Research article Special Issues

Computer-aided differentiates benign from malignant IPMN and MCN with a novel feature selection algorithm

  • In clinical practice, differentiating benign from malignant intraductal papillary mucinous neoplasm (IPMN) and mucinous cystic neoplasm (MCN) preoperatively is crucial for deciding future treating algorithm. However, it remains challenging as benign and malignant lesions usually show similarities in both imaging appearances and clinical indices. Therefore, a robust and accurate computer-aided diagnosis (CAD) system based on radiomics and clinical indices was proposed in this paper to solve this dilemma. In the proposed CAD system, 107 patients were enrolled, where 90 cases were randomly selected for the training set with 5-fold cross validation to build the diagnostic model, while 17 cases were remained for an independent testing set to validate the performance. 436 high-throughput radiomics features while 9 clinical indices were designed and extracted. A novel feature selection algorithm named BLR (Bootstrapping repeated LASSO with Random selections) was proposed to select the most effective features. Then the selected features were sent to Support Vector Machine (SVM) to differentiate the benign or malignant. In the cross-validation cohort and independent testing cohort, the area under receiver operating characteristic curve (AUC) of CAD scheme were 0.83 and 0.92, respectively. The results fully prove the proposed CAD system achieves significant effect in tumors diagnosis.

    Citation: Chengkang Li, Ran Wei, Yishen Mao, Yi Guo, Ji Li, Yuanyuan Wang. Computer-aided differentiates benign from malignant IPMN and MCN with a novel feature selection algorithm[J]. Mathematical Biosciences and Engineering, 2021, 18(4): 4743-4760. doi: 10.3934/mbe.2021241

    Related Papers:

    [1] Jahanara Akter, Sadia Islam Nilima, Rakibul Hasan, Anamika Tiwari, Md Wali Ullah, Md Kamruzzaman . Artificial intelligence on the agro-industry in the United States of America. AIMS Agriculture and Food, 2024, 9(4): 959-979. doi: 10.3934/agrfood.2024052
    [2] Ayomikun D. Ajayi, Boris Boiarskii, Kouya Aoyagi, Hideo Hasegawa . Utilizing MapBox API, Java, and ICT in the creation of agricultural interactive maps for improved farm management and decision-making. AIMS Agriculture and Food, 2024, 9(2): 393-410. doi: 10.3934/agrfood.2024023
    [3] Mohammad M. Islam, Majed Alharthi, Rotana S. Alkadi, Rafiqul Islam, Abdul Kadar Muhammad Masum . Crop yield prediction through machine learning: A path towards sustainable agriculture and climate resilience in Saudi Arabia. AIMS Agriculture and Food, 2024, 9(4): 980-1003. doi: 10.3934/agrfood.2024053
    [4] Radhwane Derraz, Farrah Melissa Muharam, Noraini Ahmad Jaafar . Uncertainty sources affecting operational efficiency of ML algorithms in UAV-based precision agriculture: A 2013–2020 systematic review. AIMS Agriculture and Food, 2023, 8(2): 687-719. doi: 10.3934/agrfood.2023038
    [5] Nicholas Ngepah, Regret Sunge . Agricultural expenditure and agricultural total factor productivity growth in South Africa. AIMS Agriculture and Food, 2023, 8(2): 637-661. doi: 10.3934/agrfood.2023035
    [6] Jan Willem Erisman, Nick van Eekeren, Jan de Wit, Chris Koopmans, Willemijn Cuijpers, Natasja Oerlemans, Ben J. Koks . Agriculture and biodiversity: a better balance benefits both. AIMS Agriculture and Food, 2016, 1(2): 157-174. doi: 10.3934/agrfood.2016.2.157
    [7] Emilio J. González-Sánchez, Amir Kassam, Gottlieb Basch, Bernhard Streit, Antonio Holgado-Cabrera, Paula Triviño-Tarradas . Conservation Agriculture and its contribution to the achievement of agri-environmental and economic challenges in Europe. AIMS Agriculture and Food, 2016, 1(4): 387-408. doi: 10.3934/agrfood.2016.4.387
    [8] Cíntia Sorane Good Kitzberger, Maria Brígida dos Santos Scholz, João Batista Gonçalves Dias da Silva, Marta de Toledo Benassi, Luiz Filipe Protasio Pereira . Free choice profiling sensory analysis to discriminate coffees. AIMS Agriculture and Food, 2016, 1(4): 455-469. doi: 10.3934/agrfood.2016.4.455
    [9] Tineka R. Burkhead, Vincent P. Klink . American agricultural commodities in a changing climate. AIMS Agriculture and Food, 2018, 3(4): 406-425. doi: 10.3934/agrfood.2018.4.406
    [10] Edward L. Kick, Kelly Zering, John Classen . Approaches to agricultural innovation and their effectiveness. AIMS Agriculture and Food, 2017, 2(4): 370-373. doi: 10.3934/agrfood.2017.4.370
  • In clinical practice, differentiating benign from malignant intraductal papillary mucinous neoplasm (IPMN) and mucinous cystic neoplasm (MCN) preoperatively is crucial for deciding future treating algorithm. However, it remains challenging as benign and malignant lesions usually show similarities in both imaging appearances and clinical indices. Therefore, a robust and accurate computer-aided diagnosis (CAD) system based on radiomics and clinical indices was proposed in this paper to solve this dilemma. In the proposed CAD system, 107 patients were enrolled, where 90 cases were randomly selected for the training set with 5-fold cross validation to build the diagnostic model, while 17 cases were remained for an independent testing set to validate the performance. 436 high-throughput radiomics features while 9 clinical indices were designed and extracted. A novel feature selection algorithm named BLR (Bootstrapping repeated LASSO with Random selections) was proposed to select the most effective features. Then the selected features were sent to Support Vector Machine (SVM) to differentiate the benign or malignant. In the cross-validation cohort and independent testing cohort, the area under receiver operating characteristic curve (AUC) of CAD scheme were 0.83 and 0.92, respectively. The results fully prove the proposed CAD system achieves significant effect in tumors diagnosis.



    In the agricultural sector, there remain challenges in agricultural management in response to customer needs. The cause of this problem is from lack of know-how and knowledge management [1,2,3]. These issues indicate that entrepreneurs need effective tools for developing and increasing productivity in the production process for long-term stability. Additionally, inaccurate yield prediction can have far-reaching effects on food production, supply systems, economies and global food security. Uncertainty in predicting crop yields and quality can lead to lower crop yields, reduced income, financial instability for agricultural producers, increased production costs, shortages and price fluctuations [4]. In marketing, it affects price volatility, suboptimal policy choices and resource misallocation and disrupts international trade agreements and negotiations, affecting trade balances and economic stability [5]. Accurate forecasting is more critical for farmers who need to adapt their practices to changing climate conditions [6].

    We focused on arabica coffee (Coffea arabica L.) grown in northern Thailand. It is among the most popular species of coffee due to its features and flavor that offer superior quality than other types [7]. Currently, the coffee business is becoming more competitive. Its production forecast is of great interest to stakeholders involved. A lack of certainty in forecasting coffee production is especially vulnerable, affecting the whole supply chain from coffee farmers to exporters, importers, roasters and retailers, leading to supply gaps, disappointing customers and potentially damaging brand reputation [8]. Furthermore, price volatility will affect their profitability and financial planning [9]. Uncertain forecasts can lead to overstocking or understocking [10]. Coffee cultivation is usually long-term. If the farmer or entrepreneurs lack knowledge of the processing management, it will result in the uncertainty in production forecasts [11].

    To address these challenges in coffee businesses, artificial intelligence (AI) is considered essential for modern manufacturing processes in agriculture and industry. This technology is likely to generate increased efficacy and effectiveness in the production process to enhance companies' potential according to international standards. Over the past few years, the agriculture and industry sectors have introduced various technologies to increasingly modernize their manufacturing and agricultural operations. AI is also being used to analyze the data and accelerate the operating system with flexibility that leads to more effectiveness in producing products or services in accordance with customer needs [12,13,14].

    The artificial neural network (ANN), an algorithm of machine learning (ML) model, is viewed as a vital data-modeling tool [15]. It can calculate the data through the functional structures of neural networks. Input and output data processes are run through a neuron network, including single-layer perceptions, multilayer perceptions, recurrent ANN and self-organization mapping [16]. Recently, many agriculture sectors applied the ANN to predict the productivity of products [17,18]. Kittichotsatsawat et al. [19] used basic ANN models to predict the productivity of coffee in northern Thailand. Bhojani et al. [20] applied ANN to wheat yield prediction. Palanivel et al. [21] also utilized ANN to predict crop yield. Important factors considered include the area, productivity zone, rainfall, relative humidity, temperature, etc. [22,23,24,25,26]. It is noted here that, apart from applying ML models, other methods can help to improve the productivity of cherry coffee. Examples are the analytical hierarchy process (AHP) and frequency ratio (FR), attention mechanism (AM), convolutional neural network (CNN), hyperspectral image (HSI), spectral–spatial features from principal component analysis (PCA), weighted linear combination (WLC), attitude determination and control subsystem (ADCS) models [27,28,29,30].

    Autoregressive integrated moving average (ARIMA) model is one of the statistical tools that has been used to predict various agricultural product output. It can detect the data through Box-Jenkins in order to create the ARIMA model, including (i) stationary, (ii) co–integration and (iii) error correction mechanism [31]. Padhan [32] employed ARIMA to forecast agricultural productivity in India. ARIMA was used to predict the productivity of goods in terms of area, zone, relative humidity, rainfall, temperature, etc. [33,34,35,36].

    Techniques such as ANN and ARIMA may be used to determine the productivity of coffee to meet customer requirements. From the literature review, there have been several works applying the ANN and ARIMA models to predict agricultural productivity, such as commodities, agricultural products, crop price, etc. [37,38,39]. However, it was noticed that ARIMA and ANN are not yet employed in coffee production forecast. Therefore, we aim to predict the cherry coffee yield and compare the performance between the ARIMA and ANN models. This finding will benefit and help in analyzing the trends of Thai coffee effectively and sustainably.

    In this study, datasets were collected for 15 years from 2004 to 2018 (180 months). The input predictor data considered were from the Thai Agricultural Economics Office and the Meteorological Department. They included the cultivated area, productivity zone, monthly rainfall, monthly RH and monthly temperatures. Furthermore, the output data was from the productivity yield each year. It was noted that the coffee was yielded only six months in a year with increasing trend for the past several years, shown in Figure 1.

    Figure 1.  Historical data of arabica coffee production in Thailand.

    Based on the literature, it is necessary to normalize and standardize the values of input features and output targets before developing ML models [40,41]. In this work, the input and output variables are normalized in the range 0‒1, using:

    N=(XXmin)(XmaxXmin) (1)

    where N is the normalized data; X is the measured value: Xmin and Xmax are the minimum and maximum values.

    The ARIMA and ANN performances had to be measured according to a validation of variables dataset. The ANN selected was tested. Then, the coefficient of determination (R2), the root means square error (RMSE) [42,43] as well as the mean squared error (MSE), were compared.

    SST=ni=1(y¯y)2=(y¯y)(y1¯y) (2)
    SSR=ni=1(ˆyi¯y)2=(ˆy¯y)(ˆy1¯y) (3)
    R2=1SSregressSStotal (4)

    where yi and ŷ are the square of the sample correlation, SSregression is the sum of squares due to regression (explained sum of squares) and SStotal is the total sum of squares.

    RMSE=1nni=1[E(xi)M(xi)]2 (5)
    MSE=(RMSE)2 (6)

    where n is the sample size of the testing dataset, while E(xi) and M(xi) are interpolated/predicted and observed values, respectively.

    A prediction is built on the foundation of some scientific calculation based on historical data. The variable datasets were analyzed through the ARIMA model with Python programming and ANN model using MATLAB programming.

    ARIMA model is a technique of statistics and econometrics that evaluates the events that will happen over each period of time.

    yt=θ0+1yt1+2yt2++pytp+εtθ1εt1θ2εt2θqεtq, (7)

    where yt and ɛt are the actual value and random error at time period t, respectively; ɸi (I = 1, 2…., p) and ɵj (j = 0, 1, 2…, q) are model parameter. p and q are integers and often referred to as order of the model. Random errors, ɛt, are assumed to be independently and identically distributed with a mean of zero and a constant variance of σ2.

    Yt=f(t)+εt (8)

    where, Yt signifies production for the time t in year, f (t) denotes a function of time t and ɛt denotes production error (i.e., the difference between observed and forecasted production for time t year). Once a functional link between production and time (in other words, a time series model) has been built, production for year t + 1 can be forecasted. The first stage in creating this model is determining whether the time series under consideration is stationary or non-stationary.

    φp(B)Δdht=c+θq(B)gt (9)

    where, ht is variable under forecasting at time t, B is lag operator, g is error term (Y-Ŷ in which Ŷ is the estimated value of Y), 𝜑𝑝(𝐵) is non-seasonal AR i.e., the autoregressive operator, represented as a polynomial in the back shift operator, (1 − B) d is non-seasonal difference, θq(B) is non-seasonal moving average i.e., the moving average operator, represented as a polynomial in the backshift operator, φ′ s and θ′ s are the parameters to be estimated.

    The variable datasets were prepared to consider the time series component, including trend, season, cycle and irregularity. ARIMA model was split into two parts, with 156 data for training and 24 for testing. The historical observations and random mistakes (errors) were used to estimate the future variables dataset. It was shown on Box-Jenkins to predict the future value through ARIMA modeling, including a three-step iterative technique (i) model identification, (ii) parameter estimation and (iii) residual diagnostics testing [44].

    The unit root test series graphs revealed the autocorrelation function (ACF) and partial ACF (PACF) [45]. The zig-zag trend will show the increase to meet the stationary series graphs. After the stationary time series process was identified, the ARIMA model was defined by the autoregressive integrated moving average model (p, d, q). Python programming was used to detect a suitable residual of the ACF graph with a 95% confidence band [46]. Next, a suitable ARIMA model was used in prediction (156 data for training and 24 data for testing).

    ANN is a complex multivariate model to approximate the unknown expectation function of a random variable. Weights will be used to estimate the parameters in the ANN model.

    ti=nj1WijXj (10)

    where n is number of inputs, w is weight of the connection between ith and jth node and x is input from node j. Calculation of output will be analyzed through a transfer function of Oi;

    Oi=f(ti) (11)

    The variable datasets are randomly separated into three groups to prevent overfitting. The variable data were divided randomly whose 70% used for training, 15% for validating and 15% for testing [47]. During ANN training, the algorithm was furnished with the performance of minimum or maximum through the shortest path in order to gain the network's yield size. The neural network performance was accomplished by backpropagation via the training set in order to update the minimum MSE during the training set [48].

    The neural networks were trained by means of a training set, and the output datasets were compared with the fixed weight. Then, feed-forward backpropagation was utilized to compare output datasets with the fixed weight. The MSE was utilized to test and calculate epochs to validate the variable dataset in the part of the neural network running. Neurons of the neural network will utilize a definite function in the hidden layer and gather the combination and bias. Lastly, the output of variable data will give the predicted model [49]. The crop yield index was determined via the input and output variables set. During the neural network process, each independent variable set was assessed and revealed by a partial dependence plot (PDP) [50,51]. The highest importance value showed the relative importance of that parameter from each index.

    The crop yield of cherry coffee was validated through the input variables dataset based on the difference between observed and predicted coffee crops. The leave-one-out cross-validation technique was used to be randomly evaluated with the ANN model, while the ARIMA model used time-based cross-validation. The 156 months were evenly partitioned in order to evaluate the cross-validation. Three rounds of ANN and ARIMA were randomly evaluated throughout variable datasets. The variable datasets were trained and treated individually in each dataset. Finally, the performance model was evaluated and showed the RMSE and MSE. The best model was determined through the largest R2 and the smallest RMSE.

    Time series analysis of ARIMA was completed based on the 180 monthly input variables dataset. Variable datasets were detected through the unit root with stationary test in order to examine the stationary or non-stationary data. A statistical test was used to consider these data, specifically the Augmented Dickey-Fuller (ADF) test, and based on the p-value of the ADF-test, if the result shows a value less than 0.05, it will be identified as stationary [52].

    After the stationary test, the p-value of 0.9210 was obtained and variable datasets were non-stationary. However, when examining the data with ACF and PACF as in Figure 2, it was shown that the time series was revealed as seasonal and stationary. Nonetheless, the data was adjusted and rearranged to investigate the difference of the first number (d = 1). This time, the p-value was 1.426 x 10−13, thus, the variable data was identified as stationary. However, the ACF and PACF showed the space of seasonal components (12, 24, 36 units) in Figure 3.

    Figure 2.  Autocorrelation function (ACF) and partial ACF of coffee prediction.
    Figure 3.  Autocorrelation function (ACF) and partial ACF of coffee prediction through adjusting and rearranging data.

    The data was divided into two parts, including 156 for training data and 24 for testing data. Then, tuning the model by training data through ARIMA model fitting was carried out in order to define the parameters through 64 conditions based on the autoregressive (AR(p)), integrated (I(d)) and moving average (MA(q)) [53]. The Akaike information criteria (AIC) was used to return the conditions of each value [54], as shown in Table 1.

    Table 1.  Optimized ARIMA model parameters of coffee yields.
    No. Parameter AIC
    1 (2, 1, 2) −168.0802
    2 (2, 1, 3) −166.9035
    3 (2, 0, 3) −166.7370
    4 (3, 1, 2) −166.4853
    5 (3, 0, 2) −165.7990

     | Show Table
    DownLoad: CSV

    Table 1 shows the five most minor (p, d, q) condition parameters; the AIC of (p, d, q) is 168.0802 at (2, 1, 2). However, when (p, d, q) parameters were identified, it led to cross validation based on time-based cross-validation. The data was changed from random sampling to one by one through a training model or forward chaining by identifying the data ratio between training data and data of testing amounted 12 rounds.

    In Table 2, the ARIMA model shows R2 of 0.0741, RMSE of 0.1348, and MSE of 0.0181. However, the relation of target and output of variable datasets, which is the trends of the data association, was unidirectional, as shown in Figure 4. While the results of cross-validation showed an average R2 to be higher, while the average of MSE of test data was smaller. So, this model configuration is suitable, and it can be used in prediction.

    Table 2.  R2 and MSE of training and testing and time-based cross-validation of coffee yields.
    Training and Testing Time-based cross-validation
    MSE
    Train
    MSE
    Test
    RMSE
    Train
    RMSE
    Test
    R2
    Train
    R2
    Test
    R2 Train R2
    Test
    MSE
    Train
    MSE
    Test
    0.0181 0.0469 0.1348 0.2167 0.7041 0.3521 0.7383 0.3931 0.0046 0.0451

     | Show Table
    DownLoad: CSV
    Figure 4.  Targets and outputs for the coffee prediction.

    The ANN analysis was achieved based on the 180 monthly input data. The variable datasets included the cultivated area (X1) for the coffee. The productivity zone (X2) is the factor that implies the quantity of cherry coffee in each crop. The rainfall data (X3) is a crucial factor for the output of coffee in each year. The proper amount of RH (X4) will enable the high quantity of coffee yield. The maximum and minimum ambient temperatures (X5 and X6) are essential to the coffee productivity.

    For ANN hyperparameter setting, the number of hidden layers and the number of neurons (i.e., processing elements (PEs)) for each hidden layer were optimized by trial and error in predicting the coffee yields. The MSE, RMSE and R2 values were used to evaluate the optimal parameters. From Table 3, the network properties were arranged with (a) network type of feed-forward backpropagation, (b) training and adaptation learning functions through Levenberg-Marquardt algorithm with TRAINLM and LEARNGDM, (c) MSE of performance function, (d) varying hidden layers and six neurons and (e) TANSIG transfer function [55]. After that, the performances of the ANN models to predict cherry coffee productivity yield were evaluated for various ANN configurations. The best training results of the ANN model were two hidden layers and one PE for each hidden layer, which provided the R values of the training, testing, validating data phases to be 0.9921, 0.9384 and 0.8723, respectively, and the MSE of the validating data to be 19576, as also shown in Figures 5 and 6.

    Table 3.  Performances of various ANN configurations through MSE, R training, R testing, R validation, R overall and R2.
    Number of hidden layers PEs MSE R
    Training
    R
    Testing
    R
    Validation
    R
    Overall
    R2
    1 1 20791 0.9014 0.5100 0.9196 0.8491 0.7210
    1 2 27138 0.7977 0.7603 0.5115 0.7764 0.6028
    1 3 40226 0.8079 0.8405 0.7176 0.7965 0.6344
    1 4 42773 0.7506 0.3192 0.7525 0.7096 0.5035
    1 5 50880 0.8474 0.7186 0.4740 0.7795 0.6076
    1 6 39929 0.8580 0.6132 0.7423 0.7794 0.6075
    1 7 72995 0.9761 0.6014 0.4396 0.8464 0.7164
    1 8 20092 0.7720 0.6825 0.6804 0.7431 0.5522
    1 9 20419 0.9798 0.8299 0.8797 0.9291 0.8632
    1 10 15591 0.9810 0.5857 0.9452 0.9000 0.8100
    2 1 19576 0.9921 0.9384 0.8723 0.9643 0.9299
    2 2 20384 0.7900 0.7736 0.6675 0.7807 0.6095
    2 3 14133 0.7776 0.8219 0.7922 0.7747 0.6002
    2 4 27636 0.9143 0.5913 0.8700 0.8816 0.7772
    2 5 28923 0.9297 0.7595 0.9057 0.9085 0.8254
    2 6 16209 0.9131 0.7751 0.7987 0.8690 0.7552
    2 7 37072 0.7236 0.6636 0.6070 0.7042 0.4959
    2 8 16657 0.9825 0.9049 0.8930 0.9629 0.9272
    2 9 7218 0.8151 0.6677 0.8071 0.8039 0.6463
    2 10 16288 0.9761 0.6510 0.8018 0.9164 0.8398

     | Show Table
    DownLoad: CSV
    Figure 5.  Validation performance with the MSE value of the validating data.
    Figure 6.  ANN modeling in coffee yield (ton) prediction.

    Figure 5 shows the validation performance with the MSE value of the validating data for the best training results of the ANN model. The learning rate was set to be 0.02, while the learning cycles of the model was done to be 1000 epochs. The optimal validation performance of the ANN model with two hidden layers and one PE each was found at 13 epochs, which was based on the lowest MSE value of the validating data for this ANN configuration.

    Figure 6 shows the predicted values versus measured values for training, testing, validating data and whole data. The perfect prediction established the accuracy of the neural network in predicting the cherry coffee productivity yield based on the calculation of the index of the variable.

    Figure 7 shows one-way partial dependence plots (PDPs) for each variable's relative importance. The monthly temperature (X5) (Figure 7e) showed a higher effect on a variable dataset in this model, with the PDP value varied from 0.0501 to 0.5211. The second most important predictor was the productivity zone (X2) (Figure 7b), with the PDP value from 0.0896 to 0.2929. Similarly, the third crucial variable was revealed to be monthly rainfall (X3) (Figure 7c), showing the PDP value from 0.0899 to 0.2760. Moreover, the cultivated area (X1) (Figure 7a) and minimum monthly temperature (X6) (Figure 7f) showed the PDP value of small difference, which were 0.1373 to 0.2010 and 0.1883 to 0.1275, respectively. Lastly, the relative humidity (X4) showed marginal effects on the model PDP with values from 0.1295 to 0.1670.

    Figure 7.  One-way PDPs of coffee yield prediction.

    The maximum temperature (e) affected the amount of productivity significantly. If the temperature was higher than or equal to 29 ℃, the productivity was decreased. If the minimum temperature (f) was less than or equal to 15–20 ℃, the productivity was improved. The rainfall (c) was one of the essential factors because the coffee productivity depended on the amount of rainfall each year. A suitable rainfall should be less than 100 mm, leading to good coffee plantation condition. The productivity zone (b) and cultivated area (a) were directly affected the quantity of coffee production If the farmers have more productivity zone and area, they will have higher production. Finally, relative humidity (d) should be high because it is preferable for coffee cultivation.

    Yield of arabica coffee is relatively unstable due to many factors, for example, changing weather conditions, different soil pH, fluctuation of ambient temperature, alteration of moisture in air, etc. Therefore, it is essential to forecast the coffee productivity to go along with customer' expectations.

    In this study, ARIMA and ANN were deployed to analyze and predict the crop yield of arabica coffee using data from 2004 to 2018. Both models have been demonstrated to be efficient in forecasting coffee production. The prediction performances of these models were evaluated using R2 and RMSE. The ARIMA model was optimized for (p, d, q) at (2, 1, 2). Its R2 and RMSE were 0.7041 and 0.1348, respectively. The ANN model employed the Levenberg-Marquardt algorithm with TrainLM and LearnGDM training functions, two hidden layers and one PEs for each hidden layer. Its performance regarding R2 and RMSE values of 0.9299 and 0.0642 was highly acceptable. Apparently, with respect to the R2 and RMSE, the ANN model was better than the ARIMA model.

    Table 4 shows comparison between other works concerning different agricultural products. When comparing the R2 and RMSE between ANN and ARIMA, the ANN showed a better R2 than the ARIMA, and the RMSE of the ARIMA was higher than that for the ANN, like those in the forecasting of rainfall, predicting pod damage from pigeons [35,56,57,58,59,60,61]. While some of the agriculture predictions are favorable, the R2 of ARIMA is better than the ANN model, such as predicting soil salt and water content in crop rootzones and prediction for sugarcane production in Bihar, etc. [62].

    Table 4.  Comparison of prediction performances with the literature.
    Reference Output Period (yrs) Predictors R2 RMSE
    ANN ARIMA ANN ARIMA
    [56] Chickpea production 5 rainfall, minimum and maximum temperatures 0.960 0.591 66.72 159.63
    [59] Wheat production 58 total annual precipitation, applied fertilizer, population and cultivated area 0.930 - 0.39 1.46
    [62] Soil salt and water content 5 crop rootzone 0.886 0.898 - -
    [57] Behavioral pattern of rainfall 93 rainfall 0.984 0.953 5.518 35.88
    [61] Sugarcane production 81 area, production, yield - - 12.99 13.82
    [58] Crop planning 32 rainfall 0.790 0.750 93.97 97.12
    [35] Pod damage of pigeon pea 27 relative humidity 0.770 0.650 1.97 2.16
    [60] Agricultural and water resources 100 rainfall, temperature - - 59.03 76.78

     | Show Table
    DownLoad: CSV

    We aim to forecast the cherry coffee production of arabica coffee cultivated in northern Thailand. Two models in forecasting arabica coffee yields through ARIMA and ANN models were compared. The ARIMA model yielded a correlation coefficient (R2) of 0.704 and an RMSE of 0.1348. The ANN model produced a higher R2 of 0.9299 and a lower RMSE of 0.0642. In estimating yearly arabica coffee production, both models were determined to be adequate, but the ANN model appeared to perform better. However, when comparing the R2 and RMSE with others in literature, shown in Table 4, it was found that the ANN and ARIMA models gave the reasonable R2 and RMSE. They were suitable for coffee prediction.

    With respect to the shortcomings of this work, they include missing data and the quality and quantity of data for coffee yield prediction. We considered merely six variable datasets; the area and productivity zone, rainfall, RH and temperature. The available amount of data remained low for these factors. Other factors that affect the coffee productivity, such as the amount of fertilizer, climate uncertainty each year, soil moisture, wind speed and amount of sunlight should also be considered, as they will help capture the full complexity of coffee yield. Moreover, flexible models that can capture the dynamic relationships between various factors affecting coffee yield may also be considered.

    For future works, application of other ML algorithms such as decision tree, random forest, support vector machine, K-nearest neighbors, K-mean clustering, principal component analysis, naive Bayes etc. may be considered. Other techniques such as data augmentation from multiple sources, sensitivity analysis and sustainability analysis may be incorporated. Moreover, the coffee prediction model may be combined with assessing the feasibility of using remote sensing data, such as satellite imagery, to supplement the existing predictor variables and improve the forecasting models. Factors affected by climate change may also be considered.

    The productivity of arabica coffee varies depending on the cultivated area, total rainfall, ambient temperature and RH, among other factors. They affect the yield of cherry coffee in each month. Accurate forecast oof the crop yield is crucial in response to customer needs. We used ANN and ARIMA models to predict the yield of arabica coffee using time-series data from 2004 to 2018. It was shown that both models could forecast coffee production satisfactorily. Within the dataset considered, the ANN (R2 and RMSE of 0.9299 and 0.0642) appeared to perform better than the ARIMA (R2 and RMSE of 0.7041 and 0.1348) model.

    The authors declare that they have not used Artificial Intelligence (AI) tools in the creation of this article.

    This work was partially supported by Chiang Mai University. One of the authors (Y.K.) wishes to acknowledge the CMU Graduate School for Research Assistant grant. We also wish to thank the Supply Chain and Engineering Management Research Unit (SCEM), Chiang Mai University for providing research facilities. This research is part of the project "A Strategic Roadmap Toward the Next Level of Intelligent, Sustainable and Human-Centered SME: SME 5.0" from the European Union's Horizon 2021 research and innovation program under the Marie Skłodowska-Curie Grant agreement No. 101086487.

    Conceptualization, K.Y.T. and N.T.; Methodology, Y.K. and N.T.; Data curation, Y.K.; Formal analysis, Y.K., A.B. and E.R.; Investigation, Y.K. and A.B.; Writing—original draft preparation, Y.K.; Writing—review and editing, N.T. and E. R.; Supervision, K.Y.T.; Funding acquisition, K.Y.T.

    The data from the Climate Department included rainfall, RH and minimum and maximum temperature. The Agricultural Economics Office and Meteorological Department provide the area and productivity zone.

    All authors declare no conflicts of interest.



    [1] K. Tulla, A. Maker, Can we better predict the biologic behavior of incidental IPMN? A comprehensive analysis of molecular diagnostics and biomarkers in intraductal papillary mucinous neoplasms of the pancreas, Langenbecks Arch. Surg., 403 (2018), 151-194. doi: 10.1007/s00423-017-1644-z
    [2] M. Daude, F. Muscari, C. Buscail, N. Carrere, P. Otal, J. Selves, et al., Outcomes of nonresected main-duct intraductal papillary mucinous neoplasms of the pancreas, World J. Gastroenterol., 21 (2015), 2658-2667. doi: 10.3748/wjg.v21.i9.2658
    [3] J. Farrell, Prevalence, diagnosis and management of pancreatic cystic neoplasms: current status and future directions, Gut Liver, 9 (2015), 571-589.
    [4] K. Ohta, M. Tanada, Y. Sugawara, N. Teramoto, H. Iguchi, Usefulness of positron emission tomography (pet)/contrast-enhanced computed tomography (ce-ct) in discriminating between malignant and benign intraductal papillary mucinous neoplasms (ipmns), Pancreatology, 17 (2017), 911-919. doi: 10.1016/j.pan.2017.09.010
    [5] S. Choi, J. Kim, M. Yu, H. Eun, H. Lee, J. Han, Diagnostic performance and imaging features for predicting the malignant potential of intraductal papillary mucinous neoplasm of the pancreas: a comparison of eus, contrast-enhanced ct and mri, Abdom. Radiol., 42 (2017), 1449-1458. doi: 10.1007/s00261-017-1053-3
    [6] D. D. D. Brennan, G. A. Zamboni, V. D. Raptopoulos, J. B. Kruskal, Comprehensive preoperative assessment of pancreatic adenocarcinoma with 64-section volumetric CT, Radiographics, 27 (2007), 1653-1666. doi: 10.1148/rg.276075034
    [7] M. Tanaka, C. F. Castillo, V. Adsay, S. Chari, M. Falconi, J. Y. Jang, et al., International consensus guidelines 2012 for the management of IPMN and MCN of the pancreas, Pancreatology, 12 (2012), 183-197. doi: 10.1016/j.pan.2012.04.004
    [8] M. Tanaka, S. Chari, V. Adsay, F. Castillo, M. Falconi, M. Shimizu, et al., International consensus guidelines for management of intraductal papillary mucinous neoplasms and mucinous cystic neoplasms of the pancreas, Pancreatology, 6 (2006), 17-32. doi: 10.1159/000090023
    [9] Y. Gu, C. Lan, H. Pei, S. N. Yang, F. Y. Liu, L. L. Xiao, Applicative value of serum CA19-9, CEA, CA125 and CA242 in diagnosis and prognosis for patients with pancreatic cancer treated by concurrent chemoradiotherapy, Asian Pac. J. Cancer Prev., 16 (2015), 6569-6573. doi: 10.7314/APJCP.2015.16.15.6569
    [10] C. Jayasree, M. Abhishek, G. Lior, A. Marc, L. Liana, A. Peter, et al., CT radiomics to predict high risk intraductal papillary mucinous neoplasms of the pancreas, Med. Phys., 45 (2018), 5019-5029. doi: 10.1002/mp.13159
    [11] S. Park, L. C. Chu, R. Hruban, B. Vogelstein, K. W. Kinzler, A. L. Yuille, et al., Differentiating autoimmune pancreatitis from pancreatic ductal adenocarcinoma with CT radiomics features, Diagn. Interventional Imaging, 101 (2020), 555-564. doi: 10.1016/j.diii.2020.03.002
    [12] Y. Zhang, C. Cheng, Z. Liu, L. Wang, G. Pan, G. Sun, et al., Radiomics analysis for the differentiation of autoimmune pancreatitis and pancreatic ductal adenocarcinoma in 18F-FDG PET/CT, Med. Phys., 46 (2019), 4520-4530. doi: 10.1002/mp.13733
    [13] R. Wei, K. Lin, W. Yan, Y. Guo, Y. Wang, J. Li, et al., Computer-aided diagnosis of pancreas serous cystic neoplasms: a radiomics method on preoperative MDCT images, Technol. Cancer Res. Treat., 18 (2019), 1-9.
    [14] D. Sahani, A. Kambadakone, M. Macari, N. Takahashi, S. Chari, F. Castillo, Diagnosis and management of cystic pancreatic lesions, Am. J. Roentgenol., 200 (2013), 343-354. doi: 10.2214/AJR.12.8862
    [15] Y. Chou, C. Tiu, G. Hung, S. Wu, T. Chang, H. Chiang, Stepwise logistic regression analysis of tumor contour features for breast ultrasound diagnosis, Ultrasound Med. Biol., 27 (2001), 1493-1498. doi: 10.1016/S0301-5629(01)00466-5
    [16] Y. Guo, Y. Hu, M. Qiao, Y. Wang, J. Yu, J. Li, et al., Radiomics analysis on ultrasound for prediction of biologic behavior in breast invasive ductal carcinoma, Clin. Breast Cancer, 18 (2018), e335-e344. doi: 10.1016/j.clbc.2017.08.002
    [17] R. Haralick, K. Shanmugam, I. Dinstein, Textural features for image classification, IEEE Trans. Syst. Man Cybern., 3 (1973), 610-621.
    [18] M. Galloway, Texture analysis using gray level run lengths, NASA STI/Recon Tech. Rep. N, 4 (1975), 172-179.
    [19] G. Thibault, B. Fertil, C. Navarro, S. Pereira, P. Cau, N. Levy, et al., Shape and texture indices application to cell nuclei classification, Int. J. Pattern Recognit. Artif. Intell., 27 (2013), 1357002. doi: 10.1142/S0218001413570024
    [20] M. Amadasun, R. King, Textural features corresponding to textural properties, IEEE Trans. Syst. Man Cybern., 19 (1989), 1264-1274. doi: 10.1109/21.44046
    [21] A. Dalalyan., M. Hebiri, J. Lederer, On the prediction performance of the Lasso, Bernoulli, 23 (2017), 552-581.
    [22] H. Richard, R. Patricia, R. Vincent, Lasso and probabilistic inequalities for multivariate point processes, Bernoulli, 21 (2015), 83-143.
  • Reader Comments
  • © 2021 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(3542) PDF downloads(164) Cited by(4)

Figures and Tables

Figures(7)  /  Tables(9)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog