Rainfall Model Using Principal Component Regression Analysis with R Software in Sulawesi

Doi: 10.24042/djm.v3i3.6108 Indonesia is a tropical country that has two seasons, rainy and dry. Nowadays, the earth is experiencing the climate change phenomenon which causes erratic rainfall. The rainfall is influenced by several factors, one of which is the local scale factor. This research was aimed to build a rainfall model in Sulawesi to find out how the rainfall relationship with local scale factor in Sulawesi. In this research, the data used were secondary data which consisted of 15 samples with 6 variables from Badan Pusat Statistik (BPS). The limitation of the sample size in this study was due to the limited secondary data available in the field. The data was processed using Principal Component Regression Analysis. The first step was reducing local scale factor variables so that the principal component variable could be obtained that can explain variability from the original data which then that variable was analyzed using principal regression analysis. The data were analyzed by utilizing R Studio software. The results show that two principal component variables can explain 75.2% of the variability of original data and only one principal component variable that was significant to the rainfall variable. The regression model explained that the relationship between rainfall, humidity, air temperature, air pressure, and solar radiation was in the same direction while the relationship between rainfall and wind velocity was not in the same direction. Overall, the results of the study provided an overview of the application of the Principal Component Regression analysis to model the rainfall phenomenon in the Sulawesi region using the R program. http://ejournal.radenintan.ac.id/index.php/desimal/index

*Correspondence: E-mail: budi.nurani@unpad.ac.id Doi: 10.24042/djm.v3i3.6108 Indonesia is a tropical country that has two seasons, rainy and dry. Nowadays, the earth is experiencing the climate change phenomenon which causes erratic rainfall. The rainfall is influenced by several factors, one of which is the local scale factor. This research was aimed to build a rainfall model in Sulawesi to find out how the rainfall relationship with local scale factor in Sulawesi. In this research, the data used were secondary data which consisted of 15 samples with 6 variables from Badan Pusat Statistik (BPS). The limitation of the sample size in this study was due to the limited secondary data available in the field. The data was processed using Principal Component Regression Analysis. The first step was reducing local scale factor variables so that the principal component variable could be obtained that can explain variability from the original data which then that variable was analyzed using principal regression analysis. The data were analyzed by utilizing R Studio software. The results show that two principal component variables can explain 75.2% of the variability of original data and only one principal component variable that was significant to the rainfall variable. The regression model explained that the relationship between rainfall, humidity, air temperature, air pressure, and solar radiation was in the same direction while the relationship between rainfall and wind velocity was not in the same direction. Overall, the results of the study provided an overview of the application of the Principal Component Regression analysis to model the rainfall phenomenon in the Sulawesi region using the R program.

Rain
is a common natural phenomenon that occurs around the world. The climate change phenomenon caused by the greenhouse gas effect, the rainfall has been erratic. Meanwhile, rainfall is needed as an estimator of water availability for local living things which determines the boundaries of the rainy season and dry season and controls flood and drought disasters. The intensity of rainfall is usually influenced by several factors, one of which is the local scale factor. The local scale factor refers to the air humidity, air temperature, air pressure, wind speed, solar radiation, and so on. Sulawesi is one of the areas in Indonesia that is considered to have quite low rainfall because the rainfall is around 1000-2000 mm/year. Several scientific papers have discussed rainfall models, one of which is the sea surface temperatures rainfall model in West Kalimantan using the Stepwise Regression method (Handiana et al., 2016) and rainfall estimation model with climatic factors in Bangladesh using Multiple Regression Analysis (Navid & Niloy, 2018). Based on the mentioned scientific works, the methods employed were less than optimal because they did not consider the multicollinearity problem. As a result, the models obtained contained multi-collinearity problems so that the models should be improved.
At present, there have been many scientific works that employ analytical methods to overcome the multicollinearity problem, for example, overcoming the problem of multicollinearity using the Ridge Regression (Gorgees & Ali, 2017), overcoming the multicollinearity problem on factors that affect the human development index in East Java using Principal Components Analysis (Sudrajat, 2016), and overcoming the multicollinearity problem in the factors that affect the JCI on Indonesia Stock Exchange using the Latent Root Regression. The analysis methods used in the mentioned scientific papers were generally assisted by Microsoft Excel, SAS, and SPSS (Untari & Susanti, 2017).
In this research, a rainfall model was built based on multivariate data in the Sulawesi Region using Principal Component Regression Analysis to overcome the multicollinearity problem, so that the right model could be obtained for prediction studies with the help of R Studio software.

METHODS
The method used in this research was a literature study for theoretical studies and experimental studies through simulation and data processing with the Principal Component Regression model.

The Basic Concepts of Principal Component Regression Analysis
One of the basic concepts of algebra used is the eigenvalues and eigenvectors associated with Principal Component Analysis. The eigenvalues of matrix A is × , notated by: The concept of correlation is also needed to describe the linear relationship between two or more quantitative variables. The correlation value is a standardized covariance. If the correlation value between the independent variables is strong enough, it can cause multicollinearity.
If there is multicollinearity, a variable that has a strong correlation with other variables in the regression model might have an unreliable and unstable power of prediction (Rencher, 2002).

Regression Analysis
Regression analysis is a statistical method for examining, modeling, and predicting relationships between variables. The relationship of a model can be expressed in an equation that connects the independent variable ( ) with the dependent variable ( ) (Montgomery et al., 2012). In general, a regression model with independent variables and observation can be written as follows: The regression equation model, in general, can also be written in matrix notation as follow: In estimating the parameter, the Least Square Method can be performed if the data does not contain multicollinearity. It can be calculated as follows: where ̂ is an unbiased estimate for the parameter , such that (̂) = .

Principal Component Analysis
Principal Component Analysis (PCA) was first discovered by Karl Pearson in 1901 and named PCA by Harold Hotelling in 1933. According to (Sudrajat, 2016), Principal Component Analysis is the best method to solve the problem because it can overcome the multicollinearity (correlation is zero) in all research data conditions. PCA can be formed based on a covariance matrix or a correlation matrix.
If is an orthogonal matrix of × , the principal component is defined as a combination of original independent variable that can be expressed in the form of a matrix as follows: Description, : The eigenvector matrix of × : The original variable vector of × 1 in the form of a linear combination, ut can be notated as: If the original variable is measured with different units of measurement, the variable is transformed into a standard score (standardization). Standardization of the original variable into the score can be done using the following formula: The criteria of the Principal Component Analysis with a correlation matrix is to use principal components with more than one eigenvalues ( ≥ 1). The cumulative percentage variance of the principal component representing the total data variance (information) of the independent variables is approximately 75%.

Principal Component Regression Analysis
According to (Mariana, 2013), Principal Component Regression Analysis is a principal component analysis technique that is combined with regression analysis where the principal component analysis is used as the analysis stage. The principle of the Principal Component Regression analysis is to select several principal components that will be used as independent variables in regression by estimating the regression coefficient using the Least Square Method.
There are two ways of forming Principal Component Regression through principal component analysis, namely using a covariance matrix or a correlation matrix (Jolliffe, 2010). Both methods are used depending on the condition of the observation range of the independent variable.
If matrix is an orthogonal matrix × with ′ = ′ = where = , then the multiple linear regression equations process becomes the Principal Component Regression as follows: denotes the vector of the regression parameter and = ′ .
The Principal Component Regression model that has been reduced to principal components is stated as follows:

R Studio Software in Principal Component Regression Analysis
R studio is software related to computing and data processing for statistics (Chambers, 2008). R Studio is an integrated development environment (IDE) for R software which is a programming language for statistics and graphics. R Studio was founded by JJ Allaire, the creator of the ColdFusion programming language. R Studio is partly written in the C++ programming language.

Research Data
The data used in this research were the data of rainfall, humidity, air temperature, air pressure, wind speed, and solar radiation in 15 districts/cities of Sulawesi in 2018. The data were obtained from the Badan Pusat Statistik (BPS) Sulawesi as presented in Appendix 1.

Building the Principal Component Regression Model
In this section, a rainfall model had been built in Sulawesi, especially at the 15 studied points using the Principal Component Regression Analysis by performing the following steps:

Standardizing the Independent Variables
In this step, the original independent variables ( ) were transformed into standardized independent variables ( ) using equation (10) because they had different measurement scales. Through the R Studio data processing, the standardized independent variables could be obtained as displayed in table 1.

Establishing a Correlation Matrix between Standardized Independent Variables
In this step, the correlation matrix between the standardized independent variables( ) was formed to see whether the multicollinearity problem present or not. The correlation between the five standardized independent variables was calculated using equation (3) and through R Studio data processing. The data can be seen in Table 2. Based on Table 2, it can be seen that there are a pair of variables, namely 2 and 3 with a correlation value of 0.83, Thus, it can be concluded that the data contained multicollinearity problem. This can cause the predicted value generated to be unable to predict the dependent variable precisely.

Eigenvalues and Eigenvectors
In this step, the eigenvalues and eigenvectors were calculated as in equations (1) and (2). Through R Studio data processing, the obtained eigenvalues and eigenvectors are as follows:  Based on Table 4, it can be seen that the eigenvalues were greater than one ( ≥ 1) on the first two principal components with eigenvalues of 2,449 and 1,312. Both principal components were able to explain 75.2% of the diversity of the entire original data. Therefore, the principal components used were 1 and 2 . Based on Table 6, it can be seen that ℎ > , then 0 was rejected. It can be inferred that there was at least one principal component variable ( ) contributed to the dependent variable ( ).

Performing
Then, to determine whether there was a contribution from each principal component variable ( ) to the dependent variable ( ), an individual regression coefficient test was performed using the ttest. The results of the test are as follows: Based on Table 7 it can be seen that |( ℎ ) 1 | < , then 0 was accepted. Thus, it can be concluded that the principal component of the variable 1 did not significantly influence the dependent variable . Then, it can be seen that |( ℎ ) 2 | > , then 0 was rejected.it can be concluded that 2 significantly influenced the dependent variable .

CONCLUSION
Based on the description, it can be concluded that a rainfall model in the Sulawesi region with local-scale factors on secondary data obtained from Badan Pusat Statistik (BPS) can be built using Principal Component Regression analysis assisted by R Studio software. The five original independent variables were reduced to two principal component variables which can explain 75.2% of the original data diversity and only one principal component variable that was significant to the dependent variable. Thus, a regression model has been obtained which shows the relationship between rainfall, air humidity, air temperature, air pressure, and solar radiation is unidirectional while the relationship between rainfall and wind speed is not unidirectional. The use of R Studio can simplify and speed up the calculations of the Principal Component Regression Analysis.