CLUSTERING AND RISK ANALYSIS OF THE EARTHQUAKE IN SULAWESI USING MINI BATCH K-MEANS, K-MEDOIDS, AND MAXIMUM LIKELIHOOD METHOD

,


Introduction
Sulawesi is a triple junction of Eurasian, Hindia-Australia, and Pacific Ocean plates so Sulawesi has complex structural geology such as the Palu-Koro fault, North Sulawesi trench, Mantano fault, East Sangihe trust, Tolo trust, Batui trust, Lawanopo Fault, Kolaka fault, Waianae fault, and Tukang Besi (White et al., 2017).The effects of that collisions are the release of seismic waves in the subduction zone where the energy is spread out radially in the subsurface of has caused earthquakes significantly.The energy effect of the seismic waves can cause changes in stress in the near epicenter of the earthquakes.The accumulation of this stress change can cause changes in seismic activity around the postearthquake area so that it can trigger earthquakes in the future.The relationship between seismic stress or recurrence of the earthquake with time failure is divided into interseismic, coseismic, and postseismic (Nurana et al., 2021).The stress accumulation of earthquakes event and ductility of rock correlated with the bvalue and a-value parameters.The b-value indicates the stress of the tectonic parameter, and the a-value indicates seismicity within a given region (Scholz, 1968).
The a value and b value parameter have been widely used for recurrence earthquake identification, such as spatial-temporal condition recent seismicity in Sumatera (Nurana et al., 2021), a-value and b-value indicator of the earthquake potential in West Nusa Tenggara (Ernandi & Madlazim, 2020), statistical analysis of South Sulawesi earthquake (Kurniawati & Muhammad Rahman, 2020), and bvalue and seismic hazard investigation (Ray et al., 2019).Based on their research of the a-value and b-value, we found a difference in characteristic seismicity within the region.Consequently, stress states are also varied within the area because of the material characteristic of the subsurface difference (Scholz, 1968(Scholz, , 2015)).The relations of the frequency earthquake event with the a-value and the bvalue parameter were revealed by Gutenberg & Richter (1944).Furthermore, the comprehensive interpretation, of machine learning is recently used to analyze the event earthquake characteristics where the resemblance characteristics are signed by equal clusters.
Clustering is one of the parts of data mining that is used for exploratory information on seismic events and is part of unsupervised learning categories.From these techniques, the earthquake or seismic patterns of the unlabeled attribute can be determined.The gaussian mixture model has been applied by Seydoux et al. (2020) to clustering earthquake signals and this method is one unsupervised machine learning technique.Inherently, K-Medoids and K-Means clustering is also applied to clustering earthquake risk where R programming is implemented in the clustering process (Ahmad & Irsalinda, 2020;Novinanti et al., 2017;Rifa et al., 2020;Yuan, 2021).The euclidean distance used to find the centroid cluster of the catalog earthquake is the square of the straight line distance of two points in Euclidean space.
Some research has been conducted in the study area to explain the tectonic process in Sulawesi, such as tectonic activities in Sulawesi (Zakaria & Sidarto, 2015), geological history of Western Sulawesi (White et al., 2017), seismic hazard assessment (Cipta et al., 2016;Eva, 2016).Clustering earthquake risk by K-Means and K-Medoids method with R programming has been conducted by Ahmad & Irsalinda (2020), Novinanti et al. (2017), and Rifa et al. (2020).Moreover, research on seismic activities must be continuously conducted to explain the recent seismic activities in both spatial and temporal conditions.For seismic activities analysis, the raw data is split to be some attributes such as time, date, magnitude, and depth.Then, the maximum likelihood method is applied to determine the b-value of the Gutenberg-Richter formula, and Mini Batch K-Means of machine learning methods are applied to describe the earthquake event clusters, where this method is not applied in any clustering earthquake research yet.The results of the cluster of this method are compared with K-Means and K-Medoids method.That methods are validated by Davies-Boudin method (Davies & Bouldin, 1979).In particular, identify Gutenberg-Richer equation of Sulawesi earthquake of 1980 to early 2022, temporal and spatial analysis of b-value, efficiency and validation of clustering methods, temporal and spatial clustering, and spatial earthquake class are conducted.Python language program is implemented to perform this analysis.

Data and Methods
The earthquake data used in this research is secondary data obtained from the Incorporated Research Institution for Seismology (IRIS) by entering the coordinates or boundaries of the research area.IRIS data have been widely used in seismic analysis, namely earthquake risk (Rifa et al., 2020), regional tomography earthquake (Koulakov et al., 2014), earthquake hazard (Papadopoulos & Kijko, 1991), etc.This data is used to analyze the characteristics and risks of the earthquakes that occurred over four decades in Sulawesi.During this decade, there are some destructive earthquakes, such as the earthquake in Donggala Regency in 2018, the earthquake in Mamuju Regency in 1984, and the earthquake in Mamuju Regency in 2021, etc.The earthquake events have been collected in the period 1980-2022, with the earthquake magnitude being more than equal to 5 SR, where there are 1039 data in that period.The data were analyzed using machine learning methods (Mini Batch K-Means and K-Medoids methods) and statistical method (maximum likelihood method).In machine learning, Mini Batch K-Means and K-Medoids methods are validated using the Davies Bouldin Index (DBI) method to determine the optimal number of clusters.The maximum Likelihood method is used to determine the pattern or level of activity of tectonic events (b-value and avalue).
The attributes used to cluster seismic events are earthquake epicenter, depth, and magnitude, while a-value and b-value analysis, earthquake epicenter, depth, magnitude, hours, minute, years, and month attributes are needed.To perform that analysis, the module of python language programming are pandas, geopandas, numpy, sci-kit-learn (Pedregosa et al., 2011), matplotlib, and seaborn.Meanwhile, temporal and spatial analysis of the a-value and b-value is conducted by the maximum likelihood method with zmap 6.0 free under Matlab language programming.The earthquake classifications are divided into six classes as shown in table 1.The earthquake maximum recorded by the seismometer in this data is magnitude 7.9 with serious damage effect as revealed in Table 1.The class of the earthquake in Table 1 are used to map the spatial distribution of the earthquake effect.
Table 1.Class and effect of the earthquake event.

Magnitude
Class Effect 2.5 or less 2.5 to 5.4 5.5 to 6.0 6.1 to 6.9 7.0 to 7.9

Machine Learning Methods
Machine learning is a method for analyzing data mining, and is divided into supervised and unsupervised learning but in this paper, unsupervised learning, like Mini Batch K-Means and K-Medoids, is used to reach out the number clusters of an earthquake event.Before it is conducted, the number of clusters of data is determined by the elbow and silhouette method where the advantage of a silhouette over than elbow is cluster structure and data distributions within the cluster as revealed in Table 2 (P. Rousseeuw et al., 1997).where is number of the earthquake event, ( ) silhouette value (P.J. Rousseeuw, 1987).The numbers clusters in both elbow and silhouette methods have been validated by The Davies-Bouldin index (DBI) which is followed these formulas (Davies & Bouldin, 1979): where is called the Davies-Bouldin index, is the ratio of the qth root of the qth moment of the points in cluster i,j about the mean, + with separation between cluster i,j, , .
Both Mini Bath K-Means and K-Medoids use the Euclidean distance to find centroid points within the earthquake event catalog.Mathematically the formula Euclidean distance follows equation 3. Mini Batch K-Means Method is more efficient in convergent than K-Means Method because the gradient descent method is applied to the optimization process on the updated centroid.Consequently, this method can be applied to big data analysis.to anticipate the outlier data of earthquake catalog, K-Medoids or partitioning around medoid method (PAM) is applied in this research.Like K-Means or Mini Batch K-Means methods, the K-Medoids method uses the same attributes as the input in its algorithm.In principle, the K-Medoids method divided n the earthquake data to be k cluster is a method that uses Euclidean distance to determine the medoid of each cluster.The algorithm of the K-Medoids method: Step 1: Select the initial model 1: Given: K, iterations, data set X 2: Initialize each c ∈ C with an x picked randomly from X 3: v ← 0 4: for i = 1 to t do 5: M ← b examples picked randomly from X 6: for x ∈ M do 7: d[x] ← f(C, x) // Cache the center nearest to x 8: end for 9: Select k objects having the first k smallest values as initial medoids.
Step 2: (Update medoids) 10: 10:Find a new medoid of each cluster, which is the object minimizing the total distance to other 11: objects in its cluster.Update the current medoid in each cluster by replacing it with the new medoid.
Step 3: (Assign objects to medoids) 12: Assign each object to the nearest medoid and obtain the clustering result.13: Calculate the sum of the distance from all objects to their medoids.If the sum is equal to the previous one, then stop the algorithm.Otherwise, go back to Step 1.
A flowchart used to cluster earthquake event data and its spatial mapping is shown in Figure 1.

The Maximum Likelihood Method and Estimated Damage
b-value called statistical parameters can describe the earthquake event which is caused by tectonic activities.The maximum likelihood method is used to calculate this value following Equation 4 (Kijko, 1988;Utsu, 1965): b-value and a-value are the parameters that use to describe the seismicity caused by tectonic in the regional catalog of seismicity.The relations of these constants have been described by following linear formula (Gutenberg & Richter, 1944):

Results and Discussions The Clustering Earthquake Events by Mini Batch K-Means and K-Medoids
The clustering of earthquake events has been successfully analysed or extracted information of the seismic events where the amount of the recorded data is 1039 events on magnitude ≥ 5 SR.The optimum number of the cluster has been obtained by Elbow and Silhouette coefficient for Mini Batch K-Means, K-Means, and K-Medoids as shown in Figure 3  Based on Figure 3 and Table 3, The optimum number cluster of K-Means, Mini Batch K-Means, and K-Medoids are 3, 3, and 4 respectively, however, Mini Batch K-Means is more efficient in time or converges faster than the others because of applied gradient descent optimization on its algorithm (Pedregosa et al., 2011;Sculley, 2010).Gradient descent is one of the optimization algorithms used to find a local minimum of the earthquake attribute namely epicenter, magnitude, and depth.For more accuracy, the Silhouette method is applied to determine of the number clusters where the highest value of the Silhouette coefficient is used as the cluster number.Silhouette coefficient can be figured out by structure relations within the member clusters (Hidayati et al., 2021;Sculley, 2010).The other advantage of computing the Silhouette coefficient using python programming, the member size within each cluster can be interpreted descriptively as shown in Figure 4.  Based on Table 4, the highest Silhouette coefficient on Both K-Means and Mini Batch K-Means is 0.73, and its member distribution is revealed in Figure 4.It means that earthquake events in each cluster have a strong similarity in the characteristic.Meanwhile K-Medoids method, the highest Silhouette coefficient is in 2 clusters with 0.70 Silhouette coefficient as in Table 4. Based on Table 2, this coefficient value is indicated that the cluster member has either a reasonable structure or quite a resemblance.
Based on the description of clustering the earthquake event, Both the Elbow and Silhouette coefficient method for Mini Batch K-Means and K-Means Method have a match in determining the optimum number of the cluster, namely 3 clusters.However, both the Elbow and Silhouette coefficient methods for K-Medoids are not in line to determine the optimum number cluster where the Elbow method is 4 clusters whereas the Silhouette coefficient is 4 clusters.In this case, Davies & Bouldin (1979) in a cluster separation measure had shown a method to validate of number clusters which are Davies-Bouldin Index (DBI).This method has been adopted by Ahmad & Irsalinda (2020) in Fuzzy K-Means with R programing language to validation of their models.In the same way, this research also applied the DBI technique to the number cluster validation for the K-Medoids method as in Table 5.The lower value of DBI is indicating a better cluster (Davies & Bouldin, 1979;Pedregosa et al., 2011) so The optimum number of the cluster used to map temporal and spatial clusters are 3 clusters for Mini Batch K-Means and 2 clusters for K-Medoids.
In this research, both temporal and spatial mapping as in Figure 5 and Figure 6 have been conducted to analyze earthquake events in Sulawesi so that the earthquake information is more comprehensive.based on the depth, Fowler (2004) has stated that the earthquake event is divided into three classes namely shallow earthquakes with less than 70 km, intermediate earthquakes between 70 km up to 300 km, and deep earthquakes with more than 300 km.Interesting results are depicted in Figures 5b and 5c where the pattern of the earthquake events are well segmented and have a similar classification as described by Fowler (2004), instead in Figure 5b.The moderate up to strong damages are in cluster 0 where its depth below 100 km, meanwhile the depth of the others cluster are in among above 100 km up to below 350 km (intermediate earthquakes with the average damage in moderate class), and above 350 (deep earthquake with the average damage is slight class.The K-Medoids method only recognized 2 classes, where cluster 0 below 100 km (shallow earthquake) and cluster 1 above 99 km.The members of cluster 1 are in the intermediate and deep earthquakes.The cluster 0 both Mini Bath K-Means and K-Medoids methods are the representation of the shallow earthquake where this cluster is more devastating than others because the earthquake event in this depth is close to the surface.Therefore the wave energy is received more destructive than others of the earthquake class.For example, Setiyono et al.(2019) recorded the shallow earthquake in Donggala-Palu-Sigi with 5.9 and 7.4 magnitudes of 10 km and 11 km in depth, respectively, are moderate and seriously destructive.Based on Figure 6a, cluster 0 are covered North Sulawesi, Gorontolo, Middle Sulawesi, and West Sulawesi, some regions of South Sulawesi, and Southeast Sulawesi Province, cluster 1 are covered North Sulawesi, Gorontolo, and some regions of Middle Sulawesi Province, and cluster 2 are covered some region of Gorontalo, Center Sulawesi, and Southeast Sulawesi Province.The most dominant earthquake events occur in North Sulawesi, Gorontolo, Middle Sulawesi, and West Sulawesi Province, while the earthquake events in South Sulawesi Province rarely occurs in the others so that disaster caused by the earthquake probably minimum happened in this province.Based on K-Medoids in either Figure 5c or Figure 6b, the effect of earthquake caused by cluster 0 is not felt in West and South Sulawesi because of the distance of the earthquake events (extremely attenuated wave) and the earthquake type (intermediate earthquake).This result has a similarity in cluster 1 with Mini Batch K-Means Method.The centroid point of each cluster is revealed in Table 6, where the shallow earthquake is represented by centroid 0, the intermediate earthquake is represented by centroid 1, and the deep earthquake is represented by centroid 2. The seismicity activities of the earthquake of each cluster are described by a-value and b-value parameters.The a and b-Value and Estimated Damage Cluster 0 in this paper is hight destructive than the others because cluster members are in the shallow earthquake.During the period 1980-2022, the percentage of moderate, major, and strong earthquake events was recorded above 80 percent, under 10 percent, and under 5 percent, respectively, as shown in Figure 7a.Other than that, slight, severe, and serious damage s was recorded above 80 percent, under 10 percent, and under 5 percent, respectively as shown in Figure 7b.For example, the Slight and strong damage earthquake in Manado with magnitude 5.3 SR at 26 km, magnitude 5.2 SR at 33 km, and magnitude 6.7 SR at 63 km was confirmed by Setiyono et al.(2019).The damage caused by the earthquakes, namely slight damage to some buildings, died, injured, and bridges.Another example of serious damage with magnitude 7.4 SR at 11 km was recorded in Donggala-Palu-Sigi which damage caused by this disaster were deaths, buried, injured and lost people, damaged buildings, etc.The frequency of earthquake events and damage classification as shown in Figures 7c and 7d The effects of the earthquake events have been revealed in Figure 8, where the earthquake events are relatively related to tectonic activities.The frequencymagnitude relation caused by tectonic activity expressed by Gutenberg & Richter (1944) is reciprocal as shown in Figures 8a, 8b and 8c.The magnitude complement (Mc) of the earthquake event is 5 SR.This value is similar to the magnitude minimum in Figure 8b, so the magnitude complement is the lowest magnitude or minimum frequency of the earthquake events.The temporal magnitude complement (Mc) variations in Figure 8d showed the minimum wave energy on the subduction zone is diverse (Scholz, 2015), and the minimum wave energy increased pattern in the period 1990 to 2005 shows the energy gradient of the earthquake is relatively both strong and destructive.and 1.Based on this statement, the earthquake events in this research are mostly on plutonic and volcanic rocks (White et al., 2017).There are a major minimum curve and two minor minimum curves as shown in Figure 8e.The first minimum that occurred during the period 1980-1989 was increasing either force or accumulated stress and decreasing either force or release energy (Rigo et al., 2018;Scholz, 2015).In the major earthquakes of 1984 and 1986, there was a decreasing gradient of the This pattern was repeated during the period 1990-2001, so the energy of the seismic wave was released.The increasing gradient that happened during the period  Mini Batch K-Medoids is classified into minor effect up to slight although these regions have high seismicity of the earthquake.

Conclusion
The Analysis pattern of the earthquake events is successfully conducted by Mini Batch K-Means and K-Medoids method (machine learning), and the maximum likelihood method.Other than that, Mini Batch K-Means and K-Medoids method have been validated by Davies-Bouldin Index.While the maximum likelihood method is used to calculate the a-value and b-value.Mini Batch K-Means has classified the depth of the earthquake events into shallow, intermediate, and deep earthquakes, while the K-Medoids method can classifications of the earthquake events into shallow earthquakes and earthquake events above 100 km.Based on the Davies-Bouldin index and silhouette coefficient, the Mini Batch K-Means method is more robust than K-Medoids Method, instead, the earthquake events in Sulawesi with magnitude ≥ 5 during the period 1980 -early 2022.Furthermore, the classifications of the earthquake event based on Mini Batch K-Means are more relevant to Fowler's concept (2004), and time efficient than both the K-Means and K-Medoids Methods.
Both Mini Batch K-Means and K-Medoids methods have depicted the pattern of earthquake events in Sulawesi, where the earthquake events are categorized as dangerous natural hazards because both serious and major earthquakes are shallow earthquake clusters.The Seismicities and stress level on the subduction zones is figured out by a-value and b-value both temporal and spatial.The b-value and a-value obtained in Sulawesi are 0.99 and 8.1, repeatedly.These values are shown that stress and seismicities are so enough strong.Based on the b-value temporal, the earthquake events are periodic events, where during the periods 1990-2001 was major release energy and during the periods 2001-2007 was major accumulated stress on subduction zones.The regions with a bvalue and a-value less than 0.9 and 7.5, respectively, and in cluster 0, namely the western part of North Sulawesi, Gorontalo, Middle Sulawesi, and West Sulawesi Province, are as vulnerable to earthquake disasters.Meanwhile, the region in cluster 1 and cluster 2 with a b-value and a-value more than 0.9 and 7.5 respectively namely South Sulawesi, the Northern part of North Sulawesi, and Southeast Sulawesi Province, are categorized as minor earthquake disasters.
The results of this study can provide an overview of seismic activities and their impact in Sulawesi, but a comprehensive analysis focal mechanism analysis to determine the fault and slip of structural geology can be conducted.This analysis is using waveform seismic data.For clustering with machine learning, Mini Batch K-Means clustering can be applied to cluster the seismic event, and this method can be used to cluster big data because more efficient than the K-Means, and K-Medoids Methods.Deep and machine learning can be combined for pattern prediction of seismic events.

Figure 1 .
Figure 1.Flowchart of the clustering earthquake events.

Figure 2 .
Figure 2. Flowchart of estimated damage and b-value and a-value analysis.

Figure 3 .
Figure 3. Elbow method using K-Means, Mini Batch K-Means, and K-Medoids algorithm on the earthquake data in the period of 1980 to 2022.

Figure 4 .
Figure 4. Silhouette method earthquake data in the period of 1980 to 2022; (a) K-Means Method for 2 clusters, (b) Mini Batch K-Means for 3 clusters, (c) K-Medoids for 2 clusters, and (d) K-Medoids for 3 clusters.

Figure 6 .
Figure 6.Spatial mapping of (a) cluster of Mini Batch K-Means method, and (b) clusters of K-Medoids method during the period 1980 -2022.

Figure 7 .Figure 8 .
Figure 7.The temporal and spatial mapping of (a) earthquake categories, (b) damage percentage, (c) the earthquake frequency, and (d) damaged classifications.
From the Mc value, the avalue was obtained at 8.1 that is indicated strong seismotectonic activities in the subduction zones, and the b-value obtained at 0.99 indicated moderate level stress on the subduction zones.So, equation 6 is written to: log N(M) = 81-0.99M..................................... (7) Scholz (1968) stated that b-values are reciprocal with the accumulated stress and a-values are linear with seismicity activities.Other than that, Scholz (1968) also stated that volcanic origins are characterized by b-value between 0 2001-2007 was stress accumulations.This stress was released in 2008 where it is characterized by a decrease in the b-value gradient.Based on this description, it might occur in any year periodically.Both the bvalue and a-value are spatially shown in Figure 9 which this mapping is depicted the b-value and a-value distributions of each province in Sulawesi.

Table 2 .
Interpretation of Silhouette coefficient.

Table 3 .
Optimum cluster and its time consumption.

Table 4 .
Silhouette coefficient of K-Means, Mini Batch K-Means, and K-Medoids Method.

Table 6 .
The centroid location of earthquake events during the period 1980 -2022.