SIMILARITY ANALYSIS OF USER TRAJECTORIES BASED ON HAVERSINE DISTANCE AND NEEDLEMAN WUNSCH ALGORITHM

This paper discusses the similarity between two trajectories using the Needleman Wunsch algorithm. The calculation steps are interpolating the trajectory, calculating the distance between the trajectory coordinates, identifying the equivalent length, transforming trajectories into a sequence of alphabetic letters, aligning the sequences, and measuring the magnitude of the similarity based on the alignment results. The similarity obtained is compared directly to the length of the trajectories shared by the two lines. The calculation results show that the accuracy of the alignment method reaches more than 90%.


Introduction
Everything that changes over time, either its position or its value, is called a moving object. Cars running on the highway, airplanes in the air, rockets fired, satellite travel, stock price movements, temperature changes in an area are examples of moving objects. Recording the movement of these objects will produce a set of data in the form of coordinate points or values called trajectories (Mello et al., 2019). The trajectory is the history of the movement of an object expressed in an ordered pair set , , … , , which represents the location or value at a particular time , , … , , so that a trajectory is expressed as = , , , , … , , Where is the number of observed points (Toohey & Duckham, 2015). The magnitude of the trajectory similarity becomes an actual problem in several application domains. For example, customer trajectory analysis in a supermarket is used to organize product placement on shelves. Another example is to find unusual migration patterns of a group of suspicious birds or the presence of rare trajectories. Other domains include stock and data analysis, a travel route recommendation system, a recommendation system for friends based on interests, hobbies, locations visited, etc. The similarity of the trajectory can also be used to predict future phenomena such as cardinal directions, storms, and earthquakes (Wang et al., 2013).
One of the ways to calculate the similarity of the two trajectories is to use the alignment method as done by Cavojsky in (Čavojský et al., 2020). Cavojsky et al. used the Needleman Wunsch algorithm to align the paths that had previously been converted into alphabetic letter sequences. Sabarish et al. convert the trajectories into a graph and measure the similarity of the route based on edges and vertices (Sabarish et al., 2020). The similarity of the trajectory can be calculated using approximation equations built using a regression model or interpolation. Magdy et al. used three approaches to measure the similarity of pathways: regression, interpolation, and curve barcoding (Magdy et al., 2018).
In Cavojsky (Čavojský & Drozda, 2019), interpolation is carried out to equate the number of coordinate points on the two trajectories. This results in the absence of a maximum distance between the coordinate points in the trajectories, which may cause the distance between the sides of the two trajectories to be categorized as unequal because there are no adjacent points. This paper proposes a slightly different way of applying interpolation. Interpolation on the two paths is done by adding new points if the distance between the coordinate points of the measurement results on the two paths exceeds the specified equivalence distance ( ).
Previously, Chua and Foo (Chua & Foo 2015) used the Needleman Wunsch algorithm to detect suspicious activity on smart home devices and compare the results with the results obtained from the decision tree. Ju et al. (Ju et al., 2018) used the algorithm to identify smart card use and calculate transaction similarity scores to find its relationship with student achievement. Garhwal and Yan (Garhwal & Yan, 2019) used this algorithm to detect watermarks in images. They reported that the method could recognize watermarks in some image data with up to 100% accuracy. Cavojsky and Drozda (Čavojský et al., 2020) say that the Needleman Wunsch algorithm (Needleman & Wunsch, 1970) aligns DNA, RNA, and protein sequences but can also be used for path alignment.
In this paper, the Needleman Wunsch algorithm will be used to calculate the similarity of the two paths. The Needleman-Wunsch algorithm was initially designed to align sequences in the form of letters of the alphabet (Beretta, 2018), (Irawan et al., 2019). To align the trajectory in the form of coordinate points (numerical sequences), it is necessary to make adjustments by transforming them into alphabetic letter sequences. Meanwhile, to convert numeric sequences into alphabetical form, distance calculations are needed, including using Euclid distance (Čavojský et al., 2020), Manhattan distance (Chen et al., 2005) or Haversine distance (Anisya & Swara, 2017), (Sofwan et al., 2019). The distance calculation is needed to identify the equivalent coordinate points on the two paths. Next, labeling the coordinate points is carried out using the same character if the coordinates are identical and with different characters if the coordinates are not equivalent. To improve the accuracy of trajectory similarity, it is possible to interpolate the coordinates of the two tracks using linear interpolation before the transformation is carried out (Boubrahimi et al., 2018).

Material and Method
The main study of this paper is how to determine the similarity of the trajectory of two moving objects based on the coordinates sourced from the GPS (global positioning system) device. Suppose and are two trajectories recorded based on the position of moving objects in the form of a set of latitude-longitude coordinates, then and can be written as ordered sets = , … , and = , … , , where = , , , and = , , , for = 1, … , and = 1, … , . To calculate the similarity of the two paths, the completion steps are broadly stated in the flow chart in Figure 1 below. (Čavojský et al., 2020): Flowchart to determine the similarity between two trajectories (Čavojský et al., 2020) The steps for aligning the and trajectories in Figure 1 above are 1) interpolation; carried out to add new coordinate points between the existing coordinate points on the track, 2) transformation; changing the numerical sequence of the coordinate points in the path into a sequence of letters of the alphabet, 3) alignment; arrange matching characters or insert alternate characters so that the columns of each sequence contain identical characters, (4) count the number of matching or identical characters, (5) calculate path similarity. An essential concept in interpolation and path transformation is the distance between points, which is used to determine the number of interpolation points and to identify equivalent coordinate points.

Distance between points
The distance between two points in the form of latitude-longitude coordinates can be calculated using the Haversine equation introduced by James Inman in 1835 (Inman, 1849). That is if given two ordered pairs of geographic coordinates = , and = ! , ! where , ! , and , ! are latitude and longitude, respectively, then the distance between and is defined as " , = 2$ arcsin +sin , -. / 0 + cos cos ! sin , -. / 0 ……….. (1) Where $ is the radius of the earth, which is ± 6,371 km. In addition to using the Haversine formula, the distance between and can be calculated using the Euclidean distance as in (Čavojský et al., 2020), namely " , The distance obtained from equation (2) is still in geodetic form, so to convert it into kilometers, it must be multiplied by the radius of the earth $. Another method that is almost the same is using the Manhattan distance as used in Chen et al. (Chen et al., 2005), which is " ,

Trajectory Interpolation
Linear interpolation is a method used to find the value of an unknown point from two points that form a linear line known in advance (Chapra & Canale, 1998). Path interpolation is a linear interpolation performed on two consecutive coordinate points in a path with a distance of more than 8. Suppose = , and = ! , ! are the points to be interpolated, then we can define the interpolation point 9 = : , : as the midpoint of and with Interpolation is done recursively until all successive points in the path have a distance of less than 8. Interpolation steps in a path that has initial coordinate points in more detail can be seen in the flowchart in Figure 2 below: (Chapra & Canale, 1998) There is an update on the number of coordinate points in trajectory interpolation, id est the old coordinate points plus the interpolated coordinate points. The number of interpolation points added can be more than one and depends on the distance between the two points and the selected value.  (1), it can be calculated the distance between the closest points in the path, namely " , = 30,812 m, " , = = 12.1431 m, and " = , > = 119.7142 m. To perform interpolation, select $ ; as the furthest distance between adjacent coordinate points after the path is interpolated, then apply the interpolation algorithm as shown in the flow chart in Figure 2. If we take 8 = 20, we will get new coordinate points in the path. Such that all the distances between adjacent points are less than 20. The results of the interpolation of path s can be seen in Table 1 below: In table 1, the second column above, K , M K , N K , O K , P K , Q K , R K , K are the new points added to the path s as a result of interpolation, and in the 5th column, it can be seen that the distance between each successive coordinate is less than 8 = 20. The number of coordinate points on the path after interpolation becomes 12 points, that is, four starting points ( K = , = K = , > K = = , K = > ) plus eight interpolated points. Changes in the coordinate points on the path can be seen in table 1 above. In Figure 3 (right), the yellow circles are the original coordinate points in the path, while the green × sign is the location of the interpolation points added to the path. Between points and = no new points are added because " , = ≥ 20.

Trajectory Transformation
Interpolation can be done directly on a single path, but to convert the path into a sequence of alphabet letters. An essential part of transforming a path into an alphabetic letter sequence is identifying the equivalent coordinates of the two paths, then labeling them with the same letter if the coordinates are equivalent and with a different letter if the coordinates are not.
Suppose and are two paths that will be converted into a sequence of letters of the alphabet. The first step that must be done is to calculate the distance between the coordinate points in and the coordinate points in , then look for the coordinate points in , which is less than from the coordinate points in , then do the labeling. The flow for converting the path into a sequence of letters of the alphabet in more detail can be seen in the lintasantosequence function in figure 5 and the flowchart in figure 6. First, we calculate all the distances between the coordinates in and using equation (1) and convert them into meters, while the results of the calculations can be seen in table 3, namely: In table 3 above, the distance between points whose value is less than 20 are marked in bold red letters. It can be said that distances less than 20 are equivalence points, i.e., points to be labeled using the same letter. The labeling process is as follows: Based on Table 3 The transformation of trajectories and above is done without interpolating first. The path transformation process with or without interpolation is the same way. What differs is the number of coordinate points that must be transformed. In Table 4 above, there are three same letter labels on both paths, namely Y , Y , and Y = , meaning that there are three coordinate points in and which are equivalent to each other. As an illustration, it can be seen in figure 4 below. In Figure 4, the equivalent coordinates are marked using a red circle which contains a green circle (part of the path) and a yellow circle (part of the s circle). The figure shows that ≡ , ≡ , and > ≡ > . Actually, ≡ = and M ≡ > are also, since " , ≤ " , = and " > , > ≤ " M , > we pass = and M and are considered not equivalent to another point like a point = . The similarity of the trajectories Trajectory similarity, as defined by Cavojsky (Čavojský et al., 2020), is based on the number of identical coordinate positions and the number of gaps of the two aligned sequences, namely: Definition (similarity). The paths r and s, which consist of m and n coordinate points, are equal Where #match is the number of matching characters and #maxgap is the largest number of gaps in the gap subset.
Using the definition above, We express the percentage of similarity of two paths as gc % = `:a , × 100% …………………………. 6 A match is defined as the same bases in the DNA sequences and located at the exact position of two DNA strands. In paths, a match is the equivalent coordinates on both paths represented by characters in the same place in both sequences. The parameters , in equation (5) are the length of sequence $ and sequence [, respectively. So in Figure 4 above, we get c jℎ = 3, = 5, and = 4, so the similarity of the paths $ and [ above is

Result and Discussion
The number of interpolated points added to a trajectory depends on how much maximum distance (ϵ) is desired after the path is interpolated. The smaller the value of giving, the more interpolation points will be generated, and vice versa. Path interpolation increases the accuracy of the similarity between paths on the same path by estimating the points using the line formed by the points on the path. Interpolation can confirm which parts of a path have the same direction and path.

Example 1
Look at trajectories A and B in Figure 7 below, the length of track A (green color) is 277.98 meters, and the length of track B (yellow color) is 317.90 meters. The similarity between paths A and B is the length of the intersection paths compared to the longest path. Paths A and B are on the same path as far as 107.13 meters, so the similarity value is The length of the intersection of the paths traversed by the two objects, namely the thick blue line, is visually easy to recognize and find by the human eye but difficult for the computer. Therefore, as described in the previous section above, we need easy methods to do by programs or computers. If the similarity obtained in equations (9) and (10) compare with the similarity in equation (8), the differences are 6.3% and 0.37%, respectively. While the accuracy of similarity before interpolation is 81%, the accuracy increases to 99% after interpolation.

Example 2
Look at tracks C and D in Figure 13 below, the length of track C (green color) is 3897.79 meters, and the length of track C (yellow color) is 4370.64 meters. The value of the similarity or similarity between paths C and D is the length of the path traversed together by the two objects from each path compared to the longest path. Tracks C and D are on the same track for a distance of 2989.65 meters, so the similarity value is The length of the intersection of the paths traversed by the two objects, namely the thick blue line, is visually easy to recognize and find by the human eye but difficult for the computer. Therefore, as described in the previous section, we need easy methods to do by programs or computers.
In Figure 8 below, the equivalent coordinates of the two trajectories, less than 15 meters apart (8), are marked with a red circle, so there are ten equivalent coordinate points. Meanwhile, there are 19 and 42 coordinate points on paths C and D in Figure 8. Using the definition of similarity in the previous section, the similarity value between the two trajectories is If we compare the similarity obtained in equations (12) and (13) by the similarity in equation (11), the difference is 44.59% and 8.78%, respectively. The accuracy of similarity before interpolation is 35%; after interpolation, the accuracy increases to 87%.

Conclusion
From the discussion, we can conclude that the accuracy of the similarity value between the two trajectories significantly increased by interpolating. We can calculate the path similarity without prior alignment, that is, by counting the number of the same characters or labels after transforming the sequence, then compared with the most extended sequence length from the path transformation results. Alignment is necessary if we want to use the definition of similarity to express two equal or different paths.