Enhance Clustering Algorithm Using Optimization

Unsupervised learning can reveal the structure of datasets without being concerned with any labels, K-means clustering is one such method. Traditionally the initial clusters have been selected randomly, with the idea that the algorithm will generate better clusters. However, studies have shown there are methods to improve this initial clustering as well as the K-means process. This paper examines these results on different types of datasets to study if these results hold for all types of data. Another method that is used for unsupervised clustering is the algorithm based on Particle Swarm Optimization. For the second part this paper studies the classic K-means based algorithm and a Hybrid K-means algorithm which uses PSO to improve the results from K-means. The hybrid K-means algorithms are compared to the standard K-means clustering on two benchmark classification problems. In this project we used Kaggle dataset to with different size (small, large and medium) for comparison PSO, k-means and k-means hybrid.


Introduction
In unsupervised learning, training methods do not use any kind of labels between algorithms. This can reduce the time required to differentiate training, and allows researchers to see the properties in the data. One of the methods for unsupervised learning is the K-means method, which divides data into separate group's k. Each collection is thought to be Gaussian and circular, with each data point in the collection closest to its center.
The traditional method for initializing the K-means method is to randomly assign cluster centers and let the algorithm distribute those random centers to appropriate locations. However, depending on the data structure this does not always create predictable clusters after training. A refined initialization method has been developed by Bradley and Fayyad that refines the random initial clusters.
Refined collections are used in the K-means algorithm to separate data. The first refined collections are designed to produce unpredictable collections. Particle Swarm Optimization based clustering algorithm was used for the integration of vector image and data. This paper will compare the hybrid K-means algorithm against the standard PSO and the K-means standard (scikit package) algorithm, in addition to heart disease, breast cancer, diabetes, wine quantity and MNIST information data and attempts to use a different type of the database.

A. Background
Integration is one of the most challenging methods of mining in the data acquisition process. Managing large amounts of data is a daunting task because the goal is to find the right subdivisions in an unsupervised manner (i.e. without prior knowledge) in an attempt to maximize internal cluster similarities and reduce cluster similarities that also keep high cluster mergers. Data collection takes place in subsets in such a way that similar conditions are collected together, while different conditions belong to different groups.
Circumstances are thus organized into a more effective presentation that reflects the number of samples. Therefore, the release of cluster analysis is the number of groups or clusters forming the composition of the division, of the data set. In short collections the data processing process has become a sound mathematical analysis group. Exploitation of Data Mining and Knowledge access has permeated a variety of Machine Learning System.

B. Motivation
As the number of digital documents over the years as the Internet has grown exponentially, managing information search, and retrieval, etc., have become more important issues. Advanced methods of organizing large volumes of random text into small numbers of logical clusters will be of great help to combine such as indexing, filtering, default metadata production, number of web resource catalogues and, generally, any program that requires text editing.
There are also a large number of people who are interested in reading certain stories so there is a need to compile news articles from the number of available articles, because a large number of articles were added to each data and many articles were related to the same issues but included in different sources. By compiling articles, we can narrow down our search domain International Journal of Research in Engineering, Science and Management Volume-3, Issue-9, September-2020 journals.resaim.com/ijresm | ISSN (Online): 2581-5792 | RESAIM Publishing 137 with recommendations as most users are interested in issues related to a few groups.
This can improve the effect of time efficiency on a large scale and can help identify similar issues from different sources. The main motivation is to compare the different types of unattended algorithm to learn how they behave, their advantages, and their disadvantages and to learn how to choose an unattended learning algorithm depending on the type of dataset. This paper projected we describe our hybrid K-means clustering algorithm flow, compare and analysis their behavior on two types of dataset.
Also implement the different parameter of unsupervised learning algorithm to observed error rate, Silhouette score, by compare Hybrid K-means clustering algorithm with standard PSO algorithm and K-means algorithm we get their advantage and disadvantage.

A. Standard K-means
The modified startup method uses a set of categories J of data. Each of these sub-sections is designed to randomly select a small percentage of the original data. From each clause is obtained a set of k-center centers, and any empty collections are given a point with a great distortion and then reassembled the whole clause. When all subdivisions have empty collection centers, the J * k points are grouped using random startups.
The result of this integration is used as the first K-means collection centers throughout the database. The first was the purity of the middle class, a measure based on data labels. The second was a distortion, or a double L2 range of data, of groups where the L2 / Euclidean range is given as: This paper uses the silhouette score as a measure of quality. This score, from -1 to 1, compares the inter-cluster distance of data to the distance to the nearest cluster. A negative score represents mis-clustered data, with points assigned to a cluster that should be in another. A positive score represents defined clusters, with a higher score meaning more distinct clusters. A score of 0 represents overlapping clusters.
To truly investigate the difference between the random and refined initialization, and to compare PSO algorithms, K-means algorithm with hybrid K-means, 5

B. Standard Particle Swarm Optimization
The PSO was inspired by the social behavior of bird populations and was first developed by Eberhart and Kennedy in 1995 It is a man-made foundation where the algorithm stores particles each representing a solution to the problem of efficiency. The PSO aims to obtain a particle position that provides an excellent test of a given performance function. The next section describes the performance of Particle Swarm Optimization and surpasses the integration of PSO and K-mean PSO collection algorithms. For this purpose, the following symptoms are described: : Dimension of data vector : Number of cluster centroids : ℎ Data vector : Centroid vector of cluster j : Subset of data vector that form cluster j One of the key features of the compilation is the similarity scale used to combine the data with the number of predetermined collections. Two outstanding methods used to install a computer are similar to the Euclidean range, which is International Journal of Research in Engineering, Science and Management Volume-3, Issue-9, September-2020 journals.resaim.com/ijresm | ISSN (Online): 2581-5792 | RESAIM Publishing 138 used for data vector integration, and the cosine aggregation process, which is used for document integration. The Euclidean range is used as a measure of similarity. The data vectors within the collection are in a small 'Euclidean' range from each other, and are associated with one centroid vector of that collection. The vector distance in centroid is determined using equation 1: Flow Chart: Algorithm initially start with a set of randomly generated points where each point refers to the position of a particle in dimensional space. Associated with each particle is its velocity vector. Each particle has the following information : The current position of the particle : The current velocity of the particle; : The personal best position of the particle. A particle's position at the next time instance is then calculated as: Where, w is the inertia weight, 1 and 2 are the acceleration constants, 1 , j(t), 2 , j(t) ~ U(0, 1) and k = 0,..., . As is clear from equation 3, the velocity is updated based on three components: first is a fraction of its previous velocity, second is cognitive component which is a function of the distance of particle from its personal best position and third is social component which is a function of distance of particle from the global best position. The personal best position of a particle, defined to be the position which gives the best evaluation of the fitness function over all instances, is updated as:

C. Hybrid K-means Clustering
Hybrid K-means algorithm is a hybrid of K-means and PSO methods of clustering. In this, K-means is executed once and the results of K-means are used to seed one of the particles in PSO clustering algorithm. Then PSO algorithm is executed.

Algorithm for Hybrid K-means clustering:
1) Number of particles = 10 2) Execute K-means on the data and assign the calculated Centroid to one particle 3) Initialize other nine particles to have randomly selected cluster centroids. 4) For i in range : a) For j in range No. of particles: i) For each data vector: A. Calculate the euclidean distance d( , ) to all cluster centroids .
B. Assign the data vector to the cluster such that the euclidean distance is minimum. ii) Calculate the fitness function. b) Update local best position using equation 5. c) Update the global best position as the position of particle which minimizes the fitness function. d) Update the cluster centroids using equation 3, 4.

A. Hybrid K-means vs Standard K-means vs PSO comparison
In this section we will discuss the effects of silhouette symbols from each information. In each test, the random Kmeans method used was from the Scikit-learn package of Python. This was also the basis for a random launch within a fixed path. Each collection is limited to 50 iterations. Adjusted data subsets are 10% of each of the original data. The Heart Dieses database was tested with a range of 300 lines within a International Journal of Research in Engineering, Science and Management Volume-3, Issue-9, September-2020 journals.resaim.com/ijresm | ISSN (Online): 2581-5792 | RESAIM Publishing 139 refined database and this hybrid k-methods method provides an additional 10% accuracy.

Fig. 3. Heart dieses dataset accuracy
The number of collections also varies. The results show that the 2 methods are comparable in their peaks. This dataset is very complex, as it has a limited number of both features and categories. This is a lack of complexity that can cause similar peaks. Each method is able to identify three different classes, similar to true symbols. However, when the number of collections did not equal the number of classes.
The Breast Cancer Diagnostic Dataset we used to 600 rows large dataset to compare this 3 dataset and again we get 10% more Silhouette score at 3 clusters and less error rate. The next dataset we used for comparison its Diabetes datasets with 800 rows.
The same method of comparison as was used on the MNIST and Wine Quantity dataset was used, with the exception of extending the range of clustering's. Since the MNIST set has 10 classes, the range of tested clusters needed to be larger. The number of class labels the set has does not directly affect the algorithm, since these are unsupervised learning methods the training of the clusters does not use the labels. However, the structure of the data is more complex, with a comparable number of features as the Wine Quantity dataset. The results of this set are shown in figure 6 and 7. This setting indicates why a modified start up method may be preferred to a random process. While there are many calculations at the beginning of training, it can produce better performance. This extended functionality will indicate if the International Journal of Research in Engineering, Science and Management Volume-3, Issue-9, September-2020 journals.resaim.com/ijresm | ISSN (Online): 2581-5792 | RESAIM Publishing 140 database is complex enough, while simple data sets may not see any improvement at all between the two approaches.
All five refined databases will work and even better than the standard k-and PSO methods given a fair amount. The mathematical method of scoring uses the concept of distance. As the features are removed, the size of the feature space is removed causing smaller distances between points. However, this will affect random and equally refined methods, allowing for comparisons between them and hybrid k-means give better result as compare to standard k-means and PSO clustering algorithm.
This very reduced feature space makes the structure simplified, which has resulted in results. The gap between performance measures is very large from 8 to 20 factors. This region shows a refined approach that can maintain a better definition of the collection as the difficulty increases. There is a satisfying point where the methods work the same way.  Table 1 lists the performance of three algorithms in the Wine and Digital databases limited over 10 simulations. One thing to note here is that although the Quantization error can be compared to the given database algorithms, it is not comparable between different databases. This is because quantization error depends on the number of clusters, the pre-processing of data, the number of samples among other things that are very different from the data sets.
From the wine database it is clear that the PSO performs worse than the K methods when compared to the quantization error and silhouette score. However significant improvements can be seen in the Hybrid K-means algorithm. When a single particle in a PSO algorithm is sown with results from the Kmeans algorithm, the resulting algorithm works much better than the original PSO and is much better than the standard random K-means.

Conclusion
As a method, K methods are a quick and easy way to test data formation. However, it has its flaws, it has potential for improvement. This paper illustrated strategies to improve Kmeans performance through the use of first refined centers and particle efficiency. The key to unsupervised learning functionality is to understand how and when to use it. No single process will always lead to better data collection.
In the future, determining the number of collections by force using the Silhouette score may be included. Hybrid K-means algorithm courses can be expanded to cover as much detail as data integration and image integration.