Skip to Main content Skip to Navigation

Scalable Clustering Applying Local Accretions

Abstract : This thesis focuses on methods allowing to tackle complexity problem of specific algorithms in order to deal with Big Data. It presents well known algorithms and new ones from various machine learning fields (unsupervised and supervised learning), which use modern algorithms as the Locality Sensitive Hashing to decrease efficiently the algorithmic complexity. In the first part, we study the problem of scalable clustering algorithm based on Mean Shift algorithm for continuous features. We propose a new design for the Mean Shift clustering using locality sensitive hashing and distributed system. Its variation for categorical features is also proposed based on binary coding and Hamming distance. In the second part, we introduce scalable Clusterwise method, which is a combination of clustering algorithm and PLS regression. The issue is to find clusters of entities such that the overall sum of squared errors from regressions performed over these clusters is minimized, where each cluster may have a different variance. We improve its time duration and scalability by applying clustering before the regression task. We investigate also in this part of the thesis a feature selection field. We present two efficient distributed algorithms based on Rough Set Theory for large-scale data pre-processing under the Spark framework. The first approach(Sp-RST) splits the given dataset into partitions with smaller numbers of features which are then processed in parallel. The second proposition LSH-dRST use locality sensitive hashing as clustering method to determine appropriate partitions of the feature set.In the last part, we propose to share as an open source project. This project titled Clustering4Ever offers the possibility to anyone to read the source code and test the different algorithms either via notebooks or calling directly the API. The design enables the generation of algorithms working for many types of data.
Keywords : Clustering
Complete list of metadatas

Cited literature [132 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Thursday, August 20, 2020 - 3:53:05 AM
Last modification on : Wednesday, December 2, 2020 - 5:48:34 PM
Long-term archiving on: : Tuesday, December 1, 2020 - 8:41:10 PM


Version validated by the jury (STAR)


  • HAL Id : tel-02917865, version 1



Gaël Beck. Scalable Clustering Applying Local Accretions. Distributed, Parallel, and Cluster Computing [cs.DC]. Université Paris-Nord - Paris XIII, 2019. English. ⟨NNT : 2019PA131004⟩. ⟨tel-02917865⟩



Record views


Files downloads