clusteringSpark

Let's cluster the univers !

License	License Apache-2.0 Apache-2.0
GroupId	GroupId org.clustering4ever
ArtifactId	ArtifactId clusteringspark_2.11
Last Version	Last Version 0.9.8
Release Date	Release Date May 3, 2020
Type	Type jar
Description	Description clusteringSpark Let's cluster the univers !
Project URL	Project URL https://github.com/Clustering4Ever/Clustering4Ever
Project Organization	Project Organization Clustering4Ever
Source Code Management	Source Code Management https://github.com/Clustering4Ever/Clustering4Ever

Download clusteringspark_2.11

Filename	Size
clusteringspark_2.11-0.9.8.pom
clusteringspark_2.11-0.9.8.jar	1 MB
clusteringspark_2.11-0.9.8-sources.jar	1 MB
clusteringspark_2.11-0.9.8-javadoc.jar	1 MB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/org.clustering4ever/clusteringspark_2.11/ -->
<dependency>
    <groupId>org.clustering4ever</groupId>
    <artifactId>clusteringspark_2.11</artifactId>
    <version>0.9.8</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/org.clustering4ever/clusteringspark_2.11/
implementation 'org.clustering4ever:clusteringspark_2.11:0.9.8'

Gradle Kotlin

// https://jarcasting.com/artifacts/org.clustering4ever/clusteringspark_2.11/
implementation ("org.clustering4ever:clusteringspark_2.11:0.9.8")

Apache Buildr

'org.clustering4ever:clusteringspark_2.11:jar:0.9.8'

Apache Ivy

<dependency org="org.clustering4ever" name="clusteringspark_2.11" rev="0.9.8">
  <artifact name="clusteringspark_2.11" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='org.clustering4ever', module='clusteringspark_2.11', version='0.9.8')
)

Scala SBT

libraryDependencies += "org.clustering4ever" % "clusteringspark_2.11" % "0.9.8"

Leiningen

[org.clustering4ever/clusteringspark_2.11 "0.9.8"]

Dependencies

compile (3)

Group / Artifact	Type	Version
org.scala-lang : scala-library	jar	2.11.12
org.clustering4ever : clusteringscala_2.11	jar	0.9.8
gov.nist.math : jama	jar	1.0.3

provided (2)

Group / Artifact	Type	Version
org.apache.spark : spark-core_2.11	jar	2.3.3
org.apache.spark : spark-mllib_2.11	jar	2.3.3

Project Modules

There are no modules declared in this project.

Clustering 4️⃣ Ever

Welcome to Clustering 4️⃣ Ever, a Big Data Clustering Library gathering clustering, unsupervized algorithms, and quality indices. Don't hesitate to check our Wiki, ask questions or make recommendations in our Gitter.

API documentation

Include it in your project

Add following line in your build.sbt :

"org.clustering4ever" % "clustering4ever_2.11" % "0.11.0" to your libraryDependencies

Eventually add one of these resolvers :

resolvers += Resolver.bintrayRepo("clustering4ever", "C4E")
resolvers += "mvnrepository" at "http://mvnrepository.com/artifact/"

You can also take specifics parts (Core, ScalaClustering, ...) from Bintray or Maven.

Available algorithms

emphasized algorithms are in Scala.
bold algorithms are implemented in Spark.
They can be available in both versions

Clustering algorithms

Jenks Natural Breaks
Epsilon Proximity*
- Scalar Epsilon Proximity*, Binary Epsilon Proximity*, Mixed Epsilon Proximity*, Any Object Epsilon Proximity*
K-Centers*
- K-Means*, K-Modes*, K-Prototypes*, Any Object K-Centers*
Gaussian Mixtures
Self Organizing Maps (Original project)
G-Stream (Original project)
PatchWork (Original project)
Random Local Area *
OPTICS
Clusterwize
Tensor Biclustering algorithms (Original project)
- Folding-Spectral, Unfolding-Spectral, Thresholding Sum Of Squared Trajectory Length, Thresholding Individuals Trajectory Length, Recursive Biclustering, Multiple Biclustering
Ant-Tree
- Continuous Ant-Tree, Binary Ant-Tree, Mixed Ant-Tree
DC-DPM (Original project) - Distributed Clustering based on Dirichlet Process Mixture
SG2Stream

Algorithm followed with a * can be executed by benchmarking classes.

Preprocessing

UMAP
Gradient Ascent (Mean-Shift related)
- Scalar Gradient Ascent, Binary Gradient Ascent, Mixed Gradient Ascent, Any Object Gradient Ascent
Rough Set Features Selection

Quality Indices

You can realize manually your quality measures with dedicated class for local or distributed collection. Helpers ClustersIndicesAnalysisLocal and ClustersIndicesAnalysisDistributed allow you to test indices on multiple clustering at once.

Internal Indices
- Davies Bouldin
- Ball Hall
External Indices
- Multiple Classification
  - Mutual Information, Normalized Mutual Information
  - Purity
  - Accuracy, Precision, Recall, fBeta, f1, RAND, ARAND, Matthews correlation coefficient, CzekanowskiDice, RogersTanimoto, FolkesMallows, Jaccard, Kulcztnski, McNemar, RusselRao, SokalSneath1, SokalSneath2
- Binary Classification
  - Accuracy, Precision, Recall, fBeta, f1

Clustering benchmarking and analysis

Using classes ClusteringChainingLocal, BigDataClusteringChaining, DistributedClusteringChaining, and ChainingOneAlgorithm descendants you have the possibility to run multiple clustering algorithms respectively locally and parallely, in a sequentially distributed way, and parallely on a distributed system, locally and parallely, generate many different vectorizations of the data whilst keeping active information on each clustering including used vectorization, clustering model, clustering number and clustering arguments.

Classes ClustersIndicesAnalysisLocal and ClustersIndicesAnalysisDistributed are devoted for clustering indices analysis.

Classes ClustersAnalysisLocal and ClustersAnalysisDistributed will be use to describe obtained clustering in term of distributions, proportions of categorical features...

Incoming soon (developped by our team)

DESOM:Deep Embedded Self-Organizing Map: Joint Representation Learning and Self-Organization
SOM:Kohonen self-organizing map
[SOMperf: SOM performance metrics and quality indices](https://github.com/FlorentF9/SOMperf/
**[skstab is a module for clustering stability analysis in Python with a scikit-learn compatible API](https://github.com/FlorentF9/skstab **
**FunCLBM: Functional Conditional Latent Block Model **
**Spark Time Series Set data analysis **
UMAP
Gaussian Mixture Models
DBScan
Bayesian Optimization for AutoML

Citation

If you publish material based on informations obtained from this repository, then, in your acknowledgements, please note the assistance you received by using this community work. This will help others to obtain the same informations and replicate your experiments, because having results is cool but being able to compare to others is better. Citation: @misc{C4E, url = “https://github.com/Clustering4Ever/Clustering4Ever“, institution = “Paris 13 University, LIPN UMR CNRS 7030”}

C4E-Notebooks examples

Basic usages of implemented algorithms are exposed with BeakerX and Jupyter notebook through binder ➡️ .

They also can be download directly from our Notebooks repository under different format as Jupyter or SparkNotebook.

Miscellaneous

Helper functions to generate Clusterizable collections

You can easily generate your collections with basic Clusterizable using helpers in org.clustering4ever.util.{ArrayAndSeqTowardGVectorImplicit, ScalaCollectionImplicits, SparkImplicits} or explore Clusterizable and EasyClusterizable for more advanced usages.

References

What data structures are recommended for best performances

ArrayBuffer or ParArray as vector containers are recommended for local applications, if data is bigger don't hesitate to pass to RDD.

Clustering4Ever

Versions

Version
0.9.8 May 3, 2020
0.9.7 Apr 14, 2020
0.9.6 Jun 11, 2019
0.9.5 May 19, 2019
0.9.4 Mar 15, 2019
0.9.3 Feb 14, 2019
0.9.2 Feb 14, 2019
0.9.1 Feb 13, 2019
0.8.4 Jan 27, 2019
0.8.3 Jan 20, 2019
0.8.2 Jan 17, 2019
0.8.1 Jan 14, 2019
0.8.0 Jan 8, 2019
0.7.3 Nov 25, 2018
0.7.2 Nov 15, 2018
0.7.1 Oct 31, 2018
0.7.0 Oct 27, 2018
0.6.30 Oct 19, 2018