Simmetrics Core

A Java library of similarity and distance metrics e.g. Levenshtein distance and Cosine similarity. All similarity metrics return normalized values rather than unbounded similarity scores. Distance metrics return non-negative unbounded scores.

License	License The Apache Software License, Version 2.0
Categories	Categories Metrics Application Testing & Monitoring Monitoring
GroupId	GroupId com.github.mpkorstanje
ArtifactId	ArtifactId simmetrics-core
Last Version	Last Version 4.1.1
Release Date	Release Date 24-Aug-2016
Type	Type jar
Description	Description Simmetrics Core A Java library of similarity and distance metrics e.g. Levenshtein distance and Cosine similarity. All similarity metrics return normalized values rather than unbounded similarity scores. Distance metrics return non-negative unbounded scores.

Download simmetrics-core

Filename	Size
simmetrics-core-4.1.1.pom
simmetrics-core-4.1.1.jar	130 KB
simmetrics-core-4.1.1-sources.jar	84 KB
simmetrics-core-4.1.1-javadoc.jar	414 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/com.github.mpkorstanje/simmetrics-core/ -->
<dependency>
    <groupId>com.github.mpkorstanje</groupId>
    <artifactId>simmetrics-core</artifactId>
    <version>4.1.1</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.github.mpkorstanje/simmetrics-core/
implementation 'com.github.mpkorstanje:simmetrics-core:4.1.1'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.github.mpkorstanje/simmetrics-core/
implementation ("com.github.mpkorstanje:simmetrics-core:4.1.1")

Apache Buildr

'com.github.mpkorstanje:simmetrics-core:jar:4.1.1'

Apache Ivy

<dependency org="com.github.mpkorstanje" name="simmetrics-core" rev="4.1.1">
  <artifact name="simmetrics-core" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.github.mpkorstanje', module='simmetrics-core', version='4.1.1')
)

Scala SBT

libraryDependencies += "com.github.mpkorstanje" % "simmetrics-core" % "4.1.1"

Leiningen

[com.github.mpkorstanje/simmetrics-core "4.1.1"]

Dependencies

compile (2)

Group / Artifact	Type	Version
com.google.guava : guava	jar	19.0
commons-codec : commons-codec	jar	1.10

test (4)

Group / Artifact	Type	Version
com.google.caliper : caliper	jar	1.0-beta-2
junit : junit	jar	4.12
org.mockito : mockito-core	jar	1.10.19
org.hamcrest : hamcrest-all	jar	1.3

Project Modules

There are no modules declared in this project.

SimMetrics

Usage

For a quick and easy use StringMetrics and StringDistances contain a collection of well known similarity and distance metrics.

String str1 = "This is a sentence. It is made of words";
String str2 = "This sentence is similar. It has almost the same words";

StringMetric metric = StringMetrics.cosineSimilarity();

float result = metric.compare(str1, str2); //0.4767

The StringMetricBuilder and StringDistanceBuilder are convenience tools to build string similarity and distance metrics. Any class implementing Metric or Distance respectively can be used to build a metric. The builders support simplification, tokenization, token-filtering, token-transformation, and caching.

For usage see the examples section.

For a terse syntax use import static org.simmetrics.builders.StringMetricBuilder.with;

String str1 = "This is a sentence. It is made of words";
String str2 = "This sentence is similar. It has almost the same words";

StringMetric metric =
        with(new CosineSimilarity<>())
        .simplify(Simplifiers.toLowerCase(Locale.ENGLISH))
        .simplify(Simplifiers.replaceNonWord())
        .tokenize(Tokenizers.whitespace())
        .build();

float result = metric.compare(str1, str2); //0.5720

Metrics that operate on lists, sets, or multisets are generic can be used to compare collections of arbitrary elements. The elements in the collection must implement equals and hashcode.

Set<Integer> scores1 = new HashSet<>(asList(1, 1, 2, 3, 5, 8, 11, 19));
Set<Integer> scores2 = new HashSet<>(asList(1, 2, 4, 8, 16, 32, 64));

SetMetric<Integer> metric = new OverlapCoefficient<>();

float result = metric.compare(scores1, scores2); // 0.4285

Unicode

Due to Java's Unicode Character Representations some care must be taken when dealing with texts containing outside Basic Multilingual Plane. Using a metric that compares strings by their char values will result in an unexpectedly high similarity as every other char is the same high surrogate.

All provided metrics, simplifiers and tokenizers use unicode code points rather then char values.

When implementing your own tokenizer take care to split the string on code points rather then characters. For example:

String str1 = "𐇑𐇛𐇜𐇐𐇡";

Tokenizer tokenizer = input -> {
    List<String> tokens = new ArrayList<>();
    for (int start = 0; start < input.length(); start = input.offsetByCodePoints(start, 1)){
        int end = input.offsetByCodePoints(start, 1);
        tokens.add(input.substring(start, end));
    }
    return tokens;
};

List<String> result = tokenizer.tokenizeToList(str1); // [ 𐇑, 𐇛, 𐇜, 𐇐, 𐇡 ]

Versions

Version
4.1.1 24-Aug-2016
4.1.0 08-Jan-2016
4.0.1 16-Nov-2015
4.0.0 15-Nov-2015
3.2.3 17-Oct-2015
3.2.1 22-Aug-2015
3.2.0 02-Aug-2015
3.1.0 24-Jul-2015
3.0.4 01-Jul-2015
3.0.3 04-Jun-2015
3.0.2 03-Jun-2015
3.0.1 06-Mar-2015
3.0.0 05-Mar-2015

Simmetrics Core

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description