liblevenshtein

A library for spelling-correction based on Levenshtein Automata.

License	License MIT License
Categories	Categories Auto Application Layer Libs Code Generators
GroupId	GroupId com.github.universal-automata
ArtifactId	ArtifactId liblevenshtein
Last Version	Last Version 3.0.0
Release Date	Release Date 29-May-2016
Type	Type jar
Description	Description liblevenshtein A library for spelling-correction based on Levenshtein Automata.
Project URL	Project URL https://github.com/universal-automata/liblevenshtein-java/
Source Code Management	Source Code Management https://github.com/universal-automata/liblevenshtein-java/

Download liblevenshtein

Filename	Size
liblevenshtein-3.0.0.pom
liblevenshtein-3.0.0.jar	166 KB
liblevenshtein-3.0.0-sources.jar	92 KB
liblevenshtein-3.0.0-javadoc.jar	261 bytes
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/com.github.universal-automata/liblevenshtein/ -->
<dependency>
    <groupId>com.github.universal-automata</groupId>
    <artifactId>liblevenshtein</artifactId>
    <version>3.0.0</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.github.universal-automata/liblevenshtein/
implementation 'com.github.universal-automata:liblevenshtein:3.0.0'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.github.universal-automata/liblevenshtein/
implementation ("com.github.universal-automata:liblevenshtein:3.0.0")

Apache Buildr

'com.github.universal-automata:liblevenshtein:jar:3.0.0'

Apache Ivy

<dependency org="com.github.universal-automata" name="liblevenshtein" rev="3.0.0">
  <artifact name="liblevenshtein" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.github.universal-automata', module='liblevenshtein', version='3.0.0')
)

Scala SBT

libraryDependencies += "com.github.universal-automata" % "liblevenshtein" % "3.0.0"

Leiningen

[com.github.universal-automata/liblevenshtein "3.0.0"]

Dependencies

runtime (8)

Group / Artifact	Type	Version
com.google.code.findbugs : annotations	jar	3.0.1
com.google.guava : guava	jar	19.0
com.google.protobuf : protobuf-java-util	jar	3.0.0-beta-3
com.google.protobuf : protobuf-java	jar	3.0.0-beta-3
it.unimi.dsi : fastutil	jar	7.0.12
org.apache.commons : commons-lang3	jar	3.4
org.projectlombok : lombok	jar	1.16.8
org.slf4j : slf4j-api	jar	1.7.21

Project Modules

There are no modules declared in this project.

liblevenshtein

Java

A library for generating Finite State Transducers based on Levenshtein Automata.

Levenshtein transducers accept a query term and return all terms in a dictionary that are within n spelling errors away from it. They constitute a highly-efficient (space and time) class of spelling correctors that work very well when you do not require context while making suggestions. Forget about performing a linear scan over your dictionary to find all terms that are sufficiently-close to the user's query, using a quadratic implementation of the Levenshtein distance or Damerau-Levenshtein distance, these babies find all the terms from your dictionary in linear time on the length of the query term (not on the size of the dictionary, on the length of the query term).

If you need context, then take the candidates generated by the transducer as a starting place, and plug them into whatever model you're using for context (such as by selecting the sequence of terms that have the greatest probability of appearing together).

For a quick demonstration, please visit the Github Page, here. There's also a command-line interface, liblevenshtein-java-cli. Please see its README.md for acquisition and usage information.

The library is currently written in Java, CoffeeScript, and JavaScript, but I will be porting it to other languages, soon. If you have a specific language you would like to see it in, or package-management system you would like it deployed to, let me know.

Branches

Branch	Description
master	Latest, development source
release	Latest, release source
release-3.x	Latest, release source for version 3.x
release-2.x	Latest, release source for version 2.x

Project Management

Issues are managed on waffle.io. Below you will find a graph on the rate at which I've been closing them.

Please visit Bountysource to pledge your support for ongoing issues.

Documentation

When it comes to documentation, you have several options:

Basic Usage:

Minimum Java Version

liblevenshtein has been developed against Java ≥ 1.8. It will not work with prior versions.

Installation

Latest, Development Release

Add a Maven dependency on Artifactory. For example, in a Gradle project, you would modify your repositories as follows:

repositories {
  maven {
    url 'https://oss.jfrog.org/artifactory/oss-release-local'
  }
}

Latest, Stable Release

Add a Maven dependency on one of the following:

Maven

<dependency>
  <groupId>com.github.universal-automata</groupId>
  <artifactId>liblevenshtein</artifactId>
  <version>3.0.0</version>
</dependency>

Apache Buildr

'com.github.universal-automata:liblevenshtein:jar:3.0.0'

Apache Ivy

<dependency org="com.github.universal-automata" name="liblevenshtein" rev="3.0.0" />

Groovy Grape

@Grapes(
@Grab(group='com.github.universal-automata', module='liblevenshtein', version='3.0.0')
)

Gradle / Grails

compile 'com.github.universal-automata:liblevenshtein:3.0.0'

Scala SBT

libraryDependencies += "com.github.universal-automata" % "liblevenshtein" % "3.0.0"

Leiningen

[com.github.universal-automata/liblevenshtein "3.0.0"]

Git

% git clone --progress git@github.com:universal-automata/liblevenshtein-java.git
Cloning into 'liblevenshtein-java'...
remote: Counting objects: 8117, done.        
remote: Compressing objects: 100% (472/472), done.        
remote: Total 8117 (delta 352), reused 0 (delta 0), pack-reused 7619        
Receiving objects: 100% (8117/8117), 5.52 MiB | 289.00 KiB/s, done.
Resolving deltas: 100% (5366/5366), done.
Checking connectivity... done.

% cd liblevenshtein-java
% git pull --progress
Already up-to-date.

% git fetch --progress --tags
% git checkout --progress 3.0.0
Note: checking out '3.0.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at 4f0f172... pushd and popd silently

% git submodule init
% git submodule update

Usage

Let's say you have the following content in a plain text file called, top-20-most-common-english-words.txt (note that the file has one term per line):

the
be
to
of
and
a
in
that
have
I
it
for
not
on
with
he
as
you
do
at

The following provides you a way to query its content:

import java.io.InputStream;
import java.io.OutputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import com.github.liblevenshtein.collection.dictionary.SortedDawg;
import com.github.liblevenshtein.serialization.PlainTextSerializer;
import com.github.liblevenshtein.serialization.ProtobufSerializer;
import com.github.liblevenshtein.serialization.Serializer;
import com.github.liblevenshtein.transducer.Algorithm;
import com.github.liblevenshtein.transducer.Candidate;
import com.github.liblevenshtein.transducer.ITransducer;
import com.github.liblevenshtein.transducer.factory.TransducerBuilder;

// ...

final SortedDawg dictionary;
final Path dictionaryPath =
  Paths.get("/path/to/top-20-most-common-english-words.txt");
try (final InputStream stream = Files.newInputStream(dictionaryPath)) {
  // The PlainTextSerializer constructor accepts an optional boolean specifying
  // whether the dictionary is already sorted lexicographically, in ascending
  // order.  If it is sorted, then passing true will optimize the construction
  // of the dictionary; you may pass false whether the dictionary is sorted or
  // not (this is the default and safest behavior if you don't know whether the
  // dictionary is sorted).
  final Serializer serializer = new PlainTextSerializer(false);
  dictionary = serializer.deserialize(SortedDawg.class, stream);
}

final ITransducer<Candidate> transducer = new TransducerBuilder()
  .dictionary(dictionary)
  .algorithm(Algorithm.TRANSPOSITION)
  .defaultMaxDistance(2)
  .includeDistance(true)
  .build();

for (final String queryTerm : new String[] {"foo", "bar"}) {
  System.out.println(
    "+-------------------------------------------------------------------------------");
  System.out.printf("| Spelling Candidates for Query Term: \"%s\"%n", queryTerm);
  System.out.println(
    "+-------------------------------------------------------------------------------");
  for (final Candidate candidate : transducer.transduce(queryTerm)) {
    System.out.printf("| d(\"%s\", \"%s\") = [%d]%n",
      queryTerm,
      candidate.term(),
      candidate.distance());
  }
}

// +-------------------------------------------------------------------------------
// | Spelling Candidates for Query Term: "foo"
// +-------------------------------------------------------------------------------
// | d("foo", "do") = [2]
// | d("foo", "of") = [2]
// | d("foo", "on") = [2]
// | d("foo", "to") = [2]
// | d("foo", "for") = [1]
// | d("foo", "not") = [2]
// | d("foo", "you") = [2]
// +-------------------------------------------------------------------------------
// | Spelling Candidates for Query Term: "bar"
// +-------------------------------------------------------------------------------
// | d("bar", "a") = [2]
// | d("bar", "as") = [2]
// | d("bar", "at") = [2]
// | d("bar", "be") = [2]
// | d("bar", "for") = [2]

// ...

If you want to serialize your dictionary to a format that's easy to read later, do the following:

final Path serializedDictionaryPath =
  Paths.get("/path/to/top-20-most-common-english-words.protobuf.bytes");
try (final OutputStream stream = Files.newOutputStream(serializedDictionaryPath)) {
  final Serializer serializer = new ProtobufSerializer();
  serializer.serialize(dictionary, stream);
}

Then, you can read the dictionary later, in much the same way you read the plain text version:

final SortedDawg deserializedDictionary;
try (final InputStream stream = Files.newInputStream(serializedDictionaryPath)) {
  final Serializer serializer = new ProtobufSerializer();
  deserializedDictionary = serializer.deserialize(SortedDawg.class, stream);
}

Serialization is not restricted to dictionaries, you may also (de)serialize transducers.

Please see the wiki for more details.

Reference

This library is based largely on the work of Stoyan Mihov, Klaus Schulz, and Petar Nikolaev Mitankin: Fast String Correction with Levenshtein-Automata. For more information, please see the wiki.

liblevenshtein-java is maintained by@dylon (dylon.devo+liblevenshtein-java@gmail.com)

Universal Automata

Various libraries regarding universal automata

Versions

Version
3.0.0 29-May-2016
2.2.3 04-May-2016
2.2.3-alpha.8 04-May-2016
2.2.3-alpha.1 04-May-2016

liblevenshtein

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management

Download liblevenshtein

How to add to project

Dependencies

runtime (8)

Project Modules

liblevenshtein

Java

A library for generating Finite State Transducers based on Levenshtein Automata.

Branches

Project Management

Documentation

Basic Usage:

Minimum Java Version

Installation

Latest, Development Release

Latest, Stable Release

Maven

Apache Buildr

Apache Ivy

Groovy Grape

Gradle / Grails

Scala SBT

Leiningen

Git

Usage

Reference

Universal Automata

Versions