elasticsearch-langdetect

Language detection for Elasticsearch

License	License The Apache License, Version 2.0
Categories	Categories Search Business Logic Libraries Elasticsearch
GroupId	GroupId org.xbib.elasticsearch.plugin
ArtifactId	ArtifactId elasticsearch-langdetect
Last Version	Last Version 5.4.0.2
Release Date	Release Date 08-Jun-2017
Type	Type jar
Description	Description elasticsearch-langdetect Language detection for Elasticsearch
Project URL	Project URL https://github.com/jprante/elasticsearch-langdetect
Project Organization	Project Organization xbib
Source Code Management	Source Code Management https://github.com/jprante/elasticsearch-langdetect

Download elasticsearch-langdetect

Filename	Size
elasticsearch-langdetect-5.4.0.2.pom
elasticsearch-langdetect-5.4.0.2.jar	2 MB
elasticsearch-langdetect-5.4.0.2-sources.jar	2 MB
elasticsearch-langdetect-5.4.0.2-plugin.zip	2 MB
elasticsearch-langdetect-5.4.0.2-javadoc.jar	489 bytes
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/org.xbib.elasticsearch.plugin/elasticsearch-langdetect/ -->
<dependency>
    <groupId>org.xbib.elasticsearch.plugin</groupId>
    <artifactId>elasticsearch-langdetect</artifactId>
    <version>5.4.0.2</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/org.xbib.elasticsearch.plugin/elasticsearch-langdetect/
implementation 'org.xbib.elasticsearch.plugin:elasticsearch-langdetect:5.4.0.2'

Gradle Kotlin

// https://jarcasting.com/artifacts/org.xbib.elasticsearch.plugin/elasticsearch-langdetect/
implementation ("org.xbib.elasticsearch.plugin:elasticsearch-langdetect:5.4.0.2")

Apache Buildr

'org.xbib.elasticsearch.plugin:elasticsearch-langdetect:jar:5.4.0.2'

Apache Ivy

<dependency org="org.xbib.elasticsearch.plugin" name="elasticsearch-langdetect" rev="5.4.0.2">
  <artifact name="elasticsearch-langdetect" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='org.xbib.elasticsearch.plugin', module='elasticsearch-langdetect', version='5.4.0.2')
)

Scala SBT

libraryDependencies += "org.xbib.elasticsearch.plugin" % "elasticsearch-langdetect" % "5.4.0.2"

Leiningen

[org.xbib.elasticsearch.plugin/elasticsearch-langdetect "5.4.0.2"]

Dependencies

compile (1)

Group / Artifact	Type	Version
org.elasticsearch : elasticsearch	jar	5.4.0

test (3)

Group / Artifact	Type	Version
org.apache.logging.log4j : log4j-core	jar	2.8.2
junit : junit	jar	4.12
org.elasticsearch.plugin : transport-netty4-client	jar	5.4.0

Project Modules

There are no modules declared in this project.

A langdetect plugin for Elasticsearch

This is an implementation of a plugin for Elasticsearch using the implementation of Nakatani Shuyo’s language detector.

It uses 3-gram character and a Bayesian filter with various normalizations and feature sampling. The precision is over 99% for 53 languages.

The plugin offers a mapping type to specify fields where you want to enable language detection. Detected languages are indexed into a subfield of the field named 'lang', as you can see in the example. The field can be queried for language codes.

You can use the multi_field mapping type to combine this plugin with the attachment mapper plugin, to enable language detection in base64-encoded binary data. Currently, UTF-8 texts are supported only.

The plugin offers also a REST endpoint, where a short text can be posted to in UTF-8, and the plugin responds with a list of recognized languages.

Here is a list of languages code recognized:

Table 1. Langauges

Code	Description
af	Afrikaans
ar	Arabic
bg	Bulgarian
bn	Bengali
cs	Czech
da	Danish
de	German
el	Greek
en	English
es	Spanish
et	Estonian
fa	Farsi
fi	Finnish
fr	French
gu	Gujarati
he	Hebrew
hi	Hindi
hr	Croatian
hu	Hungarian
id	Indonesian
it	Italian
ja	Japanese
kn	Kannada
ko	Korean
lt	Lithuanian
lv	Latvian
mk	Macedonian
ml	Malayalam
mr	Marathi
ne	Nepali
nl	Dutch
no	Norwegian
pa	Eastern Punjabi
pl	Polish
pt	Portuguese
ro	Romanian
ru	Russian
sk	Slovak
sl	Slovene
so	Somali
sq	Albanian
sv	Swedish
sw	Swahili
ta	Tamil
te	Telugu
th	Thai
tl	Tagalog
tr	Turkish
uk	Ukrainian
ur	Urdu
vi	Vietnamese
zh-cn	Chinese
zh-tw	Traditional Chinese characters (Taiwan, Hongkong, Macau)

Table 2. Compatibility matrix

Plugin version	Elasticsearch version	Release date
5.4.0.2	5.4.0	Jun 8, 2017
5.4.0.1	5.4.0	May 30, 2017
5.4.0.0	5.4.0	May 10, 2017
5.3.2.0	5.3.2	Apr 30, 2017
5.3.1.0	5.3.1	Apr 30, 2017
5.3.0.2	5.3.0	Apr 3, 2017
5.3.0.1	5.3.0	Apr 1, 2017
5.3.0.0	5.3.0	Mar 30, 2017
5.2.2.0	5.2.2	Mar 2, 2017
5.2.1.0	5.2.1	Mar 2, 2017
5.1.2.0	5.1.2	Jan 26, 2017
2.4.4.1	2.4.4	Jan 25, 2017
2.3.3.0	2.3.3	Jun 11, 2016
2.3.2.0	2.3.2	Jun 11, 2016
2.3.1.0	2.3.1	Apr 11, 2016
2.2.1.0	2.2.1	Apr 11, 2016
2.2.0.2	2.2.0	Mar 25, 2016
2.2.0.1	2.2.0	Mar 6, 2016
2.1.1.0	2.1.1	Dec 20, 2015
2.1.0.0	2.1.0	Dec 15, 2015
2.0.1.0	2.0.1	Dec 15, 2015
2.0.0.0	2.0.0	Nov 12, 2015
1.6.0.0	1.6.0	Jul 1, 2015
1.4.4.1	1.4.4	Apr 3, 2015
1.4.4.1	1.4.4	Mar 4, 2015
1.4.0.2	1.4.0	Nov 26, 2014
1.4.0.1	1.4.0	Nov 20, 2014
1.4.0.0	1.4.0	Nov 14, 2014
1.3.1.0	1.3.0	Jul 30, 2014
1.2.1.1	1.2.1	Jun 18, 2014

Installation

Elasticsearch 5.x

./bin/elasticsearch-plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-langdetect/5.4.0.2/elasticsearch-langdetect-5.4.0.2-plugin.zip

Elasticsearch 2.x

./bin/plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-langdetect/2.4.4.1/elasticsearch-langdetect-2.4.4.1-plugin.zip

Elasticsearch 1.x

./bin/plugin -install langdetect -url http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-langdetect/1.6.0.0/elasticsearch-langdetect-1.6.0.0-plugin.zip

Do not forget to restart the node after installing.

Examples

Note	The examples are written for Elasticsearch 5.x and need to be adapted to earlier versions of Elastiscearch.

A simple language detection example

In this example, we create a simple detector field, and write text to it for detection.

DELETE /test
PUT /test
{
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "langdetect",
               "languages" : [ "en", "de", "fr" ]
            }
         }
      }
   }
}

PUT /test/docs/1
{
      "text" : "Oh, say can you see by the dawn`s early light, What so proudly we hailed at the twilight`s last gleaming?"
}

PUT /test/docs/2
{
      "text" : "Einigkeit und Recht und Freiheit für das deutsche Vaterland!"
}

PUT /test/docs/3
{
      "text" : "Allons enfants de la Patrie, Le jour de gloire est arrivé!"
}

POST /test/_search
{
       "query" : {
           "term" : {
                "text" : "en"
           }
       }
}

POST /test/_search
{
       "query" : {
           "term" : {
                "text" : "de"
           }
       }
}

POST /test/_search
{
       "query" : {
           "term" : {
                "text" : "fr"
           }
       }
}

Indexing language-detected text alongside with code

Just indexing the language code is not enough in most cases. The language-detected text should be passed to a specific analyzer to apply language-specific analysis. This plugin allows that by the language_to parameter.

DELETE /test
PUT /test
{
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "langdetect",
               "languages": [
                  "de",
                  "en",
                  "fr",
                  "nl",
                  "it"
               ],
               "language_to": {
                  "de": "german_field",
                  "en": "english_field"
               }
            },
            "german_field": {
               "analyzer": "german",
               "type": "string"
            },
            "english_field": {
               "analyzer": "english",
               "type": "string"
            }
         }
      }
   }
}

PUT /test/docs/1
{
  "text" : "Oh, say can you see by the dawn`s early light, What so proudly we hailed at the twilight`s last gleaming?"
}

POST /test/_search
{
   "query" : {
       "match" : {
            "english_field" : "light"
       }
   }
}

Language code and `multi_field`

Using multifields, it is possible to store the text alongside with the detected language(s). Here, we use another (short nonsense) example text for demonstration, which has more than one detected language code.

DELETE /test
PUT /test
{
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "text",
               "fields": {
                  "language": {
                     "type": "langdetect",
                     "languages": [
                        "de",
                        "en",
                        "fr",
                        "nl",
                        "it"
                     ],
                     "store": true
                  }
               }
            }
         }
      }
   }
}

PUT /test/docs/1
{
    "text" : "Oh, say can you see by the dawn`s early light, What so proudly we hailed at the twilight`s last gleaming?"
}

POST /test/_search
{
   "query" : {
       "match" : {
            "text" : "light"
       }
   }
}

POST /test/_search
{
   "query" : {
       "match" : {
            "text.language" : "en"
       }
   }
}

Language detection ina binary field with `attachment` mapper plugin

DELETE /test
PUT /test
{
   "mappings": {
      "docs": {
         "properties": {
            "text": {
    		  "type" : "attachment",
			  "fields" : {
				"content" : {
				  "type" : "text",
				  "fields" : {
					"language" : {
					  "type" : "langdetect",
					  "binary" : true
					}
				  }
				}
			  }
            }
         }
      }
   }
}

On a shell, enter commands

rm index.tmp
echo -n '{"content":"' >> index.tmp
echo "This is a very simple text in plain english" | base64  >> index.tmp
echo -n '"}' >> index.tmp
curl -XPOST --data-binary "@index.tmp" 'localhost:9200/test/docs/1'
rm index.tmp

POST /test/_refresh

POST /test/_search
{
   "query" : {
       "match" : {
            "content" : "very simple"
       }
   }
}

POST /test/_search
{
   "query" : {
       "match" : {
            "content.language" : "en"
       }
   }
}

Language detection REST API Example

curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'This is a test'
{
  "languages" : [
    {
      "language" : "en",
      "probability" : 0.9999972283490304
    }
  ]
}

curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'Das ist ein Test'
{
  "languages" : [
    {
      "language" : "de",
      "probability" : 0.9999985460514316
    }
  ]
}

curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'Datt isse ne test'
{
  "languages" : [
    {
      "language" : "no",
      "probability" : 0.5714275763833249
    },
    {
      "language" : "nl",
      "probability" : 0.28571402563882925
    },
    {
      "language" : "de",
      "probability" : 0.14285660343967294
    }
  ]
}

Use _langdetect endpoint from Sense

GET _langdetect
{
   "text": "das ist ein test"
}

Change profile of language detection

There is a "short text" profile which is better to detect languages in a few words.

curl -XPOST 'localhost:9200/_langdetect?pretty&profile=short-text' -d 'Das ist ein Test'
{
  "profile" : "/langdetect/short-text/",
  "languages" : [ {
    "language" : "de",
    "probability" : 0.9999993070517024
  } ]
}

Settings

These settings can be used in elasticsearch.yml to modify language detection.

Use with caution. You don’t need to modify settings. This list is just for the sake of completeness. For successful modification of the model parameters, you should study the source code and be familiar with probabilistic matching using naive bayes with character n-gram. See also Ted Dunning, Statistical Identification of Language, 1994.

Name	Description
`languages`	a comma-separated list of language codes such as (de,en,fr…) used to restrict (and speed up) the detection process
`map.<code>`	a substitution code for a language code
`number_of_trials`	number of trials, affects CPU usage (default: 7)
`alpha`	additional smoothing parameter, default: 0.5
`alpha_width`	the width of smoothing, default: 0.05
`iteration_limit`	safeguard to break loop, default: 10000
`prob_threshold`	default: 0.1
`conv_threshold`	detection is terminated when normalized probability exceeds this threshold, default: 0.99999
`base_freq`	default 10000

Issues

All feedback is welcome! If you find issues, please post them at Github

Credits

Thanks to Alexander Reelsen for his OpenNLP plugin, from where I have copied and adapted the mapping type code.

License

elasticsearch-langdetect - a language detection plugin for Elasticsearch

Derived work of language-detection by Nakatani Shuyo http://code.google.com/p/language-detection/

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. you may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Versions

Version
5.4.0.2 08-Jun-2017
5.4.0.0 10-May-2017
5.3.2.0 30-Apr-2017
5.3.1.0 30-Apr-2017
5.3.0.2 06-Apr-2017
5.3.0.1 01-Apr-2017
5.2.2.0 02-Mar-2017
5.1.2.0 26-Jan-2017
2.4.4.1 25-Jan-2017
2.4.4.0 15-Jan-2017
2.4.3.0 15-Jan-2017

elasticsearch-langdetect

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Project Organization

Source Code Management

Download elasticsearch-langdetect

How to add to project

Dependencies

compile (1)

test (3)

Project Modules

A langdetect plugin for Elasticsearch

Installation

Elasticsearch 5.x

Elasticsearch 2.x

Elasticsearch 1.x

Examples

A simple language detection example

Indexing language-detected text alongside with code

Language code and multi_field

Language detection ina binary field with attachment mapper plugin

Language detection REST API Example

Use _langdetect endpoint from Sense

Change profile of language detection

Settings

Issues

Credits

License

Versions

Language code and `multi_field`

Language detection ina binary field with `attachment` mapper plugin