elasticsearch-analysis-decompound

null

License	License The Apache License, Version 2.0
Categories	Categories Search Business Logic Libraries Elasticsearch
GroupId	GroupId org.xbib.elasticsearch.plugin
ArtifactId	ArtifactId elasticsearch-analysis-decompound
Last Version	Last Version 6.3.2.0
Release Date	Release Date 03-Oct-2018
Type	Type jar
Description	Description elasticsearch-analysis-decompound null
Project URL	Project URL https://github.com/jprante/elasticsearch-analysis-decompound
Project Organization	Project Organization xbib
Source Code Management	Source Code Management https://github.com/jprante/elasticsearch-analysis-decompound

Download elasticsearch-analysis-decompound

Filename	Size
elasticsearch-analysis-decompound-6.3.2.0.pom
elasticsearch-analysis-decompound-6.3.2.0.jar	1 MB
elasticsearch-analysis-decompound-6.3.2.0-sources.jar	1 MB
elasticsearch-analysis-decompound-6.3.2.0-plugin.zip	1 MB
elasticsearch-analysis-decompound-6.3.2.0-javadoc.jar	261 bytes
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/org.xbib.elasticsearch.plugin/elasticsearch-analysis-decompound/ -->
<dependency>
    <groupId>org.xbib.elasticsearch.plugin</groupId>
    <artifactId>elasticsearch-analysis-decompound</artifactId>
    <version>6.3.2.0</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/org.xbib.elasticsearch.plugin/elasticsearch-analysis-decompound/
implementation 'org.xbib.elasticsearch.plugin:elasticsearch-analysis-decompound:6.3.2.0'

Gradle Kotlin

// https://jarcasting.com/artifacts/org.xbib.elasticsearch.plugin/elasticsearch-analysis-decompound/
implementation ("org.xbib.elasticsearch.plugin:elasticsearch-analysis-decompound:6.3.2.0")

Apache Buildr

'org.xbib.elasticsearch.plugin:elasticsearch-analysis-decompound:jar:6.3.2.0'

Apache Ivy

<dependency org="org.xbib.elasticsearch.plugin" name="elasticsearch-analysis-decompound" rev="6.3.2.0">
  <artifact name="elasticsearch-analysis-decompound" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='org.xbib.elasticsearch.plugin', module='elasticsearch-analysis-decompound', version='6.3.2.0')
)

Scala SBT

libraryDependencies += "org.xbib.elasticsearch.plugin" % "elasticsearch-analysis-decompound" % "6.3.2.0"

Leiningen

[org.xbib.elasticsearch.plugin/elasticsearch-analysis-decompound "6.3.2.0"]

Dependencies

test (4)

Group / Artifact	Type	Version
org.apache.logging.log4j : log4j-core	jar	2.11.0
org.xbib.elasticsearch : elasticsearch-test-framework	jar	6.3.2.1
org.xbib.elasticsearch : elasticsearch-analysis-common	jar	6.3.2.1
org.elasticsearch.plugin : transport-netty4-client	jar	6.3.2

Project Modules

There are no modules declared in this project.

Decompound plugin for Elasticsearch

This is an implementation of a word decompounder plugin for Elasticsearch.

Compounding several words into one word is a property not all languages share. Compounding is used in German, Scandinavian Languages, Finnish and Korean.

This code is a reworked implementation of the Baseforms Tool found in the ASV toolbox of Chris Biemann, Automatische Sprachverarbeitung of Leipzig University.

Lucene comes with two compound word token filters, a dictionary- and a hyphenation-based variant. Both of them have a disadvantage, they require loading a word list in memory before they run. This decompounder does not require word lists, it can process german language text out of the box. The decompounder uses prebuilt Compact Patricia Tries for efficient word segmentation provided by the ASV toolbox.

Table 1. Table Compatibility matrix

Plugin version	Elasticsearch version	Release date
5.4.3.0	5.4.3	Aug 24 2017
5.4.0.0	5.4.0	May 12 2017
5.1.1.0	5.1.1	Dec 19 2016
2.4.1.0	2.4.1	Nov 16 2016
2.3.4.0	2.3.4	Jul 30 2016
2.3.3.0	2.3.3	Jun 1 2016
2.3.2.0	2.3.2	Jun 1 2016
2.3.1.0	2.3.1	Jun 1 2016
2.3.0.0	2.3.0	Mar 31 2016
2.2.1.0	2.2.1	Mar 31 2016
2.2.0.0	2.2.0	Feb 19 2016
2.1.1.0	2.1.1	Dec 22 2015
2.1.0.0	2.1.0	Dec 8 2015
1.7.1.3	1.7.1	Nov 17 2015
1.5.2.0	1.5.2	Oct 26 2015

Installation

Elasticsearch 5.x

./bin/elasticsearch-plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-analysis-decompound/5.4.3.0/elasticsearch-analysis-decompound-5.4.3.0-plugin.zip

Do not forget to restart the node after installing.

Elasticsearch 2.x

./bin/plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-analysis-decompound/2.4.1.0/elasticsearch-analysis-decompound-2.4.1.0-plugin.zip

Do not forget to restart the node after installing.

Issues

All feedback is welcome! If you find issues, please post them at Github

Example

PUT /test
{
   "settings": {
       "index": {
           "analysis": {
               "filter": {
                   "decomp":{
                       "type" : "decompound"
                   }
               },
               "analyzer": {
                   "decomp": {
                       "type": "custom",
                       "tokenizer" : "standard",
                       "filter" : [
                           "decomp",
                            "unique",
                            "german_normalization",
                            "lowercase"
                       ]
                   }
               }
           }
       }
   },
   "mappings": {
      "docs" : {
            "properties": {
                "text" : {
                    "type" : "text",
                    "analyzer": "decomp"
                }
            }
      }
   }
}

GET /test/docs/_mapping

PUT /test/docs/1
{
    "text" : "Die Jahresfeier der Rechtsanwaltskanzleien auf dem Donaudampfschiff hat viel Ökosteuer gekostet"
}

POST /test/docs/_search?explain
{
    "query": {
        "match": {
           "text": "dampf schiff"
        }
    }
}

"Die Jahresfeier der Rechtsanwaltskanzleien auf dem Donaudampfschiff hat viel Ökosteuer gekostet" will be tokenized into "Die", "Die", "Jahresfeier", "Jahr", "feier", "der", "der", "Rechtsanwaltskanzleien", "Recht", "anwalt", "kanzlei", "auf", "auf", "dem", "dem", "Donaudampfschiff", "Donau", "dampf", "schiff", "hat", "hat", "viel", "viel", "Ökosteuer", "Ökosteuer", "gekostet", "gekosten"

It is recommended to add the Unique token filter to skip tokens that occur more than once.

The input "Ein schöner Tag in Köln im Café an der Straßenecke" will be tokenized into "Ein", "schoner", "Tag", "in", "Koln", "im", "Café", "an", "der", "Strassenecke".

Threshold

The decomposing algorithm knows about a threshold when to assume words as decomposed successfully or not. If the threshold is too low, words could silently disappear from being indexed. In this case, you have to adapt the threshold so words do no longer disappear.

The default threshold value is 0.51. You can modify it in the settings

"index" : {
    "analysis" : {
        "filter" : {
            "decomp" : {
                "type" : "decompound",
                "threshold" : 0.51
            }
        }
    }
}

Subwords

Sometimes only the decomposed subwords should be indexed. For this, you can use the parameter "subwords_only": true

"index" : {
    "analysis" : {
        "filter" : {
            "decomp" : {
                "type" : "decompound",
                "subwords_only" : true
            }
        }
    }
}

References

The Compact Patricia Trie data structure can be found in

Morrison, D.: Patricia - practical algorithm to retrieve information coded in alphanumeric. Journal of ACM, 1968, 15(4):514–534*

The compound splitter used for generating features for document classification is described in

Witschel, F., Biemann, C.: Rigorous dimensionality reduction through linguistically motivated feature selection for text categorization. Proceedings of NODALIDA 2005, Joensuu, Finland*

The base form reduction step (for Norwegian) is described in

Eiken, U.C., Liseth, A.T., Richter, M., Witschel, F. and Biemann, C.: Ord i Dag: Mining Norwegian Daily Newswire. Proceedings of FinTAL, Turku, 2006, Finland*

License

Decompounder Analysis Plugin for Elasticsearch

Derived work of ASV toolbox http://asv.informatik.uni-leipzig.de/asv/methoden

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

Versions

Version
6.3.2.0 03-Oct-2018
5.4.0.0 12-May-2017
5.1.1.0 19-Dec-2016

elasticsearch-analysis-decompound

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Project Organization

Source Code Management

Download elasticsearch-analysis-decompound

How to add to project

Dependencies

test (4)

Project Modules

Decompound plugin for Elasticsearch

Installation

Elasticsearch 5.x

Elasticsearch 2.x

Issues

Example

Threshold

Subwords

References

License

Versions