ClearWSD NLP4J Parser

ClearWSD wrapper for NLP4J

License	License The Apache License, Version 2.0
GroupId	GroupId io.github.clearwsd
ArtifactId	ArtifactId clearwsd-nlp4j
Last Version	Last Version 0.12.1
Release Date	Release Date 14-Sep-2020
Type	Type jar
Description	Description ClearWSD NLP4J Parser ClearWSD wrapper for NLP4J

Download clearwsd-nlp4j

Filename	Size
clearwsd-nlp4j-0.12.1.pom
clearwsd-nlp4j-0.12.1.jar	6 KB
clearwsd-nlp4j-0.12.1-sources.jar	5 KB
clearwsd-nlp4j-0.12.1-javadoc.jar	35 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/io.github.clearwsd/clearwsd-nlp4j/ -->
<dependency>
    <groupId>io.github.clearwsd</groupId>
    <artifactId>clearwsd-nlp4j</artifactId>
    <version>0.12.1</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/io.github.clearwsd/clearwsd-nlp4j/
implementation 'io.github.clearwsd:clearwsd-nlp4j:0.12.1'

Gradle Kotlin

// https://jarcasting.com/artifacts/io.github.clearwsd/clearwsd-nlp4j/
implementation ("io.github.clearwsd:clearwsd-nlp4j:0.12.1")

Apache Buildr

'io.github.clearwsd:clearwsd-nlp4j:jar:0.12.1'

Apache Ivy

<dependency org="io.github.clearwsd" name="clearwsd-nlp4j" rev="0.12.1">
  <artifact name="clearwsd-nlp4j" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='io.github.clearwsd', module='clearwsd-nlp4j', version='0.12.1')
)

Scala SBT

libraryDependencies += "io.github.clearwsd" % "clearwsd-nlp4j" % "0.12.1"

Leiningen

[io.github.clearwsd/clearwsd-nlp4j "0.12.1"]

Dependencies

compile (6)

Group / Artifact	Type	Version
io.github.clearwsd : clearwsd-core	jar	0.12.1
edu.emory.mathcs.nlp : nlp4j-api	jar	1.1.3
edu.emory.mathcs.nlp : nlp4j-english	jar	1.1.3
com.google.guava : guava	jar	27.0-jre
org.slf4j : slf4j-api	jar	1.7.25
ch.qos.logback : logback-classic	jar	1.2.3

provided (1)

Group / Artifact	Type	Version
org.projectlombok : lombok	jar	1.18.4

test (1)

Group / Artifact	Type	Version
junit : junit	jar	4.12

Project Modules

There are no modules declared in this project.

ClearWSD

ClearWSD is a word sense disambiguation tool for the JVM, with core modules available under an Apache 2.0 license. It provides simple APIs for integration with other libraries, as well as a command-line interface (CLI) for non-programmatic use. It is modular, allowing for alternative implementations of sub-components such as parsers or resources used for feature extraction.

It is meant for use in both research and production settings. Main features include

State-of-the-art results in verb sense disambiguation over VerbNet classes
Automatic optimization of feature subsets and hyperparameters
Production-ready pre-trained models
Easy training of new models using CLI
1000+ sense predictions per second on a 2014 MacBook Pro

API

The easiest way to make use of ClearWSD in your project is through Maven, by simply adding corresponding ClearWSD dependencies to your project's pom.xml.

Releases are distributed through Maven Central.

To try out ClearWSD in your project, you will need to include three modules, the first being clearwsd-core:

<dependency>
  <groupId>io.github.clearwsd</groupId>
  <artifactId>clearwsd-core</artifactId>
  <version>0.12.1</version>
</dependency>

and the second being a parser module, used for pre-processing and feature extraction. A wrapper for the NLP4J dependency parser is provided:

<dependency>
  <groupId>io.github.clearwsd</groupId>
  <artifactId>clearwsd-nlp4j</artifactId>
  <version>0.12.1</version>
</dependency>

Finally, to use pre-trained word sense disambiguation models (compatible with NLP4J), just add the following:

<dependency>
  <groupId>io.github.clearwsd</groupId>
  <artifactId>clearwsd-models</artifactId>
  <version>0.12.1</version>
</dependency>

You can then try out a pre-trained model (from OntoNotes) with the following:

import java.util.List;

import io.github.clearwsd.DefaultSensePredictor;
import io.github.clearwsd.SensePrediction;
import io.github.clearwsd.corpus.ontonotes.OntoNotesSense;
import io.github.clearwsd.parser.Nlp4jDependencyParser;

public class Test {
    public static void main(String[] args) {
        Nlp4jDependencyParser parser = new Nlp4jDependencyParser(); // load dependency parser
        DefaultSensePredictor<OntoNotesSense> wsd = DefaultSensePredictor.loadFromResource(
                "models/nlp4j-ontonotes.bin", parser); // load WSD model

        String sentence = "Mary took the bus to school (which " // 8 --> travel by means of
                + "took about 30 minutes), and studiously "     // 3 --> require or necessitate
                + "took notes about the Bolsheviks "            // 2 --> light verb usage
                + "taking over the Winter Palace";              // 9 --> claim or conquer, become in control of

        List<String> tokens = parser.tokenize(sentence); // split sentence into tokens

        // display sense predictions and their definitions
        for (SensePrediction<OntoNotesSense> prediction : wsd.predict(tokens)) {
            System.out.println(prediction.sense().getNumber() + " --> " + prediction.sense().getName());
        }
    }
}

Command Line Interface

ClearWSD provides a command-line interface for training, evaluation, and application of word sense disambiguation models.

To build ClearWSD, you will need Java 8 or above and Apache Maven.

On OS X/Linux, you can then build the project for CLI use:

git clone https://github.com/clearwsd/clearwsd.git
cd clearwsd
mvn package -DskipTests -P build-nlp4j-cli

To use the Stanford Parser wrapper module (GPL licensed) instead, use build-stanford-cli:

mvn package -DskipTests -P build-stanford-cli

You can see a help message and available options with the following command (assuming you have already followed the CLI setup instructions):

java -jar clearwsd-cli-*.jar --help

Usage: WordSenseCLI [options]
  Options:
    -model, -m
      Path to classifier model (for loading or saving)
    -input, -i
      Path to unlabeled input file for new predictions
    -train, -t
      Path to training data (required for training)
    -valid, -dev, -v
      Path to validation data
    -cv, -folds
      Number of cross-validation folds
      Default: 0
    -test
      Path to test data
    --itl, --interactive, --loop
      Start an interactive test session on provided model (after training 
      and/or testing)
      Default: false
    --om
      Output misses on evaluation data in separate files
      Default: false
    --reparse
      Reparse, even if a parsed file of the same name already exists
      Default: false
    --help, --usage
      Display usage
    -corpus
      Training/evaluation corpus type
      Default: Semlink
      Possible Values: [Semeval, Semlink]
    -dataExt
      Extension for training data file (only needed for Semeval XML corpora)
      Default: .data.xml
    -ext
      Parse file extension, appended to input file names to save parses
      Default: .dep
    -inventory, -inv
      Sense inventory
      Possible Values: [VerbNet, WordNet, OntoNotes, Counting]
    -inventoryPath
      Sense inventory path (optional)
    -keyExt
      Extension for sense key file (only needed for Semeval XML corpora)
      Default: .gold.key.txt
    -output, -o
      Path to output file where predictions on the input file are stored

Training

To train a new model, you must specify the path to a training data file with -train, as well as a path for the resulting saved model, using -model:

java -jar clearwsd-cli-*.jar -train path/to/training/file.txt -model path/to/save/model.bin

The default corpus (Semlink) expects files with an instance per line in the following format:

document_id <space> sentence_id <space> token# <space> lemma <space> sense_label <tab> sentence_text

sentence_text should be a single sentence containing the instance, with tokens separated by spaces:

example.txt 25 3 get comprehend-87.2-1	Oh , I get it .
example.txt 57 2 get get-13.5.1-1	Did you get that part ?

Evaluation

The CLI provides several modes of evaluation/application. You can perform cross-validation, test on a specific dataset, apply a trained model to raw text, or try out a model interactively by typing in test sentences.

Cross Validation

Specify the number of folds with -cv. -cv 5, for example, can be used for 5-fold cross validation.:

java -jar clearwsd-cli-*.jar -train path/to/training/file.txt -cv 5

Test Dataset

Specify a test file with -test:

java -jar clearwsd-cli-*.jar -test path/to/test/file.txt -model path/to/trained/model.bin

Application

To apply a trained model to new (raw) data, specify a path with -input. Optionally specify an output path with -output:

java -jar clearwsd-cli-*.jar -input path/to/raw/data.txt -output path/to/predictions.txt \
-model clearwsd-models/src/main/resources/models/nlp4j-ontonotes.bin

Interactive Testing

--loop or --itl can be used to start an interactive command line test loop, where you can input sentences and see predictions.

java -jar clearwsd-cli-*.jar --loop -model clearwsd-models/src/main/resources/models/nlp4j-verbnet-3.3.bin

After the parser and model finish loading, you should then be able to enter test sentences and see predicted senses:

Enter test input ("EXIT" to quit).
> please take notes

Please
take[25.2]
notes

> Take the train home.

Take[51.4.3]
the
train
home

> Take on the government

Take[98]
on
the
government

> Take the money out of the vault

Take[13.5.1]
the
money
out
of
the
vault

License

Please refer to the LICENSE.txt in individual modules.

ClearWSD

Versions

Version
0.12.1 14-Sep-2020
0.12.0 30-Jun-2019
0.10.0 07-May-2019

ClearWSD NLP4J Parser

License

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Download clearwsd-nlp4j

How to add to project

Dependencies

compile (6)

provided (1)

test (1)

Project Modules

ClearWSD

API

Command Line Interface

Training

Evaluation

Cross Validation

Test Dataset

Application

Interactive Testing

License

ClearWSD

Versions