About
extjwnl-data-mcr30 prepackages jars with wordnet data from the Multilingual Central Repository 3.0 (2016 release; currently only the Spanish portion).
A configuration file is included to make it extremely easy to use these resources in your project.
Getting started
In your pom.xml
:
<dependency>
<groupId>net.sf.extjwnl</groupId>
<artifactId>extjwnl</artifactId>
<version>2.0.3</version>
</dependency>
<dependency>
<groupId>net.sf.extjwnl.mcr</groupId>
<artifactId>extjwnl-data-spa-mcr30</artifactId>
<version>1.0.5</version>
</dependency>
In your code:
import net.sf.extjwnl.dictionary.*;
Dictionary d = Dictionary.getDefaultResourceInstance();
Mapping Between Dictionaries
extjwnl-data-mcr30 also contains an alignment module which supports loading multiple dictionaries and mapping word senses between them. To use it, you first need the following additional dependency in your pom.xml
:
<dependency>
<groupId>net.sf.extjwnl.mcr</groupId>
<artifactId>extjwnl-data-alignment-mcr30</artifactId>
<version>1.0.5</version>
</dependency>
Then you can load the MCR 3.0 Spanish wordnet together with two versions (3.0 and 3.1) of Princeton WordNet:
import net.sf.extjwnl.dictionary.*;
import net.sf.extjwnl.data.mcr30.alignment.*;
Dictionary spa = InterLingualIndex.getDictionary("mcr30", "spa");
Dictionary wn31 = InterLingualIndex.getDictionary("wn31", "eng");
Dictionary wn30 = InterLingualIndex.getDictionary("wn30", "eng");
After that, if you have a Spanish synset, you can find the corresponding English synset (if a mapping exists):
Synset englishSynset = InterLingualIndex.mapSynset(spanishSynset, wn31);
If you need to map lots of synsets, then use the SynsetMapper
interface instead for better performance:
SynsetMapper mapper = InterLingualIndex.loadMapper(spa, wn31);
Synset englishSynset1 = mapper.mapSynset(spanishSynset1);
Synset englishSynset2 = mapper.mapSynset(spanishSynset2);
...
For more information, see the javadoc.
Acknowledgements
The data for this package comes from the Multilingual Central Repository (MCR):
Aitor Gonzalez-Agirre, Egoitz Laparra and German Rigau (2012) Multilingual Central Repository version 3.0: upgrading a very large lexical knowledge base. In Proceedings of the 6th Global WordNet Conference (GWC 2012) Matsue, Japan.
@InProceedings{Gonzalez-Agirre:Laparra:Rigau:2012,
author = "Aitor Gonzalez-Agirre and Egoitz Laparra and German Rigau",
title = "Multilingual Central Repository version 3.0: upgrading a very large lexical knowledge base",
booktitle = "Proceedings of the 6th Global WordNet Conference (GWC 2012)",
year = 2012,
address = "Matsue",
}
This package is designed for use with extjwnl. The resource bundling is based on the pattern set by extjwnl-data-wn31 for the English-language Princeton WordNet 3.1.
Princeton University "About WordNet." WordNet. Princeton University. 2010.
MCR data is converted into extjwnl format via a modified version of the wn-mcr-transform script. You can find the modified version here.
The MCR is aligned with Princeton WordNet 3.0, so for realigning to Princeton WordNet 3.1, we use the 3.0->3.1 mapping_wordnet.json
from:
@misc{ZendelWordNetConv19,
author = {Zendel, Oliver},
title = {WordNet v3.0 vs. v3.1 mapping},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ozendelait/wordnet-to-json}},
commit = {7521b70937355e826ea7e028a615108cdb18d0ee}
}
Stemming
Language-specific stemming rules are packaged in each data module; for example, here are the Spanish-specific stemming rules.
Exceptional Forms
For Spanish, exceptional forms (irregular verb conjugations, noun pluralizations, and adjective pluralizations) are enumerated using the morphala project. All lemmas from the MCR dictionary are run through morphala's conjugation/pluralization routines. From the resulting derived form, we attempt to reverse-derive the lemma as a base form via the standard DetachSuffixesOperation
. When this fails, we treat the derived form as an exception and add it to supplemental_spa.txt
.
Future Work
If you are interested in adding support for languages beyond Spanish (such as Portuguese), please open an issue on this project. The bare minimum for a language would be to bring in the language-specific dataset from MCR and also add stemming rules for regular inflections; bonus would be to enhance morphala with the necessary support for generating exceptional forms for that language.