GermaNLTK
An Introduction to German NLTK Features
Philipp Nahratow
Martin G?bler
Stefan Reinhardt
Raphael Brand
Leon Schr?der
v0.01
GermaNLTK is an integration of GermaNet and Projekt Deutscher Wortschatz into NLTK. GermaNet is a semantically-oriented dictionary of German, similar to WordNet.
GermaNLTK started as a student project at Hochschule der Medien Stuttgart (http://www.hdm-stuttgart.de/).
Legal hints
All source code provided by the GermaNLTK project is open source (GPL). The data it relies on belongs to the particular owners.
The Germanet Corpus Reader is based on a proprietary corpus by Universit?t Tübingen. We are not allowed to distribute the files ourselves, so please consider http://www.sfs.uni-tuebingen.de/GermaNet/index.shtml for further information on how to obtain the corpus files.
The German Baseform Lemmatizer is based on a database provided by Projekt Deutscher Wortschatz.
THE DATABASE IS FREE FOR EDUCATIONAL UND RESEARCHING PURPOSES BUT NOT FOR COMMERCIAL USE. FOR MORE INFORMATION VISIT: http://wortschatz.uni-leipzig.de/
Germanet Corpus Reader
Installation
Some manual steps
Until the GermaNLTK features are fully integrated into NLTK, you need to perform the following steps manually:
-
Locate your Python installation folder
-
Navigate to “\Lib\site-packages\nltk\corpus”
-
Open the file “__init__.py” which you find in this folder
-
You will see a list of LazyCorpusLoader’s being instanciated
-
In this list, insert the following line:
germanet = LazyCorpusLoader('germanet', GermaNetCorpusReader)
-
Close the file and navigate to the folder “reader”
-
Open the file ‘__init__.py’ in the ‘reader’ folder
-
Add the following import:
from nltk.corpus.reader.germanet import *
-
Within the list ‘__all__’, add the following Item:
'GermaNetCorpusReader'
-
The following files have to be copied to the “reader” folder:
-
germanet.py
-
GermanetDBBuilder.py
Get the data
Although the code included in NLTK to access GermaNet is open source, the database itself is proprietary to Universit?t Tübingen. It is free for educational and research purposes, though.
Please read http://www.sfs.uni-tuebingen.de/GermaNet/index.shtml on how to obtain the data.
Unpack & Install the data
You will receive a zipped folder with a bunch of XML files. The corpus reader does not work on the XML files directly, but converts them to an SQLite database. Please extract the XML Files to a folder and remember the path to it. When you try to access functionality of the corpus reader for the first time, you will be prompted by a file chooser - please select the folder where you unpacked the XML.
First Steps
Imports
GermaNet is a corpus reader similar to WordNet and can be imported like this:
>>> from nltk.corpus import germanet
For more compact code, we recommend:
>>> from nltk.corpus import germanet as gn
The object 'germanet' respectively 'gn' is an instance of the class
GermaNetCorpusReader.
Encodings
To avoid hassle with encodings, you should setup the corpus reader, so it
knows how the input you pass to it will be encoded and how the output
it will return should be encoded:
>>> gn.setInputEncoding('utf-8')
>>> gn.setOutputEncoding('utf-8')
If you don't provide any information about encodings, the corpus reader will
by itself try to find out which encodings are appropriate. This is done by
inspecting the STDOUT / STDIN and the locale package.
This works very well when you use germanet from a console. However, it is
strongly recommended to manually set the input encoding and output encoding.
The Germanet Corpus Reader
Germanet
TODO: Some information on the structure of GermaNet
Synsets
A synset is an object which represents a set of synonyms that share a common meaning.
To explore a word, you can use the synsets() method:
>>> gn.synsets('gehen')
[Synset('funktionieren.v.1'),
Synset('funktionieren.v.2'),
Synset('gehen.v.3'),
Synset('gehen.v.4'),
Synset('gehen.v.5'),
Synset('gehen.v.6'),
Synset('handeln.v.1'),
Synset('gehen.v.8'),
Synset('gehen.v.9'),
Synset('gehen.v.10'),
Synset('gehen.v.11'),
Synset('gehen.v.12'),
Synset('gehen.v.13'),
Synset('gehen.v.14'),
Synset('auseinandergehen.v.5')]
Each synset has a unique identifier, like 'funktionieren.v.2'.
If a synset identifier is given, you can use the synset() method
to create a synset object:
>>> funktionieren = gn.synset('funktionieren.v.2')
You can now call further methods on the synset object:
>>> funktionieren.hypernym_paths()
[[Synset('GNROOT.n.1'),
Synset('sein.v.2'),
Synset('stattfinden.v.1'),
Synset('passieren.v.1'),
Synset('funktionieren.v.2')]]
The Synset class, as well as the Lemma class, which is explained immediately,
both derive from the abstract class _GermaNetObject.
Lemmas
Each synset comprises at least one Lemma. A lemma is the citation form of a
set of words, e.g. "halten" is the citation form of "h?lt" and "hielt".
To see which lemmas a Synset object contains, type this:
>>> funktionieren.lemmas
[Lemma('funktionieren.v.2.funktionieren'),
Lemma('funktionieren.v.2.funzen'),
Lemma('funktionieren.v.2.gehen'),
Lemma('funktionieren.v.2.laufen'),
Lemma('funktionieren.v.2.arbeiten')]
One lemma can be contained in multiple synsets. To see in which senses one
specific citation form occurs, type:
>>> gn.lemmas('brennen')
[Lemma('brennen.v.1.brennen'),
Lemma('verbrennen.v.1.brennen'),
Lemma('brennen.v.3.brennen'),
Lemma('brennen.v.4.brennen'),
Lemma('brennen.v.5.brennen'),
Lemma('destillieren.v.1.brennen'),
Lemma('brennen.v.7.brennen'),
Lemma('brennen.v.8.brennen')]
For each sense in which a lemma occurs, there is a unique identifier.
It consists of the identifier of the Synset and the citation form of the word.
E.g. 'verbrennen.v.1.brennen' consists of the Synset identifier 'verbrennen.v.1'
and the citation form 'brennen'.
This is helpful to distinguish them despite their same citation form.
E.g. 'verbrennen.v.1.brennen' and 'destillieren.v.1.brennen' specify exactly
which sense of 'brennen' is meant.
GermaNet contains additional relations between Lemmas: antonyms, pertonyms and
participles.
Similarity
The germanet class contains serveral methods to determine the similarity between synsets.
>>> hund = gn.synset('Hund.n.2')
>>> katze = gn.synset('Katze.n.2')
>>> automobil = gn.synset('Automobil.n.1')
Path similarity
``synset1.path_similarity(synset2):``
Return a score denoting how similar two word senses are, based on the
shortest path that connects the senses in the is-a (hypernym/hypnoym)
taxonomy. The score is in the range 0 to 1.
>>> hund.path_similarity(katze)
0.33333333333333331
>>> hund.path_similarity(automobil)
0.066666666666666666
Wu-Palmer Similarity
``synset1.wup_similarity(synset2):``
Return a score denoting how similar two word senses are, based on the
depth of the two senses in the taxonomy and that of their Least Common
Subsumer (most specific ancestor node).
>>> hund.wup_similarity(katze)
0.88888888888888884
>>> hund.wup_similarity(automobil)
0.36363636363636365
Synset Relations
Get the path(s) from this synset to the root node, where each path is a
list of the synset nodes traversed on the way to the root:
>>> katze.hypernym_paths()
[[Synset('GNROOT.n.1'), Synset('Entit?t.n.1'), Synset('Objekt.n.2'),
Synset('natürliches Objekt.n.1'), Synset('Lebewesen.n.1'),
Synset('natürliches Lebewesen.n.1'), Synset('h?heres Lebewesen.n.1'),
Synset('Tier.n.1'), Synset('Gewebetier.n.1'), Synset('Chordatier.n.1'),
Synset('Wirbeltier.n.1'), Synset('S?ugetier.n.1'),
Synset('h?herer S?uger.n.1'), Synset('Raubtier.n.1'),
Synset('Landraubtier.n.1'), Synset('katzenartiges Landraubtier.n.1'),
Synset('Katze.n.2')],
[Synset('GNROOT.n.1'), Synset('Entit?t.n.1'), Synset('Objekt.n.2'),
Synset('natürliches Objekt.n.1'), Synset('Lebewesen.n.1'),
Synset('natürliches Lebewesen.n.1'), Synset('h?heres Lebewesen.n.1'),
Synset('Tier.n.1'), Synset('Haustier.n.1'), Synset('Katze.n.2')]]
Get the topmost hypernym(s) of this synset in GermaNet. Mostly GNROOT.n.1:
>>> katze.root_hypernyms()
[Synset('GNROOT.n.1')]
Find all common hypernyms of two synsets:
>>> katze.common_hypernyms(hund)
[Synset('Landraubtier.n.1'), Synset('S?ugetier.n.1'),
Synset('Haustier.n.1'), Synset('Entit?t.n.1'),
Synset('Tier.n.1'), Synset('natürliches Objekt.n.1'),
Synset('Chordatier.n.1'), Synset('Gewebetier.n.1'),
Synset('Lebewesen.n.1'), Synset('natürliches Lebewesen.n.1'),
Synset('Wirbeltier.n.1'), Synset('h?heres Lebewesen.n.1'),
Synset('h?herer S?uger.n.1'), Synset('GNROOT.n.1'),
Synset('Raubtier.n.1'), Synset('Objekt.n.2')]
>>> katze.common_hypernyms(automobil)
[Synset('GNROOT.n.1'), Synset('Entit?t.n.1'), Synset('Objekt.n.2')]
Find the lowest common hypernyms of two synsets:
>>> katze.lowest_common_hypernyms(hund)
[Synset('Landraubtier.n.1')]
>>> katze.lowest_common_hypernyms(automobil)
[Synset('Objekt.n.2')]
Get the path(s) from this synset to the root, counting the distance of each node
from the initial node on the way:
>>> automobil.hypernym_distances()
set([(Synset('Radfahrzeug.n.1'), 1), (Synset('Artefakt.n.1'), 5),
(Synset('Entit?t.n.1'), 8), (Synset('Fahrzeug.n.1'), 3),
(Synset('Objekt.n.2'), 7), (Synset('Transportmittel.n.1'), 4),
(Synset('Ding.n.1'), 6), (Synset('GNROOT.n.1'), 9),
(Synset('Automobil.n.1'), 0), (Synset('Landfahrzeug.n.1'), 2)])
Try to find the shortest path distance between to synsets (if one exists):
>>> katze.shortest_path_distance(hund)
2
>>> katze.shortest_path_distance(automobil)
14
Synset Closures
Compute transitive closures of synsets regarding a relation via breadth-first search:
>>> hyper = lambda s: s.hypernyms()
>>> hypo = lambda s: s.hyponyms()
>>> list(katze.closure(hyper))
[Synset('katzenartiges Landraubtier.n.1'), Synset('Haustier.n.1'),
Synset('Landraubtier.n.1'), Synset('Tier.n.1'), Synset('Raubtier.n.1'),
Synset('h?heres Lebewesen.n.1'), Synset('h?herer S?uger.n.1'),
Synset('natürliches Lebewesen.n.1'), Synset('S?ugetier.n.1'),
Synset('Lebewesen.n.1'), Synset('Wirbeltier.n.1'),
Synset('natürliches Objekt.n.1'), Synset('Chordatier.n.1'),
Synset('Objekt.n.2'), Synset('Gewebetier.n.1'), Synset('Entit?t.n.1'),
Synset('GNROOT.n.1')]
>>> list(katze.closure(hyper, depth=1)) == katze.hypernyms()
True
>>> list(katze.closure(hypo, depth=1)) == katze.hyponyms()
True
German Wortschatz Lemmatizer
The German Wortschatz Lemmatizer gives you the ability to find the lemma (or: citation form, canonical form, dictionary form) of a given word. It is based on data provided by Projekt Deutscher Wortschatz.
Installation
Some manual steps
-
The following files have to be copied to the folder ‘Lib\site-packages\nltk\stem’ in your Python installation path:
-
Open the file ‘__init__.py’ in the same folder and do the following changes:
-
Add the following import:
from GermanWortschatzLemmatizer import *
-
If you try to access the lemmatizer for the first time, you will be prompted to choose the file ‘baseforms_by_projekt_deutscher_wortschatz.txt’. A script will be run which creates a database.
Using the Lemmatizer
The German Wortschatz Lemmatizer can be imported like this
>>> from nltk.stem import GermanWortschatzLemmatizer
For more compact code, we recommend:
>>> from nltk.stem import GermanWortschatzLemmatizer as gwl
The usage of the lemmatizer is fairly easy. Simply call the lemmatize() function with a word you would like to lemmatize:
>>> gwl.lemmatize('geht')
'gehen'
If the lemmatizer can’t find a lemma to the word you provided, it will just return it unchanged:
>>> gwl.lemmatize('UnsinnigesWort')
'UnsinnigesWort'