GermaNLTK: An Introduction to German NLTK Features

GermaNLTK

An Introduction to German NLTK Features

Philipp Nahratow

Martin G?bler

Stefan Reinhardt

Raphael Brand

Leon Schr?der

v0.01

GermaNLTK is an integration of GermaNet and Projekt Deutscher Wortschatz into NLTK. GermaNet is a semantically-oriented dictionary of German, similar to WordNet. 

GermaNLTK started as a student project at Hochschule der Medien Stuttgart (http://www.hdm-stuttgart.de/).

Legal hints

All source code provided by the GermaNLTK project is open source (GPL). The data it relies on belongs to the particular owners.

The Germanet Corpus Reader is based on a proprietary corpus by Universit?t Tübingen. We are not allowed to distribute the files ourselves, so please  consider http://www.sfs.uni-tuebingen.de/GermaNet/index.shtml for further information on how to obtain the corpus files.

The German Baseform Lemmatizer is based on a database provided by Projekt Deutscher Wortschatz. 

THE DATABASE IS FREE FOR EDUCATIONAL UND RESEARCHING PURPOSES BUT NOT FOR COMMERCIAL USE. FOR MORE INFORMATION VISIT: http://wortschatz.uni-leipzig.de/

Germanet Corpus Reader

Installation

Some manual steps

Until the GermaNLTK features are fully integrated into NLTK, you need to perform the following steps manually:

  1. Locate your Python installation folder

  2. Navigate to  “\Lib\site-packages\nltk\corpus”

  3. Open the file “__init__.py” which you find in this folder

    • You will see a list of LazyCorpusLoader’s being instanciated

    • In this list, insert the following line:

germanet = LazyCorpusLoader('germanet', GermaNetCorpusReader)

  1. Close the file and navigate to the folder “reader”

  2. Open the file ‘__init__.py’ in the ‘reader’ folder

    • Add the following import:

from nltk.corpus.reader.germanet import *

  • Within the list ‘__all__’, add the following Item:

'GermaNetCorpusReader'

  1. The following files have to be copied to the “reader” folder:

    • germanet.py

    • GermanetDBBuilder.py

Get the data

Although the code included in NLTK to access GermaNet is open source, the database itself is proprietary to Universit?t Tübingen. It is free for educational and research purposes, though.

Please read http://www.sfs.uni-tuebingen.de/GermaNet/index.shtml  on how to obtain the data. 

Unpack & Install the data

You will receive a zipped folder with a bunch of XML files. The corpus reader does not work on the XML files directly, but converts them to an SQLite database. Please extract the XML Files to a folder and remember the path to it. When you try to access functionality of the corpus reader for the first time, you will be prompted by a file chooser - please select the folder where you unpacked the XML.

First Steps

Imports

GermaNet is a corpus reader similar to WordNet and can be imported like this:

>>> from nltk.corpus import germanet

For more compact code, we recommend:

>>> from nltk.corpus import germanet as gn

The object 'germanet' respectively 'gn' is an instance of the class

GermaNetCorpusReader.

Encodings

To avoid hassle with encodings, you should setup the corpus reader, so it

knows how the input you pass to it will be encoded and how the output

it will return should be encoded:

    

    >>> gn.setInputEncoding('utf-8')

    >>> gn.setOutputEncoding('utf-8')

    

If you don't provide any information about encodings, the corpus reader will

by itself try to find out which encodings are appropriate. This is done by

inspecting the STDOUT / STDIN and the locale package.

This works very well when you use germanet from a console. However, it is

strongly recommended to manually set the input encoding and output encoding.

The Germanet Corpus Reader

Germanet

TODO: Some information on the structure of GermaNet

Synsets

A synset is an object which represents a set of synonyms that share a common meaning.

To explore a word, you can use the synsets() method:

    >>> gn.synsets('gehen')

    [Synset('funktionieren.v.1'),

     Synset('funktionieren.v.2'),

     Synset('gehen.v.3'),

     Synset('gehen.v.4'),

     Synset('gehen.v.5'),

     Synset('gehen.v.6'),

     Synset('handeln.v.1'),

     Synset('gehen.v.8'),

     Synset('gehen.v.9'),

     Synset('gehen.v.10'),

     Synset('gehen.v.11'),

     Synset('gehen.v.12'),

     Synset('gehen.v.13'),

     Synset('gehen.v.14'),

     Synset('auseinandergehen.v.5')]

Each synset has a unique identifier, like 'funktionieren.v.2'.

If a synset identifier is given, you can use the synset() method

to create a synset object:

    >>> funktionieren = gn.synset('funktionieren.v.2')

    

You can now call further methods on the synset object:

    >>> funktionieren.hypernym_paths() 

    [[Synset('GNROOT.n.1'),

    Synset('sein.v.2'),

    Synset('stattfinden.v.1'),

    Synset('passieren.v.1'),

    Synset('funktionieren.v.2')]]

    

The Synset class, as well as the Lemma class, which is explained immediately,

both derive from the abstract class _GermaNetObject.

Lemmas

Each synset comprises at least one Lemma. A lemma is the citation form of a

set of words, e.g. "halten" is the citation form of "h?lt" and "hielt".

To see which lemmas a Synset object contains, type this:

    >>> funktionieren.lemmas

    [Lemma('funktionieren.v.2.funktionieren'),

    Lemma('funktionieren.v.2.funzen'),

    Lemma('funktionieren.v.2.gehen'),

    Lemma('funktionieren.v.2.laufen'),

    Lemma('funktionieren.v.2.arbeiten')]

    

One lemma can be contained in multiple synsets. To see in which senses one

specific citation form occurs, type:

    >>> gn.lemmas('brennen')

    [Lemma('brennen.v.1.brennen'),

    Lemma('verbrennen.v.1.brennen'),

    Lemma('brennen.v.3.brennen'),

    Lemma('brennen.v.4.brennen'),

    Lemma('brennen.v.5.brennen'),

    Lemma('destillieren.v.1.brennen'),

    Lemma('brennen.v.7.brennen'),

    Lemma('brennen.v.8.brennen')]

    

For each sense in which a lemma occurs, there is a unique identifier.

It consists of the identifier of the Synset and the citation form of the word.

E.g. 'verbrennen.v.1.brennen' consists of the Synset identifier 'verbrennen.v.1'

and the citation form 'brennen'.

This is helpful to distinguish them despite their same citation form.

E.g. 'verbrennen.v.1.brennen' and 'destillieren.v.1.brennen' specify exactly

which sense of 'brennen' is meant.

GermaNet contains additional relations between Lemmas: antonyms, pertonyms and

participles.

Similarity

The germanet class contains serveral methods to determine the similarity between synsets.

    >>> hund = gn.synset('Hund.n.2')

    >>> katze = gn.synset('Katze.n.2')

    >>> automobil = gn.synset('Automobil.n.1')

    

Path similarity

``synset1.path_similarity(synset2):``

Return a score denoting how similar two word senses are, based on the

shortest path that connects the senses in the is-a (hypernym/hypnoym)

taxonomy. The score is in the range 0 to 1.

    

    >>> hund.path_similarity(katze) 

    0.33333333333333331

    >>> hund.path_similarity(automobil)

    0.066666666666666666

    

Wu-Palmer Similarity

``synset1.wup_similarity(synset2):``

Return a score denoting how similar two word senses are, based on the

depth of the two senses in the taxonomy and that of their Least Common

Subsumer (most specific ancestor node).

    

    >>> hund.wup_similarity(katze) 

    0.88888888888888884

    >>> hund.wup_similarity(automobil)

    0.36363636363636365

Synset Relations

Get the path(s) from this synset to the root node, where each path is a

list of the synset nodes traversed on the way to the root:

    >>> katze.hypernym_paths()

    [[Synset('GNROOT.n.1'), Synset('Entit?t.n.1'), Synset('Objekt.n.2'),

    Synset('natürliches Objekt.n.1'), Synset('Lebewesen.n.1'),

    Synset('natürliches Lebewesen.n.1'), Synset('h?heres Lebewesen.n.1'),

    Synset('Tier.n.1'), Synset('Gewebetier.n.1'), Synset('Chordatier.n.1'),

    Synset('Wirbeltier.n.1'), Synset('S?ugetier.n.1'),

    Synset('h?herer S?uger.n.1'), Synset('Raubtier.n.1'),

    Synset('Landraubtier.n.1'),    Synset('katzenartiges Landraubtier.n.1'),

    Synset('Katze.n.2')],

    [Synset('GNROOT.n.1'), Synset('Entit?t.n.1'), Synset('Objekt.n.2'),

    Synset('natürliches Objekt.n.1'), Synset('Lebewesen.n.1'),

    Synset('natürliches Lebewesen.n.1'), Synset('h?heres Lebewesen.n.1'),

    Synset('Tier.n.1'), Synset('Haustier.n.1'), Synset('Katze.n.2')]]

Get the topmost hypernym(s) of this synset in GermaNet. Mostly GNROOT.n.1:

    >>> katze.root_hypernyms()

    [Synset('GNROOT.n.1')]

    

Find all common hypernyms of two synsets:

    >>> katze.common_hypernyms(hund)

    [Synset('Landraubtier.n.1'), Synset('S?ugetier.n.1'),

    Synset('Haustier.n.1'), Synset('Entit?t.n.1'),

    Synset('Tier.n.1'), Synset('natürliches Objekt.n.1'),

    Synset('Chordatier.n.1'), Synset('Gewebetier.n.1'),

    Synset('Lebewesen.n.1'), Synset('natürliches Lebewesen.n.1'),

    Synset('Wirbeltier.n.1'), Synset('h?heres Lebewesen.n.1'),

    Synset('h?herer S?uger.n.1'), Synset('GNROOT.n.1'),

    Synset('Raubtier.n.1'), Synset('Objekt.n.2')]

    

    >>> katze.common_hypernyms(automobil)

    [Synset('GNROOT.n.1'), Synset('Entit?t.n.1'), Synset('Objekt.n.2')]

Find the lowest common hypernyms of two synsets:

    

    >>> katze.lowest_common_hypernyms(hund)

    [Synset('Landraubtier.n.1')]

    

    >>> katze.lowest_common_hypernyms(automobil)

    [Synset('Objekt.n.2')]

    

Get the path(s) from this synset to the root, counting the distance of each node

from the initial node on the way:

    >>> automobil.hypernym_distances()

    set([(Synset('Radfahrzeug.n.1'), 1), (Synset('Artefakt.n.1'), 5),

    (Synset('Entit?t.n.1'), 8), (Synset('Fahrzeug.n.1'), 3),

    (Synset('Objekt.n.2'), 7), (Synset('Transportmittel.n.1'), 4),

    (Synset('Ding.n.1'), 6), (Synset('GNROOT.n.1'), 9),

    (Synset('Automobil.n.1'), 0), (Synset('Landfahrzeug.n.1'), 2)])

Try to find the shortest path distance between to synsets (if one exists):

    >>> katze.shortest_path_distance(hund)

    2

    

    >>> katze.shortest_path_distance(automobil)

    14

Synset Closures

Compute transitive closures of synsets regarding a relation via breadth-first search:

    >>> hyper = lambda s: s.hypernyms()

    >>> hypo = lambda s: s.hyponyms()

    

    >>> list(katze.closure(hyper))

    [Synset('katzenartiges Landraubtier.n.1'), Synset('Haustier.n.1'),

    Synset('Landraubtier.n.1'), Synset('Tier.n.1'), Synset('Raubtier.n.1'),

    Synset('h?heres Lebewesen.n.1'), Synset('h?herer S?uger.n.1'),

    Synset('natürliches Lebewesen.n.1'), Synset('S?ugetier.n.1'),

    Synset('Lebewesen.n.1'), Synset('Wirbeltier.n.1'),

    Synset('natürliches Objekt.n.1'), Synset('Chordatier.n.1'),

    Synset('Objekt.n.2'), Synset('Gewebetier.n.1'), Synset('Entit?t.n.1'),

    Synset('GNROOT.n.1')]

    >>> list(katze.closure(hyper, depth=1)) == katze.hypernyms()

    True

    

    >>> list(katze.closure(hypo, depth=1)) == katze.hyponyms()

    True

German Wortschatz Lemmatizer

The German Wortschatz Lemmatizer gives you the ability to find the lemma (or: citation form, canonical form, dictionary form) of a given word. It is based on data provided by Projekt Deutscher Wortschatz.

Installation

Some manual steps

  1. The following files have to be copied to the folder ‘Lib\site-packages\nltk\stem’ in your Python installation path:

    • Open the file ‘__init__.py’ in the same folder and do the following changes:

      • Add the following import:

    from GermanWortschatzLemmatizer import *

    1. If you try to access the lemmatizer for the first time, you will be prompted to choose the file ‘baseforms_by_projekt_deutscher_wortschatz.txt’. A script will be run which creates a database. 

    Using the Lemmatizer

    The German Wortschatz Lemmatizer can be imported like this

    >>> from nltk.stem import GermanWortschatzLemmatizer

    For more compact code, we recommend:

    >>>  from nltk.stem import GermanWortschatzLemmatizer as gwl

    The usage of the lemmatizer is fairly easy. Simply call the lemmatize() function with a word you would like to lemmatize:

    >>> gwl.lemmatize('geht')

    'gehen'

    If the lemmatizer can’t find a lemma to the word you provided, it will just return it unchanged:

    >>> gwl.lemmatize('UnsinnigesWort')

    'UnsinnigesWort'

    Leave a Comment

    Your email address will not be published.