Python Natural Language Toolkit

Resources

Main website - http://www.nltk.org/
List of all data provided by nltk.download() - Note that this is an annoying XML file and might show nothing in a browser.
The NLTK Book - Pretty helpful!
- Chapter 1
- Chapter 2 - All About Corpora

Installation

To install.

apt-get install python-nltk

To use.

import nltk

Data

Before doing anything too interesting you need to install some data files.

$ python
>>> import nltk
>>> nltk.download()

This creates a GUI file thingy that allows you to download the parts you want. I just did this and let it go crazy.

>>> nltk.download('all')

Note that it puts all the goop that acquires in ~/nltk_data/. After struggling to make and download my own concordances independently, this is really worth the price of admission. Seriously, there a lot of big files that this grabs. Some are probably not critical (grammars/basque_grammars.zip) but it does seem like a good place to start.

An example of the kinds of things that the download step will acquire is Panlex, a cross referenced database of all words in all languages. Note that it’s 2GB, so plan this download accordingly. See: https://dev.panlex.org/db/

Bizarre Bonus Functionality

nltk.chat.chatbots()

Sample Corpora

The nltk data provides some texts to play with. I don’t know how useful they are for serious work but they do help with pedagogy.

To access books do this.

In [1]: from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

In [2]: text6
Out[2]: <Text: Monty Python and the Holy Grail>

These text[1-9] objects are nltk.text.Text types.

Here’s how to get specific titles in the various corpora.

mov= nltk.Text(nltk.corpus.shakespeare.words('merchant.xml'))

Some organizational functions.

nltk.corpus.brown.categories()
nltk.corpus.brown.fileids()
nltk.corpus.brown.root      # Where in the real file system it lives.
nltk.corpus.brown.abspath(fileid)
nltk.corpus.brown.raw()[:500]
nltk.corpus.brown.readme()

Loading Custom Corpora

corpus_root='/home/xed/X/P/nltk/blogcorpus/'
myfiles= ['xed13','xed14','xed15','xed16']
xedcorp= nltk.corpus.PlaintextCorpusReader(corpus_root,myfiles)
xedcorp.fileids()

The myfiles argument can also be patterns like r'*.txt' which means you need that r at the front to make it a regular expression, which it is.

Note that xedcorp and text6 are different kinds of objects. These are true.

type(text6) == nltk.text.Text
type(xedcorp) == nltk.corpus.reader.plaintext.PlaintextCorpusReader

Handy functions

Here are some methods that are defined for nltk.text.Text.

text6.concordance(swallow)
text6.collocations()
text6.dispersion_plot(["coconut","swallow","shrubbery"])
text6.plot
text6.unicode_repr
text6.common_contexts([grail,shrubber])
text6.findall
text6.readability
text6.vocab
text6.concordance
text6.index
text6.similar(swallow)
text6.count
text6.name
text6.tokens
len(text6)
nltk.FreqDist(text6)
nltk.FreqDist(text6).most_common(20)
wordlist=sorted(set(text6))

Tokenizing

Tokenizing is the process of taking a long string of stuff and converting it into a list of words. It sounds easy, but it is not.

The default is to tokenize by word.

text='''I was born in a crossfire hurricane and I howled at my Ma in
the driving rain but it's all right now, in fact, it's a gas.'''
tokens= nltk.word_tokenize(text)
print '%s %s'%(tokens[6],tokens[-2]) # Produces 'hurricane gas'.
tagged= nltk.pos_tag(tokens) # Tags words with part of speech.

Tokenizing by sentence

This messy thing seems to tokenize by sentence, i.e. it will give you a list of all the sentences in a document.

regexp_tokenize(text, pattern=r'\.(\s+|$)', gaps=True)

Or here’s a slightly more effective version.

regexp_tokenize(text, pattern=r'[.?!](\s+|$)', gaps=True)