To install.

apt-get install python-nltk

To use.

import nltk


Before doing anything too interesting you need to install some data files.

$ python
>>> import nltk

This creates a GUI file thingy that allows you to download the parts you want. I just did this and let it go crazy.


Note that it puts all the goop that acquires in ~/nltk_data/. After struggling to make and download my own concordances independently, this is really worth the price of admission. Seriously, there a lot of big files that this grabs. Some are probably not critical (grammars/ but it does seem like a good place to start.

An example of the kinds of things that the download step will acquire is Panlex, a cross referenced database of all words in all languages. Note that it’s 2GB, so plan this download accordingly. See:

Bizarre Bonus Functionality


Sample Corpora

The nltk data provides some texts to play with. I don’t know how useful they are for serious work but they do help with pedagogy.

To access books do this.

In [1]: from import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

In [2]: text6
Out[2]: <Text: Monty Python and the Holy Grail>

These text[1-9] objects are nltk.text.Text types.

Here’s how to get specific titles in the various corpora.

mov= nltk.Text(nltk.corpus.shakespeare.words('merchant.xml'))

Some organizational functions.

nltk.corpus.brown.root      # Where in the real file system it lives.

Loading Custom Corpora

myfiles= ['xed13','xed14','xed15','xed16']
xedcorp= nltk.corpus.PlaintextCorpusReader(corpus_root,myfiles)

The myfiles argument can also be patterns like r'*.txt' which means you need that r at the front to make it a regular expression, which it is.

Note that xedcorp and text6 are different kinds of objects. These are true.

  • type(text6) == nltk.text.Text

  • type(xedcorp) == nltk.corpus.reader.plaintext.PlaintextCorpusReader

Handy functions

Here are some methods that are defined for nltk.text.Text.

  • text6.concordance(swallow)

  • text6.collocations()

  • text6.dispersion_plot(["coconut","swallow","shrubbery"])

  • text6.plot

  • text6.unicode_repr

  • text6.common_contexts([grail,shrubber])

  • text6.findall

  • text6.readability

  • text6.vocab

  • text6.concordance

  • text6.index

  • text6.similar(swallow)

  • text6.count


  • text6.tokens

  • len(text6)

  • nltk.FreqDist(text6)

  • nltk.FreqDist(text6).most_common(20)

  • wordlist=sorted(set(text6))


Tokenizing is the process of taking a long string of stuff and converting it into a list of words. It sounds easy, but it is not.

The default is to tokenize by word.

text='''I was born in a crossfire hurricane and I howled at my Ma in
the driving rain but it's all right now, in fact, it's a gas.'''
tokens= nltk.word_tokenize(text)
print '%s %s'%(tokens[6],tokens[-2]) # Produces 'hurricane gas'.
tagged= nltk.pos_tag(tokens) # Tags words with part of speech.

Tokenizing by sentence

This messy thing seems to tokenize by sentence, i.e. it will give you a list of all the sentences in a document.

regexp_tokenize(text, pattern=r'\.(\s+|$)', gaps=True)

Or here’s a slightly more effective version.

regexp_tokenize(text, pattern=r'[.?!](\s+|$)', gaps=True)