Resources
-
Main website - http://www.nltk.org/
-
List of all data provided by
nltk.download()
- Note that this is an annoying XML file and might show nothing in a browser. -
The NLTK Book - Pretty helpful!
Installation
To install.
apt-get install python-nltk
To use.
import nltk
Data
Before doing anything too interesting you need to install some data files.
$ python
>>> import nltk
>>> nltk.download()
This creates a GUI file thingy that allows you to download the parts you want. I just did this and let it go crazy.
>>> nltk.download('all')
Note that it puts all the goop that acquires in ~/nltk_data/
. After
struggling to make and download my own concordances independently,
this is really worth the price of admission. Seriously, there a lot of
big files that this grabs. Some are probably not critical
(grammars/basque_grammars.zip
) but it does seem like a good place to
start.
An example of the kinds of things that the download step will acquire is Panlex, a cross referenced database of all words in all languages. Note that it’s 2GB, so plan this download accordingly. See: https://dev.panlex.org/db/
Bizarre Bonus Functionality
-
nltk.chat.chatbots()
Sample Corpora
The nltk data provides some texts to play with. I don’t know how useful they are for serious work but they do help with pedagogy.
To access books do this.
In [1]: from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
In [2]: text6
Out[2]: <Text: Monty Python and the Holy Grail>
These text[1-9]
objects are nltk.text.Text
types.
Here’s how to get specific titles in the various corpora.
mov= nltk.Text(nltk.corpus.shakespeare.words('merchant.xml'))
Some organizational functions.
nltk.corpus.brown.categories()
nltk.corpus.brown.fileids()
nltk.corpus.brown.root # Where in the real file system it lives.
nltk.corpus.brown.abspath(fileid)
nltk.corpus.brown.raw()[:500]
nltk.corpus.brown.readme()
Loading Custom Corpora
corpus_root='/home/xed/X/P/nltk/blogcorpus/'
myfiles= ['xed13','xed14','xed15','xed16']
xedcorp= nltk.corpus.PlaintextCorpusReader(corpus_root,myfiles)
xedcorp.fileids()
The myfiles
argument can also be patterns like r'*.txt'
which
means you need that r
at the front to make it a regular expression,
which it is.
Note that xedcorp
and text6
are different kinds of objects.
These are true.
-
type(text6) == nltk.text.Text
-
type(xedcorp) == nltk.corpus.reader.plaintext.PlaintextCorpusReader
Handy functions
Here are some methods that are defined for nltk.text.Text
.
-
text6.concordance(swallow)
-
text6.collocations()
-
text6.dispersion_plot(["coconut","swallow","shrubbery"])
-
text6.plot
-
text6.unicode_repr
-
text6.common_contexts([grail,shrubber])
-
text6.findall
-
text6.readability
-
text6.vocab
-
text6.concordance
-
text6.index
-
text6.similar(swallow)
-
text6.count
-
text6.name
-
text6.tokens
-
len(text6)
-
nltk.FreqDist(text6)
-
nltk.FreqDist(text6).most_common(20)
-
wordlist=sorted(set(text6))
Tokenizing
Tokenizing is the process of taking a long string of stuff and converting it into a list of words. It sounds easy, but it is not.
The default is to tokenize by word.
text='''I was born in a crossfire hurricane and I howled at my Ma in the driving rain but it's all right now, in fact, it's a gas.''' tokens= nltk.word_tokenize(text) print '%s %s'%(tokens[6],tokens[-2]) # Produces 'hurricane gas'. tagged= nltk.pos_tag(tokens) # Tags words with part of speech.
Tokenizing by sentence
This messy thing seems to tokenize by sentence, i.e. it will give you a list of all the sentences in a document.
regexp_tokenize(text, pattern=r'\.(\s+|$)', gaps=True)
Or here’s a slightly more effective version.
regexp_tokenize(text, pattern=r'[.?!](\s+|$)', gaps=True)