apt-get install python-nltk
Before doing anything too interesting you need to install some data files.
$ python >>> import nltk >>> nltk.download()
This creates a GUI file thingy that allows you to download the parts you want. I just did this and let it go crazy.
Note that it puts all the goop that acquires in
struggling to make and download my own concordances independently,
this is really worth the price of admission. Seriously, there a lot of
big files that this grabs. Some are probably not critical
grammars/basque_grammars.zip) but it does seem like a good place to
Bizarre Bonus Functionality
The nltk data provides some texts to play with. I don’t know how useful they are for serious work but they do help with pedagogy.
To access books do this.
In : from nltk.book import * *** Introductory Examples for the NLTK Book *** Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908 In : text6 Out: <Text: Monty Python and the Holy Grail>
text[1-9] objects are
Here’s how to get specific titles in the various corpora.
Some organizational functions.
nltk.corpus.brown.categories() nltk.corpus.brown.fileids() nltk.corpus.brown.root # Where in the real file system it lives. nltk.corpus.brown.abspath(fileid) nltk.corpus.brown.raw()[:500] nltk.corpus.brown.readme()
Loading Custom Corpora
corpus_root='/home/xed/X/P/nltk/blogcorpus/' myfiles= ['xed13','xed14','xed15','xed16'] xedcorp= nltk.corpus.PlaintextCorpusReader(corpus_root,myfiles) xedcorp.fileids()
myfiles argument can also be patterns like
means you need that
r at the front to make it a regular expression,
which it is.
text6 are different kinds of objects.
These are true.
type(text6) == nltk.text.Text
type(xedcorp) == nltk.corpus.reader.plaintext.PlaintextCorpusReader
Here are some methods that are defined for
Tokenizing is the process of taking a long string of stuff and converting it into a list of words. It sounds easy, but it is not.
The default is to tokenize by word.
text='''I was born in a crossfire hurricane and I howled at my Ma in the driving rain but it's all right now, in fact, it's a gas.''' tokens= nltk.word_tokenize(text) print '%s %s'%(tokens,tokens[-2]) # Produces 'hurricane gas'. tagged= nltk.pos_tag(tokens) # Tags words with part of speech.
Tokenizing by sentence
This messy thing seems to tokenize by sentence, i.e. it will give you a list of all the sentences in a document.
regexp_tokenize(text, pattern=r'\.(\s+|$)', gaps=True)
Or here’s a slightly more effective version.
regexp_tokenize(text, pattern=r'[.?!](\s+|$)', gaps=True)