Jump to content

User:Alvations/NLTK cheatsheet

From Wikipedia, the free encyclopedia

NLTK (Natural Language Toolkit) (http://nltk.org/) is a nifty library for human language analysis (aka Computational Linguistics/Natural Language Processing). It's written in and for python by reputed computational/field linguist, Steven Bird.

This user page is setup to answer some hiccups that new NLTK users will chance upon, especially on using the wordnet modules. I am using an Ubuntu 12.04 LTS Unix distro, so most of my solution to troubleshoot the hiccups are in ubuntu's context. I also have another userpage for python related cheatsheet.

Installation/General

[edit]

How to install NLTK?

[edit]

The main nltk page has a simple installation guide (see http://nltk.org/install.html). Note that NLTK requires Python versions 2.6-2.7.

How to check NLTK's version?

[edit]

Firstly, open the python interpreter, then import nltk and then nltk.__version__, voila, the version number pops up!!

$ python
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.__version__
'2.0.4'

How do I download the corpora and additional packages from nltk?

[edit]

Although NLTK provides the basic tools for language processing, often resources like corpora, dictionaries, treebanks, grammars and pre-trained language models are necessary to process the training/testing data. So to use nltk to access these resources, you need to use the download module in nltk.

>>> import nltk
>>> nltk.__version__
'2.0.4'
>>> nltk.download()
showing info http://nltk.github.com/nltk_data/

How do I update nltk to the latest version?

[edit]

I suggest pip install to ensure that the nltk is in-sync with your python distribution. What i normally do is to repeat the installation process to make sure that nltk dependencies like pyyaml or numpy are also updated, i would redo the installation process from nltk installation guide. To simply update nltk, try this:

$ sudo pip install nltk

Corpus Readers

[edit]

From experience, I have tried to recode different corpus readers to read corpora for NLP but none of my readers have came close to the elegant that Steven Bird had in NLTK. So here's my attempts to first go through all the pre-coded corpora reader, then extend the corpora readers to take in more corpus or in a more robust way. Go to NLTK Corpora Readers

[edit]

How do I convert the from the Princeton format (used in SemCor) to the offset-pos format (e.g. 01234567-x)?

[edit]

The prefeered Princeton format is also used in SemCor (e.g. bus%1:06:00::), often I find the other offset-pos format more palatable (e.g. 01234567-x). So there is a short piece of code to switch between the format (Author: FrancisBond, source:source click here)

>>> import nltk 
>>> from nltk.corpus import wordnet as ewn 
>>> def sc2ss(lemma,sensekey,senseno): 
    ### Look up a synset given the infomation from SemCor
    ### Assuming it is the same WN version (e.g. 3.0) 
      p = [′′, 'n', 'v', 'a', 'r', 's']  ## pos mapping 
      return ewn.synset('%s.%s.%02d' % \ 
                        (lemma, p[int(sensekey[0])], int(senseno))) 
>>> ss = sc2ss('live', '2:42:06::', '2') 
>>> print ss, ss.definition, ss.lexname, '(%08d-%s)' % (ss.offset, ss.pos) 

How do I get the corpus instances from Senseval through NLTK?

[edit]

(source: NLTK-user google grp)

>>> import nltk
#from nltk.corpus import senseval as ss

# This shows what files from senseval are in NLTK. sadly there's only 4. 
>>> print nltk.corpus.senseval.fileids()

# This yields the line by line xml version of the senseval data.
# note the "\n" are also in each line.
>>> print nltk.corpus.senseval.raw()

# Most probably the individual instances are what you wanted to get.
>>> for id in nltk.corpus.senseval.fileids():
  # This access all instances from each file.
  >>> insts = nltk.corpus.senseval.instances(id)
  # Looping through the instances.
  >>> for i in insts:
    >>> print i
    # SensevalInstance returns (word, position, context, senses), 
    # so you can each variable as such:
    >>> print i.word, i.position, i.senses
    >>> print i.context

How to get the Adverbial forms of Adjectives (quick => quickly)

[edit]

Although it is rare but WordNet have a relation call pertainym. It connects the relevant adjective to its adverbial form.

>>> from nltk.corpus import wordnet as wn
>>> for ss in wn.all_synsets(): # loop through all synsets in WordNet
...     for l in ss.lemmas: # loop through the possible lemmas in that synsets.
...             x = j.pertainyms() # access lemma's pertainyms
...             if len(x) > 0:
...                     print str(ss.offset)+"-"+ss.pos, l, x

HPSG related

[edit]

How to read ERG *.tdl into python ?

[edit]
def readTDL(tdlfile):
	obj,temp = [], []
	for line in open(tdlfile):
		if line[0] == ";": continue;
		temp.append(line)
		if line[-3:] == "].\n":	
			obj.append("".join(temp).strip())
			temp = []
	return obj

Get a dictionary from ERG lexicon.tdl

[edit]
def vocab2lemmas(vocab):
	lemmas = set()
	for word in vocab:
		lemma = word.split()[0].rpartition("_")[0]
		if lemma == "":
			lemma = word.split()[0]
		lemmas.add(lemma)
	return lemmas
vocab = readTDL('lexicon.tdl')
lemmas = vocab2lemmas(vocab)

Getting idioms from ERG idioms.mtr

[edit]
idioms = [i.split()[0] for i in readTDL('idioms.mtr')]

Getting HPSG parses from ACE

[edit]
import os
def installACE():
	os.system("wget -P ~/ http://sweaglesw.org/linguistics/ace/download/ace-0.9.16-x86-64.tar.gz")
	os.system("tar -zxvf ~/ace-0.9.16-x86-64.tar.gz -C ~/ace-0.9.16")
	os.system("wget -P ~/ http://sweaglesw.org/linguistics/ace/download/erg-1212-x86-64-0.9.16.dat.bz2")
	os.system("bzip2 -dc ~/erg-1212-x86-64-0.9.16.dat.bz2 > ~/ace-0.9.16/erg-1212-x86-64-0.9.16.dat")
	# os.system(";".join(["wget -P ~/ http://sweaglesw.org/linguistics/ace/download/ace-0.9.16-x86-64.tar.gz","tar -zxvf ~/ace-0.9.16-x86-64.tar.gz -C ~/ace-0.9.16","wget -P ~/ http://sweaglesw.org/linguistics/ace/download/erg-1212-x86-64-0.9.16.dat.bz2","bzip2 -dc ~/erg-1212-x86-64-0.9.16.dat.bz2 > ~/ace-0.9.16/erg-1212-x86-64-0.9.16.dat"]))
def aceParse(sent, onlyMRS=False, parameters):
	if onlyMRS == True:
		parameters+=" -T"
	return [p.strip() for p in os.popen("echo " +sent+" | ~/ace-0.9.16/ace -g ~/ace-0.9.16/erg-1212-x86-64-0.9.16.dat"+parameters) if p.strip() != ""][1:]

# If you have not installed ACE, uncomment and try the following line:
#installACE()
sentence = "This is a foo bar sentence."
parse_outputs = aceParse(sentence)
for po in parse_outputs:
	print po
#TODO: have a prettify function to make the ACE output humanly readable
#TODO: add ACE parameter functions
#TODO: properly get the MRS outputs into python variables
#TODO: proper ace object so that i can have ace.install(), ace.parse(), parse.prettify()
#TODO: ace.update()


Simple NLP examples

[edit]

Movie Review classifier

[edit]

http://stackoverflow.com/questions/21107075/classification-using-movie-review-corpus-in-nltk-python

Numpy array usage

[edit]

http://stackoverflow.com/questions/27027680/working-out-word-document-vectors-from-nested-dictionary