Lemmatization of English words in sentences in XML format by Python

Lemmatization of English words in sentences in XML format by Python

Python 2.7, NLTK 3.0
The input XML file look likes this:


<?xml version="1.0" encoding="UTF-8"?>
<sentences version="1.0">
<item id="1" asks-for="cause" most-plausible-alternative="1">
<p>my body cast a shadow over the grass . </p>
<a1>the sun be rise . </a1>
<a2>the grass be cut . </a2>
</item>

<item id="2" asks-for="cause" most-plausible-alternative="1">
<p>the woman tolerate the woman friend 's difficult behavior . </p>
<a1>the woman know the woman friend be go through a hard time . </a1>
<a2>the woman felt that the woman friend take advantage of her kindness . </a2>
</item>
...

</sentences>

Python Code


#This setting is only necessary for error about 'encoding utf-8'
import sys
reload(sys)
sys.setdefaultencoding(&quot;utf-8&quot;)

import xml.etree.cElementTree as ET #library for XML processing

from nltk.tokenize import word_tokenize #library for word tokenize

from nltk.stem import WordNetLemmatizer #library for word lemmatize
wordnet_lemmatizer = WordNetLemmatizer()

tree = ET.parse('input.xml') #parse the XML tree from input.xml
root = tree.getroot() #get root element of the tree

for item_of_root in root: #for each item
for sentence in item_of_root: #for each sentence in the item
words = word_tokenize(sentence.text) #divide sentence to words
sentenceNew = &quot;&quot; #contatiner for new lemmatized sentence
for word in words: #for each word in the sentence
lamWord = wordnet_lemmatizer.lemmatize(word, pos='v') #lemmatize the words
sentenceNew += lamWord + ' ' #put the lemmatized word to the contatiner
sentence.text = sentenceNew #store the new sentence to the tree

tree.write('output.xml') #ouput the lemmatized tree to file

 

Reference

The ElementTree XML API – Python 2.7.12 Documentation

Installing NLTK Data

Dive Into NLTK, Part I: Getting Started with NLTK

Dive Into NLTK, Part II: Sentence Tokenize and Word Tokenize

Dive Into NLTK, Part IV: Stemming and Lemmatization

 

 

Previous Post

Gem5 Basic Guideline

All contents original from https://github.com/dependablecomputinglab This article is just a ... Read more

Next Post

我所亲历的“韩国大学“新鲜趣事

首先不得不提的是 韩国人的两个小习惯   刷牙的习惯 无论是图书馆,教学楼还是公司办公室,到卫生间总是能见到韩国人在刷牙。一天刷3次以上都很正常。我见过最奇葩的一次是,在教室跟教授谈话的时候,嘴里衔着牙刷,满嘴泡沫,更神奇的是教授完全没有感觉不合适。那画面太美至今不敢多想。   关于拖鞋 拖鞋——无疑是韩国人在办公室的必备用品之一! 此外还有常驻图书馆的孩子们 常常能见到他们穿着拖鞋在办公室和图书馆里走来走去 ... Read more

Leave a Reply