Murzal Arsya: xml

Showing posts with label xml. Show all posts

Wednesday, July 29, 2009

XML Document processing comparison between java and python

I had a project that requires xml document processing. At first, I was thinking to do it using Java, but since the project was not so big, I decided to use Python. I really have to say that it was a good decision. In my opinion, xml doc processing feels more natural in Python compare to Java.

Let's go straight to the code comparison.
There are many steps involved in java before actually getting the xml doc elements that match with xpath expression defined.

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true); // never forget this!
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse("books.xml");

XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
XPathExpression expr = xpath.compile("//book[author='Neal Stephenson']/title/text()");

Object result = expr.evaluate(doc, XPathConstants.NODESET);

NodeList nodes = (NodeList) result;
for (int i = 0; i < nodes.getLength(); i++) {
    System.out.println(nodes.item(i).getNodeValue()); 
}

I took the example from IBM website, here. You should check out the link if you are interested using xpath in java.

Now, in Python, using lxml library.

doc = etree.parse("books.xml")
nodes = doc.xpath("//book[author='Neal Stephenson']/title/text()")
print(nodes)

I mostly use java for my projects, pretty much happy with it, even though sometimes I wish that java has some syntactic sugars like in C#.
So, if you have the option to choose which programming language for your xml projects, give Python a try. I am sure you will like it.

Friday, June 12, 2009

Processing xml document in python using lxml

Processing xml document in python using lxml

In previous entry, I discussed about libxml2. Now, I am going to discuss about lxml, which in my opinion is awesome. I really appreciate the developer team that create lxml. It makes processing xml document easier.
The best way to get it by going to this link and choose the appropriate version. As soon as you installed lxml, try to use it. Here is a sample of how to use lxml.

from lxml import etree
import sys

def getChildInfo(child):
 return "id: " + child.get("id") + ", data: " + child.text + "\n"

xmlString = "c1c2"
tree = etree.fromstring(xmlString)
children = tree.xpath("//child")
sys.stdout.write(str(len(children)) + "\n")
for el in [ getChildInfo(child) for child in children ]:
 sys.stdout.write(el)

childList = tree.xpath("//child[@id='1']")
sys.stdout.write(getChildInfo(childList[0]))
child = tree.xpath("//child[@id='2']")
sys.stdout.write(getChildInfo(childList[0]))

For more examples and other information about lxml, check out this link.

Thursday, June 11, 2009

Processing xml document in python using libxml

Processing xml document in python using libxml

I had to spent sometimes to gather information about processing xml document in python. There are several different libraries to use, such as minidom, libxml, etree, etc. During my work, I had been working with xpath, so I need xpath support in the built-in library that I work with, so I chose libxml. It is difficult for me to gather information on how to use that API. There are minimum documentations, guides, examples, and tutorials out there.
After spending sometimes, I became familiar on using it. I am writing this, so other people could simply learn on how to use libxml in more realistic situation. Many examples I found in the internet is way too basic, and does not suit my need.

example 1:
--------------------------------------


import sys
import libxml2

def main():
 doc = libxml2.parseFile("data.xml")
 ctxt = doc.xpathNewContext()
 # get student elements in xml doc
 res = ctxt.xpathEval("//student")
 # output all properties for "student" element using xpath
 sys.stdout.write(res[0].get_properties().content + "\n")

 for chld in res[0].children:
     if chld.type == "element":
             # output all information within the child element
             sys.stdout.write(chld.content + "\n")

 doc.freeDoc()
 ctxt.xpathFreeContext()

if __name__ == '__main__':
 main()

--------------------------------------

example 2:
--------------------------------------


import sys
import libxml2

def main():
 doc = libxml2.parseFile("layout.xml")
 ctxt = doc.xpathNewContext()
 # get all "property" element that has attribute "Location",
 # which parent is "object" element that has attribute "pickList*"
 res = ctxt.xpathEval("//Object[starts-with(@name,'pickList')]/Property[@name='Location']")
 for r in res:
     sys.stdout.write(r.content + "\n")

 sys.stdout.write(str(len(res)))

 doc.freeDoc()
 ctxt.xpathFreeContext()

if __name__ == '__main__':
 main()

--------------------------------------

It is also useful to see the source code of libxml.py, just to see what's going on under the hood.

Not long after writing this post, I found even better xml API, which is lxml. I will make separate post about this wonderful xml API for python.