Tuesday, April 20, 2010

Processing Large XML Documents with PHP 5

Processing Large XML Documents with PHP 5

There are different ways to process XML Documents in PHP 5. One can process them with SimpleXML, SAX, XMLReader or DOM and they all have their pro and cons (See my "XML in PHP 5" workshop slides for more details about them). But when it comes to large XML documents, the choices look quite limited.
Therefore I did some benchmark testing with the different extensions. The XML document is approx. 10MB big and consists of a lot of blog-entries from Planet PHP. The to-be-solved excercise was to get the title of the entry with the ID 4365. Not that much of a complicated task and with more complicated questions,
the results may differ.



The results (as text file) were actually not that surprising. SAX and XMLReader were very low on memory usage, but slower than with DOM/XPath. Here's a chart of the initial
results (parsing the full document)
.



But if we assume, there's only one ID = 4365, then we don't have to process the full document and we can stop after the first one (aka FO or firstonly in the results) was found. As this entry is in the first 10% in our example, the results are quite different. To no
surprise. With this approach and some luck with the entries order, we can cut down the processing time considerably, which is not possible with the DOM approach. There, it's all-or-nothing.



In the result charts you maybe also recognized the option "Expand" and "Expand & SimpleXML". I added a new method to XMLReader this weekend called "expand()" (it's in CVS now). With this method, you can convert a node catched with XMLReader to a DOMElement. See also the libxml2 page for more information. This can be very useful, if you want to do DOM Operations on only a little part of a huge XML document. With the "Expand" script, we expand the node matching ID = 4365 with XMLReader and then apply an XPath operation on it. As you can see, it needs some lines
of code (the expand() method only returns a node, but we need a document for XPath), but after that, we can use every XPath expression and DOM Method we want. Even convert it to SimpleXML, as we do in the "Expand & SimpleXML" script. It's maybe a little bit useless in this case, as we don't save a lot of coding or time, but if your subtrees
are more complex or you want to build a new XML document, this can be quite useful. The time and memory used is approx. the same as with the plain XMLReader script (no surprise, since most of the time is spent in traversing the XML document and not parsing the subtree).



I also did some benchmarks with XSLT ( the chart). First I did the traditional method with loading the whole XML document into memory and then transform it. Time and memory used is more or less the same as with plain DOM processing, which is no surprise, since the
task this script has to do is almost the same as we did with the XPath stuff. But it gets interesting with the expand() feature of XMLReader. As we just want to transform the one entry, we search for it with XMLReader, create a DOMElement, resp. a DOMDocument and feed only that to the XSLT processor. This saves a lot of memory and scales very well
on the memory side. It takes longer time-wise (if you parse the full document, but that's the worst case scenario anyway), but if your XML documents are really huge (more than your available RAM for example), then this (or other XMLReader approaches) is the only feasible solution, IMHO.



To sum up: XMLReader is a powerfull extension to parse large XML documents, it's usually much faster than SAX (twice as fast), while still scaling without problems on the memory side. With the expand() method, it's now also possible to mix the features of DOM/SimpleXML/XSLT with XMLReader, if you only have to process parts of an XML
document.



Here are the scripts for reference:

SAX

XMLReader

Expand

Expand & SimpleXML

DOM & XPath

XSLT

XSLT w/ XMLReader
Comments (26) |
 Permalink

Comments

Bill Humphries

@ 11.05.2004 08:22 CEST

I need to dedicate a box to PHP5 testing.



Thanks for the write up on this, because it gives me some ideas for some strategies to use with XML Reader.
Daniel Veillard

@ 15.05.2004 19:33 CEST

Interesting, at the libxml2 level, SAX is like

twice as fast as the xmlReader, but your

experiment point out one more problem of

the SAX API which is when you cross languages

boundaries, the callbacks are extermely

expensive especially if converting strings

is needed. That's why SAX is not a good

API to export a fast parser to say PHP or

Python, any advantage you may gain with

the C parser is lost in the marshalling process

of the strings. The reader in comparison

allows to minimize the marshalling, you have

far more integer (cheaper), all attributes and

their values are marshalled only if asked for,

and checking the element type allows to

short-circuit potentially expensive operations.

I think that adding Reader operations like

NextElement(Name?) or NextType(type) would

have even more potential for fast processing,

and would be very convenient for the kind of

operations you describe NextElement(title)

would stop only once per article (and there

is glob of optimization possible at the libxml2

level for such searching).



Daniel
ashook
@ 24.03.2005 17:39 CEST

Hai Daniel,



I want to try you reference scripts but in need the memreport.php, it would be nice if you could zend it to me pleas???



thank in advance

greetings,

Ashook
Oskar Austegard
@ 18.07.2005 20:23 CEST

Has anyone run a test on HUGE (multi-GB) XML files? Would XMLReader scale to this size?
chregu
@ 18.07.2005 20:28 CEST

Hi Oskar: XMLReader scales to any size as it only parses chunk by chunk.
giulio
@ 26.08.2005 12:38 CEST

very usefull thanks!!



i am currently using the combination of xmlreader+expand and then manage the single needed node with DOM



great speed!!
Ben Margolin
@ 04.10.2005 00:01 CEST

This was enormously useful info, thanks! Appreciate the examples. While it's slightly clunky to expand/SimpleXML import, it's VERY convenient; I was pleasantly surprised to see there isn't much of a performance penalty, either.



Looking forward to using XMLReader for processing giant feeds... (150MB+ XML...)
anusha
@ 10.01.2006 18:42 CEST

Hi,

Problem:I have to parse an xml file of size greater than 5gb.If i do that using DOM it throws Out of memory exception.

Which parser should i use...should i go for SAX...Or will i face the same problem.....
chregu
@ 10.01.2006 19:39 CEST

If you have PHP 5, i'd recommend XMLReader
Henry
@ 26.02.2006 12:04 CEST

The reference scripts have been removed. Would much appreciate to have em available again.



Anyhow, i would like to make some speed/memory comparisons between DOM, SAX, XMLReader and

SimpleXML as implementations of an xml2array parser. The structure should be the one given by PEAR:XML_Unserializer. Any idea or prediction which of them is most useful for this job ? The XML structure

i would work with is not very deep folded. Attributes are

rare too.



Thx so far
Pravin
@ 14.03.2006 12:34 CEST

Hi,

I read your blog it seems intresting and show me a way to find solution for parsing and storing large XML in DB (upto 200 MB)



can send the example scripts that you mentioned.
chregu
@ 14.03.2006 13:09 CEST

The examples are online again, sorry to all who didn't find them...
Hosato
@ 21.03.2006 02:56 CEST

I'm trying to use XMLReader on a Windows install of PHP 5.1.2 but can't find a way to enable it. Does it require that PHP be recompiled with --enable-xmlreader added to the configure line or is there a build of PHP out there with this already done? I'm doing this for a client and don't have the time to figure out how to compile PHP myself.



I'm trying to parse a 28MB XML file and load it into a MySQL database. Is there another way to do it that won't require me to recompile PHP?



Any help would greatly be appreciated. Thanks.
Roman
@ 23.08.2006 02:39 CEST

ashook , memreport.php is available at https://svn.bitflux.org/repos/public/php5examples/largexml/memreport.php
Kanthan Arul
@ 30.08.2006 12:43 CEST

Does this work with crossdomain fetching, without crossdomain.xml

My testing showed varied results.

someone clarify pls
bwdow
@ 21.12.2006 01:35 CEST

Nice article but couldn't you give any example about xml reading with xmlreader. I was looking for an example.
chregu
@ 21.12.2006 08:53 CEST

bwdow: http://php5.bitflux.org/xml-namics/slide_64.php and ff. has some examples
trevor
@ 26.02.2007 18:53 CEST

i think it's wierd that we are going in circles. the reason large data volumes got broken down into a relational db was for this EXACT situation. fast searching through giant data.'



now we've gone full circle, back to flat files, and needing a way to search them quickly again?



why not just structure your file hierarchy like a database, where the folder names represent tables, and the data in the xml is structured in a relational manner - this keeps the sizes down, and your main title file, would only have two elements, name, and location.



technology seems to chase it's tail an awful lot. i mean, if you have php5, then why not just use a db if you suspect your files are going to become huge???



that said, if your files are reasonable enough and you don't want a db - this advice is great, and thanks a lot!

/tre
carreau
@ 30.06.2008 09:24 CEST

Could I ask you 'memreport.php' and Xml/Xsl files to test your benchmark ?

Your benchmark and blog are really interesting

Thank you very much

JC
Mikesloper
@ 08.07.2008 15:03 CEST

Hey Christian



Great post and thanks for doing the work, saved me some time and effort.



All the files you need are here:

https://svn.bitflux.org/repos/public/php5examples/largexml/
Not Web Design
@ 21.07.2008 14:24 CEST

Perfect information - thank you. Saved me a lot of time. XMLReader it is then :)



@trevor - I agree that we are going in circles, but XML is often used to transport data and not for storage. So sometimes huge XML files has to be read and imported.
sonja
@ 16.01.2009 09:38 CEST

I'd just like to queue up with all the others.



I'm using XMLReader to do exact the thing "Not Web Design" mentioned - and it works great. Especially in combination with expand and SimpleXML.



Thanks!
Satya Prakash
@ 02.12.2009 11:23 CEST

I have read this presentation http://php5.bitflux.org/phpconf2004/slide_29.php



Thanks!
Blabi
@ 15.03.2010 12:57 CEST

Great test results,

but where can I find the planet.xml?



Thanks in advance,



Blabi
rgi
@ 24.03.2010 22:34 CEST

Great article!

someone has tried to parse large xml files using XMLReader + XMLWriter?
Jason
@ 26.03.2010 11:38 CEST

nice post,

with this information i made my xml operation work faster, i never knew that my xmlparsing could be done so fast...

expand() -> nice idea

No comments: