Processing Large XML Documents with PHP 5
There are different ways to process XML Documents in PHP 5. One can process them with SimpleXML, SAX, XMLReader or DOM and they all have their pro and cons (See my "XML in PHP 5" workshop slides for more details about them). But when it comes to large XML documents, the choices look quite limited.
Therefore I did some benchmark testing with the different extensions. The XML document is approx. 10MB big and consists of a lot of blog-entries from Planet PHP. The to-be-solved excercise was to get the title of the entry with the ID 4365. Not that much of a complicated task and with more complicated questions,
the results may differ.
The results (as text file) were actually not that surprising. SAX and XMLReader were very low on memory usage, but slower than with DOM/XPath. Here's a chart of the initial
results (parsing the full document).
But if we assume, there's only one ID = 4365, then we don't have to process the full document and we can stop after the first one (aka FO or firstonly in the results) was found. As this entry is in the first 10% in our example, the results are quite different. To no
surprise. With this approach and some luck with the entries order, we can cut down the processing time considerably, which is not possible with the DOM approach. There, it's all-or-nothing.
In the result charts you maybe also recognized the option "Expand" and "Expand & SimpleXML". I added a new method to XMLReader this weekend called "expand()" (it's in CVS now). With this method, you can convert a node catched with XMLReader to a DOMElement. See also the libxml2 page for more information. This can be very useful, if you want to do DOM Operations on only a little part of a huge XML document. With the "Expand" script, we expand the node matching ID = 4365 with XMLReader and then apply an XPath operation on it. As you can see, it needs some lines
of code (the expand() method only returns a node, but we need a document for XPath), but after that, we can use every XPath expression and DOM Method we want. Even convert it to SimpleXML, as we do in the "Expand & SimpleXML" script. It's maybe a little bit useless in this case, as we don't save a lot of coding or time, but if your subtrees
are more complex or you want to build a new XML document, this can be quite useful. The time and memory used is approx. the same as with the plain XMLReader script (no surprise, since most of the time is spent in traversing the XML document and not parsing the subtree).
I also did some benchmarks with XSLT ( the chart). First I did the traditional method with loading the whole XML document into memory and then transform it. Time and memory used is more or less the same as with plain DOM processing, which is no surprise, since the
task this script has to do is almost the same as we did with the XPath stuff. But it gets interesting with the expand() feature of XMLReader. As we just want to transform the one entry, we search for it with XMLReader, create a DOMElement, resp. a DOMDocument and feed only that to the XSLT processor. This saves a lot of memory and scales very well
on the memory side. It takes longer time-wise (if you parse the full document, but that's the worst case scenario anyway), but if your XML documents are really huge (more than your available RAM for example), then this (or other XMLReader approaches) is the only feasible solution, IMHO.
To sum up: XMLReader is a powerfull extension to parse large XML documents, it's usually much faster than SAX (twice as fast), while still scaling without problems on the memory side. With the expand() method, it's now also possible to mix the features of DOM/SimpleXML/XSLT with XMLReader, if you only have to process parts of an XML
document.
Here are the scripts for reference:
SAX
XMLReader
Expand
Expand & SimpleXML
DOM & XPath
XSLT
XSLT w/ XMLReader
Therefore I did some benchmark testing with the different extensions. The XML document is approx. 10MB big and consists of a lot of blog-entries from Planet PHP. The to-be-solved excercise was to get the title of the entry with the ID 4365. Not that much of a complicated task and with more complicated questions,
the results may differ.
The results (as text file) were actually not that surprising. SAX and XMLReader were very low on memory usage, but slower than with DOM/XPath. Here's a chart of the initial
results (parsing the full document).
But if we assume, there's only one ID = 4365, then we don't have to process the full document and we can stop after the first one (aka FO or firstonly in the results) was found. As this entry is in the first 10% in our example, the results are quite different. To no
surprise. With this approach and some luck with the entries order, we can cut down the processing time considerably, which is not possible with the DOM approach. There, it's all-or-nothing.
In the result charts you maybe also recognized the option "Expand" and "Expand & SimpleXML". I added a new method to XMLReader this weekend called "expand()" (it's in CVS now). With this method, you can convert a node catched with XMLReader to a DOMElement. See also the libxml2 page for more information. This can be very useful, if you want to do DOM Operations on only a little part of a huge XML document. With the "Expand" script, we expand the node matching ID = 4365 with XMLReader and then apply an XPath operation on it. As you can see, it needs some lines
of code (the expand() method only returns a node, but we need a document for XPath), but after that, we can use every XPath expression and DOM Method we want. Even convert it to SimpleXML, as we do in the "Expand & SimpleXML" script. It's maybe a little bit useless in this case, as we don't save a lot of coding or time, but if your subtrees
are more complex or you want to build a new XML document, this can be quite useful. The time and memory used is approx. the same as with the plain XMLReader script (no surprise, since most of the time is spent in traversing the XML document and not parsing the subtree).
I also did some benchmarks with XSLT ( the chart). First I did the traditional method with loading the whole XML document into memory and then transform it. Time and memory used is more or less the same as with plain DOM processing, which is no surprise, since the
task this script has to do is almost the same as we did with the XPath stuff. But it gets interesting with the expand() feature of XMLReader. As we just want to transform the one entry, we search for it with XMLReader, create a DOMElement, resp. a DOMDocument and feed only that to the XSLT processor. This saves a lot of memory and scales very well
on the memory side. It takes longer time-wise (if you parse the full document, but that's the worst case scenario anyway), but if your XML documents are really huge (more than your available RAM for example), then this (or other XMLReader approaches) is the only feasible solution, IMHO.
To sum up: XMLReader is a powerfull extension to parse large XML documents, it's usually much faster than SAX (twice as fast), while still scaling without problems on the memory side. With the expand() method, it's now also possible to mix the features of DOM/SimpleXML/XSLT with XMLReader, if you only have to process parts of an XML
document.
Here are the scripts for reference:
SAX
XMLReader
Expand
Expand & SimpleXML
DOM & XPath
XSLT
XSLT w/ XMLReader
Comments
Bill Humphries
@ 11.05.2004 08:22 CEST
I need to dedicate a box to PHP5 testing.
Thanks for the write up on this, because it gives me some ideas for some strategies to use with XML Reader.
@ 11.05.2004 08:22 CEST
I need to dedicate a box to PHP5 testing.
Thanks for the write up on this, because it gives me some ideas for some strategies to use with XML Reader.
Daniel Veillard
@ 15.05.2004 19:33 CEST
Interesting, at the libxml2 level, SAX is like
twice as fast as the xmlReader, but your
experiment point out one more problem of
the SAX API which is when you cross languages
boundaries, the callbacks are extermely
expensive especially if converting strings
is needed. That's why SAX is not a good
API to export a fast parser to say PHP or
Python, any advantage you may gain with
the C parser is lost in the marshalling process
of the strings. The reader in comparison
allows to minimize the marshalling, you have
far more integer (cheaper), all attributes and
their values are marshalled only if asked for,
and checking the element type allows to
short-circuit potentially expensive operations.
I think that adding Reader operations like
NextElement(Name?) or NextType(type) would
have even more potential for fast processing,
and would be very convenient for the kind of
operations you describe NextElement(title)
would stop only once per article (and there
is glob of optimization possible at the libxml2
level for such searching).
Daniel
@ 15.05.2004 19:33 CEST
Interesting, at the libxml2 level, SAX is like
twice as fast as the xmlReader, but your
experiment point out one more problem of
the SAX API which is when you cross languages
boundaries, the callbacks are extermely
expensive especially if converting strings
is needed. That's why SAX is not a good
API to export a fast parser to say PHP or
Python, any advantage you may gain with
the C parser is lost in the marshalling process
of the strings. The reader in comparison
allows to minimize the marshalling, you have
far more integer (cheaper), all attributes and
their values are marshalled only if asked for,
and checking the element type allows to
short-circuit potentially expensive operations.
I think that adding Reader operations like
NextElement(Name?) or NextType(type) would
have even more potential for fast processing,
and would be very convenient for the kind of
operations you describe NextElement(title)
would stop only once per article (and there
is glob of optimization possible at the libxml2
level for such searching).
Daniel
ashook
@ 24.03.2005 17:39 CEST
Hai Daniel,
I want to try you reference scripts but in need the memreport.php, it would be nice if you could zend it to me pleas???
thank in advance
greetings,
Ashook
@ 24.03.2005 17:39 CEST
Hai Daniel,
I want to try you reference scripts but in need the memreport.php, it would be nice if you could zend it to me pleas???
thank in advance
greetings,
Ashook
Oskar Austegard
@ 18.07.2005 20:23 CEST
Has anyone run a test on HUGE (multi-GB) XML files? Would XMLReader scale to this size?
@ 18.07.2005 20:23 CEST
Has anyone run a test on HUGE (multi-GB) XML files? Would XMLReader scale to this size?
chregu
@ 18.07.2005 20:28 CEST
Hi Oskar: XMLReader scales to any size as it only parses chunk by chunk.
@ 18.07.2005 20:28 CEST
Hi Oskar: XMLReader scales to any size as it only parses chunk by chunk.
giulio
@ 26.08.2005 12:38 CEST
very usefull thanks!!
i am currently using the combination of xmlreader+expand and then manage the single needed node with DOM
great speed!!
@ 26.08.2005 12:38 CEST
very usefull thanks!!
i am currently using the combination of xmlreader+expand and then manage the single needed node with DOM
great speed!!
Ben Margolin
@ 04.10.2005 00:01 CEST
This was enormously useful info, thanks! Appreciate the examples. While it's slightly clunky to expand/SimpleXML import, it's VERY convenient; I was pleasantly surprised to see there isn't much of a performance penalty, either.
Looking forward to using XMLReader for processing giant feeds... (150MB+ XML...)
@ 04.10.2005 00:01 CEST
This was enormously useful info, thanks! Appreciate the examples. While it's slightly clunky to expand/SimpleXML import, it's VERY convenient; I was pleasantly surprised to see there isn't much of a performance penalty, either.
Looking forward to using XMLReader for processing giant feeds... (150MB+ XML...)
anusha
@ 10.01.2006 18:42 CEST
Hi,
Problem:I have to parse an xml file of size greater than 5gb.If i do that using DOM it throws Out of memory exception.
Which parser should i use...should i go for SAX...Or will i face the same problem.....
@ 10.01.2006 18:42 CEST
Hi,
Problem:I have to parse an xml file of size greater than 5gb.If i do that using DOM it throws Out of memory exception.
Which parser should i use...should i go for SAX...Or will i face the same problem.....
Henry
@ 26.02.2006 12:04 CEST
The reference scripts have been removed. Would much appreciate to have em available again.
Anyhow, i would like to make some speed/memory comparisons between DOM, SAX, XMLReader and
SimpleXML as implementations of an xml2array parser. The structure should be the one given by PEAR:XML_Unserializer. Any idea or prediction which of them is most useful for this job ? The XML structure
i would work with is not very deep folded. Attributes are
rare too.
Thx so far
@ 26.02.2006 12:04 CEST
The reference scripts have been removed. Would much appreciate to have em available again.
Anyhow, i would like to make some speed/memory comparisons between DOM, SAX, XMLReader and
SimpleXML as implementations of an xml2array parser. The structure should be the one given by PEAR:XML_Unserializer. Any idea or prediction which of them is most useful for this job ? The XML structure
i would work with is not very deep folded. Attributes are
rare too.
Thx so far
Pravin
@ 14.03.2006 12:34 CEST
Hi,
I read your blog it seems intresting and show me a way to find solution for parsing and storing large XML in DB (upto 200 MB)
can send the example scripts that you mentioned.
@ 14.03.2006 12:34 CEST
Hi,
I read your blog it seems intresting and show me a way to find solution for parsing and storing large XML in DB (upto 200 MB)
can send the example scripts that you mentioned.
Hosato
@ 21.03.2006 02:56 CEST
I'm trying to use XMLReader on a Windows install of PHP 5.1.2 but can't find a way to enable it. Does it require that PHP be recompiled with --enable-xmlreader added to the configure line or is there a build of PHP out there with this already done? I'm doing this for a client and don't have the time to figure out how to compile PHP myself.
I'm trying to parse a 28MB XML file and load it into a MySQL database. Is there another way to do it that won't require me to recompile PHP?
Any help would greatly be appreciated. Thanks.
@ 21.03.2006 02:56 CEST
I'm trying to use XMLReader on a Windows install of PHP 5.1.2 but can't find a way to enable it. Does it require that PHP be recompiled with --enable-xmlreader added to the configure line or is there a build of PHP out there with this already done? I'm doing this for a client and don't have the time to figure out how to compile PHP myself.
I'm trying to parse a 28MB XML file and load it into a MySQL database. Is there another way to do it that won't require me to recompile PHP?
Any help would greatly be appreciated. Thanks.
Roman
@ 23.08.2006 02:39 CEST
ashook , memreport.php is available at https://svn.bitflux.org/repos/public/php5examples/largexml/memreport.php
@ 23.08.2006 02:39 CEST
ashook , memreport.php is available at https://svn.bitflux.org/repos/public/php5examples/largexml/memreport.php
Kanthan Arul
@ 30.08.2006 12:43 CEST
Does this work with crossdomain fetching, without crossdomain.xml
My testing showed varied results.
someone clarify pls
@ 30.08.2006 12:43 CEST
Does this work with crossdomain fetching, without crossdomain.xml
My testing showed varied results.
someone clarify pls
bwdow
@ 21.12.2006 01:35 CEST
Nice article but couldn't you give any example about xml reading with xmlreader. I was looking for an example.
@ 21.12.2006 01:35 CEST
Nice article but couldn't you give any example about xml reading with xmlreader. I was looking for an example.
chregu
@ 21.12.2006 08:53 CEST
bwdow: http://php5.bitflux.org/xml-namics/slide_64.php and ff. has some examples
@ 21.12.2006 08:53 CEST
bwdow: http://php5.bitflux.org/xml-namics/slide_64.php and ff. has some examples
trevor
@ 26.02.2007 18:53 CEST
i think it's wierd that we are going in circles. the reason large data volumes got broken down into a relational db was for this EXACT situation. fast searching through giant data.'
now we've gone full circle, back to flat files, and needing a way to search them quickly again?
why not just structure your file hierarchy like a database, where the folder names represent tables, and the data in the xml is structured in a relational manner - this keeps the sizes down, and your main title file, would only have two elements, name, and location.
technology seems to chase it's tail an awful lot. i mean, if you have php5, then why not just use a db if you suspect your files are going to become huge???
that said, if your files are reasonable enough and you don't want a db - this advice is great, and thanks a lot!
/tre
@ 26.02.2007 18:53 CEST
i think it's wierd that we are going in circles. the reason large data volumes got broken down into a relational db was for this EXACT situation. fast searching through giant data.'
now we've gone full circle, back to flat files, and needing a way to search them quickly again?
why not just structure your file hierarchy like a database, where the folder names represent tables, and the data in the xml is structured in a relational manner - this keeps the sizes down, and your main title file, would only have two elements, name, and location.
technology seems to chase it's tail an awful lot. i mean, if you have php5, then why not just use a db if you suspect your files are going to become huge???
that said, if your files are reasonable enough and you don't want a db - this advice is great, and thanks a lot!
/tre
carreau
@ 30.06.2008 09:24 CEST
Could I ask you 'memreport.php' and Xml/Xsl files to test your benchmark ?
Your benchmark and blog are really interesting
Thank you very much
JC
@ 30.06.2008 09:24 CEST
Could I ask you 'memreport.php' and Xml/Xsl files to test your benchmark ?
Your benchmark and blog are really interesting
Thank you very much
JC
Mikesloper
@ 08.07.2008 15:03 CEST
Hey Christian
Great post and thanks for doing the work, saved me some time and effort.
All the files you need are here:
https://svn.bitflux.org/repos/public/php5examples/largexml/
@ 08.07.2008 15:03 CEST
Hey Christian
Great post and thanks for doing the work, saved me some time and effort.
All the files you need are here:
https://svn.bitflux.org/repos/public/php5examples/largexml/
Not Web Design
@ 21.07.2008 14:24 CEST
Perfect information - thank you. Saved me a lot of time. XMLReader it is then :)
@trevor - I agree that we are going in circles, but XML is often used to transport data and not for storage. So sometimes huge XML files has to be read and imported.
@ 21.07.2008 14:24 CEST
Perfect information - thank you. Saved me a lot of time. XMLReader it is then :)
@trevor - I agree that we are going in circles, but XML is often used to transport data and not for storage. So sometimes huge XML files has to be read and imported.
sonja
@ 16.01.2009 09:38 CEST
I'd just like to queue up with all the others.
I'm using XMLReader to do exact the thing "Not Web Design" mentioned - and it works great. Especially in combination with expand and SimpleXML.
Thanks!
@ 16.01.2009 09:38 CEST
I'd just like to queue up with all the others.
I'm using XMLReader to do exact the thing "Not Web Design" mentioned - and it works great. Especially in combination with expand and SimpleXML.
Thanks!
Satya Prakash
@ 02.12.2009 11:23 CEST
I have read this presentation http://php5.bitflux.org/phpconf2004/slide_29.php
Thanks!
@ 02.12.2009 11:23 CEST
I have read this presentation http://php5.bitflux.org/phpconf2004/slide_29.php
Thanks!
Blabi
@ 15.03.2010 12:57 CEST
Great test results,
but where can I find the planet.xml?
Thanks in advance,
Blabi
@ 15.03.2010 12:57 CEST
Great test results,
but where can I find the planet.xml?
Thanks in advance,
Blabi
No comments:
Post a Comment