Tuesday, April 20, 2010

XML in PHP 5 - What's New?

Author: Christian Stocker
Title: XML in PHP 5 - What's New?
Date: 18th March 2004

Intended Audience

Introduction

XML in PHP 4

XML in PHP 5

•  Streams support

SAX

DOM

•  Reading the DOM

•  XPath

•  Writing to the DOM

•  Extending Classes

•  HTML

Validation

SimpleXML

•  Writing to SimpleXML Documents

•  Interoperability

XSLT

•  Calling PHP Functions

Summary

Links

•  PHP 4 Specific

•  PHP 5 Specific

•  Standards

•  Tools




Intended Audience

This article is intended for PHP developers at all levels who are interested in using the new XML functionality in PHP 5. Only basic, general knowledge about XML is assumed. However, it's an advantage if you have already worked with XML in PHP.


Introduction

In today's Internet world, XML isn't just a buzzword anymore, but a widely accepted and used standard. Therefore XML support was taken more seriously for PHP 5 than it was in PHP 4. In PHP 4 you were almost always faced with non-standard, API-breaking, memory leaking, incomplete functionality. Although some of these deficiencies were dealt with in the 4.3 series of PHP 4, the developers nevertheless decided to dump almost everything and start from scratch in PHP 5.


This article will give an introduction to all the new exciting features PHP 5 has to offer regarding XML.


XML in PHP 4

PHP has had XML support from its early days. While this was "only" a SAX based interface, it did at least allow parsing any XML documents without too much hassle. Further XML support came with PHP 4 and the domxml extension. Later the XSLT extension, with Sablotron as backend, was added. During the PHP 4 life cycle, additional features like HTML, XSLT and DTD-validation were added to the domxml extension. Unfortunately, since the xslt and domxml extensions never really left the experimental stage, and changed their API more than once, they were not enabled by default, and frequently not installed on hosts. Furthermore, the domxml extension did not implement the DOM standard defined by the W3C, but had its own method-naming. While this was improved in the 4.3 series of PHP, together with a lot of memory leak and other fixes, it never reached a truly stable stage, and it was almost impossible to really fix the deeper issues. Also, only the SAX extension was enabled by default, so the other extensions never achieved widespread distribution.


For all these reasons, the PHP XML developers decided to start from scratch for PHP 5, and to follow commonly used standards.


XML in PHP 5

Almost everything regarding XML support was rewritten for PHP 5. All the XML extensions are now based on the excellent libxml2 library by the GNOME project. This allows for interoperability between the different extensions, so that the core developers only need to work with one underlying library. For example, the quite complex and now largely improved memory management had to be implemented only once for all XML-related extensions.


In addition to the better-known SAX support inherited from PHP 4, PHP 5 supports DOM according to the W3C standard and XSLT with the very fast libxslt engine. It also incorporates the new PHP-specific SimpleXML extension and a much improved, standards-compliant SOAP extension. Given the increasing importance of XML, the PHP developers decided to enable more XML support by default. This means that you now get SAX, DOM and SimpleXML enabled out of the box, which ensures that they will be installed on many more servers in the future. XSLT and SOAP support, however, still need to be explicitly configured into a PHP build.


Streams support

All the XML extensions now support PHP streams throughout, even if you try to access a stream not directly from PHP. In PHP 5 you can access a PHP stream, for example, from an <xsl:include> or from an <xi:xinclude> directive. Basically, you can access a PHP stream everywhere where you can access a normal file.


Streams in general were introduced in PHP 4.3 and were further improved in PHP 5 as a way of generalizing file-access, network-access, and other operations that share a common set of functions. You can even implement your own streams with PHP code, and thus unify and simplify access to your data. See the PHP documentation for more details about that.


SAX

SAX stands for Simple API for XML. It's a callback-based interface for parsing XML documents. SAX support has been available since PHP 3 and hasn't changed a lot since then. For PHP 5 the API is unchanged, so your old code should still work. The only difference is that it's not based on the expat library anymore, but on the libxml2 library.


This change introduced some problems with namespace support, which are currently resolved in libxml2 2.6, but not in older versions of libxml2. Therefore, if you use xml_parser_create_ns(), you are strongly advised to install libxml2 2.6 or above on your system.


DOM

DOM (Document Object Model) is a standard for accessing XML document trees, defined by the W3C. In PHP 4, the domxml extension was used for doing just that. The main problem with the domxml extension was that it didn't follow the standard method names. It also had memory leak issues for a long time (they were fixed in PHP 4.3).


The new DOM extension is completely based on the W3C standard, including method and property names. If you're familiar with DOM from other languages, for example in JavaScript, it will be much easier for you to code similar functionality in PHP. You don't have to check the documentation all the time, because the methods and parameters are identical.


As a consequence of this new W3C compatibility, your old domxml-based scripts won't work anymore. The API is quite different in PHP 5. But if you used the "almost W3C compatible" method names available in PHP 4.3, porting isn't such a big deal. You only need to change the loading and saving methods, and remove the underscore in the method names (the DOM standard uses studlyCaps). Other adjustments here and there may be necessary, but the main logic can stay the same.


Reading the DOM

I will not explain all the features of the DOM extension in this article; that would be overkill. You may want to bookmark the documentation available at http://www.w3.org/DOM, which basically corresponds to the implementation in PHP 5.


For most of examples in this article we will use the same XML file; a much-simplified version of the RSS available at zend.com. Paste the following into a text file and save it as articles.xml:


<?xml version="1.0" encoding="iso-8859-1"
?>
<articles>

    <item>  

        <title>PHP Weekly: Issue # 172</title>  

        <link>http://www.zend.com/zend/week/week172.php</link>  

    </item>

    <item>  

        <title>Tutorial: Develop rock-solid code in PHP: Part three</title>  

        <link>http://www.zend.com/zend/tut/tut-hatwar3.php</link>  

    </item>

</articles>


To load this example into a DOM object, you have to create a DomDocument object, and then load the XML file:


$dom = new DomDocument();
$dom->load("articles.xml");


As mentioned above, you could use a PHP stream to load an XML document. You would do this by writing:


$dom->load("file:///articles.xml");


(or any other type of stream, as appropriate).


If you want to output the XML document to the browser or as standard output, use:


print $dom->saveXML();


If you want to save it to a file, use:


print $dom->save("newfile.xml");


(Note that this action will send the filesize to stdout.)


There's not much functionality in this example, of course, so let's do something more useful: let's grab all the titles. There are different ways to do this, the easiest one being to use getElementsByTagname($tagname):


$titles = $dom->getElementsByTagName("title");

foreach(
$titles as $node) {

   print
$node->textContent . "
"
;

}


The property textContent isn't actually a W3C standard. It's a convenience property to access all the text nodes of an element quickly. The W3C way to read this would have been:


$node->firstChild->data;


(but only if you were sure that firstChild was the text node you needed, otherwise you would have to loop through all the child nodes to find that).


One other thing to notice is that getElementsByTagName() returns a DomNodeList, and not an array as the similar function get_elements_by_tagname() did in PHP 4. But as you can see in the example, you can easily loop through it with a foreach directive. You could also directly access the nodes with $titles->item(0). This would return the first title element.


Another approach to getting all the titles would be to loop through the nodes starting with the root element. As you can see, this is way more complicated, but it's also more flexible should you need more than just the title elements.


foreach ($dom->documentElement->childNodes as $articles) {

    
//if node is an element (nodeType == 1) and the name is "item" loop further

    
if ($articles->nodeType == 1 && $articles->nodeName == "item") {

        foreach (
$articles->childNodes  as $item) {

            
//if node is an element and the name is "title", print it.

            
if ($item->nodeType == 1 && $item->nodeName == "title") {

                print
$item->textContent . "
"
;

            }

        }

    }

}


XPath

XPath is something like SQL for XML. With XPath you can query an XML document for a specific node matching some criteria. To get all the title nodes with XPath, just do the following:


<?

$xp
= new domxpath($dom);
$titles = $xp->query("/articles/item/title");

foreach (
$titles as $node) {

    print
$node->textContent . "
"
;

}
?>


This is almost the same code as with getElementsByTagName(), but XPath is much more powerful. For example, if we had a title element as a child of the article element (instead of being the child of an item element), getElementsByTagname() would return it. With /articles/item/title we only pick up the title elements that are placed at the desired level. This is just a simple example; further possibilities might be:


  • /articles/item[position() = 1]/title returning the title element of the first item element.
  • /articles/item/title[@id = '23'] returning all title elements having an attribute id with the value 23
  • /articles//title returning all title elements that are placed below articles

You can also query for elements which have a specific sibling element, or which have a certain text content, or using namespaces, etc. If you have to query XML documents a lot, learning to use XPath properly will save you a lot of time. It's much easier to use, faster in execution, and requires less code than the standard DOM methods.

Writing to the DOM

The Document Object Model can not only be read and queried; you can also manipulate and write to it. (The DOM standard is a little verbose, because its writers tried to support just about every imaginable situation, but it does the job very well). See the next example, where a new element is added to our articles.xml:


$item = $dom->createElement("item");
$title = $dom->createElement("title");
$titletext = $dom->createTextNode("XML in PHP5");
$title->appendChild($titletext);
$item->appendChild($title);
$dom->documentElement->appendChild($item);

print
$dom->saveXML();


First, we create all the needed nodes: an item element, a title element and a text node containing the title of the item. Then we chain all the nodes together by appending the text node to the title element and appending the title element to the item element. Finally we insert the item element into the root element articles, and voilà! - we have a new article listed in our XML document.


Extending Classes

While the above examples were all doable with PHP 4 and the domxml extension (only the API was a little bit different), the ability to extend DOM classes with your own code is a new feature of PHP 5. This makes it possible to write more readable code. Here's the whole example again, re-written to use the DomDocument class:


class Articles extends DomDocument {

    function
__construct() {

        
//has to be called!

        
parent::__construct();

    }

    

    function
addArticle($title) {

        
$item = $this->createElement("item");

        
$titlespace = $this->createElement("title");

        
$titletext = $this->createTextNode($title);

        
$titlespace->appendChild($titletext);

        
$item->appendChild($titlespace);

        
$this->documentElement->appendChild($item);

    }

}
$dom = new Articles();
$dom->load("articles.xml");
$dom->addArticle("XML in PHP5");

print
$dom->save("newfile.xml");


HTML

An often-overlooked feature in PHP 4 is the HTML support in libxml2. You can not only load well-formed XML documents with the DOM extension, but you can also load not-well-formed HTML documents, treat them as regular DomDocument objects, and use all the available methods and features such as XPath and SimpleXML.


This HTML capability is very useful if you need to access content from a website you don't control. With the help of XPath, XSLT or SimpleXML you avoid a lot of coding, as compared with using regular expressions or a SAX parser. This is especially useful if the HTML document is not well structured (a frequent problem!).


The code below fetches the php.net index page, parses it and returns the name of the first title element:


$dom = new DomDocument();
$dom->loadHTMLFile("http://www.php.net/");
$title = $dom->getElementsByTagName("title");

print
$title->item(0)->textContent;


Note that you may get errors as part of your output when expected elements are not found.


If you're one of those people still outputting HTML 4 code on their web pages, there is good news for you, too. The DOM extension cannot only load HTML documents, but can also save them as HTML 4. Just use $dom->saveHTML() after you have built up your DOM document. Note that, for simply making HTML code W3C standards compliant, you're far better off using the Tidy extension. The HTML support in libxml2 is not tuned for every possible case, and doesn't cope well with input in uncommon formats.


Validation

Validation of XML documents is getting more and more important. For example, if you get an XML document from some foreign source, you need to verify that it follows a certain format before you can process it. Luckily it's not necessary to write your own validating code in PHP, because you can use one of the three widely used standards for doing this: DTD, XML Schema or RelaxNG.


  • DTD is a standard that comes from SGML days, and lacks some of the newer XML features (like namespaces). Also, because it's not written in XML, it's not easily parsed and/or transformed.
  • XML Schema is a standard defined by the W3C. It's very extensive and has taken care of almost every imaginable need for validating XML documents.
  • RelaxNG was an answer to the complex XML Schema standard, and was created by an independent group. More and more programs support RelaxNG, since it's much easier to implement than XML Schema.

If you don't have legacy schema documents, or overly complex XML documents, go for RelaxNG. It's easier to write, easier to read, and more and more tools support it. There's even a tool called Trang, which automatically creates a RelaxNG document from sample XML document(s). Furthermore, only RelaxNG (and the aging DTDs) is fully supported by libxml2, although full XML Schema support is coming along.


The syntax for validating XML documents is quite simple:


  • $dom->validate('articles.dtd');
  • $dom->relaxNGValidate('articles.rng');
  • $dom->schemaValidate('articles.xsd');

At present, these all simply return true or false. Errors are dumped out as PHP warnings. Obviously this is not the ideal way to give good feedback to the user, and it will be enhanced in one of the releases after PHP 5.0.0. The exact implementation is currently under discussion, but will certainly lead to better error reporting for parse errors and so on.

SimpleXML

SimpleXML is the latest addition to the XML family in PHP. The goal of the SimpleXML extension is to provide easy access to XML documents using standard object properties and iterators. This extension doesn't have many methods, but it's quite powerful nonetheless. Getting all the title nodes from our document requires even less code than before:


$sxe = simplexml_load_file("articles.xml");

foreach(
$sxe->item as $item) {

    print
$item->title ."
"
;

}


What does this do? It first loads articles.xml into a SimpleXML object. Then it gets all elements named item with the property $sxe->item. Finally $item->title gives us the content of the title element. That's it. You could also query attributes with associative arrays, using: $item->title['id'].


As you can see, there's a lot of magic behind this, and there are different ways to get the desired result. For example, $item->title[0] returns the same result as the example. On the other hand, foreach($sxe->item->title as $item) only returns the first title, and not all the titles stored in the document (as I - coming from XPath - would have expected).


SimpleXML is actually one of the first extensions to use most of the new features available with Zend Engine 2. It's therefore also the testing ground for these new features. You should be aware that bugs and unexpected behavior are not uncommon during this stage of development.


Besides the traditional "loop through all the nodes" approach, as shown in the example above, there's also an XPath interface in SimpleXML, which provides even easier access to individual nodes:


foreach($sxe->xpath('/articles/item/title') as $item) {

    print
$item . "
"
;

}


Admittedly the code isn't shorter than in the previous example, but given more complex or deeply nested XML documents you'll find that using XPath together with SimpleXML saves you a lot of typing.


Writing to SimpleXML documents

You can not only parse and read, but also change SimpleXML documents. At least, to some extent:


$sxe->item->title = "XML in PHP5 ";  //new text content for the title element
$sxe->item->title['id'] = 34; // new attribute for the title element
$xmlString = $sxe->asXML(); // returns the SimpleXML object as a serialized XML string
print $xmlString;


Interoperability

As SimpleXML is also based on libxml2, you can easily convert SimpleXML objects to DomDocument objects and vice versa without a big impact on speed (the document doesn't have to be copied internally). With this mechanism you can have the best of both worlds, using the tool best suited for the job in hand. It works with the following methods:


  • $sxe = simplexml_import_dom($dom);
  • $dom = dom_import_simplexml($sxe);

XSLT

XSLT is a language for transforming XML documents into other XML documents. XSLT is itself written in XML, and belongs to the family of functional languages, which have a different approach to that of procedural and object-orientated languages like PHP.


There were two different XSLT processors implemented in PHP 4: Sablotron (in the more widely used and known xslt extension), and libxslt (within the domxml extension). The two APIs were not compatible with each other, and their feature sets were also different.


In PHP 5, only the libxslt processor is supported. Libxslt was chosen because it's also based on libxml2 and therefore fits perfectly into the XML concept of PHP 5.


It would theoretically be possible to port the Sablotron binding to PHP 5 as well, but unfortunately no one did this yet. Therefore, if you're using Sablotron you will have to switch to the libxslt processor for PHP 5. libxslt is - with the exception of the JavaScript support - feature-equivalent to Sablotron. Even the useful Sablotron-specific scheme handlers can be reimplemented with the much more powerful and portable PHP streams. In addition, libxslt is one of the fastest XSLT implementations available, so you'll get a nice speed boost for free (the execution speed can be double that of Sablotron).


As with all the other extensions discussed in this article, you can exchange XML documents from the XSL extension to the DOM extension and vice versa. In fact you have to, as ext/xsl doesn't have an interface to load and save XML documents, but uses the one from the DOM extension.


You don't need many methods for starting an XSLT transformation, and there is no W3C standard for it, therefore the API was "borrowed" from Mozilla.


First, you need an XSLT stylesheet. Paste the following into a new file and save it as articles.xsl:


<?xml version=1.0" encoding="iso-8859-1"
?>

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:param name="titles" />

  <xsl:template match="/articles">

    <h2><xsl:value-of select="$titles" /></h2>

    <xsl:for-each select=".//title">

      <h3><xsl:value-of select="." /></h3>

    </xsl:for-each>

  </xsl:template>

</xsl:stylesheet>


Then call it with a PHP script:


<?php


/* load the xml file and stylesheet as domdocuments */
$xsl = new DomDocument();
$xsl->load("articles.xsl");
$inputdom = new DomDocument();
$inputdom->load("articles.xml");


/* create the processor and import the stylesheet */
$proc = new XsltProcessor();
$xsl = $proc->importStylesheet($xsl);
$proc->setParameter(null, "titles", "Titles");


/* transform and output the xml document */
$newdom = $proc->transformToDoc($inputdom);

print
$newdom->saveXML();


?>


The above example first loads the XSLT stylesheet articles.xsl with the help of the DOM method load(). Then it creates a new XsltProcessor object, which imports the loaded XSLT stylesheet for later execution. Parameters can be set with setParameter(namespaceURI, name, value), and finally it starts the transformation with transformToDoc($inputdom), which returns a new DomDocument.


This API has the advantage that you can make dozens of XSLT transformations with the same stylesheet, just loading it once and reusing it, as transormToDoc() can be applied to different XML documents.


Besides transformToDoc(), there are two other transformation methods; transformToXML($dom), which returns a string, and transformToURI($dom, $uri), which saves the transformation to a file or a PHP stream. Note that if you want to use an XSLT feature such as <xsl:output method="html"> or indent="yes", you can't use transformToDoc(), because the DomDocument cannot retain this information. These directives will be used only if you output the transformation directly to a string or a file.


Calling PHP Functions

One of the latest features added to the XSLT extension is the ability to call any PHP function from within an XSLT stylesheet. While XML/XSLT purists will certainly dislike this (such stylesheets won't be portable anymore, and could easily mix logic and design), it can be very useful in some special cases. XSLT is very limited when it comes down to functions. Even outputting a date in different languages can be painful to implement - but with this feature it's no more complicated than with PHP itself. Here's the PHP snippet for adding a function into XSLT:


<?php


function dateLang () {

        return
strftime("%A");

}


$xsl = new DomDocument();
$xsl->load("datetime.xsl");
$inputdom = new DomDocument();
$inputdom->load("today.xml");


$proc = new XsltProcessor();
$proc->registerPhpFunctions();


// Load the documents and process using $xslt
$xsl = $proc->importStylesheet($xsl);


/* transform and output the xml document */
$newdom = $proc->transformToDoc($inputdom);



print
$newdom->saveXML();


?>


Here's the XSLT stylesheet, datetime.xsl, that will call that function:


<?xml version="1.0" encoding="iso-8859-1"
?>

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:php="http://php.net/xsl">

<xsl:template match="/">

  <xsl:value-of select="php:function('dateLang')" />

</xsl:template>

</xsl:stylesheet>


And here's an absolute minimal XML file, today.xml, to pass through the stylesheet (although articles.xml would achieve the same result):


<?xml version="1.0" encoding="iso-8859-1"
?>

<today></today>


The stylesheet above, together with the PHP script and any xml file loaded, will output the current weekday name in the language defined in the locale settings. You could add more arguments to php:function(), which would also be passed to the PHP function. Additionally, there's php:functionString(). This function automatically converts all input parameters to strings, so that you don't need to convert them when they reach PHP.


Note that you will need to call $xslt->registerPhpFunctions(); before the transformation, otherwise the PHP function-calls will not work for security reasons (can you always trust your XSLT stylesheets?). A more refined access system (i.e. one that limits access to specific methods) is not available yet, but would not be impossible to implement in a future PHP 5 release.


Summary

XML support in PHP has taken a great step forward. It is standards-compliant, well behaved, feature-rich, interoperable - and enabled-by-default functionality can now be taken for granted.


PHP 4's much-disliked domxml extension has been completely rewritten. The new DOM extension follows the W3C standard almost to the dot, and has also resolved a lot of internal memory problems. With the added support of some general PHP features, such as class inheritance and stream support, even more powerful and tightly integrated XML applications will be possible.


The newly added SimpleXML extension is an easy and fast way to access XML documents. It can save you a lot of coding, especially if you have structured documents or are able to use the power of XPath.


Thanks to libxml2, the underlying library used for all PHP 5 XML extensions, validation of XML documents using DTD, RelaxNG or (to some extent) XML Schema is now supported.


XSLT support also got a facelift and now uses the libxslt library, which should improve performance over the old Sablotron library. Furthermore, the ability to call PHP functions from within XSLT stylesheets allows you to write more powerful (though unfortunately less portable) XSLT code.


If you used XML in PHP 4 or in another language, you will love the XML support in PHP 5. XML in PHP 5 is much improved, is standards-compliant, and is finally on a par with other tools and languages.


Links

PHP 4 specific




PHP 5 specific




Standards




Tools


No comments: