<<

. 86
( 132 .)



>>

</first>
<second>
<name></name>
510 Part IV: Not So Simple Applications

</second>
<third>
<name>Joe</name>
</third>
<fourth>
<name>Jill</name>
<name>Bob</name>
</fourth>
</outer>
EOT;

Then we can point Simplexml at it and print out the object we receive as a result:

$xml = simplexml_load_string($doc);
print_r($xml);

The results are as follows:

simplexml_element Object
(
[first] => simplexml_element Object
(
)
[second] => simplexml_element Object
(
[name] => simplexml_element Object
(
)
)
[third] => simplexml_element Object
(
[name] => Joe
)
[fourth] => simplexml_element Object
(
[name] => Array
(
[0] => Jill
[1] => Bob
)
)
)
Chapter 15: XML Parsing 511

The first tag, <first>, had no content, so it becomes an empty simplexml_
element object. The <second> tag did have some content, a <name> tag, but
<name> itself was empty. We end up with an object having a single property, called
name, whose value is another empty object. But now look at the <third> tag. This
time <name> had a value, the string Joe. Now the name property, instead of point-
ing to another object, has a simple string value. And in the <fourth> tag, where the
<name> tag was repeated, the result is an array of strings. This all makes good sense
when you compare the output to the XML source.
But if you don™t know what that source looked like, then discovering the nature
of what you™ve received can be a little tricky. If you™re going to be routinely pars-
ing complicated documents that can have widely varying contents, then using
something like the DOM XML Parser, which supports Xpath functions you can use
to search through the XML in an easy yet powerful fashion, might be the way to go.
On the other hand, to use those routines, you have to learn new syntax rules,
lots of new method calls, and so on. Whereas objects ” we know from objects.
Simplexml lets you focus more on the PHP side of the problem, and if that™s where
most of your real work lies, it can be a godsend.



Code Breakdown
Most of the work we™ll be doing, as you saw in the smaller version earlier in the
chapter, has to do with what happens at our end of the connection. We want to
always be able to deliver some kind of content to our Netsloth front door, even if it
isn™t always current up to the instant the user sees the page. If we were delivering
stock quotations, for instance, we™d have to worry much more about that subject.
But for news headlines ” particularly of the sort covered by a site like Slashdot,
more valuable for their uniqueness than as the latest breaking stories ” being
roughly up-to-date is fine. Besides, they specifically request that you not hit their
site more than once every 30 minutes, or else you might find yourself banned. One
advantage we™ll have is that we know what we™ll be getting ” the XML we™ll receive
from Slashdot is quite predictable. A typical Slashdot headline XML document is
formatted like this (though, again, none of these stories are real):

<?xml version=”1.0”?>
<backslash xmlns:backslash=”http://slashdot.org/backslash.dtd”>

<story>
<title>Aliens Sue OCS For Copyright Infringement</title>

<url>http://slashdot.org/article.pl?sid=03/08/15/1454224</url>
<time>1917-08-15 15:51:00</time>
<author>nobody</author>
512 Part IV: Not So Simple Applications

<department>this-is-completely-fake</department>
<topic>107</topic>
<comments>65</comments>
<section>aliens</section>
<image>topicms.gif</image>
</story>

<story>
<title>Diesel-Powered Barbie a Hit in Midwest</title>

<url>http://slashdot.org/article.pl?sid=03/08/15/1451223</url>
<time>3002-08-15 14:55:00</time>
<author>no one</author>
<department>we-made-these-up-they-are-not-real</department>
<topic>126</topic>
<comments>3014</comments>
<section>basement</section>
<image>topictoys.gif</image>
</story>

<story>
<title>Change Most Often Found Within Other Pants</title>

<url>http://slashdot.org/article.pl?sid=03/08/14/2222214</url>
<time>666-08-15 12:05:00</time>
<author>CowgirlJane</author>
<department>because-it-would-be-wrong</department>
<topic>134</topic>
<comments>253</comments>
<section>science</section>
<image>topicscience.gif</image>
</story>

</backslash>

Note that each story element has associated title, URL, time, author, depart-
ment, topic, comments, section, and image elements. Further note that the whole
document is a backslash element ” that is, it™s bounded by <backslash> and
</backslash> tags.


Laying the groundwork
All of the code is contained in a single file, slashdot.php, that lives in the
/book/xml-rpc directory. But it gets displayed via an include() call from the front
page of the enhanced version of Netsloth, which lives in /book/netsloth21. So the
Chapter 15: XML Parsing 513

first order of business here, besides setting up the remote URL we™re working with,
is to figure out just where in the heck we are:

<?php
// example using new simplexml extension

// made possible by the nice folks at slashdot.org
$url = ˜http://www.slashdot.org/slashdot.xml™;

// let™s get some stuff out of the way
$this_dir = dirname(__FILE__);
$root_dir = str_replace(
$_SERVER[˜PHP_SELF™]
, ˜™
, $_SERVER[˜PATH_TRANSLATED™]
);
$src_dir = str_replace(
$root_dir
, ˜™
, $this_dir
);

We are going to want to create URLs that point to content in the current direc-
tory ” /book/xml-rpc ” but that is not the “current directory” from the Web server™s
point of view, because we™re being included from another script. That script name is
in $_SERVER[˜PHP_SELF™], and the full file system path to that script is
$_SERVER[˜PATH_TRANSLATED™]. So we make the assumption that we share a
common root directory with the script that called us, and do some string algebra to
remove that common root from our current location ” leaving us with, in theory,
the correct Web server path to where we are.
Next, we check to see if it™s time to get new stories from the Slashdot site. If that
fails, because the network connection has trouble, or the remote site is down, or
someone in Boise went nuts with a backhoe, we pick up the last copy we downloaded
from a cache file. (If that fails, there™s not much to do but apologize and give up.):


// a file to hold previously retrieved data
$cachefile = “{$this_dir}/slashdot.xml”;

// We want nice clean error messages.
ini_set(˜display_errors™, 0);

// first, we need to check whether we™ve hit their
// site within the last half-hour, BECAUSE:
//
514 Part IV: Not So Simple Applications

// “For those who don™t know, you can get slashdot.rdf
// or slashdot.xml to receive a list of headlines for
// Slashdot. The document is fairly self explanatory,
// and the rules are simple: Do whatever you want, but
// don™t access the file more than once every 30 minutes.
// The server is plenty bogged down without adding a
// hundred stock tickers refreshing themselves every
// 60 seconds. If your automated loading of slashdot
// becomes too much of a burden on our servers, you
// run the risk of having your IP banned, so play fair!”
//
// -- http://slashdot.org/code.shtml
//

$xml = false;
if ( !file_exists($cachefile) or !($last_time =
@filectime($cachefile))
)
{
$last_time = 0;
}
else
{
$xml = @file_get_contents($cachefile);
}

$this_time = time();
if ($last_time + 1800 < $this_time)
{
// OK, try getting a new copy
$newxml = @file_get_contents($url);
if ($newxml === FALSE)
{
error_log(“Unable to contact $url”);
if ($xml === FALSE)
{
error_log(“Unable to open cache file: $cachefile”);
$msg = ˜Unable to obtain Slashdot.org content. Please
try again later.™;
error_log($msg);
print “<h3>$msg</h3>\n”;
return;
}
$this_time = $last_time;
Chapter 15: XML Parsing 515

}



}

Note that we use the built-in PHP function filectime() to get the date that the
cache file was created. That lets us tell users how current the information they™re
looking at is.
If we are able to load the XML file from the URL, then we need to save it out as
a new copy of the cache file, in preparation for times to come:

else
{
$xml = $newxml;
if (!@file_put_contents($cachefile, $xml))
{
// you might want to put an alert to the site webmaster
// in here - if permission problems are preventing you
// from caching the headlines, you™ll hit the site too
// often, and There Will Be Trouble. possibly including
// flying monkeys.
error_log(“Problem caching Slashdot content to
$cachefile”);
}
}
}

Then we run the XML that™s now sitting in the variable $xml through Simplexml
and create an object. We also check to see how many stories we™re supposed to dis-
play. In the Netsloth21 home page, we set $storycount to the number we want to
use. That variable will be visible here. In case we™re being called by someone else,
we also check to see if $storycount was passed as part of the URL, or from a form,
by looking in the $_REQUEST superglobal. If no one has told us differently, we set
the count to zero, meaning that we want to display every story we can get:

<<

. 86
( 132 .)



>>