Following on from my previous post Scraping data with PHP and cURL today I’m going to show you about reading a websites contents using PHP and XPath.

One of the biggest problems I have faced when dealing with a scraped web page (read: block of HTML) is how to go about getting specific text elements out of it. Several years ago I would’ve gone with regex. This was a hassle to me and I’ve still yet to get my head around how it correctly works. These days though, I have discovered a useful little thing called XPath.

Basically, how it works is, you put the HTML in a DOMDocument and then you can use XPath to navigate through elements and attributes of the document getting the data of them in the process.

Using PHP and XPath

Now if we were to consider the following HTML:

<html>   
  <head>
    <title>Test</title>
  </head>
  <body>
    <div id="mainDiv">
      <div id="leftDiv">
        <p class="bodyText">This is left</p>
        <img src="images/test.jpg" />
      </div>
      <div id="rightDiv">
        <p class="bodyText">This is right</p>
      </div>
    </div>
    <div id="footer">
      <p class="smallText">This is the footer</p>
    </div>
  </body>
</html>
HTML

And now, lets say I wanted to get the text of the rightDiv. This is how we would go about it:

<?php
  // $content is the content you scraped (via curl for example)
  $dom = new DOMDocument();
  @$dom->loadHTML($content)
  $xpath = new DOMXPath($dom);
  $rightDivText = $xpath->query("//html/body/div[@id='mainDiv']/div[@id='rightDiv']/p/text()"); 
  
  // Returns: This is right
  echo $rightDivText->item(0)->nodeValue;
?>
PHP

Simply, it will load the $content (HTML) into a DOMDocument and then using XPath it looks for the for the div with the id of ‘mainDiv’ and then inside of ‘mainDiv’ it looks for the div with the id of ‘rightDiv’. Finally it then gets the text of the ‘p’ element.

Now that is one of the more straight forward scenarios and is made easier by the fact that most of the different elements have custom IDs like ‘mainDiv’ or ‘rightDiv’. If we were to look at the same HTML block again but this time without IDs, this is how it would look.

<html>
  <head>
    <title>Test</title>
  </head>
  <body>
    <div>
      <div>
        <p class="bodyText">This is left</p>
        <img src="images/test.jpg" />
      </div>

      <div>
        <p class="bodyText">This is right</p>
      </div>
    </div>

    <div id="footer">
      <p class="smallText">This is the footer</p>
    </div>
  </body>
</html>
HTML

This time lets get the src of the image in the former leftDiv.

<?php
// $content is the content you scraped (via curl for example)
$dom = new DOMDocument();  @$dom->loadHTML($content)
$xpath = new DOMXPath($dom);
$imgSrc = $xpath->query("//html/body/div[1]/div[1]/img/@src");

// Returns: images/test.jpg
echo $imgSrc->item(0)->nodeValue;
?>
PHP

As you can see, the difference this time is that instead of using the divs ID, we used the elements number. Inside the body, it looks for the first div, and then inside of that it again looks for the first child div before returning the src of the img element.

Again though, this is probably another pretty simple solution but definitely gives you an idea of how XPath works and some of the ways it can be used. The next scenario is basically the one I came face to face with, and that scenario is one in which we have multiple divs with the same name. Consider this html:

<html>
  <head>
    <title>Test</title>
  </head>
  <body>
    <div id="container">
      <div id="person">
        <p>Bob</p>
        <p>25</p>
        <p>180lbs</p>
        <img src="images/p1.jpg" />
      </div>
      <div id="person">
        <p>Stacy</p>
        <p>26</p>
        <p>187lbs</p>
        <img src="images/p2.jpg" />
      </div>
      <div id="person">
        <p>John</p>
        <p>21</p>
        <p>255lbs</p>
        <img src="images/p3.jpg" />
      </div>
    </div>
    <div id="footer"><p class="smallText">This is the footer</p></div>
  </body>
</html>
HTML

So here we have a webpage that displays a bunch of people with information such as their name, age, weight, and picture. Now as someone who is interested in getting the information of all of these people, I need to scrape the website and then using XPath I can loop through each person and get it.

To do this, we get everything inside the ‘container’ div and then loop through each individual ‘person’ div. Then using XPath on the ‘person’ div we can use the above examples to get the information we want. Finally we store all the results in an array.

 <?php
$dom = new DOMDocument();
@$dom->loadHTML($content);
$tempDom   = new DOMDocument();
$xpath     = new DOMXPath($dom);
$container = $xpath->query("//div[@id='container']");
foreach ($container as $item) {
    $tempDom->appendChild($tempDom->importNode($item, true));
}
$tempDom->saveHTML();
$peopleXpath = new DOMXPath($tempDom);
$peopleDiv   = $peopleXpath->query("div[@id='person']");
$results     = array();
foreach ($peopleDiv as $people) {
    $newDom = new DOMDocument;
    $newDom->appendChild($newDom->importNode($people, true));
    $personXpath = new DOMXPath($newDom);
    $name        = trim($personXpath->query("p[1]/text()")->item(0)->nodeValue);
    $age         = trim($resultXpath->query("p[2]/text()")->item(0)->nodeValue);
    $weight      = trim($resultXpath->query("p[3]/text()")->item(0)->nodeValue);
    $image       = trim($resultXpath->query("img[1]/@src")->item(0)->nodeValue);
    $results[]   = array(
        'name' => $name,
        'age' => $age,
        'weight' => $weight,
        'image' => $image
    );
}
?> 
PHP

There you have it. If you run a print_r on the results array you should have yourself a bunch values containing the information of Bob, Stacy, and John.

Now a totally awesome cheat for this solution is to use the Google Chrome XPath Helper extension. All you need to do is hold down the shift key as you hover over webpage elements. This in turn will return the XPath query which is a HUGE help!

As always feel free to comment and or improve on the above. Also, don’t forget to follow me @JAGracie.