Scraping the sh*t out of the interwebz – Part #1

PHP’s curl library is by far my favorite tool. Combined with DOMDocument it provides a powerful API to gather content from remote sites or manipulate remote forms programmatically. In the upcoming couple of articles / tutorials I will show you how to use it to scrape website content, login to remote interfaces, use it as a gateway for dealing with APIs and at the end how to do cool (and dodgy) stuff like breaking captchas.

In my examples I will use my own wrapper class (curl.class.php) which I have been gradually building during the past couple years, expanding it with additional features as I needed to use them.

Ok, so the first example will be a pretty basic one: The goal is to get listings from Ebay based on a keyword passed in to our script.

The whole script will look something like this. I will explain it line by line below:

<?php
 
if (isset($_GET['keyword']) && $_GET['keyword']){
	$keyword = $_GET['keyword'];
} elseif (isset($argv[1]) && $argv[1]){
	$keyword = $argv[1];
} else {
	die("usage: php {$argv[0]} [keyword]\n");
}
 
require_once("../curl.class.php");
 
$curl = new Curl();
$page = $curl->get("http://www.ebay.com/sch/i.html?_trksid=p2050601.m570.l1313&_nkw=".urlencode($keyword)."&_sacat=0&_from=R40");
 
if ($page && $curl->getHttpCode()>=200 && $curl->getHttpCode()<400){
 
	$dom = new DOMDocument();
	@$dom->loadHTML($page);
 
	$tables = $dom->getElementsByTagName('table');
	for($i=0;$i<$tables->length;$i++){
 
		if ($tables->item($i)->getAttribute("itemtype")!="http://schema.org/Offer"){
			continue;
		}
 
		$h4s = $tables->item($i)->getElementsByTagName('h4');
		if (!$h4s->length){
			continue;
		}
 
		$links = $h4s->item(0)->getElementsByTagName('a');
		if (!$links->length){
			continue;
		}
 
		$item_title = $links->item(0)->textContent;
		$item_url = $links->item(0)->getAttribute("href");
 
		print($item_title."\t".$item_url."\n");
 
	}
 
 
} else {
	print("unexpected error occured\n");
}

Basic enough, right? The top couple of lines doesn’t need any explanation I guess. Check if a “keyword” parameter is passed in via the GET request or via the command line. If not prompt the user an error message.

The first thing we need to do when scraping a site is to check for the URL structure. If you go to ebay.com and do a search for something you will see the keyword appears in the URL. Therefor:

$curl = new Curl();
$page = $curl->get("http://www.ebay.com/sch/i.html?_trksid=p2050601.m570.l1313&_nkw=".urlencode($keyword)."&_sacat=0&_from=R40");

This will download the page and store the html in the $page variable. Now it is unlikely but possible that ebay.com is down so it is better to check the response code. We will accept any 2xx, 3xx response code (my curl class handles redirects by default):

if ($page && $curl->getHttpCode()>=200 && $curl->getHttpCode()<400){

Neat, we have the html, lets initialize the DOM parser:

	$dom = new DOMDocument();
	@$dom->loadHTML($page);

Notice the @ sign infront of the loadHTML method call. I know its ugly and I should really use my own error handler, etc. but unfortunately i’m lazy and DOMDocument is like a teenage girl at Valentines day: she wants everything to be perfect and the exact same way she imagined, otherwise she starts to complain. Anyway, now it’s time to have a look at ebay’s html structure via Firebug.

curl_1

Good, it will be easy. It looks like each results are in their own table with itemtype=”http://schema.org/Offer” (god bless the micro formats). Also the title and the URL could be gathered by getting the one and only H4 tag’s first link’s text content and href attribute. Some people prefer using xpath to get to the target element directly but I found drilling down the DOM more reliable:

	$tables = $dom->getElementsByTagName('table');
	for($i=0;$i<$tables->length;$i++){
 
		if ($tables->item($i)->getAttribute("itemtype")!="http://schema.org/Offer"){
			continue;
		}
 
		$h4s = $tables->item($i)->getElementsByTagName('h4');
		if (!$h4s->length){
			continue;
		}
 
		$links = $h4s->item(0)->getElementsByTagName('a');
		if (!$links->length){
			continue;
		}
 
		$item_title = $links->item(0)->textContent;
		$item_url = $links->item(0)->getAttribute("href");
 
		print($item_title."\t".$item_url."\n");
 
	}

And we are done. It is a basic example but as you can tell it is a really powerful way to gather data from remote sites. What can you use it for? Well, pretty much anything. Add stock prices to your site by scraping Google finance, add customized Ebay affiliate widget to your sidebar based on your keywords, sky is the limit.

Be Sociable, Share!

9 comments

  1. Emil Vikström

    This lacks a discussion of robots.txt and user agents, stuff that all web crawlers/scrapers should adhere to. In short: Make sure your user-agent is set to something sensible, preferrably with some contact information, and make sure you follow the rules set up by the target site’s robots.txt file.

  2. Phillip

    If you looking to grab something odly specific phpQuery can make it easier than the DOMDocument can, it lets you use jQuery style selectors to grab elements. Because there jQuery based you can easily use the console to build them out with too. Just something to keep in find if you want the 4rd header containing the word “Trunip” that has the class “farkle”.

  3. Gary Teh

    Cool! Me and my team mate thought to make the entire scraping layer abstract and used this instead

    https://krake.io/scrap-gm

    {
    origin_url: ‘http://www.ebay.com/sch/?_nkw=GM%20Part&_pgn=1′,
    columns: [
    {
    col_name: 'item_name',
    dom_query: 'h4 a'
    }, {
    col_name: 'item_detail_url',
    dom_query: 'h4 a',
    required_attribute: 'href',
    options : {
    columns: [{
    col_name: 'description',
    dom_query: '#desc_div'
    },{
    col_name: 'seller_name',
    dom_query: '.mbg a[[0]]’
    },{
    col_name: ‘seller_profile_url’,
    dom_query: ‘.mbg a[[0]]’,
    required_attribute: ‘href’
    }]
    }
    }, {
    col_name: ‘item_image’,
    dom_query: ‘.img img’,
    required_attribute: ‘src’
    }
    ],
    next_page: {
    dom_query: ‘.next’
    }
    };

  4. David Bradbury

    Thanks for posting this. I’ve been working on scraping data recently so that I can keep track of trends in certain products I use. Remember to use RSS feeds or API’s when available to be more polite and decrease processing time! In the case of eBay, they have ebay dot com/sch/rss/. You can customize the results as well and it will be nicely organized for processing.

    • Gary Teh

      Yup, the engine allows for the processing of RSS feeds. Inappropriate of web scraping does impose a huge bandwidth penalty on the content provider.

  5. Pingback: Scraping the sh*t out of the interwebz – Part #2 | Shut up and code

Post a comment

You may use the following HTML:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>