Sunday, December 2, 2012

Web Scraping using Java


Extracting structured data from web sites is not a trivial task. Most of the information on the web today is in the form of Hypertext Markup Language (HTML) documents which are viewed by humans with a browser. HTML documents are sometimes written by hand, sometimes with the aid of HTML tools. Given that the format of HTML documents is designed for presentation purposes, not automated extraction, and the fact that most of the HTML content on the web is ill-formed (“broken”), extracting data from such documents can be compared to the task of extracting structure from unstructured documents.

In our previous post, we discuss and learnt about simple web harvesting. In this post, we will try to scrape complex information from naptol.com.


Just to start with the little complex web scraping, we will try to extract mobile handset items available on given naptol.com page URL.


naptoConfig.xml

Now we will try to understand the configuration file and how it works.

<list>
  <xpath expression='//*[@id="productView"]'>
    <html-to-xml prunetags="yes">
      <http url="${url}"/>
   </html-to-xml>
  </xpath>
</list>
<html-to-xml> processor cleans up the html downloaded by <http> processor for given url and produce XHTML content. xpath processor searches specific xpath in XHTML and produce a list of items matching the xpath expression.

Here, we will get all the elements having id as productView as array. If the page have 3 element with id productView, then list will contain three items starting from element [0] to [2]. <list> contains the produced item as array of items list.

<loop item="link" index="index">
Loop iterate through the specified list and executes specified body logic for each item. So item="link"
will give all the item in the list one by one starting from index [0] to [n-1].

Now in the body section, we are using the item variable and extracting data from it by applying xpath expression and storing the extracted data into variable called productName using below code
<var-def name="productName">
  <xpath expression='//p[@class="proName"]//@title'>
    <var name="link" />
  </xpath>
</var-def>

In the above excerpt, we are trying to extract the title of a paragraph having class as proName from the list item one by one and storing it in a variable productName


Naptol.java
package basic;

import java.io.FileNotFoundException;
import org.webharvest.definition.ScraperConfiguration;
import org.webharvest.runtime.Scraper;
import org.webharvest.runtime.variables.Variable;

public class Naptol {
 public static void main(String[] args) {
 try {
     ScraperConfiguration config =
        new ScraperConfiguration(
         "D:/R&D/WebScrapping/src/basic/naptolConf.xml");
     Scraper scraper = new Scraper(config, "D:/");
     scraper.addVariableToContext("url",
        "http://www.naaptol.com/buy/mobile_phones/mobile_handsets.html");
     scraper.execute();
      
     Variable varScrappedContent =  
        (Variable)scraper.getContext().getVar("naptol");
      
     // Printing the scraped data here
     System.out.println(varScrappedContent.toString());
     }catch (FileNotFoundException e) {
       e.printStackTrace();
      }
   }
}