Wednesday, January 5, 2011

Web Page Scraping using Java

In this blog, we are going to learn about web scraping fundamentals and implementation of web scraper using Java API.

Agenda of this post
  • What is Web Scraping
  • Web Scraping technique
  • Useful API for web scraping
  • Sample code using java API


Web scraping (also called Web harvesting or Web data extraction) is a technique of extracting information from websites.
It describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context.
Using web scraper, you can extract the useful content from the web page and convert into any format as applicable.

Web Scraping technique:
These are few steps suggested for web scraping:
  • Connect : Connect with the remote site over HTTP or FTP.
  • Extract : Extract information from the website
  • Process : Filter useful data from source and format data in useful format
  • Save : Save data in desired format

There are different web scraping software and APIs available. I am going to use web-harvest for my web scrapping example.

Web-Harvest
Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them.
For more detail : click here
To download the web-harvest API, Click here




For simplicity, I am going to write a scraper using web-harvest which will scrap some portion of this blog. Lets say "Web scraping (also called Web harvesting or Web data extraction) is ............".

Here is my sample java code: WebHarvestTest.java

import java.io.FileNotFoundException;
import org.webharvest.definition.ScraperConfiguration;
import org.webharvest.runtime.Scraper;
import org.webharvest.runtime.variables.Variable;

public class WebHarvestTest 
{
 public static void main(String[] args) 
 {
  try{
   String strPageURL =
   "http://half-wit4u.blogspot.com/2011/01/web-scraping-using-java-api.html"
       ScraperConfiguration config =
         new ScraperConfiguration
          ("K:/R&D/WebScrapping/src/basic/webHarvestConf.xml");
       Scraper scraper = new Scraper(config, "D:/");
       scraper.addVariableToContext("url",strPageURL);
       scraper.setDebug(true);
       scraper.execute();
       Variable varScrappedContent =
         (Variable)scraper.getContext().getVar("scrappedContent");

       // Printing the scraped data here
       System.out.println(varScrappedContent.toString());
       }catch(FileNotFoundException e){
         e.printStackTrace();
       }
  } 
}


Here, in the above sample java code, I am using K:/R&D/WebScrapping/src/basic/webHarvestConf.xml file which will define extraction procedure.
I am using URL of this blog as "http://half-wit4u.blogspot.com/2011/01/web-scraping-using-java-api.html" .

Every extraction procedure in Web-Harvest is user-defined through XML-based configuration files .Each configuration file describes sequence of processors executing some common task in order to accomplish the final goal. Processors execute in the form of pipeline. Thus, the output of one processor execution is input to another one.

Here is our webHarvestConf.xml




You need to view the source code of this blog and find out the id of the div containing the data to be scraped. If you view the source, you will come to know that id of the required div is defId.

Web harvest executes the first part of configuration <html-to-xml>, the following steps occur:
  1. http processor downloads content from the specified URL.
  2. html-to-xml processor cleans up that HTML producing XHTML content.
  3. xpath processor searches specific links in XHTML from previous step giving URL sequence as a result. 
 In <xq-expression> , we are doing the following things:
  • We are getting the <h1> value and storing it into a variable called title.
  • We are scraping the div content whose id is defId and storing it into variable called data.
  • Finally, we are formatting the content.


<myContent>
   <title>
Half Wit </title>
   <data>Web scraping (also called
Web harvesting or
Web data extraction) is a computer software technique of extracting information from websites.</data>
</myContent>