Half Wit: Web Page Scraping using Java

In this blog, we are going to learn about web scraping fundamentals and implementation of web scraper using Java API.

Agenda of this post

What is Web Scraping
Web Scraping technique
Useful API for web scraping
Sample code using java API

Web scraping (also called Web harvesting or Web data extraction) is a technique of extracting information from websites.

It describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context.

Using web scraper, you can extract the useful content from the web page and convert into any format as applicable.

Web Scraping technique:

These are few steps suggested for web scraping:

Connect : Connect with the remote site over HTTP or FTP.
Extract : Extract information from the website
Process : Filter useful data from source and format data in useful format
Save : Save data in desired format.

There are different web scraping software and APIs available. I am going to use web-harvest for my web scrapping example.

Web-Harvest

Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them.

For more detail : click here

To download the web-harvest API, Click here

For simplicity, I am going to write a scraper using web-harvest which will scrap some portion of this blog. Lets say "Web scraping (also called Web harvesting or Web data extraction) is ............".

Here is my sample java code: WebHarvestTest.java

import java.io.FileNotFoundException;
import org.webharvest.definition.ScraperConfiguration;
import org.webharvest.runtime.Scraper;
import org.webharvest.runtime.variables.Variable;

public class WebHarvestTest 
{
 public static void main(String[] args) 
 {
  try{
   String strPageURL =
   "http://half-wit4u.blogspot.com/2011/01/web-scraping-using-java-api.html"
       ScraperConfiguration config =
         new ScraperConfiguration
          ("K:/R&D/WebScrapping/src/basic/webHarvestConf.xml");
       Scraper scraper = new Scraper(config, "D:/");
       scraper.addVariableToContext("url",strPageURL);
       scraper.setDebug(true);
       scraper.execute();
       Variable varScrappedContent =
         (Variable)scraper.getContext().getVar("scrappedContent");

       // Printing the scraped data here
       System.out.println(varScrappedContent.toString());
       }catch(FileNotFoundException e){
         e.printStackTrace();
       }
  } 
}

Here, in the above sample java code, I am using K:/R&D/WebScrapping/src/basic/webHarvestConf.xml file which will define extraction procedure.

I am using URL of this blog as "http://half-wit4u.blogspot.com/2011/01/web-scraping-using-java-api.html" .

Every extraction procedure in Web-Harvest is user-defined through XML-based configuration files .Each configuration file describes sequence of processors executing some common task in order to accomplish the final goal. Processors execute in the form of pipeline. Thus, the output of one processor execution is input to another one.

Here is our webHarvestConf.xml

You need to view the source code of this blog and find out the id of the div containing the data to be scraped. If you view the source, you will come to know that id of the required div is defId.

Web harvest executes the first part of configuration <html-to-xml>, the following steps occur:

http processor downloads content from the specified URL.
html-to-xml processor cleans up that HTML producing XHTML content.
xpath processor searches specific links in XHTML from previous step giving URL sequence as a result.

In <xq-expression> , we are doing the following things:

We are getting the <h1> value and storing it into a variable called title.
We are scraping the div content whose id is defId and storing it into variable called data.
Finally, we are formatting the content.