How to write a Web Crawler in Java. Part-1

The task of the crawler is to keep on getting information from the internet into the database of the search engine. It literally crawls over the internet from page to page, link by link and downloads all the information to the database. A search engine is made up of basically four parts:

Web Crawler
Database
Search Algorithm
Search system that binds all the above together

For more information on crawler visit the wiki page for web crawlers A crawler development can be planned out into phases as we will be doing.

To begin with, we would develop a very trivial crawler that will just crawl the url spoon fed to it.
Then we will make a crawler with capability to extract urls from the downloaded web page.
Next we can also make a queue system in the crawler that will track no of urls still to be downloaded.
We can then add capability to the crawler to extract only the user visible text from the web page.
There after we will make a multi-threaded downloader that will utilize our network bandwidth to the maximum.
And we will also add some kind of front end to it, probably in php.

In this part of the article we will make a simple java crawler which will crawl a single page over the internet. Net-beans is primarily used for the crawler development, the database would be implemented in Mysql . Make a new project in Net-beans and save it by the name something like “WebC” or “w1”,etc. By default there will be a class called Main.java in the default package of the project. Write the following code in it’s main() function. This class will later be worked upon and new classes will be added once we get going.

/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
package net.viralpatel.java.webcrawler;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;

/**
*
* @author vimal
*/
public class Main {

 /**
 * @param args the command line arguments
 */
 public static void main(String[] args)  {
  try {
   URL my_url = new URL("http://www.vimalkumarpatel.blogspot.com/");
   BufferedReader br = new BufferedReader(new InputStreamReader(my_url.openStream()));
   String strTemp = "";
   while(null != (strTemp = br.readLine())){
   System.out.println(strTemp);
  }
  } catch (Exception ex) {
   ex.printStackTrace();
  }
 }
}
Code language: Java (java)

viola, there is your first baby crawler :) Watch the output when you first run it, when runing successfully it will show you the HTML code for the web page ‘www.vimalkumarpatel.blogspot.com‘ .

Trouble Shooting in Web Crawler

It may give some hiccups or may stumble upon some errors, most probably network errors related to proxy settings on your Net-beans and JVM. In such a case you can change the proxy IP & port for the Net-beans at Tools>>options>>general>>proxy settings.Also you may need to feed the same to the JVM via command line, that can be done in Net-Beans at File>>’w1′ Properties>>Run>>VM options: write the following in the text box over there. -Dhttp.proxyHost=<your proxy IP> -Dhttp.proxyPort=<port for the same> example: -Dhttp.proxyHost=172.16.3.1 -Dhttp.proxyPort=3128

Future Work

Keep on visiting this site for the next article following soon, wherein we will discuss possible improvements in our crawler along the plan we chalked out earlier. Also watch out for an article on how to integrate your eclipse IDE and google android’s ADT for android application development.

Next Google Android ADT, SDK and Eclipse IDE integration on Linux »

Previous « Android 1.5 released: Preview SDK & feature list

View Comments

Cleber Adriani says:

April 17, 2009 at 3:44 am

Coolll, i\'m looking forward to see the thread implementation.
Marcello de Sales says:

May 4, 2009 at 11:11 pm

This is the first step for the crawler, but you still need to add an HTML parser to extract the meaningful and "readable" from the "code" data... From there, you can play with search algorithms, indexing, etc... As Google bots does, they save the entire output as your code snippet as a cache and apply the other techniques... Others also use this technique for Web Scrapping or data harvest, which may be illegal depending on the Terms and Conditions on a given website...

Marcello de Sales
spiderwick says:

June 7, 2009 at 5:45 pm

nice starter guide, probably next part you could add Libxml for HTML parsing and URL rebuild. also robots.txt exclusion rules.
Sheppounet says:

August 3, 2009 at 11:33 pm

Very interesting !
But i can\'t found the next part ... It is already written ?
justonefix says:

August 13, 2009 at 2:17 am

hi , im working on similar project , my aim is to build a high capacity web crawler , just wanted to ask what would it be the average speed of links checked per second for a fast crawler, what i did is a mysql based crawler , and maximum i did is 10 checked links per 1 sec, on arraylist based loop in the java code, with mysql retrieving loop this speed is 2 checked links per a second .
Arnab says:

October 23, 2009 at 8:37 pm

How do you parse contents which are rendered by Javascript after load. Thanks Arnab
amit says:

February 15, 2011 at 1:50 pm

please provide the next tutorial on this topic

Nikhil says:

June 2, 2011 at 1:47 am

Hey I just wanted to ask if anyone would be able to give me a good starting point for developing a web crawler to crawl the 'deep web', basically to performing deep searching. I was just wondering if anyone had any idea about this aspect of web crawling? If so any guidance with regard to a good starting point for the deep web crawler development or a good website to refer to for more information or a good book to read for this task, would be highly appreciated!!!

darren says:

July 7, 2011 at 1:18 am

This is something I have been looking for days.Thanks a lot.
darren says:

July 7, 2011 at 1:22 am

Is there anyway we can develop the webcrawler to visit all web sites that contains say an item may be a book or any sporting goods?Please post a tutorial if we can build that in less then a 100lines of codes.
Thanks again
adcha says:

December 16, 2011 at 2:45 pm

have you already posted the second part?

Java URL Encoder/Decoder Example

Java URL Encoder/Decoder Example - In this tutorial we will see how to URL encode/decode…

4 years ago

General

How to Show Multiple Examples in OpenAPI Spec

Show Multiple Examples in OpenAPI - OpenAPI (aka Swagger) Specifications has become a defecto standard…

4 years ago

General

How to Run Local WordPress using Docker

Local WordPress using Docker - Running a local WordPress development environment is crucial for testing…

4 years ago

Java

Create and Validate JWT Token in Java using JJWT

1. JWT Token Overview JSON Web Token (JWT) is an open standard defines a compact…

4 years ago

Spring Boot

Spring Boot GraphQL Subscription Realtime API

GraphQL Subscription provides a great way of building real-time API. In this tutorial we will…

4 years ago

Spring Boot

Spring Boot DynamoDB Integration Test using Testcontainers

1. Overview Spring Boot Webflux DynamoDB Integration tests - In this tutorial we will see…

4 years ago

ViralPatel.net