How to write a Web Crawler in Java. Part-1

The task of the crawler is to keep on getting information from the internet into the database of the search engine. It literally crawls over the internet from page to page, link by link and downloads all the information to the database. A search engine is made up of basically four parts:
  1. Web Crawler
  2. Database
  3. Search Algorithm
  4. Search system that binds all the above together
For more information on crawler visit the wiki page for web crawlers A crawler development can be planned out into phases as we will be doing.
  1. To begin with, we would develop a very trivial crawler that will just crawl the url spoon fed to it.
  2. Then we will make a crawler with capability to extract urls from the downloaded web page.
  3. Next we can also make a queue system in the crawler that will track no of urls still to be downloaded.
  4. We can then add capability to the crawler to extract only the user visible text from the web page.
  5. There after we will make a multi-threaded downloader that will utilize our network bandwidth to the maximum.
  6. And we will also add some kind of front end to it, probably in php.
In this part of the article we will make a simple java crawler which will crawl a single page over the internet. Net-beans is primarily used for the crawler development, the database would be implemented in Mysql . Make a new project in Net-beans and save it by the name something like “WebC” or “w1”,etc. By default there will be a class called in the default package of the project. Write the following code in it’s main() function. This class will later be worked upon and new classes will be added once we get going.
/* * To change this template, choose Tools | Templates * and open the template in the editor. */ package; import; import; import; /** * * @author vimal */ public class Main { /** * @param args the command line arguments */ public static void main(String[] args) { try { URL my_url = new URL(""); BufferedReader br = new BufferedReader(new InputStreamReader(my_url.openStream())); String strTemp = ""; while(null != (strTemp = br.readLine())){ System.out.println(strTemp); } } catch (Exception ex) { ex.printStackTrace(); } } }
Code language: Java (java)
viola, there is your first baby crawler :) Watch the output when you first run it, when runing successfully it will show you the HTML code for the web page ‘‘ .

Trouble Shooting in Web Crawler

It may give some hiccups or may stumble upon some errors, most probably network errors related to proxy settings on your Net-beans and JVM. In such a case you can change the proxy IP & port for the Net-beans at Tools>>options>>general>>proxy settings.Also you may need to feed the same to the JVM via command line, that can be done in Net-Beans at File>>’w1′ Properties>>Run>>VM options: write the following in the text box over there. -Dhttp.proxyHost=<your proxy IP> -Dhttp.proxyPort=<port for the same> example: -Dhttp.proxyHost= -Dhttp.proxyPort=3128

Future Work

Keep on visiting this site for the next article following soon, wherein we will discuss possible improvements in our crawler along the plan we chalked out earlier. Also watch out for an article on how to integrate your eclipse IDE and google android’s ADT for android application development.
