Monday 29 April 2013

Python Wikipedia Crawler To Get All Images Of A Page


Although there exist many frameworks to crawl web like Scrapy but there's no need of using all these frameworks for writing simple crawlers.

Aim of this tutorial is to write Wikipedia crawler which will crawl all images of given wikipedia page.

Requirements :

1. Python
2. BeautifulSoup - Python Package
2. urllib and urllib2 - Python Packages


Make sure that above packages are installed. Mostly Python and Urllib comes preinstalled on most version of Linux. To install BeautifulSoup type following command on terminal

>>pip install bs4

Have a look at the code here.

Steps Followed :

1. Send Request to wikipedia to connect to the site.So that data can be transferred.

2. The URL (variable site here) is opened and the source code of url is retrieved and stored into page

3. Finding img tag in whole page

4. Saving the iimages one by one in output folder

Please Contribute on git to make this crawler more useful.  

  

No comments:

Post a Comment