Search engines, like Google or Yahoo!, pull Web pages into their search results by using Web bots (also sometimes called spiders or crawlers), which are programs that scan the Internet and index websites into a database. Web bots can be made using most programming languages, including C, Perl, Python, and PHP, all of which allow software engineers to write scripts that perform procedural tasks, such as Web scanning and indexing.
Step 1
Open a plain text editing application, such as Notepad, which is included with Microsoft Windows, or Mac OS X's TextEdit, where you will author a Python Web bot application.
Video of the Day
Step 2
Initiate the Python script by including the following lines of code, and replacing the example URL with the URL of the website you wish to scan and the name of the example database with the database that will be storing the results:
import urllib2, re, string enter_point = 'http://www.exampleurl.com' db_name = 'example.sql'
Step 3
Include the following lines of code to define the sequence of operations that the Web bot will follow:
def uniq(seq): set = {} map(set.setitem, seq, []) return set.keys()
Step 4
Obtain the URLs in the website's structure by using the following lines of code:
def geturls(url): items = [] request = urllib2.Request(url) request.add.header('User', 'Bot_name ;)') content = urllib2.urlopen(request).read() items = re.findall('href="http://.?"', content) urls = [] return urls
Step 5
Define the database that the Web bot will use and specify what information it should store to complete making the Web bot:
db = open(db_name, 'a') allurls = uniq(geturls(enter_point))
Step 6
Save the text document and upload it to a server or computer with an internet connection where you can execute the script and begin scanning web pages.
Video of the Day