Python Crawler Note 1

Dec 01, 2016

First, let’s start with a easy one:

import urllib2

response = urllib2.urlopen('http://drapor.me')
print response.read()

The output should be like:

<!DOCTYPE html>
<html>
	<head>
......
</html>

Also, codes below do the same thing,

import urllib2

request=urllib2.Request(url)
response = urllib2.urlopen(request)
print response.read()

However, there are too many things in the output result, we want to filter out the content we want (for instance, every article title on the page), so we can add Regex(Tutorial I recommand) here to help us:

import urllib2
import re

response = urllib2.urlopen('http://drapor.me')
content = response.read()              #First we store the content of whole webpage
content = content.replace('\n','')     #I use this to delete all the newline character to avoid some troubles

pattern = r'<a class="article-title" href=".*?">.*?</a>'   #Then your regex here, remember the 'r'

result = re.findall(pattern, content) #Find all the result

with open("index.html", "w") as f:    #Write the result in a new HTML file
    for items in result:
        f.write(items+'<br>')

Tip: Don’t use something like ‘+’ or ‘*’ in your zero-width assertion or write your regex like ‘(?<![a-z]+)/d+’, or you will get:

error: look-behind requires fixed-width pattern.

And open the new HTML file we create with browser, we got this:

Wow! How fantastic!!!

And here’s the first attempt of mine (also my first crawler):

URL: http://stat.drapor.me/ Code: https://github.com/GabrielDrapor/FiveManStat-Of-NBAleague

#Python #Crawler