First, let’s start with a easy one:
import urllib2
response = urllib2.urlopen('http://drapor.me')
print response.read()
<!DOCTYPE html>
<html>
<head>
......
</html>
import urllib2
request=urllib2.Request(url)
response = urllib2.urlopen(request)
print response.read()
However, there are too many things in the output result, we want to filter out the content we want (for instance, every article title on the page), so we can add Regex(Tutorial I recommand) here to help us:
import urllib2
import re
response = urllib2.urlopen('http://drapor.me')
content = response.read() #First we store the content of whole webpage
content = content.replace('\n','') #I use this to delete all the newline character to avoid some troubles
pattern = r'<a class="article-title" href=".*?">.*?</a>' #Then your regex here, remember the 'r'
result = re.findall(pattern, content) #Find all the result
with open("index.html", "w") as f: #Write the result in a new HTML file
for items in result:
f.write(items+'<br>')
error: look-behind requires fixed-width pattern.
And open the new HTML file we create with browser, we got this:
Wow! How fantastic!!!
And here’s the first attempt of mine (also my first crawler):
URL: http://stat.drapor.me/ Code: https://github.com/GabrielDrapor/FiveManStat-Of-NBAleague