In this code with me series, I'll introduce one of the beauty of 'pattern' library in python. Pattern is a web mining module for Python. It has tools for:
Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM parser
Natural Language Processing: part-of-speech taggers, n-gram search, sentiment analysis, WordNet
Machine Learning: vector space model, clustering, classification (KNN, SVM, Perceptron)
Network Analysis: graph centrality and visualization.
For now, we'll focus more on data mining specially....
Grabbing all text in a website
The Algorithm
Code:
from pattern.web import URL, plaintext
import urllib
import re
with open('wordsGrab.txt','w') as w:
site = raw_input("Input website: ")
s = URL(site).download() #download all text from URL input
s = plaintext(s) #Get only plaintexts, removing all html tags and codes
plain_txt = s.encode('UTF-8') #to convert a text to string it must be in unicode format. By default character is in ascii and can't encode char
str_plain_txt = str(plain_txt) #convert unicode to string needed to properly print on shell and text output
print str_plain_txt
print >> w, str_plain_txt #to print on wordsGrab.txt
print "done print on wordsGrab.txt"
Youtube Video:
Comments