top of page
Writer's pictureVlad

[CodewithMe]WordGrab: Grab all text on a website




In this code with me series, I'll introduce one of the beauty of 'pattern' library in python. Pattern is a web mining module for Python. It has tools for:


Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM parser

Natural Language Processing: part-of-speech taggers, n-gram search, sentiment analysis, WordNet

Machine Learning: vector space model, clustering, classification (KNN, SVM, Perceptron)

Network Analysis: graph centrality and visualization.



For now, we'll focus more on data mining specially....



Grabbing all text in a website



The Algorithm






Code:


from pattern.web import URL, plaintext

import urllib

import re



with open('wordsGrab.txt','w') as w:

site = raw_input("Input website: ")


s = URL(site).download() #download all text from URL input

s = plaintext(s) #Get only plaintexts, removing all html tags and codes

plain_txt = s.encode('UTF-8') #to convert a text to string it must be in unicode format. By default character is in ascii and can't encode char

str_plain_txt = str(plain_txt) #convert unicode to string needed to properly print on shell and text output

print str_plain_txt

print >> w, str_plain_txt #to print on wordsGrab.txt

print "done print on wordsGrab.txt"



Youtube Video:



10 views0 comments

Comments


LET'S TAKE IT TO THE NEXT LEVEL!

bottom of page