[CodewithMe]WordGrab: Grab all text on a website

Vlad
Oct 23, 2020
1 min read

In this code with me series, I'll introduce one of the beauty of 'pattern' library in python. Pattern is a web mining module for Python. It has tools for:

Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM parser

Natural Language Processing: part-of-speech taggers, n-gram search, sentiment analysis, WordNet

Machine Learning: vector space model, clustering, classification (KNN, SVM, Perceptron)

Network Analysis: graph centrality and visualization.

For now, we'll focus more on data mining specially....

Grabbing all text in a website

The Algorithm

Code:

from pattern.web import URL, plaintext

import urllib

import re

with open('wordsGrab.txt','w') as w:

site = raw_input("Input website: ")

s = URL(site).download() #download all text from URL input

s = plaintext(s) #Get only plaintexts, removing all html tags and codes

plain_txt = s.encode('UTF-8') #to convert a text to string it must be in unicode format. By default character is in ascii and can't encode char

str_plain_txt = str(plain_txt) #convert unicode to string needed to properly print on shell and text output

print str_plain_txt

print >> w, str_plain_txt #to print on wordsGrab.txt

print "done print on wordsGrab.txt"

Youtube Video:

[CodewithMe]WordGrab: Grab all text on a website

Grabbing all text in a website

Recent Posts

Commentaires