Scraping the Web-1

Vlad
Jan 29, 2018
3 min read

This method has been one of my favorite as of late. Scraping the web is much like getting all your most favorite fruit and veggies first, mixing it up,serving your appetite in a salad. You then eat your least favorite last or the other way around. Web has been one of the most beautiful yet the most risky creation of the 21st century, it could either make or break you. Along with the creation of the is the lead making of a 'spider' on web. This spiders are web scrapers whose goal could for an industrial research, API redirections and some informational gains. Scraping could also be for bad intentions like stealing leads, hijacking marketing campaigns, identity theft and the early phase of hacking.

Search engines, I can say are the 'legal' web crawlers. One rookie question could be, how a search engine work? A brief explanation on for this is that, these engines is being run by a huge number of servers capable of handling a tremendous amount of data. With these search engines employs a lot of spiders to crawl into the web,to provide the informations needed by the users.

Billions or a number-that-man-can't-count of informations and data have been stored on the internet. A single keyword search on google could lead you to thousands of results. You can even check the person's background and informations by just typing his name and hitting search. A several profile account will be in front of you. This is getting all the needed information in just a few seconds, with just aa few amount of code, and with just a few keyboard typing and mouse clicks.

In this blog, I will walk you through and introduce you to the art of crawling or better known as scraping. As you may have noticed on the title 'Scraping the Web-1', because I know this topic is so huge that one post will not be enough. I'll be covering the basics on this one.

The Process

Below is a simple pseudocode implemented for a basic crawler.

(1) with requests.Session() as c:

(2) url = 'http://www.ipvoid.com/ip-blacklist-check/'

(3) c.get(url)

(4) search_data = dict(ip = link)

(5) page = c.post(url, data = search_data, headers={"Referer": "www.ipvoid.com/"})

(6) page = page.content

(7) td1 = soup.find_all('div',{'class': 'table-responsive'})

(1)Using python's requests method, I established a session that will be identified as c.

(2)I then assigned the url to be crawled.

(3)With the session still in process, will need to get the url to be crawled.

(4)From fig1, we determine that form name is ip. Create variable that will reference form name to link(variable input from the input file)

(5)Create another variable that will execute the post request in a form of dictionary. Three parameters are passed, url of the site to be crawl, a variable that will represent the input, and headers. Headers are needed for site validation to determine that session is secure for access.

(6)We then get the content of our request, by page.content.

(7)With beautifulsoup, we processed variable that will parse through all the html output(web server response), the data we needed.

With less than 10 lines of python we managed to crawl this threat intel site.

At (7) below's the web server response. We parsed the content of the tag div with name table-responsive, as per image below. This tag contains the whole IP Address Information, thus containing table name table table-striped table-bordered that contains the IP's basic info and the blacklist report.

This ends up this first scraping blog. Stay tuned for more of this, as I will show different exercises on web crawling, and we'll see the beauty of it.

krontek>>halt

Scraping the Web-1

Recent Posts

Comments