My first version of Spyder only focused on crawling a certain or group of websites. Digging deep on the domain by gathering elements and information the site contain.This is capable of gathering links, searching on strings the webpage contain from the gathered links and checking if the certain site is down or up..
How it all began? I've been in the field of Cybersecurity for more than 3 years already, and as an analyst and engineer for the field, one of the most time consuming and grueling task that I've faced is the part of information gathering. This is the part where each and everyone engaged in a zone of concentration, gathering all the data and informations we need in a certain threat or vulnerability. That is why I came up with an 'if' thinking along the way. What if I could make a tool that would do the information gathering for me. You know, search through web about a certain risk or attack, its magnituda and what not. Or find the reputation or reputations of certain hash file in virus total all in a just a few seconds or a minute. I think that would be so awesome, it won't only save me alot of time it also saves me task and proceed with other areas of my work. I'm talking about real time automation baby!
The technology So, yeah quiet a historic day! Now the problem and solution has been defined. All I need is the technology and resource for it. I need a programming language that will help do such immense task. One capable of automation. One that is so easy and simple for me to code on. During my college days I do familiarized myself with C and C++ language and some others. So I have decided to then create a simple web crawler(one that gathers link on a web site) on a cpp language. I did make it, program runs fine. But it took me 100s of codes just for a certain task. So I did a lot more researching on how can I achieve my goal by just a minimal code. Maybe some library imports could help me.Then one day I came across with some youtube videos, mentoring on how they created a web crawling program to gather informations on a certain website. Majority of these youtubers wrote their application by python. Python is a cool programming language, for a lot of reasons. Here's a few,
*First, it is easy to learn. Yes so easy! Easier than a piece of cake(you know, assembly language? haha)For me it's a C with a little mix of JAVA. C, because the language is so simple and almost has same syntax and architecture with C. Java-like for it gives me alot of libraries and imports that could help me with my future dev and its somehow OOP.
*Secondly, has an extensive support libraries. One I need because with this I don't need to code with some specifics, like some string manipulation, OS interfaces and the likes. Would save me time and would save me lines ;)
* Third, has a user friendly data structures. It has built-in list and dictionary data structures which can be used to construct fast runtime data structures.
* Fourth, less time, more productivity. With Python I have developed a lot of programming projects/tools in six months(which I will show you in this blog). Reason behind this is that, has clean object-oriented design, provides enhanced process control capabilities, and possesses strong integration and text processing capabilities and its own unit testing framework.
* Last but not the least, it can run on all platforms. Yes indeed! Python has PYPI(Python Package Index), which has alot of third-party modules capable of interacting with other languages and platforms. In the future, I'll show you a program run in C, but coded in python ;).
The Alpha version Spyder is my first project coded in one of the best programming language out the there. My first version took me 286 lines of code, with 3 modules intact. I have partitioned my app in 3, that is NightCrawler, Ctrlf+f and IsItDwn. a. NightCrawler - this crawls to the site, and get all the links present on it. No duplicates of links, plus all arranged in alphabetical order. b. Ctrl+F - this search through strings on the web component. Will output the line and the link the strings were found. c. IsItDwn - this checks through the link/site if accessible or not.
Nightcrawler, with the help of beautiful soup and mechanize, I have managed to gather all the links a certain website contain. Lets, say we wanted to gather the links of the Hacker News site.
Above are the links crawled. As you can see on the figure, results are stored in the NighCrawler.txt. Next module is Ctrl + F. The idea is to search through text/strings a certain webpage/s has/have, urls to be search are read from search_input.txt. Search results will be stored in search_output.txt. I have parsed through all the web elements(html,javascript,css,etc). To do that I opened the link by urllib library and read through every lines. Set a for loop that will open a web content for every link on Links(links from input file). Inside the for loop is another for loop that would search the text for every line in each webpage. Code below.
So let's say you want to look for an ASUS laptop on your top 3 favorite online shops. Just provide the urls on search_input.txt, as below. ----------------------------------------------------------search_input.txt-------------------------------------------------------http://ph.priceprice.com/laptops/ https://www.lazada.com.ph/shop-laptops/ https://www.bestbuy.com/site/searchpage.jsp?id=pcat17071&sc=Global&usc=All%20Categories&type=page&browsedCategory=pcmcat138500050001&st=categoryid%24pcmcat138500050001&qp=currentoffers_facet%3DCurrent+Deals%7EOn+Sale
-----------------------------------------------------------------------------------------------------------------------------------------Executing the program, this would be output printed on the search_output.txt
Last module is the IsItDwn. I decided to include this as an analyst/engineer we are always conscious on sites/urls that are for priority monitoring. For this module, I have implemented a webpage.code syntax that would determine the status code response of the website. Basically my program will send a request to access a certain site, server would then output a response.
Say for instance you want to monitor the site on isitdown.txt.
With the program execution, below would be the results.
And that would be it for this post, my next would be the next patch implementation for changes. So stay tuned... You may watch the video of this below. Please don't forget to hit like and subscribe to my channel for more cool post like this. Thank you, and lets innovate!
Коментарі