Welcome to another series of can we scrape it!
In this episode, I will take you to one of the most visited web application as of this moment. As the playoffs did came to a close, what could be the best way is to wait and check on your teams and players this offseason as this is the time for movement, free agencies and signings. It may be the 'offseason' but the league's website will still be busy.
With NBA.com as its main website, the league has began to flourished and provide a much needed amount of data to feed those basketball nerds. Live scores, videos, news updates, players and team statistics, game schedule, NBA.com is the mecca of all informations related to NBA. There's just one main concern here, just one big question. Can we scrape it?
From a known source, statista.com, there are 33.93 millions of FB users subscribed in NBA itself as of MArch 2018. Meanwhile, there are 20.4 million of viewers in an NBA Finals series last year just in United States. During the 7th game of the 2016 NBA finals, the Lebron vs Curry part 2, NBA reached its peak to 44M viewers. The highest number of views contributed by the league on a live TV. Imagine those humongous numbers, and that's just in United States. What about those in Asia, Europe, Africa, and other continents? NBA was proven to be one of the most popular brand on sports worldwide.It is ranked 3rd just behind NFL and Premiere League for the most watched sports league on the planet.
>>TEST<<
Before performing some cool grabbing of data from the website. We first test, if the server allows us the perform our goal. We do that by setting a session request on a loop. Such,
with requests.Session() as c:
(1)url = 'https://sports.abs-cbn.com/nba?gr=www'
get(url)
(2)page = c.post(url, headers={"Referer": "https://ph.global.nba.com"})
page = page.content
<codes to validate connection>
For us to conform judgement if we can get data from this web application we set up 3 metrics:
a. Does the website prohibits scraping?
b. Can we get unlimited amount of data in a way that there will be no captcha stopping us?
As we feed on the request on the site's server we have the following response.
From the web response we have above, we can say that we're in! That solves our first metrics.
>>GRAB 'n GET<<
So I have put up a simple spyder that will get a simple sort of data that an avid fan usually look for in the website.
> get_sched - This functionality will let you get all the schedule listed from the home page. By this, you won't only keep track if your favorite team is playing, but you can also check the current score. Below is the sample execution.
>get_news - This will crawl on the headlines from the homepage. A basketball patron, will always look on the headline and see how's the league is doing. A simple snippet of details will satisfy a certain fan, all in with a cup of coffee.
>get_player_stat - Want to check on your idol's current stats? This portion gets the player's points per game(PPG), rebounds per game(RPG) and assists per game(APG). A sample on 3 of the best player in the Finals.
To answer metrics b, I have set my top 10 current players. We'll test if the website would block us after some amount of data scraping in a period of time.
Here are my 10 players listed in players.txt file.
And below we have the result.
We gathered the data we needed in just 2 mins and 15 secs for 10 players with no interruptions . That only proves that we can manipulate alot of data in a small amount of time. That would also explain why some of the well know application that crawls through this web site can gather and manipulate a huge amount of data, in a way that you can create your own league with your favorite players and base that on their current stats. Cool and interesting...
<krontek>halt
Comments