[Can we scrape it? Eps 3 ]Lazada

Vlad
Feb 23, 2018
3 min read

Heyya fellas! Welcome to another scraping episode, in this blog I will show you how I manipulated and hooked data from one of the well known e-commerce website , lazada.com. Lazada has been one of the famous online shop this date, specially here in the Philippines. For me this is the mecca of all the online shoppers out there. Easy transaction, look for sale or search for your item, do some sort from your liking,read the product details and comments, look for the product's rating, if satisfied then your good to go! Buy the product or add it to your cart! That simple..

In this blog, I will show you how to automate this things and make your product searching alot easier.

Let the scraping begin..

First, we must know the target's header request. So for preliminaries, we try search on something, let's say Asus..

Obviously we have lazada.com.ph as the sites referer and host. But take a hard look at the query string params for ASUS.

Let's see if we have same result if we query on 'macbook'. ..

Let me further diagnose this, we see that in the header's view 'q' the value of the form's id is visible from the query params. Both request almost has identical syntax, except spm, which served as ID for the request.

Taking a look further on the URL requests:

https://www.lazada.com.ph/catalog/?q=macbook&_keyori=ss&from=input&spm=a2o4l.home.search.go.45d86ef0aitbzQ

https://www.lazada.com.ph/catalog/?q=asus&_keyori=ss&from=input&spm=a2o4l.searchlistbrand.search.go.33f77ad6Wv8f28

What url string parameter will we used for our scraping script? Can we try https://www.lazada.com.ph/catalog/?q= ?Let's see the results below..

Nice! So let's have that as our url params then. For referer? Obviously lazada.com.ph. Again why do we need to set a referer for our scraper? Main reason is for websites not to block our session, to have a successful session with a certain web application, a client and server trust must be implemented. It's like you go to a certain site, and the referer fiel will serve as one of your pass to allow access.

Now we have the initial ingredients, it's time! Let's see if we can crawl this baby!

Below is the code, we run it!

And we have..

Yep! We're in! And you know what this means, we can manipulate data from this site, and use it for our own application or for automation. In this blog, I decided to have this easy and basic. Let's grab the product name, current price tag, user ratings and the product URL so user can just visit the site for the product the user most liked.

As I have looked at the web server's reponse it seems that the site's product's json data is embeded in the script tag. These data fully complete the product's details, and this is what we all need.

The problem here now, is how can we parse this particular json data embeded inside a script tag. Well let me show you some ways:

a. You can use beautifulsoup(BS), and have it like my old fashioned parsing technique from previous posts:

s = soup.findall('script')

prod_name = [line['lisItems'] for line in json.loads(s.text)['name']] #parse the product name from the response

#will iterate and seek for object name(key for the product name) under listItems

b. Without the use of BS, regex search through 'listItems' the json object we needed.

I decided to go with b, for the reason I won't use any library for the reason I want this program to be fast plus I haven't used this method on this blog yet =)

script = re.findall(r'"listItems":(.*)',link) #will output the entire key value pair

link.replace("\\","") #will replace / with space to avoid delimeters. For json to be completely parse, we must follow the proper json format

str_scr = ''.join(script) #to convert the result to string

print >> w,str_scr, #to print the result in a text file. w is file named as outl.txt

We then execute the code, resulting to this..

So the first stage is done, we have parsed the entire json object to get the data we need.

Now it's time to properly get the data we needed from json and print it on the cmd.

To do that we need to open the file we need to get the data at.

Links1 = [Link1.strip() for Link1 in open ('outl.txt','r').readlines()]

for reg in Links1: #loop on file opened

finds = re.findall(r'"name":"(.*?)",|"price":"(.*?)",|"ratingScore":"(.*?)",|"productUrl":"(.*?)",',reg) #findall object by regex

for find in finds: #loop through list finds

str_find = ''.join(find) #convert list to string for a better read format

print str_find #print converted result

#--more code manipulation---#

Running our program we have below results

We just accomplished the paramaters needed; the product name, product URL, product price and the ratings. So you guys just have the idea how is it possible to manipulate Lazada, we can get data we wanted depending on how we want it and whatever we want it, may this help you on building your application.

krontek> halt

[Can we scrape it? Eps 3 ]Lazada

Recent Posts

Comments