Categories
bot Python web scraper

Web Scraping with Simple and Basic Technique in Python

Hello all, Web scraping is an activity of visiting websites and fetching needful information from a particular website. In order to automate this activity, developers write scripts in different programming languages (Javascript, Php, Python, NodeJs, PhantomJs) etc. While every language has some benefit over others, the core concept of web scraping remains same. I have scraped multiple and different kind of websites over a period of time. My favorite language has been Python. In this tutorial, we will discuss on how to scrape almost any website with Python.

Again, the most important part is the concept. If the concept gets clear, you can choose the language of your choice, though I will provide sample codes for Python to get started. The concept covers a general website, which should also work on websites with authentication needed.

Web Scraping Concept:

1- Make a proper request to the web page you want to scrape.

  • Make a http request to the website. In order to do this, you need to send right headers and request data.
  • In order to form a right request, you can use Chrome Developer’s Tool Network tab.
  • Open new tab in Chrome browser and then open Chrome Developer Tool (Ctrl + Shift + I). Then click on network tab. And then click on preserve log.
  • Now the current tab is all set to record the requests for the website it will open.
  • Open the particular request which you want to scrape and check request type (GET, POST), request headers, request data which you need to take care while making a web scraper.

2- Find the data you needed and store it.

  • Create soup (in python) object of the received response content.
  • Find the needful data using soup functions (find, find_all)

Let’s take a live example for web scraping demonstration-

  • Website: Naukri.com
  • Objective: Scrape different jobs for python

Prerequisite:

  • Install requests, beautifulsoup, lxml library in python3, or similar version in python.
  • Do install, any other required libraries

Example:

1- We will manually analyze the page for python jobs for the request url, request type, request headers, request data

The new url is https://www.naukri.com/python-jobs

  • Check the Headers tab in network tab
  • General-
  • Request URL : https://www.naukri.com/python-jobs
  • Request Method : POST
  • Status Code: 200
  • Request Headers : accept, accept-encoding, accept-language, etc.
  • Form Data : qp, ql, qe, qm, etc.

2- We will create similar request object in python and get response

base_url = "https://www.naukri.com/python-jobs"
headers = {
                    'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
                    'accept-encoding': "gzip, deflate, br",
                    'accept-language': "en-US,en;q=0.9",
                    'cache-control': "no-cache",
                    'content-length': "487",
                    'content-type': "multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW",
                    'origin': "https://www.naukri.com",
                    'referer': "https://www.naukri.com/python-jobs",
                    'upgrade-insecure-requests': "1",
                    'user-agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/
                    }

payload = "------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qp\"\r\n\r\npython\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"ql\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qe\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qm\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qx\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qi[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qi[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qf[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qr[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qs\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qo\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qjt[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qk[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qwdt\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qsb_section\"\r\n\r\nhome\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qpremTagLabel\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"sid\"\r\n\r\n15556764581018\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qwd[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qcf[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qci[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qci[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qck[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"edu[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qcug[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qcpg[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qctc[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qco[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qcjt[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qcr[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qcl[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qrefresh\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"xt\"\r\n\r\nadv\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qtc[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"fpsubmiturl\"\r\n\r\nhttps://www.naukri.com/python-jobs\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qlcl[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"src\"\r\n\r\nsortby\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"px\"\r\n\r\n1\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"latLong\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW--"

data = {"qp":"python","ql":",","qe":"","qm":"","qx":"","qi[]":"","qf[]":"","qr[]":"","qs":"f","qo":"","qjt[]":"","qk[]":"","qwdt":"","qsb_section":"home","qpremTagLabel":"","sid":"15556764581018","qwd[]":"","qcf[]":"","qci[]":"","qck[]":"","edu[]":"","qcug[]":"","qcpg[]":"","qctc[]":"","qco[]":"","qcjt[]":"","qcr[]":"","qcl[]":"","qrefresh":"","xt":"adv","qtc[]":"","fpsubmiturl":"https://www.naukri.com/python-jobs","qlcl[]":"","latLong":"", "src":"sortby", "px": 1}

response = requests.request("POST", base_url, data=payload, headers=headers, params=data)

3- We will find the job data from the response.

soup = BeautifulSoup(response.text, "lxml")
for div in soup.find_all('div',{'type': 'tuple', 'class': 'row'}):
    post = {
        'company_name': ''
    }
    try:
        company = div.find('span', {'class' : 'org'}).getText()
        company = company.strip()
        company = company.replace(',' , ' || ')
        post['company_name'] = company
    except:
        pass

Awesome! In similar fashion, you can scrape almost any website. Do try out the concept to create your first scraper.

Also let me know of suggestions for improvement and queries.

Note: This article is for educational purpose and in no way endorses illegal web scraping.

Categories
bash bot headless browser non-browser phantomjs tor traffic web scraper

The Bot 2 : How to make a browserless web scraper?

We will continue from our last post, which discussed about how to make a simple bot that accesses webpages anonymously.
The bot there worked fine, but there are some drawbacks with that bot.
For ex- It used browsers to open webpages, which makes it little slow. Also, it leads to usage of more memory and storage(though cache can be cleared regularly). For sometimes, when the browser is closed forcefully, then it may pop up a dialog box asking to whether open the firefox in safe mode or reset it. This interrupts the automation of the bot. Now you have to reset the firefox proxy settings.

Now, in this post we will create a web scraper(the bot 2) that access web pages without webbrowsers ( browserless web scraper ). We are going to use phantomJs as a headless browser(executes javascript on the web page) and bash to automate it.

Programs
We will make two files phantomBot.js and bashBot.sh,

phantomBot.js

var page = require('webpage').create();
var system = require('system');
page.settings.resourceTimeout = 10000;
page.onResourceTimeout = function(error){
console.log(error.errorString);
console.log(error.errorCode);
console.log(error.url);
phantom.exit();
};

page.onResourceReceived = function(response){
console.log(“####”+response.id+”####”+response.status+”####”+response.url);
}

if (system.args.length===1){
console.log(“Enter Url”);
phantom.exit();
}else {
page.open(system.args[1], function(status){
console.log(status);
if (status===’success’){
var title = page.evaluate(function(){
return document.title;
});
console.log(title);
}
phantom.exit();
});
}

Explanation
1- page is a webpage object.
2- system is the object to receive command line variables
3- timeout for page load is set.
4- function is defined to be called on timeout.
5- function is defined to be called on successful data receive.
6- check if the url is passed in the command line.
7- open the url using the webpage object(page).
8- read the title of the document on successful load of the url.
9- exit the phantom.

bashBot.sh

sudo service tor restart
array=("http://www.example.com/p" "http://www.example.com/q")
for item in ${array[*]}
do
sudo phantomjs --proxy=127.0.0.1:9050 --proxy-type=socks5 phantomBot.js $item
done
sudo bash bashBot.sh

Explanation
1- tor is restarted to change the users ip.
2- an array of links that gives you money on getting traffic.(Do not leave spaces after or before ‘=’ sign)
3- iterate through each url and pass them to phantomjs program phantomBot.js to get them opened.
4- re-run the program.

Note
To start the execution of bot, type bash bashBot.sh in command line.

Great, you have created little more efficient bot ( browserless web scraper ).
Read websites’ terms, take your own risks and enjoy free money. 🙂

Categories
bot Proxy Python subprocess tor traffic web scraper webbrowser

The Bot : How To Make A Simple Anonymous Web Scraper?

Web scraper is a program to automate the process of accessing websites with or without browsers. In this post, by web scraping, I mean of only accessing the webpages. And the web scraper built here access webpages through web browsers.

An anonymous web scraper will be the one which keeps the identity of the program hidden. That means a program that accesses some website without revealing its information(ip address).

Usage(Though this should be used at one’s own risk)
There are websites which offer you money to bring traffic to their websites. Using such anonymous web scrappers (the bots), you can send fake traffic to their websites and earn some money.

Note: These programs are tested on ubuntu 14.04. Similar programs can be made on other platforms as well.
We will discuss two ways of making such web scraper here using python.
For the first method, we will use webbrowser library of python to open webpages, and tor to create anonymity. We will use subprocess library to automate it.

Bot

import time
import subprocess as sp
import webbrowser

urls = [“www.example.com/p”, “www.example.com/q”]
count =10000
while count>=0:
for url in urls:
webbrowser.open(url)
time.sleep(2)
time.sleep(4)
sp.call([“sudo”, “killall”, “firefox”])
sp.call([“sudo”, “/etc/init.d/tor”, “restart”])
count-=1

Explanation

1- urls contain the list of url, bringing traffic on which gives you money.
2- while loop makes the iteration over the list of urls.
3- each url is opened using webbrowser(default firefox) and waits for two seconds.
4- then it waits for 4 seconds after opening all the urls to let the pages load properly(these can be changed according to the time, website takes to load pages).
5- the firefox is forcefully closed
6- tor is restarted so that user ip is changed.

Note
Before running the program, set up the firefox for socks5 proxy and port no. 9050 to make requests through tor.

For the second method, we will simply open browser through subprocess,

import time
import subprocess as sp
import webbrowser

urls = [“www.example.com/p”, “www.example.com/q”]
count =10000
while count>=0:
for url in urls:
child = sp.Popen(“firefox %s” %url, shell=True)
time.sleep(2)
time.sleep(4)
sp.call([“sudo”, “killall”, “firefox”])
sp.call([“sudo”, “/etc/init.d/tor”, “restart”])
count-=1

Explanation
1- the webpages are opened as a subprocess

Now you are all set up with your anonymous web scraper.

Note
Do setup the firefox for proxy as in first method.
With such bots, they must execute javascript to send information, which can be done using browsers.
The other technique used to create bot using phantomJs and bash can be found here.
Read websites’ terms, take your own risks and enjoy free money. 🙂