Web Scraping with Simple and Basic Technique in Python

Hello all, Web scraping is an activity of visiting websites and fetching needful information from a particular website. In order to automate this activity, developers write scripts in different programming languages (Javascript, Php, Python, NodeJs, PhantomJs) etc. While every language has some benefit over others, the core concept of web scraping remains same. I have scraped multiple and different kind of websites over a period of time. My favorite language has been Python. In this tutorial, we will discuss on how to scrape almost any website with Python.

Again, the most important part is the concept. If the concept gets clear, you can choose the language of your choice, though I will provide sample codes for Python to get started. The concept covers a general website, which should also work on websites with authentication needed.

Web Scraping Concept:

1- Make a proper request to the web page you want to scrape.

  • Make a http request to the website. In order to do this, you need to send right headers and request data.
  • In order to form a right request, you can use Chrome Developer’s Tool Network tab.
  • Open new tab in Chrome browser and then open Chrome Developer Tool (Ctrl + Shift + I). Then click on network tab. And then click on preserve log.
  • Now the current tab is all set to record the requests for the website it will open.
  • Open the particular request which you want to scrape and check request type (GET, POST), request headers, request data which you need to take care while making a web scraper.

2- Find the data you needed and store it.

  • Create soup (in python) object of the received response content.
  • Find the needful data using soup functions (find, find_all)

Let’s take a live example for web scraping demonstration-

  • Website: Naukri.com
  • Objective: Scrape different jobs for python

Prerequisite:

  • Install requests, beautifulsoup, lxml library in python3, or similar version in python.
  • Do install, any other required libraries

Example:

1- We will manually analyze the page for python jobs for the request url, request type, request headers, request data

The new url is https://www.naukri.com/python-jobs

  • Check the Headers tab in network tab
  • General-
  • Request URL : https://www.naukri.com/python-jobs
  • Request Method : POST
  • Status Code: 200
  • Request Headers : accept, accept-encoding, accept-language, etc.
  • Form Data : qp, ql, qe, qm, etc.
web scraping

2- We will create similar request object in python and get response

base_url = "https://www.naukri.com/python-jobs"
headers = {
                    'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
                    'accept-encoding': "gzip, deflate, br",
                    'accept-language': "en-US,en;q=0.9",
                    'cache-control': "no-cache",
                    'content-length': "487",
                    'content-type': "multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW",
                    'origin': "https://www.naukri.com",
                    'referer': "https://www.naukri.com/python-jobs",
                    'upgrade-insecure-requests': "1",
                    'user-agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/
                    }

payload = "------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qp\"\r\n\r\npython\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"ql\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qe\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qm\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qx\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qi[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qi[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qf[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qr[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qs\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qo\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qjt[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qk[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qwdt\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qsb_section\"\r\n\r\nhome\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qpremTagLabel\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"sid\"\r\n\r\n15556764581018\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qwd[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qcf[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qci[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qci[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qck[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"edu[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qcug[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qcpg[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qctc[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qco[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qcjt[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qcr[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qcl[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qrefresh\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"xt\"\r\n\r\nadv\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qtc[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"fpsubmiturl\"\r\n\r\nhttps://www.naukri.com/python-jobs\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qlcl[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"src\"\r\n\r\nsortby\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"px\"\r\n\r\n1\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"latLong\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW--"

data = {"qp":"python","ql":",","qe":"","qm":"","qx":"","qi[]":"","qf[]":"","qr[]":"","qs":"f","qo":"","qjt[]":"","qk[]":"","qwdt":"","qsb_section":"home","qpremTagLabel":"","sid":"15556764581018","qwd[]":"","qcf[]":"","qci[]":"","qck[]":"","edu[]":"","qcug[]":"","qcpg[]":"","qctc[]":"","qco[]":"","qcjt[]":"","qcr[]":"","qcl[]":"","qrefresh":"","xt":"adv","qtc[]":"","fpsubmiturl":"https://www.naukri.com/python-jobs","qlcl[]":"","latLong":"", "src":"sortby", "px": 1}

response = requests.request("POST", base_url, data=payload, headers=headers, params=data)

3- We will find the job data from the response.

soup = BeautifulSoup(response.text, "lxml")
for div in soup.find_all('div',{'type': 'tuple', 'class': 'row'}):
    post = {
        'company_name': ''
    }
    try:
        company = div.find('span', {'class' : 'org'}).getText()
        company = company.strip()
        company = company.replace(',' , ' || ')
        post['company_name'] = company
    except:
        pass

Awesome! In similar fashion, you can scrape almost any website. Do try out the concept to create your first scraper.

Also let me know of suggestions for improvement and queries.

Note: This article is for educational purpose and in no way endorses illegal web scraping.

Leave a Reply