Categories
aws optimization php static

Optimization – Store Static Files on AWS S3 with Git Hooks, AWS CLI tool

In this tutorial, we will learn on how to store your static files, like JS, CSS, JSON, images, etc. on AWS S3. This is going to make a drastic improvement for your web servers. The reason for improvement is for every page load, there are over tons of follow up such static file requests. Though, such requests are not processed (only then can be kept at S3 as it is a non-processing storage), these request still hit your web servers significantly. When kept on S3, it saves a lot of file power to the web servers. We are going to use Git Hooks, AWS CLI tool to achieve our goal.

The challenge is repeated uploads of your static files to S3, which, if done manual, is tedious. Here we can make use of Git Hooks and AWS CLI tool to work together to automate the syncing of your static files.

Git Hooks

Git Hooks, in simple terms, are the intermediate steps behind a git command. For example, git pre-push hook is a hook which executes on every git push command, but prior to pushing the code.

AWS CLI

AWS CLI tool in simple terms is a command line interface tool (full form actually) with various commands to communicate with AWS services like S3, EC2, etc.

Here we can write a command for aws using aws cli tool, in the pre-push hook.

The process is that, you can define a constant ASSET_URL for the static files base url location. For ex- for your test environment, it would be http://localhost/project/ and for production it should be your s3 address (or cloudfront url, https://cdn.example.com/, infront of s3). So, the static file urls would look like, ASSETS_URL.'assets/img/a.png', ASSETS_URL.'assets/css/a.css', ASSETS_URL.'assets/js/a.js', etc.

Now the testing and development process would remain same, as all the file copies remain on your local server. But on the production environment, it will look for these files on cdn address provided. So, before sending your code to production servers, you need to update all the new / modified files to the s3.

At this step, git hooks would come into picture. One of the git hook is pre-push hook, which you can edit/create with .git/hooks/pre-push. https://stackoverflow.com/a/14672883/2560576. Example sample for pre-push hook.

In the pre-push hook, you can add aws cli command to update your local assets folder to your s3 assets folder. For example- aws s3 sync assets/ s3://bucket/assets/ --profile aws_credential_profile --acl public-read

So, now when development is complete, and you can execute git push code to push your code to remote repository as usual. But with the help of Git pre-push hook, all the static files will be synced to your s3 bucket’s assets folder just before the actual push.

Now only processing requests are made to your web server, and all static file requests are routed to S3.

Hope this helps someone. Pl give your feedback to improve or add anything.

More- Automatic PWA Converter Platform

Thanks!

Categories
bot Python web scraper

Web Scraping with Simple and Basic Technique in Python

Hello all, Web scraping is an activity of visiting websites and fetching needful information from a particular website. In order to automate this activity, developers write scripts in different programming languages (Javascript, Php, Python, NodeJs, PhantomJs) etc. While every language has some benefit over others, the core concept of web scraping remains same. I have scraped multiple and different kind of websites over a period of time. My favorite language has been Python. In this tutorial, we will discuss on how to scrape almost any website with Python.

Again, the most important part is the concept. If the concept gets clear, you can choose the language of your choice, though I will provide sample codes for Python to get started. The concept covers a general website, which should also work on websites with authentication needed.

Web Scraping Concept:

1- Make a proper request to the web page you want to scrape.

  • Make a http request to the website. In order to do this, you need to send right headers and request data.
  • In order to form a right request, you can use Chrome Developer’s Tool Network tab.
  • Open new tab in Chrome browser and then open Chrome Developer Tool (Ctrl + Shift + I). Then click on network tab. And then click on preserve log.
  • Now the current tab is all set to record the requests for the website it will open.
  • Open the particular request which you want to scrape and check request type (GET, POST), request headers, request data which you need to take care while making a web scraper.

2- Find the data you needed and store it.

  • Create soup (in python) object of the received response content.
  • Find the needful data using soup functions (find, find_all)

Let’s take a live example for web scraping demonstration-

  • Website: Naukri.com
  • Objective: Scrape different jobs for python

Prerequisite:

  • Install requests, beautifulsoup, lxml library in python3, or similar version in python.
  • Do install, any other required libraries

Example:

1- We will manually analyze the page for python jobs for the request url, request type, request headers, request data

The new url is https://www.naukri.com/python-jobs

  • Check the Headers tab in network tab
  • General-
  • Request URL : https://www.naukri.com/python-jobs
  • Request Method : POST
  • Status Code: 200
  • Request Headers : accept, accept-encoding, accept-language, etc.
  • Form Data : qp, ql, qe, qm, etc.

2- We will create similar request object in python and get response

base_url = "https://www.naukri.com/python-jobs"
headers = {
                    'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
                    'accept-encoding': "gzip, deflate, br",
                    'accept-language': "en-US,en;q=0.9",
                    'cache-control': "no-cache",
                    'content-length': "487",
                    'content-type': "multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW",
                    'origin': "https://www.naukri.com",
                    'referer': "https://www.naukri.com/python-jobs",
                    'upgrade-insecure-requests': "1",
                    'user-agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/
                    }

payload = "------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qp\"\r\n\r\npython\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"ql\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qe\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qm\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qx\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qi[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qi[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qf[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qr[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qs\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qo\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qjt[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qk[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qwdt\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qsb_section\"\r\n\r\nhome\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qpremTagLabel\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"sid\"\r\n\r\n15556764581018\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qwd[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qcf[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qci[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qci[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qck[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"edu[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qcug[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qcpg[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qctc[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qco[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qcjt[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qcr[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qcl[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qrefresh\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"xt\"\r\n\r\nadv\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qtc[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"fpsubmiturl\"\r\n\r\nhttps://www.naukri.com/python-jobs\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"qlcl[]\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"src\"\r\n\r\nsortby\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"px\"\r\n\r\n1\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"latLong\"\r\n\r\n\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW--"

data = {"qp":"python","ql":",","qe":"","qm":"","qx":"","qi[]":"","qf[]":"","qr[]":"","qs":"f","qo":"","qjt[]":"","qk[]":"","qwdt":"","qsb_section":"home","qpremTagLabel":"","sid":"15556764581018","qwd[]":"","qcf[]":"","qci[]":"","qck[]":"","edu[]":"","qcug[]":"","qcpg[]":"","qctc[]":"","qco[]":"","qcjt[]":"","qcr[]":"","qcl[]":"","qrefresh":"","xt":"adv","qtc[]":"","fpsubmiturl":"https://www.naukri.com/python-jobs","qlcl[]":"","latLong":"", "src":"sortby", "px": 1}

response = requests.request("POST", base_url, data=payload, headers=headers, params=data)

3- We will find the job data from the response.

soup = BeautifulSoup(response.text, "lxml")
for div in soup.find_all('div',{'type': 'tuple', 'class': 'row'}):
    post = {
        'company_name': ''
    }
    try:
        company = div.find('span', {'class' : 'org'}).getText()
        company = company.strip()
        company = company.replace(',' , ' || ')
        post['company_name'] = company
    except:
        pass

Awesome! In similar fashion, you can scrape almost any website. Do try out the concept to create your first scraper.

Also let me know of suggestions for improvement and queries.

Note: This article is for educational purpose and in no way endorses illegal web scraping.

Categories
Uncategorized

AWS Resource Migration across account – Part 1 – EC2

There have been cases when one need to migrate there AWS account. The one very obvious reason is that you’re using AWS free tier services. And after an year, as the free tier is over, you may want to create a new AWS account.

But when you create a new AWS account, you need to recreate all of your AWS resources, like EC2, ELB, S3, RDS, Elasticsearch, etc.

Along with recreating all the resources you would need to retain all the data, tools, settings of your current AWS resources. For example, suppose you were running an EC2 machine for a website. You had installed various tools on that machine, like phpmyadmin, nodejs, nginx, python, gunicorn.

Now creating a new account and corresponding resources is Okay, but re-setting everything to current state is not very straight-forward.

But let me tell you, the resetting of everything is not actually very difficult.

This article of the AWS account migration series will focus on migration of EC2 resource on AWS.

Terminologies:

  • naws : New AWS account
  • oaws: Old AWS account

Steps to migrate EC2 from oaws to naws:

  1. Create an AMI for EC2 resource in oaws
  2. Edit Permission of above created AMI.
  3. Add naws account id in the step 2. This should share the above AMI with naws.
  4. Wait for few minutes. The sharing of AMI might take 5-15 minutes.
  5. Go to AMI list and select Shared AMIs. You should see above create AMI here. In case you don’t see it yet, pl wait for a while.
  6. Once you see the new AMI in naws, select that AMI and launch a new EC2 resource.
  7. Create New security group in naws, copying values from oaws.
  8. Attach above created security group to the new EC2 resource in naws.
  9. You should have a working EC2, a clone of EC2 in oaws. You might have to ssh into new EC2 resource and start your servers, in case they are not included in upstart script as the EC2 is recently launched and hence started.
Categories
google

Hack For Removal Of Google’s View Image Option

You might have noticed that, google has removed its view image option. This is an initiative by Google to lower down the online theft of pictures. This should also lower down the copyright violations of online images.

But due to the removal of this option, it has also raised difficulty to bloggers and other kinds of users. Though, there are already different plugins out there, which claim to bring back that option right at its place. But, if you’re reluctant to installing any external plugin. Or you on’t want to mess up with google practice, this article presents a clean way to get the url of those images, with 100% surety.

This method is completely lightweight, in terms of no installation of any third party plugin. This won’t as well notify google, that the particular user could be violating the view image practice. Let’s jump in to see how this could be done.

(This is a developer/programmers’ hack, so it might feel a bit lengthy, but surely a clean one.)

1- Search your image on Google.

2- Click on the image. It will take you to the respective url, wherever, it has been indexed from.

3- Now, on this page, if you find the same image, then great. You can right click onto the same and save the image.

4.1- If you can’t find the image on the page, press Ctrl+U or right click and choose View Page Source.

4.2- Find in the newly open tab, “og:image” text. If you find this text, the url in the content part of the image should be the required image.

4.3 If you don’t find that text, try another image from the related images in google search and go to step 2.

Now you can find the url of any image, not just images on Google, skipping any malfunctioning/third party plugin.

Hope this article was helpful. Please like/share. Have a good day!