Categories
bash bot headless browser non-browser phantomjs tor traffic web scraper

The Bot 2 : How to make a browserless web scraper?

We will continue from our last post, which discussed about how to make a simple bot that accesses webpages anonymously.
The bot there worked fine, but there are some drawbacks with that bot.
For ex- It used browsers to open webpages, which makes it little slow. Also, it leads to usage of more memory and storage(though cache can be cleared regularly). For sometimes, when the browser is closed forcefully, then it may pop up a dialog box asking to whether open the firefox in safe mode or reset it. This interrupts the automation of the bot. Now you have to reset the firefox proxy settings.

Now, in this post we will create a web scraper(the bot 2) that access web pages without webbrowsers ( browserless web scraper ). We are going to use phantomJs as a headless browser(executes javascript on the web page) and bash to automate it.

Programs
We will make two files phantomBot.js and bashBot.sh,

phantomBot.js

var page = require('webpage').create();
var system = require('system');
page.settings.resourceTimeout = 10000;
page.onResourceTimeout = function(error){
console.log(error.errorString);
console.log(error.errorCode);
console.log(error.url);
phantom.exit();
};

page.onResourceReceived = function(response){
console.log(“####”+response.id+”####”+response.status+”####”+response.url);
}

if (system.args.length===1){
console.log(“Enter Url”);
phantom.exit();
}else {
page.open(system.args[1], function(status){
console.log(status);
if (status===’success’){
var title = page.evaluate(function(){
return document.title;
});
console.log(title);
}
phantom.exit();
});
}

Explanation
1- page is a webpage object.
2- system is the object to receive command line variables
3- timeout for page load is set.
4- function is defined to be called on timeout.
5- function is defined to be called on successful data receive.
6- check if the url is passed in the command line.
7- open the url using the webpage object(page).
8- read the title of the document on successful load of the url.
9- exit the phantom.

bashBot.sh

sudo service tor restart
array=("http://www.example.com/p" "http://www.example.com/q")
for item in ${array[*]}
do
sudo phantomjs --proxy=127.0.0.1:9050 --proxy-type=socks5 phantomBot.js $item
done
sudo bash bashBot.sh

Explanation
1- tor is restarted to change the users ip.
2- an array of links that gives you money on getting traffic.(Do not leave spaces after or before ‘=’ sign)
3- iterate through each url and pass them to phantomjs program phantomBot.js to get them opened.
4- re-run the program.

Note
To start the execution of bot, type bash bashBot.sh in command line.

Great, you have created little more efficient bot ( browserless web scraper ).
Read websites’ terms, take your own risks and enjoy free money. 🙂

Categories
bot Proxy Python subprocess tor traffic web scraper webbrowser

The Bot : How To Make A Simple Anonymous Web Scraper?

Web scraper is a program to automate the process of accessing websites with or without browsers. In this post, by web scraping, I mean of only accessing the webpages. And the web scraper built here access webpages through web browsers.

An anonymous web scraper will be the one which keeps the identity of the program hidden. That means a program that accesses some website without revealing its information(ip address).

Usage(Though this should be used at one’s own risk)
There are websites which offer you money to bring traffic to their websites. Using such anonymous web scrappers (the bots), you can send fake traffic to their websites and earn some money.

Note: These programs are tested on ubuntu 14.04. Similar programs can be made on other platforms as well.
We will discuss two ways of making such web scraper here using python.
For the first method, we will use webbrowser library of python to open webpages, and tor to create anonymity. We will use subprocess library to automate it.

Bot

import time
import subprocess as sp
import webbrowser

urls = [“www.example.com/p”, “www.example.com/q”]
count =10000
while count>=0:
for url in urls:
webbrowser.open(url)
time.sleep(2)
time.sleep(4)
sp.call([“sudo”, “killall”, “firefox”])
sp.call([“sudo”, “/etc/init.d/tor”, “restart”])
count-=1

Explanation

1- urls contain the list of url, bringing traffic on which gives you money.
2- while loop makes the iteration over the list of urls.
3- each url is opened using webbrowser(default firefox) and waits for two seconds.
4- then it waits for 4 seconds after opening all the urls to let the pages load properly(these can be changed according to the time, website takes to load pages).
5- the firefox is forcefully closed
6- tor is restarted so that user ip is changed.

Note
Before running the program, set up the firefox for socks5 proxy and port no. 9050 to make requests through tor.

For the second method, we will simply open browser through subprocess,

import time
import subprocess as sp
import webbrowser

urls = [“www.example.com/p”, “www.example.com/q”]
count =10000
while count>=0:
for url in urls:
child = sp.Popen(“firefox %s” %url, shell=True)
time.sleep(2)
time.sleep(4)
sp.call([“sudo”, “killall”, “firefox”])
sp.call([“sudo”, “/etc/init.d/tor”, “restart”])
count-=1

Explanation
1- the webpages are opened as a subprocess

Now you are all set up with your anonymous web scraper.

Note
Do setup the firefox for proxy as in first method.
With such bots, they must execute javascript to send information, which can be done using browsers.
The other technique used to create bot using phantomJs and bash can be found here.
Read websites’ terms, take your own risks and enjoy free money. 🙂

Categories
Dryscrape Polipo Proxy Python socks tor

Proxy The Proxy : How to set socks5 proxy in dryscrape(python library)?

Dryscrape is the python library that can be used to execute javascript code when a webpage is opened. You can read about it more from here.

Now we can set the http proxy for dryscrape in the following way:

sess = dryscrape.Session(baseurl = "http://www.example.com")
sess.set_proxy(host='localhost', port = port_no)

But the problem is that, we can set only http proxy to dryscrape session.
But sometimes we may want to set socks proxy for dryscrape session. I wanted it to connect it through tor to make an anonymous web scraper but tor works only on socks protocol.

Now there is an interesting called Polipo.
Polipo is a web proxy that supports socks protocol. But the more interesting thing(may be just to me or other novice explorers) is that it can pipeline the request to other proxy server of other protocol from request protocol.

In more simple language, we can configure Polipo to route the request from one server listening on some protocol to other server with other protocol.

So, we will configure our polipo server to listen for http requests and forward it through socks protocol. So this will proxy the proxy server for initial request(might be puzzling).

Solution:

Following is the configuration that can be added in your /etc/polipo/config file,

socksParentProxy = "localhost:9050"
socksProxyType = socks5

proxyAddress = “localhost”
proxyPort = 8118

Explanation:

socksParentProxy determines the next machine with port no. to target for the incoming request(tor running on my machine with default port no. 9050).
socksProxyType determines the type of protocol of the next server.
proxyAddress detemines the address of the polipo server.
proxyPort determines the port no. for polipo server(default 8118).

Now we can set proxy server as polipo server for dryscrapes’ requests in the following way:
sess = dryscrape.Session(base_url = "http://www.example.com")
sess.set_proxy(host = "localhost", port = 8118)

So, now whenever dryscrape makes request for some webpage, the request first goes to polipo and then it gets forward to tor(or some other server).

Cheers. Now, using dryscrape, we can make request through socks protocol(socks5 proxy).

Some of the related questions on stackoverflow can be found here.

Categories
aws bad configuration nginx s3

The Dark-Knight : How to configure your nginx to retrieve static pages of your website from amazon s3 storage during 502 bad gateway error?

Nginx is widely used as a front-end proxy server to php server which means that when a client makes a request, nginx receives them first and then sends them to php(back-end) server to process them. Nginx has got ability to handle as much as 10000 requests at any point of time.

Now, lets talk about the topic.
There can be situations when you may receive server errors(502 bad gateway, 503 service unavailable, or anything like that). They are seriously nightmares if you have ever faced them. That could create bad impression of you to your users. I have faced it so I can tell you that it is really embarrassing. You probably lose out a lot of your users and other disasters might happen.

The cause for such errors could be from bad coding to high load(good thing, but unable to handle high load is not good).

Now, this is one of the solution where you learn how to configure your system to save yourself from embarrassment. I named it Dark Knight because you shouldn’t need this, but still it guards your website.

Solution:

Suppose, the url to backup has the following form, www.example.com/a/b .

Suppose the static page for www.example.com/a/b is stored in s3 storage as bucket/a/b/a_b.html. The url for same would be https://s3.amazonaws.com/bucket/a/b/a_b.html.

Now following changes can be added in your nginx configuration,

location @static{

rewrite ^ $request_uri;

rewrite /(.*)/(.*) /bucket/$1/$2/$1_$2.html;

proxy_pass https://s3.amazonaws.com;

}

location /index.php {

error_page 502 =200 @static;

fastcgi_intercept_errors on;

#  body

}

 

Explanation:

The usual cause for 502 error is when php fail to handle anymore requests.

The statements, fastcgi_intercept_errors keeps listening if php sends any error to some request, changes the response code for client and error_page redirects to some location on some particular error code(502 here) sent by php.

Now in @static location,

The current value of $uri is /index.php and $request_uri is /a/b. So rewrite it to the form of our static page in s3 bucket url.

The first rewrite changes $uri from /index.php -> /a/b.

The second rewrite changes $uri from /a/b -> /bucket/a/b/a_b.html.

Now proxy_pass directive sends data from given url(static page url) without changing the url.
Note that we can’t append uri after url in proxy_pass in any named_location(it won’t run).
Do provide proper permission to s3 bucket for access.

Cheers. You have learnt now how to configure your nginx to handle 502 error. Similar things can also be done to other server errors.

Here are few related questions, here, here, here, here on stackoverflow.