The Bot 2 : How to make a browserless web scraper?

We will continue from our last post, which discussed about how to make a simple bot that accesses webpages anonymously.
The bot there worked fine, but there are some drawbacks with that bot.
For ex- It used browsers to open webpages, which makes it little slow. Also, it leads to usage of more memory and storage(though cache can be cleared regularly). For sometimes, when the browser is closed forcefully, then it may pop up a dialog box asking to whether open the firefox in safe mode or reset it. This interrupts the automation of the bot. Now you have to reset the firefox proxy settings.

Now, in this post we will create a web scraper(the bot 2) that access web pages without webbrowsers ( browserless web scraper ). We are going to use phantomJs as a headless browser(executes javascript on the web page) and bash to automate it.

Programs
We will make two files phantomBot.js and bashBot.sh,

phantomBot.js

var page = require('webpage').create();
var system = require('system');
page.settings.resourceTimeout = 10000;
page.onResourceTimeout = function(error){
console.log(error.errorString);
console.log(error.errorCode);
console.log(error.url);
phantom.exit();
};

page.onResourceReceived = function(response){
console.log(“####”+response.id+”####”+response.status+”####”+response.url);
}

if (system.args.length===1){
console.log(“Enter Url”);
phantom.exit();
}else {
page.open(system.args[1], function(status){
console.log(status);
if (status===’success’){
var title = page.evaluate(function(){
return document.title;
});
console.log(title);
}
phantom.exit();
});
}

Explanation
1- page is a webpage object.
2- system is the object to receive command line variables
3- timeout for page load is set.
4- function is defined to be called on timeout.
5- function is defined to be called on successful data receive.
6- check if the url is passed in the command line.
7- open the url using the webpage object(page).
8- read the title of the document on successful load of the url.
9- exit the phantom.

bashBot.sh

sudo service tor restart
array=("http://www.example.com/p" "http://www.example.com/q")
for item in ${array[*]}
do
sudo phantomjs --proxy=127.0.0.1:9050 --proxy-type=socks5 phantomBot.js $item
done
sudo bash bashBot.sh

Explanation
1- tor is restarted to change the users ip.
2- an array of links that gives you money on getting traffic.(Do not leave spaces after or before ‘=’ sign)
3- iterate through each url and pass them to phantomjs program phantomBot.js to get them opened.
4- re-run the program.

Note
To start the execution of bot, type bash bashBot.sh in command line.

Great, you have created little more efficient bot ( browserless web scraper ).
Read websites’ terms, take your own risks and enjoy free money. 🙂

Latest Comments
  1. Satys August 9, 2016

Leave a Reply

Your email address will not be published. Required fields are marked *