Faking user-agent in Python

Contribute code to SolydXK and make it even better.
User avatar
Arjen Balfoort
Site Admin
Posts: 9282
Joined: 26 Jan 2013 19:36
Location: Netherlands
Contact:

Faking user-agent in Python

Postby Arjen Balfoort » 13 Nov 2015 11:45

One of the reasons I started coding was that I really dislike repetitive tasks. They are mind-numbing, a total waste of time and because they bore me to death I get sloppy and make mistakes. These tasks are ideally suited for automation.

I also know that being ranked on Distrowatch is important. Unfortunately, I'd have to visit the SolydXK page on Distrowatch to up the counter and that is...a boring repetitive task and I simply forget to do it.

So, I automated it with a Python script. I saved that script as /etc/cron.daily/distrowatch and made it executable (very important). Now I don't have to worry about that anymore.

About the script: using urllib to scrape content of a site is very well documented. However, more sites will block any requests if it hasn't got a valid user agent string and it will throw a 403 forbidden message. By adding that user agent string to the request, the server will think it's a valid user requesting the page and will return the content. In this case, it will up the counter on Distrowatch's SolydXK page (I assume).

You can check your user agent here: http://www.useragentstring.com
It also has a list of valid user agent strings: http://www.useragentstring.com/pages/us ... string.php

Here's the script:

Code: Select all

#! /usr/bin/env python3

from urllib.request import Request, urlopen
from random import choice

my_url = 'http://distrowatch.com/table.php?distribution=solydxk'

# http://www.useragentstring.com/pages/useragentstring.php
user_agents = [
    'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) Version/8.0 Safari/602.1 Epiphany/3.18.0',
    'Mozilla/5.0 (X11; Linux x86_64; rv:42.0) Gecko/20100101 Firefox/42.0',
    'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0'
]

# Create a request object with given URL
req = Request(my_url)
# Get a random user agent and add that to the request object
ua = choice(user_agents)
req.add_header('User-Agent', ua)
# Get the output of the URL
output = urlopen(req, timeout=5)

# Debugging
#print(("{}\n\n{}".format(ua, output.read().decode('utf-8'))))
This script will randomly choose a user agent string from the user_agents list. At the bottom of the script you see a debug line where I read the output and decode the output before writing it. If you ever want to scrape pages this way, don't forget to decode the output because urlopen returns a bytes list rather then a string list. I used utf-8 decoding to catch those special characters.


SolydXK needs you!
Development | Testing | Translations

User avatar
Zill
Posts: 1850
Joined: 13 Aug 2013 14:28
Location: Lincolnshire, UK

Re: Faking user-agent in Python

Postby Zill » 13 Nov 2015 14:22

Well, I am struggling to work out how to to respond to this one!

Firstly, I disagree that "being ranked on Distrowatch is important". While Distrowatch is of passing interest to many Linux users, I would never call it important. Linux distros live or die on how good they are, not on their counter rating!

Secondly, the concept of faking anything seems very unethical to me and so I cannot approve of automating fake hits for a website.

While the python script itself is a good example of coding that will be of interest to other programmers, I will not be running this script.

User avatar
Arjen Balfoort
Site Admin
Posts: 9282
Joined: 26 Jan 2013 19:36
Location: Netherlands
Contact:

Re: Faking user-agent in Python

Postby Arjen Balfoort » 13 Nov 2015 14:58

Scraping is a widely used technology, and it would have been a boring story.
This is the coding section. So, don't make it bigger than it really is.


SolydXK needs you!
Development | Testing | Translations

User avatar
ilu
Posts: 2495
Joined: 09 Oct 2013 12:45

Re: Faking user-agent in Python

Postby ilu » 13 Nov 2015 18:09

Schoelje wrote:One of the reasons I started coding was that I really dislike repetitive tasks. They are mind-numbing, a total waste of time and because they bore me to death I get sloppy and make mistakes. These tasks are ideally suited for automation.
That's exactly why I want to learn python - I heard it's the best to automate website "reading". I really hope it will enable me to "scrape" (what does that mean exactly?) this forum to keep the solyd wiki page I was working on up-to-date. I got really tired of doing that by hand, boring, tedious, waste of time - I started slacking. If I ever get to achieve anything you will tell me whether I need to fake the ua for that. :D

User avatar
Arjen Balfoort
Site Admin
Posts: 9282
Joined: 26 Jan 2013 19:36
Location: Netherlands
Contact:

Re: Faking user-agent in Python

Postby Arjen Balfoort » 13 Nov 2015 18:47

Here's a wiki explanation on scraping: https://en.wikipedia.org/wiki/Web_scraping
The "Legal issues" section is an interesting read.

To demonstrate what scraping exactly is: save the code above in a file named distrowatch.py, open it and copy this code at the bottom of the script:

Code: Select all

with open('distrowatch.html', 'w') as f:
    f.write(output.read().decode('utf-8'))
save and make the script executable.

Now open a terminal and run:

Code: Select all

./distrowatch.py
This will save the output of the URL in the script in the file distrowatch.html. Open the html file in a browser. Do you recognize its contents?

You can imagine that with the right regular expressions you can extrapolate valuable data from those pages.


SolydXK needs you!
Development | Testing | Translations

kurotsugi
Posts: 2228
Joined: 09 Jan 2014 00:17

Re: Faking user-agent in Python

Postby kurotsugi » 14 Nov 2015 11:18

@Zill: you can see the the distrowatch case just as an example of the script usage. the actual implementation would greatly varied and the script itself could be corporated with other function to make it do various job. a site counter usually has a mechanism to not count visit from same site address so even if the script is used on distrowatch it might not affect the rank much.

User avatar
Zill
Posts: 1850
Joined: 13 Aug 2013 14:28
Location: Lincolnshire, UK

Re: Faking user-agent in Python

Postby Zill » 14 Nov 2015 11:59

kurotsugi wrote:@Zill: you can see the the distrowatch case just as an example of the script usage...
... and I agree that, purely as a coding example, this could be useful. I am just concerned that, if a site counter is to have any meaningful value at all, IMO it is not appropriate to try to cheat this with bots! :o


Return to “Code”

Who is online

Users browsing this forum: No registered users and 1 guest