Show DevBest [PY] Data Mining Script

Brackson · Nov 2, 2013

Current version: v1.1

Code:

#!/usr/bin/env python

import urllib2, os, time

def store_into_file():
   url = 'http://google.com/' # URL that you want to mine.
   data = urllib2.urlopen(url).read() # Get the HTML source of URL.

   current_time = time.strftime('%H:%M:%S', time.localtime()) # Get the current time so we can use if for the txt filename.

   r = open('%s.txt' % (current_time), 'w') # Create the file.
   r.write(data) # Put the source in the file.
   r.close() # Close the file.

def main():
   store_into_file()

while True:
   main()

(

You must be registered for see links

)

PREVIOUS VERSIONS

You must be registered for see links

This script allows you to take the HTML contents of a webpage, and store it in a text file. You do not to install any modules for this (AFAIK). If you want to parse the content before it's stored into the file, you can use

You must be registered for see links

.

This would come in handy if you were building an archive site, if you want to log a website's content just because, etc.

I made this for educational purposes, and I thought I'd just release it because I don't really need it. Thanks for viewing!

Cronus · Nov 2, 2013

Nice, make a php one

10/10

Sway · Nov 2, 2013

The original code wouldn't work because you didn't call the function.. which leads me to the question: is there any reason for why you put that code in a function?

Brackson · Nov 2, 2013

Sway said:
The original code wouldn't work because you didn't call the function.. which leads me to the question: is there any reason for why you put that code in a function?

I didn't call the function because I didn't find it really necessary, however if you were to use the script, this is how you would do it.

Code:

#!/usr/bin/env python

import urllib2, time

def store_into_file():
  url = 'http://google.com/' # URL that you want to mine.
  data = urllib2.urlopen(url).read() # Get the HTML source of URL.

  current_time = time.strftime('%H:%M:%S', time.localtime()) # Get the current time so we can use if for the txt filename.

  r = open('%s.txt' % (current_time), 'w') # Create the file.
  r.write(data) # Put the source in the file.
  r.close() # Close the file.

while True:
  store_into_file()

Quackster · Nov 2, 2013

I don't understand this, it just leeches HTML from a website, you don't splice and dice the HTML to grab any specific data. So it's not really a miner, just a HTML leecher?

Sway · Nov 2, 2013

Brackson said:

I didn't call the function because I didn't find it really necessary, however if you were to use the script, this is how you would do it.

Code:

#!/usr/bin/env python

import urllib2, os, time

def store_into_file():
   url = 'http://google.com/' # URL that you want to mine.
   data = urllib2.urlopen(url).read() # Get the HTML source of URL.

   current_time = time.strftime('%H:%M:%S', time.localtime()) # Get the current time so we can use if for the txt filename.

   r = open('%s.txt' % (current_time), 'w') # Create the file.
   r.write(data) # Put the source in the file.
   r.close() # Close the file.

def main():
   store_into_file()

if __name__ == '__main__':
   main()

You should include the complete source next time but it's pretty neat. Keep it up!

Brackson · Nov 2, 2013

Quackster said:
I don't understand this, it just leeches HTML from a website, you don't splice and dice the HTML to grab any specific data. So it's not really a miner, just a HTML leecher?

It grabs the data, and stores it in a text file. I would use BeautifulSoup to parse it before it's stored, but I figured that's something the user should do since I have no idea what they'd be using the script for.

Quackster · Nov 2, 2013

Brackson said:
It grabs the data, and stores it in a text file. I would use BeautifulSoup to parse it before it's stored, but I figured that's something the user should do since I have no idea what they'd be using it for.

It's not a miner at the moment, it just grabs HTML.

Brackson · Nov 2, 2013

Quackster said:
It's not a miner at the moment, it just grabs HTML.

You can make it continuously loop like this.

Code:

#!/usr/bin/env python

import urllib2, time

def store_into_file():
   url = 'http://google.com/' # URL that you want to mine.
   data = urllib2.urlopen(url).read() # Get the HTML source of URL.

   current_time = time.strftime('%H:%M:%S', time.localtime()) # Get the current time so we can use if for the txt filename.

   r = open('%s.txt' % (current_time), 'w') # Create the file.
   r.write(data) # Put the source in the file.
   r.close() # Close the file.

while True:
   store_into_file()

Screenshot

You must be registered for see links

.

Thanks for pointing that out though, I'll update the OP.

Brackson · Nov 2, 2013

Here's a looping version of this script with proxy support, so you don't get firewalled.

Code:

#!/usr/bin/env python

import urllib2, time, random

proxies = [proxy.strip() for proxy in open('proxies.txt', 'r')] # Setting up the proxies.

def store_into_file(proxies):
  random_proxy = random.choice(proxies)
  proxy = urllib2.ProxyHandler({'http': random_proxy})

  try:
  url = 'http://google.com/' # URL that you want to mine.
  data = urllib2.urlopen(url).read() # Get the HTML source of URL.

  current_time = time.strftime('%H:%M:%S', time.localtime()) # Get the current time so we can use if for the txt filename.

  r = open('%s.txt' % (current_time), 'w') # Create the file.
  r.write(data) # Put the source in the file.
  r.close() # Close the file.
  except:
  raise

while True:
  store_into_file(proxies)

You must be registered for see links

Include a 'proxies.txt' file with a proxy list separated by line breaks in the same directory, and if you have a good proxy list, you won't be firewalled.

Sysode · Nov 3, 2013

Wait.. So this does the same as viewing source code and copying it?

Appreciate this is for educational purposes though.

Brackson · Nov 3, 2013

Sysode said:
Wait.. So this does the same as viewing source code and copying it?

Appreciate this is for educational purposes though.

Essentially, yes, but it's automatic and faster than doing it by hand.

I wrote it because I wanted experience with web parsers in Python, and it was fun to make. I released it because I don't need it and it's better for it to be public rather than sitting on my Desktop.

Show DevBest [PY] Data Mining Script

Brackson

卍卍卍卍卍卍卍卍卍卍卍

Cronus

Cerberus Founder/Developer

Sway

Ruby and Python are married.

Brackson

卍卍卍卍卍卍卍卍卍卍卍

Quackster

a devbest user says what

Sway

Ruby and Python are married.

Brackson

卍卍卍卍卍卍卍卍卍卍卍

Quackster

a devbest user says what

Brackson

卍卍卍卍卍卍卍卍卍卍卍

Brackson

卍卍卍卍卍卍卍卍卍卍卍

Sysode

Front-End Developer

Brackson

卍卍卍卍卍卍卍卍卍卍卍

Users who are viewing this thread

Latest posts

Connect with us

Newest members