Show DevBest [PY] Data Mining Script

Brackson

卍卍卍卍卍卍卍卍卍卍卍
Jun 20, 2013
262
46
Current version: v1.1
Code:
#!/usr/bin/env python

import urllib2, os, time

def store_into_file():
   url = 'http://google.com/' # URL that you want to mine.
   data = urllib2.urlopen(url).read() # Get the HTML source of URL.

   current_time = time.strftime('%H:%M:%S', time.localtime()) # Get the current time so we can use if for the txt filename.

   r = open('%s.txt' % (current_time), 'w') # Create the file.
   r.write(data) # Put the source in the file.
   r.close() # Close the file.

def main():
   store_into_file()

while True:
   main()
( )

PREVIOUS VERSIONS

This script allows you to take the HTML contents of a webpage, and store it in a text file. You do not to install any modules for this (AFAIK). If you want to parse the content before it's stored into the file, you can use .

This would come in handy if you were building an archive site, if you want to log a website's content just because, etc.

I made this for educational purposes, and I thought I'd just release it because I don't really need it. Thanks for viewing!
 
Last edited:

Sway

Ruby and Python are married.
Dec 19, 2010
194
76
The original code wouldn't work because you didn't call the function.. which leads me to the question: is there any reason for why you put that code in a function?
 

Brackson

卍卍卍卍卍卍卍卍卍卍卍
Jun 20, 2013
262
46
The original code wouldn't work because you didn't call the function.. which leads me to the question: is there any reason for why you put that code in a function?
I didn't call the function because I didn't find it really necessary, however if you were to use the script, this is how you would do it.

Code:
#!/usr/bin/env python

import urllib2, time

def store_into_file():
  url = 'http://google.com/' # URL that you want to mine.
  data = urllib2.urlopen(url).read() # Get the HTML source of URL.

  current_time = time.strftime('%H:%M:%S', time.localtime()) # Get the current time so we can use if for the txt filename.

  r = open('%s.txt' % (current_time), 'w') # Create the file.
  r.write(data) # Put the source in the file.
  r.close() # Close the file.

while True:
  store_into_file()
 
Last edited:

Quackster

a devbest user says what
Aug 22, 2010
1,764
1,241
I don't understand this, it just leeches HTML from a website, you don't splice and dice the HTML to grab any specific data. So it's not really a miner, just a HTML leecher?
 

Sway

Ruby and Python are married.
Dec 19, 2010
194
76
I didn't call the function because I didn't find it really necessary, however if you were to use the script, this is how you would do it.

Code:
#!/usr/bin/env python

import urllib2, os, time

def store_into_file():
   url = 'http://google.com/' # URL that you want to mine.
   data = urllib2.urlopen(url).read() # Get the HTML source of URL.

   current_time = time.strftime('%H:%M:%S', time.localtime()) # Get the current time so we can use if for the txt filename.

   r = open('%s.txt' % (current_time), 'w') # Create the file.
   r.write(data) # Put the source in the file.
   r.close() # Close the file.

def main():
   store_into_file()

if __name__ == '__main__':
   main()

You should include the complete source next time but it's pretty neat. Keep it up!
 

Brackson

卍卍卍卍卍卍卍卍卍卍卍
Jun 20, 2013
262
46
I don't understand this, it just leeches HTML from a website, you don't splice and dice the HTML to grab any specific data. So it's not really a miner, just a HTML leecher?
It grabs the data, and stores it in a text file. I would use BeautifulSoup to parse it before it's stored, but I figured that's something the user should do since I have no idea what they'd be using the script for.
 

Quackster

a devbest user says what
Aug 22, 2010
1,764
1,241
It grabs the data, and stores it in a text file. I would use BeautifulSoup to parse it before it's stored, but I figured that's something the user should do since I have no idea what they'd be using it for.
It's not a miner at the moment, it just grabs HTML.
 

Brackson

卍卍卍卍卍卍卍卍卍卍卍
Jun 20, 2013
262
46
It's not a miner at the moment, it just grabs HTML.
You can make it continuously loop like this.
Code:
#!/usr/bin/env python

import urllib2, time

def store_into_file():
   url = 'http://google.com/' # URL that you want to mine.
   data = urllib2.urlopen(url).read() # Get the HTML source of URL.

   current_time = time.strftime('%H:%M:%S', time.localtime()) # Get the current time so we can use if for the txt filename.

   r = open('%s.txt' % (current_time), 'w') # Create the file.
   r.write(data) # Put the source in the file.
   r.close() # Close the file.

while True:
   store_into_file()

Screenshot .

Thanks for pointing that out though, I'll update the OP.
 
Last edited:

Brackson

卍卍卍卍卍卍卍卍卍卍卍
Jun 20, 2013
262
46
Here's a looping version of this script with proxy support, so you don't get firewalled.
Code:
#!/usr/bin/env python

import urllib2, time, random

proxies = [proxy.strip() for proxy in open('proxies.txt', 'r')] # Setting up the proxies.

def store_into_file(proxies):
  random_proxy = random.choice(proxies)
  proxy = urllib2.ProxyHandler({'http': random_proxy})

  try:
  url = 'http://google.com/' # URL that you want to mine.
  data = urllib2.urlopen(url).read() # Get the HTML source of URL.

  current_time = time.strftime('%H:%M:%S', time.localtime()) # Get the current time so we can use if for the txt filename.

  r = open('%s.txt' % (current_time), 'w') # Create the file.
  r.write(data) # Put the source in the file.
  r.close() # Close the file.
  except:
  raise

while True:
  store_into_file(proxies)


Include a 'proxies.txt' file with a proxy list separated by line breaks in the same directory, and if you have a good proxy list, you won't be firewalled.
 
Last edited:

Sysode

Front-End Developer
Dec 11, 2012
1,673
848
Wait.. So this does the same as viewing source code and copying it?

Appreciate this is for educational purposes though.
 

Brackson

卍卍卍卍卍卍卍卍卍卍卍
Jun 20, 2013
262
46
Wait.. So this does the same as viewing source code and copying it?

Appreciate this is for educational purposes though.
Essentially, yes, but it's automatic and faster than doing it by hand.

I wrote it because I wanted experience with web parsers in Python, and it was fun to make. I released it because I don't need it and it's better for it to be public rather than sitting on my Desktop. :)
 

Users who are viewing this thread

Top