Website crawl or scraping with selenium and python

The amount of information on the Web grows exponentially, a flurry of applications (mobile, web, or otherwise) have come about that wish to harness it all. The methods for harnessing information on the web may be many, but one that’s seemingly the most ubiquitous is ‘scraping’. Scraping is what most search engines employ in some form or the other: the ‘spiders’ that crawl the web looking for metadata information embedded in websites, or price-comparison sites that allow users to make purchase decisions, and probably a gazillion other things that I cannot even imagine.




How Scraping differs ?

In this article we are going to look at the website crawling with python. The crawling work can be classified based on the HTML DOM modification and rendering. 

Scraping Type 1:

The DOM modification is done in server side and the html string append in frond end means we dont go with selenium. We can achieve through python and tool for parsing html code. 

Scraping Type 2:

If the DOM modification is done on client side scripting means we have to use browser to render the data then only we can scrap.

Preparation :

There are a few things you must ensure in your development environment:

  1. Python : We must installed python 2 or 3 in your system
  2. Beautifulsoup :  This python package must be used to parse the scraped html code  and get data from that html. Use the below syntax to install in your python environment     
    1
    pip install beautifulsoup4
    

  3. Selenium : To install selenium if the scraping site information come under the scraping type 2. The selenium python package installation command given below
    1
    pip install selenium
    
  4. Chrome : I used Google Chrome for all examples while preparing this tutorial. You could use Firefox if you wish and I’ll explain how you could do that when we get to the code.
  5. ChromeDriver : This is the final element you need to install. Head over to this link, and download the latest ChromeDriver (it will be a .zip file) for your OS. Extract the downloaded zip file, and copy the contents to any directory you wish to (as long as you do not move that directory around). Then add the path to that directory to you OS’s ‘PATH’ environment variable. And you’re good to go! If you are on Windows, you will need to launch a new instance of the command line app, or if you are on Linux just open a new shell. This should take the newly modified PATH variable into its fold.

Let's Go 

Scraping Type 1:

Now we are going to crawl the side https://www.smithandcrown.com/currency/bitcoin/ and get the description information,  which comes under the scraping Type 1 . 

Note : Before going with this crawl please make study on Beautifulsoup which help you get html data as soon as possible.

Type 1 : Create the python  code to crawl the web page and then parse that HTML using Beautifulsoup library . The crawl will be achieve through the urlib2 python library.  In this type we dont need to install selenium and chrome Driver.  The sample code is given below,



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
"""
This class provide the data scraping from stack overflow and parse
that using the Beautifulsoup library

@author Ramachandran K
"""

from bs4 import BeautifulSoup
import urllib2

class CrawlerDataParser(object):

    #Function to crawl the description information and return
    def parse_smithandcrown_html(self, html):
        if not html :
            return None
        bs = BeautifulSoup(html, 'html.parser')
        summary = bs.find('div', class_="summary")
        update_value = {}
        description = summary.get_text(strip=True)
        if description :
            update_value['description'] = description
        return update_value

    

if __name__ == '__main__':

    parser = CrawlerDataParser();
    url = "https://www.smithandcrown.com/currency/bitcoin/"

    # Adding header
    hdr = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
        'Accept-Encoding': 'none',
        'Accept-Language': 'en-US,en;q=0.8',
        'Connection': 'keep-alive'}
    req = urllib2.Request(url, headers=hdr)
    response = urllib2.urlopen(req)
    parsed_data = parser.parse_smithandcrown_html(response.read());
    print parsed_data
    print 'Processed successfully';

Cons:
We cant crawl the client site dom rendering site data

Scraping Type 2:

Now we are going to crawl the side https://www.cryptocompare.com/coins/btc/overview and get the description information,  which comes under the scraping Type 2 .

In this site using the angular js to rendering the dom in client side. So we cant get that web site html with data using the above method.  By using this method we are going to use the selenium to achieve the client side data rendering. We are going to use the chrome browser to crawl the data.

The sample  python code is given below,



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
"""
This read data cryptocompare website using selenium

@author Ramachandran K
"""

import time
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options

class Crawler(object):
 def crawl_data(self):
  chrome_options = Options()
  chrome_options.add_argument("--window-size=1920,1080")
  # Using the crome driver file path
  driver = webdriver.Chrome('/home/ubuntu/chromedriver', chrome_options=chrome_options)  # Optional argument, if not 
  driver.get('https://www.cryptocompare.com/coins/btc/overview');
  time.sleep(0.5)
  result = driver.find_element_by_class_name('profile-col-main')
  htmlText =  result.get_attribute('innerHTML')
   # Here we  initialized the beautiful soup to extract data from the html  
  bs = BeautifulSoup(htmlText, 'html.parser')
  ele = bs.select_one('.profile-props')

  currency = ele.select('.propblock-body')
  # Here we printing the crawled data
  print currency[2].get_text(strip=True)
  print currency[3].get_text(strip=True)
  print currency[4].get_text(strip=True)
  driver.quit()

if __name__ == "__main__":
 crawl = Crawler()
 crawl.crawl_data();

The output will be open the chrome browser and load the web site on that browser and then it crawl the data.

The selenium open the browser and fetch like,


Headless Chrome :

Selenium with out using browser open , we have option in headless chrome. which helps us to prevent the chrome open. But it load the information and scraping. The headless chrome will support above 55 version of chrome. In this we need to add the below single line code to make headless call in selenium ,


1
chrome_options.add_argument("--headless")

The mainly used selenium html access functions are,
  1. find_element_by_id
  2. find_element_by_name
  3. find_element_by_xpath
  4. find_element_by_link_text
  5. find_element_by_partial_link_text
  6. find_element_by_tag_name
  7. find_element_by_class_name
  8. Find_element_by_css_selector
These are the methods which can be used in action chain.
  1. click() → Clicks an element
  2. click_and_hold() → Holds down the left mouse button on an element.
    1. context_click() → Right click on the element.
    2. double_click() → Double click on the element.
    3. move_to_element() → Move mouse to middle of an element.
    4. key_up() → Send a keypress (only Control, Shift and Alt).
    5. key_down() → Release a key press.
    6. perform() → Performs all stored actions.
    7. send_keys() → Send keys to current focused element.
    8. release() → Release held mouse button on element.
    I’m sure if you have a reasonable expertise in Python, and a basic grasp of the HTML DOM, you can extend or twist or turn this little code snippet to satisfy all your scraping needs. Have a happy time scraping the web, and leave me a thought in the comments if you have any queries or comments.

    Comments

    Post a Comment

    Popular posts from this blog

    Pyhton auto post to blogger using Google blogger API

    Connect VPN via Python

    Salesforce Missing Organization Feature: CommonPortal null