sample code to retrieve webpage contents using selenium

Selenium webdriver is primarily it is for automating web applications for testing purposes, but has also gained popularity to scrape content from dynamic webpages that contain javascript and retrieve content on the fly when the webpage is loaded. Such webpages cannot be scraped using native solutions in most programming languages, since the query and the result take place when an action is performed is difficult to mimic in the out of the box solutions. There is where Selemium comes in handy and helps fire the action that initiates the query and captures the response when it comes through.

An example of this can be found in the current blogpost, where we will see the behaviour of a popular webpage that loads content on the fly, and how selenium webdriver can be used to extract the information we need.

Let's say that one needs to extract the results of the matches and match links for all matches in the 1999/00 season from the Premier League website, the link to which is here. The webpage loads the results for a set of matches in descending order of dates, but not the full list. It fetches data for additional sets of matches when the user scrolls to the bottom of the page.

Picture 1: showing a default set of matches when the webpage loads

Picture 2: Webpage loading additional match details when the user scrolls down

When each query is sent to the server, the updated fixtures get uploaded to a section in the webpage which has a class named "fixtures". This can be seen in the below picture 3.

Picture 3: Webpage loads details of additional matches to a class named "fixtures"

The logic that can be used in this case is that the web browser can be made to scroll to the bottom of the page, post which the contents of "fixtures" can be retrieved from the browser. This can be done using the below code.

from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
driver = webdriver.Chrome(executable_path=path_to_chrome_driver
driver.get(season_link)

# pause_time is so that the browser has enough time to make the query and fetch the results
pause_time = 120

last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # wait to load page
    time.sleep(pause_time)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height: # which means end of page
        break
    # update the last height
    last_height = new_height

fixtures = driver.find_elements_by_class_name("fixtures")

The latest chrome driver can be downloaded from the json in this location.

A common error that happens with chrome driver is the message that reads "This version of ChromeDriver only supports Chrome version xx". This is resolved by going to the above location where the latest driver can be downloaded and updating the link to the executable in the code.

Search This Blog

groundnutprogramming

sample code to retrieve webpage contents using selenium

Comments

Post a Comment

Popular posts from this blog

convert datetime to string in python

pandas dataframe - how to extract the number of instances of each unique value