Purpose: Download a collection of comics for my kids. I found a site with these ebooks, but each book was on a seperate page with the download link. It would take a lot of time to manually visit each page and download all 54 books. Enter Python to the rescue.

You can get the full code here.

We will be using selenium and chromedriver to automate our Chrome browser and make it perform the repetitive tak of visiting around 50 different pages and clicking on download links, and saving the pdf files into a diectory.

If you want to create robust, browser-based regression automation suites and tests, scale and distribute scripts across many environments, then you want to use Selenium WebDriver, a collection of language specific bindings to drive a browser – the way it is meant to be driven.

From https://www.selenium.dev/

Install selenium and multithread modules for python.

pip3 install selenium multithread

WebDriver is an open source tool for automated testing of webapps across many browsers. It provides capabilities for navigating to web pages, user input, JavaScript execution, and more. ChromeDriver is a standalone server that implements the W3C WebDriver standard. ChromeDriver is available for Chrome on Android and Chrome on Desktop (Mac, Linux, Windows and ChromeOS).

From https://chromedriver.chromium.org/

Install chromedriver. You need to match chromedriver to the version of Google Chrome you have installed. Get the version number of Chrome from: Help > About Google Chrome. My version number is Version 91.0.4472.77 (Official Build) (64-bit).

On visiting chromedriver’s download page, it states:

Current Releases
If you are using Chrome version 92, please download ChromeDriver 92.0.4515.43
If you are using Chrome version 91, please download ChromeDriver 91.0.4472.101
If you are using Chrome version 90, please download ChromeDriver 90.0.4430.24
If you are using Chrome version 89, please download ChromeDriver 89.0.4389.23
For older version of Chrome, please see below for the version of ChromeDriver that supports it.

So I download chromedriver_linux64.zip for my Chrome version, extract it, copy it to the right directory in my path, and assign permissions:

unzip -v chromedriver_linux64.zip
sudo mv chromedriver /usr/bin/
sudo chmod +x /usr/bin/chromedriver

Now, you can test it out by executing the following python code:

from selenium import webdriver
import os
from multithread import Downloader
my_url = 'https://readasterix.blogspot.com/2017/01/download-asterix-adventures-in-pdf-en.html'
chromedriver = "/usr/bin/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)

You can see Chrome open in a new window, and saying “Chrome is being controlled by automated test software“, which means Selenium is working and you are ready to automate downloading your comics.

Now let us examine the target site.

Our target site is https://readasterix.blogspot.com/2017/01/download-asterix-adventures-in-pdf-en.html. It is a simple page where a list of links are given, each of which in turn points to other html pages, on which the actual pdf links are present.

Visit the main page, and extract links from there to other pages where PDF links are present:

elements = driver.find_elements_by_tag_name('a')
links = []
# First collect the links into a list of links where PDF files can be found.
for el in elements:
    if href is not None and "e.filing.ml/p/" in href:

Here, I am selecting links containing text “e.filing.ml/p/”, which I noted that the pages containing PDF links had this pattern. Use Chrome dev tools to examine the site before you design a scraper.

Now, create a directory for the downloads and navigate there.

dirpath = "AsterixComics" #Create a directory for these files.
except Exception as e:
    print(f"Failed to create directory: {str(e)}")

Now we have a list of links where the files are present. Let’s visit them one by one and download each file

for link in links:
    print(f'{i} - {link} ..Visiting..')
    newel = driver.find_element_by_link_text('Download Document') 
    download_link =newel.get_attribute('href')
    file_name = download_link.split('/')[-1]   #Extract file names from the links
    print(f"Downloading {download_link} => {file_name}")   
    if not os.path.exists(file_name): #Download if the files dont already exist. This is helpful if the server reset out connection, which is very common when we do mass downloading or mirroring of sites.
        download_object = Downloader(download_link, file_name) #We are using a library called multithread to download these. This has the ability to implement multi threaded downloading of files which is much faster. We can also use wget module or simply the ubiquitous requests module to do this.
        print("Already downloaded.")

We are selecting links on the new page, each containing download links, with the text “Download Document”. These links directly link to PDF files.

We will then extract file names from the links, by extracting a substring from the links, after the last / symbol. We then use multithread library to download the files one by one. We will skip any existing already downloaded files.

You can use similiar technique to download from any complex site, even those with usernames and passwords, and complex forms. You will have to devise techniques to implement delays and waiting for elements to load especially when the site is rich in asynchronous javascript code.