The goal of this post is to develop a utility that facilitates the following:

  1. Retrieve HTML from the target webpage.
  2. Parse the HTML, extracting all URL references to embedded PDF links.
  3. For each embedded PDF link, download the document and save it locally.

Plenty of 3rd-party libraries can query and retrieve a webpage’s links in a single API call. However, the purpose of this post is to highlight the fact that by combining elements of the Python Standard Library with the Requests package, a lot can be accomplished with minimal overhead.


Step I: Acquire HTML

Before we begin, it’s important to mention that if you’re attempting to follow along on a computer situated behind a firewall or corporate proxy, you’ll need to provide the necessary proxy server details as part of the requests.get call. For example, assume an individual with username “user33” and password “Password33” has their web traffic routed through “corporate.proxy.com” via port 8080. “user33” would first need to specify their authentication details in a dictionary, then pass the dictionary to requests.gets optional proxies argument:

"""
NOTE: This step is for individuals working behind a firewall
or corporate proxy. If this does not apply, skip this section.
"""
import requests

# creating proxies dict to submit along with requests.get =>
proxies = {'http': 'http://user33:Password33@corporate.proxy.com:8080',
           'https': 'https://user33:Password33@corporate.proxy.com:8080'}


# arbitrary URL from which to harvest PDFs =>
URL = "https://en.wikipedia.org/wiki/Conjugate_prior"

# simplified version of requests.get call for illustration only =>
requests.get(URL, proxies=proxies)


Note that the proxies argument would be required for each subsequent invocation of requests.get.
For the remainder of the post, I’ll assume we are not working behind a proxy, and will present all code examples without referencing the proxies argument.

The library that facilitates communication between Python and the target webpage is requests. It exposes a simple, intuitive interface that works right out of the box (“batteries included”). Retrieving a webpage’s HTML is as straightforward as:

import requests
requests.get(<URL>).text   


Where URL is a string representing the target URL. requests.get returns an object, and by including the text suffix, we’re requesting the the webpage’s content be returned as plain text to allow for parsing with regular expressions in the next step.

What follows is the logic corresponding to Step I of our PDF Harvester walkthrough:

# ===================================================================
# PDF Harvester I of III: Retrieve HTML as plain text               |
# ===================================================================
import requests

URL = "https://en.wikipedia.org/wiki/Conjugate_prior"

# instruct requests object to return HTML as plain text.
html = requests.get(URL).text


The HTML has been obtained. We now need to highlight and extract references to all embedded PDF links. For this step, we’ll make use of regular expressions, available in the Python Standard Library in the re module.

Step II: Extract PDF URLs from HTML

A cursory review of the HTML from webpages with embedded PDF links revealed the following:

  1. Valid PDF URLs will in almost always be embedded within an href tag.
  2. Valid PDF URLs will in all cases be preceded by http or https.
  3. Valid PDF URLs will in all cases be enclosed by a trailing >.
  4. Valid PDF URLs cannot contain whitespace.

    After a bit of trial and error, the following regular expression was found to have acceptable performance for our test cases:
"(?=href=).*(https?://\S+.pdf).*?>"


Two excellent sites to practice building and testing regular expressions are Pythex and RegExr. Both allow you to construct regular expressions and determine how they match against the target text. I find myself using both on a regular basis. Highly recommended!

What follows is the logic for Step II:

# ===================================================================
# PDF Harvester II of III: Extract PDF URL's from HTML              |
# ===================================================================
import requests
import re

# Specify URL for webpage of interest.
URL  = "https://en.wikipedia.org/wiki/Conjugate_prior"
html = requests.get(URL).text

# Search html and compile PDF URL's.
pdf_urls = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html)

# Optionally display content of pdf_urls.
print(pdf_urls)


Note that the regular expression is preceded with an r when passed to re.findall. This instructs the Python virtual machine to interpret what follows as a raw string and to ignore escape characters.

re.findall returns a list of matches extracted from the source text. In our case, it returns a list of URLs corresponding to PDF documents.

For our last step we need to retrieve the documents associated with our collection of links and write them to file locally. We introduce another module from the Python Standard Library, os.path, which facilitates the partitioning of absolute filepaths into components in order to retain filenames when saving documents to disk.

For example, consider the following well-formed URL:

"http://Statistical_Modeling/Fall_2017/Lectures/Lecture11.pdf"


To capture Lecture11.pdf, we pass the absolute URL to os.path.split, which returns a tuple of everything preceeding the filename as the first element, along with the filename and extension as the second element:

>>> import os.path
>>> url = "http://Statistical_Modeling/Fall_2017/Lectures/Lecture11.pdf"
>>> os.path.split(url)
('http://Statistical_Modeling/Fall_2017/Lectures', 'Lecture11.pdf')


Therefore, we can capture the filename and extension by calling os.path.split(url), and using Python’s index notation to specify the element at the second position in the tuple, os.path.split(url)[1]. This will be used to name the documents we save locally.

Step III: Write PDF’s to File

This step differs from the initial HTML retrieval in that we need to request the content as bytes, not text. By calling requests.get(url, proxies=proxies).content, we’re accessing the raw bytes that comprise the PDF, then writing those bytes to file locally. Here’s the logic for the third and final step:

# ===================================================================
# PDF Harvester III of III: Write PDF(s) to file                    |
# ===================================================================
import requests
import re
import os
import os.path


URL      = "https://en.wikipedia.org/wiki/Conjugate_prior"
html     = requests.get(URL, proxies=proxies).text
pdf_urls = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html)

# Set working directory to desired location.
os.chdir("C:\\user33\\")

# Request PDF content and write to file for all entries.
for pdf in pdf_urls:

    # Get filename from url for naming file locally.
    pdfname = os.path.split(pdf)[1]

    # Get retrieved html as bytes.
    r = requests.get(pdf, proxies=proxies).content

    try:
        with open(pdfname, "wb") as f: f.write(r)

    except:
        print("Unable to download {}.".format(pdfname))
        continue

print("\nProcessing complete!")


Notice that we surround with open(pdfname, "wb")... in a try-except block: This handles situations that would prevent us from downloading a PDF, such as empty redirects or invalid links.

We end up with 24 LOC, including comments and library imports.

Lastly, we present the PDF Harvester with the commands collected into a function and with comments stripped away:

# ===================================================================
# PDF Harvester                                                     |
# ===================================================================
import requests
import re
import os
import os.path


def pdf_harvester(url, loc=None):
    """
    Retrieve url's html and extract references to PDFs.
    Download PDFs, writting to `loc`. If `loc` is None, 
    save to current working directory.
    """
    print("Harvesting PDFs from => {}\n".format(url))

    os.chdir(os.getcwd() if loc is None else loc)
    html     = requests.get(URL, proxies=proxies).text
    pdf_urls = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html)

    for pdf in pdf_urls:

        pdfname = os.path.split(pdf)[1]
        r = requests.get(pdf).content

        try:
            print("Downloading {}...".format(pdfname))
            with open(pdfname, "wb") as f: f.write(r)   

        except:
            print("Unable to download {}.".format(pdfname))
            continue

    print("\nProcessing complete!")




# example calling `pdf_harvester` =>
>>> URL = "https://en.wikipedia.org/wiki/Poisson_point_process"
>>> pdf_harvester(URL, proxies, loc="C:\\user33\\")
Processing complete!


Conclusion

In this post, we demonstrated step-by-step how to develop a useful application which can be put to work delivering content immediately. As always, be sure to check each website’s policy as it pertains to automated data acquistion prior to running this or similiar tools.


Until next time, happy coding!