downloads/
├── Publication_Title_1/
│ ├── date_1/
│ │ ├── page_1.pdf
│ │ ├── page_1.xml
│ │ ├── page_2.pdf
│ │ ├── page_2.xml
│ │ └── metadata.json
│ ├── date_2/
│ │ ├── page_1.pdf
│ │ ├── page_1.xml
│ │ ├── page_2.pdf
│ │ ├── page_2.xml
│ │ └── metadata.json
│ └── ...
I’ve been working on a genealogy project for several months now (trying to find the baby daddy of someone many generations ago, juicy stuff but the details aren’t important). One incredibly useful and fascinating resource has been old newspaper articles, which I’ve accessed primarily through Newspapers.com. While this site is an archival treasure trove, they don’t yet have papers available in the counties where my ancestor lived, namely Fentress and Overton Counties in Tennessee. I’m almost positive that if I did have access to papers from those areas, my family would appear, and I would be one step closer to solving this family mystery.
I decided to look one day to see if there were any other archives that had papers from these areas, and I found that the Library of Congress (duh, I should have known) has an unbelievable directory of US newspapers, and it’s searchable by county and year. I found many newspapers on this list that aren’t on Newspapers.com, but only one that is fully digitized and available on the Library of Congress website. The others are on microfilm at the Tennessee State Library, which I have other big plans for, but for now: while this digitized paper was in a town that isn’t directly relevant to my family, I thought it would still be a fun project to download all the pages and attempt to run some OCR to look for any mention of my family’s name.
The newspaper I attempted to download is the The Rugbeian and District Reporter out of Rugby, TN. This town was apparently founded as an “experimental utopian colony” in 1880, where the “second sons” of England could be free to own land and live life as they wished. It was plagued by disorganization and typhoid from the start, so the town–and this newspaper–didn’t last very long.
On the Library of Congress website, each page of the paper has its own URL, and you can cycle through all pages of an issue and all issues of a paper by clicking through arrow buttons. You can also download the pages in multiple formats. I wanted to download the PDF version and the “OCR ALTO” version, which I came to learn is an xml format denoting the layout of text on an image and, in my case, the actual text present in the image. My goal was to write scripts that would:
- cycle through all issues and all the pages of each issue
- download the PDF and OCR ALTO format for each image
- extract the text from each OCR ALTO file
With a heavy assist from Claude, I accomplished this in two scripts: download_pages.py
for steps 1 and 2 and transcribe_pages.py
for step 3, available in my GitHub repository.
The main function in download_pages.py
that kicks off the algorithm is download_newspaper_pages()
, which takes one parameter: the URL of the first page of the first issue of the newspaper that you want to download. It uses Selenium and Google Chrome to perform the webscraping algorithm. Selenium would need to be installed first before this function would work, and I’ll leave those instructions for elsewhere (see my first post for another example where I used Selenium).
This file also includes several helper functions that also extract the metadata for each issue from the site, sanitize filenames, and handle technical difficulties. Often times the Library of Congress site would get stuck and send you to a “we’re having technical difficulties” page, which often would be resolved if the page was just refreshed again. So a big part of the error handling is checking to make sure this page didn’t appear, and if it did to refresh the page until the publication page rendered again.
The pages are extracted into a downloads
folder in the same directory as the script, and structured like:
After running this algorithm to extract all the image files, next the transcribe_pages.py
file can be run to pull text out of the OCR ALTO xml files. When I started this project I assumed that this file only gave the layout of the document but not the actual text itself, meaning I’d have to perform my own OCR algorithm. But actually this file actually has the OCR already completed, and all I needed to do was extract the text from each page.
The main function in this script is extract_all_text_from_alto()
with one parameter: the path to the folder for the publication (downloads/Publication_Title_1
in the example above). This script then cycles through all of the pages in this folder and generates a new text file for each page, so now there will be a pdf, xml, and txt file for each page.
After extracting all this text from all these pages, it was time to look for mention of my family. I did a classic Ctrl + F
on the folder, searched for some key phrases, and found nothing. But I guess that’s research for ya.
If I can get my hands on those other microfilms, then we’ll really be cooking with gas, stay tuned.