Extracting Text from Unsearchable PDFs: AI Theory and a Method

11 min readMay 9, 2024

AI hype. AI art. AI generated news. AI teaching in schools. AI written articles. AI taking your job. In today’s capitalistic societies, artificial intelligence weaves itself into the fabric of daily life. It seems omnipresent, revolutionary, exhilarating, yet enshrouded in mystery. To some, it embodies the apex of human ingenuity, merging the frontiers of computer science, mathematics, and psychology. Alternatively, skeptics maintain that the hype over AI is fueled by corporate marketing maneuvers, painting a picture of a technological utopia that is far from being realized. They assert that much of what is lauded as AI is overstated and underdelivers, serving more as a vehicle for profit rather than genuine scientific advancement.

If economists are good at one thing, it is cutting through the hype.

Through the prism of foundational microeconomics principles, social science researchers can leverage AI to automate fundamental practices like data collection.

Using AI to Make Shoe Leather

One imperative lesson to remember is highlighted by the late statistician David A. Freedman in his 1991 article Statistical Models and Shoe Leather. Freedman showcases several historical examples to stress that advanced statistical techniques can rarely serve as substitutes for good design and relevant data. Notably, he recounts how John Snow, an English physician in 1850s London, demonstrated that cholera is a waterborne infectious disease. Snow not only presented compelling arguments aligned with the germ theory of disease, but he also engaged in scientific investigation and meticulous data analysis.

During the 1853 cholera epidemic, Snow compared cholera cases among consumers of water from two suppliers, based on the location of their water intake from the Thames River. He created a spot map to show where cases occurred and linked each house to its specific water supply. Then by surveying thousands of households across different regions of London, Snow deduced that relocating the water intake points could have saved over 1,000 lives. His data collection and analysis were extensive, incorporating distinctions between ecological and individual correlations while employing persuasive counterfactual reasoning.

Map of London showing water-distribution system from Snow, J. (1855).

Using this example from Snow, Freedman advocates for a more cautious approach to interpreting statistics, emphasizing the need for thorough data collection and rigorous model testing. Freedman champions the concept of shoe leather, a metaphor for diligent, ground-level data collection and empirical investigation, as vital for improving the validity of statistical modeling in social science research.

“The force of the argument results from the clarity of the prior reasoning, the bringing together of many different lines of evidence, and the amount of shoe leather Snow was willing to use to get the data.” David A. Freedman

Although the analysis conducted in Snow’s 1855 paper may offer limited insight for us in the 2020s, historical data can be valuable for identifying trends and understanding the causal mechanisms behind current social outcomes. In fields such as urban and environmental economics, long-term panel datasets are employed to address pressing social questions. The Census Linking Project, for instance, spans from 1850 to 1940 and encompasses over 700 million links for individuals living in the United States. These links enable researchers across social sciences to build longitudinal datasets that accurately represents the population.

However, much of the world’s data is unstructured (text, images, audio, or video), rather than the comfortable numerical or categorical data that practitioners typically handle. So, unstructured data represents a largely untapped source of information. Furthermore, a significant portion of textual data is stored in a non-searchable PDF format, such as a scanned document. Or in libraries, or in schools, or in news rooms, or in art galleries, still waiting for someone, maybe you, to come along and scan them. Expend a bit of leather.

In this brief article, I present a straightforward use case designed to help econometricians and other applied practitioners become adept at breaking down AI products and using them as tools to enhance the data collection process by converting scanned text documents into editable data. In other words, using AI to make shoe leather. For a short cut to the code, check out my Github.

What is Optical Character Recognition?

OCR software converts images of typed, handwritten or printed text into editable and searchable machine-encoded data. While the optophone, the first OCR machine, was developed in 1913 using sonification to assist visually impaired individuals, today the process has been enhanced by the application of AI. Modern OCR systems use advanced machine learning algorithms, particularly convolutional neural networks, to improve character recognition accuracy even in complex or low-quality images.

While many commercial products offer OCR software, there are several reasons why developing self-written Python code for deploying OCR is beneficial. Most importantly, writing OCR code enables smoother integration with existing workflows and regression models, while also allowing you to monitor and adjust any data transformations that take place. For instance, commercial OCRs may automatically correct spelling errors, which could be detrimental when assessing the social context of historical documents.

Additionally, writing custom OCR code provides the flexibility to tailor the process to particular document formats. For example, commercial OCRs often struggle with files containing both text and tables on the same page, leading to formatting issues. Conversely, you can develop functions that are adaptable across various projects, similar to MATLAB functions. With these features in mind, note that there are several OCR open-source programs, one of the most popular being Tesseract.

A Theoretical Walkthrough of Tesseract

The Tesseract OCR engine, developed by HP from 1984–1994, surpassed its competitors in its 1995 debut at the UNLV Fourth Annual Test of OCR, a test hosted by the U.S. Department of Energy in the early 1990s. Its methods remained largely a secret until HP released it as open source in 2005. The processing involves a conventional step-by-step algorithm, with unique stages that remain unconventional today. Key to its functionality is a connected component analysis, storing component outlines into blobs. This design decision, though computationally expensive, enables the detection of inverse text as easily as black-on-white text. Recognition involves organizing blobs into text lines, which are then analyzed for fixed pitch or proportional text. Text lines are segmented differently based on character spacing, with fixed pitch text being chopped by character cells and proportional text being segmented using definite and fuzzy spaces. This happens in a two-pass process, with satisfactory words from the first pass sent to an adaptive classifier for further refinement. A second pass is then used to re-recognize the leftover words.

Tesseract’s distinctive features, line finding and baseline finding, use methods that would be familiar to econometricians. The line finding algorithm is designed to identify skewed pages without necessitating de-skewing which preserves image quality. Central to this process are blob filtering and line construction. Assuming uniform text regions, a percentile height filter removes drop-caps and vertically touching characters. The median height approximates region text size, allowing filtering out smaller blobs, typically punctuation, special characters, or noise. Sorting blobs based on their x-coordinates helps align them with text lines and reduces the chances of incorrect assignments on skewed pages. When measuring the remaining filtered blobs overlap with the rows, the running average y-shift is used to vertically adjust each blob. For each blob, the y-shift is updated in the algorithm using:

where α ∈ (0.5, 0.7) and is positively correlated with the number of rows on a page. The running average allows for tracking of skew angles while remaining unaffected by descenders.

After filtering, the majority of the page’s blobs are matched to a row, setting a foundation to establish an accurate baseline. The baselines are estimated through a least median of squares (LMS) fit, where blobs that were initially filtered out are reassigned to their appropriate lines. The LMS can handle outliers such as punctuation and special characters that could skew a least squares (LS) fit. The estimator for the LMS was developed by Siegel (1982) as a means to improve upon the LS estimate. The issue with LS is that it can take on arbitrarily large aberrant values when outliers are present.

Consider the standard model given by:

The goal is to estimate β which corresponds to the equation below.

where r is the residual so,

While several estimates attempt to add robustness to the LS by replacing the square by something else (i.e. the unique Least Absolute Values estimated by Harter(1977), the M estimator (Huber 1973), and the generalized M estimators (Mallows 1975)), the LMS replaces the sum by a median. Considering the j-th coordinate of the vector β, the repeated median is defined coordinate-wise as:

This gives the LMS estimator:

In this model, the points along the original straight baseline are defined as t and the coordinates as z. We are searching for a baseline s which satisfies s(t) = z and s’(t) = s. Let’s say there are two points, t1 < t2, and suppose z1, z2, s1, s1 are given real numbers, then there is a quadratic polynomial that can be used to find a baseline if:

Then for every t1 < θ < t2, a unique quadratic spline s with a simple knot at θ such that:

where,

will do the trick. The quadratic spline has the advantage of being reasonably stable over other methods. Adding it all together, the algorithm runs as follows:

1.Perform connected component analysis of the image
2.Filter blobs, removing drop-caps, underlines and isolated noise
3.Sort blobs using the x-coordinate as the sort key
4.Make initial rows
  Set the running average y shift to 0.
  For each blob in sorted order:
      Find the existing row which has most vertical overlap with the blob
      If there is no overlapping row
      Then
          Make a new row and put the blob in it.
          Record the coordinates of the blob as the top and bottom of the row.
      Else
         Add the blob to the row.
         Expand the limits of the row with the blob’s top and bottom.
         Update the running average y shift with the bottom of the blob. 
      End If
  End For
5.Fit baselines

This is a brief synopsis of how the unique features of Tesseract works. Evidently, and dare I say obviously, reaching into the statistical black box reveals standard familiar regression functions. Now, its time for the code which will create a Python dictionary containing the text extracted from the PDF.

Before implementing Tesseract, several python libraries will be used to process the scanned documents into a suitable format. We will follow the steps below:

Read the PDF from the path where it is stored on your PC
Cropped the pages to just the area of the text, dropping any white space.
Convert the cropped PDF pages into a PNG images.
Read images into Tesseract.
Use Tesseract to extract text from the images.

I assume that you have Python 3.12 on your PC and you have installed the following libraries: PyPDF2, pdfminer.six, pdf2image, os, poppler-utils, and Tesseract. Be sure to add both poppler-utils and Tesseract to your PC’s PATH environment. First import the libraries:

from PyPDF2 import PdfWriter, PdfReader
from pdfminer.high_level import extract_pages, extract_text
from pdfminer.layout import LTFigure
from PIL import Image
from pdf2image import convert_from_path
import pytesseract 
import os

Then, create a function that locates and crops the text on the file using PDFMiner by its coordinates. Save it to a new PDF using PyPDF2.

def cut_pix(bite, canvas):
    [pix_left, pix_right, pix_top, pix_bottom] = [bite.x0, bite.x1, bite.y0, bite.y1] 
    # Crop the coordinates
    canvas.mediabox.lower_left = (pix_left, pix_bottom)
    canvas.mediabox.upper_right = (pix_right, pix_top)
    cut_writer = PdfWriter()
    cut_writer.add_page(canvas)
    with open('cut_pix.pdf', 'wb') as cut_pdf_file:
        cut_writer.write(cut_pdf_file)

Next, create a function that converts the PDF file into a PNG image file. For this function to run, you need to download poppler and put the folder (Release-23.XX.0–0 or the newest version) into your drive. Also as stated above, add the folder to your PATH environment.

def pdf2png(in_file,):
    pixs = convert_from_path(in_file, poppler_path 
#Change the path location after downloading poppler
        = r'C:\Release-23.11.0-0\poppler-23.11.0\Library\bin')
    pix = pixs[0]
    out_file = 'PDF_pix.png'
    pix.save(out_file, 'PNG')

Then, build one more function which reads the image files into Tessseract and extracts the text from the images.

def png2text(pix_path):
    img = Image.open(pix_path)
    text = pytesseract.image_to_string(img)
    return text

For this example, I am converting pages from the 2007 Television and Radio Broadcasting Yearbook. The Yearbook which began publication 1947 has extensive locational data on radio and television broadcasting stations and markets across the United States and Canada. First, import the PDF and create a reader object.

#Change the name of the pdf
pdf_location = '[your_pdf].pdf'
pdf_attach = open(pdf_location, 'rb')
pdfread = PdfReader(pdf_attach)

Create the dictionary and key to extract text from each image, then extract the pages from the PDF.

text_in_canvas = {}
for scroll_num, scroll in enumerate(extract_pages(pdf_location)):

    canvas = pdfread.pages[scroll_num]
    scroll_text = []
    line_format = []
    text_in_pixs = []
    scroll_content = []
    
    scroll_bite = [(bite.y1, bite) for bite in scroll._objs]
    scroll_bite.sort(key=lambda a: a[0], reverse=True)

    for i,component in enumerate(scroll_bite):
        bite = component[1]
        if isinstance(bite, LTFigure):
                
            cut_pix(bite, canvas)          
            pdf2png('cut_pix.pdf')          
            pix_text = png2text('PDF_pix.png')
            text_in_pixs.append(pix_text)
            scroll_content.append(pix_text)               
            scroll_text.append('pix')
            line_format.append('pix')
          
    dctkey = 'Page_'+str(scroll_num)
    text_in_canvas[dctkey] = [scroll_text, line_format, text_in_pixs, scroll_content]

Close the PDF, the remove extra files, and print the results. I chose to store the results in a JSON file for further analysis.

pdf_attach.close()
os.remove('cut_pix.pdf')
os.remove('PDF_pix.png')
\end{verbatim}
\begin{verbatim}
result = ''.join(text_in_canvas['Page_0'][3])
with open("sample.json", "w") as outfile: 
    json.dump(result, outfile)

While AI might change various aspects of daily life, it is important to maintain a critical perspective amidst the hype. Stripping away the facade reveals that behind the buzzwords lie foundational principles based on familiar regression models. The Tesseract OCR engine exemplifies how AI can speed up the process of collecting historical data for contemporary analysis and bridge the gap between abstract statistical models and grounded empirical evidence. By wielding these tools thoughtfully, practitioners can harness AI as a means to automate the creation of shoe leather.

Github link.

References

Agrawal, A., Gans, J., & Goldfarb, A. (2022). Prediction Machines, Updated and Expanded: The Simple Economics of Artificial Intelligence. Harvard Business Press.

Freedman, D. A. (1991). Statistical models and shoe leather. Sociological methodology, 291–313.

Snow, John. (1855) 1965. On the Mode of Communication of Cholera. Reprint ed. New York: Hafner.

Ran Abramitzky, Leah Boustan, Katherine Eriksson, Santiago Pérez & Myera Rashid. Census Linking Project: Version 2.0 [dataset]. 2020. https://censuslinkingproject.org

Thomas, E. (2021). Turning Letters into Tones: A century ago, the optophone allowed blind people to hear the printed word. IEEE Spectrum, 58(7), 34–39.

Smith, R. (2007, September). An overview of the Tesseract OCR engine. In Ninth international conference on document analysis and recognition (ICDAR 2007) (Vol. 2, pp. 629–633). IEEE.

Smith, R. (1995, August). A simple and efficient skew detection algorithm via text row accumulation. In Proceedings of 3rd International Conference on Document Analysis and Recognition (Vol. 2, pp. 1145–1148). IEEE.

Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American statistical association, 79(388), 871–880

Siegel, A. F. (1982). Robust regression using repeated medians. Biometrika, 69(1), 242–244.

Schumaker, L. I. (1983). On shape preserving quadratic spline interpolation. SIAM Journal on Numerical Analysis, 20(4), 854–864.

Extracting Text from Unsearchable PDFs: AI Theory and a Method

Using AI to Make Shoe Leather

What is Optical Character Recognition?

A Theoretical Walkthrough of Tesseract

Written by JJ Jordan

No responses yet