Create searchable PDF's using GhostScript and Tesseract

Created by Jeremy Burgess, Modified on Wed, 17 Jan, 2024 at 2:25 PM by Jeremy Burgess

Symptoms

You have scanned PDF files but are not able to search for text or copy/paste into a new application.

Cause

There is no text layer in the PDF. You should create a searchable PDF instead.

Resolution

In general, this is best done at the source, i.e. configure the scanner so that it performs OCR on the document during the scanning process.

Prerequisites:

XPdf command line tools https://www.xpdfreader.com/download.html (choco install tesseract)
Ghostscript command line https://ghostscript.com/releases/gsdnld.html (choco install ghostscript)
Tesseract command line https://tesseract-ocr.github.io/tessdoc/Downloads.html (choco install tesseract)

If you already have PDF's and want to create searchable document then try the following example script. You will require Ghostscript and Tesseract to be installed on the computer first; you will also need to add their paths to the PATH environment variable or amend the script to give the location of the executables.

# Define input and output folder paths
$inputFolder = "C:\Demonstration\In"
$outputFolder = "C:\Demonstration\Out"

# Ensure the output folder exists, create it if necessary
if (-not (Test-Path -Path $outputFolder -PathType Container)) {
    New-Item -Path $outputFolder -ItemType Directory
}

# Get all files in the input folder
$pdfFiles = Get-ChildItem -Path $inputFolder

# Loop through each PDF file
foreach ($pdfFile in $pdfFiles) {
    # Define input and output file paths
    $sourcePDF = $pdfFile.FullName
    $outputPDF = Join-Path -Path $outputFolder -ChildPath $pdfFile.Name

    # Create a temporary folder for page images and text files
    $tempFolder = Join-Path -Path $outputFolder -ChildPath "temp_$($pdfFile.BaseName)"
    if (-not (Test-Path -Path $tempFolder -PathType Container)) {
        New-Item -Path $tempFolder -ItemType Directory
    }

    # Get the total number of pages in the PDF
    $totalPages = (pdfinfo.exe "$sourcePDF" | Select-String "Pages") -replace '\D'

    # Loop through each page in the PDF
    for ($page = 1; $page -le $totalPages; $page++) {
        $outputImage = Join-Path -Path $tempFolder -ChildPath "$page.png"
        gswin64c.exe -sDEVICE=png16m -r300 -o $outputImage -sPageList="$page" "$sourcePDF"

        # Perform OCR using Tesseract on the image
        tesseract.exe $outputImage "$tempFolder\temp-output-$page" pdf
    }

    # Create a list of individual page PDF file paths
    $pagePDFs = 1..$totalPages | ForEach-Object { """$tempFolder\temp-output-$_.pdf""" }

    # Merge all individual page PDFs into a single multipage searchable PDF using Ghostscript
    $pagePDFArgs = $pagePDFs -join " "
    Invoke-Expression "gswin64c.exe -sDEVICE=pdfwrite -sOUTPUTFILE=`"$outputPDF`" -dNOPAUSE -dBATCH -dSAFER $pagePDFArgs"

    # Clean up temporary folder and file list
    Remove-Item -Path $tempFolder -Recurse -ErrorAction SilentlyContinue

    Write-Host "Processed $pdfFile"
}

Write-Host "OCR process completed."