Symptoms
You have scanned PDF files but are not able to search for text or copy/paste into a new application.
Cause
There is no text layer in the PDF. You should create a searchable PDF instead.
Resolution
In general, this is best done at the source, i.e. configure the scanner so that it performs OCR on the document during the scanning process.
Prerequisites:
- XPdf command line tools https://www.xpdfreader.com/download.html (choco install tesseract)
- Ghostscript command line https://ghostscript.com/releases/gsdnld.html (choco install ghostscript)
- Tesseract command line https://tesseract-ocr.github.io/tessdoc/Downloads.html (choco install tesseract)
If you already have PDF's and want to create searchable document then try the following example script. You will require Ghostscript and Tesseract to be installed on the computer first; you will also need to add their paths to the PATH environment variable or amend the script to give the location of the executables.
# Define input and output folder paths $inputFolder = "C:\Demonstration\In" $outputFolder = "C:\Demonstration\Out" # Ensure the output folder exists, create it if necessary if (-not (Test-Path -Path $outputFolder -PathType Container)) { New-Item -Path $outputFolder -ItemType Directory } # Get all files in the input folder $pdfFiles = Get-ChildItem -Path $inputFolder # Loop through each PDF file foreach ($pdfFile in $pdfFiles) { # Define input and output file paths $sourcePDF = $pdfFile.FullName $outputPDF = Join-Path -Path $outputFolder -ChildPath $pdfFile.Name # Create a temporary folder for page images and text files $tempFolder = Join-Path -Path $outputFolder -ChildPath "temp_$($pdfFile.BaseName)" if (-not (Test-Path -Path $tempFolder -PathType Container)) { New-Item -Path $tempFolder -ItemType Directory } # Get the total number of pages in the PDF $totalPages = (pdfinfo.exe "$sourcePDF" | Select-String "Pages") -replace '\D' # Loop through each page in the PDF for ($page = 1; $page -le $totalPages; $page++) { $outputImage = Join-Path -Path $tempFolder -ChildPath "$page.png" gswin64c.exe -sDEVICE=png16m -r300 -o $outputImage -sPageList="$page" "$sourcePDF" # Perform OCR using Tesseract on the image tesseract.exe $outputImage "$tempFolder\temp-output-$page" pdf } # Create a list of individual page PDF file paths $pagePDFs = 1..$totalPages | ForEach-Object { """$tempFolder\temp-output-$_.pdf""" } # Merge all individual page PDFs into a single multipage searchable PDF using Ghostscript $pagePDFArgs = $pagePDFs -join " " Invoke-Expression "gswin64c.exe -sDEVICE=pdfwrite -sOUTPUTFILE=`"$outputPDF`" -dNOPAUSE -dBATCH -dSAFER $pagePDFArgs" # Clean up temporary folder and file list Remove-Item -Path $tempFolder -Recurse -ErrorAction SilentlyContinue Write-Host "Processed $pdfFile" } Write-Host "OCR process completed."
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article