Monday, 20 January 2020

ocr - How to automatically find non-searchable PDFs


Suppose I have a directory full of many PDFs. In most of them, the text is completely search-able, which is the way I need them to be. But a few of them are just image scans, and they need to be OCR-ed.


Other then simply doing a batch OCR on the entire directory, is there a way to quickly identify which PDFs are the image-only ones that actually need to be OCR-ed?


I'm not a programmer, but a linux-friendly solution would be preferred.



Answer



I'm not sure if this is a 100% solution, but I came up with the following script which should get you a good part of the way if not the whole way (I have not gone through the spec) It should be run from the directory which has all the PDF's (it will search subdirectories).


#! /bin/bash

if [[ ! "$#" = "1" ]]
then
echo "Usage: $0 /path/to/PDFDirectory"
exit 1
fi

PDFDIRECTORY="$1"

while IFS= read -r -d $'\0' FILE; do
PDFFONTS_OUT="$(pdffonts "$FILE" 2>/dev/null)"
RET_PDFFONTS="$?"
FONTS="$(( $(echo "$PDFFONTS_OUT" | wc -l) - 2 ))"
if [[ ! "$RET_PDFFONTS" = "0" ]]
then
READ_ERROR=1
echo "Error while reading $FILE. Skipping..."
continue
fi
if [[ "$FONTS" = "0" ]]
then
echo "NOT SEARCHABLE: $FILE"
else
echo "SEARCHABLE: $FILE"
fi
done < <(find "$PDFDIRECTORY" -type f -name '*.pdf' -print0)

echo "Done."
if [[ "$READ_ERROR" = "1" ]]
then
echo "There were some errors."
fi

It works by looking for the number of fonts specified in each PDF. If the file does not have any fonts it is assumed to be comprised only of an image. (This might trip up on password protected files, I have no idea, don't have any to test against). If there is some stuff which is searchable and some stuff which is an image, this won't work - but it will probably be useful to seperate scanned image documents in a PDF container from "real" PDF's.


You can, of-course, comment out the part of the if-then-else loop which does not apply if you only want to print out the files which are not searchable.


No comments:

Post a Comment

How can I VLOOKUP in multiple Excel documents?

I am trying to VLOOKUP reference data with around 400 seperate Excel files. Is it possible to do this in a quick way rather than doing it m...