Historical Document Processor

During my work at KbLab, I built a tool to transform raw PDF scans into structured and ready datasets. Initially designed for researchers working with print-first documents, it can also be used as a pre-processing stage for the curation of a knowledge base in RAG applications.

Architecture

The workflow consists of four modules:

Document Ingestion - Handles raw PDF scans and digitized documents from various sources
OCR Processing - Uses Tesseract for text extraction and document segmentation
Quality Validation - Confidence intervals and OCR quality metrics to filter out poor extractions and highlight problematic sections of the page
Data Structuring - Segments the OCR output into structured datasets using python libraries lxml and NLTK

Notes

This project was created for historical document curation but can be adapted for document processing workflows, such as pre-processing for RAG.