During my work at KbLab, I built a tool to transform raw PDF scans into structured and ready datasets. Initially designed for researchers working with print-first documents, it can also be used as a pre-processing stage for the curation of a knowledge base in RAG applications.
Architecture
The workflow consists of four modules:
- Document Ingestion - Handles raw PDF scans and digitized documents from various sources
- OCR Processing - Uses Tesseract for text extraction and document segmentation
- Quality Validation - Confidence intervals and OCR quality metrics to filter out poor extractions and highlight problematic sections of the page
- Data Structuring - Segments the OCR output into structured datasets using python libraries lxml and NLTK
Notes
This project was created for historical document curation but can be adapted for document processing workflows, such as pre-processing for RAG.