During my work at KbLab, I built a tool that transforms raw PDF scans into structured NLP-ready datasets. This is useful for researchers working with print-first documents that want to curate structured data, but can also be used as a pre-processing step for a knowledge base in RAG projects.
Architecture
The pipeline consists of several components:
- Document Ingestion - Handles raw PDF scans and digitized documents from various sources
- OCR Processing - Uses Tesseract for text extraction and document segmentation
- Quality Validation - Confidence intervals and OCR quality metrics to filter out poor extractions and highlight problematic sections of the page
- Data Structuring - Segments the OCR output into structured datasets using python libraries lxml and NLTK
Notes
This project was created for historical document curation but can be adapted for document processing workflows, such as pre-processing for RAG.