Document Intelligence / OCR / Vision-Language Models
Self-Hosted Multi-Model OCR/VL Document Preprocessing Pipeline
A self-hosted OCR/VL preprocessing pipeline for turning messy technical documents into structured, retrieval-ready data for downstream AI systems.
Overview
I built a self-hosted OCR/VL document preprocessing pipeline for turning messy technical documents into structured, retrieval-ready data for AI systems.
The project focused on the ingestion layer: parsing, cleaning, structuring, and preparing complex documents before they could be used by downstream retrieval and AI workflows.
This was not a simple PDF upload flow. The source files were technical documents where no two documents followed the same structure. Some pages were text-heavy, some were table-heavy, some depended on visual layout, and some required OCR/VL interpretation because ordinary PDF text extraction lost too much meaning.
I researched, tested, and evaluated multiple OCR and vision-language model approaches, then designed a self-hosted architecture that combined three different models because no single OCR/VL model handled every document type well enough on its own.
The work also involved dealing with real infrastructure issues: GPU setup, NVIDIA/CUDA compatibility, CUBLAS errors, VRAM limits, memory issues, model loading problems, and speed bottlenecks.
Problem
The source files were technical documents, not clean text documents.
No two documents followed the same structure. Some contained dense paragraphs, some relied heavily on tables, some had diagrams or visual sections, some were scanned, and some lost important meaning when processed through ordinary PDF text extraction.
A single OCR model could not handle all document types well. Plain text extraction lost structure from tables, layouts, and visual sections. OCR/VL models produced different strengths and weaknesses depending on the page type.
Large PDFs and image-heavy pages created speed, memory, and GPU bottlenecks. Self-hosting the models introduced CUDA, NVIDIA, CUBLAS, dependency, and VRAM issues.
The challenge was to build a preprocessing architecture that could handle heterogeneous technical documents and produce structured outputs despite inconsistent source formats.
Technical challenges I solved
No single OCR/VL model was good enough. I tested multiple OCR and vision-language model approaches and compared their outputs on different document types. Some models were better at plain text extraction, some handled layout better, some were better for tables, some were more useful for visual or image-heavy sections, some produced cleaner markdown-style output, and some were faster but less structurally accurate.
Because of that, I designed the pipeline as a multi-model OCR/VL system instead of depending on one model. The architecture combined three models so the system could use the strengths of each one and produce better structured document outputs. The goal was not simply to extract text. The goal was to extract usable technical structure.
The documents were highly inconsistent. Some pages were mostly text, some were dominated by tables, some had multi-column layouts, some included diagrams, image regions, or visual references, some were scanned or partially scanned, and some had repeated headers, footers, numbering, and formatting noise.
This meant the pipeline had to be flexible enough to handle different page types instead of treating every page as the same extraction problem. I designed the preprocessing flow around page-level processing and structured output generation so the system could preserve more meaning from the original documents.
Self-hosting created GPU and runtime problems. I worked through NVIDIA/CUDA setup, driver compatibility, CUDA version mismatches, CUBLAS errors, model loading problems, dependency issues, VRAM limits, and runtime instability.
Speed became a serious constraint. OCR/VL pipelines can be slow, especially when processing large technical PDFs page by page. Running multiple models made the quality better, but it also introduced a performance problem.
I worked on batching where possible, tuning image resolution and DPI, caching reusable prompt templates, monitoring GPU and VRAM usage, reducing unnecessary repeated processing, improving memory handling, and testing acceleration modules where relevant.
The output had to be useful for downstream AI systems. Raw OCR text was not enough. The output needed to be structured enough to support chunking, metadata enrichment, embedding, retrieval, citations, and later evaluation.
Architecture and implementation
The architecture followed a multi-model preprocessing flow.
Raw technical documents were first converted into page-level inputs. Each page could then be processed through OCR/VL components depending on the type of content it contained: text-heavy pages, table-heavy pages, scanned pages, or visually structured pages.
I tested multiple OCR/VL models and found that no single model performed best across every document type. Because of that, I designed the architecture around combining three models into one preprocessing system.
Each model contributed where it was strongest, and the pipeline produced cleaner structured output than relying on one OCR path.
The system generated structured markdown or JSON-style outputs that preserved more of the original document meaning: headings, sections, tables, visual descriptions, and page-level context.
That output was then prepared for chunking, metadata enrichment, embedding, retrieval, and evaluation.
The infrastructure was self-hosted, so part of the architecture involved making the models run reliably on GPU hardware. I had to solve NVIDIA/CUDA compatibility problems, CUBLAS errors, model loading issues, VRAM limits, memory issues, and speed bottlenecks.
Speed was a major constraint, so I worked on acceleration strategies such as batching, DPI tuning, prompt-template caching, GPU/VRAM monitoring, memory handling, and acceleration module testing.
What I built
I built the OCR/VL preprocessing workflow for converting messy technical documents into structured AI-ready content.
The result was a preprocessing system designed for messy real-world technical documents, not clean demo PDFs.
- Researching OCR/VL model options
- Testing different extraction approaches
- Evaluating output quality across different document types
- Designing a three-model OCR/VL architecture
- Self-hosting the OCR/VL stack
- Setting up GPU infrastructure
- Debugging NVIDIA/CUDA compatibility issues
- Resolving CUBLAS and model loading problems
- Handling VRAM and memory constraints
- Improving speed through batching, caching, DPI tuning, and acceleration testing
- Generating structured markdown or JSON-style outputs
- Handling tables, visual sections, scanned pages, and inconsistent layouts
- Preparing outputs for chunking, metadata, embedding, retrieval, and evaluation
System pieces
- Self-hosted OCR/VL stack
- Three-model OCR/VL architecture
- OCR model research and comparison
- Vision-language model testing
- Model output evaluation
- GPU VM setup
- NVIDIA/CUDA configuration
- CUDA compatibility debugging
- CUBLAS issue resolution
- Model loading debugging
- VRAM and memory management
- Memory issue investigation
- Batch OCR processing
- DPI tuning
- Prompt-template caching
- GPU/VRAM monitoring
- Acceleration module testing
- PDF preprocessing
- Page-level processing
- Visual document understanding
- Table extraction and handling
- Image/visual section description
- Structured markdown output
- JSON-style layout output
- Chunk preparation
- Metadata enrichment
- Embedding preparation
- Retrieval preparation
- Evaluation preparation
- Retries and failure handling
- Logging and debugging
Why it was technically hard
This was technically hard because it combined three difficult problems at the same time.
First, the documents were inconsistent. There was no stable template, no predictable layout, and no single extraction method that worked everywhere.
Second, the model layer was imperfect. Every OCR/VL model had strengths and weaknesses, so I had to evaluate the tradeoffs and design a system that combined multiple models instead of trusting one.
Third, the infrastructure was heavy. Self-hosting OCR/VL models meant dealing with GPU setup, CUDA/NVIDIA problems, VRAM limits, memory behavior, speed bottlenecks, and runtime instability.
The system had to balance quality, speed, reliability, and output structure. That is what made it a real AI systems project instead of a basic document parser.
Why this matters
Most useful AI systems depend on the quality of their source layer.
If the documents are processed badly, the retrieval system will be bad. If the chunks are bad, the answers will be bad. If the structure is lost, the AI cannot reliably recover it later.
This project shows that I can build the infrastructure underneath serious AI systems: the part that turns messy technical documents into structured, searchable, retrieval-ready data.
It also shows that I can make practical architecture decisions under real constraints: model limitations, document inconsistency, GPU issues, speed constraints, and downstream retrieval requirements.