Advanced RAG / Hierarchical Retrieval / Source-Grounded Answers
Hybrid RAG System for Hierarchical Documents
A retrieval system for complex structured documents that preserved hierarchy, combined BM25 and vector search, used RRF ranking, and returned source-grounded answers.
Overview
I built an AI system for answering questions from complex structured documents.
The hard part was retrieval. The documents were not flat text files. They had hierarchy: sections, subsections, nested references, headings, numbering, and context that only made sense if the structure was preserved.
A simple vector search was not reliable enough. The system had to understand where information came from, keep document hierarchy intact, retrieve exact terms when wording mattered, retrieve semantic matches when meaning mattered, and return answers with source references.
I worked on the full retrieval layer: document processing, Anthropic-style contextualized chunking, hierarchy extraction, metadata design, hybrid search, RRF ranking, source references, and evaluation.
The goal was simple: make the system answer from the right part of the document, not just from text that sounded related.
Problem
The source material was made of complex hierarchical documents.
The documents were not written like simple articles or blog posts. They had nested structure, repeated headings, numbered sections, cross-references, and context that depended on where a piece of text appeared inside the document.
If the documents were chunked naively, the system could separate a paragraph from the heading or parent section that gave it meaning. If hierarchy was ignored, retrieval could return a related sentence from the wrong part of the document.
If only vector search was used, the system could miss exact terms, names, IDs, section numbers, or domain-specific wording. If only keyword search was used, the system could miss semantically relevant content written in different words.
If metadata was too weak, the system could not reliably tell which document, section, or subsection a retrieved chunk belonged to. If answer generation received incomplete context, it could produce a fluent answer grounded in the wrong source.
The challenge was to build a retrieval system that preserved structure, combined exact and semantic search, and made answers traceable back to the correct source material.
Technical challenges I solved
The document hierarchy had to be extracted. The documents contained parent sections, child sections, headings, numbering, nested content, and internal references. Treating everything as plain text would destroy important context.
I worked on extracting and representing that hierarchy so the system could understand where each piece of content belonged. The goal was not just to split documents into chunks. The goal was to preserve the relationship between a chunk and its place in the original document structure.
Chunking needed context, not just text splitting. Bad chunking creates bad retrieval. If chunks are too small, they lose context. If chunks are too large, search becomes noisy. If chunks ignore hierarchy, the system may retrieve the right words from the wrong section.
I used Anthropic-style contextualized chunking so chunks carried additional context from the surrounding document. Instead of treating each chunk as an isolated text block, the system enriched chunks with context about where they came from and what they meant inside the larger document.
Metadata had to carry meaning. Each chunk needed useful metadata: source document, section, subsection, hierarchy path, page or location information, and other fields needed to trace the answer back to the source.
Vector search alone was not enough. Vector search is useful for semantic meaning, but it can miss exact wording. BM25 is useful for exact matching, but it can miss semantically relevant passages written in different words.
The retrieval layer combined BM25 and vector search. BM25 helped catch exact terms and domain-specific wording. Vector search helped catch semantic matches and paraphrased meaning. Reciprocal Rank Fusion merged both result sets into one ranked list.
The system needed source-grounded answers. Source references had to be preserved through document processing, chunking, retrieval, and final answer generation so answers could be checked.
Evaluation had to expose retrieval failures. I worked on evaluation workflows that tested whether the system retrieved the right source material and whether the final answer was supported by that material.
Architecture and implementation
The architecture started with document processing.
Raw documents were parsed and converted into structured content. The system extracted hierarchy, headings, sections, subsections, and source information. That structure was then used during chunking so each chunk kept its relationship to the original document.
The chunking layer used Anthropic-style contextualization. Each chunk was enriched with surrounding context so it remained understandable even after being separated from the full document.
The retrieval layer used hybrid search. BM25 retrieved chunks based on exact terms, names, numbers, and wording. Vector search retrieved chunks based on semantic meaning. RRF fused the BM25 and vector results into a single ranked list.
The answer layer used the retrieved content and source references to generate a grounded response.
The evaluation layer tested whether the system retrieved the right material and whether the answer matched the expected source-backed response.
The important architectural decision was to treat hierarchy and retrieval strategy as first-class parts of the system. The document structure had to survive ingestion, chunking, retrieval, and answer generation.
What I built
I built the retrieval system for complex hierarchical documents.
The result was a RAG system designed for structured, hierarchical documents where source accuracy mattered.
- Researching and testing retrieval approaches
- Processing unstructured documents into structured outputs
- Extracting document hierarchy
- Preserving headings, sections, subsections, and source paths
- Building Anthropic-style contextualized chunking
- Enriching chunks with surrounding document context
- Designing metadata for retrieval and traceability
- Preparing chunks for embeddings and search
- Implementing BM25 retrieval for exact-term matching
- Implementing vector search for semantic retrieval
- Using RRF to combine BM25 and vector search results
- Improving retrieval logic beyond basic vector lookup
- Connecting retrieved chunks back to source references
- Building source-grounded answer generation
- Testing answer quality and source correctness
- Debugging cases where retrieval returned related but incorrect sections
- Analyzing cases where BM25, vector search, or RRF ranked sources incorrectly
- Improving the pipeline based on evaluation results
- Documenting and diagramming the architecture
System pieces
- Document preprocessing
- Hierarchy extraction
- Heading and section parsing
- Subsection tracking
- Anthropic-style contextualized chunking
- Chunk context enrichment
- Metadata design
- Hierarchy path metadata
- Embedding preparation
- Vector retrieval
- BM25 retrieval
- Hybrid search
- Reciprocal Rank Fusion
- Metadata-aware retrieval
- Source-aware retrieval
- Source references
- Grounded answer generation
- Retrieval evaluation
- Answer evaluation
- Failure analysis
- Architecture diagrams
- Debugging and iteration
Why it was technically hard
This was technically hard because the documents were structured, but not simple.
The important information was not only in the text itself. It was also in the hierarchy around the text: the heading, the parent section, the subsection, the numbering, and the source path.
A basic RAG pipeline would flatten that structure and treat the document as a bag of chunks. That creates problems because the system can retrieve something that sounds related but belongs to the wrong part of the source material.
A vector-only approach also was not enough. It could find meaning, but miss exact terms and references. A keyword-only approach was not enough either. It could find exact words, but miss semantically related answers.
The hard part was combining structure, contextualized chunks, metadata, BM25, vector search, RRF, source references, and evaluation into one retrieval system.
The system had to balance searchability, context, exact matching, semantic matching, source traceability, and answer quality.