OPTIMIZING PDF INGESTION FOR LARGE LANGUAGE MODELS IN RAG ARCHITECTURES

Main Article Content

Rishab Bansal, Binita Mukesh Shah, Abhijit Chanda, Vrushali Parate

Abstract

The Portable Document Format (PDF) is widely used for enterprise information communication and archival, but its emphasis on visual fidelity presents major barriers for ingestion into Large Language Model (LLM)-based systems. High-quality data ingestion is critical for Retrieval-Augmented Generation (RAG) systems, which increasingly rely on unstructured organizational knowledge. Complex PDFs, featuring tables, figures, headers, footers, and intricate layouts, often suffer from context loss and semantic degradation during extraction, impairing RAG performance. This paper presents a survey of existing research on parsing such documents for LLM vectorization. It identifies a gap between the capabilities of current parsing techniques, often evaluated on simplified benchmarks, and the needs of real-world enterprise documents. Key challenges highlighted include layout interpretation, contextualization of tables and images, OCR noise reduction, and preservation of semantic relationships. The paper categorizes existing approaches into pipeline-based methods, holistic Vision-Language Models (VLMs), hybrid systems, and graph-based representations. Analysis of reported performance reveals persistent gaps between model accuracy and human-level understanding, especially in complex reasoning tasks, and highlights limitations in current benchmarks. Based on this review, the paper offers practical recommendations for engineers, emphasizing semantic chunking, layout-aware tool selection, multimodal strategies, and metadata enrichment. Future directions include improving multimodal model robustness, establishing realistic benchmarks, enhancing explainability, and ensuring semantic fidelity, the accurate capture and representation of a document’s intended meaning and structure, in PDF ingestion pipelines for RAG systems

Article Details

Section
Articles