SCALABLE REAL-TIME FEATURE ENGINEERING PIPELINES FOR LARGE LANGUAGE MODEL TRAINING: A DISTRIBUTED SYSTEMS APPROACH
Main Article Content
Abstract
The rapid growth of large language models (LLMs) has created an unprecedented demand for scalable preprocessing systems that can handle petabyte-scale, multilingual datasets in real time. Existing big data frameworks, originally developed for general machine learning applications, struggle to meet the specific challenges associated with training LLMs, particularly in areas such as deduplication, quality assessment, and tokenization at web scale. This paper presents a distributed systems framework tailored for real-time feature engineering within LLM workflows. The architecture employs a multi-tier microservices strategy, utilizing locality-sensitive hashing (LSH) for near-linear deduplication, an adaptive machine learning-based quality scoring mechanism, and a parallelized tokenization system. Optimized MinHash-LSH parameters (128 hash functions, 16 bands) achieve a recall of 94.7% at similarity thresholds of 0.8 or higher, with a false positive rate below 3%.
The quality evaluation aspect, which encompasses linguistic, semantic, structural, and domain-specific characteristics, achieves an accuracy of 85% and demonstrates a strong correlation with human annotations (r = 0.782). When assessing 450 TB of diverse text, the system delivers a 3.2× increase in throughput (50.2 million documents per hour), a 41% decrease in costs ($0.083 per million documents), and a 35% enhancement in energy efficiency compared to Apache Spark, Flink, and other commercial benchmarks, while maintaining sub-3-second P95 latency and 99.7% uptime over a six-month period. Testing with GPT-2 Medium reveals a 12% reduction in perplexity, a 35% faster convergence, and an increase of +8.3 in the GLUE score. The full implementation and reproducibility resources are made available as open source, encouraging validation and broader access. These results demonstrate how domain-specific, integrated systems can improve the efficiency, fairness, and sustainability of forthcoming LLM infrastructures.