OPTIMIZING SPARK-BASED DATA PIPELINES THROUGH ADAPTIVE PARQUET FILE REDUCTION TECHNIQUES

Voolla Sandeep Kumar

doi:10.12732/ijam.v38i11s.1898

PDF

Published: Nov 26, 2025

DOI: https://doi.org/10.12732/ijam.v38i11s.1898

Keywords:

Apache Spark; Parquet file compaction; Adaptive data layout optimization; Distributed file systems; Lakehouse architecture

Voolla Sandeep Kumar

Abstract

Larger data pipelines based on Spark are increasingly being limited by storage layer inefficiencies, and no longer by computation. Among these factors, poor choices in file-size granularity and write parallelism, manifested as small-file proliferation, imbalanced partitioning, and metadata bloat, have become major bottlenecks in distributed analytics applications. This article provides a technical review of adaptive Parquet file reduction methods aimed at improving Spark pipeline efficiency across heterogeneous workloads and storage environments. Peer-reviewed research addressing distributed file systems, columnar storage systems, workload characterization studies, and Lakehouse system evaluations was critically reviewed. Findings from the literature were comparatively analysed to examine the relationships among file size, scheduling overhead, shuffle behaviour, compression trade-offs, and metadata scaling. The cross-study synthesis suggests that adaptive file consolidation generally decreases query latency when datasets are fragmented, particularly under moderate to high fragmentation levels depending on workload characteristics and file distribution. It has also been shown that metadata overhead and task scheduling costs increase with the number of files, whereas excessive file coalescing can limit parallelism and create executor imbalance. The analysis indicates that existing compaction methods are mainly based on non-predictive static heuristics derived from distributed storage block sizes and do not adapt to workload characteristics. Adaptive query execution mechanisms improve runtime plan optimization but do not address persistent storage fragmentation. There are still major research gaps in the integrated storage-execution co-optimization, multivariate performance modelling, streaming-aware compaction as well as energy-efficient file layout management. The results highlight that Parquet file granularity is an important determinant of structural performance in Spark ecosystems. Further development of adaptive file reduction frameworks with statistical analysis is likely to be important for maintaining scalability, efficiency, and cost-effectiveness of distributed analytics systems in the present day.

Issue

Vol. 38 No. 11s (2025)

Section

Articles

Article Sidebar

Main Article Content

Abstract

Article Details