OPTIMIZING LLM PERFORMANCE THROUGH CI/CD PIPELINES IN CLOUD-BASED ENVIRONMENTS
Main Article Content
Abstract
The deployment of large language models (LLMs) in cloud environments presents significant challenges, particularly due to their high computational demands, latency, memory consumption, and the lack of automated and reproducible workflows. The need for efficient, low-cost, reproducible deployment strategies has become critical as LLMs continue to scale and become integral to enterprise and research systems. Traditional manual deployment methods often result in performance instability and hinder operational scalability. To address these issues, this study explores the integration of CI/CD (Continuous Integration/Continuous Deployment) pipelines within Python-based cloud environments as a lightweight alternative for automating model benchmarking and inference tracking. Using the Open LLM Performance Benchmark dataset, which includes metrics such as model size, benchmark scores (e.g., ARC, MMLU, HellaSwag, TruthfulQA), latency, and memory usage, we evaluate a diverse set of public models, including DistilGPT -2, TinyLlama, GPT-Neo-125M, Falcon-rw-1b, and others. All experiments are conducted within Google Colab to simulate low-infrastructure environments. The proposed CI/CD workflow incorporates automated prompt generation, inference execution, latency and memory profiling, and structured logging. Additionally, version control is simulated using DVC-style file hashes and experiment tracking through MLflow. Key findings highlight a clear tradeoff between model size, performance, and cost. Smaller models, such as Tiny-GPT2, demonstrate superior latency but reduced benchmark scores, whereas larger models, like Falcon-rw-1b, yield higher accuracy at the expense of increased memory and inference time. The CI/CD pipeline improved reproducibility, execution traceability, and scalability. These results underscore the potential of lightweight CI/CD frameworks to streamline LLM deployment for teams operating under resource constraints.