ENHANCING MACHINE LEARNING PERFORMANCE THROUGH STATISTICAL FEATURE SELECTION IN HIGH-DIMENSIONAL GENOMIC AND FINANCIAL DATA

Main Article Content

Esmail Hasan Abdullatif Al-Sabri, Aaqil Abbas Shah, Kanwal Iqbal, Memoona Liaqat, Faiza sami, Aseel Smerat

Abstract

Multidimensional datasets pose a considerable problem to machine learning models (especially in their interpretability, computation speed, and predictability). The paper explores how statistical feature selection methods e.g. LASSO, Ridge Regression, Elastic Net and Mutual Information can be used to improve the performance and transparency of the machine learning algorithms used in genomics and financial data. We compare the results of the Random Forest, Support Vector Machine, and Neural Networks, using gene expression profiles of The Cancer Genome Atlas (TCGA), and credit scoring data of Home Credit Default Risk dataset, on several metrics such as accuracy, F1-score, SHAP-based interpretability, and resources. Findings indicate that Elastic Net is always better than other approaches in processing correlated features as well as balancing between sparsity and stability, whereas Mutual Information is effective in revealing non-linear relationships. By up to 40% reducing training time and selecting features to improve model generalization and 30 reducing memory use, machine learning pipelines will be more interpretable and scalable. These results highlight the importance of statistical rigor in high dimensional machine learning processes to achieve robust and explainable AI.

Article Details

Section
Articles