IMPROVING THE STABILITY OF THE ADAMW OPTIMISER IN EXTREMELY-SCARCE E-LEARNING DATA VIA A MUTUAL-INFORMATION-BASED ADAPTIVE MODIFIER
Main Article Content
Abstract
Deep-learning models trained on MOOC log data suffer from high instability and poor generalisation because the available samples are small and heavily imbalanced. AdamW is the de-facto optimiser, yet its hyper-parameters (β1, β2) are extremely sensitive when the batch size is small. We propose MILM-AdamW, an adaptive variant that re-scales β1 and the effective learning-rate on-the-fly using an estimate of the mutual information I(X;Y|θt) between the current mini-batch inputs and labels. A lightweight MINE network (64 neurons) is trained alongside the main model to supply It every tenth step. Extensive experiments on three public educational datasets (OULAD, KDD15, EdNet) under 5 %, 10 % and 20 % sampling scenarios show that MILM-AdamW raises average AUC by 3.8 percentage points, cuts the AUC standard deviation by 32 % and reduces wall-clock convergence time by 13 % without extra model parameters or GPU memory.