Student Performance Prediction

Open University Student Performance Imbalanced Classification

Project Summary

During my second semester of my Master’s degree, I chose the topic of imbalanced classification of university student performance for my data science final project using Python. I stumbled upon this dataset called the Open University Learning Analytics dataset while browsing on Kaggle and decided to use this dataset for my project. Aside from applying the generic data science processes from data cleaning to modeling, the primary objectives of doing this project are:

To identify significant features of students that affect the final result status of students using statistical methods.
To apply sampling techniques in building predictive models and compare the prediction results of different methods.
To experiment on the potential impacts of dimensionality reduction on model performance.

Results

By using 2 types of algorithms which were XGBoost classifier and Catboost classifier, the best results were extracted. Through the use of both oversampling and SMOTE sampling, it was observed that oversampling outperformed SMOTE sampling and the models with the highest recall for fail class were all models trained with oversampled datasets. Looking at the results table, even though the highest accuracies attained were from models trained with original imbalanced datasets, but recall is a more important metric and oversampled models produced the highest recall values of 81.34% and 84.91% for XGBoost and Catboost respectively. Unfortunately, the use of dimensionality reduction did not improve the model performance but significantly worsened the fail class recall values.

Analysis of Best Model Results

If we observe the confusion matrix for Catboost with oversampling, it is obvious that the accuracy of 87.97% may not be the best one when compared with Catboost trained with the original imbalanced dataset with accuracy of 89.68%, but comparing fail class recalls showed a significant difference of 8.45% whereby the Catboost trained with imbalanced data only obtained 76.46% recall. This concluded the fact that sampling techniques are extremely crucial when dealing with imbalanced data. With this, the findings of this projects were acquired and the objectives were also achieved.

Dataset source: Kaggle dataset link