Recruitment Scam Prediction

Employment and Recruitment Scam Prediction

Project Summary

This is my 2nd personal project that I decided to start working on it. Using the popular Employment Scam Aegean Dataset (EMSCAD) that can be found on Kaggle, a data science project was initiated and the process involved the usual data science processes such as data cleaning, exploratory data analysis (EDA), text mining, machine learning and also deep learning to predict whether a given job post is real or fraud. The codes were all written in Python. The objectives of this project are:

To experiment on the effects of various sampling techniques on the performance of machine learning models when performing classification on highly imbalanced dataset (95/5).
To compare the performance of models when using structured and unstructured data.

Results

Among all the structured data in the cleaned dataset, the chi-square scores graph showed that most of the categorical variables illustrated some degree of significance towards the fraudulent target variable except for telecommuting and required_experience which displayed scores that are close to zero.

It was discovered that oversampling technique generally produced the highest recall for the minority fraud class compared to original dataset and dataset with SMOTE sampling applied, and the combination of TFIDF vectorizer with oversampling gives the best results when looking at recall for fraud class with a value of 88.67%. Among all oversampled models, the model trained with TFIDF vectorizer also yielded the highest accuracy of 98.95%. One interesting finding that was obtained is that using mixed data with bidirectional LSTM neural network produced results that were worse than using unstructured text alone.

Model deployment

An interactive web application was built using the model that was trained with structured data and the overall quality of the data product was pretty good. Using the Gradio library, a simple app that detects fraudulent job post was developed and deployed successfully on Heroku PaaS. Feel free to have a look and try it out yourself using any sample recruitment posts online!

Deployed web app: Gradio app on Heroku

Dataset source: Kaggle dataset link