Background: Fraud is a significant problem for any bank. Fraud can take many forms, which involves large amount of variables.
Data: This project is based on a credit card transaction dataset that contains information about 800,000 records and 29 features from a bank in Canada.
Objective: The goal is to build a predictive model to determine whether a given transaction will be fraudulent or not.
Since the fraud detection is a binary classification problem, I applied several classic supervised classification machine learning models, such as Logistic regression, XGBoost, and LightGBM. The primary metric would be ROC_AUC.
Original dataset:
Since this dataset is in line-delimited JSON format, I firstly transfromed it to a dataframe format.
I utilized my domain knowlegdge in banking to select a few key features, and conducted descriptive analysis. Take the column "postEntryMode" for example: Among all the fraudulent transactions,
Among the not fraudulent transactions,
From the above 2 plots, it seemed like "posEntryMode" had a influence on whether it's a fraud or not. So this feature would be added to my model.
Case 1: explanatory variables include 'cvvNotSame', 'amountOver', 'posEM_new', 'hour', 'transactionAmount', 'availableMoney', 'cardPresent'.
Case 2: explanatory variables include 'cvvNotSame', 'amountOver', 'posEM_new', 'hour', 'transactionAmount', 'availableMoney', 'cardPresent', 'merchantCategoryCode'.
Case 3: explanatory variables include 'cvvNotSame', 'amountOver', 'posEM_new', 'hour', 'transactionAmount', 'availableMoney', 'cardPresent', 'transactionType'.
Implemented train-test split, cross validation, model fitting, roc curve visualization.
Used MinMax scaling before applying the model.
Compare 2 roc_auc results before and after scaling the dataframe, it can be concluded that scaling the dataframe increased the value of the metric. Therefore, I used the scaling dataframe in the following modeling experiments.
In addition to the previous prodecures, I utilized the Grid Search method to conduct the hyperparameter tuning (learning rate, number of estimators), selected the optimal combination of parameters, and plotted the roc_auc curve.
Similar to the XGBoost modeling, I used the Grid Search approach to conduct the hyperparameter tuning (learning rate, number of estimators, max_depth) for the lightGBM model, chose the optimal combination of parameters, and plotted the roc_auc curve.
After comparing the roc_auc of the 3 models in 3 cases, both XGBoost and lightGBM with the feature combination case 2, achieved the highest roc_auc 0.76, increasing the base value by 7%. However, a disadvantage of XGBoost was that its running time was relatively slow. Therefore, lightGBM with case 2 was the optimal model. To further investigate this model, I produced the following feature importance plot. It's straightforward that "transactionAmount", "availableMoney", "posEM_new" were the key factors that would determine whether a transaction was a fraud or not.
In the future investigation, I will attempt to experiment more kinds of feature combination. Additionally, more machine learning models with hyperparameter tuning will be applied.