Skip to content

QiangUM/Bank_Fraud_Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

Fraud Detection with Bank Transaction Data

Introduction

Background: Fraud is a significant problem for any bank. Fraud can take many forms, which involves large amount of variables.

Data: This project is based on a credit card transaction dataset that contains information about 800,000 records and 29 features from a bank in Canada.

Objective: The goal is to build a predictive model to determine whether a given transaction will be fraudulent or not.

Methodology

Since the fraud detection is a binary classification problem, I applied several classic supervised classification machine learning models, such as Logistic regression, XGBoost, and LightGBM. The primary metric would be ROC_AUC.

Data Preparation

Original dataset:

ori_data

Since this dataset is in line-delimited JSON format, I firstly transfromed it to a dataframe format.

data

EDA

I utilized my domain knowlegdge in banking to select a few key features, and conducted descriptive analysis. Take the column "postEntryMode" for example: Among all the fraudulent transactions,

Among the not fraudulent transactions,

From the above 2 plots, it seemed like "posEntryMode" had a influence on whether it's a fraud or not. So this feature would be added to my model.

Feature engineering (3 cases)

Case 1: explanatory variables include 'cvvNotSame', 'amountOver', 'posEM_new', 'hour', 'transactionAmount', 'availableMoney', 'cardPresent'.

Case 2: explanatory variables include 'cvvNotSame', 'amountOver', 'posEM_new', 'hour', 'transactionAmount', 'availableMoney', 'cardPresent', 'merchantCategoryCode'.

Case 3: explanatory variables include 'cvvNotSame', 'amountOver', 'posEM_new', 'hour', 'transactionAmount', 'availableMoney', 'cardPresent', 'transactionType'.

Model 1 - Logistic regression (Case 2)

Implemented train-test split, cross validation, model fitting, roc curve visualization.

Model 1.2 - Logistic regression after scaling (Case 2)

Used MinMax scaling before applying the model.

Compare 2 roc_auc results before and after scaling the dataframe, it can be concluded that scaling the dataframe increased the value of the metric. Therefore, I used the scaling dataframe in the following modeling experiments.

Model 2 - XGBoost (Case 2)

In addition to the previous prodecures, I utilized the Grid Search method to conduct the hyperparameter tuning (learning rate, number of estimators), selected the optimal combination of parameters, and plotted the roc_auc curve.

Model 3 - lightGBM (Case 2)

Similar to the XGBoost modeling, I used the Grid Search approach to conduct the hyperparameter tuning (learning rate, number of estimators, max_depth) for the lightGBM model, chose the optimal combination of parameters, and plotted the roc_auc curve.

Conclusion

After comparing the roc_auc of the 3 models in 3 cases, both XGBoost and lightGBM with the feature combination case 2, achieved the highest roc_auc 0.76, increasing the base value by 7%. However, a disadvantage of XGBoost was that its running time was relatively slow. Therefore, lightGBM with case 2 was the optimal model. To further investigate this model, I produced the following feature importance plot. It's straightforward that "transactionAmount", "availableMoney", "posEM_new" were the key factors that would determine whether a transaction was a fraud or not.

Future

In the future investigation, I will attempt to experiment more kinds of feature combination. Additionally, more machine learning models with hyperparameter tuning will be applied.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published