Predicting Churn for A Ride-Sharing Company

Research Problem

A ride-sharing company (Company X) is interested in predicting rider retention. Using data for rider activity, we developed a model that identifies what factors are best predictors of retention. We also offer suggestions to operationalize insights to help Company X.

Data

We have a mix of rider demographics, rider behavior, ride characteristics, and rider/driver ratings of each other. Data spanned a 7 month period.

Variable	Description
city	City this user signed up in
phone	Primary device for this user
signup_date	Date of account registration
last_trip_date	Last time user completed a trip
avg_dist	Average distance (in miles) per trip taken in first 30 days after signup
avg_rating_by_driver	Rider’s average rating over all trips
avg_rating_of_driver	Rider’s average rating of their drivers over all trips
surge_pct	Percent of trips taken with surge multiplier > 1
avg_surge	Average surge multiplier over all of user’s trips
trips_in_first_30_days	Number of trips user took in first 30 days after signing up
luxury_car_user	TRUE if user took luxury car in first 30 days
weekday_pct	Percent of user’s trips occurring during a weekday

Defining Churn

We converted dates into date time objects to calculate the churn outcome variable. Users were identified as having churned if they had not used the ride-share service in the past thirty days:

def convert_dates(df):
    df['last_trip_date'] = df['last_trip_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
    df['signup_date'] = df['signup_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
    current_date = datetime.strptime('2014-07-01', '%Y-%m-%d')
    active_date = current_date - timedelta(days=30)
    y = np.array([0 if last_trip_date > active_date else 1 for last_trip_date in df['last_trip_date']])
    return y

Categorical variables where classes were represented with strings were encoded as numerical classes:

def label_encode(df, encode_list):
    le = preprocessing.LabelEncoder()
    for col in encode_list:
        le.fit(df[col])
        df[col + '_enc'] = le.transform(df[col])
    return df

Exploratory Data Analysis and Feature Engineering

We discovered that some of the predictor variables (e.g., average distance, number of trips in first 30 days) were positively skewed to a rather marked degree. These variables also included zero values so it was not possible to use simple corrections for skew, such as log transform.

Skewed data were normalized using an inverse hyperbolic sine transformation:

def normalize_inv_hyperbol_sine(x):
    x_arr = np.array(df[x])
    df[x+'_normalized'] = np.arcsinh(x_arr)

This worked well to normalize the data.

While examining distributions of the variables, we noticed that the percent of users' trips occurring during a weekday had an interesting distribution, with definite spikes for 0% and 100% and a more normal/Gaussian-looking distribution for the space between 0 and 100:

We decided to create dummy variables to split this variable apart:

All rides on weekdays
All rides on weekends
Mix of weekdays and weekends

def categorize_weekday_pct(df):
    df['all_weekday'] = (df.weekday_pct == 100).astype('int')
    df['all_weekend'] = (df.weekday_pct == 0).astype('int')
    df['mix_weekday_weekend'] = ((df.weekday_pct <100) & (df.weekday_pct > 0)).astype('int')

Classification/Predictive Analytics

Random Forest is a great place to start with a classification problem like this. It's fast, easy to use, and pretty accurate right out of the box. Our Random Forest Classifier produced an F1 Score of 77% on unseen data.

To improve our model fit, we next tried some boosted classification models. While boosted models require more tuning (and therefore take a bit longer to get working than Random Forest), they are usually more accurate than Random Forest.

Gradient boost

Using Scikit Learn's GridSearchCV, we first performed a grid search to determine the best model parameters for a GradientBoostingClassifier. The resultant classifier performed well, with an F1 Score of 83% on unseen data.

XGBoost

XGBoost did a good job as well with near equal results on the unseen data. The average F1 score from cross validation results was almost 84%.

Results

Accuracy, recall, and precision on unseen data that XGBoost produced confirmed that it is a good choice as it generalizes well for this application.

Accuracy: 78.29%
Recall: 86.25%
Precision: 80.74%

Although there could be possible improvements from further feature engineering, the current model would certainly be helpful in identifying customer segments that should be further investigated.

By running a feature importance analysis on the XGBoost model, it is seems that surge percentage, average distance of ride, and number of trips taken in the first 30 days are all the most relevant predictive features in this model. Next steps would include comparing those who are predicted to churn and those who are not against these three features. This could lead to actionable insights and thus would be a priority of continuing work on this project.

Recommendations for Company X

Use the best fitting model (above) to obtain predicted probabilities for individuals. Target those with greater than some probability of churning (choose this cutoff by considering profit curve based on confusion matrix).
Further investigate the variables stated above that are important predictors of churn
Offer discounts or free rides to at-risk users to try and retain them - no need to target users below a certain probability threshold.

What We Learned

How useful is feature engineering and normalizing skewed data?

Classifiers like random forest and boosted trees are quite robust to skewed and non-normally distributed data. We probably did not need to spend time transforming our data or creating dummy variables for percent of weekday rides.

Contributors

Our team included Micah Shanks (github.com/Jomonsugi), Stuart King (github.com/Stuart-D-King), Jennifer Waller (github.com/jw15), and Ian

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
README.md		README.md
XGBoost.py		XGBoost.py
X_test.npy		X_test.npy
X_train.npy		X_train.npy
churn_clean.py		churn_clean.py
gradient_boost.py		gradient_boost.py
random_forest.py		random_forest.py
y_test.npy		y_test.npy
y_train.npy		y_train.npy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Churn for A Ride-Sharing Company

Research Problem

Data

Defining Churn

Exploratory Data Analysis and Feature Engineering

Classification/Predictive Analytics

Results

Recommendations for Company X

What We Learned

How useful is feature engineering and normalizing skewed data?

Contributors

Tech Stack

About

Releases

Packages

Contributors 2

Languages

Jomonsugi/churn_predictor

Folders and files

Latest commit

History

Repository files navigation

Predicting Churn for A Ride-Sharing Company

Research Problem

Data

Defining Churn

Exploratory Data Analysis and Feature Engineering

Classification/Predictive Analytics

Results

Recommendations for Company X

What We Learned

How useful is feature engineering and normalizing skewed data?

Contributors

Tech Stack

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages