Project: Investigate a Dataset - No-shows appointment

Introduction

Dataset Description

The dataset, a record of about 110,00 medical appointments in brazil focuses on whether a patient showed up for a scheduled appointment. The dataset contains 14 variables (columns) which are:

PatientId - Identification of the patient
AppointmentID - Identification of the appointment
Gender - (M)ale or (F)emale
ScheduledDay - The day the patient set up the appointment
AppointmentDay - The day of the appointment
Age - Age of the patient
Neighbourhood - The location of the hospital.
Scholarship - Whether or not the patient was enrolled in the Brazilian welfare program (0 = False, 1 = True)
Hipertension - Whether or not the patient had Hypertension (0 = False, 1 = True)
Diabetes - Was the patient diabetic? (0 = False, 1 = True)
Alcoholism - Was the patient an alcoholic? (0 = False, 1 = True)
Handcap - Was the patient handicapped? (0 = False, 1 = True)
SMS_received - One (1) or more messages sent to the patient
No-show - Did the patient not show up for the appointment? (No = Patient showed up, Yes = Patient did not)

More information about the dataset can be found here

Question(s) for Analysis

What factors are important for us to know in order to predict if a patient will show up for their scheduled appointment? Specifically:

Which gender is more likely to miss an appointment?
Does the length of days between the schedule date and appointment date affect No-show?
Does sending SMS reduce the number of No-shows?
Are those in the Brazilian welfare program more likely to keep an appointment?
Does the presence of any of the conditions (Hypertension, Diabetes, Alcoholism and Handicap) influence Noshow?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Upgrade pandas to use dataframe.explode() function. 
# !pip install --upgrade pandas==0.25.0

Data Wrangling

General Properties

# load the data
df = pd.read_csv('Database_No_show_appointments/noshowappointments-kagglev2-may-2016.csv')

# let's see the first five rows
df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PatientId	AppointmentID	Gender	ScheduledDay	AppointmentDay	Age	Neighbourhood	Hipertension	Diabetes	No-show
0	2.987250e+13	5642903	F	2016-04-29T18:38:08Z	2016-04-29T00:00:00Z	62	JARDIM DA PENHA	1	0	No
1	5.589978e+14	5642503	M	2016-04-29T16:08:27Z	2016-04-29T00:00:00Z	56	JARDIM DA PENHA	0	0	No
2	4.262962e+12	5642549	F	2016-04-29T16:19:04Z	2016-04-29T00:00:00Z	62	MATA DA PRAIA	0	0	No
3	8.679512e+11	5642828	F	2016-04-29T17:29:31Z	2016-04-29T00:00:00Z	8	PONTAL DE CAMBURI	0	0	No
4	8.841186e+12	5642494	F	2016-04-29T16:07:23Z	2016-04-29T00:00:00Z	56	JARDIM DA PENHA	1	1	No

# How many rows and columns does the data have?
df.shape

(110527, 14)

# inspect data types and look for instances of missing or possibly errant data.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
PatientId         110527 non-null float64
AppointmentID     110527 non-null int64
Gender            110527 non-null object
ScheduledDay      110527 non-null object
AppointmentDay    110527 non-null object
Age               110527 non-null int64
Neighbourhood     110527 non-null object
Scholarship       110527 non-null int64
Hipertension      110527 non-null int64
Diabetes          110527 non-null int64
Alcoholism        110527 non-null int64
Handcap           110527 non-null int64
SMS_received      110527 non-null int64
No-show           110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB

There are no missing data in any of the variables, however the ScheduledDay and AppointmentDay are not in the right types. Also Hipertension variable was wrongly spelt.

# Check the data for duplicates
df.duplicated().sum()

Data Cleaning

The following will be done to clean up the dataset for the sake of the intended analysis:

Drop irrelevant variables (PatientId, AppointmentID, Neighbourhood)
Rename variables for correctness (Hipertension, Handcap) and consistency (SMS_received, No-show)
Convert ScheduledDay and AppointmentDay to the appropriate type (datetime)
Keep only rows where the AppointmentDay is greater or same as the ScheduledDay

# Drop irrelevant columns/variables
df_clean = df.copy() # Make a copy of the original dataset for fallback or reference if neccessary

df_clean.drop(['PatientId', 'AppointmentID', 'Neighbourhood'], axis=1, inplace=True)

# view the columns left
df_clean.columns

Index(['Gender', 'ScheduledDay', 'AppointmentDay', 'Age', 'Scholarship',
       'Hipertension', 'Diabetes', 'Alcoholism', 'Handcap', 'SMS_received',
       'No-show'],
      dtype='object')

# Rename variables for correctness and consistency
columns={'Hipertension': 'Hypertension', 'Handcap': 'Handicap', 'SMS_received': 'SmsReceived', 'No-show': 'NoShow'}
df_clean.rename(columns=columns, inplace=True)

df_clean.head(2)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Gender	ScheduledDay	AppointmentDay	Age	Scholarship	Hypertension	Diabetes	Alcoholism	Handicap	SmsReceived	NoShow
0	F	2016-04-29T18:38:08Z	2016-04-29T00:00:00Z	62	0	1	0	0	0	0	No
1	M	2016-04-29T16:08:27Z	2016-04-29T00:00:00Z	56	0	0	0	0	0	0	No

# Convert ScheduledDay and AppointmentDay to the appropriate type (datetime)
df_clean.ScheduledDay = pd.to_datetime(df_clean.ScheduledDay)
df_clean.AppointmentDay = pd.to_datetime(df_clean.AppointmentDay)

df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 11 columns):
Gender            110527 non-null object
ScheduledDay      110527 non-null datetime64[ns]
AppointmentDay    110527 non-null datetime64[ns]
Age               110527 non-null int64
Scholarship       110527 non-null int64
Hypertension      110527 non-null int64
Diabetes          110527 non-null int64
Alcoholism        110527 non-null int64
Handicap          110527 non-null int64
SmsReceived       110527 non-null int64
NoShow            110527 non-null object
dtypes: datetime64[ns](2), int64(7), object(2)
memory usage: 9.3+ MB

# Keep only rows where AppointmentDay >= ScheduledDay
df_clean['ActualDateTimeDifference'] = df_clean.AppointmentDay - df_clean.ScheduledDay

# calculate the length of days (appointment day inclusive) from scheduled date
df_clean['LenDays'] = df_clean['ActualDateTimeDifference'] .dt.days + 1

df_clean[['ScheduledDay', 'AppointmentDay', 'ActualDateTimeDifference',  'LenDays', 'NoShow']].head(10)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	ScheduledDay	AppointmentDay	ActualDateTimeDifference	LenDays	NoShow
0	2016-04-29 18:38:08	2016-04-29	-1 days +05:21:52	0	No
1	2016-04-29 16:08:27	2016-04-29	-1 days +07:51:33	0	No
2	2016-04-29 16:19:04	2016-04-29	-1 days +07:40:56	0	No
3	2016-04-29 17:29:31	2016-04-29	-1 days +06:30:29	0	No
4	2016-04-29 16:07:23	2016-04-29	-1 days +07:52:37	0	No
5	2016-04-27 08:36:51	2016-04-29	1 days 15:23:09	2	No
6	2016-04-27 15:05:12	2016-04-29	1 days 08:54:48	2	Yes
7	2016-04-27 15:39:58	2016-04-29	1 days 08:20:02	2	Yes
8	2016-04-29 08:02:16	2016-04-29	-1 days +15:57:44	0	No
9	2016-04-27 12:48:25	2016-04-29	1 days 11:11:35	2	No

df_clean.LenDays.describe()

count    110527.000000
mean         10.183702
std          15.254996
min          -6.000000
25%           0.000000
50%           4.000000
75%          15.000000
max         179.000000
Name: LenDays, dtype: float64

Wait!, we have appointments (e.g. -6 days) before schedule?. The above table shows us that days with values of 0 are actually appointments scheduled for the same day. We remove the row(s) with LenDays < 0.

# get the indices of rows we want to drop 
indices = df_clean[df_clean.LenDays < 0].index

# remove the rows
df_clean.drop(indices, inplace=True)

df_clean.LenDays.describe()

count    110522.000000
mean         10.184253
std          15.255115
min           0.000000
25%           0.000000
50%           4.000000
75%          15.000000
max         179.000000
Name: LenDays, dtype: float64

Exploratory Data Analysis

Which gender is more likely to miss an appointment?

# Determine the proportions of males and females
df_clean.Gender.value_counts(normalize=True)

F    0.649979
M    0.350021
Name: Gender, dtype: float64

65% of the patients were females while the rest 35% were males

def create_grouped_data(key):
    """
    Create a pandas grouped data with the provided key
    Args:
        key (str) - The key provided to the pandas DataFrame groupby method
    Returns:
        A pandas grouped data
    """
    return df_clean.groupby(key).NoShow.value_counts(normalize=True) * 100


gender_noshow = create_grouped_data('Gender')

def create_side_by_side_barchart(grouped_data, keys, labels=None, colors=('b', 'g'), 
                                 title='', ylab='Proportion(%)', xlab='No Show'):
    """
     A helper function that creates a side by side (dodged) bar chart
     Args:
       grouped_data (pandas.core.series.Series) - The data created by pandas DataFrame groupby method
       keys         (iterable) - A list or tuple of category or levels of the variable passed to the groupby method
       labels       (iterable) - labels to use as the plot legend
       ylab         (str) - Y-axis label of the plot
       xlab         (str) - X-axis label of the plot
     Returns:
       None
    """
    
    index = np.arange(grouped_data[keys[0]].count())
    width = 0.35
    
    if labels is None:
        labels = keys

    plt.bar(index, grouped_data[keys[0]], width, color=colors[0], label=labels[0])
    plt.bar(index + width, grouped_data[keys[1]], width, color=colors[1], label=labels[1])

    plt.ylabel(ylab)
    plt.xlabel(xlab)
    plt.title(title)
    plt.xticks(index + width / 2, ['No', 'Yes'] * 2) # add xtick labels

    # # legend
    plt.legend();

create_side_by_side_barchart(gender_noshow, keys=('M', 'F'), title='Proportion of Noshow by Gender')

The same proportion (approx. 20%) of males and females did not show up for the scheduled appointment. This implied that Noshow is not influenced by gender.

Does the length of days between the schedule date and appointment date affect No-show?

It is pertinent to note that taking the days component alone (LenDays) meant a loss of the time component. My interest here is the days difference so it suffices for this question.

# I intend to use LenDays as a categorical variable since Noshow is a categorical variable
# let's see the 128 unique days
print('The {} unique days are: '.format(df_clean.LenDays.nunique()))
df_clean.LenDays.unique()

The 129 unique days are: 





array([  0,   2,   3,   1,   4,   9,  29,  10,  23,  11,  18,  17,  14,
        28,  24,  21,  15,  16,  22,  43,  30,  31,  42,  32,  56,  45,
        46,  39,  37,  38,  44,  50,  60,  52,  53,  65,  67,  91,  66,
        84,  78,  87, 115, 109,  63,  70,  72,  57,  58,  51,  59,  41,
        49,  73,  64,  20,  33,  34,   6,  35,  36,  12,  13,  40,  47,
         8,   5,   7,  25,  26,  48,  27,  19,  61,  55,  62, 176,  54,
        77,  69,  83,  76,  89,  81, 103,  79,  68,  75,  85, 112,  80,
        86,  98,  94, 142, 155, 162, 169, 104, 133, 125,  96,  88,  90,
       151, 126, 127, 111, 119,  74,  71,  82, 108, 110, 102, 122, 101,
       105,  92,  97,  93, 107,  95, 139, 132, 179, 117, 146, 123])

# Let's see the distribution of the length of days
df_clean.LenDays.plot.hist(bins=20);

# A helper function to group the days
# We could further collapse the groups than what was done here
def group_days(days):
    """
      Categorize number of days
      Args:
         days (int) - The days to group
      Returns:
         An integer between 0 and 8 (inclusive)
    """
    if days == 0:
        return 0
    elif days <= 3:
        return 1
    elif days <= 7:
        return 2
    elif days <= 14:
        return 3
    elif days <= 30:
        return 4
    elif days <= 60:
        return 5
    elif days <= 90:
        return 6
    elif days <= 120:
        return 7
    elif days <= 150:
        return 8
    elif days <= 180:
        return 9

df_clean['LenDaysGroup'] = df_clean.LenDays.map(group_days)

df_clean.LenDaysGroup.value_counts()

0    38563
2    17510
4    17371
1    14675
3    12025
5     8283
6     1878
7      132
9       56
8       29
Name: LenDaysGroup, dtype: int64

days_group_noshow = create_grouped_data('LenDaysGroup')


noshow_yeses = [val for index, val in enumerate(days_group_noshow) if index % 2 != 0]
labels = ['Same Day', 'Within 3 Days', 'Within a Week', 'Within 2 Weeks', 'Within 1 Month', 'Within 2 Months', 'Within 3 Months', 
                   'Within 4 Months', 'Within 5 Months', 'Within 6 Months']
index = np.arange(len(noshow_yeses))

# combine the index, yeses and labels so that it can be sorted accurately
data = list(zip(noshow_yeses, labels))

# sort the data by the yeses
data.sort(key=lambda x: x[0])

# transpose to the right shape for plotting: data[0] is now the index
data = list(zip(*data))

plt.figure(figsize=(8, 6))
plt.barh(index, data[0])
plt.xlabel('Proportion (%)')
plt.ylabel('Length of Days')
plt.title('Proportion of Noshow Based on Length of Days')
plt.yticks(index, data[1])
plt.show();

We see that less than 5% of same day (zero-day) appointments were not kept. The story is very different when the day of appointment is much later than the scheduled day.

Note that we could drill further into length of days with respect to gender and presence of disease or type of disease

Does sending SMS reduce the number of No-shows?

# Find the proportion of those that were sent sms
df_clean.SmsReceived.value_counts(normalize=True)

0    0.67896
1    0.32104
Name: SmsReceived, dtype: float64

About 32% of the patients were sent 1 or more SMS

sms_noshow = create_grouped_data('SmsReceived')

labels = ('Received 0', 'Received 1 or more')
title = 'Proportion of Noshow Based on Number of SMS Received'
create_side_by_side_barchart(sms_noshow, keys=(0, 1), title=title, labels=labels)

27.6% of those who received 1 or more SMS did not show up for their appointments while 16.7% of those who did not receive any sms did not show up for their appointments.

Are those in the Brazilian welfare program more likely to keep an appointment?

# How many people are in the welfare program?

# df_clean.Scholarship.value_counts(normalize=True) * 100
# Same as below

df_clean.Scholarship.mean() * 100 # We can do this because the column contains only 0s and 1s

9.8270027686795398

9.8% of the population were in the Brazilian welfare program

scholarship_noshow = create_grouped_data('Scholarship')
print(scholarship_noshow)
labels = ('Not Sponsored', 'Sponsored')
title = 'Proportion of Noshow Based on Number of SMS Received'

create_side_by_side_barchart(scholarship_noshow, keys=(0, 1), title=title, labels=labels)

Scholarship  NoShow
0            No        80.196867
             Yes       19.803133
1            No        76.263696
             Yes       23.736304
Name: NoShow, dtype: float64

About 23.7% of those who were on welfare program missed their appointments while about 20% of those who were not did not keep their appointment.

Does the Presence of any of the Conditions Influence NoShow?

has_any_condition = (df_clean.Hypertension == 1) | (df_clean.Diabetes == 1) | (df_clean.Alcoholism == 1)  | (df_clean.Handicap == 1)

df_condition = df_clean[has_any_condition]
df_no_condition = df_clean[has_any_condition == False]

print('{:.2f}% of the population had at least 1 of the listed conditions'.format(len(df_condition) / len(df_clean) * 100))
print('{:.2f}% of the population had none of the listed conditions'.format(len(df_no_condition) / len(df_clean) * 100))

23.80% of the population had at least 1 of the listed conditions
76.20% of the population had none of the listed conditions

condition_noshow = df_condition.NoShow.value_counts(normalize=True) * 100
no_condition_noshow = df_no_condition.NoShow.value_counts(normalize=True) * 100

fig = plt.figure(figsize=(16, 6))
ax1, ax2 = fig.subplots(1, 2)

ax1.pie(condition_noshow, labels=condition_noshow.index, autopct='%1.1f%%',shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
ax1.set_title('Proportion(%) of NoShow Among the Population with Any Condition')
ax1.set_xlabel("No Show")

ax2.pie(no_condition_noshow, labels=no_condition_noshow.index, autopct='%1.1f%%', shadow=True, startangle=90)
ax2.axis('equal')
ax2.set_title('Proportion(%) of NoShow Among the Population with None of the Conditions')
ax2.set_xlabel("No Show")

plt.show()

A larger proportion (20.9%) of those without any condition did not keep their appointments.

Conclusions

Firstly, it is pertinent to note that this analysis is tentative.

Gender did not seem to influence the missing of hospital appointments (NoShow) among the population.
Also, the length of days between scheduled date and appointment day seems to play a row on NoShow as 0 days (same-day) appointments were less missed.
The reception of SMS seemed to influence NoShow. However, more insight could be gained if presence of disease or the length of days were considered along with receiving SMS.
More proportion of those sponsored also missed their appointment relative to those who were not sponsored.
Finally, more people without any condition (3% more than those with any condition) missed their appointment. Here also, we could probe further by checking which of the conditions (Hypertension or Diabetes or Alcoholism or Handicap) had more of NoShow.

Limitations

The current analysis does not perform any hypothesis test which is necessary to provide concrete answers to the research questions. Secondly, statistically modelling (logistic regression) was not performed. This would give insights on the combination of factors that would predict NoShow.

# Create Html File Needed For Project Submission
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Investigate_a_Dataset.html		Investigate_a_Dataset.html
Investigate_a_Dataset.ipynb		Investigate_a_Dataset.ipynb
output_24_0.png		output_24_0.png
output_29_0.png		output_29_0.png
output_31_0.png		output_31_0.png
output_36_0.png		output_36_0.png
output_41_1.png		output_41_1.png
output_45_0.png		output_45_0.png
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project: Investigate a Dataset - No-shows appointment

Table of Contents

Introduction

Dataset Description

Question(s) for Analysis

Data Wrangling

General Properties

Data Cleaning

Exploratory Data Analysis

Which gender is more likely to miss an appointment?

Does the length of days between the schedule date and appointment date affect No-show?

Does sending SMS reduce the number of No-shows?

Are those in the Brazilian welfare program more likely to keep an appointment?

Does the Presence of any of the Conditions Influence NoShow?

Conclusions

Limitations

About

Releases

Packages

Languages

maqui7295/investigate-a-dataset

Folders and files

Latest commit

History

Repository files navigation

Project: Investigate a Dataset - No-shows appointment

Table of Contents

Introduction

Dataset Description

Question(s) for Analysis

Data Wrangling

General Properties

Data Cleaning

Exploratory Data Analysis

Which gender is more likely to miss an appointment?

Does the length of days between the schedule date and appointment date affect No-show?

Does sending SMS reduce the number of No-shows?

Are those in the Brazilian welfare program more likely to keep an appointment?

Does the Presence of any of the Conditions Influence NoShow?

Conclusions

Limitations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages