We have data about users who hit our site: whether they converted or not as well as some of their characteristics such as their country, the marketing channel, their age, whether they are repeat users and the number of pages visited during that session (as a proxy for site activity/time spent on site).
Goal:
- Predict conversion rate
- Come up with recommendations for the product team and the marketing team to improve conversion rate
Data: Present under data/conversion_data.csv
Table: "conversion_data" - information about signed-in users during one session. Each row is a user session.
- country : user country based on the IP address
- age : user age. Self-reported at sign-in step
- new_user : whether the user created the account during this session or had already an account and simply came back to the site
- source : marketing channel source
- Ads: came to the site by clicking on an advertisement
- Seo: came to the site by clicking on search results
- Direct: came to the site by directly typing the URL on the browser
- total_pages_visited: number of total pages visited during the session.
- This is a proxy for time spent on site and engagement during the session.
- converted: this is our label. 1 means they converted within the session, 0 means they left without buying anything.
- The company goal is to increase conversion rate: # conversions / total sessions.
- Define function with name
csv_to_dataframe
which should acceptfilepath
as a parameter. - Function should return a dataframe.
- As we require a dataframe, type of return variable should be pandas dataframe.
- In case if we pass
filepath
which does not exist, function should raise FileNotFoundError.
- Define function with name
dtype_category
which should acceptdataframe
andlist of columns
as parameters. - Function should return a dataframe with type of given columns changed to "category".
- As we require a dataframe, type of return variable should be
pandas dataframe
. - In case if we pass column name which does not exist, function should raise KeyError
- Define function with name
centre_and_scale
which should acceptdataframe
andcolumn_list
as parameters. - Function should return a dataframe given columns of numerical variables being centred and scaled.
- As we require a dataframe, type of return variable should be
pandas dataframe
. - In case if we pass column name which does not exist, function should raise KeyError
- Define function with name
label_encoder
which should acceptdataframe
,column_list
(of variables to be encoded) as parameters. - Function should return dataframe with encoded variables.
- As we require dataframe, type of return variable should be pandas dataframe.
- In case if we pass column name which does not exist or is not categorical type, function should raise KeyError
- Define function with name
one_hot_encoder
which should acceptdataframe
,column_list
(of variables to be encoded) as parameters. - Function should return dataframe with encoded variables.
- As we require dataframe, type of return variable should be pandas dataframe.
- In case if we pass column name which does not exist or is not categorical type, function should raise KeyError
- Define function with name
skewness
which should acceptdataframe
,column_list
(of variables whose skewness is to be determined) as parameters. - Function should return list of skewness of given columns
- As we require list of values, type of return variable should be list
- In case if we pass column name which does not exist or is categorical type, function should raise KeyError
- Define function with name
sqrt_transform
which should acceptdataframe
,column_list
(of variables which are to be sqrt transformed) as parameters. - Function should return dataframe of sqrt transformed columns of given columns
- As we require list of values, type of return variable should be list
- In case if we pass column name which does not exist or is categorical type, function should raise KeyError
- Define function with name
plots
which should acceptdataframe
,column_list
(of variables to be plotted) as parameters. - Function should return subplots of histogram and boxplots for the numeric variables.
- As we require plot, type of return variable should be matplotlib object.
- In case if we pass column name which does not exist, function should raise KeyError