Skip to content

Latest commit

 

History

History

Lab2

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Issues MIT License


Dave3625 - Lab2

Data wrangling

Feature engeneering - on the Titanic dataset
This is a classic dataset used in many data mining tutorials and demos -- perfect for getting started with exploratory analysis and building binary classification models to predict survival.
· Report Bug · Request Feature

About The Lab

In this lab, we will start to look at feature engineering on the Titanic dataset.

The titanic and titanic2 data frames describe the survival status of individual passengers on the Titanic. The titanic data frame does not contain information from the crew, but it does contain actual ages of half of the passengers. - LakeForest.edu

We will be using pandas, numpy and seaborn.

Solution added

New imports

We will use a new package in this lab. Don't worry, it's a python standard package called re, so remember to add

#Import modules
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

to your imports.

Load the titanic set found under /data/Titanic.csv as we did in Lab1

Hint: Click view as "Raw" and copy the url

Tasks

1. Check for null and nan values

In lab 1 we used df.isna().sum() to check for nan values. Since we didn’t find many, we converted blankspace into np.nan to help us procide. Lets try it again on this dataset.

df.isna().sum()
# You can also use df.isnull().sum()

nan Fill Age, Fare and Embarked with sensible values. (Embarked could be filled with "S") Since the nan values are defined differently in this dataset, we can use the function right out of the box.

During the last lab people asked how to fill nan values with meaningful values. What a meaningful value is differ from dataset to dataset, but lets just add a median value for all numeric columns. Hint: Filling columns with median values

df["column"] = df["column"].fillna(df["column"].median())
# To fill with a set value or a char, change .fillna("desired value")

We can also see that many people has a NaN for Cabin. It’s not as easy as just fill a dummy value here. We could fill with “no cabin”, but for machine learning, we like to have numerical or bool values. To achieve this, lets make a new bool column:

Cabin = True / False And set all NaN values = False, all other = True

df["HasCabin"] = df.Cabin.isnull()

do a df.head() and you can see we have a new column, but there is an error.

Hint:

Try to find the error before checking the hint

Adding a new column based on data available is considered creating a new feature.

2. Adding a feature

Lets extract the title for each person on the boat, and make a new column called «Title» As we can see from the data set, the syntax for names is LastName, Title. RestOfName

names A easy way to extract a sertan string is to use

lambda x: re.search(' ([A-Z][a-z]+)\.', x).group(1)

What is this syntax? It's called regex, and a explanation can be found here

And in our case we would like to put this data in a new column, so we can run

df["Title"] = df.Name.apply(lambda x: re.search(' ([A-Z][a-z]+)\.', x).group(1)) 

Check with df.head() that you now have acolumn called Title. We can now see how many has each title. This can be done in many ways, but calling

df["column"].value_counts() // you need to replace "column" 
# whit the name of the column you want to count

count

As we can see from the count, we have 18 titles, some of them with only one person. Replace Mlle and Ms with "Miss", and Mme with "Mr" using:

df["column"] = df["column"].replace({'xxx':'yyy', 'jjj':'iiii', … 'uuu':'iii'})

We can also package all titles with few persons into a unique category

df["column"] = df["column"].replace(["x","y", … , "n"], "Unique")

And do a new count of titels and see if you get something simellare to this: recount

You can also produce a plot with

sns.countplot(x='Title', data=df); //Seaborn countplot
plt.xticks(rotation=45);

plot

3. Convert Age and Fare into categorical data.

This can be done using pandas qcut function

df['CatAge'] = pd.qcut(df["Age"], q=4, labels=False )

do this for both Age and Fare.

4. Convert dataframe to binary data

To train a dataset easily, we want all data to be numerical. To achieve this, we need to drop columns that don’t make sense converting to a numerical value. At this point, your dataframe should look something like this:

dataframe

Identify columns that we need to drop to convert to a numerical dataset.

Solution

Drop the tabels you identified with

df = df.drop(["column1", ... , "columnN"], axis=1)

Converting to binary data is a trivial task in pandas. Try using pd.get_dummies This works well for analytic tasks, but you could also use OneHotEncoder() for machine learning tasks.

All done

At the end of the lab you should have a table looking something like finalTable

In this lab you have:

  • engineered some new features such as 'Title' and 'Has_Cabin'
  • dealt with missing values, binned your numerical data and transformed all features into numeric variables

More hints

This section will be updated after the first lab session

Usefull links

You can find usefull information about feature engeneering here

pandas cheatsheet

License

Distributed under the MIT License. See LICENSE for more information.