Import and alias Pandas:
import pandas as pd
Load up the table from the link, and extract the dataset out of it. If you're having issues with this, look carefully at the sample code provided in the reading:
url='http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2018/seasontype/2'
df = pd.read_html(url, header=0)[0]
Next up, rename the columns so that they are similar to the column definitions provided to you on the website. Be careful and don't accidentally use any column names twice. If a column uses special characters, you can replace them with regular characters to make it easier to work with:
df.columns=['RK', 'PLAYER','TEAM','GP','G','A','PTS','+/-','PIM','PTS/G','SOG','PCT','GWG','\G','\A','+G','+A']
type(df)
df.head()
RK | PLAYER | TEAM | GP | G | A | PTS | +/- | PIM | PTS/G | SOG | PCT | GWG | \G | \A | +G | +A | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | RK | PLAYER | TEAM | GP | G | A | PTS | +/- | PIM | PTS/G | SOG | PCT | GWG | G | A | G | A |
1 | 1 | Connor McDavid, C | EDM | 82 | 41 | 67 | 108 | 20 | 26 | 1.32 | 274 | 15.0 | 7 | 5 | 15 | 1 | 3 |
2 | 2 | Claude Giroux, LW | PHI | 82 | 34 | 68 | 102 | 28 | 20 | 1.24 | 193 | 17.6 | 1 | 9 | 27 | 0 | 0 |
3 | 3 | Nikita Kucherov, RW | TB | 80 | 39 | 61 | 100 | 15 | 42 | 1.25 | 279 | 14.0 | 7 | 8 | 28 | 0 | 0 |
4 | 4 | Evgeni Malkin, C | PIT | 78 | 42 | 56 | 98 | 16 | 87 | 1.26 | 239 | 17.6 | 7 | 14 | 24 | 0 | 0 |
Convert argument to a numeric type.
- If ‘raise’, then invalid parsing will raise an exception
- If ‘coerce’, then invalid parsing will be set as NaN
- If ‘ignore’, then invalid parsing will return the inpu
df['GP']= pd.to_numeric(df.GP, errors='coerce')
df['G']= pd.to_numeric(df.GP, errors='coerce')
df['A']= pd.to_numeric(df.GP, errors='coerce')
df['PTS']= pd.to_numeric(df.GP, errors='coerce')
Detect missing values (NaN in numeric arrays, None/NaN in object arrays)
selector=df.GP.isnull() & df.G.isnull()& df.A.isnull() & df.PTS.isnull()
Locate the NaN data using the index
#selector
bad_rows=df[selector].index
Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names
df.drop(bad_rows, inplace=True)
Delete the 'RK' column:
del df['RK']
df.head()
PLAYER | TEAM | GP | G | A | PTS | +/- | PIM | PTS/G | SOG | PCT | GWG | \G | \A | +G | +A | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Connor McDavid, C | EDM | 82.0 | 82.0 | 82.0 | 82.0 | 20 | 26 | 1.32 | 274 | 15.0 | 7 | 5 | 15 | 1 | 3 |
2 | Claude Giroux, LW | PHI | 82.0 | 82.0 | 82.0 | 82.0 | 28 | 20 | 1.24 | 193 | 17.6 | 1 | 9 | 27 | 0 | 0 |
3 | Nikita Kucherov, RW | TB | 80.0 | 80.0 | 80.0 | 80.0 | 15 | 42 | 1.25 | 279 | 14.0 | 7 | 8 | 28 | 0 | 0 |
4 | Evgeni Malkin, C | PIT | 78.0 | 78.0 | 78.0 | 78.0 | 16 | 87 | 1.26 | 239 | 17.6 | 7 | 14 | 24 | 0 | 0 |
5 | Nathan MacKinnon, C | COL | 74.0 | 74.0 | 74.0 | 74.0 | 11 | 55 | 1.31 | 284 | 13.7 | 12 | 12 | 20 | 0 | 1 |
Make sure there are no holes in your index by resetting it. There is an example of this in the reading material. By the way, drop the original index.
df.reset_index()
df.head()
PLAYER | TEAM | GP | G | A | PTS | +/- | PIM | PTS/G | SOG | PCT | GWG | \G | \A | +G | +A | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Connor McDavid, C | EDM | 82.0 | 82.0 | 82.0 | 82.0 | 20 | 26 | 1.32 | 274 | 15.0 | 7 | 5 | 15 | 1 | 3 |
2 | Claude Giroux, LW | PHI | 82.0 | 82.0 | 82.0 | 82.0 | 28 | 20 | 1.24 | 193 | 17.6 | 1 | 9 | 27 | 0 | 0 |
3 | Nikita Kucherov, RW | TB | 80.0 | 80.0 | 80.0 | 80.0 | 15 | 42 | 1.25 | 279 | 14.0 | 7 | 8 | 28 | 0 | 0 |
4 | Evgeni Malkin, C | PIT | 78.0 | 78.0 | 78.0 | 78.0 | 16 | 87 | 1.26 | 239 | 17.6 | 7 | 14 | 24 | 0 | 0 |
5 | Nathan MacKinnon, C | COL | 74.0 | 74.0 | 74.0 | 74.0 | 11 | 55 | 1.31 | 284 | 13.7 | 12 | 12 | 20 | 0 | 1 |
Do the data-types of each column reflect the values you see when you look through the data using a text editor / spread sheet program? If you see object where you expect to see int32 or float64, that is a good indicator that there might be a string or missing value or erroneous value in the column.
Check the data type of all columns, and ensure those that should be numeric are numeric.
df.dtypes
PLAYER object
TEAM object
GP float64
G float64
A float64
PTS float64
+/- object
PIM object
PTS/G object
SOG object
PCT object
GWG object
\G object
\A object
+G object
+A object
dtype: object
Convert argument to a numeric type
df['+/-']= pd.to_numeric(df['+/-'], errors='coerce')
df['PIM']= pd.to_numeric(df.PIM, errors='coerce')
df['PTS/G']= pd.to_numeric(df['PTS/G'], errors='coerce')
df['SOG']= pd.to_numeric(df.SOG, errors='coerce')
df['PCT']= pd.to_numeric(df.PCT, errors='coerce')
df['GWG']= pd.to_numeric(df['GWG'], errors='coerce')
df['\G']= pd.to_numeric(df['\G'], errors='coerce')
df['\A']= pd.to_numeric(df['\A'], errors='coerce')
df['+G']= pd.to_numeric(df['+G'], errors='coerce')
df['+A']= pd.to_numeric(df['+A'], errors='coerce')
df.head()
PLAYER | TEAM | GP | G | A | PTS | +/- | PIM | PTS/G | SOG | PCT | GWG | \G | \A | +G | +A | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Connor McDavid, C | EDM | 82.0 | 82.0 | 82.0 | 82.0 | 20 | 26 | 1.32 | 274 | 15.0 | 7 | 5 | 15 | 1 | 3 |
2 | Claude Giroux, LW | PHI | 82.0 | 82.0 | 82.0 | 82.0 | 28 | 20 | 1.24 | 193 | 17.6 | 1 | 9 | 27 | 0 | 0 |
3 | Nikita Kucherov, RW | TB | 80.0 | 80.0 | 80.0 | 80.0 | 15 | 42 | 1.25 | 279 | 14.0 | 7 | 8 | 28 | 0 | 0 |
4 | Evgeni Malkin, C | PIT | 78.0 | 78.0 | 78.0 | 78.0 | 16 | 87 | 1.26 | 239 | 17.6 | 7 | 14 | 24 | 0 | 0 |
5 | Nathan MacKinnon, C | COL | 74.0 | 74.0 | 74.0 | 74.0 | 11 | 55 | 1.31 | 284 | 13.7 | 12 | 12 | 20 | 0 | 1 |
df.dtypes
PLAYER object
TEAM object
GP float64
G float64
A float64
PTS float64
+/- int64
PIM int64
PTS/G float64
SOG int64
PCT float64
GWG int64
\G int64
\A int64
+G int64
+A int64
dtype: object
Try use your_data_frame['your_column'].unique() or equally, your_data_frame.your_column.unique() to see the unique values of each column and identify the rogue values.
If you find any value that should be properly encoded to NaNs, you can convert them either using the na_values parameter when loading the dataframe. Or alternatively, use one of the other methods discussed in the reading.
df.PCT.unique()
array([15. , 17.6, 14. , 13.7, 17.5, 13. , 9.4, 11.7, 13.8, 12.7, 18.7,
12.9, 8.8, 14.4, 16.3, 10.6, 11.8, 14.2, 14.9, 23.4, 11.9, 10.5,
17.4, 9.5, 10.8, 10.1, 12.4, 18.3, 13.3, 14.8, 13.2, 11.3, 6.3])
Figure out which indexing method you would need to use in order to index your dataframe with: [2:4, 'col3']. Finally, display the results:
df.iloc[2:4, [3]]
G | |
---|---|
3 | 80.0 |
4 | 78.0 |