This repository holds the course materials for the Spring 2018 edition of Statistics 154: Modern Statistical Prediction and Machine Learning at UC Berkeley.
- Instructor: Gaston Sanchez, gasigiri [at] berkeley [dot] edu
- Class Time: MWF 11-12pm in 180 Tan
- Session Dates: 01/17/18 - 05/04/18
- Code #: 30887
- Units: 4 (more info here)
- Office Hours: MW 2:15-3:15pm in 309 Evans (or by appointment)
- Piazza:
- Final: Tue May 8, 7-10pm (room TBD)
- GSI: Omid Solari (Mon. 5-6pm, Wed. 8-10am @444 EVANS).
Lab | Date | Room | GSI |
101 | M 12-2pm | 334 Evans | Omid Solari |
102 | M 3-5pm | 334 Evans | Omid Solari |
This is an introductory-level course in statistical learning, with an emphasis on regression and classification methods, and a pinch of unsupervised methods. The course includes, time permiting, the following topics (not necessarily in the displayed order, see syllabus for more info):
- Process of predictive model building
- Data Preprocessing
- Regression Models
- Linear models
- Non-linear models (time permitting)
- Tree-based methods
- Classification Models
- Linear models
- Non-linear models
- Tree-based methods
- Support Vector Machines (time permitting)
- Unsupervised methods like PCA and Clustering
- Data spending: splitting and resampling methods
- Bias-Variance Trade-off
- Model Assessment
- Model Selection
Throughout the semester we will explore the predictive modeling lifecycle, including question formulation, data preprocessing, exploratory data analysis and visualization, model building, model assessment/validation, model selection, and decision-making.
- Multivariate calculus or the equivalent, esp. partial derivatives; e.g. Math 53
- Linear algebra or the equivalent (matrices, vector spaces); e.g. Math 54
- Statistical inference or the equivalent; e.g. Stat 135
- Scripting experience in R required; e.g. Stat 133
This course will build on a lot of material from matrix algebra. In particular, you should be comfortable with notions such as vector spaces, inner products, norms, matrix products/transpose/rank/determinants/inverses, as well as matrix decompositions.
You should also have some scripting experience---preferably in R---at the level of writing functions, conditionals (if-then-else structures), for loops, while loops, sampling, read in data sets, export results.
Last but not least, it is nice to know the basics of Rmd files, as well as some knowledge of LaTeX, especially some experience writing math symbols and equations.
There is no official textbook for this course although we will use the following texts as supporting material:
An Introduction to Statistical Learning (ISL) by James, Witten, Hastie, and Tibshirani. Springer, 2013. It is freely available online in pdf format (courtesy of the authors) at
The Elements of Statistical Learning by Hastie, Tibshirani and Friedman. Springer, 2009 (2nd Ed). This book is more mathematically-and-conceptually advanced than ISL. It is freely available online in pdf format (courtesy of the authors) at
Applied Predictive Modeling by Max Kuhn and Kjell Johnson. Springer, 2013.
Data Mining and Statistics for Decision Making by Stephane Tuffery. Wiley 2011.
We expect that at the end of the course you:
- Have a basic, yet solid, understanding of the prediction modeling process/lifecycle.
- Be able to read a well-described algorithm, and write code to implement it computationally (in R).
- Know the pros and cons of each predictive technique.
- Be able to describe (to non-professionals) what a predictive technique is doing.
- We will be using a combination of materials such as slides, tutorials, reading assignments, and chalk-and-talk.
- The main computational tool will be the computing and programming environment R.
- The main workbench will be the IDE RStudio. You will also use a terminal emulator to work with the command line.
- Please read the course logistics and policies for mode details about the structure of the course, DO's and DONT's, etc.
Unless otherwise noticed, this work, by Gaston Sanchez, is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Author: Gaston Sanchez