forked from hadley/r4ds
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Showing
3 changed files
with
43 additions
and
50 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,46 +1,53 @@ | ||
```{r include=FALSE, cache=FALSE} | ||
set.seed(1014) | ||
options(digits = 3) | ||
knitr::opts_chunk$set( | ||
comment = "#>", | ||
collapse = TRUE, | ||
cache = TRUE | ||
) | ||
options(dplyr.print_min = 6, dplyr.print_max = 6) | ||
``` | ||
# (PART) Model {-} | ||
|
||
# Introduction | ||
|
||
The scientific method guides data science. Data science solves known problems with the scientific method. | ||
The goal of a model is to provide a simple low-dimensional summary of a dataset. Ideally, the model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in). Here we only cover "predictive" models, which, as the name suggests, generate predictions. There is another type of model that we're not going to discuss: "data discovery" models. These models don't make predictions, but instead help you discover interesting relationships within your data. | ||
|
||
In this book, we'll focus on deep tools for rather simple models. We want to give you tools to help you build your intuition for what models do. There's little mathematical formalism, and only a relatively small overlap with how modelling is normally presented in statistics. Modelling is a huge topic and we can only scratch the surface here. You'll definitely need to consult other resources as the complexity of your models grow. | ||
This book is not going to give you a deep understanding of the mathematical theory that underlies models. It will, however, build your intution about how statisitcal models work, and give you a family of useful tools that allow you to use models to better understand your data: | ||
|
||
We're going to give you a basic strategy, and point you to places to learn more. The key is to think about data generated from your model as regular data - you're going to want to manipulate it and visualise it in many different ways. Being good at modelling is a mixture of having some good general principles and having a big toolbox of techniques. Here we'll focus on general techniques to help you undertand what your model is telling you. | ||
* In [model basics], you'll learn how models work, focussing on the important | ||
family of linear models. You'll learn general tools for gaining insight | ||
into what a predictive model tells you about your data, focussing on simple | ||
simulated datasets. | ||
|
||
In the course of modelling, you'll often discover data quality problems. Maybe a missing value is recorded as 999. Whenever you discover a problem like this, you'll need to review an update your import scripts. You'll often discover a problem with one variable, but you'll need to think about it for all variables. This is often frustrating, but it's typical. | ||
* In [model building], you'll learn how to use models to pull out known | ||
patterns in real data. Once you have recognised an important pattern | ||
it's useful to make it explicit it in a model, because then you can | ||
more easily see the subtler signals that remina. | ||
|
||
<https://blog.engineyard.com/2014/pets-vs-cattle>. | ||
<https://en.wikipedia.org/wiki/R/K_selection_theory> | ||
* In [many models], you'll learn how to use many simple models to help | ||
understand complex datasets. This is a powerful technique, but to access | ||
it you'll need to combine modelling and programming tools. | ||
|
||
## Fitted models vs. families of models | ||
* In [model assessment], you'll learn a little a bit about how you might | ||
quantitatively assess whether a model is good or not. You'll learn two | ||
powerful techniques, cross-validation and bootstrapping, that are built | ||
on the idea of generating many random datasets which you fit many | ||
models to. | ||
|
||
Family of models vs fitted model. Set of possible values, vs. one specific model. A fitted model = family of models plus a dataset. | ||
In this book, we are going to use models as a tool for exploration, completing the trifacta of EDA tools introduced in Part 1. This is not how models are usually taught, but they make for a particularly useful tool in this context. Every exploratory analysis will involve some transformation, modelling, and visualisation. | ||
|
||
## Exploring vs. confirming | ||
Models are more common taught as tools for doing inference, or for confirming that an hypothesis is true. Doing this correctly is not complicated, but it is hard. There is a pair of ideas that you must understand in order to do inference correctly: | ||
|
||
In this book we are going to focus on models primarily as tools for description. This is rather non-standard because we're normally interested in models for their inferential power: their ability to make accurate predictions for observations that we haven't seen yet. | ||
1. Each observation can either be used for exploration or confirmation, | ||
not both. | ||
|
||
In other words, in this book, we're typically going to think about a good model as a model that well captures the patterns that we see in the data. For now, a good model captures the majority of the patterns that are generated by the underlying mechanism of interest, and captures few patterns that are not generated by that mechanism. When you go on from this book and learn other ways of thinking about models this will stand you in good stead: if you can't capture patterns in the data that you can see, it's unlikely you'll be able to make good predictions about data that you haven't seen. | ||
1. You can use an observation as many times as you like for exploration, | ||
but you can only use it once for confirmation. As soon as you use an | ||
observation twice, you've switched from confirmation to exploration. | ||
|
||
This is necessary because to confirm a hypothesis you must use data this is independent of the data that you used to generate the hypothesis. Otherwise you will be over optimistic. There is absolutely nothing wrong with exploration, but you should never sell an exploratory analysis as a confirmatory analysis because it is fundamentally misleading. If you are serious about doing an confirmatory analysis, before you begin the analysis you should split your data up into three piecese: | ||
|
||
It's not possible to do both on the same dataset. | ||
1. 60% of your data goes into a __training__ (or exploration) set. You're | ||
allowed to do anything you like with this data: visualise it and fit tons | ||
of models to it. | ||
|
||
1. 20% goes into a __query__ set. You can use this data to compare models | ||
or visualisations by hand, but you're not allowed to use it as part of | ||
an automated process. | ||
|
||
Doing correct inference is hard! | ||
|
||
Generally, however, this will tend to make us over-optimistic about the quality of our model. Chapter XXX you'll start to learn more about how we can judge the quality of a model on data that it was 't fit it. But you have to beware of overfitting the data - in the next section we'll discuss some formal methods. But a healthy dose of scepticism is also as powerful as precise quantitative methods: do you believe that a pattern you see in your sample is going to generalise to a wider population? | ||
|
||
## Prediction vs. data discovery | ||
|
||
PCA, clustering, ... | ||
1. 20% is held back for a __test__ set. You can only use this data ONCE, to | ||
test your final model. | ||
|
||
This partitioning allows you to explore the training data, occassionally generating candidate hypotheses that you check with the query set. When you are confident you have the right model, you can check it once with the test data. |