-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathinfo.otl
229 lines (229 loc) · 9.7 KB
/
info.otl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
info.otl clpoda 2016_0425
# Time-stamp: <Thu 2016 May 12 05:06:21 PMPM clpoda>
Intro
Some notes about Dan Bikle's Data Science for Time Series course, Spring 2016.
Session topics, summary notes.
Session 1, Histograms, scatter plots, correlation, time series calculations, linear regression.
Session 2, AWS, Linux, git, heroku.
Session 3, Vectors, SQL, Pandas, NumPy.
Session 4, Machine Learning, Py scikit-learn, naive bayes, linear regr, logistical regr, SVM, GBRT, Neural Nets.
Session 5, R language.
Session 6, Convolutional Neural Network with Java Script.
Session 7, Predictions from features from time series.
Session 8, Visualizations of predictions.
Session 9, Serve visualizations to the web.
Spark; Tableau; lab session.
Session 10, Deploy app to AWS and heroku which predicts a financial market.
Security
Assume all SCAE lab machines are infected with keystroke capture & other malware.
Do not use any valuable passwords in the lab.
Terms, acronyms
accuracy, a measurement of prediction; (number of correct predictions / total predictions).
array, A type in a language; a collection of items; Python has lists instead of arrays.
AWS, Amazon Web Services; cloud-based computing services; start with S3 and EC2.
classification, One type of a problem, eg, only predict if next point is up or down from prior.
correlation, tbd
CRUD, Create, Retrieve, Update, Delete; basic concepts to understand a system.
CCLUD, Create, Copy, List, Update, Delete; basic concepts to understand a system.
dataframe, tbd
django, tbd
DSTS, Data Science, Time Series.
EC2, Elastic Compute Cloud, part of AWS; provides scalable virtual private servers.
effectiveness, a measurement of prediction; tbd
feature, tbd. See lag, pctlag, lead, pctlead.
git, Distributed version control system; subcmds include init, add, commit, clone, push, pull.
github, Web site to store files using git; large collection of freely-avbl tools & data.
Hadoop, tbd
Heroku, Cloud-based platform-as-a-service; push a git repo to it, & it runs the code.
IAAS, Infrastructure as a service, eg, AWS.
ipython, an advanced interactive Python interpreter.
label, tbd, title of a column of data.
linear regression, tbd
median, tbd, the middle value of a collection, where an equal number of values are above and below it.
Sort by value and pick the middle.
neural network, tbd
nodejs, JS on server.
numpy, tbd
numpy array, Behaves like a math vector.
overfitting, Too many features in the model for the available number of observations.
PAAS, Platform as a service, eg, Heroku.
pandas, tbd
pctlag, A statistical feature, eg, pctlag1 is the percent change over 1 day (yesterday to today).
pctlagN, The percent change over N days (from N days ago to today).
pctlead, A statistical feature, the predicted percent change from today to tomorrow.
predicate, tbd
pyspy, tbd
regression, One type of a problem, eg, predict both direction & magnitude of next point.
S3, Simple Storage Service, part of AWS; provides web-based storage.
SCAE, Santa Clara Adult Education; site of the course, in lab room K1.
scikit-learn, tbd
Spark, tbd
SPY, Symbol for ETF based on S & P 500 index.
SQL, Structured Query Language, used for dbase i/f.
ssh, Secure shell; connect to remote computers; more secure than telnet.
syntax.us, One of Dan B's domains where code, notes, & data reside.
TA, Technical Analysis.
vector, A container filled with numbers; eg, a row of a table; can be nested.
Links, References
http://www.syntax.us/posts/python_preso
python_preso (2016_0224); setup & run for session 1.
vbox, linux, anaconda3, ta-lib, pyspy, herokuspy.
https://github.com/danbikle/
pyspy, herokuspy, sklearnem.
https://clp-hspy10.herokuapp.com/
Built at SCAE lab & pushed to heroku.
https://clp-hspy11.herokuapp.com/
Built at KLab & pushed to heroku.
https://clp-d1d2.herokuapp.com/
Built at SCAE lab & pushed to heroku.
https://clp-2-d1d2.herokuapp.com/
Built at KLab & pushed to heroku.
truefx.com, foreign exchange data, updated each minute.
Much more data than SPY's single closing price at end of day.
numpy, Search for 'cs231 numpy tutorial'
http://www.kdnuggets.com/2016/04/pocket-guide-data-science.html
Good, brief overview of some steps in a data analysis process.
http://www.kdnuggets.com/2016/04/datacamp-learning-python-data-analysis-data-science.html
Concepts, ideas, algorithms
How much data needed by machine learning s/w?
Depends on number of features; rule-of-thumb:
For 1-2 features, need 100 observations.
For n>2 features, need 10**n observations.
Stock market data: about 252 trading days/year.
tbd
Procedures
This section does not contain all steps of all procedures.
It just gives some highlights and some key steps.
More details are at the noted pages for each item.
Install and configure virtual box & VM for Ubuntu o/s.
tbd
Backup vbox VM to USB stick for using elsewhere.
virtualbox
See Oracle VM vbox manager screen.
Select the desired VM & power it off if it is on.
Main vbox mgr menu:/ file/ export appliance
Copy to USB stick for backup & importing elsewhere.
Predict SPY using pyspy.
From session 1, 3/5/16.
Note.1. See details at syntax.us, python_preso (2016_0224), they are not copied here.
Get pyspy s/w from github.
mkdir -p ~/ddata
cd ~
git clone https://github.com/danbikle/pyspy
Edit pyspy/bin/noon.bash (or night.bash):
Set range of data to retrieve, eg, 'gen_train.py 1987 2015 ...'
Set range of data to predict, eg, 'gen_test.py 2015 2017 ...'
Run pyspy/bin/noon.bash
Get current data about the stock or index to predict; & plot predictions.
Open the learn_test.png o/p file to see results.
Deploy s/w to heroku.com & use it.
From session 2, 3/12/16.
Note.1. See details at syntax.us, python_preso (2016_0224), they are not copied here.
Note.2. Running Python code on heroku might require detailed understanding of the system & tools.
Create acct at heroku.com.
Get herokuspy s/w from github.
cd ~
git clone https://github.com/danbikle/herokuspy
Get heroku toolbelt.
wget https://s3.amazonaws.com/assets.heroku.com/heroku-client/heroku-client.tgz
Configure ssh, postgres, django, and heroku per the python_preso page.
Test it by opening a web browser to localhost:5000 to see the data.
Deploy to heroku.
cd ~/herokuspy
~/heroku-client/bin/heroku create <h_name>
h_name must be unique; you may be warned if it exists already.
git push heroku master
~/heroku-client/bin/heroku run python manage.py migrate
See the expected web page at this URL:
https://<h_name>.herokuapp.com
Deploy s/w to AWS & use it.
Create acct at aws.amazon.com.
tbd
Make predictions using scikit-learn with different techniques, & compare them.
From session 4, 3/26/16.
Note.1. See details at syntax.us, sklearnem (2016_0326), they are not copied here.
Get sklearnem s/w from github.
cd ~
git clone https://github.com/danbikle/sklearnem
Execute the demo s/w.
cd sklearnem
./runem.bash
See the o/p data printed to the screen after each demo is run.
Convert csv data to format for scikit-learn.
csv -> pandas() -> dataframe -> numpy() -> array -> scikit-learn().
csv: process w/ pandas; o/p is dataframe.
df: process w/ numpy; o/p is array.
array: process w/ scikit-learn.
Code, Python
tbd
Code, Python and Javascript keywords
Python Javascript
dict object
key property
value value
list array
print console.log('...')
# comment // comment
Python Javascript
Code, Javascript
tbd
Code, Other
tbd
Commands, SQL
\quit
Commands, Postgres (different from SQL cmds)
tbd
Infrastructure, define & setup (o/s, VM, tools, apps).
tbd
Tools
Anaconda, Python & R languages; pkgs avbl via proprietary pkg mgr conda.
ipython, an interactive python interpreter that offers more than 'python'.
run /home/.../learn_test.py ...
run -d /home/.../learn_test.py ...
help()
Python
Ubuntu Linux, 64-bit, v14.04 LTS.
Ubuntu Unity desktop environment.
Click top 'Search' icon on menu bar (at left edge) to open search field.
Type 'terminal' then press Enter to find icons to start X Terminal w/ shell.
Type 'chrom' then press Enter to find icons to start Chrome or Chromium browser.
VirtualBox
Install Guest Additions to configure video & resolution properly, etc.
VboxMenu/ Devices/ Install Guest Adds.
When pointer is active inside the VM:
Press Right-Ctrl key at keyboard to release pointer from VM;
to allow mouse & keybd to talk to the host o/s.
Project ideas.
Study & learn pivotal tracker in depth.
: Mon2016_0509_12:19
How to store details abt implementation, design choices, specs?
User stories & acceptance tests.
Any place for data that's outside the user story paradigm?
Use story:task records for details?
How to find them when needed?
Where to put general details that apply to many stories?
Some links can be made w/ src code repository.
Build permuted index of some data sci terms.
: Mon2016_0509_12:17
Find good dictionary for source text.
wikipedia.
wordnik.
Build an indexing tool & use it.
Post to nods site.
Evaluate prediction performance of some models.
: Mon2016_0509_12:17
Test multiple algorithms & training data periods ea day.
Test w/ different number & combination of features.
Write tool to run many tests with a simple cmd.
Download data & run tests locally.
Post the results daily, over time.
Make investment decisions each day.
Use SPY for public results.
Test DB's stmt abt Apple predicts.
Using older data will not give good predicts b/c company today is very different.
Consider using other stocks & indexes for private results.
Vanguard index funds.
Forex.
Find & publish how to get data from forex site w/o login?
Misc notes
tbd