- Pipes and directing output
- Essential utilities
- Explore (head, tail, more, less, grep)
- Transform (sed, awk, cut, tr, sort, join)
- Schedule (cron, watch)
- Visualize (gnuplot)
- Regular Expressions
- Git and version control
- Data Structures
- Dictionaries and Hash Tables
- Trees (binary, balanced, splay, B)
- Heaps
- Stacks and Queues
- Graphs and Networks
- Sets
- Algorithms
- Search (BFS, DFS, A*, Dijkstra's)
- Sorting (merge, quick, heap, radix)
- Selection
- Performance (Asymptotic Analysis, hardware restrictions, indexing, etc.)
- HTTP
- APIs and ReST
- HTML and XML
- Parsing (CSS and XPath)
- Web Scraping
- PDF parsing
- Descriptive statistics (mean, mode, variance, skew, etc.)
- Estimation (confidence intervals, bias and error, etc.)
- Correlation (covariance, goodness of fit, causation, etc.)
- Distributions (PMF, CDF, Normal, Binomial, convolution, etc.)
- Significance (Hypothesis testing, p-value, ANOVA, etc.)
- Bayesian Statistics
- Monte Carlo Methods
- Sampling
- Feature Preparation
- Vectorization (binning, bag of words, tf-idf)
- Selection (automatic and manual)
- Normalization
- Regularization and Smoothing
- Natural Language Processing
- N-grams
- Tokenization
- Sentiment Analysis
- Information Retrieval
- SQL
- NoSQL (document, graph, key-value)
- Filesystem and Text
- Grammer of Graphics (ggplot2, Bokeh)
- Interactivity
- Geographic display
- MapReduce paradigm (Hadoop)
- Distributed Datastores (HDFS, Cassandra, HBase)
- Hadoop Ecosysytem (Pig, Hive, HBase, Flume, Sqoop, etc.)
- Real-Time (Spark, Storm, Shark)
- Distributed Machine Learning
- Unsupervised
- Clustering (K-means, Hierachical, etc.)
- Association Analysis (FP-Growth, MDS, etc.)
- Dimensionality Reduction (PCA, SVD, etc.)
- Supervised
- Classification (Naive Bayes, kNN, Logistic Regression, etc.)
- Regression (Linear, Polynomial, Tree, etc.)
- Recommendation
- Similarity metrics (Jaccard, Pearson, Euclidean, etc.)
- Item vs. User vs. Content based
- Limitations (Cold-start problem, preference collection, performance)
- Optimization (cost functions, hill climbing, simulated annealing, etc.)
- Anomaly Detection and timeseries
- Evaluation
- Cross Validation
- ROC plot
- Bias vs. Variance
- Recall vs. Precision
- Bootstrap