Skip to content

This is a collaborative attempt to define what belongs in a data science curriculum to productively advance the field forward. Fork this repo and submit pull requests if you would like to contribute (or open an issue)

Notifications You must be signed in to change notification settings

vaibhavsanjaylalka/data-science-curriculum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Curriculum

Utilities: Shell and POSIX

  • Pipes and directing output
  • Essential utilities
    • Explore (head, tail, more, less, grep)
    • Transform (sed, awk, cut, tr, sort, join)
    • Schedule (cron, watch)
    • Visualize (gnuplot)
  • Regular Expressions

Software Engineering

  • Git and version control
  • Data Structures
    • Dictionaries and Hash Tables
    • Trees (binary, balanced, splay, B)
    • Heaps
    • Stacks and Queues
    • Graphs and Networks
    • Sets
  • Algorithms
    • Search (BFS, DFS, A*, Dijkstra's)
    • Sorting (merge, quick, heap, radix)
    • Selection
  • Performance (Asymptotic Analysis, hardware restrictions, indexing, etc.)

Acquire

  • HTTP
  • APIs and ReST
  • HTML and XML
  • Parsing (CSS and XPath)
  • Web Scraping
  • PDF parsing

Statistics

  • Descriptive statistics (mean, mode, variance, skew, etc.)
  • Estimation (confidence intervals, bias and error, etc.)
  • Correlation (covariance, goodness of fit, causation, etc.)
  • Distributions (PMF, CDF, Normal, Binomial, convolution, etc.)
  • Significance (Hypothesis testing, p-value, ANOVA, etc.)
  • Bayesian Statistics
  • Monte Carlo Methods

Transform

  • Sampling
  • Feature Preparation
    • Vectorization (binning, bag of words, tf-idf)
    • Selection (automatic and manual)
    • Normalization
    • Regularization and Smoothing
  • Natural Language Processing
    • N-grams
    • Tokenization
    • Sentiment Analysis
    • Information Retrieval

Store

  • SQL
  • NoSQL (document, graph, key-value)
  • Filesystem and Text

Visualize and Present

  • Grammer of Graphics (ggplot2, Bokeh)
  • Interactivity
  • Geographic display

Data at Scale

  • MapReduce paradigm (Hadoop)
  • Distributed Datastores (HDFS, Cassandra, HBase)
  • Hadoop Ecosysytem (Pig, Hive, HBase, Flume, Sqoop, etc.)
  • Real-Time (Spark, Storm, Shark)
  • Distributed Machine Learning

Machine Learning

  • Unsupervised
    • Clustering (K-means, Hierachical, etc.)
    • Association Analysis (FP-Growth, MDS, etc.)
    • Dimensionality Reduction (PCA, SVD, etc.)
  • Supervised
    • Classification (Naive Bayes, kNN, Logistic Regression, etc.)
    • Regression (Linear, Polynomial, Tree, etc.)
  • Recommendation
    • Similarity metrics (Jaccard, Pearson, Euclidean, etc.)
    • Item vs. User vs. Content based
    • Limitations (Cold-start problem, preference collection, performance)
  • Optimization (cost functions, hill climbing, simulated annealing, etc.)
  • Anomaly Detection and timeseries
  • Evaluation
    • Cross Validation
    • ROC plot
    • Bias vs. Variance
    • Recall vs. Precision
    • Bootstrap

About

This is a collaborative attempt to define what belongs in a data science curriculum to productively advance the field forward. Fork this repo and submit pull requests if you would like to contribute (or open an issue)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%