Skip to content

This is a collaborative attempt to define what belongs in a data science curriculum to productively advance the field forward. Fork this repo and submit pull requests if you would like to contribute (or open an issue)

Notifications You must be signed in to change notification settings

vaibhavsanjaylalka/data-science-curriculum

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Curriculum

Utilities: Shell and POSIX

  • Pipes and directing output
  • Essential utilities
    • Explore (head, tail, more, less, grep)
    • Transform (sed, awk, cut, tr, sort, join)
    • Schedule (cron, watch)
    • Visualize (gnuplot)
  • Regular Expressions

Software Engineering

  • Git and version control
  • Data Structures
    • Dictionaries and Hash Tables
    • Trees (binary, balanced, splay, B)
    • Heaps
    • Stacks and Queues
    • Graphs and Networks
    • Sets
  • Algorithms
    • Search (BFS, DFS, A*, Dijkstra's)
    • Sorting (merge, quick, heap, radix)
    • Selection
  • Performance (Asymptotic Analysis, hardware restrictions, indexing, etc.)

Data Acquisition

  • HTTP
  • APIs and ReST
  • HTML and XML
  • Parsing (CSS and XPath)
  • Web Scraping
  • PDF parsing

Statistics and Probability

  • Descriptive statistics (mean, mode, variance, skew, etc.)
  • Estimation (confidence intervals, sampling, etc.)
  • Correlation (covariance, goodness of fit, causation, etc.)
  • Distributions
    • PMF, PDF, CDF, CMF
    • Histograms and Scatterplots
    • Normal, Binomial, Exponential
    • Probability Plot
    • Central Limit Theorem
  • Significance (Hypothesis testing, p-value, ANOVA, etc.)
  • Conditional Probability
    • Bayesian Statistics
    • Random Variables and Conditional Distributions
    • Monte Carlo Methods

Transform

  • Sampling
  • Feature Preparation
    • Vectorization (binning, bag of words, tf-idf)
    • Selection (automatic and manual)
    • Normalization
    • Regularization and Smoothing
  • Natural Language Processing
    • N-grams
    • Tokenization
    • Sentiment Analysis
    • Information Retrieval

Store

  • SQL (Postgres, MySQL)
  • NoSQL (document, graph, key-value)
  • Filesystem and Text

Data at Scale

  • MapReduce paradigm (Hadoop)
  • Distributed Datastores (HDFS, Cassandra, HBase)
  • Hadoop Ecosysytem (Pig, Hive, HBase, Flume, Sqoop, etc.)
  • Real-Time (Spark, Storm, Shark)
  • Distributed Machine Learning

Machine Learning

  • Unsupervised
    • Clustering (K-means, Hierarchical, etc.)
    • Association Analysis (FP-Growth, MDS, etc.)
    • Dimensionality Reduction (PCA, SVD, etc.)
  • Supervised
    • Classification (Naive Bayes, kNN, Logistic Regression, etc.)
    • Regression (Linear, Polynomial, Tree, etc.)
  • Recommendation
    • Similarity metrics (Jaccard, Pearson, Euclidean, etc.)
    • Item vs. User vs. Content based
    • Limitations (Cold-start problem, preference collection, performance)
  • Optimization (cost functions, hill climbing, simulated annealing, etc.)
  • Anomaly Detection and Time Series Analysis
  • Evaluation
    • Cross Validation
    • ROC plot
    • Bias vs. Variance
    • Recall vs. Precision
    • Bootstrap

Visualize and Present

  • Grammer of Graphics (ggplot2, Bokeh)
  • Interactivity (Javascript, HTML, D3.js, CSS)
  • Geographic display (i.e. maps)
  • Charts, plots, and layout (Visual Display of Quantitative Information)

About

This is a collaborative attempt to define what belongs in a data science curriculum to productively advance the field forward. Fork this repo and submit pull requests if you would like to contribute (or open an issue)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%