0% found this document useful (0 votes)
193 views

Data Science Course Syllabus

This course syllabus outlines the modules and chapters for a data science course. Module 1 covers fundamentals of programming in Python, including data structures, functions, NumPy, Matplotlib and Pandas. Module 2 focuses on exploratory data analysis, data visualization, linear algebra, probability and statistics. Module 3 covers natural language processing techniques like bag-of-words modeling and text preprocessing, as well as classification and regression models like k-nearest neighbors. The syllabus provides a comprehensive overview of key data science topics and their applications.

Uploaded by

priya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
193 views

Data Science Course Syllabus

This course syllabus outlines the modules and chapters for a data science course. Module 1 covers fundamentals of programming in Python, including data structures, functions, NumPy, Matplotlib and Pandas. Module 2 focuses on exploratory data analysis, data visualization, linear algebra, probability and statistics. Module 3 covers natural language processing techniques like bag-of-words modeling and text preprocessing, as well as classification and regression models like k-nearest neighbors. The syllabus provides a comprehensive overview of key data science topics and their applications.

Uploaded by

priya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

DATA SCIENCE

COURSE SYLLABUS
Module 1 : Fundamentals of Programming

Chapter 1: Python for Data Science: Introduction


2.1 Python, Anaconda and relevant packages installations
2.2 Why learn Python?
2.3 Keywords and Identifiers
2.4 Comments, Indentation, and Statements
2.5 Variables and Datatypes in Python
2.6 Standard Input and Output
2.7 Operators
2.8 Control flow: If...else
2.9 Control flow: while loop
2.10 Control flow: for loop
2.11 Control flow: break and continue

Chapter 2: Python for Data Science: Data Structures


3.1 Lists
3.2 Tuples part 1
3.3 Tuples part 2
3.4 Sets
3.5 Dictionary
3.6 Strings

Chapter 3 : Python for Data Science: Functions


4.1 Introduction
4.2 Types of Functions
4.3 Function Arguments
4.4 Recursive Functions
4.5 Lambda Functions
4.6 Modules
4.7 Packages
4.8 File Handling
4.9 Exception Handling
4.10 Debugging Python

Chapter 4 : Python for Data Science: Functions


5.1 Introduction to NumPy.
5.2 Numerical operations.
Chapter 5: Python for Data Science: Matplotlib
6.1 Introduction to Matplotlib

Chapter 6: Python for Data Science: Pandas


7.1 Getting started with pandas
7.2 Data Frame Basics
7.3 Key Operations on Data Frames.
Module 2: Data Science: Exploratory Data Analysis and Data
Visualization

Chapter 1 : Plotting for exploratory data analysis (EDA)


10.1 Introduction to Iris dataset and 2D scatter-plot
10.2 3D Scatter-plot.
10.3 Pair plots.
10.4 Limitations of Pair plots
10.5 Histogram and introduction to PDF(Probability Density Function)
10.6 Univariate analysis using PDF.
10.7 CDF(Cumulative distribution function)
10.8 Variance, Standard Deviation
10.9 Median
10.10 Percentiles and Quantiles
10.11 IQR(InterQuartile Range), MAD(Median Absolute Deviation).
10.12 Box-plot with whiskers
10.13 Violin plots.
10.14 Summarizing plots, Univariate, Bivariate, and Multivariate analysis.
10.15 Multivariate probability density, contour plot.

Chapter 2 : Linear Algebra


11.1 Why learn it?
11.2 Introduction to Vectors (2-D, 3-D, n-D), row vectors and column vector
11.3 Dot product and the angle between 2 vectors.
11.4 Projection and unit vector
11.5 Equation of a line (2-D), plane(3-D) and hyperplane (n-D)
11.6 Distance of a point from a plane/hyperplane, half-spaces.
11.7 Equation of a circle (2-D), sphere (3-D) and hypersphere (n-D)
11.8 Equation of an ellipse (2-D), ellipsoid (3-D) and hyperellipsoid (n-D)
11.9 Square, Rectangle.
11.10 Hypercube, Hyper cuboid.
11.11 Revision Questions

Chapter 3 : Probability and Statistics


12.1 Introduction to Probability and Statistics.
12.2 Population & Sample.
12.3 Gaussian/Normal Distribution and its PDF(Probability Density Function).
12.4 CDF(Cumulative Density Function) of Gaussian/Normal Distribution
12.5 Symmetric distribution, Skewness, and Kurtosis
12.6 Standard normal variate (z) and standardization.
12.7 Kernel density estimation.
12.8 Sampling distribution & Central Limit Theorem.
12.9 Q-Q Plot: Is a given random variable Gaussian distributed?
12.10 How distributions are used?
12.11 Chebyshev’s inequality
12.12 Discrete and Continuous Uniform distributions.
12.13 How to randomly sample data points. [Uniform Distribution]
12.14 Bernoulli and Binomial distribution
12.15 Log-normal
12.16 Power law distribution
12.17 Box-Cox transform.
12.18 Application of Non-Gaussian Distributions?
12.19 Co-variance
12.20 Pearson Correlation Coefficient
12.21 Spearman Rank Correlation Coefficient
12.22 Correlation vs Causation
12.23 How to use Correlations?
12.24 Confidence Intervals(C.I) Introduction
12.25 Computing confidence-interval has given the underlying distribution
12.26 C.I for the mean of a normal random variable.
12.27 Confidence Interval using bootstrapping.
12.28 Hypothesis Testing methodology, Null-hypothesis, p-value
12.29 Hypothesis testing intuition with coin toss example
12.30 Resampling and permutation test.
12.31 K-S Test for the similarity of two distributions.
12.32 Code Snippet K-S Test.
12.33 Hypothesis Testing: another example.
12.34 Resampling and permutation test: another example.
12.35 How to use Hypothesis testing?
12.36 Proportional Sampling.
12.37 Revision Questions.

Chapter 4 : Interview Questions on Probability and Statistics


13.1 Question & Answers

Chapter 5 : Dimensionality reduction and Visualization:


14.1 What is dimensionality reduction?
14.2 Row vector, and Column vector.
14.3 How to represent a dataset?
14.4 How to represent a dataset as a Matrix.
14.5 Data preprocessing: Feature Normalization
14.6 Mean of a data matrix.
14.7 Data preprocessing: Column Standardization
14.8 Co-variance of a Data Matrix.
14.9 MNIST dataset (784 dimensional)
14.10 Code to load MNIST data set.
Chapter 6 : Principal Component Analysis.
15.1 Why learn it.
15.2 Geometric intuition.
15.3 Mathematical objective function.
15.4 Alternative formulation of PCA: distance minimization
15.5 Eigenvalues and eigenvectors.
15.6 PCA for dimensionality reduction and visualization.
15.7 Visualize MNIST dataset.
15.8 Limitations of PCA
15.9 Code example.
15.10 PCA for dimensionality reduction (not-visualization)

Chapter 7 : T-distributed stochastic neighborhood embedding (t-SNE)


16.1 What is t-SNE?
16.2 Neighborhood of a point, Embedding.
16.3 Geometric intuition.
16.4 Crowding problem.
16.5 How to apply t-SNE and interpret its output (distill.pub)
16.6 t-SNE on MNIST.
16.7 Code example.
16.8 Revision Questions.

Chapter 8 : Interview Questions on Dimensionality Reduction


17.1 Question & Answers

Module 3 : Foundations of Natural Language Processing and


Machine Learning

Chapter 1 : Real world problem: Predict rating given product reviews on Amazon.
18.1 Dataset overview: Amazon Fine Food reviews(EDA)
18.2 Data Cleaning: Deduplication.
18.3 Why convert text to a vector?
18.4 Bag of Words (BoW)
18.5 Text Preprocessing: Stemming, Stop-word
removal, Tokenization,Lemmatization
18.6 uni-gram, bi-gram, n-grams.
18.7 tf-idf (term frequency- inverse document frequency)
18.8 Why use the log in IDF?
18.9 Word2Vec.
18.10 Avg-Word2Vec, tf-idf weighted Word2Vec
18.11 Bag of Words(code sample)
18.12 Text Preprocessing(code sample)
18.13 Bi-Grams and n-grams(code sample)
18.14 TF-IDF(code sample)
18.15 Word2Vec(code sample)
18.16 Avg-Word2Vec and TFIDF-Word2Vec(Code Sample)

Chapter 2 : Classification and Regression Models: K-Nearest Neighbors


19.1 How “Classification” works?
19.2 Data matrix notation.
19.3 Classification vs Regression (examples)
19.4 K-Nearest Neighbors Geometric intuition with a toy example.
19.5 Failure cases of K-NN
19.6 Distance measures: Euclidean(L2) , Manhattan(L1), Minkowski,
Hamming
19.7 Cosine Distance & Cosine Similarity
19.8 How to measure the effectiveness of k-NN?
19.9 Test/Evaluation time and space complexity.
19.10 k-NN Limitations.
19.11 Decision surface for K-NN as K changes.
19.12 Overfitting and Underfitting.
19.13 Need for Cross validation.
19.14 K-fold cross validation.
19.15 Visualizing train, validation and test datasets
19.16 How to determine overfitting and underfitting?
19.17 Time based splitting
19.18 k-NN for regression.
19.19 Weighted k-NN
19.20 Voronoi diagram.
19.21 Binary search tree
19.22 How to build a kd-tree.
19.23 Find nearest neighbors using kd-tree
19.24 Limitations of kd-tree
19.25 Extensions.
19.26 Hashing vs LSH.
19.27 LSH for cosine similarity
19.28 LSH for euclidean distance.
19.29 Probabilistic class label
19.30 Code Sample: Decision boundary.
19.31 Code Samples: Cross-Validation
19.32 Revision Questions

Chapter 3 : Interview Questions on k-NN


20.1 Question & Answers

Chapter 4 : Classification algorithms in various situations:


21.1 Introduction
21.2 Imbalanced vs balanced dataset.
21.3 Multi-class classification.
21.4 k-NN, given a distance or similarity matrix
21.5 Train and test set differences.
21.6 Impact of Outliers
21.7 Local Outlier Factor(Simple solution: mean distance to k-NN).
21.8 k-distance (A), N(A)
21.9 reachability-distance(A, B)
21.10 Local-reachability-density(A)
21.11 Local Outlier Factor(A)
21.12 Impact of Scale & Column standardization.
21.13 Interpretability
21.14 Feature importance & Forward Feature Selection
21.15 Handling categorical and numerical features.
21.16 Handling missing values by imputation.
21.17 Curse of dimensionality.
21.18 Bias-Variance tradeoff.
21.19 Intuitive understanding of bias-variance.
21.20 Revision Questions.
21.21 Best and worst case of an algorithm.

Chapter 5 : Performance measurement of models:


22.1 Accuracy
22.2 Confusion matrix, TPR, FPR, FNR, TNR
22.3 Precision & recall, F1-score.
22.4 Receiver Operating Characteristic Curve (ROC) curve and AUC.
22.5 Log-loss.
22.6 R-Squared/ Coefficient of determination.
22.7 Median absolute deviation (MAD)
22.8 Distribution of errors.
22.9 Revision Questions

Chapter 6 : Interview Questions on Performance Measurement models.


23.1 Question & Answers

Chapter 7 : Naive Bayes


24.1 Conditional probability.
24.2 Independent vs Mutually exclusive events.
24.3 Bayes Theorem with examples.
24.4 Exercise problems on Bayes Theorem
24.5 Naive Bayes algorithm.
24.6 Toy example: Train and test stages.
24.7 Naive Bayes on Text data.
27.8 Laplace/Additive Smoothing.
24.9 Log-probabilities for numerical stability.
24.10 Bias and Variance tradeoff.
24.11 Feature importance and interpretability.
24.12 Imbalanced data
24.13 Outliers.
24.14 Missing values.
24.15 Handling Numerical features (Gaussian NB)
24.16 Multiclass classification.
24.17 Similarity or Distance matrix.
24.18 Large dimensionality.
24.19 Best and worst cases.
24.20 Code example
24.21 Exercise: Apply Naive Bayes to Amazon reviews.

Chapter 8 : Logistic Regression:


25.1 Geometric intuition of logistic regression
25.2 Sigmoid function: Squashing
25.3 Mathematical formulation of objective function.
25.4 Weight Vector.
25.5 L2 Regularization: Overfitting and Underfitting.
25.6 L1 regularization and sparsity.
25.7 Probabilistic Interpretation: Gaussian Naive Bayes
25.8 Loss minimization interpretation
25.9 Hyperparameter search: Grid Search and Random Search
25.10 Column Standardization.
25.11 Feature importance and model interpretability.
25.12 Collinearity of features.
25.13 Train & Run time space and time complexity.
25.14 Real world cases.
25.15 Non-linearly separable data & feature engineering.
25.16 Code sample: Logistic regression, GridSearchCV, RandomSearchCV
25.17 Extensions to Logistic Regression: Generalized linear models (GLM)

Chapter 9 : Linear Regression.


26.1 Geometric intuition of Linear Regression.
26.2 Mathematical formulation.
26.3 Real world Cases.
26.4 Code sample for Linear Regression

Chapter 10 : Solving optimization problems


27.1 Differentiation.
27.2 Online differentiation tools
27.3 Maxima and Minima
27.4 Vector calculus: Grad
27.5 Gradient descent: geometric intuition.
27.6 Learning rate.
27.7 Gradient descent for linear regression.
27.8 SGD algorithm
27.9 Constrained optimization & PCA
27.10 Logistic regression formulation revisited.
27.11 Why L1 regularization creates sparsity?
27.12 Revision Questions.

Chapter 11 : Interview questions on Logistic Regression and Linear Regression


28.1 Question & Answers

Module 4 : Machine Learning- II (Supervised Learning Models)

Chapter 1 : Support Vector Machines (SVM)


29.1 Geometric intuition.
29.2 Mathematical derivation.
29.3 why we take values +1 and -1 for support vector planes
29.4 Loss function(Hinge Loss) based interpretation.
29.5 Dual form of SVM formulation.
29.6 Kernel trick.
29.7 Polynomial kernel.
29.8 RBF-Kernel.
29.9 Domain specific Kernels.
29.10 Train and run time complexities.
29.11 nu-SVM: control errors and support vectors.
29.12 SVM Regression.
29.13 Cases.
29.14 Code Sample.
29.15 Revision Questions.

Chapter 2 : Interview Questions on Support Vector Machine


30.1 Questions & Answers

Chapter 3 : Decision Trees


31.1 Geometric Intuition of decision tree: Axis parallel hyperplanes.
31.2 Sample Decision tree.
31.3 Building a decision Tree: Entropy(Intuition behind entropy)
31.4 Building a decision Tree: Information Gain
31.5 Building a decision Tree: Gini Impurity.
31.6 Building a decision Tree: Constructing a DT.
31.7 Building a decision Tree: Splitting numerical features.
31.8 Feature standardization.
31.9 Categorical features with many possible values.
31.10 Overfitting and Underfitting.
31.11 Train and Run time complexity.
31.12 Regression using Decision Trees.
31.13 Cases
31.14 Code Samples.
31.15 Revision questions

Chapter 4 : Interview Questions on Decision Trees.


32.1 Question & Answers

Chapter 5 : Ensemble Models:


33.1 What are ensembles?
33.2 Bootstrapped Aggregation (Bagging) Intuition.
33.3 Random Forest and their construction.
33.4 Bias-Variance tradeoff.
33.5 Train and Run-time Complexity.
33.6 Bagging: code Sample.
33.7 Extremely randomized trees.
33.8 Random Forest: Cases.
33.9 Boosting Intuition
33.10 Residuals, Loss functions, and gradients.
33.11 Gradient Boosting
33.12 Regularization by Shrinkage.
33.13 Train and Run time complexity.
33.14 XGBoost: Boosting + Randomization
33.15 AdaBoost: geometric intuition.
33.16 Stacking models.
33.17 Cascading classifiers.
33.18 Kaggle competitions vs Real world.
33.19 Revision Questions.
Module 6 : Machine Learning Real-World Case Studies

Chapter 1 : Case study 1 : Quora Question pair similarity problem


36.1 Business/Real world problem : Problem Definition
36.2 Business objectives and constraints.
36.3 Mapping to an ML problem: Data overview
36.4 Mapping to an ML problem: ML problem and performance metric.
36.5 Mapping to an ML problem: Train-test split
36.6 EDA: Basic Statistics.
36.7 EDA: Basic Feature Extraction.
36.8 EDA: Text Preprocessing.
36.9 EDA: Advanced Feature Extraction.
36.10 EDA: Feature analysis.
36.11 EDA: Data Visualization: T-SNE.
36.12 EDA: TF-IDF weighted word-vector featurization.
36.13 ML Models: Loading data.
36.14 ML Models: Random Model.
36.15 ML Models: Logistic Regression & Linear SVM
36.16 ML Models: XGBoost

You might also like