Data Science Lectureflow
Data Science Lectureflow
Why Python?, Features of Python Programming, Style Installation, Print Function, Comments
Variable and data types
Operators in python
Arithmetic, Assignment, Logical, Comparison, Identity , Membership
collections
List, Tuple, Set, Dictionary
Conditional Statements
If, If-else, If-elif-else, Nested If-else
Looping Statements
for loop, while Loop, Nested loops, Range Function
Control Statements
break , Continue, pass
Functions
Definition, Types of Function, Defining a Function, Calling a Function, Function Arguments,
Lambda function
Scope Of Variables
Global, Local
Page I
Modules
Introduction, How to import?, Math module, Random Module, Packages
Input - Output
Reading Input from Keyboard, Printing Output
Files and Exceptions Handling
File Operations: Opening and Closing, Read and Writing , Exceptions: try except finally
OOPS Concepts
Class, Objects, Inheritance, Polymorphism, Overloading
Introduction to Databases: Types of databases: Relational (SQL) and Non-Relational (NoSQL). Key
concepts: Tables, records, fields, primary keys, and foreign keys. Differences between SQL and
NoSQL databases.
Setting Up Databases: Installing and configuring database management systems (e.g., MySQL,
PostgreSQL, SQLite). Introduction to cloud databases (e.g., AWS RDS, Google Cloud SQL, Azure
SQL).
Structured Query Language (SQL) Data Manipulation: Basics: SELECT, INSERT, UPDATE,
DELETE. Filtering with WHERE, LIKE, IN, and BETWEEN. Sorting and limiting results (ORDER
BY, LIMIT).
Data Aggregation: GROUP BY and HAVING clauses. Aggregation functions: COUNT, SUM,
AVG, MIN, MAX.
Joins and Relationships: Types of joins: INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL OUTER
JOIN. Using joins to combine data from multiple tables. Self-joins and subqueries.
Advanced SQL: Window functions (ROW_NUMBER, RANK, DENSE_RANK). Common Table
Expressions (CTEs). Recursive queries.
SELECT statement, SELECT Statement with WHERE clause, SQL SELECT Statement with
GROUP BY clause, SQL SELECT Statement with HAVING clause
NoSQL Databases Introduction to NoSQL: Key-value stores (e.g., Redis). Document stores (e.g.,
MongoDB). Column-family stores (e.g., Cassandra). Graph databases (e.g., Neo4j).
MongoDB: CRUD operations (Create, Read, Update, Delete). Querying collections using MongoDB
Query Language (MQL). Indexing and aggregation framework in MongoDB.
Database Connectivity Connecting Databases with Python: Using libraries like sqlite3, PyMySQL,
and psycopg2. Executing SQL queries programmatically. Handling database transactions.
ORM Frameworks: Introduction to Object-Relational Mapping (ORM). Using SQLAlchemy or
Django ORM for database interactions.
Data Extraction and Transformation ETL Processes: Extracting data from multiple sources (e.g.,
APIs, flat files, web scraping). Transforming and cleaning data using SQL or Python. Loading
processed data into a database.
Data Cleaning: Handling missing values, duplicates, and outliers in SQL. String manipulation using
SQL functions (e.g., CONCAT, SUBSTRING, TRIM). Date and time operations in databases.
Database Optimization Indexing: Creating and using indexes for faster querying. Understanding the
trade-offs of indexing.
Page II
Query Optimization: Analyzing query execution plans. Optimizing joins and subqueries for
performance.
Database Design: Normalization and denormalization. Creating efficient schemas for data science
workflows.
Linear Algebra Basics Linear algebra is crucial for understanding data representations,
transformations, and machine learning algorithms. Key Topics Vectors Definition: A vector is an
ordered set of numbers, representing data points in n-dimensional space. Operations: Addition,
subtraction, scalar multiplication. Dot Product: Measures the similarity between two vectors. Norms:
Measure the length or magnitude of a vector (e.g., L1, L2 norms).
Matrices Definition: A matrix is a two-dimensional array of numbers. Operations: Addition,
multiplication, and transposition. Matrix Multiplication: Used in transformations and projections.
Inverse and Determinant: Important for solving linear equations.
Matrix Decompositions Eigenvalues and Eigenvectors: Essential for dimensionality reduction
techniques like PCA (Principal Component Analysis). Singular Value Decomposition (SVD): Used
in recommendation systems and noise reduction.
Applications in Data Science Representing datasets as matrices. Feature transformations (e.g.,
scaling, PCA). Neural networks: Weights and activations are manipulated as matrices.
Descriptive Statistics Measures of Central Tendency Mean: Average of data points. Median: Middle
value when data is sorted. Mode: Most frequently occurring value.
Measures of Dispersion Variance: Measure of data spread around the mean. Standard Deviation:
Square root of variance. Range: Difference between the maximum and minimum values.
Inferential Statistics Sampling Random Sampling: Selecting data points randomly to avoid bias.
Sampling Distributions: Distribution of a statistic (e.g., sample mean).
Hypothesis Testing Null Hypothesis (H?) and Alternative Hypothesis (H?). p-value: Probability of
observing results at least as extreme as the current ones. Confidence Intervals: Range of values
within which a population parameter lies with a certain confidence level.
Correlation and Causation Correlation Coefficient (r): Measures linear relationship between two
variables. Causation: Understanding if one variable causes another.
Applications in Data Science Data summarization for EDA (e.g., mean, variance). Identifying
relationships between variables using correlation. Performing A/B testing and drawing conclusions.
Basics of Probability Definitions Experiment: A process with an uncertain outcome. Sample Space
(S): All possible outcomes of an experiment. Event: A subset of the sample space.
Rules of Probability Addition Rule: . Multiplication Rule: Complement Rule:
Conditional Probability
Bayes' Theorem
Probability Distributions Discrete Distributions Bernoulli Distribution: For binary outcomes (e.g.,
success/failure). Binomial Distribution: Sum of several Bernoulli trials. Poisson Distribution:
Number of events occurring in a fixed interval.
Continuous Distributions Normal Distribution (Gaussian): Bell-shaped curve; central in statistics.
Uniform Distribution: Equal probability across a range. Exponential Distribution: Time between
events in a Poisson process.
Page III
Applications in Data Science Predictive modeling (e.g., classification probabilities). Understanding
distributions of data (e.g., normality assumption). Bayesian methods in machine learning (e.g., Naive
Bayes classifier). Markov chains and probabilistic graphical models.
Linear Algebra in Statistics and Probability: Covariance matrices, correlation coefficients.
Eigenvalues in PCA for reducing dimensionality.
Statistics and Probability in Machine Learning: Model evaluation using statistical metrics (e.g.,
precision, recall). Probability distributions in probabilistic models.
Page IV
Analyzing Correlations Using correlation matrices and heatmaps to explore relationships. Scatterplot
matrices for pairwise comparisons of variables. Statistical tests for assessing correlation significance
(Pearson, Spearman).
Insights Discovery Grouping and aggregating data for deeper insights (groupby() in pandas).
Identifying key drivers or factors behind trends using feature importance. Using pivot tables to
summarize and analyze data.
Automating EDA Leveraging libraries like Pandas Profiling, Sweetviz, or D-Tale for quick insights.
Creating reproducible EDA reports for stakeholders.
What is NumPy and why is it used? Installing and importing NumPy Difference between NumPy
arrays and Python lists
array(), arange(), linspace() zeros(), ones(), empty(), full() Creating identity matrices (eye()) Random
number generation (random.rand(), random.randn(), random.randint())
Shape, size, data type ndim, shape, dtype, itemsize, nbytes
1D, 2D, and nD slicing Fancy indexing Boolean indexing (filtering with conditions) Reshaping
arrays (reshape(), ravel(), flatten())
Element-wise operations (addition, subtraction, multiplication, division) Matrix operations (dot(),
matmul(), transpose()) Broadcasting
sum(), mean(), median(), std(), var() min(), max(), argmin(), argmax() Axis-based aggregation
Stacking (vstack(), hstack(), concatenate()) Splitting arrays (split(), hsplit(), vsplit()) Sorting arrays
(sort(), argsort())
Solving linear equations (linalg.solve()) Matrix inversion (linalg.inv()) Determinant (linalg.det())
Eigenvalues and eigenvectors (linalg.eig())
Using np.nan, np.isnan(), and handling with np.nanmean(), np.nanstd() etc.
Machine learning is a core component of data science, enabling systems to learn patterns from data
and make predictions or decisions. This module introduces supervised learning, unsupervised
learning, and model evaluation techniques, forming the foundation of machine learning.
Supervised learning involves training models on labeled datasets, where input-output pairs are
known. The goal is to predict the output for unseen inputs.
Linear Regression Predicting sales or revenue trends. Estimating relationships between variables
(e.g., advertising spend vs. sales).
Decision Tree - Credit risk analysis. Customer segmentation.
Random Forests - Fraud detection. Predictive maintenance in manufacturing.
Performance Metrics - For Regression Models ( Mean Squared Error (MSE): Average squared
difference between actual and predicted values. Root Mean Squared Error (RMSE): Square root of
MSE, in the same units as the target variable. Mean Absolute Error (MAE): Average absolute
difference between actual and predicted values.)
Classification Algorithms: Logistic Regression K-Nearest Neighbors (KNN) Decision Trees Random
Forest Support Vector Machines (SVM) Metrics: Confusion Matrix, Precision, Recall, F1 Score,
ROC-AUC
Performance Metrics - For Classification Models ( Accuracy: Proportion of correct predictions.
Precision: True Positives / (True Positives + False Positives). Recall (Sensitivity): True Positives /
Page V
(True Positives + False Negatives). F1 Score: Harmonic mean of precision and recall. ROC-AUC:
Measures the trade-off between sensitivity and specificity.)
Unsupervised learning deals with unlabeled data, focusing on uncovering hidden patterns or
structures.
Clustering Algorithms K-Means Clustering - Customer segmentation in marketing. Grouping similar
items in recommendation systems.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) - Spatial data analysis (e.g.,
earthquake epicenter clustering). Anomaly detection in networks.
Model Evaluation - Model evaluation ensures the reliability and generalization of machine learning
models on unseen data.
Train-Test Split
Cross-Validation
Performance Metrics - Confusion Matrix
What is Deep Learning - Definition Deep Learning leverages artificial neural networks with multiple
layers (hence "deep") to extract high-level abstractions from data. It's particularly
effective in processing large amounts of unstructured data like images, audio, and text. Key
Characteristics Automatic Feature Extraction: Unlike traditional machine learning, it doesn’t require
manual feature engineering. Scalability: Performs better as the size of data increases. Applications:
Speech recogn
Neural Networks: Basics and Architecture Neural Networks are the building blocks of deep learning,
consisting of layers of nodes (neurons) that process data.
Key Components Neurons (Nodes): Basic units that take inputs, apply weights, biases, and activation
functions, then produce outputs.
Layers: Input Layer: Receives raw data. Hidden Layers: Perform computations and extract features.
Output Layer: Produces the final predictions or classifications.
Weights and Biases: Parameters learned during training to minimize error.
Activation Functions: Introduce non-linearity, enabling the model to learn complex patterns
Forward and Backward Propagation Forward Propagation: Data passes through the network,
generating predictions. Backward Propagation: Adjusts weights and biases using gradients from the
loss function to minimize error.
Architectures Feedforward Neural Networks (FNNs): Information flows in one direction (input to
output). Convolutional Neural Networks (CNNs): Specialized for image data, extracting spatial
hierarchies. Recurrent Neural Networks (RNNs): Process sequential data, like time series or text.
Deep Learning frameworks like TensorFlow and PyTorch simplify model building and training.
TensorFlow Overview: Developed by Google, TensorFlow provides extensive tools for designing
and deploying deep learning models. Features: Tensor manipulation, automatic differentiation, and
scalability. Keras API for high-level abstractions.
Page VI
Basic Workflow: Import libraries (tensorflow and keras). Define the model architecture. Compile the
model with an optimizer and loss function. Train and evaluate the model.
PyTorch Overview: Developed by Facebook, PyTorch is widely used for research due to its
flexibility and dynamic computation graph. Features: Easy debugging and customizability. Integrates
well with Python’s native libraries.
Basic Workflow: Import torch and torch.nn. Define a custom neural network class. Specify the loss
function and optimizer. Train and evaluate the model.
Comparison of TensorFlow and PyTorch
Image Classification with Convolutional Neural Networks (CNNs)
CNN Architecture: Convolutional Layers: Extract features using filters/kernels. Convolution
Operation: Slide kernels over the image to create feature maps. Pooling Layers: Reduce spatial
dimensions to make computation efficient. Example: Max pooling selects the highest value in a
region. Fully Connected Layers: Flatten feature maps and connect to output neurons.
Computer Vision Definition and significance Applications in industries (e.g., healthcare, autonomous
vehicles, retail) Relation to AI and Machine Learning, Basics of Image Processing (Digital image
representation: pixels, resolution Color spaces (RGB, HSV, Grayscale) Image transformations:
scaling, rotation, translation Filters: edge detection, blurring, sharpening Image segmentation and
thresholding),
Key Algorithms and Techniques Feature extraction: SIFT, SURF, ORB Object detection: Haar
cascades, YOLO, SSD Object tracking: Kalman filters, optical flow Image classification and CNNs
Semantic and instance segmentation
Computer Vision Libraries and Frameworks OpenCV TensorFlow/Keras PyTorch Dlib
Practical Use Cases Facial recognition systems Optical character recognition (OCR) Autonomous
driving systems Medical imaging analysis
Introduction to NLP Definition and importance Real-world applications (chatbots, sentiment
analysis, translation)
Text Preprocessing Tokenization and lemmatization Stopword removal Text normalization
Stemming vs. lemmatization
Core Concepts in NLP Language models: n-grams, bag of words TF-IDF: term frequency-inverse
document frequency Part-of-speech tagging Named entity recognition (NER) Dependency parsing
Deep Learning for NLP Word embeddings: Word2Vec, GloVe, FastText Recurrent Neural Networks
(RNNs) GRUs and LSTMs Transformers: BERT, GPT, RoBERTa Sequence-to-sequence models
Attention mechanisms
NLP Tasks Text classification Machine translation Sentiment analysis Summarization (extractive and
abstractive) Question answering systems
NLP Tools and Libraries NLTK and SpaCy Hugging Face Transformers Gensim Stanford CoreNLP
What is GenAI? Definition: Generative AI refers to a class of artificial intelligence systems designed
to create new content such as text, images, audio, or videos based on patterns learned from existing
data.
What is GenAI?
Page VII
How It Works: Uses deep learning models to understand data distributions and generate outputs that
mimic the training data. Techniques include Variational Autoencoders (VAEs), Generative
Adversarial Networks (GANs), and Transformer-based models like GPT.
Comparison with traditional AI/ML
Core Idea: Unlike traditional AI focused on classification or prediction, Generative AI emphasizes
content creation.
Applications & Examples
omparison with Traditional AI/ML Objective: Traditional AI/ML focuses on prediction,
classification, or decision-making. Generative AI emphasizes creativity and synthesis.
LLM and the Transformers Architecture
Input and Output: Traditional AI takes input and outputs a class label, numerical value, or
recommendation. Generative AI takes a prompt and generates entirely new content.
Key concepts & Terminology
Algorithms and Approaches: Traditional AI uses algorithms like decision trees, SVMs, and simple
neural networks. Generative AI leverages advanced architectures like GANs, VAEs, and
Transformers.
Applications & Examples Text Generation: Chatbots (e.g., ChatGPT). Content creation for
blogs, emails, and marketing.
Image Generation: Tools like DALL-E and Stable Diffusion for art, design, and visual storytelling.
Video and Audio Synthesis: AI-powered tools for dubbing, video editing, and speech synthesis (e.g.,
Deepfake technology).
Scientific Research: AI-assisted drug discovery and protein structure prediction.
Gaming: Generating realistic environments, NPC interactions, and dynamic storylines.
Real-World Examples: Adobe’s AI tools for content editing. Google's Bard for search augmentation.
OpenAI Codex for code generation.
LLM and the Transformers Architecture Introduction to LLMs (Large Language Models): LLMs are
AI models trained on massive datasets of text to generate human-like responses and understand
context. Examples: GPT-4, BERT, RoBERTa.
Transformers Architecture: Key Innovation: Replaced RNNs and CNNs for sequential data
processing with the "attention mechanism."
Components: Encoder-Decoder setup (e.g., BERT focuses on encoding; GPT focuses on decoding).
Multi-Head Attention for identifying relationships between data elements. Feedforward Networks for
transformations.
Key Concepts & Terminology Latent Space: A multi-dimensional representation of data learned
by models to capture hidden patterns and features.
Generative vs. Discriminative Models: Generative: Models that create new content (e.g., GANs,
VAEs). Discriminative: Models that focus on distinguishing between different data points (e.g.,
classifiers).
Fine-Tuning: Customizing pre-trained models on specific tasks or domains using smaller datasets.
Prompt Engineering: Crafting specific inputs (prompts) to guide LLMs in generating desired outputs.
okenization: Breaking down text into smaller units (tokens) for processing by AI models.
Zero-Shot and Few-Shot Learning: Zero-Shot: Performing tasks without specific task examples in
training. Few-Shot: Performing tasks with minimal examples provided during inference.
Page VIII
Attention Mechanism: A method enabling models to focus on relevant parts of input data when
generating outputs.
Pretraining and Fine-Tuning: Pretraining: Training on massive general datasets. Fine-Tuning:
Tailoring the model to specific use cases with domain-specific data.
Getting started with LLMs a. Open source & Closed source LLMs b. Tools to get started with
LLMs c. Introduction to Hugging Face & Google Colab d. Introduction to LangChain
Framework e. Running Your First LLM
Prompt Engineering for LLMs a. Introduction & Basics of Prompting b. Basic Examples c.
Prompting Techniques d. Tips for Designing prompts e. Using Prompt Templates in LangChain
Retrieval Augmented Generation (RAG) for LLMs - Part 1 a. Introduction to RAG & it’s
components b. Benefits & use-cases of RAG c. Gettings started with RAG using LangChain d.
Text extraction & creating vectors embeddings from Documents e. Storing & retrieving
vectors from Vector Database
Retrieval Augmented Generation (RAG) for LLMs - Part 2 a. Setting up RAG pipeline with LLM b.
Creating prompt template for RAG c. Loading the knowledge context into Prompt d. Executing the
RAG pipeline for text summarization
Fine-tuning in Generative AI - Part 1 a. Why fine tune a model? b. Concepts of fine-tuning pre-
trained models c. Applications of fine tuned LLM models d. Fine-tuning techniques e. Model
selection and Data preparation
Fine-tuning in Generative AI - Part 2 a. Tokenize the dataset for training with Transformers library b.
Setting up Training structure with Transformers library c. Incorporating evaluation method d.
Hyperparameter tuning & Training e. LLM Fine-Tuning Pitfalls
Generative AI for Business Applications a. Exploring Generative AI applications in various
industries b. Exploring Generative AI for creative tasks (image generation) c. Case studies of
successful Generative AI implementations d. Exploring UI libraries - Streamlit & Gradio
Page IX
Introduction to RAG & it’s components
Benefits & use-cases of RAG
Gettings started with RAG using LangChain
Text extraction & creating vectors embeddings from Documents
Storing & retrieving vectors from Vector Database
Page X