Data Science 7th Sem AIML ITE Notes Complete LONG
Data Science 7th Sem AIML ITE Notes Complete LONG
https://yashnote.notion.site/Data-Science-1180e70e8a0f80bbbfa2fdee5d1f1d85?
pvs=4
Unit 1
Introduction to Data Science
Difference among AI, Machine Learning, and Data Science
Comparison of AI, ML, and Data Science:
Basic Introduction of Python
Key Features of Python:
Common Use Cases of Python:
Python for Data Science
1. Pandas
2. NumPy
3. Scikit-learn
4. Data Visualization
5. Advanced Python Concepts for Data Science
Introduction to Google Colab
Key Features of Google Colab:
Use Cases of Google Colab:
Popular Dataset Repositories
Discussion on Some Datasets:
Data Pre-processing
Python Example: Data Cleaning (Handling Missing Values)
Data Scales
Python Example: Encoding Ordinal Data
Similarity and Dissimilarity Measures
Python Example: Cosine and Euclidean Similarity
Sampling and Quantization of Data
Sampling:
Quantization:
Python Example: Random Sampling and Quantization
Filtering
Python Example: Moving Average and Median Filter
Data Transformation
Python Example: Data Normalization and Log Transformation
Data Science 1
Data Merging
Python Example: Merging DataFrames
Data Visualization
Python Example: Basic Data Visualization using matplotlib
Principal Component Analysis (PCA)
Python Example: PCA in Python
Correlation
Python Example: Calculating Correlation
Chi-Square Test
Python Example: Chi-Square Test
Summary
Unit 2
Regression Analysis
Linear Regression
Python Example: Simple Linear Regression
Generalized Linear Models (GLM)
Python Example: Logistic Regression
Regularized Regression
Python Example: Ridge and Lasso Regression
Summary of Key Concepts
Cross-Validation
Types of Cross-Validation:
Python Example: K-Fold Cross-Validation
Training and Testing Data Set
Python Example: Train-Test Split
Overview of Nonlinear Regression
Python Example: Nonlinear Regression (Polynomial Regression)
Overview of Ridge Regression
Advantages:
Python Example: Ridge Regression
Summary of Key Concepts
Latent Variables
Examples:
Structural Equation Modeling (SEM)
Key Components of SEM:
Python Libraries for SEM:
Python Example: Factor Analysis (Latent Variable)
Factor Analysis Example (Latent Variables Extraction)
Data Science 2
SEM Example Using semopy
Structural Equation Model Example:
Explanation:
Summary of Key Concepts
What are Latent Variables?
Example:
What is Structural Equation Modeling (SEM)?
Breaking Down SEM:
Why Use SEM?
Example to Understand SEM:
Example of SEM in Python (Basic Workflow)
Explanation of the Python Code:
Recap:
Unit 3
Data Science: Forecasting
Overview:
Key Concepts:
Types of Forecasting
Time Series Forecasting Methods
Conclusion:
Time Series Data Analysis
Overview:
Key Concepts in Time Series Analysis
Preprocessing Time Series Data
Time Series Models for Forecasting
Conclusion:
Data Science: Stationarity and Seasonality in Time Series Data
Overview:
1. Stationarity in Time Series
Definition:
Checking for Stationarity:
Making Time Series Stationary:
2. Seasonality in Time Series
Definition:
Key Features of Seasonality:
Identifying Seasonality:
Types of Seasonality:
Handling Seasonality in Time Series Models:
Data Science 3
Conclusion:
Python Implementation
Data Science: Recurrent Models & Autoregressive Models in Time Series
Overview:
1. Autoregressive (AR) Models in Time Series
Overview:
Key Features of AR Models:
ACF and PACF in AR Models:
2. Recurrent Neural Networks (RNN) for Time Series Forecasting
Overview:
Types of RNNs:
RNN for Time Series Forecasting:
Key Steps in RNN-Based Time Series Forecasting:
3. Python Implementation of Autoregressive (AR) Models
AR Model in Python using statsmodels
4. Python Implementation of Recurrent Neural Networks (RNN) for Time Series
Forecasting
RNN Model using Keras and TensorFlow
Conclusion:
Unit 4
Data Science: Classification in Machine Learning
Overview:
1. Types of Classification Problems
2. Popular Classification Algorithms
3. Python Implementation of Classification Algorithms
Step 1: Import Libraries
4. Logistic Regression
4.1: Load Data and Prepare for Training
4.2: Train Logistic Regression Model
4.3: Output Evaluation
5. K-Nearest Neighbors (KNN)
5.1: Train KNN Classifier
6. Support Vector Machine (SVM)
6.1: Train SVM Classifier
7. Model Evaluation and Comparison
7.1: Comparison of Models
8. Conclusion
Summary of Results:
Next Steps:
Data Science 4
Data Science: Linear Discriminant Analysis (LDA)
Overview:
1. Key Concepts of LDA
3. Applications of LDA
4. Python Implementation of LDA
Step 1: Import Required Libraries
Step 2: Load Data and Prepare for Training
Step 3: Apply LDA for Dimensionality Reduction
Step 4: Train a Classifier on the LDA-transformed Data
Step 5: Output Evaluation
5. LDA vs. PCA
6. Applications of LDA
7. Conclusion
Data Science: Overview of Support Vector Machine (SVM) and Decision Trees (DT)
1. Support Vector Machine (SVM)
Overview:
SVM Basics:
2. Decision Trees (DT)
Overview:
Decision Tree Working:
3. Python Implementation
3.1: Support Vector Machine (SVM) Implementation
Step 1: Import Libraries
Step 2: Load the Dataset
Step 3: Train an SVM Model
Step 4: Visualize Results (Optional)
3.2: Decision Tree (DT) Implementation
Step 1: Import Libraries
Step 2: Load Data and Split
Step 3: Train a Decision Tree Model
Step 4: Visualize the Decision Tree
4. Comparison of SVM and Decision Trees
5. Conclusion
Data Science: Clustering and Clustering Techniques
1. Overview of Clustering
2. Types of Clustering
2.1. Centroid-based Clustering
2.2. Density-based Clustering
Data Science 5
2.3. Hierarchical Clustering
2.4. Model-based Clustering
3. Detailed Explanation of Popular Clustering Techniques
3.1. K-means Clustering
Steps of K-means:
3.2. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Parameters of DBSCAN:
3.3. Agglomerative Hierarchical Clustering
Steps:
4. Illustration of Clustering Techniques through Python
Step 1: Import Required Libraries
Step 2: Load the Iris Dataset
5. 1. K-means Clustering
Step 1: Apply K-means Clustering
Step 2: Visualize the Clusters
Step 3: Evaluate K-means
6. 2. DBSCAN (Density-Based Spatial Clustering)
Step 1: Apply DBSCAN Clustering
Step 2: Visualize the Clusters
Step 3: Evaluate DBSCAN
7. 3. Agglomerative Clustering (Hierarchical Clustering)
Step 1: Apply Agglomerative Clustering
Step 2: Visualize the Clusters
8. Clustering Results Comparison
9. Conclusion
Unit 1
Introduction to Data Science
Data Science is a multidisciplinary field that combines statistics, computer
science, mathematics, and domain-specific knowledge to extract insights and
knowledge from structured and unstructured data. Data Science applies scientific
methods, processes, algorithms, and systems to analyze vast amounts of data
and generate actionable insights. In today's world, where data is generated in
massive volumes from various sources such as social media, business
Data Science 6
transactions, IoT devices, etc., Data Science plays a critical role in making sense
of that data.
1. Data Collection: Gathering data from various sources (web scraping, APIs,
surveys, sensors, etc.).
4. Data Wrangling and Cleaning: Ability to preprocess data, handle missing data,
and deal with data inconsistencies.
Data Science 7
Finance: Fraud detection, algorithmic trading, risk management.
Key Points:
Types of AI:
Narrow AI: AI systems designed for specific tasks (e.g., Siri, Alexa,
recommendation engines).
Examples:
Data Science 8
Self-driving cars (AI-driven vehicles).
Key Points:
Goal of ML: To enable machines to learn from data and improve with
experience.
Types of ML:
Examples:
Data Science:
Data Science is a more comprehensive field that integrates AI, ML, and other tools
to work with data in various forms. It focuses on extracting insights and
knowledge from data using a mix of statistics, algorithms, and domain knowledge.
While AI and ML are tools used in Data Science, Data Science is concerned with
the entire data lifecycle from collection to insight generation.
Key Points:
Data Science 9
Goal of Data Science: To extract actionable insights from large datasets using
a mix of techniques.
Scope: Data Science includes AI, ML, and various other techniques like data
mining and business intelligence.
Examples:
Comprehensive,
Very broad, includes Narrower, focused on
Scope includes ML, AI, and
ML and more learning from data
more
Data Science 10
Python is a high-level, interpreted, general-purpose programming language. It
was created by Guido van Rossum and first released in 1991. Python emphasizes
code readability and simplicity, making it an ideal choice for beginners and
professionals alike. Its extensive libraries and frameworks make it highly versatile,
used across various domains, including web development, data analysis, artificial
intelligence, and scientific computing.
2. Interpreted Language: Python code is executed line by line, which allows for
interactive debugging.
Data Science and Machine Learning: With libraries like Pandas, NumPy,
Scikit-learn, TensorFlow.
Data Science 11
Scientific Computing: Python is widely used in academic and research
settings for simulations, data analysis, and scientific computation.
import pandas as pd
Pivot Tables
Data Science 12
dates = pd.date_range('20230101', periods=6)
ts = pd.Series(range(6), index=dates)
resampled = ts.resample('2D').sum()
2. NumPy
NumPy is fundamental for numerical computing in Python.
2.1 Advanced Array Operations
import numpy as np
# Broadcasting
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([10, 20, 30])
result = a + b # b is broadcast to match a's shape
# Fancy indexing
x = np.arange(10)
indices = [2, 5, 8]
selected = x[indices]
2.2 Vectorization
def sigmoid(x):
return 1 / (1 + np.exp(-x))
Data Science 13
x = np.linspace(-10, 10, 100)
y = sigmoid(x) # Vectorized operation
3. Scikit-learn
Scikit-learn is a machine learning library for Python.
3.1 Pipeline Creation
pipe = Pipeline([
('scaler', StandardScaler()),
('svm', SVC())
])
pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)
3.2 Cross-Validation
rf = RandomForestClassifier()
scores = cross_val_score(rf, X, y, cv=5)
4. Data Visualization
4.1 Matplotlib
plt.figure(figsize=(10, 6))
plt.plot(x, y, 'r-', label='Data')
Data Science 14
plt.title('Sample Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend()
plt.show()
4.2 Seaborn
sns.set_style("whitegrid")
tips = sns.load_dataset("tips")
sns.scatterplot(x="total_bill", y="tip", hue="time", data=tip
s)
plt.show()
# List comprehension
squares = [x**2 for x in range(10)]
# Generator expression
sum_of_squares = sum(x**2 for x in range(1000000))
numbers = [1, 2, 3, 4, 5]
Data Science 15
squared = list(map(lambda x: x**2, numbers))
evens = list(filter(lambda x: x % 2 == 0, numbers))
product = reduce(lambda x, y: x * y, numbers)
These concepts and libraries form the core of Python's data science ecosystem,
providing powerful tools for data manipulation, analysis, and visualization.
2. Free GPU/TPU Access: Colab provides free access to GPUs and TPUs, which
are vital for high-performance tasks like deep learning.
5. Integration with Google Drive: You can save and load datasets and notebooks
directly to and from Google Drive.
7. Markdown and LaTeX Support: Colab allows for the inclusion of Markdown
and LaTeX (for writing mathematical equations) alongside code.
Data Science 16
Data Science and Machine Learning: Due to its GPU and TPU support, Colab
is commonly used for training machine learning models.
Educational Purposes: It's widely used by students and educators for learning
Python and machine learning without the need for local installation.
1. Kaggle Datasets:
Website: https://www.kaggle.com/datasets
Kaggle is one of the largest platforms for data science competitions and
also hosts a wide range of datasets. Users can search for datasets by
category, size, or application domain.
Popular Datasets:
Website: https://archive.ics.uci.edu/ml/index.php
Data Science 17
The UCI Machine Learning Repository is a popular destination for publicly
available datasets, widely used in machine learning research and
education.
Popular Datasets:
Website: https://datasetsearch.research.google.com/
Google’s Dataset Search allows users to find datasets across the web on
different platforms. It indexes datasets from a variety of sources such as
academic journals, governmental agencies, and open data platforms.
4. Data.gov:
Website: https://www.data.gov/
Popular Datasets:
Crime Data: Data related to crimes across various U.S. cities and
states.
Website: https://registry.opendata.aws/
Data Science 18
Amazon Web Services (AWS) hosts numerous open datasets for public
use, including datasets for satellite imagery, genomics, and machine
learning models.
Popular Datasets:
2. MNIST Dataset:
Data Science 19
achieve high accuracy rates on this dataset.
3. Iris Dataset:
Description: The Iris dataset includes features such as petal length, petal
width, sepal length, and sepal width for three species of Iris flowers.
Use Case: It is used for regression problems, where the goal is to predict
the wine quality based on its features.
In summary, Python and Google Colab are essential tools for data scientists,
offering powerful features for data analysis, machine learning, and scientific
computing. Popular dataset repositories like Kaggle, UCI, and Data.gov provide
valuable datasets that are commonly used for academic, research, and
commercial purposes. Understanding and analyzing these datasets is a critical
skill in data science.
Data Pre-processing
Data pre-processing is a critical step in the data analysis and machine learning
pipeline. It involves preparing raw data to make it suitable for further analysis or
model training. The quality of the data can significantly influence the performance
of machine learning models. Data pre-processing helps in handling missing
Data Science 20
values, removing noise, scaling, transforming, and integrating data from multiple
sources.
Key steps in data pre-processing include:
2. Data Integration: Combining data from multiple sources into a unified dataset.
4. Data Reduction: Reducing the volume of data to make analysis more efficient
without losing important information.
Example: If you have a dataset with missing values, you can fill them using the
mean, median, or mode of the available data (imputation). Alternatively, rows with
missing values can be removed if they are not critical.
import pandas as pd
import numpy as np
# Sample dataset
data = {'Age': [25, 30, np.nan, 22, np.nan], 'Salary': [5000
0, 54000, np.nan, 42000, 60000]}
df = pd.DataFrame(data)
Data Scales
Data Science 21
Data can exist on different scales, which determine the type of statistical analysis
and machine learning techniques applicable to it. Understanding data scales is
vital for selecting the right methods for data processing.
1. Nominal Scale:
2. Ordinal Scale:
Example: Ratings (Excellent, Good, Fair, Poor), ranking in a race (1st, 2nd,
3rd).
3. Interval Scale:
In this scale, the intervals between values are meaningful, but there is no
true zero point. Differences are consistent.
4. Ratio Scale:
This scale has all the characteristics of the interval scale, with a true zero
point that indicates the absence of the quantity being measured.
Data Science 22
# Example of ordinal data: education levels
education_levels = [['High School'], ['Bachelor'], ['Maste
r'], ['PhD']]
# Ordinal encoding
encoder = OrdinalEncoder(categories=[['High School', 'Bachelo
r', 'Master', 'PhD']])
encoded_education = encoder.fit_transform(education_levels)
print(encoded_education)
Data Science 23
Python Example: Cosine and Euclidean Similarity
# Example vectors
vector_a = [1, 0, -1]
vector_b = [0, 1, 0]
# Cosine similarity
Data Science 24
cos_sim = cosine_similarity([vector_a], [vector_b])
print("Cosine Similarity:", cos_sim)
# Euclidean distance
euc_dist = euclidean(vector_a, vector_b)
print("Euclidean Distance:", euc_dist)
1. Random Sampling: Each data point has an equal probability of being selected.
3. Systematic Sampling: Data points are selected at regular intervals from the
dataset.
Quantization:
Quantization involves converting continuous data into discrete values or levels.
import numpy as np
# Random sampling
data = np.arange(1, 101)
sample = np.random.choice(data, size=10, replace=False)
print("Random Sample:", sample)
Data Science 25
# Quantization (Bin data into 5 levels)
quantized_data = np.digitize(data, bins=[20, 40, 60, 80])
print("Quantized Data:", quantized_data)
Filtering
Filtering is a technique used to remove or reduce noise from a dataset. It is an
essential step in data pre-processing, especially in signal processing and time-
series data. The goal is to smooth the data or remove outliers that can skew the
results of your analysis.
1. Moving Average Filter: Averages the data points over a sliding window,
helping to smooth out short-term fluctuations.
2. Median Filter: Replaces each data point with the median of neighboring
points, often used for outlier removal.
import numpy as np
import pandas as pd
from scipy.ndimage import median_filter
# Median filter
median_filt = pd.Series(median_filter(data, size=3))
print("Median Filter:\\n", median_filt)
Data Science 26
Data Transformation
Data transformation is the process of converting data into a format suitable for
analysis. This can involve scaling, normalizing, encoding categorical data, or
transforming features to reduce skewness.
# Sample data
data = np.array([[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]])
# Log transformation
log_transformed = np.log(data + 1)
print("Log Transformed Data:\\n", log_transformed)
Data Merging
Data merging involves combining two or more datasets into a single dataset based
on a common attribute or key. Common merging operations include:
Data Science 27
2. Joining: Merging datasets based on a key (like SQL joins: inner, left, right, and
outer).
import pandas as pd
# Sample data
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob',
'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 4], 'Score': [85, 90, 75]})
Data Visualization
Data visualization is a key aspect of data analysis as it helps to understand
patterns, trends, and relationships in the data. Common visualization techniques
include:
# Sample data
data = pd.DataFrame({
'Height': [150, 160, 170, 180, 190],
Data Science 28
'Weight': [50, 60, 70, 80, 90]
})
Data Science 29
import numpy as np
# Sample data
data = np.array([[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]])
# Applying PCA
pca = PCA(n_components=1) # Reducing to 1 principal componen
t
data_pca = pca.fit_transform(data_scaled)
print("PCA Transformed Data:\\n", data_pca)
Correlation
Correlation measures the strength and direction of a linear relationship between
two variables. It ranges from -1 to 1:
0: No correlation
import pandas as pd
# Sample data
Data Science 30
data = pd.DataFrame({
'X': [1, 2, 3, 4, 5],
'Y': [2, 4, 6, 8, 10]
})
# Pearson correlation
correlation = data.corr(method='pearson')
print("Pearson Correlation:\\n", correlation)
Chi-Square Test
The Chi-Square test is used to determine if there is a significant association
between two categorical variables. It compares the observed frequencies with the
expected frequencies to test for independence.
import pandas as pd
from scipy.stats import chi2_contingency
Data Science 31
})
Summary
Filtering: Smooths and cleans data using techniques like moving average and
median filters.
Data Visualization: Visualizes data trends using plots like scatter plots,
histograms, and bar charts.
All these concepts are critical to understanding how to process, analyze, and draw
insights from data, and Python provides powerful libraries like pandas , numpy , and
matplotlib to handle these tasks.
Unit 2
Regression Analysis
Regression analysis is a statistical technique used to model and analyze the
relationship between a dependent variable (target) and one or more independent
Data Science 32
variables (features). The goal of regression is to predict or explain the dependent
variable based on the given independent variables.
Types of regression analysis:
Linear Regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
Data Science 33
# Sample data (simple linear relationship)
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
# Model parameters
print("Intercept:", model.intercept_)
print("Coefficient:", model.coef_)
Data Science 34
2. Poisson Regression: For count data, using the log link function.
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Data Science 35
# Probability of class 1
print("Predicted probabilities:\\n", log_reg.predict_proba(X_
test))
Regularized Regression
Regularized regression methods help prevent overfitting by adding a penalty term
to the loss function in the linear regression model. The most common forms of
regularized regression are:
Data Science 36
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split
# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
Data Science 37
Common types include logistic regression (for binary classification) and
Poisson regression (for count data).
3. Regularized Regression:
These techniques are fundamental in machine learning and statistical modeling for
solving various prediction and classification problems.
Cross-Validation
Cross-validation is a model evaluation technique that helps assess how well a
machine learning model will generalize to unseen data. Instead of splitting the
dataset into just training and testing sets, cross-validation divides the data into
multiple subsets (folds) and trains the model multiple times, each time using a
different subset for validation and the rest for training.
Types of Cross-Validation:
1. K-Fold Cross-Validation: The data is split into k equal-sized subsets (folds).
The model is trained k times, each time using k-1 folds for training and the
remaining fold for validation. The final result is the average of the results from
the k iterations.
2. Stratified K-Fold: Similar to K-Fold, but ensures each fold has a representative
proportion of classes for classification tasks.
Data Science 38
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
import numpy as np
# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
# KFold Cross-Validation
kf = KFold(n_splits=3)
model = LinearRegression()
1. Training Set: Used to train the machine learning model. The model learns the
relationships between the input features and the target variable.
2. Testing Set: Used to evaluate the model's performance on unseen data. The
testing set is used to assess how well the model generalizes to new, unseen
examples.
Splitting the dataset is typically done in a ratio, such as 70% for training and 30%
for testing. In cases where the dataset is large, an additional validation set may
also be used for hyperparameter tuning.
Data Science 39
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
Data Science 40
Python Example: Nonlinear Regression (Polynomial Regression)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
Data Science 41
print("Predicted values:\\n", y_pred)
Advantages:
Reduces model complexity and prevents overfitting.
# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
Data Science 42
# Ridge Regression (alpha = regularization strength)
ridge_reg = Ridge(alpha=1.0)
ridge_reg.fit(X, y)
# Results
print("Ridge Regression Predictions:", y_pred)
print("Mean Squared Error:", mse_ridge)
The training set is used to train the model, and the test set is used to
evaluate performance on unseen data.
3. Nonlinear Regression:
4. Ridge Regression:
Data Science 43
By understanding and implementing these regression techniques, you can better
model complex data relationships and create more robust predictive models.
Latent Variables
Latent variables are variables that are not directly observed but are inferred or
estimated from other observed variables. They are commonly used in fields such
as psychology, social sciences, and econometrics to represent abstract concepts
like intelligence, socioeconomic status, or customer satisfaction, which are not
directly measurable.
Examples:
Customer Satisfaction: Latent variables might include satisfaction or loyalty,
which are inferred from responses to survey questions.
Latent variables are often modeled using factor analysis or structural equation
modeling (SEM).
2. Latent Variables: Inferred from observed variables (e.g., abstract traits like
"satisfaction").
Data Science 44
2. Structural Model: Specifies the relationships between latent variables (similar
to regression).
import pandas as pd
from factor_analyzer import FactorAnalyzer
Data Science 45
fa.fit(df)
In this example, we assume that the observed variables (e.g., survey questions Q1
to Q4) are used to estimate a single latent factor.
import pandas as pd
from semopy import Model, Optimizer
Data Science 46
'L1': [3, 4, 5, 6, 7],
'L2': [4, 5, 6, 6, 8],
'L3': [5, 6, 7, 7, 9]
}
df = pd.DataFrame(data)
# Structural paths
Loyalty ~ Satisfaction
"""
Explanation:
Satisfaction =~ Q1 + Q2 + Q3: This line specifies that the latent variable
"Satisfaction" is inferred from the observed variables Q1, Q2, and Q3.
Data Science 47
1. Latent Variables: These are abstract variables that are not directly observed
but are inferred from other measured variables. Latent variables are commonly
used to represent unobservable constructs like intelligence, satisfaction, or
economic status.
Example:
Imagine you're studying happiness. You can't directly measure someone's
happiness with a single number, but you can observe behaviors and answers to
questions like:
These questions provide clues about happiness, but happiness itself is a latent
variable because it’s not directly measurable — it's inferred from these observable
indicators.
Data Science 48
What is Structural Equation Modeling (SEM)?
Structural Equation Modeling (SEM) is a statistical technique that allows
researchers to analyze relationships between observed variables and latent
variables. SEM combines:
SEM helps create a complex model where both observed variables (things we
can measure, like test scores or survey responses) and latent variables (hidden
traits like intelligence or stress levels) are analyzed together.
Example: Salary, job satisfaction score, and hours worked are observable
measures.
2. Latent Variables: These are unobserved variables that influence the observed
variables. We model latent variables based on how they impact the indicators.
Example: Job satisfaction might be a latent variable, inferred from observable
questions like "How much do you enjoy your work?" and "How likely are you
to recommend your job to a friend?"
Data Science 49
1. Model complex relationships: SEM allows you to study how different latent
variables (e.g., intelligence, motivation) and observable variables (e.g., test
scores) influence each other.
2. Estimate latent variables: With SEM, you can estimate how much an
unobserved (latent) variable is contributing to the patterns you observe in your
data.
3. Test theoretical models: Researchers can use SEM to test if their theoretical
models, which include latent concepts, fit well with the real-world data they
collect.
1. Motivation (latent).
2. Intelligence (latent).
But how do you measure motivation or intelligence? These are latent variables,
so you use observed variables to infer them, like:
Using SEM, you can create a model to test how well this hypothesis fits your
actual data.
import pandas as pd
import semopy
Data Science 50
# Let's assume we have the following dataset
data = {
'Hours_Studied': [10, 20, 30, 15, 25],
'Class_Participation': [8, 9, 7, 6, 9],
'IQ_Score': [110, 130, 120, 115, 125],
'Problem_Solving': [90, 85, 88, 92, 87],
'Exam_Score': [85, 90, 80, 75, 88]
}
df = pd.DataFrame(data)
# Regression relationships
Exam_Score ~ Motivation + Intelligence
'''
2. Fit the Model: The model.fit() function fits the SEM model to the data,
estimating the latent variables and relationships.
Data Science 51
3. Inspect Results: After fitting, you can inspect the model to see how well it fits
the data and the strength of relationships.
Recap:
Latent variables are unobservable factors that affect observable variables.
Unit 3
Data Science: Forecasting
Overview:
Forecasting is the process of making predictions about the future based on
historical data. In data science, forecasting typically refers to time series
forecasting, where the goal is to predict future values based on past data. This is
widely used in fields like finance, economics, inventory management, and sales
prediction.
Key Concepts:
Time Series Data: A sequence of data points measured at successive time
intervals. Time series data is often used for forecasting purposes.
Forecasting Horizon: The period for which predictions are made. It can be
short-term (hours, days), medium-term (weeks, months), or long-term (years).
Types of Forecasting
1. Qualitative Forecasting:
Data Science 52
When to Use: When there is limited historical data or when the data is
subjective (e.g., consumer behavior predictions).
Techniques:
2. Quantitative Forecasting:
Techniques:
Concept: Assumes that the next value in the series will be the same as the
previous value.
Formula:
y^t+1=yt\hat{y}_{t+1} = y_t
Use: This method works best when data has no trend or seasonal
component.
Data Science 53
4. Autoregressive Integrated Moving Average (ARIMA):
Components:
Data Science 54
MA(q): Moving Average part, uses previous forecast errors.
Kernel Trick: SVR uses kernel functions to map input data into higher-
dimensional space, making it easier to separate data for non-linear
Data Science 55
relationships.
Data Science 56
Conclusion:
Data Science 57
Forecasting plays a crucial role in making informed predictions about future
events based on historical data. Depending on the characteristics of the data
(e.g., trend, seasonality), different methods, such as ARIMA, exponential
smoothing, and machine learning techniques, can be applied. It's essential to
evaluate forecasting models using appropriate metrics like MAE, RMSE, and R² to
assess their performance and choose the best model for accurate predictions.
In practice, tools such as R and Octave can be used to implement these models
and perform statistical analysis, making the forecasting process efficient and
scalable.
Data Science 58
Seasonality (S): The repeating pattern or cycle in the data at regular
intervals due to seasonal factors (e.g., daily, monthly, or yearly).
3. Stationarity:
Stationary vs Non-Stationary:
Stationary: The properties of the series do not change over time (e.g.,
no trend or seasonality).
Unit Root: A time series with a unit root is typically non-stationary (e.g., a
random walk).
Data Science 59
Partial Autocorrelation Function (PACF): Shows the partial correlation
between a time series and its lags, after removing the effects of
intermediate lags.
2. Seasonality Adjustment:
3. Smoothing:
4. Differencing:
Seasonal differencing: Subtracting the value from the same time period in
the previous season (e.g., subtract last year's monthly sales from this
year's monthly sales).
Data Science 60
Data Science 61
Data Science 62
Conclusion:
Time series data analysis involves various methods and techniques to analyze
sequential data and forecast future values. By identifying patterns like trends and
seasonality, and applying models such as ARIMA, SARIMA, and Exponential
Smoothing, time series forecasting can help make informed predictions. Proper
Data Science 63
evaluation metrics are essential for selecting the best forecasting model, ensuring
the predictions are both accurate and reliable.
Both concepts are fundamental for analyzing time series data and building
effective forecasting models.
Definition:
A stationary time series is one where:
In other words, a stationary series does not exhibit any trend, seasonality, or
structural changes over time. Stationarity is crucial because most time series
forecasting models (e.g., ARIMA) assume that the data is stationary.
Data Science 64
Checking for Stationarity:
To check whether a time series is stationary, we can look for the following:
Visual Inspection: Plot the time series and see if there are trends, seasonal
patterns, or changing variance.
Statistical Tests:
Data Science 65
Making Time Series Stationary:
If a time series is non-stationary, we can apply techniques to make it stationary:
1. Differencing:
2. Log Transformation:
3. Smoothing:
4. Detrending:
Subtract the estimated trend from the original data to eliminate the trend
component.
Definition:
Seasonality refers to repeating patterns or cycles in the time series data at regular
intervals, often due to predictable external factors such as climate, holidays, or
business cycles. Seasonality manifests in data that exhibits consistent fluctuations
over specific time periods (e.g., daily, monthly, yearly).
Data Science 66
1. Periodic Patterns: The data follows a regular, predictable cycle or pattern that
repeats over specific periods.
Example: Retail sales tend to spike during the holiday season (e.g.,
Christmas) every year.
Identifying Seasonality:
Visual Inspection: Plot the time series data and look for regular fluctuations or
patterns that repeat over specific time intervals.
Types of Seasonality:
1. Additive Seasonality:
The seasonal effect is constant over time. The size of the seasonal effect
does not depend on the level of the time series.
Model: .
2. Multiplicative Seasonality:
Data Science 67
The seasonal effect changes in proportion to the level of the time series.
The seasonal fluctuations increase as the series value increases.
Model: .
Data Science 68
Conclusion:
Stationarity is an essential concept for time series analysis because many
time series models assume that the data is stationary. Stationarity ensures that
the statistical properties of the series, such as mean and variance, do not
Data Science 69
change over time. If a series is non-stationary, techniques like differencing,
transformation, and detrending are applied to make it stationary.
Python Implementation
Below is a Python illustration of time series analysis, focusing on forecasting,
stationarity, and seasonality.
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA
Data Science 70
# Stationarity check using ADF test
result = adfuller(df['Value'])
print("ADF Statistic:", result[0])
print("p-value:", result[1])
if result[1] < 0.05:
print("The series is stationary.")
else:
print("The series is non-stationary.")
# Plot forecast
plt.figure(figsize=(10, 5))
plt.plot(df['Time'], df['Value'], label="Original")
plt.plot(range(100, 110), forecast, label="Forecast", color
='red')
plt.title("Forecasting with ARIMA")
Data Science 71
plt.legend()
plt.show()
Autoregressive (AR) models are statistical models that predict future values
of a time series based on its own past values.
Overview:
Autoregressive models predict the current value of a time series as a linear
combination of its previous values. These models assume that the series is a
linear function of its past values and that the relationship between past values can
explain the future values.
Data Science 72
Key Features of AR Models:
Stationarity: AR models assume that the time series is stationary, meaning its
mean and variance are constant over time. Non-stationary data may need to
be transformed using differencing or detrending.
Order Selection (p): The order of the AR model defines how many past values
(lags) are used to predict the current value. It is typically determined using the
Autocorrelation Function (ACF) or Partial Autocorrelation Function (PACF)
plots.
Overview:
Data Science 73
A Recurrent Neural Network (RNN) is a type of artificial neural network that is
well-suited for sequential data, such as time series. Unlike traditional feed-forward
neural networks, RNNs have loops that allow information to persist, making them
capable of learning from previous time steps in a sequence.
RNN Architecture: RNNs have a feedback loop that passes the hidden state
from one time step to the next. This allows the network to maintain memory of
past states.
Types of RNNs:
1. Vanilla RNN: The basic form of RNN, where the hidden state depends on the
previous hidden state and current input.
2. Model Definition: Build an RNN model using layers like LSTM or GRU.
4. Prediction: Use the trained model to make predictions on future time steps.
Data Science 74
AR Model in Python using statsmodels
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.ar_model import AutoReg
# Make predictions
predictions = model_fitted.predict(start=len(train), end=len
(train)+len(test)-1, dynamic=False)
Data Science 75
mse = mean_squared_error(test, predictions)
print(f'Mean Squared Error: {mse}')
In this implementation:
We generate a synthetic time series using a sine wave with added noise.
We make predictions on the test data and evaluate the model's performance
using Mean Squared Error (MSE).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from sklearn.metrics import mean_squared_error
Data Science 76
# Prepare the data for training (create sequences)
def create_dataset(data, time_step=1):
X, y = [], []
for i in range(len(data) - time_step):
X.append(data[i:i+time_step, 0])
y.append(data[i+time_step, 0])
return np.array(X), np.array(y)
# Reshape the data for the LSTM model (samples, time steps, f
eatures)
X = X.reshape(X.shape[0], X.shape[1], 1)
# Make predictions
predictions = model.predict(X_test)
Data Science 77
# Inverse transform the predictions and actual values to the
original scale
predictions_rescaled = scaler.inverse_transform(predictions)
y_test_rescaled = scaler.inverse_transform(y_test.reshape(-1,
1))
In this implementation:
We build a simple LSTM model with 50 units and train it on the time series
data.
We then make predictions on the test set and evaluate the model's
performance using Mean Squared Error (MSE).
Conclusion:
Autoregressive (AR) models are classical time series models that predict
future values based on past values. They are widely used when the data
shows a linear dependence on previous observations.
Data Science 78
Recurrent Neural Networks (RNNs), especially LSTM and GRU, are deep
learning models designed to capture long-term dependencies in sequential
data. They are ideal for time series forecasting when the data has complex,
non-linear relationships.
Python provides robust libraries like statsmodels for AR models and TensorFlow
for building RNNs, making it easy to implement both approaches for time
series forecasting.
By choosing the appropriate model based on the nature of your data, you can
make accurate forecasts and gain valuable insights from time series data.
Unit 4
Data Science: Classification in Machine Learning
Overview:
Classification is a supervised learning technique where the goal is to predict the
categorical class labels of new observations based on a labeled dataset. In
classification problems, the output variable is discrete (i.e., belongs to a specific
class or category). For example, predicting whether an email is spam or not spam
(binary classification) or predicting the species of a flower based on its features
(multi-class classification).
There are many classification algorithms, each with strengths and weaknesses.
The choice of algorithm depends on the dataset, the problem, and the
assumptions behind the algorithms.
Multi-Class Classification: The target variable has more than two classes
(e.g., predicting the type of flower based on its features).
Data Science 79
Example: Predicting the species of a flower (setosa, versicolor, virginica).
Data Science 80
Below, we will implement Logistic Regression, K-Nearest Neighbors (KNN), and
Support Vector Machine (SVM) classifiers using Python with the popular scikit-
learn library.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
from sklearn.datasets import load_iris
4. Logistic Regression
Logistic regression is a simple classification model used for binary or multi-class
classification. We'll use the Iris dataset, which has 3 species of flowers as
classes.
Data Science 81
from sklearn.linear_model import LogisticRegression
# Make predictions
y_pred = log_reg.predict(X_test)
Confusion Matrix: Shows true positives, false positives, true negatives, and
false negatives.
Classification Report: Displays precision, recall, and F1-score for each class.
Data Science 82
knn = KNeighborsClassifier(n_neighbors=5)
# Make predictions
y_pred_knn = knn.predict(X_test)
# Make predictions
y_pred_svm = svm.predict(X_test)
Data Science 83
print("Confusion Matrix (SVM):\n", confusion_matrix(y_test, y
_pred_svm))
print("Classification Report (SVM):\n", classification_report
(y_test, y_pred_svm))
8. Conclusion
Logistic Regression: Suitable for binary and multi-class classification tasks. It
works well when the relationship between the features and the class is
approximately linear.
Support Vector Machine (SVM): A powerful classifier that works well in high-
dimensional spaces, especially with the kernel trick. It is effective for both
linear and non-linear classification tasks.
Summary of Results:
Data Science 84
We have implemented three different classification algorithms: Logistic
Regression, K-Nearest Neighbors, and Support Vector Machine.
You can easily adapt these models to different datasets and modify their
hyperparameters (e.g., k for KNN, C and kernel for SVM) to optimize performance
for specific tasks.
Next Steps:
Hyperparameter Tuning: You can use techniques like Grid Search or Random
Search to optimize hyperparameters.
Model Interpretability: You can explore model interpretability using tools like
SHAP or LIME to understand how models make predictions.
1. Class Separation:
Data Science 85
It maximizes the between-class variance while minimizing the within-
class variance to ensure that the classes are as distinct as possible in the
feature space.
The aim is to project the data onto a new axis (or axes) that maximally
separates the classes.
Within each class, the projection should reduce the spread of data points
as much as possible.
4. Dimensionality Reduction:
LDA can also reduce the dimensionality of the dataset by selecting the
most important features, which can improve model performance and
speed up training time.
Data Science 86
Data Science 87
3. Applications of LDA
Dimensionality Reduction: LDA is widely used to reduce the number of
features while preserving the class separability.
Classification: Once the linear discriminants are computed, they can be used
as new features for classification models (e.g., Logistic Regression, KNN).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.discriminant_analysis import LinearDiscriminantA
nalysis
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
Data Science 88
X_train, X_test, y_train, y_test = train_test_split(X, y, tes
t_size=0.3, random_state=42)
# Initialize LDA
lda = LinearDiscriminantAnalysis(n_components=2) # Reduce to
2 components for visualization
Data Science 89
After dimensionality reduction using LDA, we can use a classifier to predict the
classes.
# Make predictions
y_pred = log_reg.predict(X_test_lda)
Classification Report: Contains precision, recall, and F1-score for each class.
Data Science 90
LDA works best when there is a clear distinction between the classes and is
typically used for classification tasks. PCA is better suited for reducing the feature
space when there is no class label information.
6. Applications of LDA
1. Text Classification: LDA is widely used in text classification tasks (e.g., spam
detection) where each document can be represented as a vector of features
(word counts, term frequencies).
2. Face Recognition: LDA is used to reduce the dimensionality of image data and
enhance class separability.
3. Medical Diagnosis: LDA can help classify patients into different disease
categories based on medical measurements and test results.
7. Conclusion
Linear Discriminant Analysis (LDA) is a powerful technique for both
dimensionality reduction and classification. It maximizes class separability by
finding linear combinations of features.
LDA is ideal for supervised classification problems with clearly defined class
labels. It is commonly used for reducing high-dimensional data and improving
classifier performance.
The Iris dataset example showed how LDA can be applied to reduce the
number of features and visualize class separation in 2D, followed by training a
classifier (Logistic Regression) to make predictions.
Data Science 91
1. Support Vector Machine (SVM)
Overview:
A Support Vector Machine (SVM) is a powerful supervised learning algorithm
primarily used for classification tasks, but it can also be applied to regression.
SVM works by finding the hyperplane that best separates the classes in a feature
space, with a maximum margin between the classes. It is effective for both linear
and non-linear classification problems using the kernel trick.
Key characteristics of SVM:
Maximal Margin: SVM aims to maximize the margin between the support
vectors (the data points closest to the hyperplane) and the decision boundary.
Support Vectors: These are the data points that are closest to the hyperplane
and play a key role in defining the decision boundary.
SVM Basics:
Linear SVM: For linearly separable data, SVM finds a hyperplane that
maximizes the margin.
Non-Linear SVM: For non-linearly separable data, the kernel trick maps the
data into a higher-dimensional space where it becomes linearly separable.
Data Science 92
The goal is to maximize the margin, which is the distance between the closest
data points (support vectors) and the hyperplane.
Overview:
Decision Trees (DT) are a popular machine learning algorithm used for both
classification and regression tasks. A decision tree splits the data into subsets
based on feature values, and it recursively splits the data at each node until the
data in each leaf node is as homogenous as possible. Decision trees are easy to
understand, interpret, and visualize.
Key characteristics of Decision Trees:
Splitting Criterion: The decision tree algorithm uses criteria like Gini Impurity
(for classification) or Mean Squared Error (MSE) (for regression) to choose
the best feature to split on at each step.
Overfitting: Decision trees can easily overfit the training data if they are too
deep, leading to poor generalization on unseen data. Pruning is used to avoid
overfitting by limiting the tree depth or removing nodes that provide little
information.
Data Science 93
Non-Linear Decision Boundaries: Decision trees can model non-linear
relationships, which makes them versatile for various types of datasets.
2. Splitting: The dataset is split at each node based on the feature that provides
the best separation.
3. Leaf Nodes: These nodes represent the final classification or regression value
(for classification, this is a class label).
4. Pruning: Pruning reduces the size of the tree by removing nodes that do not
improve classification accuracy.
3. Python Implementation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
Data Science 94
X = data.data
y = data.target
# Split data into training and test sets (70% training, 30% t
esting)
X_train, X_test, y_train, y_test = train_test_split(X, y, tes
t_size=0.3, random_state=42)
# Make predictions
y_pred = svm.predict(X_test)
Data Science 95
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
Data Science 96
classification_report
from sklearn.tree import plot_tree
# Split data into training and test sets (70% training, 30% t
esting)
X_train, X_test, y_train, y_test = train_test_split(X, y, tes
t_size=0.3, random_state=42)
# Make predictions
y_pred = dt.predict(X_test)
Data Science 97
Decision trees are easy to visualize. Below, we use plot_tree() from sklearn to
display the tree.
Linear or Non-linear
Type of Algorithm classification using kernel Tree-based, recursive splitting
trick
Performance on
Excellent (especially with
High-Dimensional Can overfit in high dimensions
non-linear kernels)
Data
Data Science 98
bioinformatics
5. Conclusion
SVM: Best for datasets where the classes are clearly separated, especially
when the data is high-dimensional or non-linear. It’s a robust and powerful
classifier, especially when using appropriate kernels.
Decision Trees: Easy to interpret and visualize, ideal for both classification
and regression tasks. However, they tend to overfit without careful pruning or
regularization. Decision trees form the basis of ensemble methods like
Random Forest and Gradient Boosting.
Both techniques are widely used in machine learning and data science, and their
choice depends on the problem at hand, the dataset, and the interpretability
required for the model.
1. Overview of Clustering
Clustering is a type of unsupervised learning technique in machine learning,
where the goal is to group a set of objects into classes or clusters. Objects in the
same cluster are more similar to each other than to those in other clusters.
Clustering is widely used in various applications, including image segmentation,
customer segmentation, anomaly detection, and market research.
Unlike classification tasks, where we have labeled data, clustering deals with
unlabeled data, and the algorithm must find inherent patterns and groupings
within the data itself.
2. Types of Clustering
Clustering techniques can be broadly classified into the following categories:
Data Science 99
K-medoids: Similar to K-means but uses actual data points as cluster centers
(medoids).
Divisive Clustering: A top-down approach that starts with all points in one
cluster and splits them iteratively.
Steps of K-means:
1. Initialize K centroids randomly or by selecting random data points.
4. Repeat steps 2 and 3 until convergence (i.e., when the centroids no longer
change).
Parameters of DBSCAN:
Epsilon (ϵ\epsilon): The radius within which points are considered neighbors.
Steps:
1. Start with n clusters (each data point is its own cluster).
We'll standardize the features as clustering algorithms are sensitive to the scale of
the data.
5. 1. K-means Clustering
9. Conclusion
K-means: Performs well when the clusters are spherical and well-separated. It
is sensitive to the initial placement of centroids and requires the number of
clusters to be pre-specified.
DBSCAN: Useful when clusters have irregular shapes and when there is noise
in the data. It doesn't require the number of clusters to be defined in advance.