Data Science Notes
Data Science Notes
Ans: i) Data science starts with data, which can range from a simple array of a few
numeric observations to a complex matrix of millions of observations with thousands
of variables.
ii) Data science is a collection of techniques used to extract valuable insights from
data. It has become an essential tool for any organization that collects, stores, and
processes data as part of its operations. Data science techniques rely on finding useful
patterns, connections, and relationships within data.
iii) Example of Data Science:
a)Suppose XYZ is a subscription-based streaming service offering movies, TV shows,
and original content. They have been experiencing a high churn rate, meaning many
customers are cancelling their subscriptions after a short period.
b) XYZ wants to use data science to understand why customers are leaving and to
develop strategies to improve retention.
c) Steps in the Data Science mechanism:
1) Data Collection: XYZ collects a variety of data, including customer profiles (age,
location, subscription plan), usage data (viewing history, average watch time),
customer feedback (surveys, reviews), and churn data (cancellation dates and
reasons).
2) Data Cleaning: The data science team cleans the collected data to address issues
like missing values (e.g., missing age information), inconsistencies (e.g., different
formats for dates), and errors (e.g., typos in feedback).
3) Data Exploration: Exploratory Data Analysis (EDA) is performed to understand the
data better. This includes analysing trends such as peak viewing times, popular
content genres, and differences in usage patterns across different customer
demographics.
4) Data Analysis: Statistical methods and machine learning techniques are applied to
analyse the data. This involves identifying patterns and factors that contribute to
churn, such as low engagement or dissatisfaction with content.
5) Modeling: A predictive model is developed to forecast which customers are at high
risk of cancelling their subscriptions. Features such as “number of days since last
login” and “average watch time” are used in the model to predict churn risk.
6) Data Visualization: The results are presented using visual tools like dashboards and
charts. These visualizations help stakeholders understand key insights, such as the
impact of different factors on churn and the effectiveness of retention strategies.
7) Implementation: Based on the analysis, XYZ implements strategies to improve
retention. This includes personalized content recommendations to increase
engagement and targeted offers or incentives for high-risk customers to encourage
them to stay.
8) Results: Following the implementation of these strategies, XYZ observes a
reduction in the churn rate, increased customer engagement, and higher overall
satisfaction. This example shows the effectiveness of using data science to address
business challenges.
In this example, data is a Series with values [10, 20, 30, 40] and indexes ['a', 'b', 'c',
'd'].
2.) DataFrame:
i)Definition: A DataFrame is a two-dimensional, size-mutable, and potentially
heterogeneous tabular data structure with labelled axes (rows and columns). It is
similar to a table in a relational database or a data frame in R.
ii)Components:
• Columns: Each column is a Series, and the DataFrame can have multiple
columns, each with its own data type.
• Index: The labels for rows, which allow access to rows by these labels.
• Values: The actual data in the DataFrame organized into rows and columns.
iii)Example:
import pandas as pd
# Creating a DataFrame
data = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']})
# Accessing data
print(data['Name']) # Output: Series with names
print(data.loc[1]) # Output: Row with index 1
In this example, data is a DataFrame with three columns ('Name', 'Age', and 'City') and
three rows of data.
Q.6) What is overfitting and underfitting? With an example explain how to recognize
and handle them.
Ans: 1. Overfitting
Definition:
Overfitting happens when a model learns not just the patterns in the training data but
also the noise and details specific to that dataset. As a result, the model performs
very well on the training data but poorly on unseen or test data because it fails to
generalize.
Recognition:
• The model has high accuracy on the training set but low accuracy on the test
set.
• The model is too complex (e.g., using too many parameters) and fits the training
data almost perfectly.
Example:
Imagine you are training a model to predict house prices. If your model is overfitting,
it may learn that a specific house in the training data had a red door and that
influenced its price, but in reality, the color of the door has no real impact on price. So
when you show it a house with a blue door, it might give a bad prediction.
How to Handle Overfitting:
1. Simplify the Model: Use fewer features or a less complex model to reduce
unnecessary learning.
2. Regularization: Techniques like L1 or L2 regularization add penalties to the
model's complexity, encouraging it to keep the weights small and avoid learning
noise.
3. Cross-Validation: Use techniques like k-fold cross-validation to test the model's
generalizability on multiple subsets of the data.
4. More Training Data: If possible, adding more training data can help the model
generalize better by seeing more examples.
2. Underfitting
Definition:
Underfitting occurs when a model is too simple to capture the underlying patterns in
the data. This happens when the model cannot adequately learn from the data,
resulting in poor performance on both the training and test datasets.
Recognition:
• The model has low accuracy on both the training and test sets.
• The model is not complex enough and fails to capture the important patterns in
the data.
Example:
Consider the same house price prediction scenario. If your model is underfitting, it
might only consider one or two basic features like the number of rooms but ignore
other important factors like location or size. As a result, its predictions would be
overly simplistic and inaccurate.
How to Handle Underfitting:
1. Increase Model Complexity: Use a more complex model with additional
features or layers (in the case of neural networks) to capture more patterns.
2. Feature Engineering: Add or modify features to provide the model with more
relevant information.
3. Reduce Regularization: If regularization is being used, too much of it can cause
underfitting. You may need to reduce it.
2. Bivariate Analysis
Definition:
Bivariate analysis involves the analysis of two variables to understand their
relationship. It helps in discovering correlations, trends, or patterns between the two
variables.
Purpose:
• Investigate the relationship between two variables.
• Understand whether one variable impacts or is related to the other (e.g., does
height affect weight?).
Example:
If you have data on people’s ages and their income, bivariate analysis would explore
whether there is a relationship between age and income. You might find that older
individuals tend to have higher incomes.
Common Techniques:
• Scatter plots: Visualizes the relationship between two continuous variables.
• Correlation coefficients (e.g., Pearson’s correlation) to measure the strength and
direction of the relationship.
• Crosstabulation and bar plots for categorical variables.
3. Multivariate Analysis
Definition:
Multivariate analysis deals with the analysis of more than two variables at once. It
explores relationships and interactions between multiple variables to understand how
they collectively impact each other.
Purpose:
• Understand complex relationships between several variables.
• Investigate the effects of multiple independent variables on one or more
dependent variables.
• Identify patterns and interactions that might be missed in univariate or bivariate
analyses.
Example:
In a dataset with variables like age, income, education level, and spending habits,
multivariate analysis might explore how age and education level together influence
income or how these factors collectively affect spending habits.
Common Techniques:
• Multiple regression analysis: Examines how multiple independent variables
influence a dependent variable.
• Principal component analysis (PCA): Reduces the dimensionality of the data
while preserving important patterns.
• Multivariate analysis of variance (MANOVA): Tests the effect of several
independent variables on more than one dependent variable
iii)
iv) Informa on gain is defined as the difference between the original informa on
requirement and the new requirement a er par oning on a ribute A:
Gain(A)=Info(D)−InfoA(D)
v) A higher gain indicates that the a ribute provides a be er par on, leading to purer (or
more homogeneous) subsets.
vi) The a ribute with the highest gain is selected for spli ng at each step in a decision tree.
vi) Informa on gain tends to favor a ributes with many possible values (e.g., product_ID),
which can lead to highly specific par ons that provide li le meaningful classifica on
informa on.
vii) A split on an a ribute like product_ID results in many par ons, each containing a single
tuple. In this case, the entropy a er the split, meaning
no further informa on is needed. However, this
par oning is not useful for classifica on.
viii) To counter this bias, the Gain Ra o method uses split informa on, which measures the
poten al informa on generated by spli ng the data. It is defined as:
ix) Split informa on represents the poten al for dividing the dataset into par ons and
takes into account the distribu on of tuples across these par ons.
x) Formula for Gain Ra o:
This ensures that a ributes with many values (and hence many par ons) are not unfairly
favored.
xi) The a ribute with the highest gain ra o is selected as the spli ng a ribute in a decision
tree.
xii) As the split informa on approaches 0, the gain ra o becomes unstable. To prevent
issues, constraints are applied to ensure that the selected a ribute maintains a reasonable
gain ra o, avoiding a ributes with excessive spli ng.
Q.2.) What is classifica on? Hence explain the decision tree concept.
Ans:i) Classifica on is the most widely used data science task in business. The objec ve of a
classifica on model is to predict a target variable that is binary (e.g., a loan decision) or
categorical (e.g., a customer type) when a set of input variables are given.
ii) The model does this by learning the generalized rela onship between the predicted
target variable with all other input a ributes from a known dataset.
Decision tree concept:
1) Defini on and Basic Components of Decision Trees:
A decision tree is a supervised machine learning algorithm used for both classifica on and
regression tasks. It splits the data into subsets based on the most significant features,
forming a tree-like structure where each internal node represents a decision (split), and
each leaf node represents the outcome (class or value).
Root Node: The top node of the tree, represen ng the en re dataset, where the first
split occurs.
Internal Nodes: Represent decisions based on features that lead to further splits.
Leaf Nodes: The final output a er spli ng, represen ng the predicted class or value.
2) Spli ng Criteria
The key idea behind decision trees is to divide data into groups that are as homogeneous as
possible based on certain criteria. The most common criteria used for spli ng is:
Entropy (used in Informa on Gain): Measures the randomness or uncertainty in the
data. The formula for entropy is:
Where:
D is the en re dataset,
Dk are the par ons of D a er a split based on a feature,
n is the number of par ons.
The goal is to maximize informa on gain with each split.
Then
Manha an Distance:
2) Binary Variables:
i) A binary variable has only two states: 0 or 1, where 0 means that the variable is absent,
and 1 means that it is present.
ii) A binary variable is symmetric if both states are equally valuable and have no preference
for coding as 0 or 1. and asymmetric if outcomes are of unequal importance, such as
disease test results, where the rare, more important outcome is coded as 1 and the other as
0.
iv) To calculate Dissimilarity for symmetric binary variable:
where q is the number of variables that equal 1 for both objects i and j, r is the number of
variables that equal 0 for object i but that are 0 for object j, s is the number of variables that
equal 0 for object i but equal 1 for object j, and t is the number of variables that equal 0 for
both objects i and j.
3) Categorical Variables:
i) A categorical variable is a generaliza on of the binary variable in that it can take on more
than two states. For example, map_color is a categorical variable that may have, say, five
states: red, yellow, green, pink, and blue.
ii) The dissimilarity between two objects i and j can be calculated as :
where m is the n number of variables for which i and j are in the same state, and p is the
total number of variables.
4) Ordinal Variables:
i) An ordinal variable can be discrete or con nuous . Here Order is important, e.g., rank.
They Can be treated like interval-scaled.
ii) To compute Dissimilarity:
5)Ra o Scaled Variables:
i) Ra o-scaled variable is a posi ve measurement on a nonlinear scale, approximately at
exponen al scale, such as AeBt or Ae-Bt
ii)To compute Dissimilarity:
Q.5.) What is cluster analysis? Explain the types of data in cluster analysis.
Ans: i) Cluster analysis is a method of grouping a set of objects (data points) in such a way
that objects in the same group (or cluster) are more similar to each other than to those in
other groups.
ii)There are various types of data in clustering analysis. We will see these types with
Dissimilarity. Dissimilarity in clustering analysis refers to a measure of how different two
data points or objects are from each other.
( refer to q.3. for further answer)
Q.2.) What is data discre za on? Explain with example
Ans:
i) Data discre za on is the process of conver ng con nuous data into discrete categories or
intervals.
ii) This is par cularly useful in data mining and machine learning, where algorithms o en
perform be er with categorical data rather than con nuous numerical data.
iii) Discre za on can reduce the complexity of the data, making it easier to analyze and
interpret.
iv) Methods of Discre za on
1. Equal-width Binning: The range of con nuous values is divided into intervals of equal
width.
2. Equal-frequency Binning: The data is divided into intervals such that each interval
contains approximately the same number of data points.
3. Clustering-based Discre za on: Clustering techniques are used to group similar
values together.
Example of Data Discre za on
Let’s consider a simple example with a con nuous variable: Age.
Here is a dataset of ages:
Ages: 23, 45, 18, 30, 50, 26, 34, 29, 42, 60
Step 1: Choose Discre za on Method
Suppose we choose Equal-width Binning and want to create three bins (or intervals).
Step 2: Define Bins
1. Minimum Age: 18
2. Maximum Age: 60
3. Range: 60 - 18 = 42
4. Width of Each Bin: Range / Number of Bins = 42 / 3 = 14
Now, we can create the bins:
Bin 1: 18 to 32 (18 + 14)
Bin 2: 33 to 46 (32 + 14)
Bin 3: 47 to 60 (46 + 14)
Step 3: Assign Data to Bins
Now, we can assign each age to its respec ve bin:
Age 23: Bin 1
Age 45: Bin 2
Age 18: Bin 1
Age 30: Bin 1
Age 50: Bin 3
Age 26: Bin 1
Age 34: Bin 2
Age 29: Bin 1
Age 42: Bin 2
Age 60: Bin 3
Resul ng Discre zed Data
A er discre za on, the ages can be represented as categories:
Age Groups:
- Group 1 (18-32): 23, 18, 30, 26, 29
- Group 2 (33-46): 45, 34, 42
- Group 3 (47-60): 50, 60
Median: The middle value when the data is ordered. If the sepal lengths are sorted as
4.9, 5.0, and 5.1 cm, the median is 5.0 cm.
Mode: The most frequently occurring value. If the sepal lengths are 5.0, 5.0, and 5.1
cm, the mode is 5.0 cm.
2)Measure of Spread: These metrics describe the variability of the dataset.
Range: The difference between the maximum and minimum values. For example, if
the sepal lengths are 4.5 cm (min) and 5.5 cm (max), the range is 5.5−4.5=1.0 cm.
Standard Devia on: This indicates how much the values deviate from the mean. For
a dataset with a mean of 5.0 cm, if the lengths are 4.8, 5.0, and 5.2 cm, the standard
devia on would show that the values are close to the mean.
Variance: The average of the squared differences from the mean. If the mean is 5.0
cm and the data points are 4.8 and 5.2 cm, the variance reflects how spread out
these points are from the mean.
Mul variate Explora on: This involves studying more than one a ribute in the
dataset simultaneously to understand rela onships between a ributes.
1)Central Data Point:
This represents a hypothe cal observa on point made up of the mean of each
a ribute in the dataset.
For example, in the Iris dataset, the central mean point for sepal length, sepal width,
petal length, and petal width could be expressed as {5.006, 3.418, 1.464, 0.244}.
2)Correla on:
This measures the sta s cal rela onship between two a ributes using the Pearson
correla on coefficient (r), which ranges from -1 to 1. . For example, there is a strong
posi ve correla on between temperature and ice cream sales, if as temperatures
rise, ice cream sales increase.
The correla on coefficient quan fies this rela onship, indica ng how closely related
the two a ributes are. A coefficient of 0.8 suggests a strong posi ve correla on.
Where:
D is the en re dataset,
Dk are the par ons of D a er a split based on a feature,
n is the number of par ons.
The goal is to maximize informa on gain with each split.
2. Search Engines:
i) Search engines intelligently predict what a person is typing and automa cally complete
the sentences. For example, when you type "game" in Google, it suggests op ons like
"Game of Thrones," "Game of Life," or, if you're interested in math, "game theory."
ii) All these sugges ons are provided using autocomplete that uses Natural Language
Processing to guess what the person wants to ask.
3.Voice Assistants:
i) Voice assistants are essen al tools today. Siri, Alexa, and Google Assistant are commonly
used for making calls, se ng reminders, scheduling mee ngs, se ng alarms, and browsing
the internet.
ii) They use a complex combina on of speech recogni on, natural language understanding,
and natural language processing to understand what humans are saying and then act on it.
4. Language Translators:
i) Language transla on is made easy with tools like Google Translate, which can convert text
from one language to another, such as English to Hindi.
ii) Modern translators u lize sequence-to-sequence modeling in Natural Language
Processing, which is more precise than the older Sta s cal Machine Transla on (SMT)
method
5. Sen ment Analysis:
i) Sen ment analysis allows companies to gauge how users feel about a par cular topic or
product by analyzing social media and other forms of communica on.
ii) By using techniques such as natural language processing, computa onal linguis cs, and
text analysis, companies can determine whether the general sen ment is posi ve, nega ve,
or neutral.
6. Grammar Checkers:
Grammar checkers are essen al tools for ensuring error-free wri ng, especially in
professional reports and academic assignments. They rely on natural language processing
(NLP) to provide accurate sugges ons.
7. Email Classifica on and Filtering:
Email classifica on and filtering use natural language processing (NLP) to automa cally sort
incoming emails into categories, improving organiza on and reducing clu er.
Q.15.) Write algorithm of k-means, k-medoid par oning method.
Ans:
The K-Means Clustering Method:
i. The k-means algorithm takes the input parameter, k, and par ons a set of n
objects into k clusters so that the resul ng intracluster similarity is high but the
intercluster similarity is low .
ii. Cluster similarity is measured in regard to the mean value of the objects in a
cluster,which can be viewed as the cluster’s centroid or center of gravity.
Q.18.) Briefly explain how to compute dissimilarity between objects described by binary,
categorical and nominal and ordinal variables.
Ans: (refer to answer of q.3.)
Q.22.) What is a histogram?
Ans:
1) Defini on: A histogram is a type of bar chart that represents the frequency
distribu on of a dataset by dividing the data into intervals, known as bins, and
plo ng the number of observa ons (frequency) within each bin.
2) Data Grouping: The data is divided into consecu ve, non-overlapping intervals (bins).
Each bin represents a range of values, and the height of each bar indicates the
number of data points that fall within that range.
3) Con nuous Data: Histograms are primarily used for con nuous data, where values
can fall within any range, making them suitable for showing distribu ons like height,
weight, temperature, etc.
4) Shape Representa on: The shape of the histogram can provide insights into the data
distribu on, including whether it is normal, skewed (le or right), uniform, or
bimodal.
5) Bin Width: The width of the bins can significantly affect the appearance of the
histogram. Wider bins may oversimplify the data, while narrower bins can introduce
noise. Choosing the right bin width is crucial for accurate representa on.
6) Frequency vs. Rela ve Frequency: A histogram can show either the absolute
frequency (the count of data points in each bin) or rela ve frequency (the propor on
of data points in each bin compared to the total number of observa ons).
7) X and Y Axes: The x-axis of a histogram represents the bins (ranges of data), while the
y-axis represents the frequency (count) of data points in each bin.
8) Comparison: Histograms can be used to compare distribu ons between different
datasets by overlaying mul ple histograms on the same plot, which can help iden fy
differences in central tendencies or variances.
9) Applica ons: Histograms are widely used in sta s cs, data analysis, and machine
learning to analyze data distribu ons, detect outliers, and understand the spread of
data.
10) Limita ons: While histograms provide a useful overview of data distribu on, they
can obscure specific data points and may not be suitable for small datasets or when the
data has a large number of unique values, where a different type of visualiza on might
be more appropriate.
Q.24.) Differen ate between supervised and unsupervised learning.
Ans:
r. Aspect Supervised Learning Unsupervised Learning
No.
1 Defini on Trained on labeled data with Trained on unlabeled data
input-output pairs. without explicit outputs.
2 Objec ve Predict outputs for unseen Iden fy pa erns or structures in
data based on learned the data.
rela onships.
3 Types of Used for classifica on and Used for clustering and
Problems regression tasks. associa on tasks.
4 Data Requires a large amount of Does not require labeled data,
Requirements labeled data. making it easier to collect.
5 Examples of Includes linear regression, Includes k-means clustering,
Algorithms decision trees, and neural PCA, and hierarchical clustering.
networks.
6 Evalua on Performance evaluated using Performance evalua on is
accuracy, precision, and challenging without ground
recall. truth.
7 Complexity Generally more complex due O en simpler, but complexity
to model fi ng and increases with data pa erns.
predic on.
8 Use Cases Used in spam detec on, Used in customer segmenta on,
image classifica on, and anomaly detec on, and insights.
diagnos cs.
9 Learning Process Learning is guided by labeled Learning is exploratory with no
data with feedback on feedback loop.
predic ons.
10 Real-World Applied in scenarios requiring Useful for exploratory data
Applica on high accuracy, like credit analysis before applying
scoring. supervised methods.
3. Binary Splits: For each discrete-valued a ribute A with v unique values, the possible
binary splits that can be formed are 2v−2, excluding empty and full sets.
5. Selec ng Spli ng A ributes: The a ribute that results in the lowest Gini index
(indica ng highest purity) is selected as the spli ng criterion, promo ng be er
classifica on.
6. Con nuous-Valued A ributes: For a ributes with con nuous values, split points are
determined by the midpoints of sorted values, with the op mal point being one that
minimizes the Gini index.
8. Gini Impurity vs. Entropy: Unlike entropy, which measures uncertainty, the Gini index
focuses solely on the distribu on of classes, offering a simpler interpreta on in
decision trees.
9. Handling Mul -class Problems: The Gini index effec vely handles mul -class
scenarios, making it suitable for datasets with more than two classes.
10.Overfi ng Considera ons: While a lower Gini index suggests a be er split, care
should be taken to avoid overfi ng by analyzing the model's performance on unseen
data.
11.Applica ons: The Gini index is a core component of decision tree algorithms, o en
u lized in random forests and several machine learning frameworks for classifica on
tasks.
12.Itera ve Process: The decision tree grows by recursively applying the Gini index to
find splits un l the stopping criteria are met, leading to a finalized model that
classifies instances efficiently.
1. Matplotlib is a popular Python library for data visualiza on, widely used to create
sta c, interac ve, and animated plots. It's especially useful for displaying trends,
rela onships, and distribu ons in data.
2. The main purpose of Matplotlib is to provide an easy way to generate plots and
graphs, making data analysis and interpreta on simpler.
3. Installa on: To install Matplotlib, use the command:
Q.4.) State and explain the steps to perform hypothesis tes ng.
Ans:
1. State the Hypotheses:
Formulate the null hypothesis (H₀), which assumes no effect or difference, and the
alterna ve hypothesis (H₁), which proposes a possible effect or difference.
2. Select Significance Level (α):
Choose a significance level, usually 0.05 or 5%, which represents the probability of
rejec ng the null hypothesis when it’s actually true. This is your threshold for making
decisions.
3. Choose the Test Type:
Decide which sta s cal test to use based on the data type and sample size. Common
tests include t-test, z-test, and chi-square test.
4. Collect and Prepare Data:
Gather data relevant to the hypotheses, ensuring it is clean, accurate, and
representa ve of the popula on.
5. Calculate the Test Sta s c:
Use your chosen test formula to calculate the test sta s c (e.g., t-value or z-value),
which will measure the degree of difference between the observed data and what is
expected under the null hypothesis.
6. Find the Cri cal Value or P-value:
The cri cal value defines the cutoff point for rejec ng the null hypothesis.
Alterna vely, you can calculate a p-value, which tells the probability of observing
your results under the null hypothesis.
7. Compare Test Sta s c to Cri cal Value:
If the test sta s c exceeds the cri cal value, or if the p-value is less than α, you have
enough evidence to reject the null hypothesis.
8. Make a Decision:
Based on the comparison, either reject the null hypothesis (if there is evidence for
the alterna ve) or fail to reject it (if there isn’t).
9. Draw a Conclusion:
Interpret the result in the context of your research. Clearly state if there’s evidence
suppor ng the alterna ve hypothesis or if the results align with the null.
10.Report the Findings:
Summarize the hypothesis, test used, test sta s c, p-value, and the decision. This
helps others understand the analysis and conclusions drawn from it.