0% found this document useful (0 votes)

39 views44 pages

Data Science Notes

Uploaded by

Magma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views44 pages

Data Science Notes

Uploaded by

Magma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Q.1.) What is data science? Explain with the help of an example.

Ans: i) Data science starts with data, which can range from a simple array of a few
numeric observations to a complex matrix of millions of observations with thousands
of variables.
ii) Data science is a collection of techniques used to extract valuable insights from
data. It has become an essential tool for any organization that collects, stores, and
processes data as part of its operations. Data science techniques rely on finding useful
patterns, connections, and relationships within data.
iii) Example of Data Science:
a)Suppose XYZ is a subscription-based streaming service offering movies, TV shows,
and original content. They have been experiencing a high churn rate, meaning many
customers are cancelling their subscriptions after a short period.
b) XYZ wants to use data science to understand why customers are leaving and to
develop strategies to improve retention.
c) Steps in the Data Science mechanism:

1) Data Collection: XYZ collects a variety of data, including customer profiles (age,
location, subscription plan), usage data (viewing history, average watch time),
customer feedback (surveys, reviews), and churn data (cancellation dates and
reasons).
2) Data Cleaning: The data science team cleans the collected data to address issues
like missing values (e.g., missing age information), inconsistencies (e.g., different
formats for dates), and errors (e.g., typos in feedback).
3) Data Exploration: Exploratory Data Analysis (EDA) is performed to understand the
data better. This includes analysing trends such as peak viewing times, popular
content genres, and differences in usage patterns across different customer
demographics.
4) Data Analysis: Statistical methods and machine learning techniques are applied to
analyse the data. This involves identifying patterns and factors that contribute to
churn, such as low engagement or dissatisfaction with content.
5) Modeling: A predictive model is developed to forecast which customers are at high
risk of cancelling their subscriptions. Features such as “number of days since last
login” and “average watch time” are used in the model to predict churn risk.
6) Data Visualization: The results are presented using visual tools like dashboards and
charts. These visualizations help stakeholders understand key insights, such as the
impact of different factors on churn and the effectiveness of retention strategies.
7) Implementation: Based on the analysis, XYZ implements strategies to improve
retention. This includes personalized content recommendations to increase
engagement and targeted offers or incentives for high-risk customers to encourage
them to stay.
8) Results: Following the implementation of these strategies, XYZ observes a
reduction in the churn rate, increased customer engagement, and higher overall
satisfaction. This example shows the effectiveness of using data science to address
business challenges.

Q.2.) List and explain the data science process.

Ans: i) The methodical discovery of useful relationships and patterns in data is
achieved through a set of iterative activities collectively known as the data science
process.
ii) The standard data science process involves :
1) Prior knowledge:
a) Prior Knowledge is crucial in data science as it helps define the problem, its
business context, and the necessary data.
b)This step involves understanding the problem's objective, the subject area, and the
existing knowledge about the data.
2)Data Preparation:
a) Data Preparation involves several crucial steps to ready a dataset for analysis, often
requiring significant time and effort.
b) Data needs to be cleaned and structured, which includes converting formats,
handling missing values, and ensuring data quality.
3)Modeling:
a) Modeling in data science involves creating an abstract representation of data
relationships to predict outcomes or uncover patterns.
b) This includes using predictive techniques (like regression) to build and train models
with known datasets, and then validating these models with separate test datasets to
ensure accuracy.
4)Application:
a) In the application stage of data science, the model is integrated into business
processes, making it production-ready.
b)Key aspects include assessing the model’s readiness for real-time or batch
processing, ensuring smooth technical integration, and managing response times.
5) Knowledge:
a) The data science process transforms raw data into actionable knowledge through
advanced algorithms and techniques.
b) Through the use of statistical methods, machine learning algorithms, and data
visualization, data scientists can uncover patterns, trends, and correlations that might
not be immediately apparent.
Q.3.) Explain the data structures of pandas package. (ie series and data frame)
Ans: The Pandas package, is a powerful library for data manipulation and analysis in
Python, the primary data structures are Series and DataFrame. Here’s a breakdown of
each:
1.) Series:
i)Definition: A Series is essentially a one-dimensional labelled array capable of
holding any data type (integers, floats, strings, etc.). It is similar to a column in a
spreadsheet or a SQL table.
ii)Components:
• Values: The actual data stored in the Series.
• Index: The labels associated with each value, which allows you to access data by
these labels.
iii)Example:
import pandas as pd
# Creating a Series
data = [Link]([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
# Accessing data
print(data['b']) # Output: 20

In this example, data is a Series with values [10, 20, 30, 40] and indexes ['a', 'b', 'c',
'd'].

2.) DataFrame:
i)Definition: A DataFrame is a two-dimensional, size-mutable, and potentially
heterogeneous tabular data structure with labelled axes (rows and columns). It is
similar to a table in a relational database or a data frame in R.
ii)Components:
• Columns: Each column is a Series, and the DataFrame can have multiple
columns, each with its own data type.
• Index: The labels for rows, which allow access to rows by these labels.
• Values: The actual data in the DataFrame organized into rows and columns.
iii)Example:
import pandas as pd
# Creating a DataFrame
data = [Link]({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']})
# Accessing data
print(data['Name']) # Output: Series with names
print([Link][1]) # Output: Row with index 1

In this example, data is a DataFrame with three columns ('Name', 'Age', and 'City') and
three rows of data.

Q.4) List and write any 2 methods used in package/library numpy.

Ans: The NumPy library is essential for numerical computing in Python. It provides
support for arrays, matrices, and many mathematical functions. Here are two
commonly used methods in NumPy:
1.) [Link]():
i)Purpose: Creates a NumPy array from a Python list or tuple. This array can be multi-
dimensional and is used for efficient mathematical operations on large datasets.
ii)Syntax:
[Link](object, dtype=None, copy=True, order='K', subok=False, ndmin=0)
iii) Example:
import numpy as np
# Creating a 1D NumPy array from a list
array_1d = [Link]([1, 2, 3, 4])
print(array_1d) # Output: [1 2 3 4]
# Creating a 2D NumPy array from a list of lists
array_2d = [Link]([[1, 2], [3, 4]])
print(array_2d)
# Output:
# [[1 2]
# [3 4]]

In this example, array_1d is a one-dimensional array, and array_2d is a two-

dimensional array.
2.) [Link]():
i) Purpose: Computes the arithmetic mean (average) of the elements along a
specified axis of an array. This method is useful for statistical analysis and data
summarization.
ii) Syntax:
[Link](a, axis=None, dtype=None, out=None, keepdims=False)
iii)Example:
import numpy as np
# Creating a NumPy array
array = [Link]([1, 2, 3, 4, 5])
# Calculating the mean of the array
mean_value = [Link](array)
print(mean_value) # Output: 3.0
# Creating a 2D NumPy array
array_2d = [Link]([[1, 2, 3], [4, 5, 6]])
# Calculating the mean along the first axis (columns)
mean_columns = [Link](array_2d, axis=0)
print(mean_columns) # Output: [2.5 3.5 4.5]

In this example, mean_value is the mean of all elements in a 1D array, while

mean_columns is the mean of each column in a 2D array.

Q.5.) Explain the following of pandas package: i) Merge ,ii) Concatenation

iii) Combine ,iv) Mapping , v) Remove duplicates
Ans:
i) Merge
1. The merge() function allows you to combine two DataFrames based on a
common column or index, much like joining tables in a relational database. This
is useful when you need to integrate data from multiple sources.
2. You can specify different types of joins (inner, outer, left, right), which control
how rows from the two DataFrames are matched. For example, an "inner" join
will keep only rows with matching values in both DataFrames.
ii) Concatenation
1. The concat() function stacks or appends DataFrames either along rows (vertical)
or columns (horizontal). This allows you to combine datasets when they have
similar structures or when you want to add new data.
2. You can control whether to preserve or ignore the index values of the original
DataFrames. If ignore_index=True, the resulting DataFrame will have a fresh
sequential index, which is useful for resetting row labels.
iii) Combine
1. The combine_first() function allows you to fill in missing values in one
DataFrame using values from another. It essentially combines two DataFrames,
giving priority to non-missing data in the first DataFrame.
2. This method is particularly useful for data cleaning, where you have two sources
of data and want to keep the most complete version of the information across
both DataFrames.
iv) Mapping
1. The map() function is used to apply a mapping (like a dictionary, function, or
another Series) to transform the values of a Series element-wise. This allows
you to quickly replace or map values according to a predefined set of rules.
2. It’s commonly used for converting categories or applying specific functions to
elements in a Series, making it a flexible tool for data transformations.
v) Remove Duplicates
1. The drop_duplicates() function removes duplicate rows from a DataFrame,
ensuring each row is unique based on specific columns or the entire dataset.
This helps in cleaning the data and avoiding redundancy.
2. You can control whether to keep the first occurrence of a duplicate or the last,
and also decide if you want to modify the DataFrame in place or return a new
DataFrame without duplicates.

Q.6) What is overfitting and underfitting? With an example explain how to recognize
and handle them.
Ans: 1. Overfitting
Definition:
Overfitting happens when a model learns not just the patterns in the training data but
also the noise and details specific to that dataset. As a result, the model performs
very well on the training data but poorly on unseen or test data because it fails to
generalize.
Recognition:
• The model has high accuracy on the training set but low accuracy on the test
set.
• The model is too complex (e.g., using too many parameters) and fits the training
data almost perfectly.
Example:
Imagine you are training a model to predict house prices. If your model is overfitting,
it may learn that a specific house in the training data had a red door and that
influenced its price, but in reality, the color of the door has no real impact on price. So
when you show it a house with a blue door, it might give a bad prediction.
How to Handle Overfitting:
1. Simplify the Model: Use fewer features or a less complex model to reduce
unnecessary learning.
2. Regularization: Techniques like L1 or L2 regularization add penalties to the
model's complexity, encouraging it to keep the weights small and avoid learning
noise.
3. Cross-Validation: Use techniques like k-fold cross-validation to test the model's
generalizability on multiple subsets of the data.
4. More Training Data: If possible, adding more training data can help the model
generalize better by seeing more examples.

2. Underfitting
Definition:
Underfitting occurs when a model is too simple to capture the underlying patterns in
the data. This happens when the model cannot adequately learn from the data,
resulting in poor performance on both the training and test datasets.
Recognition:
• The model has low accuracy on both the training and test sets.
• The model is not complex enough and fails to capture the important patterns in
the data.
Example:
Consider the same house price prediction scenario. If your model is underfitting, it
might only consider one or two basic features like the number of rooms but ignore
other important factors like location or size. As a result, its predictions would be
overly simplistic and inaccurate.
How to Handle Underfitting:
1. Increase Model Complexity: Use a more complex model with additional
features or layers (in the case of neural networks) to capture more patterns.
2. Feature Engineering: Add or modify features to provide the model with more
relevant information.
3. Reduce Regularization: If regularization is being used, too much of it can cause
underfitting. You may need to reduce it.

Q.7.)Explain univariate analysis, bivariate and multivariate analysis?

Ans: 1. Univariate Analysis
Definition:
Univariate analysis refers to the analysis of a single variable. It focuses on
understanding the distribution, central tendency (mean, median, mode), and
dispersion (variance, standard deviation) of a single variable in a dataset.
Purpose:
• Summarize and describe the variable's characteristics.
• Identify patterns such as the variable’s range, frequency distribution, and
outliers.
Example:
If you have data on the age of a group of people, performing univariate analysis
would involve finding the mean age, age range, and perhaps plotting a histogram to
see the age distribution.
Common Techniques:
• Summary statistics (mean, median, mode).
• Visualization methods like histograms, box plots, or bar charts.

2. Bivariate Analysis
Definition:
Bivariate analysis involves the analysis of two variables to understand their
relationship. It helps in discovering correlations, trends, or patterns between the two
variables.
Purpose:
• Investigate the relationship between two variables.
• Understand whether one variable impacts or is related to the other (e.g., does
height affect weight?).
Example:
If you have data on people’s ages and their income, bivariate analysis would explore
whether there is a relationship between age and income. You might find that older
individuals tend to have higher incomes.
Common Techniques:
• Scatter plots: Visualizes the relationship between two continuous variables.
• Correlation coefficients (e.g., Pearson’s correlation) to measure the strength and
direction of the relationship.
• Crosstabulation and bar plots for categorical variables.

3. Multivariate Analysis
Definition:
Multivariate analysis deals with the analysis of more than two variables at once. It
explores relationships and interactions between multiple variables to understand how
they collectively impact each other.
Purpose:
• Understand complex relationships between several variables.
• Investigate the effects of multiple independent variables on one or more
dependent variables.
• Identify patterns and interactions that might be missed in univariate or bivariate
analyses.
Example:
In a dataset with variables like age, income, education level, and spending habits,
multivariate analysis might explore how age and education level together influence
income or how these factors collectively affect spending habits.
Common Techniques:
• Multiple regression analysis: Examines how multiple independent variables
influence a dependent variable.
• Principal component analysis (PCA): Reduces the dimensionality of the data
while preserving important patterns.
• Multivariate analysis of variance (MANOVA): Tests the effect of several
independent variables on more than one dependent variable

Q.8.) Explain the different data preparation steps.

Ans:
1) Data Exploration: Involves exploratory data analysis (EDA) to understand the
dataset’s structure, distributions, relationships, and outliers. Techniques include
descriptive statistics (mean, median, standard deviation) and visualizations like
scatterplots.
2) Data Quality: Ensures data accuracy and consistency through processes like data
cleansing, removing duplicates, standardizing values, and handling outliers or
incorrect data entries.
3) Handling Missing Values: Deals with missing data by identifying the cause and
applying strategies such as imputation (substituting missing values with
mean/median) or excluding records with missing values.
4) Data Type Conversion: Converts data types based on the requirements of the
algorithm. For example, categorical data can be converted to numeric (encoding) or
numeric data can be grouped into categories (binning).
5) Data Transformation: Normalizes numeric attributes to ensure balanced
comparisons in algorithms like k-NN, where attributes should be on similar scales to
prevent bias in distance calculations.
6) Outlier Detection: Identifies and handles outliers, which can skew the results. The
presence of extreme values might be due to correct or erroneous data capture and
requires special attention.
7) Feature Selection: Reduces the number of attributes in the dataset by selecting the
most relevant features to simplify models and avoid issues like the "curse of
dimensionality" (where high-dimensional data can degrade model performance).
8) Data Sampling: Selects a subset of the data that represents the entire dataset.
Sampling is useful to reduce data size and processing time. Training and test datasets
are often created through sampling.
9) Stratified Sampling: Ensures balanced representation of each class in the sample,
especially useful for tasks like anomaly detection where the classes may be
imbalanced.
10) Data Partitioning: Segments data into training and testing sets to validate the
performance of predictive models.
11) Data Cleansing: Includes processes like removing duplicates, filling missing values,
and correcting errors to ensure data quality before modeling.
12) Standardization: Applies methods like standardizing or normalizing data ranges to
prevent certain features from disproportionately influencing model results.
13) Data Integration: Involves merging multiple data sources or tables into a unified
dataset, often requiring operations like joins or pivots to achieve a structured format
suitable for analysis.

Q.9.) What is a model? Discuss the modelling process.

Ans:
1) Definition of a Model: A model is an abstract representation that captures
relationships between variables in a dataset. It could be as simple as a rule (e.g.,
"interest rate reduces as credit score increases") or a more complex
mathematical expression.
2) Types of Models: Models can be classified as either predictive (e.g.,
classification and regression) or descriptive (e.g., clustering and association).
Predictive models require a target variable, whereas descriptive models do not.
3) Training and Testing Datasets: The dataset is split into a training set (to build
the model) and a testing set (to validate the model). Typically, two-thirds of the
data is used for training, and one-third is used for testing.
4) Learning Algorithms: Depending on the task (e.g., regression, classification),
different algorithms like decision trees, neural networks, or linear regression are
selected to model the data.
5) Model Creation: The model is generated from the training dataset by applying
the selected algorithm, such as fitting a straight line in the case of linear
regression.
6) Model Evaluation: The model is evaluated using the test dataset. This step
checks if the model generalizes well to unseen data and prevents overfitting
(i.e., memorizing the training data).
7) Overfitting: A model that performs perfectly on training data but poorly on new
data is considered overfitted. Evaluation on a separate test dataset helps avoid
this issue.
8) Error Measurement: The error between predicted values and actual values on
the test dataset is calculated to evaluate model accuracy. Acceptable error rates
determine the model's readiness for deployment.
9) Ensemble Modeling: This technique combines multiple base models to improve
prediction accuracy. The diversity of base models in an ensemble reduces
overall error and enhances performance.
10) Model Deployment: Once validated, the model is integrated into real-world
systems (e.g., predicting loan interest rates based on borrower credit scores) for
practical use.
11) Iterative Process: The modeling process includes analyzing the business
problem, selecting algorithms, preparing data, building and validating the
model, and ensuring it can generalize well to new data.

Q.10) What is an outlier? Explain with example.

Ans:
1. Definition: An outlier is an anomaly in a dataset, meaning a data point that
differs significantly from other observations.
2. Causes: Outliers can occur due to correct data capture (e.g., a few people
having very high incomes) or due to errors in data capture (e.g., recording
someone's height as 1.73 cm instead of 1.73 meters).
3. Importance: Outliers need to be understood and handled carefully, as they can
affect the analysis of the dataset.
4. Effect on Models: Outliers can skew or distort the results of models that are
designed to generalize patterns in the dataset.
5. Types of Issues: Outliers can either represent unusual but valid data points or
errors that need correction or removal.
6. Impact on Representation: The presence of outliers may affect the
representativeness of a model, making it less accurate in predicting or
describing the general behavior of the dataset.
7. Special Treatment: Outliers often require special techniques to either remove
or adjust them in the dataset to ensure accurate analysis.
8. Examples of Outliers: Incomes in tens of millions or incorrect data such as
height being recorded incorrectly can be examples of outliers.
9. Application in Real World: Detecting outliers is crucial in fields like fraud
detection or intrusion detection, where anomalies often indicate significant
events.
[Link] Outliers: Depending on the context, outliers can either be removed,
corrected, or specifically analyzed, especially when they serve a purpose like
detecting fraud.
Example: If most people in a dataset earn between $30,000 and $100,000, but one
person earns $10 million, this person would be considered an outlier.
Q.1.) . Describe the Informa on gain and gain ra o method.
Ans:i) Informa on Gain (IG) is a measure used in decision trees to determine which
a ribute is the most useful for spli ng a dataset at each step. It is based on the concept of
entropy and measures the reduc on in uncertainty (or impurity) in the dataset a er
spli ng.
ii)

iii)

iv) Informa on gain is defined as the difference between the original informa on
requirement and the new requirement a er par oning on a ribute A:
Gain(A)=Info(D)−InfoA(D)
v) A higher gain indicates that the a ribute provides a be er par on, leading to purer (or
more homogeneous) subsets.
vi) The a ribute with the highest gain is selected for spli ng at each step in a decision tree.
vi) Informa on gain tends to favor a ributes with many possible values (e.g., product_ID),
which can lead to highly specific par ons that provide li le meaningful classifica on
informa on.
vii) A split on an a ribute like product_ID results in many par ons, each containing a single
tuple. In this case, the entropy a er the split, meaning
no further informa on is needed. However, this
par oning is not useful for classifica on.
viii) To counter this bias, the Gain Ra o method uses split informa on, which measures the
poten al informa on generated by spli ng the data. It is defined as:

ix) Split informa on represents the poten al for dividing the dataset into par ons and
takes into account the distribu on of tuples across these par ons.
x) Formula for Gain Ra o:

This ensures that a ributes with many values (and hence many par ons) are not unfairly
favored.
xi) The a ribute with the highest gain ra o is selected as the spli ng a ribute in a decision
tree.
xii) As the split informa on approaches 0, the gain ra o becomes unstable. To prevent
issues, constraints are applied to ensure that the selected a ribute maintains a reasonable
gain ra o, avoiding a ributes with excessive spli ng.
Q.2.) What is classifica on? Hence explain the decision tree concept.
Ans:i) Classifica on is the most widely used data science task in business. The objec ve of a
classifica on model is to predict a target variable that is binary (e.g., a loan decision) or
categorical (e.g., a customer type) when a set of input variables are given.
ii) The model does this by learning the generalized rela onship between the predicted
target variable with all other input a ributes from a known dataset.
 Decision tree concept:
1) Defini on and Basic Components of Decision Trees:
A decision tree is a supervised machine learning algorithm used for both classifica on and
regression tasks. It splits the data into subsets based on the most significant features,
forming a tree-like structure where each internal node represents a decision (split), and
each leaf node represents the outcome (class or value).
 Root Node: The top node of the tree, represen ng the en re dataset, where the first
split occurs.
 Internal Nodes: Represent decisions based on features that lead to further splits.
 Leaf Nodes: The final output a er spli ng, represen ng the predicted class or value.
2) Spli ng Criteria
The key idea behind decision trees is to divide data into groups that are as homogeneous as
possible based on certain criteria. The most common criteria used for spli ng is:
 Entropy (used in Informa on Gain): Measures the randomness or uncertainty in the
data. The formula for entropy is:

Where pi is the propor on of class i

3) Informa on Gain
Informa on Gain is a measure of how much "informa on" a feature provides about the
class labels. It is the reduc on in entropy before and a er a split:

Where:
 D is the en re dataset,
 Dk are the par ons of D a er a split based on a feature,
 n is the number of par ons.
The goal is to maximize informa on gain with each split.

4) Tree Building Process :

Then

Where variable i represents the index of a speciﬁc observa on within a dataset.

v) A er standardiza on, we use the z-score for each observa on to calculate dissimilarity
using any of the two methods:
 Euclidean distance:

 Manha an Distance:
2) Binary Variables:
i) A binary variable has only two states: 0 or 1, where 0 means that the variable is absent,
and 1 means that it is present.
ii) A binary variable is symmetric if both states are equally valuable and have no preference
for coding as 0 or 1. and asymmetric if outcomes are of unequal importance, such as
disease test results, where the rare, more important outcome is coded as 1 and the other as
0.
iv) To calculate Dissimilarity for symmetric binary variable:

And To calculate Dissimilarity for asymmetric binary variable:

where q is the number of variables that equal 1 for both objects i and j, r is the number of
variables that equal 0 for object i but that are 0 for object j, s is the number of variables that
equal 0 for object i but equal 1 for object j, and t is the number of variables that equal 0 for
both objects i and j.

3) Categorical Variables:
i) A categorical variable is a generaliza on of the binary variable in that it can take on more
than two states. For example, map_color is a categorical variable that may have, say, ﬁve
states: red, yellow, green, pink, and blue.
ii) The dissimilarity between two objects i and j can be calculated as :

where m is the n number of variables for which i and j are in the same state, and p is the
total number of variables.

4) Ordinal Variables:
i) An ordinal variable can be discrete or con nuous . Here Order is important, e.g., rank.
They Can be treated like interval-scaled.
ii) To compute Dissimilarity:
5)Ra o Scaled Variables:
i) Ra o-scaled variable is a posi ve measurement on a nonlinear scale, approximately at
exponen al scale, such as AeBt or Ae-Bt
ii)To compute Dissimilarity:

6) Variables of Mixed Types:

i) A database may contain all the six types of variables
symmetric binary, asymmetric binary, nominal, ordinal, interval and ra o.
ii)To compute Dissimilarity:

Q.5.) What is cluster analysis? Explain the types of data in cluster analysis.
Ans: i) Cluster analysis is a method of grouping a set of objects (data points) in such a way
that objects in the same group (or cluster) are more similar to each other than to those in
other groups.
ii)There are various types of data in clustering analysis. We will see these types with
Dissimilarity. Dissimilarity in clustering analysis refers to a measure of how diﬀerent two
data points or objects are from each other.
( refer to q.3. for further answer)
Q.2.) What is data discre za on? Explain with example
Ans:
i) Data discre za on is the process of conver ng con nuous data into discrete categories or
intervals.
ii) This is par cularly useful in data mining and machine learning, where algorithms o en
perform be er with categorical data rather than con nuous numerical data.
iii) Discre za on can reduce the complexity of the data, making it easier to analyze and
interpret.
iv) Methods of Discre za on
1. Equal-width Binning: The range of con nuous values is divided into intervals of equal
width.
2. Equal-frequency Binning: The data is divided into intervals such that each interval
contains approximately the same number of data points.
3. Clustering-based Discre za on: Clustering techniques are used to group similar
values together.
 Example of Data Discre za on
Let’s consider a simple example with a con nuous variable: Age.
Here is a dataset of ages:

Ages: 23, 45, 18, 30, 50, 26, 34, 29, 42, 60
Step 1: Choose Discre za on Method
Suppose we choose Equal-width Binning and want to create three bins (or intervals).
Step 2: Deﬁne Bins
1. Minimum Age: 18
2. Maximum Age: 60
3. Range: 60 - 18 = 42
4. Width of Each Bin: Range / Number of Bins = 42 / 3 = 14
Now, we can create the bins:
 Bin 1: 18 to 32 (18 + 14)
 Bin 2: 33 to 46 (32 + 14)
 Bin 3: 47 to 60 (46 + 14)
Step 3: Assign Data to Bins
Now, we can assign each age to its respec ve bin:
 Age 23: Bin 1
 Age 45: Bin 2
 Age 18: Bin 1
 Age 30: Bin 1
 Age 50: Bin 3
 Age 26: Bin 1
 Age 34: Bin 2
 Age 29: Bin 1
 Age 42: Bin 2
 Age 60: Bin 3
Resul ng Discre zed Data
A er discre za on, the ages can be represented as categories:
Age Groups:
- Group 1 (18-32): 23, 18, 30, 26, 29
- Group 2 (33-46): 45, 34, 42
- Group 3 (47-60): 50, 60

Q.3.) Explain descrip ve sta s cs.

Ans: i) Descrip ve sta s cs is a branch of sta s cs focused on summarizing, organizing, and
presen ng data in a clear and understandable way.
ii) The main purpose of descrip ve sta s cs is to provide a straigh orward and concise
overview of the data, enabling researchers or analysts to gain insights and understand
pa erns, trends, and distribu ons within the dataset.
 Univariate Analysis: This involves studying more than one a ribute at a me
1)Measure of Central Tendency: These metrics indicate the central point of a dataset.
 Mean: The average value of the data points. For example, if the sepal lengths in a
ﬂower dataset are 5.1, 4.9, and 5.0 cm, the mean is calculated as

 Median: The middle value when the data is ordered. If the sepal lengths are sorted as
4.9, 5.0, and 5.1 cm, the median is 5.0 cm.
 Mode: The most frequently occurring value. If the sepal lengths are 5.0, 5.0, and 5.1
cm, the mode is 5.0 cm.
2)Measure of Spread: These metrics describe the variability of the dataset.
 Range: The difference between the maximum and minimum values. For example, if
the sepal lengths are 4.5 cm (min) and 5.5 cm (max), the range is 5.5−4.5=1.0 cm.
 Standard Devia on: This indicates how much the values deviate from the mean. For
a dataset with a mean of 5.0 cm, if the lengths are 4.8, 5.0, and 5.2 cm, the standard
devia on would show that the values are close to the mean.
 Variance: The average of the squared differences from the mean. If the mean is 5.0
cm and the data points are 4.8 and 5.2 cm, the variance reflects how spread out
these points are from the mean.

 Mul variate Explora on: This involves studying more than one a ribute in the
dataset simultaneously to understand rela onships between a ributes.
1)Central Data Point:
 This represents a hypothe cal observa on point made up of the mean of each
a ribute in the dataset.
 For example, in the Iris dataset, the central mean point for sepal length, sepal width,
petal length, and petal width could be expressed as {5.006, 3.418, 1.464, 0.244}.
2)Correla on:
 This measures the sta s cal rela onship between two a ributes using the Pearson
correla on coefficient (r), which ranges from -1 to 1. . For example, there is a strong
posi ve correla on between temperature and ice cream sales, if as temperatures
rise, ice cream sales increase.
 The correla on coefficient quan fies this rela onship, indica ng how closely related
the two a ributes are. A coefficient of 0.8 suggests a strong posi ve correla on.

Q.10.) What are the types of univariate analysis?

Ans: Univariate analysis involves examining a single variable to summarize and find pa erns
in the data. The main types of univariate analysis can be categorized as follows:
1. Descrip ve Sta s cs:
o Measures of Central Tendency: These include the mean (average), median
(middle value), and mode (most frequent value) of the dataset.
o Measures of Spread: This includes range (difference between maximum and
minimum), variance (measure of data variability), and standard devia on
(average distance of data points from the mean).
2. Frequency Distribu on:
o Count and Percentage: This involves coun ng how many mes each value
appears in the dataset and expressing this count as a percentage of the total
observa ons. This is o en displayed in a frequency table.
3. Visualiza ons:
o Histograms: Used to show the distribu on of numerical data by dividing the
data into bins and coun ng the frequency of data points in each bin.
o Box Plots: Visualize the distribu on of a dataset by showing its median,
quar les, and poten al outliers.
o Bar Charts: Used for categorical data to represent the frequency of each
category.
4. Probability Distribu on:
o Iden fying the underlying probability distribu on of a variable (e.g., normal
distribu on, binomial distribu on) can help understand its behavior.
5. Skewness and Kurtosis:
o Skewness: Measures the asymmetry of the distribu on of values (posi ve,
nega ve, or zero skewness).
o Kurtosis: Measures the "tailedness" of the distribu on, indica ng whether data
have heavy or light tails compared to a normal distribu on.
Q.11) Explain algorithm for decision tree induc on
Ans:
 Decision tree induc on:
1)Defini on and Basic Components of Decision Trees:
A decision tree is a supervised machine learning algorithm used for both classifica on and
regression tasks. It splits the data into subsets based on the most significant features,
forming a tree-like structure where each internal node represents a decision (split), and
each leaf node represents the outcome (class or value).
 Root Node: The top node of the tree, represen ng the en re dataset, where the first
split occurs.
 Internal Nodes: Represent decisions based on features that lead to further splits.
 Leaf Nodes: The final output a er spli ng, represen ng the predicted class or value.
2) Spli ng Criteria
The key idea behind decision trees is to divide data into groups that are as homogeneous as
possible based on certain criteria. The most common criteria used for spli ng is:
 Entropy (used in Informa on Gain): Measures the randomness or uncertainty in the
data. The formula for entropy is:

Where pi is the propor on of class i

3) Informa on Gain
Informa on Gain is a measure of how much "informa on" a feature provides about the
class labels. It is the reduc on in entropy before and a er a split:

Where:
 D is the en re dataset,
 Dk are the par ons of D a er a split based on a feature,
 n is the number of par ons.
The goal is to maximize informa on gain with each split.

4) Tree Building Process :

 Star ng at the root: Begin by calcula ng the Gini Index or Entropy of the root node
(en re dataset).
 Split the data: For each feature, calculate the Informa on Gain for poten al splits.
 Choose the best split: Select the feature and threshold that provides the highest
Informa on Gain .
 Repeat the process: The spli ng process is recursively repeated for each subset of
data, forming internal nodes, un l a stopping condi on is met.
5) Stopping Criteria:
A decision tree will con nue to grow un l one of the following condi ons is met:
 Maximum depth reached: Predefined limit to how deep the tree can go.
 Minimum samples per node: If a node contains fewer than a certain number of
samples, it will not be split further.
 Pure node: If all the samples at a node belong to the same class, no further spli ng is
necessary.
Q.12.) Discuss about random forests.
Ans: Random Forest:
Random Forest is an ensemble learning method that improves predic on accuracy by
combining mul ple decision trees. It was introduced by Leo Breiman and Adele Cutler
in 2001.
1) Bagging and Random Sampling:
 Similar to bagging (Bootstrap Aggrega ng), Random Forest selects a random sample
of training data with replacement for each tree. This helps create diverse trees and
reduce overfi ng.
2) Random Subset of A ributes:
 For each split in a decision tree, Random Forest randomly selects a subset of
a ributes instead of using all features. This further ensures diversity among the trees.
3) Dual Randomiza on:
 Random Forest introduces two levels of randomiza on:
1. Random selec on of training records (similar to bagging).
2. Random selec on of a ributes for spli ng at each node.
4) Decision Tree Construc on:
 During tree construc on, for every node, a random subset of a ributes (of size DDD)
is considered for finding the best split, instead of using all available features.
5) Number of Trees (k):
 The model is built using mul ple decision trees (k trees). Increasing the number of
trees typically enhances the model’s performance.
6) Vo ng Mechanism:
 Once the trees are built, the final predic on is made by aggrega ng the predic ons of
all trees. Each tree votes for a class, and the majority class is selected as the final
predic on.
7) Reduced Overfi ng:
 By averaging the predic ons of many trees, Random Forest reduces overfi ng that a
single decision tree might face. The model generalizes be er to unseen data.
8) Key Parameters:
 Important parameters in Random Forest include the number of trees, the number of
a ributes to consider at each split (D), the depth of each tree, and the minimum
samples per leaf node.
9) Applica on in Tools:
 Random Forest can be implemented using machine learning tools like RapidMiner.
Users can specify parameters like the number of base trees and observe individual
trees' outputs, which vary due to random a ribute selec on.
Q.13.) Discuss the applica ons of Natural Language Processing
Ans:
 Natural Language Processing:
i) Natural Language Processing is a part of ar ficial intelligence that aims to teach the
human language with all its complexi es to computers.
ii) This is so that machines can understand and interpret the human language to eventually
understand human communica on in a be er way. NLP is used among many different fields
such as AI, computa onal linguis cs, human-computer interac on, etc.
 Applica ons of NLP:
1. Chatbots:
i) Chatbots are a form of ar ficial intelligence that are programmed to interact with humans
in such a way that they sound like humans themselves.
ii) Chatbots work in two simple steps. First, they iden fy the meaning of the ques on asked
and collect all the data from the user that may be required to answer the ques on. Then
they answer the ques on appropriately.

2. Search Engines:
i) Search engines intelligently predict what a person is typing and automa cally complete
the sentences. For example, when you type "game" in Google, it suggests op ons like
"Game of Thrones," "Game of Life," or, if you're interested in math, "game theory."
ii) All these sugges ons are provided using autocomplete that uses Natural Language
Processing to guess what the person wants to ask.
[Link] Assistants:
i) Voice assistants are essen al tools today. Siri, Alexa, and Google Assistant are commonly
used for making calls, se ng reminders, scheduling mee ngs, se ng alarms, and browsing
the internet.
ii) They use a complex combina on of speech recogni on, natural language understanding,
and natural language processing to understand what humans are saying and then act on it.
4. Language Translators:
i) Language transla on is made easy with tools like Google Translate, which can convert text
from one language to another, such as English to Hindi.
ii) Modern translators u lize sequence-to-sequence modeling in Natural Language
Processing, which is more precise than the older Sta s cal Machine Transla on (SMT)
method
5. Sen ment Analysis:
i) Sen ment analysis allows companies to gauge how users feel about a par cular topic or
product by analyzing social media and other forms of communica on.
ii) By using techniques such as natural language processing, computa onal linguis cs, and
text analysis, companies can determine whether the general sen ment is posi ve, nega ve,
or neutral.
6. Grammar Checkers:
Grammar checkers are essen al tools for ensuring error-free wri ng, especially in
professional reports and academic assignments. They rely on natural language processing
(NLP) to provide accurate sugges ons.
7. Email Classifica on and Filtering:
Email classifica on and filtering use natural language processing (NLP) to automa cally sort
incoming emails into categories, improving organiza on and reducing clu er.
Q.15.) Write algorithm of k-means, k-medoid par oning method.
Ans:
 The K-Means Clustering Method:
i. The k-means algorithm takes the input parameter, k, and par ons a set of n
objects into k clusters so that the resul ng intracluster similarity is high but the
intercluster similarity is low .
ii. Cluster similarity is measured in regard to the mean value of the objects in a
cluster,which can be viewed as the cluster’s centroid or center of gravity.

 The K-Medoids Clustering Method:

i. K-Medoids Clustering is a par oning method similar to K-Means clustering, but
instead of using the mean of the data points to represent each cluster, it uses
actual data points, known as medoids.
ii. This makes K-Medoids more robust to noise and outliers, as the medoid is less
inﬂuenced by extreme values compared to the mean.

Q.16.) What is TensorFlow? Explain.

Ans:
TensorFlow is an open-source machine learning library developed by Google that helps in
building and training machine learning models.
1. Open Source: TensorFlow is freely available to everyone and has a large community
of users and contributors. This means anyone can use, modify, and share it.
2. Flexible Architecture: TensorFlow allows users to deploy computa on across different
pla orms (CPUs, GPUs, TPUs) and devices (desktops, servers, mobile, web), making it
adaptable to various environments.
3. Data Flow Graphs: It uses a data flow graph to represent computa on. Nodes in the
graph represent opera ons (like addi on, mul plica on), and edges represent the
data (tensors) that flow between them.
4. Tensors: In TensorFlow, data is represented as tensors, which are mul -dimensional
arrays (like vectors or matrices). This allows for efficient computa on and storage.
5. High-level APIs: TensorFlow offers high-level APIs, such as Keras, which make it easier
for beginners to build and train neural networks without needing to write low-level
code.
6. Ecosystem: TensorFlow has a comprehensive ecosystem that includes tools for model
deployment (TensorFlow Serving), visualiza on (TensorBoard), and mobile
development (TensorFlow Lite), among others.
7. Support for Deep Learning: TensorFlow is par cularly well-suited for deep learning
tasks, enabling the development of complex models like Convolu onal Neural
Networks (CNNs) and Recurrent Neural Networks (RNNs).
8. Scalability: It is designed to scale easily across mul ple CPUs and GPUs, which is
essen al for training large models on massive datasets.
9. Community and Resources: TensorFlow has extensive documenta on, tutorials, and
community support, making it easier for newcomers to get started and find help
when needed.
[Link]-Pla orm: TensorFlow can be used for both research and produc on, allowing
developers to build and train models in various programming languages (Python,
Java, JavaScript) and run them in diverse environments.

Q.17.) Describe various computer vision tasks in object recogni on.

Ans:
Object recogni on is a cri cal area of computer vision that involves iden fying and
classifying objects within images or video streams. It has a wide range of applica ons,
from autonomous vehicles to security systems. some key tasks involved in object
recogni on are:
1. Object Detec on
 Defini on: The process of loca ng and iden fying objects in an image. It outputs
both the class of the object and its posi on within the image (usually as bounding
boxes).
 Example: Detec ng faces in a photograph, iden fying pedestrians in a video feed.
2. Object Classifica on
 Defini on: Classifying an object within an image into predefined categories. This task
determines what the object is but does not provide loca on informa on.
 Example: Classifying an image of a fruit as an apple or a banana without specifying
where the fruit is in the image.
3. Object Localiza on
 Defini on: Iden fying the loca on of an object within an image using bounding boxes
or other forms of spa al representa on.
 Example: Determining where a car is located in an image by drawing a rectangle
around it.
4. Instance Segmenta on
 Defini on: A more advanced form of object recogni on that involves not only
detec ng and classifying objects but also delinea ng their exact shapes at the pixel
level.
 Example: Differen a ng between individual instances of objects in an image, such as
iden fying and segmen ng mul ple dogs in a park scene.
5. Seman c Segmenta on
 Defini on: Classifying each pixel in an image into a category, providing a pixel-wise
mask of the objects present without differen a ng between instances.
 Example: In an image containing several cars, all car pixels would be labeled as “car,”
regardless of whether they belong to different individual cars.
6. Keypoint Detec on and Matching
 Defini on: Iden fying specific points of interest (keypoints) on an object that can be
used to recognize and track it across different images or frames.
 Example: Iden fying dis nct features like corners, edges, or other points on an object
for tracking or recogni on purposes.
7. Object Tracking
 Defini on: Following the movement of a specific object across mul ple frames in a
video. It can involve both detec on and localiza on.
 Example: Tracking a moving car in a video stream, maintaining its iden ty across
frames.
8. Ac on Recogni on
 Defini on: Iden fying and classifying ac ons performed by objects, especially in
video sequences. This involves recognizing the context and dynamics of the scene.
 Example: Recognizing that a person is jumping or waving in a video.
9. A ribute Recogni on
 Defini on: Recognizing specific characteris cs or a ributes of objects in addi on to
their classes.
 Example: Iden fying the color, size, or material of an object, such as determining that
a dress is red and made of silk.
10. Anomaly Detec on
 Defini on: Iden fying unusual or unexpected objects or behaviors in a scene.
 Example: Detec ng an unauthorized person in a restricted area using surveillance
cameras.
11. 3D Object Recogni on
 Defini on: Recognizing and classifying objects in three-dimensional space, o en
using depth informa on from sensors.
 Example: Recognizing furniture in a room using 3D scanning technology.

Q.18.) Briefly explain how to compute dissimilarity between objects described by binary,
categorical and nominal and ordinal variables.
Ans: (refer to answer of q.3.)
Q.22.) What is a histogram?
Ans:
1) Defini on: A histogram is a type of bar chart that represents the frequency
distribu on of a dataset by dividing the data into intervals, known as bins, and
plo ng the number of observa ons (frequency) within each bin.
2) Data Grouping: The data is divided into consecu ve, non-overlapping intervals (bins).
Each bin represents a range of values, and the height of each bar indicates the
number of data points that fall within that range.
3) Con nuous Data: Histograms are primarily used for con nuous data, where values
can fall within any range, making them suitable for showing distribu ons like height,
weight, temperature, etc.
4) Shape Representa on: The shape of the histogram can provide insights into the data
distribu on, including whether it is normal, skewed (le or right), uniform, or
bimodal.
5) Bin Width: The width of the bins can significantly affect the appearance of the
histogram. Wider bins may oversimplify the data, while narrower bins can introduce
noise. Choosing the right bin width is crucial for accurate representa on.
6) Frequency vs. Rela ve Frequency: A histogram can show either the absolute
frequency (the count of data points in each bin) or rela ve frequency (the propor on
of data points in each bin compared to the total number of observa ons).
7) X and Y Axes: The x-axis of a histogram represents the bins (ranges of data), while the
y-axis represents the frequency (count) of data points in each bin.
8) Comparison: Histograms can be used to compare distribu ons between different
datasets by overlaying mul ple histograms on the same plot, which can help iden fy
differences in central tendencies or variances.
9) Applica ons: Histograms are widely used in sta s cs, data analysis, and machine
learning to analyze data distribu ons, detect outliers, and understand the spread of
data.
10) Limita ons: While histograms provide a useful overview of data distribu on, they
can obscure specific data points and may not be suitable for small datasets or when the
data has a large number of unique values, where a different type of visualiza on might
be more appropriate.
Q.24.) Differen ate between supervised and unsupervised learning.
Ans:
r. Aspect Supervised Learning Unsupervised Learning
No.
1 Defini on Trained on labeled data with Trained on unlabeled data
input-output pairs. without explicit outputs.
2 Objec ve Predict outputs for unseen Iden fy pa erns or structures in
data based on learned the data.
rela onships.
3 Types of Used for classifica on and Used for clustering and
Problems regression tasks. associa on tasks.
4 Data Requires a large amount of Does not require labeled data,
Requirements labeled data. making it easier to collect.
5 Examples of Includes linear regression, Includes k-means clustering,
Algorithms decision trees, and neural PCA, and hierarchical clustering.
networks.
6 Evalua on Performance evaluated using Performance evalua on is
accuracy, precision, and challenging without ground
recall. truth.
7 Complexity Generally more complex due O en simpler, but complexity
to model fi ng and increases with data pa erns.
predic on.
8 Use Cases Used in spam detec on, Used in customer segmenta on,
image classifica on, and anomaly detec on, and insights.
diagnos cs.
9 Learning Process Learning is guided by labeled Learning is exploratory with no
data with feedback on feedback loop.
predic ons.
10 Real-World Applied in scenarios requiring Useful for exploratory data
Applica on high accuracy, like credit analysis before applying
scoring. supervised methods.

Q.25.) What is data mining? State its applica ons.

Ans:
i) Data mining is the process of discovering pa erns, trends, and useful informa on from
large datasets using sta s cal, mathema cal, and computa onal techniques.
ii) It involves analyzing data from various perspec ves and summarizing it into useful
informa on, which can then be used for decision-making.
 Applica ons of Data Mining
Here are some common applica ons of data mining across various fields:
1. Marke ng and Sales:
o Customer Segmenta on: Iden fying dis nct customer groups based on
purchasing behavior.
o Targeted Adver sing: Tailoring adver sements based on user behavior and
preferences.
2. Finance and Banking:
o Fraud Detec on: Iden fying suspicious transac ons and pa erns that may
indicate fraud.
o Credit Scoring: Assessing the creditworthiness of applicants by analyzing
historical data.
3. Healthcare:
o Disease Predic on: Analyzing pa ent data to predict the likelihood of diseases.
o Treatment Effec veness: Evalua ng the effec veness of treatments across
different pa ent demographics.
4. Retail:
o Inventory Management: Predic ng stock requirements based on sales trends
to op mize inventory levels.
o Market Basket Analysis: Iden fying product associa ons to improve cross-
selling strategies.
5. Telecommunica ons:
o Churn Predic on: Predic ng customer churn and iden fying factors leading to
customer a ri on.
o Network Management: Analyzing network data to op mize performance and
reduce down me.
6. Manufacturing:
o Quality Control: Analyzing produc on data to iden fy defects and improve
product quality.
o Predic ve Maintenance: Predic ng equipment failures before they occur to
reduce down me and maintenance costs.
7. Social Media:
o Sen ment Analysis: Understanding public sen ment towards brands, products,
or topics through social media data.
o Trend Analysis: Iden fying emerging trends and topics based on user
interac ons and posts.
8. Educa on:
o Student Performance Predic on: Analyzing student data to predict
performance and iden fy at-risk students.
o Curriculum Development: Improving educa onal programs based on learning
pa erns and outcomes.
9. Government:
o Public Safety: Analyzing crime data to iden fy pa erns and allocate resources
effec vely.
o Policy Development: Using data insights to inform and shape public policies.
[Link] on and Logis cs:
o Route Op miza on: Analyzing traffic data to op mize delivery routes and
reduce transporta on costs.
o Demand Forecas ng: Predic ng transporta on demands to improve service
efficiency.
Q.1) . Describe the Gini index method.
Ans:
1. Defini on: The Gini index is a sta s cal measure used in CART (Classifica on and
Regression Trees) to quan fy the impurity of a dataset, aiding in decision-making for
par oning data.

3. Binary Splits: For each discrete-valued a ribute A with v unique values, the possible
binary splits that can be formed are 2v−2, excluding empty and full sets.

5. Selec ng Spli ng A ributes: The a ribute that results in the lowest Gini index
(indica ng highest purity) is selected as the spli ng criterion, promo ng be er
classiﬁca on.
6. Con nuous-Valued A ributes: For a ributes with con nuous values, split points are
determined by the midpoints of sorted values, with the op mal point being one that
minimizes the Gini index.

8. Gini Impurity vs. Entropy: Unlike entropy, which measures uncertainty, the Gini index
focuses solely on the distribu on of classes, offering a simpler interpreta on in
decision trees.
9. Handling Mul -class Problems: The Gini index effec vely handles mul -class
scenarios, making it suitable for datasets with more than two classes.
[Link]fi ng Considera ons: While a lower Gini index suggests a be er split, care
should be taken to avoid overfi ng by analyzing the model's performance on unseen
data.
[Link] ons: The Gini index is a core component of decision tree algorithms, o en
u lized in random forests and several machine learning frameworks for classifica on
tasks.
[Link] ve Process: The decision tree grows by recursively applying the Gini index to
find splits un l the stopping criteria are met, leading to a finalized model that
classifies instances efficiently.

Q.2.)Deﬁne classiﬁca on technique with an illustra on.

Ans:
[Link]fini on: Classifica on is a supervised machine learning technique used to categorize
or classify data into predefined classes or labels. It predicts the category of new
observa ons based on the learning from labeled training data.
2. Purpose: The main goal of classifica on is to accurately assign labels to new data
points based on learned pa erns. For example, it might classify an email as "spam" or
"not spam."
3. Supervised Learning: Classifica on requires labeled data, which means each data
point in the training set has a known category. The model learns by comparing its
predic ons to these known labels and adjus ng itself to minimize errors.
4. Examples of Classifica on Tasks: Some common tasks include:
 Image classifica on: Labeling images as "cat," "dog," or "car."
 Medical diagnosis: Predic ng whether a pa ent has a par cular disease.
 Sen ment analysis: Categorizing text as "posi ve," "neutral," or "nega ve."
5. Process Overview:
 Data Prepara on: Gather and clean data, then label each instance with its class.
 Training the Model: Feed labeled data into the algorithm to allow it to learn pa erns
that dis nguish each class.
 Tes ng: Evaluate the model's performance on new, unseen data.
6. Illustra ve Example: Imagine we want to classify emails as "spam" or "not spam":
 Training Data: We use past emails marked as "spam" or "not spam" as our training
data.
 Learning Pa erns: The model learns from pa erns, like certain words ("free," "win")
being common in spam emails.
 Predic on: When a new email arrives, the model evaluates it against learned pa erns
and classifies it as "spam" or "not spam."
7. Types of Classifica on Algorithms: Some popular algorithms include:
 Decision Trees: Trees that split data based on ques ons.
 Support Vector Machines (SVM): Finds the boundary that best separates classes.
 k-Nearest Neighbors (k-NN): Classifies based on proximity to other data points.
8. Evalua on Metrics: Classifica on performance is typically measured using metrics
like accuracy, precision, recall, and F1-score, helping to understand how well the
model classifies new data.
9. Real-World Applica ons: Classifica on is widely used in fields like finance (fraud
detec on), healthcare (disease predic on), marke ng (customer segmenta on), and
many others.
[Link] on: In a simple illustra on, imagine a sca er plot where points represent
emails, with two clusters. One cluster is "spam" (e.g., emails offering prizes), and the
other is "not spam" (e.g., work emails). The model tries to draw a boundary to
separate these clusters, helping it decide where to place new emails.
Q.3.) Illustrate Matplotlib with an example.

1. Matplotlib is a popular Python library for data visualiza on, widely used to create
sta c, interac ve, and animated plots. It's especially useful for displaying trends,
rela onships, and distribu ons in data.
2. The main purpose of Matplotlib is to provide an easy way to generate plots and
graphs, making data analysis and interpreta on simpler.
3. Installa on: To install Matplotlib, use the command:

4. Impor ng Matplotlib: Import it in Python code with:

5. Basic Example: A simple plot with Matplotlib can be created as follows:

6. Explana on of Code:
o x and y are lists represen ng data points on the x-axis and y-axis.
o [Link](x, y) plots these points.
o [Link]() displays the plot.
7. Types of Plots:
o Line Plot: Used for visualizing trends over me.
o Bar Plot: Displays data in the form of bars for comparison.
o Sca er Plot: Shows rela onships between two variables.
8. Adding Titles and Labels:
o plt. tle("My Plot") adds a tle to the plot.
o [Link]("X-axis") and [Link]("Y-axis") label the axes.
9. Example with Title and Labels:

[Link] Colors and Styles:

o You can change the color with [Link](x, y, color='red').
o Line styles (like dashed or do ed) are customizable using arguments like
linestyle='--'.
[Link]-World Applica on: Matplotlib is commonly used in data science for visualizing
large datasets, tracking stock prices, or analyzing survey results.
[Link]: Matplotlib is a versa le, essen al tool for anyone working with data in
Python, making it easy to create clear and informa ve visualiza ons.

Q.4.) State and explain the steps to perform hypothesis tes ng.
Ans:
1. State the Hypotheses:
Formulate the null hypothesis (H₀), which assumes no effect or difference, and the
alterna ve hypothesis (H₁), which proposes a possible effect or difference.
2. Select Significance Level (α):
Choose a significance level, usually 0.05 or 5%, which represents the probability of
rejec ng the null hypothesis when it’s actually true. This is your threshold for making
decisions.
3. Choose the Test Type:
Decide which sta s cal test to use based on the data type and sample size. Common
tests include t-test, z-test, and chi-square test.
4. Collect and Prepare Data:
Gather data relevant to the hypotheses, ensuring it is clean, accurate, and
representa ve of the popula on.
5. Calculate the Test Sta s c:
Use your chosen test formula to calculate the test sta s c (e.g., t-value or z-value),
which will measure the degree of difference between the observed data and what is
expected under the null hypothesis.
6. Find the Cri cal Value or P-value:
The cri cal value defines the cutoff point for rejec ng the null hypothesis.
Alterna vely, you can calculate a p-value, which tells the probability of observing
your results under the null hypothesis.
7. Compare Test Sta s c to Cri cal Value:
If the test sta s c exceeds the cri cal value, or if the p-value is less than α, you have
enough evidence to reject the null hypothesis.
8. Make a Decision:
Based on the comparison, either reject the null hypothesis (if there is evidence for
the alterna ve) or fail to reject it (if there isn’t).
9. Draw a Conclusion:
Interpret the result in the context of your research. Clearly state if there’s evidence
suppor ng the alterna ve hypothesis or if the results align with the null.
[Link] the Findings:
Summarize the hypothesis, test used, test sta s c, p-value, and the decision. This
helps others understand the analysis and conclusions drawn from it.

Q.5.) Explain agglomera ve and divisive hierarchical clustering method.

Ans:
1. Types: There are two hierarchical clustering methods: agglomera ve and divisive.
2. Agglomera ve Method:
o This is a bo om-up strategy.
o Each object starts in its own cluster.
o Clusters are progressively merged to form larger clusters.
3. Divisive Method:
o This is a top-down approach.
o Starts with all objects in one cluster.
o Gradually splits the cluster into smaller pieces un l each object is in its own
cluster.
4. Process of Agglomera ve Clustering:
o Begins with individual items.
o Merges clusters based on a similarity criterion, typically distance.
o Con nues un l a single cluster encompassing all items is formed.
5. Process of Divisive Clustering:
o Starts with one large cluster.
o Splits clusters based on specific criteria un l the desired number of clusters is
achieved.
6. Dendrogram Representa on:
o Useful for visualizing how clusters are formed at each step.
o Each level demonstrates how pairs of clusters are combined or divided.
7. Distance Measures:
o Common distance metrics for defining cluster similarity include minimum,
maximum, mean, and average distances.
8. Termina on Condi on:
o Both methods can specify a desired number of clusters or a specific similarity
threshold as stopping points.
9. Example Use Case:
o AGNES (Agglomera ve nes ng) for agglomera ve clustering.
o DIANA (Divisive analysis) for divisive clustering.
[Link]-Linkage Algorithm:
o Involves merging clusters based on the minimum distance between any two
objects in the clusters.
[Link]-Linkage Algorithm:
o Merges based on the maximum distance within the clusters, offering different
clustering outcomes.
[Link] and Limita ons:
o Agglomera ve is o en easier to apply; however, divisive can yield be er-
defined clusters in certain scenarios. Both methods face challenges in
scalability and decision-making during the crea on of clusters.

Q.6.) Explain the phases of NLP

Ans:
1. Tokeniza on
o The first step is breaking down a text into smaller units called tokens, usually
words or phrases. This helps in analyzing individual words or parts of a
sentence.
2. Text Preprocessing
o Involves cleaning the text by removing punctua on, conver ng text to
lowercase, removing stopwords (like "the," "is," etc.), and stemming or
lemma zing words to simplify them.
3. Stopword Removal
o Common words that don’t add significant meaning (like "and," "the," etc.) are
removed to reduce noise in the data and focus on more meaningful words.
4. Stemming and Lemma za on
o Stemming cuts words to their root forms, and lemma za on converts words to
their base forms, making it easier to analyze the core meaning.
5. POS (Part-of-Speech) Tagging
o Assigns gramma cal tags (like noun, verb, adjec ve) to each word, providing
context about how each word func ons in a sentence.
6. Named En ty Recogni on (NER)
o Iden fies and categorizes named en es (like names of people, places,
organiza ons) in the text, which helps in extrac ng meaningful informa on.
7. Parsing
o Involves analyzing sentence structure using syntax rules, crea ng a parse tree
that shows rela onships between words, which helps in understanding
grammar and structure.
8. Word Embedding and Vectoriza on
o Converts text into numerical form through techniques like word embeddings
(Word2Vec, GloVe) or vectoriza on, making it easier for algorithms to process
text.
9. Sen ment Analysis
o Detects the sen ment or emo on in text (e.g., posi ve, nega ve, neutral),
which is useful in customer feedback analysis or social media monitoring.
[Link] Modeling
 Iden fies topics within a large volume of text, grouping similar words or phrases,
which helps in summarizing and organizing informa on.

Q.7.) Explain the NLP pipeline.

Ans:
1. Text Collec on
 Gather raw text data from various sources like documents, social media, websites, or
databases as input for NLP processing.
2. Text Preprocessing
 Clean and prepare the text by performing opera ons like tokeniza on, removing
punctua on, conver ng text to lowercase, and removing unnecessary characters or
symbols.
3. Tokeniza on
 Split text into individual tokens, such as words or sentences, which are easier to
analyze separately.
4. Stopword Removal
 Filter out common words that don’t carry signiﬁcant meaning, such as "is," "and," or
"the," to reduce noise in the data.
5. Stemming and Lemma za on
 Simplify words to their root or base forms (e.g., "running" to "run") to standardize
varia ons of a word.
6. POS Tagging
 Label each word with its part of speech, like noun, verb, or adjec ve, which provides
context about each word’s role in the sentence.
7. Named En ty Recogni on (NER)
 Iden fy and categorize named en es, such as names of people, loca ons, or
organiza ons, which adds meaning and context to the data.
8. Text Vectoriza on or Embedding
 Convert text into numerical data using methods like word embeddings (e.g.,
Word2Vec, GloVe) or vectoriza on (e.g., TF-IDF) to make it understandable to
machine learning models.
9. Sen ment Analysis and Classiﬁca on
 Detect the sen ment, category, or emo on in the text, which helps in understanding
user opinions, a tudes, or emo onal tones.
[Link] Training and Evalua on
 Train machine learning models on the processed and transformed text data and
evaluate their performance to ensure accuracy and reliability.

Q.8.) Write a program to explain implementa on of TensorFlow

Ans:
import tensorﬂow as

# Load the MNIST dataset

(x_train, y_train), (x_test, y_test) = .[Link].load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0 # Normalize pixel values

# Deﬁne a simple neural network model

model = .[Link] al([
.[Link] en(input_shape=(28, 28)), # Fla en input
.[Link](64, ac va on='relu'), # Hidden layer with 64 neurons
.[Link](10, ac va on='so max') # Output layer for 10 classes
])

# Compile the model

[Link](op mizer='adam', loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

# Train the model

model.ﬁt(x_train, y_train, epochs=3)

# Evaluate the model on test data

test_loss, test_accuracy = [Link](x_test, y_test)
print(f"Test accuracy: {test_accuracy}")

Q.9.) Explain support vector machine for 2D data.

Ans:
1. Purpose of SVM
 SVM is a supervised machine learning algorithm used for classifica on tasks. Its goal
is to find the best boundary (hyperplane) that separates classes in the data.
2. Hyperplane in 2D Data
 In a 2D space, the separa ng boundary is a line (hyperplane) that divides data points
of different classes. SVM aims to find the line that best separates these classes.
3. Maximizing the Margin
 SVM selects the hyperplane that maximizes the margin between the two classes. The
margin is the distance between the hyperplane and the closest data points from
either class.
4. Support Vectors
 Support vectors are the data points closest to the hyperplane. These points are
cri cal, as they define the margin and ul mately determine the posi on of the
hyperplane.
5. Linear Separability
 SVM works best when the data is linearly separable, meaning a straight line (in 2D)
can separate the classes. In this case, SVM finds the line that maximizes the distance
between classes.
6. So Margin
 If the data is not perfectly separable, SVM can use a so margin, allowing some
misclassifica on by including a penalty term. This helps handle noisy or overlapping
data points.
7. Kernel Trick
 For complex 2D data where a straight line cannot separate classes, SVM uses kernel
func ons. The kernel trick maps the data to a higher-dimensional space where a
linear separator can be found.
8. Common Kernels
 Popular kernel func ons include linear, polynomial, and radial basis func on (RBF).
The RBF kernel is widely used for non-linear data in 2D classifica on tasks.
9. Binary Classifica on
 SVM is naturally a binary classifier, meaning it classifies data into two classes. For
mul -class problems, techniques like one-vs-rest (OvR) are used to extend SVM.
[Link] ons of SVM in 2D Data
 SVM is commonly used for tasks like binary image classifica on and simple 2D object
classifica on, as it effec vely finds the op mal decision boundary between two
classes

Data Science
No ratings yet
Data Science
10 pages
Report
No ratings yet
Report
18 pages
DATASCIENCE (Unit-1) Question Bank
No ratings yet
DATASCIENCE (Unit-1) Question Bank
6 pages
Dsbda 4
No ratings yet
Dsbda 4
16 pages
Drop Duplicates in Pandas and NumPy
No ratings yet
Drop Duplicates in Pandas and NumPy
43 pages
Module 1.foundations of Data Science
No ratings yet
Module 1.foundations of Data Science
17 pages
Data Science Papers
No ratings yet
Data Science Papers
109 pages
Fds Answers
No ratings yet
Fds Answers
53 pages
UNIT 4 Data Science Notes
100% (1)
UNIT 4 Data Science Notes
4 pages
File 2
No ratings yet
File 2
43 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
DS 3-Marks Semeseter Suggestion
No ratings yet
DS 3-Marks Semeseter Suggestion
54 pages
NumPy Basics and Data Science Overview
100% (1)
NumPy Basics and Data Science Overview
69 pages
OCS353 Data Science Fundamentals QB - (Common To EEE, Mech, Civil)
No ratings yet
OCS353 Data Science Fundamentals QB - (Common To EEE, Mech, Civil)
7 pages
FDS - 1 Solved
No ratings yet
FDS - 1 Solved
17 pages
2A - Python+Data Analysis For Pyhton2 v2
No ratings yet
2A - Python+Data Analysis For Pyhton2 v2
38 pages
Data Science Comprehension Worksheets
No ratings yet
Data Science Comprehension Worksheets
32 pages
Cs3352 - Foundation of Data Science
No ratings yet
Cs3352 - Foundation of Data Science
56 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
51 pages
3 - Pandas
No ratings yet
3 - Pandas
87 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
FINAL FDS MANUAL Print
No ratings yet
FINAL FDS MANUAL Print
55 pages
Q.1 Explain Process of Working With Data From Files in Data Science
No ratings yet
Q.1 Explain Process of Working With Data From Files in Data Science
20 pages
Question Bank With Answers
No ratings yet
Question Bank With Answers
103 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
30 pages
Num Py Pandas Interview Qa
No ratings yet
Num Py Pandas Interview Qa
7 pages
Q1. Explain Data Science Process Along With Detailed Diagram
No ratings yet
Q1. Explain Data Science Process Along With Detailed Diagram
7 pages
Python Basics for Data Science Course
No ratings yet
Python Basics for Data Science Course
9 pages
21CSS203TCT-1 - SET A - Answer Key
No ratings yet
21CSS203TCT-1 - SET A - Answer Key
4 pages
MODEL EXAM II Answer Key - For Merge
No ratings yet
MODEL EXAM II Answer Key - For Merge
20 pages
DATASCIENCE
No ratings yet
DATASCIENCE
2 pages
Datascience
No ratings yet
Datascience
26 pages
FDS PYQ Solution
No ratings yet
FDS PYQ Solution
8 pages
Data Science Using Python
No ratings yet
Data Science Using Python
7 pages
Advanced Python & Data Science Guide
No ratings yet
Advanced Python & Data Science Guide
42 pages
PDS Bits
No ratings yet
PDS Bits
6 pages
Common Python Data Science Interview Questions1
No ratings yet
Common Python Data Science Interview Questions1
5 pages
FDS
No ratings yet
FDS
7 pages
CS3352 FDS QP Solved (Anna University)
100% (1)
CS3352 FDS QP Solved (Anna University)
98 pages
Data Science Workflow
No ratings yet
Data Science Workflow
7 pages
Data Science Workshop - Day 1
No ratings yet
Data Science Workshop - Day 1
80 pages
Python & Excel for Data Science
No ratings yet
Python & Excel for Data Science
19 pages
Python Data Structures and Libraries Guide
No ratings yet
Python Data Structures and Libraries Guide
7 pages
PDS Qba
No ratings yet
PDS Qba
12 pages
Python Unit IV
No ratings yet
Python Unit IV
12 pages
EXP1-siddhant Gupta (23 - SE - 148)
No ratings yet
EXP1-siddhant Gupta (23 - SE - 148)
17 pages
Dse Unit 3
No ratings yet
Dse Unit 3
12 pages
Unit 4 Fod
100% (1)
Unit 4 Fod
21 pages
FDS Record-1-4
No ratings yet
FDS Record-1-4
18 pages
Python
100% (2)
Python
635 pages
Dav 2 Unit
No ratings yet
Dav 2 Unit
55 pages
Data Science Concepts & Techniques
No ratings yet
Data Science Concepts & Techniques
18 pages
Unit-II Data Science QB
No ratings yet
Unit-II Data Science QB
33 pages
FDS Lab
No ratings yet
FDS Lab
43 pages
Essentials of Data Science Exploration
No ratings yet
Essentials of Data Science Exploration
15 pages
Data Science Fundamentals Overview
No ratings yet
Data Science Fundamentals Overview
3 pages
Ilovepdf Merged (2) Merged
No ratings yet
Ilovepdf Merged (2) Merged
65 pages
Report - Project 3 - Lake Inventory
No ratings yet
Report - Project 3 - Lake Inventory
38 pages
Quantitative Methods Slide Answers 2024
No ratings yet
Quantitative Methods Slide Answers 2024
13 pages
Algebra1ScatterPlotProject 1
No ratings yet
Algebra1ScatterPlotProject 1
4 pages
Nurse Commitment & Turnover Study
No ratings yet
Nurse Commitment & Turnover Study
10 pages
Sample Final Exam Mathematics in The Modern World For Review
No ratings yet
Sample Final Exam Mathematics in The Modern World For Review
2 pages
Practice Sheet IAPM
No ratings yet
Practice Sheet IAPM
6 pages
Regression Analysis and ANOVA
No ratings yet
Regression Analysis and ANOVA
2 pages
A Level Maths A H240 Pure Mathematics Guide
No ratings yet
A Level Maths A H240 Pure Mathematics Guide
12 pages
QA Full Test
No ratings yet
QA Full Test
24 pages
Comparative Study of Different Test Methods Used For The Measurement of Physical Properties of Cotton
No ratings yet
Comparative Study of Different Test Methods Used For The Measurement of Physical Properties of Cotton
7 pages
Statistical Tests
No ratings yet
Statistical Tests
55 pages
Dynamic Momentum and Contrarian Trading
No ratings yet
Dynamic Momentum and Contrarian Trading
33 pages
Experiment 3 - CH142L
No ratings yet
Experiment 3 - CH142L
15 pages
Final
No ratings yet
Final
19 pages
Kumar Et Al. 2022
No ratings yet
Kumar Et Al. 2022
11 pages
Factors Influencing Ethical Behaviour in The Workplace: The Case of Schools in Kuwait
No ratings yet
Factors Influencing Ethical Behaviour in The Workplace: The Case of Schools in Kuwait
27 pages
MBA - Finance Syllabus - Revised Course Codes
No ratings yet
MBA - Finance Syllabus - Revised Course Codes
97 pages
Algebra 1 Practice Regents Review Sheet 2024
No ratings yet
Algebra 1 Practice Regents Review Sheet 2024
15 pages
Omotola Pre-Data Power Point.
No ratings yet
Omotola Pre-Data Power Point.
16 pages
PR Lliams
No ratings yet
PR Lliams
29 pages
Statistics Interview Questions
No ratings yet
Statistics Interview Questions
8 pages
Stress Analysis of Buried Pipes
No ratings yet
Stress Analysis of Buried Pipes
8 pages
Business Statistics Unit 1
No ratings yet
Business Statistics Unit 1
22 pages
Exploratory Data Analysis For Machine Learning
No ratings yet
Exploratory Data Analysis For Machine Learning
6 pages
Financial Management Practices of Indian Small and Medium Enterprises (SMEs) - A Study of The Food Processing Sector
No ratings yet
Financial Management Practices of Indian Small and Medium Enterprises (SMEs) - A Study of The Food Processing Sector
15 pages
1st Sem Detailed Syllabus (B. Sc. in Data Science)
No ratings yet
1st Sem Detailed Syllabus (B. Sc. in Data Science)
7 pages
Kenyan Schools' Learning Impact
No ratings yet
Kenyan Schools' Learning Impact
8 pages
561-Original Article-3872-1-10-20220220
No ratings yet
561-Original Article-3872-1-10-20220220
11 pages
Syllabus
No ratings yet
Syllabus
7 pages
Mcbride, Beatty - 1985 - Newly Planted Street Tree Growth and Mortality
No ratings yet
Mcbride, Beatty - 1985 - Newly Planted Street Tree Growth and Mortality
6 pages

Data Science Notes

Uploaded by

Data Science Notes

Uploaded by

Q.1.) What is data science? Explain with the help of an example.

Q.2.) List and explain the data science process.

Q.4) List and write any 2 methods used in package/library numpy.

In this example, array_1d is a one-dimensional array, and array_2d is a two-

In this example, mean_value is the mean of all elements in a 1D array, while

Q.5.) Explain the following of pandas package: i) Merge ,ii) Concatenation

Q.7.)Explain univariate analysis, bivariate and multivariate analysis?

Q.8.) Explain the different data preparation steps.

Q.9.) What is a model? Discuss the modelling process.

Q.10) What is an outlier? Explain with example.

Where pi is the propor on of class i

4) Tree Building Process :

Where variable i represents the index of a speciﬁc observa on within a dataset.

And To calculate Dissimilarity for asymmetric binary variable:

6) Variables of Mixed Types:

Q.3.) Explain descrip ve sta s cs.

Q.10.) What are the types of univariate analysis?

Where pi is the propor on of class i

4) Tree Building Process :

 The K-Medoids Clustering Method:

Q.16.) What is TensorFlow? Explain.

Q.17.) Describe various computer vision tasks in object recogni on.

Q.25.) What is data mining? State its applica ons.

Q.2.)Deﬁne classiﬁca on technique with an illustra on.

4. Impor ng Matplotlib: Import it in Python code with:

5. Basic Example: A simple plot with Matplotlib can be created as follows:

[Link] Colors and Styles:

Q.5.) Explain agglomera ve and divisive hierarchical clustering method.

Q.6.) Explain the phases of NLP

Q.7.) Explain the NLP pipeline.

Q.8.) Write a program to explain implementa on of TensorFlow

# Load the MNIST dataset

# Deﬁne a simple neural network model

# Compile the model

# Train the model

# Evaluate the model on test data

Q.9.) Explain support vector machine for 2D data.

You might also like