0% found this document useful (0 votes)

43 views19 pages

PDS Question Bank

The document provides an introduction to data science, outlining its need, benefits, and the data science process, which includes data retrieval, cleansing, analysis, modeling, and presentation. It also covers Python libraries like NumPy and pandas for data manipulation, including creating ndarrays, data structures, and performing operations. Additionally, it discusses data cleaning techniques, handling missing data, and visualization methods using pandas and matplotlib.

Uploaded by

Ms.K.Mohanapriya CSE Department

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views19 pages

PDS Question Bank

Uploaded by

Ms.K.Mohanapriya CSE Department

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 19

UNIT I INTRODUCTION TO DATA SCIENCE

Introduction: Need for data science - Benefits and uses - Causality and Experimentation - Facets
of data-Data science process: Retrieving data - Cleansing, integrating and transforming data -
Exploratory Data Analysis - Build the models - Presenting findings and building applications

2-Mark Questions with Answers

1. What is data science?

o Answer: Data science is an interdisciplinary field that combines statistical
analysis, machine learning, and data engineering to extract insights and
knowledge from structured and unstructured data.

2. Why is data science needed?

o Answer: Data science is needed to analyze large and complex datasets,
uncover patterns, predict outcomes, and drive decision-making in various
fields like business, healthcare, and technology.

3. List any two benefits of data science.

o Answer:
 Helps organizations make data-driven decisions.
 Enables predictions and forecasts through machine learning models.

4. What is causality in data science?

o Answer: Causality refers to understanding cause-and-effect relationships,
where one variable directly influences another.

5. What is the role of experimentation in data science?

o Answer: Experimentation helps test hypotheses and determine causal
relationships between variables, often using techniques like A/B testing.

6. What are the facets of data in data science?

o Answer: The facets of data include structure, format, quality, completeness,
consistency, and granularity.

7. What are the main steps in the data science process?

o Answer:

1. Retrieving data
2. Cleansing, integrating, and transforming data
3. Exploratory Data Analysis (EDA)
4. Building models
5. Presenting findings and building applications

8. What is data retrieval?

o Answer: Data retrieval involves accessing and collecting data from various
sources like databases, APIs, or web scraping.
9. What is data cleansing?
o Answer: Data cleansing is the process of identifying and correcting errors,
inconsistencies, and inaccuracies in data to improve its quality.

10. What is Exploratory Data Analysis (EDA)?

o Answer: EDA is the process of summarizing, visualizing, and understanding
data to uncover patterns, trends, and relationships.

11. Why is data integration important?

o Answer: Data integration combines data from multiple sources to create a
unified view, improving analysis and decision-making.

12. What is a data science model?

o Answer: A data science model is a mathematical representation or algorithm
used to analyze data and make predictions or decisions.

13. What are the benefits of presenting findings in data science?

o Answer: Presenting findings ensures stakeholders can understand insights and
use them for actionable decisions.

14. What is the purpose of building applications in data science?

o Answer: Applications automate processes, integrate data science models, and
provide end-users with actionable tools and insights.

15. What are structured and unstructured data?

o Answer:

 Structured Data: Organized data stored in tabular formats (e.g.,

databases).
 Unstructured Data: Unorganized data like text, images, or videos.

16. What is the importance of transforming data?

o Answer: Data transformation converts raw data into a suitable format for
analysis, improving compatibility and interpretability.

16-Mark Questions with Answers

1. Explain the need for data science and its benefits and uses.
o Answer:
1. Need for Data Science:
 The exponential growth of data requires tools to analyze and
derive insights.
 Traditional methods are insufficient for handling large-scale,
complex datasets.
 Data science provides techniques to predict outcomes and
improve decision-making.
2. Benefits:
 Drives business decisions through data-driven insights.
 Enhances operational efficiency and cost savings.
 Provides predictive and prescriptive analytics for future
planning.
3. Uses:
 In healthcare: Predicting diseases and optimizing treatments.
 In finance: Fraud detection and credit risk analysis.
 In retail: Personalized recommendations and inventory
management.

2. Describe causality and experimentation in data science with examples.

o Answer:
1. Causality:
 Refers to understanding cause-and-effect relationships between
variables.
 Example: Identifying whether increased advertising (cause)
leads to higher sales (effect).
2. Experimentation:
 Involves testing hypotheses to validate causality.
 A/B Testing: Splitting users into groups to test different
variations of a feature and measure its impact.
Example: Testing two website designs to determine which
increases user engagement.
3. Importance:
 Helps make informed decisions.
 Avoids incorrect conclusions based on correlations alone.

3. Explain the facets of data in data science.

o Answer:
1. Structure:
 Data can be structured (e.g., tables) or unstructured (e.g.,
images).
2. Format:
 Data may exist in formats like JSON, CSV, SQL, or XML.
3. Quality:
 High-quality data ensures accuracy, completeness, and
consistency.
4. Completeness:
 Ensures all necessary information is present in the dataset.
5. Granularity:
 Refers to the level of detail in data, which impacts its usability
for different analyses.

4. Discuss the steps involved in the data science process.

o Answer:
1. Retrieving Data:
 Accessing data from sources like databases, APIs, or web
scraping.
2. Cleansing, Integrating, and Transforming Data:
 Cleaning: Removing errors and inconsistencies.
 Integrating: Combining data from multiple sources.
 Transforming: Converting data into a suitable format.
3. Exploratory Data Analysis (EDA):
 Visualizing data using plots, identifying trends, and detecting
anomalies.
4. Building Models:
 Developing machine learning models for predictions or
classifications.
5. Presenting Findings and Building Applications:
 Creating reports, dashboards, or interactive applications for
end-users.

5. Explain the importance of data cleansing, integration, and transformation in the

data science process.
o Answer:
1. Data Cleansing:
 Identifies and corrects errors like missing values, duplicates,
and outliers.
 Example: Replacing missing values with the column mean.
2. Data Integration:
 Combines datasets from multiple sources into a single, unified
view.
 Example: Merging customer data from an e-commerce
platform with payment history.
3. Data Transformation:
 Converts data into a suitable format for analysis.
 Example: Scaling numerical data or encoding categorical
variables.
4. Significance:
 Improves data quality.
 Ensures compatibility for machine learning models.
 Enhances the accuracy of analysis.

6. Describe Exploratory Data Analysis (EDA) and its role in the data science
process.
o Answer:
1. Definition:
 EDA is the process of summarizing and visualizing data to
identify patterns, relationships, and anomalies.
2. Steps in EDA:
 Visualizing distributions using histograms or density plots.
 Checking relationships between variables using scatter plots.
 Identifying missing values or outliers.
3. Role in Data Science:
 Provides insights to guide feature engineering.
 Detects potential issues in data quality.
4. Example:
 Analyzing customer purchase data to understand buying
patterns.
5. Tools:
 Libraries like pandas, matplotlib, and seaborn.
7. Explain the importance of presenting findings and building applications in data
science.
o Answer:
1. Presenting Findings:
 Converts complex data insights into actionable visuals like
dashboards, reports, and graphs.
 Example: A sales dashboard showing trends and forecasts.
2. Building Applications:
 Embeds data science models into software tools for automation
and scalability.
 Example: A recommendation system for e-commerce websites.
3. Importance:
 Improves decision-making for stakeholders.
 Enhances user experience with real-time insights.
 Bridges the gap between data analysis and business application.

UNIT II PYTHON LIBRARIES FOR DATA SCIENCE 9

NumPy Basics: Arrays and Vectorized Computation - The NumPy ndarray - Creating ndarrays -
Data Types for ndarrays - Arithmetic with NumPy Arrays - Basic Indexing and Slicing - Boolean
Indexing - Transposing Arrays and Swapping Axes. Introduction to pandas Data Structures:
Series, DataFrame - Essential Functionality: Dropping Entries - Indexing, Selection, and Filtering
- Function Application and Mapping- Sorting and Ranking. Summarizing and Computing
Descriptive Statistics - Unique Values, Value Counts, and Membership. Reading and Writing Data
in Text Format.

2-Mark Questions with Answers

1. What is a NumPy ndarray?

o Answer: A NumPy ndarray (n-dimensional array) is a multi-dimensional array
object that is fast and flexible, designed for numerical computations in Python.

2. How do you create a NumPy ndarray?

o Answer: Use functions like np.array(), np.zeros(), np.ones(), or np.arange() to
create an ndarray. Example:
o arr = np.array([1, 2, 3])

3. What are the data types supported by NumPy ndarrays?

o Answer: Common data types include int32, int64, float32, float64, bool,
complex, and object.

4. What is vectorized computation in NumPy?

o Answer: Vectorized computation allows operations to be applied element-
wise to arrays without explicit loops, improving performance.

5. What is Boolean indexing in NumPy?

o Answer: Boolean indexing uses a Boolean array to filter elements in another
array. Example:
o arr[arr > 5]
6. How do you transpose an array in NumPy?
o Answer: Use the T attribute or np.transpose() function. Example:
o transposed = arr.T

7. What is a pandas Series?

o Answer: A pandas Series is a one-dimensional labeled array capable of
holding data of any type (integer, float, string, etc.).

8. What is a pandas DataFrame?

o Answer: A DataFrame is a two-dimensional, size-mutable, and labeled data
structure similar to a spreadsheet or SQL table.

9. How do you drop entries in a pandas DataFrame?

o Answer: Use the drop() method to remove rows or columns. Example:
o df.drop('column_name', axis=1)

10. What is the purpose of the apply() function in pandas?

o Answer: The apply() function applies a function along the axis of a
DataFrame or Series. Example:
o df['column'].apply(np.sqrt)

11. How do you sort a pandas DataFrame by values?

o Answer: Use the sort_values() method. Example:
o df.sort_values(by='column_name')

12. What is the difference between unique() and value_counts() in pandas?

o Answer: unique() returns unique values in a column, while value_counts()
provides the count of each unique value.

13. How do you compute descriptive statistics in pandas?

o Answer: Use methods like mean(), median(), std(), sum(), or describe().

14. What is the purpose of read_csv() in pandas?

o Answer: The read_csv() function reads data from a CSV file into a pandas
DataFrame.

15. How do you write a pandas DataFrame to a text file?

o Answer: Use the to_csv() method to save a DataFrame to a text or CSV file.

16. What is the significance of indexing in pandas?

o Answer: Indexing allows selection, filtering, and alignment of data for easier
manipulation.

16-Mark Questions with Answers

1. Explain the creation of NumPy ndarrays and arithmetic operations with

examples.
o Answer:
1. Creating ndarrays:
 Using np.array():
 arr = np.array([1, 2, 3, 4])
 Using np.zeros() and np.ones():
 zeros = np.zeros((2, 2))
 ones = np.ones((3,))
 Using np.arange() and np.linspace():
 arr1 = np.arange(0, 10, 2)
 arr2 = np.linspace(0, 1, 5)
2. Arithmetic Operations:
 Element-wise addition, subtraction, multiplication, and
division:
 arr1 + arr2
 arr1 * arr2
 Broadcasting for operations on arrays of different shapes.
3. Advantages:
 Faster computation.
 Reduced memory usage.

2. Discuss indexing, slicing, and Boolean indexing in NumPy with examples.

o Answer:
1. Indexing:
 Access specific elements using integer indices:
 arr[2] # Access the 3rd element
2. Slicing:
 Extract subsets using slice notation:
 arr[1:4] # Extract elements from index 1 to 3
3. Boolean Indexing:
 Filter elements based on a condition:
 arr[arr > 5]
4. Examples:
 Access rows/columns of multi-dimensional arrays:
 arr[1, :] # Access the 2nd row
 arr[:, 2] # Access the 3rd column

3. Explain the key pandas data structures (Series and DataFrame) with examples.
o Answer:
1. Series:
 One-dimensional labeled array:
 s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
2. DataFrame:
 Two-dimensional labeled data structure:
 df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
3. Key Operations:
 Indexing: Access rows or columns using labels.
 Selection: Use loc[] or iloc[] for label-based or position-based
selection.
4. Significance:
 Ideal for structured data.
 Simplifies data manipulation.

4. Describe essential functionality in pandas, including dropping entries, filtering,

and mapping.
o Answer:
1. Dropping Entries:
 Remove rows/columns using drop():
 df.drop('column_name', axis=1)
2. Filtering:
 Use conditions to filter rows:
 df[df['column'] > 10]
3. Mapping:
 Transform column values using map() or apply():
 df['column'].map(lambda x: x * 2)
4. Example:
 Drop duplicates:
 df.drop_duplicates()

5. Explain summarizing and computing descriptive statistics in pandas.

o Answer:
1. Descriptive Statistics:
 Use mean(), median(), std(), etc., to compute metrics:
 df['column'].mean()
 df.describe()
2. Unique Values:
 Identify unique values using unique():
 df['column'].unique()
3. Value Counts:
 Count occurrences of each unique value:
 df['column'].value_counts()
4. Membership:
 Check if values belong to a set:
 df['column'].isin([1, 2, 3])
5. Significance:
 Provides insights into data distributions.
 Guides further analysis.

6. Discuss reading and writing data in pandas with examples.

o Answer:
1. Reading Data:
 Use read_csv() to load data:
 df = pd.read_csv('file.csv')
2. Writing Data:
 Save data using to_csv():
 df.to_csv('output.csv', index=False)
3. Other File Formats:
 Read/write Excel:
 pd.read_excel('file.xlsx')
4. Significance:
 Enables integration with external data sources.
UNIT III DATA CLEANING, PREPARATION AND VISUALIZATION 9
Data Cleaning and Preparation: Handling Missing Data - Data Transformation: Removing
Duplicates, Transforming Data Using a Function or Mapping, Replacing Values, Detecting and
Filtering Outliers- String Manipulation: Vectorized String Functions in pandas. Plotting with
pandas and matplotlib: Line Plots, Bar Plots, Histograms and Density Plots, Scatter or Point
Plots.

2-Mark Questions with Answers

1. What is data cleaning?

o Answer: Data cleaning is the process of identifying and correcting errors,
inconsistencies, and inaccuracies in a dataset to improve its quality and
suitability for analysis.

2. How can missing data be handled in pandas?

o Answer: Missing data can be handled in pandas by methods such as:
 Dropping rows or columns using dropna().
 Filling missing values using fillna() with a specific value or statistical
methods (e.g., mean, median).

3. What is data transformation?

o Answer: Data transformation involves changing the format, structure, or
values of data to make it suitable for analysis, such as scaling, mapping, or
encoding.

4. How do you remove duplicates in pandas?

o Answer: Use the drop_duplicates() function in pandas to remove duplicate
rows from a DataFrame.

5. What is an outlier?
o Answer: An outlier is a data point that deviates significantly from the rest of
the dataset and can affect analysis results.

6. How can outliers be detected in pandas?

o Answer: Outliers can be detected using:
 Statistical methods (e.g., z-scores, IQR).
 Visualization techniques like boxplots.

7. What are vectorized string functions in pandas?

o Answer: Vectorized string functions in pandas operate on entire columns of
string data efficiently, such as str.lower(), str.upper(), and str.replace().

8. What is the difference between a histogram and a density plot?

o Answer: A histogram shows the frequency distribution of data, while a
density plot represents the data's probability density function.

9. What is the use of a scatter plot?

o Answer: A scatter plot visualizes the relationship between two continuous
variables by plotting data points on a two-dimensional axis.
10. What is the purpose of matplotlib in Python?
o Answer: matplotlib is a Python library used for creating static, interactive, and
animated visualizations, such as line plots, bar charts, and scatter plots.

11. What is the function of fillna() in pandas?

o Answer: The fillna() function fills missing values in a DataFrame or Series
with specified values or methods like forward-fill or backward-fill.

12. How do you replace values in a pandas DataFrame?

o Answer: Use the replace() function to substitute specified values in a
DataFrame or Series with new values.

13. What is the purpose of str.contains() in pandas?

o Answer: The str.contains() function checks if a substring is present in a string
column and returns a Boolean Series.

14. What is a bar plot?

o Answer: A bar plot displays categorical data with rectangular bars, where the
length of each bar is proportional to its value.

15. How do you plot a line graph using pandas?

o Answer: Use the plot.line() function in pandas or specify kind='line' in the
plot() function to create a line plot.

16. How can duplicate rows be identified in pandas?

o Answer: Duplicate rows can be identified using the duplicated() function,
which returns a Boolean Series indicating duplicate entries.

16-Mark Questions with Answers

1. Explain the steps involved in data cleaning, including handling missing data and
removing duplicates.
o Answer:
1. Definition: Data cleaning involves detecting and correcting errors in
datasets to ensure reliability.
2. Steps:
 Handling Missing Data:
 Drop rows/columns with missing values using dropna().
 Fill missing values using fillna() with:
 Mean, median, or mode for numerical data.
 Specific values for categorical data.
 Removing Duplicates:
 Detect duplicates using duplicated().
 Remove duplicates using drop_duplicates().
3. Importance:
 Reduces noise.
 Ensures consistency in analysis.
 Improves model accuracy.
2. Describe data transformation techniques with examples of removing duplicates,
mapping, replacing values, and filtering outliers.
o Answer:
1. Removing Duplicates:
 Example: df.drop_duplicates().
2. Transforming Data Using a Function or Mapping:
 Example: Map categorical values to numbers using
df['column'].map({'A': 1, 'B': 2}).
3. Replacing Values:
 Example: Replace incorrect entries using
df.replace({'old_value': 'new_value'}).
4. Filtering Outliers:
 Use statistical methods like:
 IQR Method: Filter out values outside Q1−1.5×IQRQ1
- 1.5 \times IQR and Q3+1.5×IQRQ3 + 1.5 \times IQR.
 Z-Score: Remove values with a z-score > 3.
5. Significance:
 Prepares data for accurate analysis.
 Reduces bias and anomalies.

3. Explain vectorized string functions in pandas with examples.

o Answer:
1. Definition: Vectorized string functions in pandas enable efficient
string operations on entire columns or Series.
2. Common Functions:
 str.lower(): Converts text to lowercase.
 str.upper(): Converts text to uppercase.
 str.strip(): Removes leading/trailing whitespaces.
 str.replace(): Replaces substrings.
 str.contains(): Checks for the presence of substrings.
3. Example:
4. df['column'] = df['column'].str.lower()
5. df['column'] = df['column'].str.replace('old', 'new')
6. Applications:
 Text preprocessing in NLP.
 Standardizing categorical variables.

4. Describe the various types of plots in pandas and matplotlib with examples.
o Answer:
1. Line Plot:
 Used for visualizing trends over time.
 Example:
 df.plot.line(x='date', y='value')
2. Bar Plot:
 Visualizes categorical data.
 Example:
 df['category'].value_counts().plot.bar()
3. Histogram:
 Displays data distribution.
 Example:
 df['column'].plot.hist(bins=10)
4. Density Plot:
 Shows the probability density function.
 Example:
 df['column'].plot.kde()
5. Scatter Plot:
 Visualizes relationships between two variables.
 Example:
 df.plot.scatter(x='x_column', y='y_column')
6. Significance:
 Helps in data exploration.
 Identifies patterns, trends, and relationships.

5. Discuss the steps and importance of detecting and filtering outliers.

o Answer:
1. Definition: Outliers are extreme values that deviate significantly from
the dataset.
2. Detection Methods:
 Statistical Methods:
 IQR Method: Identify values outside Q1−1.5×IQRQ1 -
1.5 \times IQR and Q3+1.5×IQRQ3 + 1.5 \times IQR.
 Z-Score: Detect values with a z-score > 3.
 Visualization:
 Use boxplots or scatter plots.
3. Filtering Methods:
 Use conditions to remove outliers:
 df = df[(df['column'] > lower_limit) & (df['column'] <
upper_limit)]
4. Importance:
 Ensures accurate analysis.
 Prevents distortion in statistical results.

UNIT IV STATISTICAL ANALYSIS 9

Introduction - Data Preparation - Exploratory Data Analysis: Data summarization - Data
distribution - Outlier Treatment - Measuring asymmetry - Continuous distribution - Empirical
Distribution; Estimation: Mean - Variance - Randomness - Sampling - Covariance -
Correlation, Measuring the Variability in Estimates: Point estimates - Confidence intervals;
Hypothesis Testing: Using confidence intervals - Using p-values.

2-Mark Questions with Answers

1. What is data preparation?

o Answer: Data preparation is the process of cleaning, transforming, and
organizing raw data into a suitable format for analysis or modeling.

2. What is Exploratory Data Analysis (EDA)?

o Answer: EDA involves summarizing and visualizing datasets to uncover
patterns, detect anomalies, and check assumptions before applying statistical
methods.

3. What are outliers in a dataset?

o Answer: Outliers are data points that significantly deviate from the majority
of the dataset and can potentially distort analysis.

4. Define data distribution.

o Answer: Data distribution describes how values in a dataset are spread out,
often represented using histograms, density plots, or probability distributions.

5. What is skewness?
o Answer: Skewness measures the asymmetry of a data distribution. A
distribution can be positively skewed (right-skewed), negatively skewed (left-
skewed), or symmetric.

6. What is a continuous distribution?

o Answer: A continuous distribution represents data that can take an infinite
number of values within a range, such as heights or weights. Examples include
the normal and uniform distributions.

7. What is an empirical distribution?

o Answer: An empirical distribution is a distribution function derived from
observed data, often visualized through empirical cumulative distribution
functions (ECDFs).

8. What is meant by a point estimate?

o Answer: A point estimate is a single value calculated from sample data to
estimate a population parameter, such as the sample mean for the population
mean.

9. What is a confidence interval?

o Answer: A confidence interval is a range of values that is likely to contain the
true population parameter with a specified probability (e.g., 95%).

10. Define p-value in hypothesis testing.

o Answer: The p-value is the probability of obtaining a test statistic at least as
extreme as the one observed, assuming the null hypothesis is true.

11. What is covariance?

o Answer: Covariance measures the extent to which two variables change
together. A positive value indicates a direct relationship, while a negative
value indicates an inverse relationship.

12. What is correlation?

o Answer: Correlation measures the strength and direction of the linear
relationship between two variables. It is standardized between -1 and 1.

13. What is randomness in sampling?

o Answer: Randomness in sampling ensures that each member of the population
has an equal chance of being selected, minimizing bias.

14. What is hypothesis testing?

o Answer: Hypothesis testing is a statistical method used to decide whether
there is enough evidence to reject a null hypothesis in favor of an alternative
hypothesis.

15. What is the difference between mean and variance?

o Answer: The mean is the average of the data values, while variance measures
the spread of the data around the mean.

16. What is the role of confidence intervals in hypothesis testing?

o Answer: Confidence intervals help determine whether a parameter estimate
supports or rejects the null hypothesis by checking if the hypothesized value
lies within the interval.

16-Mark Questions with Answers

1. Explain the importance and steps involved in data preparation.

o Answer:
1. Definition: Data preparation ensures data quality by cleaning,
transforming, and organizing data for analysis.
2. Steps:
 Data Cleaning: Handle missing values, remove duplicates, and
correct errors.
 Data Transformation: Normalize, scale, or encode variables
for analysis.
 Feature Selection: Select relevant variables to improve model
performance.
 Data Splitting: Divide data into training, validation, and
testing sets.
3. Significance:
 Improves data quality.
 Reduces noise and inconsistencies.
 Enhances model accuracy.

2. What is Exploratory Data Analysis (EDA)? Explain the techniques used in EDA.
o Answer:
1. Definition: EDA involves analyzing datasets to summarize their main
characteristics using visual and statistical methods.
2. Techniques:
 Data Summarization: Calculate measures like mean, median,
and standard deviation.
 Data Distribution: Use histograms, boxplots, and density plots
to understand spread.
 Outlier Detection: Identify outliers using boxplots or z-scores.
 Measuring Asymmetry: Compute skewness and kurtosis.
 Visualizations: Use scatter plots, pair plots, and heatmaps.
3. Applications:
 Detect anomalies.
 Discover patterns.
 Guide feature engineering.

3. Discuss the concepts of mean, variance, covariance, and correlation in detail.

o Answer:
1. Mean: The average value of a dataset: Mean=1n∑i=1nxi\text{Mean}
= \frac{1}{n} \sum_{i=1}^{n} x_i
2. Variance: The spread of data around the mean:
Variance=1n∑i=1n(xi−μ)2\text{Variance} = \frac{1}{n} \
sum_{i=1}^{n} (x_i - \mu)^2
3. Covariance: Measures how two variables change together:
Cov(X,Y)=1n∑i=1n(xi−Xˉ)(yi−Yˉ)\text{Cov}(X, Y) = \frac{1}{n} \
sum_{i=1}^{n} (x_i - \bar{X})(y_i - \bar{Y})
4. Correlation: A standardized measure of linear relationship:
r=Cov(X,Y)σXσYr = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}
5. Applications:
 Mean: Central tendency.
 Variance: Data spread.
 Covariance: Variable dependency.
 Correlation: Strength and direction of relationships.

4. Explain hypothesis testing with confidence intervals and p-values.

o Answer:
1. Definition: Hypothesis testing assesses evidence to decide whether to
reject a null hypothesis (H0H_0).
2. Using Confidence Intervals:
 Construct a confidence interval around the sample statistic.
 Reject H0H_0 if the hypothesized value lies outside the
interval.
3. Using p-values:
 Compute the probability of observing the data, assuming
H0H_0 is true.
 Reject H0H_0 if p≤αp \leq \alpha (significance level, e.g.,
0.05).
4. Steps:
 Define H0H_0 and H1H_1.
 Choose a significance level (α\alpha).
 Calculate the test statistic and p-value.
 Make a decision (reject or fail to reject H0H_0).
5. Example:
 Null Hypothesis: Mean of a population is μ0\mu_0.
 Confidence Interval: Check if μ0\mu_0 is within the interval.
6. Applications:
 Testing the effectiveness of a new drug.
 Verifying if two datasets have the same mean.
5. Describe point estimates, confidence intervals, and their role in variability
measurement.
o Answer:
1. Point Estimate:
 A single value used to estimate a population parameter.
 Example: Sample mean (xˉ\bar{x}) for population mean (μ\
mu).
2. Confidence Interval:
 A range of values that likely contains the population parameter.
 Formula for mean: CI=xˉ±Z(σn)\text{CI} = \bar{x} \pm Z \
left(\frac{\sigma}{\sqrt{n}}\right)
3. Role in Measuring Variability:
 Point estimates provide a single value, but confidence intervals
account for variability in the data.
 Wider intervals indicate greater uncertainty in the estimate.
4. Applications:
 Assessing precision in sample estimates.
 Supporting hypothesis testing.

UNIT V PREDICTION AND INFERENCE WITH MACHINE LEARNING 9

Machine learning - Modeling Process - Training model - Validating model - Predicting new
observations - Supervised learning algorithms - Linear Regression - Unsupervised learning
algorithms - K Means Clustering - Reinforcement learning.

2-Mark Questions with Answers

1. What is Machine Learning?

o Answer: Machine learning is a subset of artificial intelligence that enables
systems to learn patterns from data and make decisions or predictions without
being explicitly programmed.
2. What are the three types of machine learning?
o Answer: The three types are:
1. Supervised Learning.
2. Unsupervised Learning.
3. Reinforcement Learning.
3. What is meant by training a model in machine learning?
o Answer: Training a model involves feeding data into an algorithm to learn
patterns and relationships between input (features) and output (labels).
4. What is model validation in machine learning?
o Answer: Model validation is the process of assessing a trained model's
performance on unseen data to ensure it generalizes well.
5. Define supervised learning.
o Answer: Supervised learning involves training a model on labeled data, where
the input-output relationship is known.
6. Define unsupervised learning.
o Answer: Unsupervised learning involves using unlabeled data to identify
patterns or groupings within the dataset.
7. What is Linear Regression used for?
o Answer: Linear regression is used to predict a continuous output based on one
or more input features.
8. What is K-Means clustering?
o Answer: K-Means clustering is an unsupervised algorithm that partitions data
into kk clusters based on feature similarity.
9. What is reinforcement learning?
o Answer: Reinforcement learning is a type of machine learning where an agent
interacts with an environment to maximize cumulative rewards.
10. What is the objective function of K-Means?
o Answer: The objective is to minimize the sum of squared distances between
data points and their respective cluster centroids.
11. What is the difference between training and testing in machine learning?
o Answer: Training involves learning patterns from data, while testing evaluates
the model's performance on unseen data.
12. What is the formula for Linear Regression?
o Answer: Y=β0+β1X1+⋯+βnXn+ϵY = \beta_0 + \beta_1X_1 + \dots + \
beta_nX_n + \epsilon, where β\beta are coefficients and ϵ\epsilon is the error
term.
13. What is the discount factor in reinforcement learning?
o Answer: The discount factor (γ\gamma) determines the importance of future
rewards in reinforcement learning.
14. What is a centroid in K-Means clustering?
o Answer: A centroid is the center of a cluster, calculated as the mean of all
points assigned to that cluster.
15. What is the difference between supervised and unsupervised learning?
o Answer: Supervised learning uses labeled data, while unsupervised learning
uses unlabeled data to identify patterns.
16. What is overfitting in machine learning?
o Answer: Overfitting occurs when a model performs well on training data but
poorly on unseen data due to memorizing rather than generalizing patterns.

16-Mark Questions with Answers

1. Explain the machine learning modeling process in detail.

o Answer:
1. Data Collection: Gather and preprocess data.
2. Feature Selection: Select the most relevant features.
3. Training: Train the model using labeled or unlabeled data.
4. Validation: Evaluate the model's performance on validation data.
5. Testing: Use unseen data to test the model.
6. Deployment: Deploy the model in a production environment.
7. Monitoring and Updating: Continuously monitor performance and
retrain if necessary.
2. Describe Linear Regression in detail, including its mathematical formulation,
assumptions, and applications.
o Answer:
1. Definition: Linear regression predicts a continuous target variable
based on input features.
2. Formula: Y=β0+β1X1+⋯+βnXn+ϵY = \beta_0 + \beta_1X_1 + \dots
+ \beta_nX_n + \epsilon.
3. Assumptions:
 Linear relationship between independent and dependent
variables.
 Homoscedasticity: Constant variance of errors.
 No multicollinearity among independent variables.
 Errors are normally distributed.
4. Applications:
 House price prediction.
 Sales forecasting.
 Predicting exam scores.
3. Explain K-Means clustering in detail with its algorithm, objective function, and
applications.
o Answer:
1. Definition: K-Means partitions data into kk clusters based on feature
similarity.
2. Algorithm:
 Initialize kk centroids.
 Assign points to the nearest centroid.
 Update centroids as the mean of assigned points.
 Repeat until centroids stabilize.
3. Objective Function: J=∑i=1k∑x∈Ci∥x−μi∥2J = \sum_{i=1}^{k} \
sum_{x \in C_i} \| x - \mu_i \|^2
4. Applications:
 Customer segmentation.
 Market analysis.
 Image compression.
4. Discuss reinforcement learning in detail, including its components, algorithms,
and applications.
o Answer:
1. Definition: Reinforcement learning involves an agent learning through
trial and error in an environment to maximize cumulative rewards.
2. Components:
 Agent: Decision-maker.
 Environment: Interaction space.
 State (S): Current situation.
 Action (A): Possible moves.
 Reward (R): Feedback for actions.
 Policy (π\pi): Strategy for choosing actions.
3. Algorithms:
 Q-Learning: Learn action-value function Q(s,a)Q(s, a).
 Deep Q-Networks (DQN): Use deep learning for complex
environments.
4. Applications:
 Robotics.
 Self-driving cars.
 Game playing (e.g., AlphaGo).
5. Compare supervised, unsupervised, and reinforcement learning with examples.
o Answer:
Supervised Unsupervised Reinforcement
Aspect
Learning Learning Learning
Learns from
Learns from Learns from
Definition interactions with the
labeled data unlabeled data
environment
Linear
Examples Regression, K-Means, PCA Q-Learning, DQN
Decision Trees
Predict outcomes Discover hidden Maximize cumulative
Goal
for unseen data patterns reward
Customer
Fraud detection, Robotics, Game
Applications segmentation,
Sales forecasting playing
Anomaly detection

TYCS Data Science Questions Bank
No ratings yet
TYCS Data Science Questions Bank
3 pages
FDS - Unit 1 Question Bank
No ratings yet
FDS - Unit 1 Question Bank
16 pages
Data Science Answers
No ratings yet
Data Science Answers
2 pages
Big Data Analytics Supply Chain Performance
No ratings yet
Big Data Analytics Supply Chain Performance
16 pages
Software Engineering Xor Data Science 12-1
No ratings yet
Software Engineering Xor Data Science 12-1
6 pages
SOP Sample For Data Science in Canada
No ratings yet
SOP Sample For Data Science in Canada
2 pages
Introduction To Datascience (R20DS501)
No ratings yet
Introduction To Datascience (R20DS501)
162 pages
Data Science Program 2014 PDF
No ratings yet
Data Science Program 2014 PDF
20 pages
New Arrivals Books Feb Jun 2020
No ratings yet
New Arrivals Books Feb Jun 2020
45 pages
Program Name: Master's in
No ratings yet
Program Name: Master's in
19 pages
Internship Report 2023-24 Data Science
100% (2)
Internship Report 2023-24 Data Science
23 pages
File
No ratings yet
File
27 pages
Data Science Fundamentals QB
No ratings yet
Data Science Fundamentals QB
23 pages
Fods QB
No ratings yet
Fods QB
35 pages
Technical Report Writing For Ca2 Examination: Topic: Introduction To Data Science
No ratings yet
Technical Report Writing For Ca2 Examination: Topic: Introduction To Data Science
7 pages
Introduction of Data Science
No ratings yet
Introduction of Data Science
3 pages
Data Science Unit-1 Notes
No ratings yet
Data Science Unit-1 Notes
19 pages
Summary: The Study Aims To Explore The Relationship Between The Digital of Rail
No ratings yet
Summary: The Study Aims To Explore The Relationship Between The Digital of Rail
2 pages
Motivation Letter
No ratings yet
Motivation Letter
4 pages
CUITM217-DATA-SCIENCE Data
No ratings yet
CUITM217-DATA-SCIENCE Data
48 pages
AD3491 - Unit 1 - Introduction To Data Science Important Questions 2 Marks With Answer - 3-8
No ratings yet
AD3491 - Unit 1 - Introduction To Data Science Important Questions 2 Marks With Answer - 3-8
6 pages
Unit 1
No ratings yet
Unit 1
76 pages
OCS353 Data Science Fundamentals QB - (Common To EEE, Mech, Civil)
No ratings yet
OCS353 Data Science Fundamentals QB - (Common To EEE, Mech, Civil)
7 pages
Fdsa Unit 1
No ratings yet
Fdsa Unit 1
25 pages
Student Tracker
No ratings yet
Student Tracker
14 pages
12 2marks With Ans
No ratings yet
12 2marks With Ans
21 pages
01.ad3491 Fdsa QB
No ratings yet
01.ad3491 Fdsa QB
16 pages
12 2marks With Ans
No ratings yet
12 2marks With Ans
21 pages
Data Science Ai Important Questions Answers - 250322 - 101649
No ratings yet
Data Science Ai Important Questions Answers - 250322 - 101649
31 pages
BBA-603 - Baljit Singh
No ratings yet
BBA-603 - Baljit Singh
10 pages
BI Unit 2
No ratings yet
BI Unit 2
113 pages
Class 1 - Introduction To The Module
No ratings yet
Class 1 - Introduction To The Module
25 pages
Resume Sai 2
No ratings yet
Resume Sai 2
49 pages
Unit I 2 Marks
No ratings yet
Unit I 2 Marks
5 pages
Introduction To Data Science - 23CSH-283
100% (1)
Introduction To Data Science - 23CSH-283
48 pages
Unit I 2 Marks With Ans
No ratings yet
Unit I 2 Marks With Ans
7 pages
LO2b) - Data Scientist Vs Data Engineer New
No ratings yet
LO2b) - Data Scientist Vs Data Engineer New
22 pages
Fdsa Unit 1 Aids Sem 4
No ratings yet
Fdsa Unit 1 Aids Sem 4
26 pages
DS With Answer
No ratings yet
DS With Answer
10 pages
Digital Marketing Analytics Chapter 15
No ratings yet
Digital Marketing Analytics Chapter 15
43 pages
FDS Unit 1 QB
No ratings yet
FDS Unit 1 QB
7 pages
Extended Major Program in Artificial Intelligence
No ratings yet
Extended Major Program in Artificial Intelligence
3 pages
DS - Unit I
No ratings yet
DS - Unit I
3 pages
Fdsa 12 - 2M
No ratings yet
Fdsa 12 - 2M
15 pages
FDS Important Questions Detailed
No ratings yet
FDS Important Questions Detailed
10 pages
2marks Unit 1 2marks Unit 1: Foundations of Datascience (Anna University) Foundations of Datascience (Anna University)
No ratings yet
2marks Unit 1 2marks Unit 1: Foundations of Datascience (Anna University) Foundations of Datascience (Anna University)
8 pages
Fundamental of Data Science
No ratings yet
Fundamental of Data Science
20 pages
B7
No ratings yet
B7
2 pages
Data Science Management - Vss
No ratings yet
Data Science Management - Vss
84 pages
IV AI-DS AD3491 FDSA QB Unit1
No ratings yet
IV AI-DS AD3491 FDSA QB Unit1
5 pages
R For Basic Biostatistics in Medical Research Scribd Full Download
100% (11)
R For Basic Biostatistics in Medical Research Scribd Full Download
14 pages
PRINCIPLES OF DATA SCIENCE Lab
No ratings yet
PRINCIPLES OF DATA SCIENCE Lab
20 pages
Data Scientist ML Resume
No ratings yet
Data Scientist ML Resume
5 pages
AD3491-Unit 1
No ratings yet
AD3491-Unit 1
32 pages
Coursera Ibm Data
No ratings yet
Coursera Ibm Data
1 page
Abhishek Gupta Resume
No ratings yet
Abhishek Gupta Resume
1 page
Think Like A Data Scientist Tackle The Data Science Process Step-By-Step 1st Edition Brian Godsey All Chapters Instant Download
100% (1)
Think Like A Data Scientist Tackle The Data Science Process Step-By-Step 1st Edition Brian Godsey All Chapters Instant Download
55 pages
Sumit
No ratings yet
Sumit
1 page
UNIT I Material
No ratings yet
UNIT I Material
25 pages
Class 9 (Chap #4)
No ratings yet
Class 9 (Chap #4)
9 pages
Data Science QB
No ratings yet
Data Science QB
2 pages
Data Science Notes
No ratings yet
Data Science Notes
61 pages
Malak Al Aabiad CV
No ratings yet
Malak Al Aabiad CV
2 pages
Movie Statistic Analysis Report
No ratings yet
Movie Statistic Analysis Report
14 pages
Unit-1 Ans
No ratings yet
Unit-1 Ans
30 pages
Data Science Module 1 Q & A
No ratings yet
Data Science Module 1 Q & A
16 pages
Ads TopperSh
No ratings yet
Ads TopperSh
50 pages
Data Science (Quick Guide) For College Exams
No ratings yet
Data Science (Quick Guide) For College Exams
34 pages
Introduction To Data Science Important Questions
No ratings yet
Introduction To Data Science Important Questions
3 pages
Question Bank With Answers
No ratings yet
Question Bank With Answers
103 pages
Data Science Overview Basic To Advance Guide
No ratings yet
Data Science Overview Basic To Advance Guide
27 pages
Class P and NP
No ratings yet
Class P and NP
47 pages
Ad3491-FDA Unit 1 Question Bank
No ratings yet
Ad3491-FDA Unit 1 Question Bank
8 pages
State Elimination Method - DFA To RE
No ratings yet
State Elimination Method - DFA To RE
2 pages
Hyperledger Fabric Layered Architecture Detailed
No ratings yet
Hyperledger Fabric Layered Architecture Detailed
39 pages
DS 3-Marks Semeseter Suggestion
No ratings yet
DS 3-Marks Semeseter Suggestion
54 pages
Ixs8h l8mgc
No ratings yet
Ixs8h l8mgc
40 pages
DFA To RE - Rij Method
No ratings yet
DFA To RE - Rij Method
9 pages
2 Marks With Answers
No ratings yet
2 Marks With Answers
39 pages
Rulebook For KUET CSE BitFest 2025 Datathon
No ratings yet
Rulebook For KUET CSE BitFest 2025 Datathon
13 pages
Data Science Notes and Questions - 250605 - 112515
No ratings yet
Data Science Notes and Questions - 250605 - 112515
5 pages
Tuning Machine Problem
No ratings yet
Tuning Machine Problem
10 pages
Thompson's Method
No ratings yet
Thompson's Method
3 pages
Data Science
No ratings yet
Data Science
10 pages
Fds Question Bank
No ratings yet
Fds Question Bank
116 pages
Foundation of Data Science (BSC) 1
No ratings yet
Foundation of Data Science (BSC) 1
64 pages
2 Marks Foundations of Data Science
No ratings yet
2 Marks Foundations of Data Science
13 pages
Data Science Unit 01
No ratings yet
Data Science Unit 01
19 pages
Thrivewell - Job Description - Lead Data Engineer
No ratings yet
Thrivewell - Job Description - Lead Data Engineer
2 pages
Q1. Explain Data Science Process Along With Detailed Diagram
No ratings yet
Q1. Explain Data Science Process Along With Detailed Diagram
7 pages
Bcom Python
No ratings yet
Bcom Python
71 pages
Data Science Set - B
No ratings yet
Data Science Set - B
5 pages
Set. No - 1 P18pecs021-Data Science QP - Ph.d.
No ratings yet
Set. No - 1 P18pecs021-Data Science QP - Ph.d.
20 pages
Set. No - 2 P18pecs021-Data Science QP - Ph.d.
No ratings yet
Set. No - 2 P18pecs021-Data Science QP - Ph.d.
20 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Data Science Mastery: From Beginner to Expert in Big Data Analytics
From Everand
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Kameron Hussain
No ratings yet