0% found this document useful (0 votes)
43 views19 pages

PDS Question Bank

The document provides an introduction to data science, outlining its need, benefits, and the data science process, which includes data retrieval, cleansing, analysis, modeling, and presentation. It also covers Python libraries like NumPy and pandas for data manipulation, including creating ndarrays, data structures, and performing operations. Additionally, it discusses data cleaning techniques, handling missing data, and visualization methods using pandas and matplotlib.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views19 pages

PDS Question Bank

The document provides an introduction to data science, outlining its need, benefits, and the data science process, which includes data retrieval, cleansing, analysis, modeling, and presentation. It also covers Python libraries like NumPy and pandas for data manipulation, including creating ndarrays, data structures, and performing operations. Additionally, it discusses data cleaning techniques, handling missing data, and visualization methods using pandas and matplotlib.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT I INTRODUCTION TO DATA SCIENCE

Introduction: Need for data science - Benefits and uses - Causality and Experimentation - Facets
of data-Data science process: Retrieving data - Cleansing, integrating and transforming data -
Exploratory Data Analysis - Build the models - Presenting findings and building applications

2-Mark Questions with Answers

1. What is data science?


o Answer: Data science is an interdisciplinary field that combines statistical
analysis, machine learning, and data engineering to extract insights and
knowledge from structured and unstructured data.

2. Why is data science needed?


o Answer: Data science is needed to analyze large and complex datasets,
uncover patterns, predict outcomes, and drive decision-making in various
fields like business, healthcare, and technology.

3. List any two benefits of data science.


o Answer:
 Helps organizations make data-driven decisions.
 Enables predictions and forecasts through machine learning models.

4. What is causality in data science?


o Answer: Causality refers to understanding cause-and-effect relationships,
where one variable directly influences another.

5. What is the role of experimentation in data science?


o Answer: Experimentation helps test hypotheses and determine causal
relationships between variables, often using techniques like A/B testing.

6. What are the facets of data in data science?


o Answer: The facets of data include structure, format, quality, completeness,
consistency, and granularity.

7. What are the main steps in the data science process?


o Answer:

1. Retrieving data
2. Cleansing, integrating, and transforming data
3. Exploratory Data Analysis (EDA)
4. Building models
5. Presenting findings and building applications

8. What is data retrieval?


o Answer: Data retrieval involves accessing and collecting data from various
sources like databases, APIs, or web scraping.
9. What is data cleansing?
o Answer: Data cleansing is the process of identifying and correcting errors,
inconsistencies, and inaccuracies in data to improve its quality.

10. What is Exploratory Data Analysis (EDA)?


o Answer: EDA is the process of summarizing, visualizing, and understanding
data to uncover patterns, trends, and relationships.

11. Why is data integration important?


o Answer: Data integration combines data from multiple sources to create a
unified view, improving analysis and decision-making.

12. What is a data science model?


o Answer: A data science model is a mathematical representation or algorithm
used to analyze data and make predictions or decisions.

13. What are the benefits of presenting findings in data science?


o Answer: Presenting findings ensures stakeholders can understand insights and
use them for actionable decisions.

14. What is the purpose of building applications in data science?


o Answer: Applications automate processes, integrate data science models, and
provide end-users with actionable tools and insights.

15. What are structured and unstructured data?


o Answer:

 Structured Data: Organized data stored in tabular formats (e.g.,


databases).
 Unstructured Data: Unorganized data like text, images, or videos.

16. What is the importance of transforming data?


o Answer: Data transformation converts raw data into a suitable format for
analysis, improving compatibility and interpretability.

16-Mark Questions with Answers

1. Explain the need for data science and its benefits and uses.
o Answer:
1. Need for Data Science:
 The exponential growth of data requires tools to analyze and
derive insights.
 Traditional methods are insufficient for handling large-scale,
complex datasets.
 Data science provides techniques to predict outcomes and
improve decision-making.
2. Benefits:
 Drives business decisions through data-driven insights.
 Enhances operational efficiency and cost savings.
 Provides predictive and prescriptive analytics for future
planning.
3. Uses:
 In healthcare: Predicting diseases and optimizing treatments.
 In finance: Fraud detection and credit risk analysis.
 In retail: Personalized recommendations and inventory
management.

2. Describe causality and experimentation in data science with examples.


o Answer:
1. Causality:
 Refers to understanding cause-and-effect relationships between
variables.
 Example: Identifying whether increased advertising (cause)
leads to higher sales (effect).
2. Experimentation:
 Involves testing hypotheses to validate causality.
 A/B Testing: Splitting users into groups to test different
variations of a feature and measure its impact.
Example: Testing two website designs to determine which
increases user engagement.
3. Importance:
 Helps make informed decisions.
 Avoids incorrect conclusions based on correlations alone.

3. Explain the facets of data in data science.


o Answer:
1. Structure:
 Data can be structured (e.g., tables) or unstructured (e.g.,
images).
2. Format:
 Data may exist in formats like JSON, CSV, SQL, or XML.
3. Quality:
 High-quality data ensures accuracy, completeness, and
consistency.
4. Completeness:
 Ensures all necessary information is present in the dataset.
5. Granularity:
 Refers to the level of detail in data, which impacts its usability
for different analyses.

4. Discuss the steps involved in the data science process.


o Answer:
1. Retrieving Data:
 Accessing data from sources like databases, APIs, or web
scraping.
2. Cleansing, Integrating, and Transforming Data:
 Cleaning: Removing errors and inconsistencies.
 Integrating: Combining data from multiple sources.
 Transforming: Converting data into a suitable format.
3. Exploratory Data Analysis (EDA):
 Visualizing data using plots, identifying trends, and detecting
anomalies.
4. Building Models:
 Developing machine learning models for predictions or
classifications.
5. Presenting Findings and Building Applications:
 Creating reports, dashboards, or interactive applications for
end-users.

5. Explain the importance of data cleansing, integration, and transformation in the


data science process.
o Answer:
1. Data Cleansing:
 Identifies and corrects errors like missing values, duplicates,
and outliers.
 Example: Replacing missing values with the column mean.
2. Data Integration:
 Combines datasets from multiple sources into a single, unified
view.
 Example: Merging customer data from an e-commerce
platform with payment history.
3. Data Transformation:
 Converts data into a suitable format for analysis.
 Example: Scaling numerical data or encoding categorical
variables.
4. Significance:
 Improves data quality.
 Ensures compatibility for machine learning models.
 Enhances the accuracy of analysis.

6. Describe Exploratory Data Analysis (EDA) and its role in the data science
process.
o Answer:
1. Definition:
 EDA is the process of summarizing and visualizing data to
identify patterns, relationships, and anomalies.
2. Steps in EDA:
 Visualizing distributions using histograms or density plots.
 Checking relationships between variables using scatter plots.
 Identifying missing values or outliers.
3. Role in Data Science:
 Provides insights to guide feature engineering.
 Detects potential issues in data quality.
4. Example:
 Analyzing customer purchase data to understand buying
patterns.
5. Tools:
 Libraries like pandas, matplotlib, and seaborn.
7. Explain the importance of presenting findings and building applications in data
science.
o Answer:
1. Presenting Findings:
 Converts complex data insights into actionable visuals like
dashboards, reports, and graphs.
 Example: A sales dashboard showing trends and forecasts.
2. Building Applications:
 Embeds data science models into software tools for automation
and scalability.
 Example: A recommendation system for e-commerce websites.
3. Importance:
 Improves decision-making for stakeholders.
 Enhances user experience with real-time insights.
 Bridges the gap between data analysis and business application.

UNIT II PYTHON LIBRARIES FOR DATA SCIENCE 9


NumPy Basics: Arrays and Vectorized Computation - The NumPy ndarray - Creating ndarrays -
Data Types for ndarrays - Arithmetic with NumPy Arrays - Basic Indexing and Slicing - Boolean
Indexing - Transposing Arrays and Swapping Axes. Introduction to pandas Data Structures:
Series, DataFrame - Essential Functionality: Dropping Entries - Indexing, Selection, and Filtering
- Function Application and Mapping- Sorting and Ranking. Summarizing and Computing
Descriptive Statistics - Unique Values, Value Counts, and Membership. Reading and Writing Data
in Text Format.

2-Mark Questions with Answers

1. What is a NumPy ndarray?


o Answer: A NumPy ndarray (n-dimensional array) is a multi-dimensional array
object that is fast and flexible, designed for numerical computations in Python.

2. How do you create a NumPy ndarray?


o Answer: Use functions like np.array(), np.zeros(), np.ones(), or np.arange() to
create an ndarray. Example:
o arr = np.array([1, 2, 3])

3. What are the data types supported by NumPy ndarrays?


o Answer: Common data types include int32, int64, float32, float64, bool,
complex, and object.

4. What is vectorized computation in NumPy?


o Answer: Vectorized computation allows operations to be applied element-
wise to arrays without explicit loops, improving performance.

5. What is Boolean indexing in NumPy?


o Answer: Boolean indexing uses a Boolean array to filter elements in another
array. Example:
o arr[arr > 5]
6. How do you transpose an array in NumPy?
o Answer: Use the T attribute or np.transpose() function. Example:
o transposed = arr.T

7. What is a pandas Series?


o Answer: A pandas Series is a one-dimensional labeled array capable of
holding data of any type (integer, float, string, etc.).

8. What is a pandas DataFrame?


o Answer: A DataFrame is a two-dimensional, size-mutable, and labeled data
structure similar to a spreadsheet or SQL table.

9. How do you drop entries in a pandas DataFrame?


o Answer: Use the drop() method to remove rows or columns. Example:
o df.drop('column_name', axis=1)

10. What is the purpose of the apply() function in pandas?


o Answer: The apply() function applies a function along the axis of a
DataFrame or Series. Example:
o df['column'].apply(np.sqrt)

11. How do you sort a pandas DataFrame by values?


o Answer: Use the sort_values() method. Example:
o df.sort_values(by='column_name')

12. What is the difference between unique() and value_counts() in pandas?


o Answer: unique() returns unique values in a column, while value_counts()
provides the count of each unique value.

13. How do you compute descriptive statistics in pandas?


o Answer: Use methods like mean(), median(), std(), sum(), or describe().

14. What is the purpose of read_csv() in pandas?


o Answer: The read_csv() function reads data from a CSV file into a pandas
DataFrame.

15. How do you write a pandas DataFrame to a text file?


o Answer: Use the to_csv() method to save a DataFrame to a text or CSV file.

16. What is the significance of indexing in pandas?


o Answer: Indexing allows selection, filtering, and alignment of data for easier
manipulation.

16-Mark Questions with Answers

1. Explain the creation of NumPy ndarrays and arithmetic operations with


examples.
o Answer:
1. Creating ndarrays:
 Using np.array():
 arr = np.array([1, 2, 3, 4])
 Using np.zeros() and np.ones():
 zeros = np.zeros((2, 2))
 ones = np.ones((3,))
 Using np.arange() and np.linspace():
 arr1 = np.arange(0, 10, 2)
 arr2 = np.linspace(0, 1, 5)
2. Arithmetic Operations:
 Element-wise addition, subtraction, multiplication, and
division:
 arr1 + arr2
 arr1 * arr2
 Broadcasting for operations on arrays of different shapes.
3. Advantages:
 Faster computation.
 Reduced memory usage.

2. Discuss indexing, slicing, and Boolean indexing in NumPy with examples.


o Answer:
1. Indexing:
 Access specific elements using integer indices:
 arr[2] # Access the 3rd element
2. Slicing:
 Extract subsets using slice notation:
 arr[1:4] # Extract elements from index 1 to 3
3. Boolean Indexing:
 Filter elements based on a condition:
 arr[arr > 5]
4. Examples:
 Access rows/columns of multi-dimensional arrays:
 arr[1, :] # Access the 2nd row
 arr[:, 2] # Access the 3rd column

3. Explain the key pandas data structures (Series and DataFrame) with examples.
o Answer:
1. Series:
 One-dimensional labeled array:
 s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
2. DataFrame:
 Two-dimensional labeled data structure:
 df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
3. Key Operations:
 Indexing: Access rows or columns using labels.
 Selection: Use loc[] or iloc[] for label-based or position-based
selection.
4. Significance:
 Ideal for structured data.
 Simplifies data manipulation.

4. Describe essential functionality in pandas, including dropping entries, filtering,


and mapping.
o Answer:
1. Dropping Entries:
 Remove rows/columns using drop():
 df.drop('column_name', axis=1)
2. Filtering:
 Use conditions to filter rows:
 df[df['column'] > 10]
3. Mapping:
 Transform column values using map() or apply():
 df['column'].map(lambda x: x * 2)
4. Example:
 Drop duplicates:
 df.drop_duplicates()

5. Explain summarizing and computing descriptive statistics in pandas.


o Answer:
1. Descriptive Statistics:
 Use mean(), median(), std(), etc., to compute metrics:
 df['column'].mean()
 df.describe()
2. Unique Values:
 Identify unique values using unique():
 df['column'].unique()
3. Value Counts:
 Count occurrences of each unique value:
 df['column'].value_counts()
4. Membership:
 Check if values belong to a set:
 df['column'].isin([1, 2, 3])
5. Significance:
 Provides insights into data distributions.
 Guides further analysis.

6. Discuss reading and writing data in pandas with examples.


o Answer:
1. Reading Data:
 Use read_csv() to load data:
 df = pd.read_csv('file.csv')
2. Writing Data:
 Save data using to_csv():
 df.to_csv('output.csv', index=False)
3. Other File Formats:
 Read/write Excel:
 pd.read_excel('file.xlsx')
4. Significance:
 Enables integration with external data sources.
UNIT III DATA CLEANING, PREPARATION AND VISUALIZATION 9
Data Cleaning and Preparation: Handling Missing Data - Data Transformation: Removing
Duplicates, Transforming Data Using a Function or Mapping, Replacing Values, Detecting and
Filtering Outliers- String Manipulation: Vectorized String Functions in pandas. Plotting with
pandas and matplotlib: Line Plots, Bar Plots, Histograms and Density Plots, Scatter or Point
Plots.

2-Mark Questions with Answers

1. What is data cleaning?


o Answer: Data cleaning is the process of identifying and correcting errors,
inconsistencies, and inaccuracies in a dataset to improve its quality and
suitability for analysis.

2. How can missing data be handled in pandas?


o Answer: Missing data can be handled in pandas by methods such as:
 Dropping rows or columns using dropna().
 Filling missing values using fillna() with a specific value or statistical
methods (e.g., mean, median).

3. What is data transformation?


o Answer: Data transformation involves changing the format, structure, or
values of data to make it suitable for analysis, such as scaling, mapping, or
encoding.

4. How do you remove duplicates in pandas?


o Answer: Use the drop_duplicates() function in pandas to remove duplicate
rows from a DataFrame.

5. What is an outlier?
o Answer: An outlier is a data point that deviates significantly from the rest of
the dataset and can affect analysis results.

6. How can outliers be detected in pandas?


o Answer: Outliers can be detected using:
 Statistical methods (e.g., z-scores, IQR).
 Visualization techniques like boxplots.

7. What are vectorized string functions in pandas?


o Answer: Vectorized string functions in pandas operate on entire columns of
string data efficiently, such as str.lower(), str.upper(), and str.replace().

8. What is the difference between a histogram and a density plot?


o Answer: A histogram shows the frequency distribution of data, while a
density plot represents the data's probability density function.

9. What is the use of a scatter plot?


o Answer: A scatter plot visualizes the relationship between two continuous
variables by plotting data points on a two-dimensional axis.
10. What is the purpose of matplotlib in Python?
o Answer: matplotlib is a Python library used for creating static, interactive, and
animated visualizations, such as line plots, bar charts, and scatter plots.

11. What is the function of fillna() in pandas?


o Answer: The fillna() function fills missing values in a DataFrame or Series
with specified values or methods like forward-fill or backward-fill.

12. How do you replace values in a pandas DataFrame?


o Answer: Use the replace() function to substitute specified values in a
DataFrame or Series with new values.

13. What is the purpose of str.contains() in pandas?


o Answer: The str.contains() function checks if a substring is present in a string
column and returns a Boolean Series.

14. What is a bar plot?


o Answer: A bar plot displays categorical data with rectangular bars, where the
length of each bar is proportional to its value.

15. How do you plot a line graph using pandas?


o Answer: Use the plot.line() function in pandas or specify kind='line' in the
plot() function to create a line plot.

16. How can duplicate rows be identified in pandas?


o Answer: Duplicate rows can be identified using the duplicated() function,
which returns a Boolean Series indicating duplicate entries.

16-Mark Questions with Answers

1. Explain the steps involved in data cleaning, including handling missing data and
removing duplicates.
o Answer:
1. Definition: Data cleaning involves detecting and correcting errors in
datasets to ensure reliability.
2. Steps:
 Handling Missing Data:
 Drop rows/columns with missing values using dropna().
 Fill missing values using fillna() with:
 Mean, median, or mode for numerical data.
 Specific values for categorical data.
 Removing Duplicates:
 Detect duplicates using duplicated().
 Remove duplicates using drop_duplicates().
3. Importance:
 Reduces noise.
 Ensures consistency in analysis.
 Improves model accuracy.
2. Describe data transformation techniques with examples of removing duplicates,
mapping, replacing values, and filtering outliers.
o Answer:
1. Removing Duplicates:
 Example: df.drop_duplicates().
2. Transforming Data Using a Function or Mapping:
 Example: Map categorical values to numbers using
df['column'].map({'A': 1, 'B': 2}).
3. Replacing Values:
 Example: Replace incorrect entries using
df.replace({'old_value': 'new_value'}).
4. Filtering Outliers:
 Use statistical methods like:
 IQR Method: Filter out values outside Q1−1.5×IQRQ1
- 1.5 \times IQR and Q3+1.5×IQRQ3 + 1.5 \times IQR.
 Z-Score: Remove values with a z-score > 3.
5. Significance:
 Prepares data for accurate analysis.
 Reduces bias and anomalies.

3. Explain vectorized string functions in pandas with examples.


o Answer:
1. Definition: Vectorized string functions in pandas enable efficient
string operations on entire columns or Series.
2. Common Functions:
 str.lower(): Converts text to lowercase.
 str.upper(): Converts text to uppercase.
 str.strip(): Removes leading/trailing whitespaces.
 str.replace(): Replaces substrings.
 str.contains(): Checks for the presence of substrings.
3. Example:
4. df['column'] = df['column'].str.lower()
5. df['column'] = df['column'].str.replace('old', 'new')
6. Applications:
 Text preprocessing in NLP.
 Standardizing categorical variables.

4. Describe the various types of plots in pandas and matplotlib with examples.
o Answer:
1. Line Plot:
 Used for visualizing trends over time.
 Example:
 df.plot.line(x='date', y='value')
2. Bar Plot:
 Visualizes categorical data.
 Example:
 df['category'].value_counts().plot.bar()
3. Histogram:
 Displays data distribution.
 Example:
 df['column'].plot.hist(bins=10)
4. Density Plot:
 Shows the probability density function.
 Example:
 df['column'].plot.kde()
5. Scatter Plot:
 Visualizes relationships between two variables.
 Example:
 df.plot.scatter(x='x_column', y='y_column')
6. Significance:
 Helps in data exploration.
 Identifies patterns, trends, and relationships.

5. Discuss the steps and importance of detecting and filtering outliers.


o Answer:
1. Definition: Outliers are extreme values that deviate significantly from
the dataset.
2. Detection Methods:
 Statistical Methods:
 IQR Method: Identify values outside Q1−1.5×IQRQ1 -
1.5 \times IQR and Q3+1.5×IQRQ3 + 1.5 \times IQR.
 Z-Score: Detect values with a z-score > 3.
 Visualization:
 Use boxplots or scatter plots.
3. Filtering Methods:
 Use conditions to remove outliers:
 df = df[(df['column'] > lower_limit) & (df['column'] <
upper_limit)]
4. Importance:
 Ensures accurate analysis.
 Prevents distortion in statistical results.

UNIT IV STATISTICAL ANALYSIS 9


Introduction - Data Preparation - Exploratory Data Analysis: Data summarization - Data
distribution - Outlier Treatment - Measuring asymmetry - Continuous distribution - Empirical
Distribution; Estimation: Mean - Variance - Randomness - Sampling - Covariance -
Correlation, Measuring the Variability in Estimates: Point estimates - Confidence intervals;
Hypothesis Testing: Using confidence intervals - Using p-values.

2-Mark Questions with Answers

1. What is data preparation?


o Answer: Data preparation is the process of cleaning, transforming, and
organizing raw data into a suitable format for analysis or modeling.

2. What is Exploratory Data Analysis (EDA)?


o Answer: EDA involves summarizing and visualizing datasets to uncover
patterns, detect anomalies, and check assumptions before applying statistical
methods.

3. What are outliers in a dataset?


o Answer: Outliers are data points that significantly deviate from the majority
of the dataset and can potentially distort analysis.

4. Define data distribution.


o Answer: Data distribution describes how values in a dataset are spread out,
often represented using histograms, density plots, or probability distributions.

5. What is skewness?
o Answer: Skewness measures the asymmetry of a data distribution. A
distribution can be positively skewed (right-skewed), negatively skewed (left-
skewed), or symmetric.

6. What is a continuous distribution?


o Answer: A continuous distribution represents data that can take an infinite
number of values within a range, such as heights or weights. Examples include
the normal and uniform distributions.

7. What is an empirical distribution?


o Answer: An empirical distribution is a distribution function derived from
observed data, often visualized through empirical cumulative distribution
functions (ECDFs).

8. What is meant by a point estimate?


o Answer: A point estimate is a single value calculated from sample data to
estimate a population parameter, such as the sample mean for the population
mean.

9. What is a confidence interval?


o Answer: A confidence interval is a range of values that is likely to contain the
true population parameter with a specified probability (e.g., 95%).

10. Define p-value in hypothesis testing.


o Answer: The p-value is the probability of obtaining a test statistic at least as
extreme as the one observed, assuming the null hypothesis is true.

11. What is covariance?


o Answer: Covariance measures the extent to which two variables change
together. A positive value indicates a direct relationship, while a negative
value indicates an inverse relationship.

12. What is correlation?


o Answer: Correlation measures the strength and direction of the linear
relationship between two variables. It is standardized between -1 and 1.

13. What is randomness in sampling?


o Answer: Randomness in sampling ensures that each member of the population
has an equal chance of being selected, minimizing bias.

14. What is hypothesis testing?


o Answer: Hypothesis testing is a statistical method used to decide whether
there is enough evidence to reject a null hypothesis in favor of an alternative
hypothesis.

15. What is the difference between mean and variance?


o Answer: The mean is the average of the data values, while variance measures
the spread of the data around the mean.

16. What is the role of confidence intervals in hypothesis testing?


o Answer: Confidence intervals help determine whether a parameter estimate
supports or rejects the null hypothesis by checking if the hypothesized value
lies within the interval.

16-Mark Questions with Answers

1. Explain the importance and steps involved in data preparation.


o Answer:
1. Definition: Data preparation ensures data quality by cleaning,
transforming, and organizing data for analysis.
2. Steps:
 Data Cleaning: Handle missing values, remove duplicates, and
correct errors.
 Data Transformation: Normalize, scale, or encode variables
for analysis.
 Feature Selection: Select relevant variables to improve model
performance.
 Data Splitting: Divide data into training, validation, and
testing sets.
3. Significance:
 Improves data quality.
 Reduces noise and inconsistencies.
 Enhances model accuracy.

2. What is Exploratory Data Analysis (EDA)? Explain the techniques used in EDA.
o Answer:
1. Definition: EDA involves analyzing datasets to summarize their main
characteristics using visual and statistical methods.
2. Techniques:
 Data Summarization: Calculate measures like mean, median,
and standard deviation.
 Data Distribution: Use histograms, boxplots, and density plots
to understand spread.
 Outlier Detection: Identify outliers using boxplots or z-scores.
 Measuring Asymmetry: Compute skewness and kurtosis.
 Visualizations: Use scatter plots, pair plots, and heatmaps.
3. Applications:
 Detect anomalies.
 Discover patterns.
 Guide feature engineering.

3. Discuss the concepts of mean, variance, covariance, and correlation in detail.


o Answer:
1. Mean: The average value of a dataset: Mean=1n∑i=1nxi\text{Mean}
= \frac{1}{n} \sum_{i=1}^{n} x_i
2. Variance: The spread of data around the mean:
Variance=1n∑i=1n(xi−μ)2\text{Variance} = \frac{1}{n} \
sum_{i=1}^{n} (x_i - \mu)^2
3. Covariance: Measures how two variables change together:
Cov(X,Y)=1n∑i=1n(xi−Xˉ)(yi−Yˉ)\text{Cov}(X, Y) = \frac{1}{n} \
sum_{i=1}^{n} (x_i - \bar{X})(y_i - \bar{Y})
4. Correlation: A standardized measure of linear relationship:
r=Cov(X,Y)σXσYr = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}
5. Applications:
 Mean: Central tendency.
 Variance: Data spread.
 Covariance: Variable dependency.
 Correlation: Strength and direction of relationships.

4. Explain hypothesis testing with confidence intervals and p-values.


o Answer:
1. Definition: Hypothesis testing assesses evidence to decide whether to
reject a null hypothesis (H0H_0).
2. Using Confidence Intervals:
 Construct a confidence interval around the sample statistic.
 Reject H0H_0 if the hypothesized value lies outside the
interval.
3. Using p-values:
 Compute the probability of observing the data, assuming
H0H_0 is true.
 Reject H0H_0 if p≤αp \leq \alpha (significance level, e.g.,
0.05).
4. Steps:
 Define H0H_0 and H1H_1.
 Choose a significance level (α\alpha).
 Calculate the test statistic and p-value.
 Make a decision (reject or fail to reject H0H_0).
5. Example:
 Null Hypothesis: Mean of a population is μ0\mu_0.
 Confidence Interval: Check if μ0\mu_0 is within the interval.
6. Applications:
 Testing the effectiveness of a new drug.
 Verifying if two datasets have the same mean.
5. Describe point estimates, confidence intervals, and their role in variability
measurement.
o Answer:
1. Point Estimate:
 A single value used to estimate a population parameter.
 Example: Sample mean (xˉ\bar{x}) for population mean (μ\
mu).
2. Confidence Interval:
 A range of values that likely contains the population parameter.
 Formula for mean: CI=xˉ±Z(σn)\text{CI} = \bar{x} \pm Z \
left(\frac{\sigma}{\sqrt{n}}\right)
3. Role in Measuring Variability:
 Point estimates provide a single value, but confidence intervals
account for variability in the data.
 Wider intervals indicate greater uncertainty in the estimate.
4. Applications:
 Assessing precision in sample estimates.
 Supporting hypothesis testing.

UNIT V PREDICTION AND INFERENCE WITH MACHINE LEARNING 9


Machine learning - Modeling Process - Training model - Validating model - Predicting new
observations - Supervised learning algorithms - Linear Regression - Unsupervised learning
algorithms - K Means Clustering - Reinforcement learning.

2-Mark Questions with Answers

1. What is Machine Learning?


o Answer: Machine learning is a subset of artificial intelligence that enables
systems to learn patterns from data and make decisions or predictions without
being explicitly programmed.
2. What are the three types of machine learning?
o Answer: The three types are:
1. Supervised Learning.
2. Unsupervised Learning.
3. Reinforcement Learning.
3. What is meant by training a model in machine learning?
o Answer: Training a model involves feeding data into an algorithm to learn
patterns and relationships between input (features) and output (labels).
4. What is model validation in machine learning?
o Answer: Model validation is the process of assessing a trained model's
performance on unseen data to ensure it generalizes well.
5. Define supervised learning.
o Answer: Supervised learning involves training a model on labeled data, where
the input-output relationship is known.
6. Define unsupervised learning.
o Answer: Unsupervised learning involves using unlabeled data to identify
patterns or groupings within the dataset.
7. What is Linear Regression used for?
o Answer: Linear regression is used to predict a continuous output based on one
or more input features.
8. What is K-Means clustering?
o Answer: K-Means clustering is an unsupervised algorithm that partitions data
into kk clusters based on feature similarity.
9. What is reinforcement learning?
o Answer: Reinforcement learning is a type of machine learning where an agent
interacts with an environment to maximize cumulative rewards.
10. What is the objective function of K-Means?
o Answer: The objective is to minimize the sum of squared distances between
data points and their respective cluster centroids.
11. What is the difference between training and testing in machine learning?
o Answer: Training involves learning patterns from data, while testing evaluates
the model's performance on unseen data.
12. What is the formula for Linear Regression?
o Answer: Y=β0+β1X1+⋯+βnXn+ϵY = \beta_0 + \beta_1X_1 + \dots + \
beta_nX_n + \epsilon, where β\beta are coefficients and ϵ\epsilon is the error
term.
13. What is the discount factor in reinforcement learning?
o Answer: The discount factor (γ\gamma) determines the importance of future
rewards in reinforcement learning.
14. What is a centroid in K-Means clustering?
o Answer: A centroid is the center of a cluster, calculated as the mean of all
points assigned to that cluster.
15. What is the difference between supervised and unsupervised learning?
o Answer: Supervised learning uses labeled data, while unsupervised learning
uses unlabeled data to identify patterns.
16. What is overfitting in machine learning?
o Answer: Overfitting occurs when a model performs well on training data but
poorly on unseen data due to memorizing rather than generalizing patterns.

16-Mark Questions with Answers

1. Explain the machine learning modeling process in detail.


o Answer:
1. Data Collection: Gather and preprocess data.
2. Feature Selection: Select the most relevant features.
3. Training: Train the model using labeled or unlabeled data.
4. Validation: Evaluate the model's performance on validation data.
5. Testing: Use unseen data to test the model.
6. Deployment: Deploy the model in a production environment.
7. Monitoring and Updating: Continuously monitor performance and
retrain if necessary.
2. Describe Linear Regression in detail, including its mathematical formulation,
assumptions, and applications.
o Answer:
1. Definition: Linear regression predicts a continuous target variable
based on input features.
2. Formula: Y=β0+β1X1+⋯+βnXn+ϵY = \beta_0 + \beta_1X_1 + \dots
+ \beta_nX_n + \epsilon.
3. Assumptions:
 Linear relationship between independent and dependent
variables.
 Homoscedasticity: Constant variance of errors.
 No multicollinearity among independent variables.
 Errors are normally distributed.
4. Applications:
 House price prediction.
 Sales forecasting.
 Predicting exam scores.
3. Explain K-Means clustering in detail with its algorithm, objective function, and
applications.
o Answer:
1. Definition: K-Means partitions data into kk clusters based on feature
similarity.
2. Algorithm:
 Initialize kk centroids.
 Assign points to the nearest centroid.
 Update centroids as the mean of assigned points.
 Repeat until centroids stabilize.
3. Objective Function: J=∑i=1k∑x∈Ci∥x−μi∥2J = \sum_{i=1}^{k} \
sum_{x \in C_i} \| x - \mu_i \|^2
4. Applications:
 Customer segmentation.
 Market analysis.
 Image compression.
4. Discuss reinforcement learning in detail, including its components, algorithms,
and applications.
o Answer:
1. Definition: Reinforcement learning involves an agent learning through
trial and error in an environment to maximize cumulative rewards.
2. Components:
 Agent: Decision-maker.
 Environment: Interaction space.
 State (S): Current situation.
 Action (A): Possible moves.
 Reward (R): Feedback for actions.
 Policy (π\pi): Strategy for choosing actions.
3. Algorithms:
 Q-Learning: Learn action-value function Q(s,a)Q(s, a).
 Deep Q-Networks (DQN): Use deep learning for complex
environments.
4. Applications:
 Robotics.
 Self-driving cars.
 Game playing (e.g., AlphaGo).
5. Compare supervised, unsupervised, and reinforcement learning with examples.
o Answer:
Supervised Unsupervised Reinforcement
Aspect
Learning Learning Learning
Learns from
Learns from Learns from
Definition interactions with the
labeled data unlabeled data
environment
Linear
Examples Regression, K-Means, PCA Q-Learning, DQN
Decision Trees
Predict outcomes Discover hidden Maximize cumulative
Goal
for unseen data patterns reward
Customer
Fraud detection, Robotics, Game
Applications segmentation,
Sales forecasting playing
Anomaly detection

You might also like