0% found this document useful (0 votes)
19 views33 pages

AW Term1 - DS

The document outlines key topics in Data Science for Grade 12, including Data Governance, Exploratory Data Analysis (EDA), and Decision Trees. It emphasizes the importance of data governance in ensuring data quality and privacy, introduces EDA as a method for understanding data patterns, and explains decision trees as a tool for decision-making and classification. Additionally, it includes assignments and ethical guidelines related to data handling and privacy laws.

Uploaded by

shawn praveen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views33 pages

AW Term1 - DS

The document outlines key topics in Data Science for Grade 12, including Data Governance, Exploratory Data Analysis (EDA), and Decision Trees. It emphasizes the importance of data governance in ensuring data quality and privacy, introduces EDA as a method for understanding data patterns, and explains decision trees as a tool for decision-making and classification. Additionally, it includes assignments and ethical guidelines related to data handling and privacy laws.

Uploaded by

shawn praveen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Academic Window

Data Science
Grade 12
Sno Topic
1. Data Governance
2. Exploratory Data Analysis
3. Decision Tree
4. Classification Algorithms - II
5. Communication Skills
Chapter 1: Data Governance

What is Data Governance?


 Data governance is a system of rules, policies, and technologies that help manage and protect data.
 It ensures data quality, security, and proper usage.
 Defines who can access data, how it is used, and under what conditions.
 Focuses on:
o Data integrity (accuracy and reliability)

o Data availability (accessible when needed)

o Data usability (easy to use)

o Data consistency (uniform across systems)

2. Ethical Guidelines
 Ethics refers to moral principles that guide behavior.
 Data and software should be used responsibly for the benefit of society.
 Key ethical values to follow:
o Integrity (honesty in handling data)

o Objectivity (no personal bias)

o Non-discrimination (fairness in data usage)

 Important ethical practices in data handling:


o Keep data secure.

o Ensure AI models are fair and unbiased.

o Be transparent about data collection and usage.

o Use data-minimizing technologies (collect only necessary data).

3. Data Privacy
 Data privacy means individuals have control over their personal information.
 Secure storage alone is not enough—users must consent to data collection.
 The individual is the sole owner of their data and can request its deletion.
 Data privacy laws help protect personal information.
Major Data Privacy Laws:
1. GDPR (General Data Protection Regulation - EU)
o Effective from May 25, 2018.
o Gives consumers control over their data.

o Key aspects:

 Consent before collecting data.


 Right to access and delete personal data.
 Notification of data breaches.
 Privacy built into data systems.
2. HIPAA (Health Insurance Portability and Accountability Act - USA)
o Protects healthcare and insurance-related personal data.

o Allows individuals to access and correct their medical records.

o Protects:

 Names, phone numbers, emails.


 Geographical and biometric data.
 Medical records and social security numbers.
3. CCPA (California Consumer Privacy Act - USA)
o Effective from January 1, 2020.

o Protects consumer data of California residents.

o Consumers can:

 Know what data companies have.


 Request data deletion.
 Stop businesses from selling their data.
4. COPPA (Children’s Online Privacy Protection Act - USA)
o Passed in 1998, effective from April 21, 2000.

o Protects data of children under 13.

o Websites must get parental consent before collecting children's data.

5. PDP Bill (Personal Data Protection Bill - India)


o Introduced on December 11, 2019.

o Aims to protect personal data and set up a Data Protection Authority of India.

o Still under review by experts and lawmakers.


ASSIGNMENT
TOPIC

LEGENDS

Class Work Homework Challenge Strongly recommended to Evaluation


Yourself practice

1.  How can data governance frameworks evolve to address the challenges posed by emerging
technologies like Artificial Intelligence and Blockchain, while ensuring compliance with data
privacy laws?
 What are the potential consequences for an organization if it fails to implement a robust data
governance system, especially regarding data privacy and security regulations?

2.  Considering the GDPR and CCPA, how can businesses balance user privacy with their need
for data collection and analysis to drive growth and innovation?

3.  What are the implications of a data breach under different data privacy laws (GDPR, HIPAA,
CCPA) and how can organizations ensure they comply with breach notification regulations
effectively?

4.  In what ways can organizations ensure data integration and interoperability while
maintaining high standards of data security and privacy across different platforms and systems?

5. How do data governance practices vary across industries (e.g., healthcare, e-commerce,
finance), and what unique challenges do these industries face in maintaining data quality and
security?
6. What role does data architecture play in ensuring data availability and consistency across
various data sources, and how can this be managed efficiently in large-scale organizations?

7.  How can ethical principles, such as transparency and accountability, be integrated into AI-
driven data systems to minimize bias and discrimination?
How can organizations develop a data governance strategy that accommodates both the growing
volume of data and the increasing demands for transparency and user consent in data
management?

8. How can governments and regulatory bodies enforce data privacy laws effectively, especially in
the context of cross-border data flows and multinational companies?
Chapter 2- Exploratory Data Analysis (EDA)
1. Introduction to EDA
EDA is the process of understanding and exploring data before using it for decision-making or building
models. It helps us:
✅ Find patterns in data
✅ Detect unusual data points (anomalies)
✅ Check if our assumptions about data are correct
EDA is mainly done using:
 Summary statistics (like mean, median, range, standard deviation, etc.)
 Graphs & charts (like histograms, scatter plots, and box plots)
💡 Why is EDA important?
 It helps us understand the data before using it for machine learning.
 It improves decision-making by revealing insights.
 It ensures data quality by identifying missing values or incorrect data.

2. Ty

pes of EDA
EDA can be classified into three types:
A. Univariate Analysis (One Variable at a Time)
📌 Definition:
Univariate analysis focuses on studying only one variable at a time. It helps us understand how a single
feature behaves.
📌 Example:
If we have a dataset of students, we can study the variable "height" separately.
📌 Methods Used:
1. Statistical Methods:
o Mean – Average value
o Median – Middle value
o Mode – Most frequent value
o Range – Difference between maximum and minimum value
o Standard Deviation – Measures how spread out the data is
2. Graphical Methods:
o Histogram – Shows how often different values appear
o Box Plot – Helps identify outliers (unusual values)
o Bar Chart – Used for categorical data (e.g., favorite color of students)
o Pie Chart – Shows proportion of different categories

B. Bivariate Analysis (Two Variables at a Time)


📌 Definition:
Bivariate analysis examines the relationship between two variables.
📌 Example:
 Studying the relationship between price and sales of a product.
 Checking if height and weight are related.
📌 Methods Used:
1. Scatter Plot – Helps see the relationship between two variables.
2. Line Chart – Shows trends over time (e.g., monthly sales).
3. Correlation Analysis – Measures how strongly two variables are related.

C. Multivariate Analysis (More Than Two Variables)


📌 Definition:
Multivariate analysis studies multiple variables together to find patterns.
📌 Example:
 Studying age, income, and spending habits together to understand customer behavior.
 Analyzing how weather, location, and traffic affect fuel consumption.
📌 Methods Used:
1. Pair Plots – Shows relationships between all variables.
2. Cluster Analysis – Groups similar data points together.
3. Principal Component Analysis (PCA) – Reduces the number of variables while keeping important
information.

3. Data Cleaning
Before analyzing data, we must clean it to remove errors and incorrect values.
📌 Why is Data Cleaning Important?
 Dirty data can lead to wrong conclusions.
 Some models don’t work well if the data has missing values or errors.
📌 Steps to Clean Data:
✔ 1. Remove Duplicate Data
 Example: If a student's record appears twice, we remove the extra entry.
✔ 2. Remove Irrelevant Data
 Example: If we are studying customer purchases, columns like “Customer’s Favorite Color” might
not be useful.
✔ 3. Handle Outliers
 Outliers are extreme values that do not fit the pattern.
 Example: If most people have a salary between ₹20,000 - ₹50,000, but one person has ₹5,00,000,
this might be an outlier.
✔ 4. Fix Data Type Issues
 Example: If a date is stored as text instead of a date format, we need to fix it.
✔ 5. Handle Missing Data
 If data is missing, we can:
o Remove the row (if only a few values are missing).
o Fill in missing values with the mean (for numbers) or most common value (for categories).
 Example: If some students forgot to enter their age, we can fill missing ages with the average age of
the class.

4. Common Graphs Used in EDA


📊 Histogram – Shows the frequency of data points within certain ranges.
📊 Box Plot – Helps in detecting outliers and understanding data distribution.
📊 Scatter Plot – Shows relationships between two numerical variables.
📊 Bar Chart – Used for categorical data like “Favorite Sport” of students.
📊 Pie Chart – Shows proportions like “Percentage of male and female students in a class.”
5. Summary
🔹 EDA is the first step in understanding data before using it for decision-making or machine learning.
🔹 It helps find patterns, relationships, and errors in the data.
🔹 EDA uses statistics and graphs to explore data.
🔹 Data Cleaning is an important part of EDA to ensure accuracy.

ASSIGNMENT
TOPIC

LEGENDS

Class Work Homework Challenge Strongly recommended to Evaluation


Yourself practice

1.
What is bivariate analysis?
Which type of graph is commonly used to analyze the relationship between two numerical
variables?

2.  How would you decide whether to use a histogram or a box plot for analyzing a
variable like "income" in a population survey?
 If two variables (e.g., study hours and exam scores) show a weak correlation in a
scatter plot, what steps would you take to investigate further?
3.  Why might it be dangerous to perform machine learning on data without first doing
EDA? Give real-world examples to justify your answer.
 A company sees a sudden spike in product returns. How can EDA help in identifying
the cause? What kind of plots or analysis would you use?
4.  Create an EDA plan for a new online food delivery startup that wants to understand
customer behavior. Which types of EDA would you perform and why?
 If you are analyzing climate data with variables like temperature, humidity, and
rainfall across cities, how would you apply multivariate analysis to draw insights?

5.  If you are analyzing climate data with variables like temperature, humidity, and
rainfall across cities, how would you apply multivariate analysis to draw insights?
 Design a step-by-step data cleaning strategy for a messy dataset collected from social
media platforms, including text, dates, and user activity.

6. Imagine your EDA shows some missing values in a crucial column. Would you prefer to
drop the rows or fill in missing values? Justify your decision.

7.  How does a scatter plot differ in use from a line chart when analyzing trends over time?
Which would you use to track a student’s academic progress and why?
 When might a pie chart give misleading information? Can you suggest a better
alternative for the same data and explain your reasoning?

8.  How does a scatter plot differ in use from a line chart when analyzing trends over time?
Which would you use to track a student’s academic progress and why?
 When might a pie chart give misleading information? Can you suggest a better
alternative for the same data and explain your reasoning?
Chapter 3: Introduction to Decision Trees
 A decision tree is a tree-like diagram used for decision-making.
 Internal nodes represent questions, branches show possible outcomes, and leaf nodes indicate final
decisions.
 Decision trees are commonly used in daily life, e.g., deciding how much bread to buy.
 They help classify data based on chosen criteria.
 The Random Forest algorithm improves decision trees by combining multiple trees for better
decisions.
Example of decision Tree
[Is it raining?]
 / \
 Yes No
 / \
 [Take umbrella?] [Wear sunglasses?]
 / \ / \
 Yes No Yes No
 / \ / \
 Outcome1 Outcome2 Outcome3 Outcome4predictions.
Explanation:
 The root node is the first decision: "Is it raining?"
 Each branch represents a possible answer (Yes/No).
 Each internal node is a decision to make next.
 The leaf nodes (Outcome1, Outcome2, etc.) are the final outcomes based on the decisions made.

Random Forest Algorithm


A random forest is like a team of decision trees working together to make better predictions. Instead of
using just one tree, it uses many trees and takes a final decision based on majority voting (for
classification) or averaging (for regression).

Why Use Random Forest?


✅ More Accurate → Combines many trees to reduce mistakes.
✅ Handles Missing Data → Works even if some data is missing.
✅ Avoids Overfitting → Doesn't memorize the training data too much.

2. Applications of Decision Trees


 Decision trees are widely used in supervised learning for classification and regression.
 They handle non-linear trends better than linear models.
 Easy to visualize, interpret, and understand, even for non-technical users.
 Can process both numerical(like marks) and categorical data(like pass, fail, average) with
minimal data cleaning.
 Less affected by outliers and missing values.
 Regression trees are used for continuous data, while classification trees handle categorical data.
 Regression trees predict by calculating the mean, whereas classification trees use the mode of
observations.
Regression Tree
A regression tree is a way to make predictions using a tree-like model. It is used when we need to predict
numbers instead of categories.
How It Works:
1. Start with the data → A table with inputs and the number we want to predict.
2. Find the best split → The tree divides the data into two groups based on a rule.
3. Keep splitting → Each group is split again until the tree stops growing.
4. Make predictions → The final groups (called "leaves") give the predicted number.
Classification Tree
A classification tree is a type of decision tree used to predict categories instead of numbers. It helps in
making decisions like "Yes or No" or "Red, Blue, or Green."
How It Works:
1. Start with the data → A table with inputs and a category to predict.
2. Find the best split → The tree divides the data into two or more groups based on a rule.
3. Keep splitting → Each group is split again until the tree stops growing.
4. Make predictions → The final groups (called "leaves") contain the predicted category.
Summary:
 Classification trees predict categories (like Pass/Fail, Yes/No).
 They split data into groups to make decisions.
 Each group has a predicted category, based on the most common result in that group.
Difference Between Regression Tree and Classification Tree
Feature Regression Tree 🌳 Classification Tree 🌳

Numbers (e.g., price, temperature,


Predicts Categories (e.g., Yes/No, Pass/Fail, Color)
score)

Example Predicts a student’s exam score Predicts if a student will Pass or Fail

Splits to make groups more pure and to reach on


Splitting Splits to reduce error
conclusions

Final A number (average of values in a


A category (most common category in a group)
Output group)
Example to Understand:
1. Regression Tree:
o If study hours = 3, predict 65 marks.
2. Classification Tree:
o If study hours = 3, predict Pass.

3. Creating a Decision Tree


 Real-world example: In the 1970s, Dr. Lee Goldman developed a decision tree for diagnosing heart
attacks in submarines.
 Steps to create a decision tree:
1. Define the main objective (root node).
2. Draw branches and leaf nodes to represent decisions and outcomes.
3. Calculate and assign probabilities to each outcome.
 Decision trees can also be expressed as a set of decision rules for structured decision-making.
Important theory Questions
1. What Is a Decision Tree and How Does It Work?
 Focus: Understand the overall concept and structure.
 Discussion Points:
o The tree-like structure of decision trees with a root, internal nodes (decision points),
branches, and leaf nodes (final outcomes).
o How decision trees use a series of questions to partition the data based on different features,
leading to predictions or classifications.

2. What Are the Key Components of a Decision Tree?


 Focus: Break down the parts of the tree and their roles.
 Discussion Points:
o Root Node: The initial decision point.
o Internal Nodes: Represent tests on features.
o Branches: Outcome paths from each test.
o Leaf Nodes: Final decision or predicted class/value.

3. How Do Decision Trees Split Data, and What Criteria Are Used?
 Focus: The methodology behind the formation of branches.
 Discussion Points:
o Information Gain: How splitting on a feature can reduce uncertainty (entropy) in the
dataset.
o Gini Impurity: A measure of how often a randomly chosen element from the set would be
incorrectly labeled, commonly used in classification tasks.
o How these criteria help determine the “best” split at each internal node.

4. What Is the Difference Between Classification Trees and Regression Trees?


 Focus: Compare trees based on their output type.
 Discussion Points:
o Classification Trees: Predict categorical outcomes (e.g., Yes/No, Pass/Fail).
o Regression Trees: Predict continuous numerical values (e.g., exam scores, prices).
o How the splitting criteria and final predictions differ (using the mode versus the mean).

5. What Is Overfitting in Decision Trees and How Can It Be Mitigated?


 Focus: Learn about common pitfalls and remedies.
 Discussion Points:
o Overfitting: When the tree fits the training data too closely, capturing noise rather than the
underlying trend.
o Mitigation Techniques:
 Pruning: Removing unnecessary branches after the tree is built.
 Setting a Maximum Depth: Limiting how deep the tree can grow.
 Minimum Samples per Leaf: Requiring a minimum number of samples in each
terminal node.

6. How Does Pruning Improve the Performance of a Decision Tree?


 Focus: Understand post-processing steps.
 Discussion Points:
o Purpose of Pruning: To remove parts of the tree that may capture noise rather than signal,
thereby reducing overfitting.
o Techniques:
 Pre-Pruning (Early Stopping): Cease tree growth early based on specific criteria.
 Post-Pruning: Grow the full tree and then remove non-contributing branches.

7. What Is a Random Forest and How Does It Enhance Decision Trees?


 Focus: Understand ensemble methods.
 Discussion Points:

o Concept: A random forest builds multiple decision trees and aggregates their results for a
more robust prediction.

o Benefits:
 Increases prediction accuracy.
 Reduces the risk of overfitting (each tree is trained on a random subset of features
and/or data).
 Enhances model stability.
8. How Does Majority Voting Work in Random Forests (for Classification) and Averaging (for
Regression)?
 Focus: The final prediction mechanism in ensemble learning.
 Discussion Points:
o Majority Voting: For classification tasks, each tree votes for a class label and the label with
the most votes is selected.
o Averaging: In regression tasks, the continuous predictions from all trees are averaged to
produce the final prediction.

9. How Do Decision Trees Handle Missing Data and Outliers?


 Focus: Practical considerations in real-world data processing.
 Discussion Points:
o Missing Data: Many decision tree algorithms can either ignore missing values during
training or assign default directions based on the majority rule.
o Outliers: Decision trees are relatively less affected by outliers since splits are based on
thresholds that mainly consider the majority of data points.

10. What Are Some Real-World Applications of Decision Trees and Random Forests?
 Focus: Understand where and why these algorithms are applied.
 Discussion Points:
o Everyday Decisions: Using decision trees for simple, understandable decision-making
processes (like choosing how much bread to buy).
o Medical Diagnosis: For instance, diagnosing heart attacks or other conditions based on
symptom queries.
o Academic and Business Predictions: Examples include predicting student performance or
market trends using regression trees.
ASSIGNMENT
TOPIC

LEGENDS

Class Work Homework Challenge Strongly recommended to Evaluation


Yourself practice

1.
 Suppose you are designing a decision tree to help students choose between Science,
Commerce, and Humanities. What features (criteria) would you include, and why?
 How would a regression tree help a business predict monthly sales based on advertising
budget and number of staff? Illustrate with a real-world scenario.

2.  You're given a dataset with missing values and outliers. How would a decision tree
handle this differently compared to linear regression?
 If your classification tree gives wrong predictions for new data, what steps would you
take to improve its accuracy?

3.  Between a decision tree and a random forest, which would you choose for a project
predicting student performance and why?
 In what situations might a classification tree be more useful than a regression tree, and
vice versa? Support your answer with examples.
4.  A decision tree model is very accurate on training data but performs poorly on new
data. What is the issue, and how does Random Forest solve it?
 If you were explaining how a random forest works to a non-technical person, what real-
life analogy would you use and why?

5.  Create a basic decision tree to help someone decide whether to take an umbrella when
going out. Include at least three decision criteria.
 Design a plan to build a decision tree model that predicts whether a customer will buy a
product online. Mention the data you’ll collect, how you’ll clean it, and how you’ll
structure the tree.

6. How would you combine decision trees and real-time sensor data (like from fitness bands
or smart watches) to monitor health risks?

7.  Compare the performance of a single decision tree and a random forest on the same
dataset. What insights can you draw from the difference?
 In what cases can a decision tree give misleading results? How can you detect and avoid
such situations?
 Why do decision trees perform better with non-linear data compared to linear models?
Give an example where this difference is significant.
 Decision trees can be interpreted as a set of if-then rules. How does this make them
easier to explain to stakeholders compared to other machine learning models?

8.  Why do decision trees perform better with non-linear data compared to linear models?
Give an example where this difference is significant.
 Decision trees can be interpreted as a set of if-then rules. How does this make them
easier to explain to stakeholders compared to other machine learning models?
Chapter 4: Classification Algorithms – II
Introduction
 In the previous chapter, we studied Decision Trees for classification.
 This chapter introduces another powerful classification algorithm called K-Nearest Neighbors (K-
NN).
 We will also learn about cross-validation, a technique used to evaluate the accuracy and
effectiveness of a machine learning model.
2. Introduction to K-Nearest Neighbors (K-NN)
What is K-NN?
 K-Nearest Neighbors (K-NN) is a supervised machine learning algorithm that can be used for
both classification and regression tasks.
 It is mainly used in pattern recognition and data mining.
 The algorithm assumes that similar objects exist close to each other in a feature space.
How Does K-NN Work?
1. The algorithm calculates the distance between the new data point and all the points in the dataset.
2. It selects the ‘K’ nearest points to the new data point.
3. It performs a majority vote on the class labels of the ‘K’ nearest neighbors.
4. The class with the most votes is assigned as the predicted class of the new data point.
Choosing the Right ‘K’ Value
 ‘K’ is the number of neighbors considered while classifying a new data point.
 If K=1, the algorithm simply assigns the class of the closest neighbor.
 If K=2, the algorithm checks the two closest points and takes a majority vote, and so on.
 The choice of K significantly affects accuracy:
o Low K values (e.g., K=1 or K=3) → More sensitive to noise (outliers).
o High K values (e.g., K=15 or more) → More generalized but may lose important details.
 Optimal K is found experimentally, usually in the range K=7 to K=13.
 Odd values of K are preferred to avoid ties in voting.
Characteristics of K-NN
1. Lazy Learning – No explicit training phase; all computation happens at the prediction stage.
2. Non-Parametric – Makes no assumptions about the data distribution, unlike algorithms such as
linear regression.
3. Pros and Cons of K-NN
Advantages of K-NN
✔ Simple and Easy to Implement – Requires minimal setup and is easy to understand.
✔ No Training Step – Unlike many machine learning algorithms, K-NN does not learn a model but directly
uses training data for classification.
✔ Works Well for Multi-Class Classification – Can handle more than two classes without modifications.
✔ Can be Used for Both Classification and Regression – Unlike some models designed only for
classification or regression.
✔ No Assumptions About Data – Unlike parametric models, K-NN does not require the data to follow a
specific distribution.
Disadvantages of K-NN
✘ Computationally Expensive – K-NN must compute distances for all training data points at prediction
time, making it slow for large datasets.
✘ High Memory Usage – The entire dataset must be stored in memory for classification.
✘ Sensitive to Imbalanced Data – If one class dominates the dataset, K-NN may incorrectly classify points
into the majority class.
✘ Affected by Outliers – Distance-based classification means a single outlier can heavily impact results.
✘ Curse of Dimensionality – Performance decreases as the number of features increases because the notion
of "distance" becomes less meaningful in high-dimensional space.
✘ Depends on Distance Metrics – Works only if a meaningful way to measure distances (e.g., Euclidean,
Manhattan, Minkowski) exists.
4. Cross-Validation
What is Cross-Validation?
 Cross-validation is a technique used to evaluate a machine learning model by splitting the dataset
into training and testing parts multiple times.
 It helps assess how well a model generalizes to unseen data.
 Instead of using a single train-test split, cross-validation ensures that every data point is used for both
training and testing.
Steps of Cross-Validation
1. Reserve a portion of the dataset for validation (testing).
2. Train the model on the remaining dataset.
3. Evaluate the model using the validation set and measure accuracy.
K-Fold Cross-Validation
 One common technique is K-Fold Cross-Validation, where the dataset is divided into K equal
parts (folds).
 The model is trained on K-1 folds and tested on the remaining one.
 The process repeats K times, each time using a different fold for testing.
 The final accuracy is the average of all K iterations.
Example: 5-Fold Cross-Validation
 Data is divided into 5 subsets.
 The model is trained on 4 subsets and tested on the remaining 1.
 This process is repeated 5 times, ensuring each fold serves as a test set once.
 Common choice: K = 10 is widely used in applied machine learning.
Using Cross-Validation to Tune ‘K’ in K-NN
 Cross-validation can also be used to determine the optimal value of K in K-NN.
 The model is tested with different K values, and the one that gives the best accuracy is chosen.
 Typically, the accuracy is highest for K = 7 to 13 and declines as K increases beyond this range.

Summary
🔹 K-NN is a simple, non-parametric, lazy learning algorithm used mainly for classification.
🔹 It classifies a new data point based on the majority class of its K nearest neighbors.
🔹 The value of ‘K’ is crucial and is determined experimentally.
🔹 K-NN is memory-intensive and sensitive to outliers but works well for multi-class problems.
🔹 Cross-validation improves model evaluation by using multiple train-test splits.
🔹 K-Fold Cross-Validation (K=10) is commonly used to avoid overfitting and ensure reliable accuracy
estimates.
🔹 Cross-validation helps determine the best value of K in the K-NN algorithm.
MCQs on K-Nearest Neighbors (K-NN) and Cross-Validation
1. What type of machine learning algorithm is K-Nearest Neighbors (K-NN)?
A) Supervised Learning
B) Unsupervised Learning
C) Reinforcement Learning
D) Semi-supervised Learning
✅ Answer: A) Supervised Learning

2. What is the primary use of the K-NN algorithm?


A) Clustering
B) Regression only
C) Classification and Regression
D) Reinforcement Learning
✅ Answer: C) Classification and Regression

3. What is the main assumption behind the K-NN algorithm?


A) Data is normally distributed
B) Similar objects exist close to each other in feature space
C) Data must be linearly separable
D) All features are equally important
✅ Answer: B) Similar objects exist close to each other in feature space

4. What does ‘K’ represent in K-NN?


A) Number of input features
B) Number of training samples
C) Number of nearest neighbors considered for classification
D) Number of decision trees
✅ Answer: C) Number of nearest neighbors considered for classification

5. How does K-NN determine the class of a new data point?


A) By computing the sum of all feature values
B) By using a predefined rule set
C) By identifying K closest points and taking a majority vote
D) By training a deep neural network
✅ Answer: C) By identifying K closest points and taking a majority vote

6. Which distance metric is commonly used in K-NN?


A) Manhattan Distance
B) Euclidean Distance
C) Minkowski Distance
D) All of the above
✅ Answer: D) All of the above

7. What happens if we choose K=1 in K-NN?


A) The algorithm only considers the closest data point for classification
B) It takes a majority vote among all data points
C) It will ignore the input data
D) The accuracy will always be 100%
✅ Answer: A) The algorithm only considers the closest data point for classification

8. What is a disadvantage of using a very high value for K?


A) The model becomes too sensitive to noise
B) The decision boundary becomes too complex
C) The model becomes more generalized and loses details
D) The model does not work for classification problems
✅ Answer: C) The model becomes more generalized and loses details
9. Why are odd values of K preferred in K-NN?
A) They reduce the complexity of calculations
B) They prevent ties in majority voting
C) They are easier to implement in Python
D) They make K-NN work faster
✅ Answer: B) They prevent ties in majority voting

10. What is the major drawback of K-NN?


A) It cannot be used for classification problems
B) It requires a long training phase
C) It is computationally expensive and memory inefficient
D) It cannot handle missing values
✅ Answer: C) It is computationally expensive and memory inefficient

11. What is the curse of dimensionality in K-NN?


A) More dimensions lead to better accuracy
B) Increasing the number of features can reduce the effectiveness of distance calculations
C) K-NN cannot handle high-dimensional data
D) It always causes overfitting
✅ Answer: B) Increasing the number of features can reduce the effectiveness of distance calculations

12. Which of the following is NOT an advantage of K-NN?


A) Simple and easy to implement
B) Works well with high-dimensional data
C) Can be used for multi-class classification
D) No training step required
✅ Answer: B) Works well with high-dimensional data

13. How does K-NN handle outliers?


A) K-NN is highly sensitive to outliers
B) K-NN ignores outliers automatically
C) K-NN removes outliers before classification
D) K-NN works better with outliers
✅ Answer: A) K-NN is highly sensitive to outliers

14. What does "lazy learning" mean in K-NN?


A) The algorithm does not use training data
B) The algorithm does not have a specific training phase and performs computations during prediction
C) The algorithm ignores feature importance
D) The algorithm does not work well with small datasets
✅ Answer: B) The algorithm does not have a specific training phase and performs computations during
prediction

15. What type of algorithm is K-NN in terms of parameter assumptions?


A) Parametric
B) Non-parametric
C) Semi-parametric
D) Rule-based
✅ Answer: B) Non-parametric

MCQs on Cross-Validation
16. What is the purpose of cross-validation?
A) To reduce training time
B) To improve accuracy on training data
C) To evaluate how well a model generalizes to new data
D) To increase bias in the model
✅ Answer: C) To evaluate how well a model generalizes to new data

17. What is the main concept behind K-Fold Cross-Validation?


A) The dataset is divided into K folds, and the model is tested on each fold
B) The model is trained multiple times on the same dataset
C) The dataset is split into two parts: training and validation
D) K-NN is used to cross-check results
✅ Answer: A) The dataset is divided into K folds, and the model is tested on each fold

18. What is the commonly used value of ‘K’ in K-Fold Cross-Validation?


A) 5
B) 10
C) 15
D) 20
✅ Answer: B) 10

19. How does cross-validation help in choosing the best value of K for K-NN?
A) By testing different K values and selecting the one with the best accuracy
B) By ignoring outliers in the dataset
C) By normalizing the dataset before classification
D) By increasing the number of neighbors
✅ Answer: A) By testing different K values and selecting the one with the best accuracy
20. What happens if we use too many folds in K-Fold Cross-Validation?
A) The training set becomes too small, leading to high variance
B) The model accuracy always improves
C) The model never overfits
D) The dataset size increases
✅ Answer: A) The training set becomes too small, leading to high variance
Short Answer Questions on K-NN
1. What type of machine learning algorithm is K-NN?
o K-NN is a supervised learning algorithm.
2. What are the two main applications of K-NN?
o K-NN can be used for classification and regression problems.
3. What is the basic assumption behind K-NN?
o The assumption is that similar data points exist close to each other in feature space.
4. What does ‘K’ represent in the K-NN algorithm?
o ‘K’ represents the number of nearest neighbors considered for classification.
5. How does K-NN classify a new data point?
o It identifies K closest points and assigns the most common class among them.
6. What are the commonly used distance metrics in K-NN?
o Euclidean distance, Manhattan distance, and Minkowski distance.
7. What happens when K=1 in K-NN?
o The algorithm assigns the class of the single nearest data point.
8. Why are odd values of K preferred in K-NN?
o Odd values prevent ties in majority voting.
9. What is the major drawback of K-NN?
o It is computationally expensive and requires high memory.
10. What is the ‘curse of dimensionality’ in K-NN?
 It means that as the number of features increases, the effectiveness of distance calculations
decreases.
11. How does K-NN handle outliers?
 K-NN is highly sensitive to outliers as they can affect distance calculations.
12. Why is K-NN called a "lazy learning" algorithm?
 It does not have a training phase and performs computations during prediction.
13. Is K-NN a parametric or non-parametric algorithm? Why?
 It is non-parametric because it does not assume any specific distribution of the data.
Short Answer Questions on Cross-Validation
14. What is the purpose of cross-validation?
 It helps to evaluate how well a model generalizes to new data.
15. How does K-Fold Cross-Validation work?
 It splits the dataset into K folds and tests the model on each fold one by one.
16. What is the commonly used value of ‘K’ in K-Fold Cross-Validation?
 K = 10 is a commonly used value.
17. How does cross-validation help in choosing the best K for K-NN?
 By testing different K values and selecting the one with the best accuracy.
18. What happens if we use too many folds in K-Fold Cross-Validation?
 The training set becomes too small, leading to high variance.
19. Why is cross-validation important in machine learning?
 It helps to reduce overfitting and ensures the model performs well on unseen data.
20. What is the key difference between K-NN and cross-validation?
 K-NN is a classification algorithm, whereas cross-validation is a technique to evaluate model
performance.
Difference between KNN and Decision Tree
 K-NN is a lazy learner, meaning it does not build a model but makes predictions on the fly.
 Decision Trees actively learn patterns from data during training and use a structured tree for
predictions.
 K-NN is memory-intensive and performs better with small datasets, whereas Decision Trees scale well
with larger datasets.
ASSIGNMENT
TOPIC

LEGENDS

Class Work Homework Challenge Strongly recommended to Evaluation


Yourself practice

1.

2.

3.

4.

5.
6.

7.

8.
Chapter 5- Communication Skills
🔹 1. What is Communication?
 Definition: Exchange of information, ideas, thoughts, or feelings.
 Types:
o Verbal – Spoken or written (e.g., presentations, emails)
o Non-verbal – Body language, gestures, facial expressions
o Visual – Charts, graphs, images used to convey data

🔹 2. Importance of Communication in Data Science


 Explains complex data insights in simple terms
 Builds trust with stakeholders
 Helps in decision-making
 Bridges the gap between data experts and business teams

🔹 3. Key Elements of Communication


1. Sender – The one who conveys the message
2. Message – The information or idea
3. Medium – Channel used (e.g., email, report, graph)
4. Receiver – The person who gets the message
5. Feedback – Response from the receiver

🔹 4. Barriers to Effective Communication


 Technical jargon
 Poor data visualization
 Misinterpretation of charts
 Lack of context or background
 Cultural/language differences
🔹 5. Data Storytelling
 Combines data + narrative + visuals
 Helps make data insights understandable
 Example: Instead of just showing numbers, tell a story (e.g., "Sales increased after new ad
campaign")

🔹 6. Visual Communication Tools


 Bar Charts – Compare categories
 Line Graphs – Show trends over time
 Pie Charts – Show parts of a whole
 Histograms – Show frequency distribution
 Scatter Plots – Show relationship between two variables
 Dashboards – Combine multiple visual elements

🔹 7. Tips for Effective Communication in Data Science


✅ Know your audience
✅ Avoid technical terms if unnecessary
✅ Use simple and clear visuals
✅ Highlight key insights
✅ Tell a story with your data
✅ Practice active listening and give clear responses

🔹 8. Tools for Communication in Data Science


 Presentation Tools: PowerPoint, Google Slides
 Reporting Tools: Excel, Tableau, Power BI
 Communication Platforms: Email, Slack, Zoom
 Documentation: Jupyter Notebooks, PDFs
ASSIGNMENT
TOPIC

LEGENDS

Class Work Homework Challenge Strongly recommended to Evaluation


Yourself practice

1.
You are asked to present your analysis of a customer satisfaction survey to a non-technical
audience. How would you ensure they understand your insights?

2.  A pie chart shows that 45% of customers prefer product A, 35% prefer B, and 20%
prefer C. How can you use storytelling to explain this visually and meaningfully?
 Your team uses highly technical terms in a report. The business stakeholders don't
understand it. What changes would you suggest to improve communication?

3. Why do you think visual communication is often more effective than verbal
communication when presenting large datasets?

4. Imagine you are explaining a complex correlation graph to a school principal with no
background in data science. How would you break down the information?

5. Your manager says, “We don’t need visualizations; just give me the numbers.” Do you
agree or disagree? Justify your viewpoint with examples.
6.  Create a mini-presentation outline using storytelling, visuals, and key takeaways to
explain the increase in student attendance over the last 6 months.
 Design a dashboard layout to present school performance data (marks, attendance,
activities). What visual elements would you include and why?
During a project review, your graph is misinterpreted by the team, leading to a wrong
decision. What could be the reason, and how would you prevent this in future?
7. You have conflicting feedback: one person wants detailed numbers, another wants only
charts. How would you balance both needs in your report?

8. You have conflicting feedback: one person wants detailed numbers, another wants only
charts. How would you balance both needs in your report?

You might also like