AW Term1 - DS
AW Term1 - DS
Data Science
Grade 12
Sno Topic
1. Data Governance
2. Exploratory Data Analysis
3. Decision Tree
4. Classification Algorithms - II
5. Communication Skills
Chapter 1: Data Governance
2. Ethical Guidelines
Ethics refers to moral principles that guide behavior.
Data and software should be used responsibly for the benefit of society.
Key ethical values to follow:
o Integrity (honesty in handling data)
3. Data Privacy
Data privacy means individuals have control over their personal information.
Secure storage alone is not enough—users must consent to data collection.
The individual is the sole owner of their data and can request its deletion.
Data privacy laws help protect personal information.
Major Data Privacy Laws:
1. GDPR (General Data Protection Regulation - EU)
o Effective from May 25, 2018.
o Gives consumers control over their data.
o Key aspects:
o Protects:
o Consumers can:
o Aims to protect personal data and set up a Data Protection Authority of India.
LEGENDS
1. How can data governance frameworks evolve to address the challenges posed by emerging
technologies like Artificial Intelligence and Blockchain, while ensuring compliance with data
privacy laws?
What are the potential consequences for an organization if it fails to implement a robust data
governance system, especially regarding data privacy and security regulations?
2. Considering the GDPR and CCPA, how can businesses balance user privacy with their need
for data collection and analysis to drive growth and innovation?
3. What are the implications of a data breach under different data privacy laws (GDPR, HIPAA,
CCPA) and how can organizations ensure they comply with breach notification regulations
effectively?
4. In what ways can organizations ensure data integration and interoperability while
maintaining high standards of data security and privacy across different platforms and systems?
5. How do data governance practices vary across industries (e.g., healthcare, e-commerce,
finance), and what unique challenges do these industries face in maintaining data quality and
security?
6. What role does data architecture play in ensuring data availability and consistency across
various data sources, and how can this be managed efficiently in large-scale organizations?
7. How can ethical principles, such as transparency and accountability, be integrated into AI-
driven data systems to minimize bias and discrimination?
How can organizations develop a data governance strategy that accommodates both the growing
volume of data and the increasing demands for transparency and user consent in data
management?
8. How can governments and regulatory bodies enforce data privacy laws effectively, especially in
the context of cross-border data flows and multinational companies?
Chapter 2- Exploratory Data Analysis (EDA)
1. Introduction to EDA
EDA is the process of understanding and exploring data before using it for decision-making or building
models. It helps us:
✅ Find patterns in data
✅ Detect unusual data points (anomalies)
✅ Check if our assumptions about data are correct
EDA is mainly done using:
Summary statistics (like mean, median, range, standard deviation, etc.)
Graphs & charts (like histograms, scatter plots, and box plots)
💡 Why is EDA important?
It helps us understand the data before using it for machine learning.
It improves decision-making by revealing insights.
It ensures data quality by identifying missing values or incorrect data.
2. Ty
pes of EDA
EDA can be classified into three types:
A. Univariate Analysis (One Variable at a Time)
📌 Definition:
Univariate analysis focuses on studying only one variable at a time. It helps us understand how a single
feature behaves.
📌 Example:
If we have a dataset of students, we can study the variable "height" separately.
📌 Methods Used:
1. Statistical Methods:
o Mean – Average value
o Median – Middle value
o Mode – Most frequent value
o Range – Difference between maximum and minimum value
o Standard Deviation – Measures how spread out the data is
2. Graphical Methods:
o Histogram – Shows how often different values appear
o Box Plot – Helps identify outliers (unusual values)
o Bar Chart – Used for categorical data (e.g., favorite color of students)
o Pie Chart – Shows proportion of different categories
3. Data Cleaning
Before analyzing data, we must clean it to remove errors and incorrect values.
📌 Why is Data Cleaning Important?
Dirty data can lead to wrong conclusions.
Some models don’t work well if the data has missing values or errors.
📌 Steps to Clean Data:
✔ 1. Remove Duplicate Data
Example: If a student's record appears twice, we remove the extra entry.
✔ 2. Remove Irrelevant Data
Example: If we are studying customer purchases, columns like “Customer’s Favorite Color” might
not be useful.
✔ 3. Handle Outliers
Outliers are extreme values that do not fit the pattern.
Example: If most people have a salary between ₹20,000 - ₹50,000, but one person has ₹5,00,000,
this might be an outlier.
✔ 4. Fix Data Type Issues
Example: If a date is stored as text instead of a date format, we need to fix it.
✔ 5. Handle Missing Data
If data is missing, we can:
o Remove the row (if only a few values are missing).
o Fill in missing values with the mean (for numbers) or most common value (for categories).
Example: If some students forgot to enter their age, we can fill missing ages with the average age of
the class.
ASSIGNMENT
TOPIC
LEGENDS
1.
What is bivariate analysis?
Which type of graph is commonly used to analyze the relationship between two numerical
variables?
2. How would you decide whether to use a histogram or a box plot for analyzing a
variable like "income" in a population survey?
If two variables (e.g., study hours and exam scores) show a weak correlation in a
scatter plot, what steps would you take to investigate further?
3. Why might it be dangerous to perform machine learning on data without first doing
EDA? Give real-world examples to justify your answer.
A company sees a sudden spike in product returns. How can EDA help in identifying
the cause? What kind of plots or analysis would you use?
4. Create an EDA plan for a new online food delivery startup that wants to understand
customer behavior. Which types of EDA would you perform and why?
If you are analyzing climate data with variables like temperature, humidity, and
rainfall across cities, how would you apply multivariate analysis to draw insights?
5. If you are analyzing climate data with variables like temperature, humidity, and
rainfall across cities, how would you apply multivariate analysis to draw insights?
Design a step-by-step data cleaning strategy for a messy dataset collected from social
media platforms, including text, dates, and user activity.
6. Imagine your EDA shows some missing values in a crucial column. Would you prefer to
drop the rows or fill in missing values? Justify your decision.
7. How does a scatter plot differ in use from a line chart when analyzing trends over time?
Which would you use to track a student’s academic progress and why?
When might a pie chart give misleading information? Can you suggest a better
alternative for the same data and explain your reasoning?
8. How does a scatter plot differ in use from a line chart when analyzing trends over time?
Which would you use to track a student’s academic progress and why?
When might a pie chart give misleading information? Can you suggest a better
alternative for the same data and explain your reasoning?
Chapter 3: Introduction to Decision Trees
A decision tree is a tree-like diagram used for decision-making.
Internal nodes represent questions, branches show possible outcomes, and leaf nodes indicate final
decisions.
Decision trees are commonly used in daily life, e.g., deciding how much bread to buy.
They help classify data based on chosen criteria.
The Random Forest algorithm improves decision trees by combining multiple trees for better
decisions.
Example of decision Tree
[Is it raining?]
/ \
Yes No
/ \
[Take umbrella?] [Wear sunglasses?]
/ \ / \
Yes No Yes No
/ \ / \
Outcome1 Outcome2 Outcome3 Outcome4predictions.
Explanation:
The root node is the first decision: "Is it raining?"
Each branch represents a possible answer (Yes/No).
Each internal node is a decision to make next.
The leaf nodes (Outcome1, Outcome2, etc.) are the final outcomes based on the decisions made.
Example Predicts a student’s exam score Predicts if a student will Pass or Fail
3. How Do Decision Trees Split Data, and What Criteria Are Used?
Focus: The methodology behind the formation of branches.
Discussion Points:
o Information Gain: How splitting on a feature can reduce uncertainty (entropy) in the
dataset.
o Gini Impurity: A measure of how often a randomly chosen element from the set would be
incorrectly labeled, commonly used in classification tasks.
o How these criteria help determine the “best” split at each internal node.
o Concept: A random forest builds multiple decision trees and aggregates their results for a
more robust prediction.
o Benefits:
Increases prediction accuracy.
Reduces the risk of overfitting (each tree is trained on a random subset of features
and/or data).
Enhances model stability.
8. How Does Majority Voting Work in Random Forests (for Classification) and Averaging (for
Regression)?
Focus: The final prediction mechanism in ensemble learning.
Discussion Points:
o Majority Voting: For classification tasks, each tree votes for a class label and the label with
the most votes is selected.
o Averaging: In regression tasks, the continuous predictions from all trees are averaged to
produce the final prediction.
10. What Are Some Real-World Applications of Decision Trees and Random Forests?
Focus: Understand where and why these algorithms are applied.
Discussion Points:
o Everyday Decisions: Using decision trees for simple, understandable decision-making
processes (like choosing how much bread to buy).
o Medical Diagnosis: For instance, diagnosing heart attacks or other conditions based on
symptom queries.
o Academic and Business Predictions: Examples include predicting student performance or
market trends using regression trees.
ASSIGNMENT
TOPIC
LEGENDS
1.
Suppose you are designing a decision tree to help students choose between Science,
Commerce, and Humanities. What features (criteria) would you include, and why?
How would a regression tree help a business predict monthly sales based on advertising
budget and number of staff? Illustrate with a real-world scenario.
2. You're given a dataset with missing values and outliers. How would a decision tree
handle this differently compared to linear regression?
If your classification tree gives wrong predictions for new data, what steps would you
take to improve its accuracy?
3. Between a decision tree and a random forest, which would you choose for a project
predicting student performance and why?
In what situations might a classification tree be more useful than a regression tree, and
vice versa? Support your answer with examples.
4. A decision tree model is very accurate on training data but performs poorly on new
data. What is the issue, and how does Random Forest solve it?
If you were explaining how a random forest works to a non-technical person, what real-
life analogy would you use and why?
5. Create a basic decision tree to help someone decide whether to take an umbrella when
going out. Include at least three decision criteria.
Design a plan to build a decision tree model that predicts whether a customer will buy a
product online. Mention the data you’ll collect, how you’ll clean it, and how you’ll
structure the tree.
6. How would you combine decision trees and real-time sensor data (like from fitness bands
or smart watches) to monitor health risks?
7. Compare the performance of a single decision tree and a random forest on the same
dataset. What insights can you draw from the difference?
In what cases can a decision tree give misleading results? How can you detect and avoid
such situations?
Why do decision trees perform better with non-linear data compared to linear models?
Give an example where this difference is significant.
Decision trees can be interpreted as a set of if-then rules. How does this make them
easier to explain to stakeholders compared to other machine learning models?
8. Why do decision trees perform better with non-linear data compared to linear models?
Give an example where this difference is significant.
Decision trees can be interpreted as a set of if-then rules. How does this make them
easier to explain to stakeholders compared to other machine learning models?
Chapter 4: Classification Algorithms – II
Introduction
In the previous chapter, we studied Decision Trees for classification.
This chapter introduces another powerful classification algorithm called K-Nearest Neighbors (K-
NN).
We will also learn about cross-validation, a technique used to evaluate the accuracy and
effectiveness of a machine learning model.
2. Introduction to K-Nearest Neighbors (K-NN)
What is K-NN?
K-Nearest Neighbors (K-NN) is a supervised machine learning algorithm that can be used for
both classification and regression tasks.
It is mainly used in pattern recognition and data mining.
The algorithm assumes that similar objects exist close to each other in a feature space.
How Does K-NN Work?
1. The algorithm calculates the distance between the new data point and all the points in the dataset.
2. It selects the ‘K’ nearest points to the new data point.
3. It performs a majority vote on the class labels of the ‘K’ nearest neighbors.
4. The class with the most votes is assigned as the predicted class of the new data point.
Choosing the Right ‘K’ Value
‘K’ is the number of neighbors considered while classifying a new data point.
If K=1, the algorithm simply assigns the class of the closest neighbor.
If K=2, the algorithm checks the two closest points and takes a majority vote, and so on.
The choice of K significantly affects accuracy:
o Low K values (e.g., K=1 or K=3) → More sensitive to noise (outliers).
o High K values (e.g., K=15 or more) → More generalized but may lose important details.
Optimal K is found experimentally, usually in the range K=7 to K=13.
Odd values of K are preferred to avoid ties in voting.
Characteristics of K-NN
1. Lazy Learning – No explicit training phase; all computation happens at the prediction stage.
2. Non-Parametric – Makes no assumptions about the data distribution, unlike algorithms such as
linear regression.
3. Pros and Cons of K-NN
Advantages of K-NN
✔ Simple and Easy to Implement – Requires minimal setup and is easy to understand.
✔ No Training Step – Unlike many machine learning algorithms, K-NN does not learn a model but directly
uses training data for classification.
✔ Works Well for Multi-Class Classification – Can handle more than two classes without modifications.
✔ Can be Used for Both Classification and Regression – Unlike some models designed only for
classification or regression.
✔ No Assumptions About Data – Unlike parametric models, K-NN does not require the data to follow a
specific distribution.
Disadvantages of K-NN
✘ Computationally Expensive – K-NN must compute distances for all training data points at prediction
time, making it slow for large datasets.
✘ High Memory Usage – The entire dataset must be stored in memory for classification.
✘ Sensitive to Imbalanced Data – If one class dominates the dataset, K-NN may incorrectly classify points
into the majority class.
✘ Affected by Outliers – Distance-based classification means a single outlier can heavily impact results.
✘ Curse of Dimensionality – Performance decreases as the number of features increases because the notion
of "distance" becomes less meaningful in high-dimensional space.
✘ Depends on Distance Metrics – Works only if a meaningful way to measure distances (e.g., Euclidean,
Manhattan, Minkowski) exists.
4. Cross-Validation
What is Cross-Validation?
Cross-validation is a technique used to evaluate a machine learning model by splitting the dataset
into training and testing parts multiple times.
It helps assess how well a model generalizes to unseen data.
Instead of using a single train-test split, cross-validation ensures that every data point is used for both
training and testing.
Steps of Cross-Validation
1. Reserve a portion of the dataset for validation (testing).
2. Train the model on the remaining dataset.
3. Evaluate the model using the validation set and measure accuracy.
K-Fold Cross-Validation
One common technique is K-Fold Cross-Validation, where the dataset is divided into K equal
parts (folds).
The model is trained on K-1 folds and tested on the remaining one.
The process repeats K times, each time using a different fold for testing.
The final accuracy is the average of all K iterations.
Example: 5-Fold Cross-Validation
Data is divided into 5 subsets.
The model is trained on 4 subsets and tested on the remaining 1.
This process is repeated 5 times, ensuring each fold serves as a test set once.
Common choice: K = 10 is widely used in applied machine learning.
Using Cross-Validation to Tune ‘K’ in K-NN
Cross-validation can also be used to determine the optimal value of K in K-NN.
The model is tested with different K values, and the one that gives the best accuracy is chosen.
Typically, the accuracy is highest for K = 7 to 13 and declines as K increases beyond this range.
Summary
🔹 K-NN is a simple, non-parametric, lazy learning algorithm used mainly for classification.
🔹 It classifies a new data point based on the majority class of its K nearest neighbors.
🔹 The value of ‘K’ is crucial and is determined experimentally.
🔹 K-NN is memory-intensive and sensitive to outliers but works well for multi-class problems.
🔹 Cross-validation improves model evaluation by using multiple train-test splits.
🔹 K-Fold Cross-Validation (K=10) is commonly used to avoid overfitting and ensure reliable accuracy
estimates.
🔹 Cross-validation helps determine the best value of K in the K-NN algorithm.
MCQs on K-Nearest Neighbors (K-NN) and Cross-Validation
1. What type of machine learning algorithm is K-Nearest Neighbors (K-NN)?
A) Supervised Learning
B) Unsupervised Learning
C) Reinforcement Learning
D) Semi-supervised Learning
✅ Answer: A) Supervised Learning
MCQs on Cross-Validation
16. What is the purpose of cross-validation?
A) To reduce training time
B) To improve accuracy on training data
C) To evaluate how well a model generalizes to new data
D) To increase bias in the model
✅ Answer: C) To evaluate how well a model generalizes to new data
19. How does cross-validation help in choosing the best value of K for K-NN?
A) By testing different K values and selecting the one with the best accuracy
B) By ignoring outliers in the dataset
C) By normalizing the dataset before classification
D) By increasing the number of neighbors
✅ Answer: A) By testing different K values and selecting the one with the best accuracy
20. What happens if we use too many folds in K-Fold Cross-Validation?
A) The training set becomes too small, leading to high variance
B) The model accuracy always improves
C) The model never overfits
D) The dataset size increases
✅ Answer: A) The training set becomes too small, leading to high variance
Short Answer Questions on K-NN
1. What type of machine learning algorithm is K-NN?
o K-NN is a supervised learning algorithm.
2. What are the two main applications of K-NN?
o K-NN can be used for classification and regression problems.
3. What is the basic assumption behind K-NN?
o The assumption is that similar data points exist close to each other in feature space.
4. What does ‘K’ represent in the K-NN algorithm?
o ‘K’ represents the number of nearest neighbors considered for classification.
5. How does K-NN classify a new data point?
o It identifies K closest points and assigns the most common class among them.
6. What are the commonly used distance metrics in K-NN?
o Euclidean distance, Manhattan distance, and Minkowski distance.
7. What happens when K=1 in K-NN?
o The algorithm assigns the class of the single nearest data point.
8. Why are odd values of K preferred in K-NN?
o Odd values prevent ties in majority voting.
9. What is the major drawback of K-NN?
o It is computationally expensive and requires high memory.
10. What is the ‘curse of dimensionality’ in K-NN?
It means that as the number of features increases, the effectiveness of distance calculations
decreases.
11. How does K-NN handle outliers?
K-NN is highly sensitive to outliers as they can affect distance calculations.
12. Why is K-NN called a "lazy learning" algorithm?
It does not have a training phase and performs computations during prediction.
13. Is K-NN a parametric or non-parametric algorithm? Why?
It is non-parametric because it does not assume any specific distribution of the data.
Short Answer Questions on Cross-Validation
14. What is the purpose of cross-validation?
It helps to evaluate how well a model generalizes to new data.
15. How does K-Fold Cross-Validation work?
It splits the dataset into K folds and tests the model on each fold one by one.
16. What is the commonly used value of ‘K’ in K-Fold Cross-Validation?
K = 10 is a commonly used value.
17. How does cross-validation help in choosing the best K for K-NN?
By testing different K values and selecting the one with the best accuracy.
18. What happens if we use too many folds in K-Fold Cross-Validation?
The training set becomes too small, leading to high variance.
19. Why is cross-validation important in machine learning?
It helps to reduce overfitting and ensures the model performs well on unseen data.
20. What is the key difference between K-NN and cross-validation?
K-NN is a classification algorithm, whereas cross-validation is a technique to evaluate model
performance.
Difference between KNN and Decision Tree
K-NN is a lazy learner, meaning it does not build a model but makes predictions on the fly.
Decision Trees actively learn patterns from data during training and use a structured tree for
predictions.
K-NN is memory-intensive and performs better with small datasets, whereas Decision Trees scale well
with larger datasets.
ASSIGNMENT
TOPIC
LEGENDS
1.
2.
3.
4.
5.
6.
7.
8.
Chapter 5- Communication Skills
🔹 1. What is Communication?
Definition: Exchange of information, ideas, thoughts, or feelings.
Types:
o Verbal – Spoken or written (e.g., presentations, emails)
o Non-verbal – Body language, gestures, facial expressions
o Visual – Charts, graphs, images used to convey data
LEGENDS
1.
You are asked to present your analysis of a customer satisfaction survey to a non-technical
audience. How would you ensure they understand your insights?
2. A pie chart shows that 45% of customers prefer product A, 35% prefer B, and 20%
prefer C. How can you use storytelling to explain this visually and meaningfully?
Your team uses highly technical terms in a report. The business stakeholders don't
understand it. What changes would you suggest to improve communication?
3. Why do you think visual communication is often more effective than verbal
communication when presenting large datasets?
4. Imagine you are explaining a complex correlation graph to a school principal with no
background in data science. How would you break down the information?
5. Your manager says, “We don’t need visualizations; just give me the numbers.” Do you
agree or disagree? Justify your viewpoint with examples.
6. Create a mini-presentation outline using storytelling, visuals, and key takeaways to
explain the increase in student attendance over the last 6 months.
Design a dashboard layout to present school performance data (marks, attendance,
activities). What visual elements would you include and why?
During a project review, your graph is misinterpreted by the team, leading to a wrong
decision. What could be the reason, and how would you prevent this in future?
7. You have conflicting feedback: one person wants detailed numbers, another wants only
charts. How would you balance both needs in your report?
8. You have conflicting feedback: one person wants detailed numbers, another wants only
charts. How would you balance both needs in your report?