Data Science Exam Prep-Unit 2
Data Science Exam Prep-Unit 2
✅
Yes, let's start!
I'm ready to teach you everything from your study material for 21CSS303T - Data Science in a simple
yet detailed way, making sure nothing is missed.
Please go ahead and upload or paste the first part of your study material.
Great! Let’s go through Unit II – Reshaping from your Data Science (21CSS303T) syllabus in a simple
and detailed way, without skipping anything. This section is about how to rearrange tabular data in
Pandas (a Python data analysis library).
Pivoting
Melting
Stacking
Unstacking
🔁 1. Pivoting
Goal: Turn unique values from one column into new column headers.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 1/18
✅ `pivot()` function
python
import pandas as pd
df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'City': ['New York', 'London', 'New York', 'London'],
'Sales': [200, 150, 220, 180]
})
python
Output:
📌 Key Points:
`index='Date'` → rows are grouped by Date.
🔄 2. Melting
Goal: Convert from wide format to long format (opposite of pivoting).
python
Output:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 2/18
Date City Sales
📌 Key Points:
`reset_index()` brings Date from index back to column.
`melt()`:
🔁 Stacking
Goal: Convert columns into a new row-level index.
python
stacked_df = pivot_df.stack()
print(stacked_df)
Output:
vbnet
Date City
2024-01-01 London 150
New York 200
2024-01-02 London 180
New York 220
dtype: int64
🔄 Unstacking
Goal: Convert multi-index back into columns.
python
unstacked_df = stacked_df.unstack()
print(unstacked_df)
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 3/18
Output: same as `pivot_df`
💡 Summary Table:
Operation Description Function Used
You're now clear with everything about reshaping in Pandas — pivot, melt, stack, and unstack — with
examples, explanations, and key points!
Shall we move on to the next part of the material?
Great! Let’s now cover the second file: Unit II – Data Cleaning, Preparation & Transformation from
Data Science (21CSS303T) in a simple yet detailed manner, without missing anything. This will help
you write long, clear answers in your exam.
🧠 Why is it important?
Raw data is usually noisy, incomplete, inconsistent.
Dirty data can lead to wrong decisions and bad model performance.
🔧 2. Data Preparation
✅ What is it?
Data preparation = Collect + Clean + Label data for machine learning.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 4/18
👨💻 Steps:
1. Collect raw data.
🧠 Why is it important?
Improves accuracy & reliability of analysis.
Helps in identifying patterns & trends.
B. Imputation
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 5/18
3. Regression Imputation – Predict missing value using other columns.
4. Multiple Imputation – Fill missing values with multiple possible values for better accuracy.
python
import pandas as pd
import numpy as np
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', np.nan],
'Age': [25, np.nan, 30, 22, 35],
'Salary': [50000, 54000, np.nan, 42000, 48000]
}
df = pd.DataFrame(data)
🔄 4. Data Transformation
✅ What is it?
Changing raw data into a better format to help analysis and decision making.
✅ 2. Attribute Construction
Creating new features from existing ones.
Example:
✅ 3. Data Generalization
Converts detailed data into abstract, higher-level forms.
Types:
1. Attribute Generalization
Use a hierarchy.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 7/18
Salary = 52000 → Salary Group = 50K–60K
4. Text Generalization
✅ 4. Data Aggregation
Collect data from different sources and summarize it.
✅ 5. Data Discretization
Convert continuous data into categories (bins).
Used in ML and feature engineering.
📌 Example in Python:
python
data = {'Age': [22, 25, 30, 35, 40, 45, 50, 55, 60]}
df = pd.DataFrame(data)
df['Age_Binned'] = pd.cut(df['Age'], bins=3, labels=['Young', 'Middle-aged', 'Old'])
print(df)
✅ 6. Data Normalization
Scale values to a specific range, like [0,1] or [-1,1].
🧠 Why normalize?
Improves model performance, especially for:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 8/18
Use Python functions like `dropna()`, `fillna()`, `cut()`, etc.
You're now fully prepared with this unit — from data cleaning to transformation — with code,
examples, and explanation.
Would you like to move on to the next file?
Awesome! Now let’s cover the final part of Unit II from your Data Science (21CSS303T) material —
String Manipulation in Python — explained clearly and in detail so you can write long answers easily.
🔠 3. Case Manipulation
python
Function Output
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 9/18
Function Output
str1 = "Hello"
str2 = "Python"
text = "apple,banana,grape"
python
fruits = text.split(",")
print(fruits) # ['apple', 'banana', 'grape']
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 10/18
python
new_text = "-".join(fruits)
print(new_text) # 'apple-banana-grape'
🔍 7. Removing Whitespace
python
Method Output
`text.strip()` `'Python'`
`text.lstrip()` `'Python '`
🎨 8. String Formatting
Let’s say:
python
name = "Alice"
age = 25
`f"My name is {name} and I am {age} years `'My name is Alice and I am
f-string
old."` 25 years old.'`
text = "Python123"
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 11/18
Function Explanation Output
`"HELLO".isupper()` All uppercase? `True`
python
📌 Final Tips:
Use slicing for parts of strings.
Use `split()` and `join()` to go between strings and lists.
You're now fully covered with all three topics of Unit II:
✅ Reshaping
✅ Data Cleaning & Transformation
✅ String Manipulation
Want a quick summary or practice questions for revision now?
Great! Let’s now finish Unit II of Data Science (21CSS303T) with the last section from the file you
uploaded — covering Summarizing, Binning, Standardization, and Outliers in a clear, simple, and
detailed way.
📊 1. Data Summarization
Data summarization means reducing the size and complexity of data by focusing on the main
features, patterns, or trends.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 12/18
Measure Meaning Example
Mean Average value (10 + 20 + 30)/3 = 20
Median Middle value when sorted 10, 20, 30 → median = 20
Measure Description
Standard Deviation Average distance from the mean (spread).
📈 C. Sample Distribution
This tells you about the shape of the data distribution.
Concept Meaning
🧺 2. Binning
Binning = Grouping continuous values into intervals (bins).
Used in: data preprocessing, feature engineering, visualization.
✅ Types of Binning:
Type Description Example
Equal-Width Binning All bins have equal range. Ages 22, 25, 30 → [20–30]
Equal-Frequency Binning Each bin has nearly the same number of values. 5 values split into 3 bins.
Custom Binning Manual bins based on domain knowledge. Income → Low, Medium, High
⚖️ 3. Standardization
Standardization is used to make all features comparable by scaling them.
Helps especially in machine learning when features have different scales.
✅ Why Standardize?
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 13/18
Benefit Explanation
Equal Feature Importance Prevents large values from dominating.
Better Model Performance ML algorithms like KNN, SVM work better with standardized data.
✏️ Techniques of Standardization
🧮 A. Z-score Normalization
Converts data to have mean = 0 and std = 1
Formula:
x − mean
z=
standard deviation
📉 B. Min-Max Normalization
Scales values to range [0, 1]
Formula:
x − min(x)
xscaled =
max(x) − min(x)
They affect:
Model accuracy
Statistical analysis
🧠 Types of Outliers:
Type Meaning
Global Outliers Obvious values that are far away from all other data. Easy to spot.
Contextual Outliers Look normal in one context, but abnormal in another. Requires domain knowledge.
❗ 5. Anomalies
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 14/18
Anomalies are unusual instances in data.
Can be:
Rare events
Extreme values
Different patterns
🧠 Example:
A ₹1 lakh credit card transaction at midnight might be normal for one person but anomalous for
another.
Global
Contextual
Or based on behavior
2. Binning
Equal width
Equal frequency
Custom
3. Standardization
Z-score → Mean = 0, SD = 1.
Min-Max → Values between [0, 1].
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 15/18
Reshaping
Data Cleaning & Preparation
String Manipulation
Would you like a 1-page quick revision sheet or some practice questions for revision now?
Here’s a 1-page quick revision sheet for Unit II – Data Science (21CSS303T). You can save or revise
from this before the exam.
🧼 DATA CLEANING
Concept Description
🔄 DATA TRANSFORMATION
Technique Purpose
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 16/18
Technique Purpose
Aggregation Combine & summarize data
Discretization Convert continuous → categorical (`pd.cut()`)
🔤 STRING MANIPULATION
Task Example
Length `len(s)`
Slicing `s[0:5]`, `s[::-1]`
📊 DATA SUMMARIZATION
Category Measures
Centrality Mean, Median, Mode
Dispersion Std Dev, Variance, Range
Distribution Histogram, Tally, Skewness, Kurtosis
🧺 BINNING
Type Example
Equal Width 22, 25, 30 → [20–30], [30–40]
Equal Frequency 5 values split into ~equal bins
Custom Income → Low, Medium, High
⚖️ STANDARDIZATION
Technique Formula
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 17/18
Type Description
Global Outlier Far from all other data (clear outlier)
Contextual Unusual in some context only
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 18/18