0% found this document useful (0 votes)
13 views18 pages

Data Science Exam Prep-Unit 2

The document provides an overview of key concepts in Data Science, focusing on reshaping data using Pandas, data cleaning, preparation, and transformation techniques. It covers operations such as pivoting, melting, stacking, and unstacking, along with handling missing data and various transformation methods. Additionally, it includes string manipulation in Python, detailing basic operations, case manipulation, concatenation, searching, splitting, and formatting strings.

Uploaded by

Melba Suyambu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views18 pages

Data Science Exam Prep-Unit 2

The document provides an overview of key concepts in Data Science, focusing on reshaping data using Pandas, data cleaning, preparation, and transformation techniques. It covers operations such as pivoting, melting, stacking, and unstacking, along with handling missing data and various transformation methods. Additionally, it includes string manipulation in Python, detailing basic operations, case manipulation, concatenation, searching, splitting, and formatting strings.

Uploaded by

Melba Suyambu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Data Science Exam Prep-unit 2

i have datascience exam tomorrow, subject code 21CSS303T


i will share u the study material one by one...u are going to teach me everything from each study
material, u need to be carefulll that u dont miss anything from the study material, teach me in a
simple way yet in detail(then only i will be able to write long answers as well), cause i have very
limited time for preparation
shall we start


Yes, let's start!
I'm ready to teach you everything from your study material for 21CSS303T - Data Science in a simple
yet detailed way, making sure nothing is missed.

Please go ahead and upload or paste the first part of your study material.

Great! Let’s go through Unit II – Reshaping from your Data Science (21CSS303T) syllabus in a simple
and detailed way, without skipping anything. This section is about how to rearrange tabular data in
Pandas (a Python data analysis library).

🧩 What is Reshaping in Pandas?


Reshaping means changing the structure or layout of a DataFrame (a table-like data structure in
Pandas). This is useful when you want to:

Convert rows into columns.


Convert columns into rows.
Change from wide to long format or vice versa.

These operations are called:

Pivoting

Melting
Stacking
Unstacking

Let’s look at each one in detail.

🔁 1. Pivoting
Goal: Turn unique values from one column into new column headers.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 1/18
✅ `pivot()` function

Suppose you have this data:

python

import pandas as pd

df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'City': ['New York', 'London', 'New York', 'London'],
'Sales': [200, 150, 220, 180]
})

This is how it looks:

Date City Sales

2024-01-01 New York 200


2024-01-01 London 150

2024-01-02 New York 220

2024-01-02 London 180

Now apply pivot:

python

pivot_df = df.pivot(index='Date', columns='City', values='Sales')


print(pivot_df)

Output:

City London New York

2024-01-01 150 200


2024-01-02 180 220

📌 Key Points:
`index='Date'` → rows are grouped by Date.

`columns='City'` → City names become column headers.


`values='Sales'` → table shows Sales values.

🔄 2. Melting
Goal: Convert from wide format to long format (opposite of pivoting).

python

melted_df = pivot_df.reset_index().melt(id_vars='Date', var_name='City', value_name='Sales')


print(melted_df)

Output:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 2/18
Date City Sales

2024-01-01 London 150


2024-01-01 New York 200
2024-01-02 London 180

2024-01-02 New York 220

📌 Key Points:
`reset_index()` brings Date from index back to column.
`melt()`:

`id_vars='Date'`: keeps Date as is.

`var_name='City'`: names the column that holds former column headers.

`value_name='Sales'`: names the column for values.

📚 3. Stacking and Unstacking


These are used to convert between columns and row-level indexes (MultiIndex).

🔁 Stacking
Goal: Convert columns into a new row-level index.

python

stacked_df = pivot_df.stack()
print(stacked_df)

Output:

vbnet

Date City
2024-01-01 London 150
New York 200
2024-01-02 London 180
New York 220
dtype: int64

Now you have a multi-level index: (Date, City).

🔄 Unstacking
Goal: Convert multi-index back into columns.

python

unstacked_df = stacked_df.unstack()
print(unstacked_df)

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 3/18
Output: same as `pivot_df`

City London New York


2024-01-01 150 200
2024-01-02 180 220

💡 Summary Table:
Operation Description Function Used

Pivot Long → Wide (rows → columns) `pivot()` / `pivot_table()`


Melt Wide → Long (columns → rows) `melt()`
Stack Columns → Row index (multi-index) `stack()`
Unstack Row index → Columns `unstack()`

You're now clear with everything about reshaping in Pandas — pivot, melt, stack, and unstack — with
examples, explanations, and key points!
Shall we move on to the next part of the material?

Great! Let’s now cover the second file: Unit II – Data Cleaning, Preparation & Transformation from
Data Science (21CSS303T) in a simple yet detailed manner, without missing anything. This will help
you write long, clear answers in your exam.

🧼 1. Data Cleaning (Data Cleansing / Data Preprocessing)


✅ What is it?
It means fixing or removing errors, missing values, or inconsistencies in your data.

🧠 Why is it important?
Raw data is usually noisy, incomplete, inconsistent.
Dirty data can lead to wrong decisions and bad model performance.

Algorithms need clean, correct data to learn properly.


Clean data leads to better models, more reliable analysis, and clear insights.

🔧 2. Data Preparation
✅ What is it?
Data preparation = Collect + Clean + Label data for machine learning.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 4/18
👨‍💻 Steps:
1. Collect raw data.

2. Clean it (remove errors, fill missing values).


3. Label it (add target values if needed).
4. Explore and visualize the data.

🧠 Why is it important?
Improves accuracy & reliability of analysis.
Helps in identifying patterns & trends.

Makes ML models perform better.


Saves time by reducing mistakes during analysis.

❌ 3. Handling Missing Data


🔎 What are missing values?
Values that are not available in a dataset.
Appear as: `NaN`, blank cells, “NA”, “unknown”, etc.

⚠️ Why should we care?


Can reduce data quality.
Many ML models can’t handle them.

May cause bias if certain groups are affected more.


Deleting them may cause loss of useful data.

🔨 Techniques to handle missing values:


A. Deletion

1. Listwise Deletion – Remove entire rows/columns with missing data.

Simple, but you lose data.


2. Pairwise Deletion – Remove only the affected pairs.

Less wasteful, but still risky if not random.

B. Imputation

1. Mean/Median/Mode Imputation – Replace missing value with mean/median/mode.

Works if data isn’t skewed.


2. KNN Imputation – Use average from 'k' nearest similar rows.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 5/18
3. Regression Imputation – Predict missing value using other columns.
4. Multiple Imputation – Fill missing values with multiple possible values for better accuracy.

🧪 Python Code: Handling Missing Values


Create a sample DataFrame:

python

import pandas as pd
import numpy as np

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', np.nan],
'Age': [25, np.nan, 30, 22, 35],
'Salary': [50000, 54000, np.nan, 42000, 48000]
}
df = pd.DataFrame(data)

1️⃣ Check for Missing Values


python

df.isnull() # shows where data is missing


df.isnull().sum() # count of missing values in each column

2️⃣ Delete Missing Values


python

df.dropna() # drop rows with missing values


df.dropna(axis=1) # drop columns with missing values

3️⃣ Fill Missing Values


python

df.fillna(0) # fill with a constant

# Fill using statistics


df['Age'].fillna(df['Age'].mean(), inplace=True) # mean
df['Salary'].fillna(df['Salary'].median(), inplace=True) # median
df['Name'].fillna(df['Name'].mode()[0], inplace=True) # mode

4️⃣ Forward and Backward Fill


python

df.fillna(method='ffill') # forward fill


df.fillna(method='bfill') # backward fill

✅ Choosing the Right Approach:


Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 6/18
Criteria Recommended Action
Many missing values Use imputation rather than deletion

Random missing data Mean/Median/Mode or KNN Imputation

Missing data has pattern Use regression or multiple imputation


High impact on results Use advanced imputation to avoid bias

🔄 4. Data Transformation
✅ What is it?
Changing raw data into a better format to help analysis and decision making.

🔹 Techniques of Data Transformation


✅ 1. Data Smoothing
Removes noise from data using algorithms.
Helps highlight important patterns.

Example: Using a moving average to reduce fluctuations in data.

✅ 2. Attribute Construction
Creating new features from existing ones.

Example:

From: `Date` column → create `Day`, `Month`, `Year`


Helps improve data mining and model accuracy.

✅ 3. Data Generalization
Converts detailed data into abstract, higher-level forms.

Types:

1. Attribute Generalization

Replace specific data with categories.

Example: Age = 25 → Age Group = 20-30


2. Hierarchical Generalization

Use a hierarchy.

Example: “Toyota Corolla” → “Toyota” → “Car”


3. Numeric Generalization

Group numbers into ranges.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 7/18
Salary = 52000 → Salary Group = 50K–60K
4. Text Generalization

Replace words with general terms.

“John lives in New York” → “Person lives in City”

✅ 4. Data Aggregation
Collect data from different sources and summarize it.

Helps in giving a broader view for analysis.


Example: Sum of sales per month from daily sales data.

✅ 5. Data Discretization
Convert continuous data into categories (bins).
Used in ML and feature engineering.

📌 Example in Python:
python

data = {'Age': [22, 25, 30, 35, 40, 45, 50, 55, 60]}
df = pd.DataFrame(data)
df['Age_Binned'] = pd.cut(df['Age'], bins=3, labels=['Young', 'Middle-aged', 'Old'])
print(df)

✅ 6. Data Normalization
Scale values to a specific range, like [0,1] or [-1,1].

🧠 Why normalize?
Improves model performance, especially for:

KNN, SVM, Neural Networks

Faster convergence in training.


Prevents large-scale features from dominating.

✅✅ Summary Points to Remember for Exam:


Data Cleaning: Remove/fix errors, missing values.
Data Preparation: Clean + format data for ML.

Missing Values: Handle by deletion or imputation.

Transformation Techniques: Smoothing, Construction, Generalization, Aggregation,


Discretization, Normalization.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 8/18
Use Python functions like `dropna()`, `fillna()`, `cut()`, etc.

You're now fully prepared with this unit — from data cleaning to transformation — with code,
examples, and explanation.
Would you like to move on to the next file?

Awesome! Now let’s cover the final part of Unit II from your Data Science (21CSS303T) material —
String Manipulation in Python — explained clearly and in detail so you can write long answers easily.

🧵 1. What is a String in Python?


A string is a sequence of characters enclosed in single (' ') or double (" ") quotes.
Strings are immutable – once created, they cannot be changed, but new strings can be created
from them (by slicing, joining, etc.).

✂️ 2. Basic String Operations


python

text = "Hello, Python!"

Operation Code Output

Length `len(text)` `14`


Access character `text[0]` `'H'`

Last character `text[-1]` `'!'`

Slice `text[0:5]` or `text[:5]` `'Hello'`


Slice from index `text[7:]` `'Python!'`

Reverse `text[::-1]` `'!nohtyP ,olleH'`

🔠 3. Case Manipulation
python

text = "hello python"

Function Output

`text.upper()` `'HELLO PYTHON'`


`text.lower()` `'hello python'`

`text.title()` `'Hello Python'`

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 9/18
Function Output

`text.capitalize()` `'Hello python'`


`text.swapcase()` `'HELLO PYTHON'`

➕ 4. Concatenation & Repetition


python

str1 = "Hello"
str2 = "Python"

Operation Code Output

Concatenation `str1 + " " + str2` `'Hello Python'`


Repetition `str1 * 3` `'HelloHelloHello'`

🔍 5. Searching and Replacing


python

text = "Python is fun"

Operation Code Output

Substring exists `"Python" in text` `True`

Not in text `"Java" not in text` `True`


Find position `text.find("fun")` `10`

Find index `text.index("is")` `7`


Replace word `text.replace("fun", "awesome")` `'Python is awesome'`

🍌 6. Splitting and Joining Strings


python

text = "apple,banana,grape"

Split: Break string into a list.

python

fruits = text.split(",")
print(fruits) # ['apple', 'banana', 'grape']

Join: Combine list into string with a separator.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 10/18
python

new_text = "-".join(fruits)
print(new_text) # 'apple-banana-grape'

🔍 7. Removing Whitespace
python

text = " Python "

Method Output

`text.strip()` `'Python'`
`text.lstrip()` `'Python '`

`text.rstrip()` `' Python'`

🎨 8. String Formatting
Let’s say:

python

name = "Alice"
age = 25

Method Code Output

`f"My name is {name} and I am {age} years `'My name is Alice and I am
f-string
old."` 25 years old.'`

`format()` `"My name is {} and I am {} years `'My name is Alice and I am


method old.".format(name, age)` 25 years old.'`

`"My name is %s and I am %d years old." % `'My name is Alice and I am


`%` operator
(name, age)` 25 years old.'`

✅ 9. Checking String Properties


python

text = "Python123"

Function Explanation Output


`text.isalpha()` Only alphabets? `False`
`text.isdigit()` Only digits? `False`

`text.isalnum()` Only letters and numbers? `True`


`text.isspace()` Only spaces? `False`
`"hello".islower()` All lowercase? `True`

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 11/18
Function Explanation Output
`"HELLO".isupper()` All uppercase? `True`

🔁 10. Reversing Words in a Sentence


python

text = "Hello World"

python

reversed_words = " ".join(text.split()[::-1])


print(reversed_words) # 'World Hello'

📌 Final Tips:
Use slicing for parts of strings.
Use `split()` and `join()` to go between strings and lists.

Use formatting to make clean outputs.


Use string methods to clean or analyze text.
Always remember: strings are immutable.

You're now fully covered with all three topics of Unit II:
✅ Reshaping
✅ Data Cleaning & Transformation
✅ String Manipulation
Want a quick summary or practice questions for revision now?

Great! Let’s now finish Unit II of Data Science (21CSS303T) with the last section from the file you
uploaded — covering Summarizing, Binning, Standardization, and Outliers in a clear, simple, and
detailed way.

📊 1. Data Summarization
Data summarization means reducing the size and complexity of data by focusing on the main
features, patterns, or trends.

🧭 A. Centrality (Center of Data)


These describe the “middle” of the data:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 12/18
Measure Meaning Example
Mean Average value (10 + 20 + 30)/3 = 20
Median Middle value when sorted 10, 20, 30 → median = 20

Mode Most frequent value [2, 3, 3, 5] → mode = 3

📏 B. Dispersion (Spread of Data)


These show how scattered the data is:

Measure Description
Standard Deviation Average distance from the mean (spread).

Variance Square of standard deviation.


Range Difference between maximum and minimum values.

📈 C. Sample Distribution
This tells you about the shape of the data distribution.

Concept Meaning

Histogram A bar chart showing frequency of value ranges.


Tally Manual count of occurrences.
Skewness Asymmetry in data (left/right skewed).

Kurtosis Describes whether data has sharp or flat peaks.

🧺 2. Binning
Binning = Grouping continuous values into intervals (bins).
Used in: data preprocessing, feature engineering, visualization.

✅ Types of Binning:
Type Description Example
Equal-Width Binning All bins have equal range. Ages 22, 25, 30 → [20–30]

Equal-Frequency Binning Each bin has nearly the same number of values. 5 values split into 3 bins.
Custom Binning Manual bins based on domain knowledge. Income → Low, Medium, High

⚖️ 3. Standardization
Standardization is used to make all features comparable by scaling them.
Helps especially in machine learning when features have different scales.

✅ Why Standardize?
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 13/18
Benefit Explanation
Equal Feature Importance Prevents large values from dominating.
Better Model Performance ML algorithms like KNN, SVM work better with standardized data.

Faster Convergence Makes training faster and more stable.

✏️ Techniques of Standardization
🧮 A. Z-score Normalization
Converts data to have mean = 0 and std = 1
Formula:

x − mean
z=
standard deviation

📉 B. Min-Max Normalization
Scales values to range [0, 1]
Formula:

x − min(x)
xscaled =
max(x) − min(x)
​ ​

🚨 4. Outliers & Noise


✅ What is an Outlier?
A data point very different from others.
Also called: anomaly, deviant, abnormality.

They affect:

Model accuracy
Statistical analysis

May be valid data or errors.

🧠 Types of Outliers:
Type Meaning
Global Outliers Obvious values that are far away from all other data. Easy to spot.
Contextual Outliers Look normal in one context, but abnormal in another. Requires domain knowledge.

❗ 5. Anomalies
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 14/18
Anomalies are unusual instances in data.
Can be:
Rare events

Extreme values
Different patterns

🧠 Example:
A ₹1 lakh credit card transaction at midnight might be normal for one person but anomalous for
another.

Anomalies can be:

Global
Contextual

Or based on behavior

✅ Summary for Writing Long Answers:


1. Summarization

Centrality → Mean, Median, Mode.


Dispersion → Std dev, Variance, Range.
Distribution → Histogram, Skewness, Kurtosis.

2. Binning

Equal width
Equal frequency

Custom

3. Standardization

Z-score → Mean = 0, SD = 1.
Min-Max → Values between [0, 1].

4. Outliers & Anomalies

Global & Contextual


Affect analysis & models

May need removal or deeper analysis

You’ve now completed ALL of Unit II, including:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 15/18
Reshaping
Data Cleaning & Preparation
String Manipulation

Summarizing, Binning, and Standardization


Outliers and Anomalies

Would you like a 1-page quick revision sheet or some practice questions for revision now?

1 page quick revision sheet

Here’s a 1-page quick revision sheet for Unit II – Data Science (21CSS303T). You can save or revise
from this before the exam.

📘 UNIT II – QUICK REVISION SHEET


🔁 RESHAPING (Pandas)
Operation Function Description
Pivot `pivot()` Long → Wide format (columns from values)

Melt `melt()` Wide → Long format


Stack `stack()` Columns → MultiIndex rows
Unstack `unstack()` MultiIndex rows → Columns

🧼 DATA CLEANING
Concept Description

Goal Remove noise, errors, missing or inconsistent data


Importance Ensures accuracy, reduces bias, improves model

Handling Missing Values:

Deletion: `dropna()` (row/column)


Imputation: Mean/Median/Mode, KNN, Regression
Fill: `fillna(0)` or with forward/backward fill

🔄 DATA TRANSFORMATION
Technique Purpose

Smoothing Removes noise


Attribute Construction Create new features
Generalization Abstract values (e.g., 25 → 20–30 age group)

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 16/18
Technique Purpose
Aggregation Combine & summarize data
Discretization Convert continuous → categorical (`pd.cut()`)

Normalization Scale data to [0,1] or standard range

🔤 STRING MANIPULATION
Task Example
Length `len(s)`
Slicing `s[0:5]`, `s[::-1]`

Case Change `.upper()`, `.lower()`, etc.


Search/Replace `"in" in s`, `.replace()`
Split/Join `.split(",")`, `"-".join(list)`

Whitespace `.strip()`, `.lstrip()`, etc.


Format `f"My name is {name}"`
Check Properties `.isalpha()`, `.isdigit()`, etc.
Reverse Words `" ".join(s.split()[::-1])`

📊 DATA SUMMARIZATION
Category Measures
Centrality Mean, Median, Mode
Dispersion Std Dev, Variance, Range
Distribution Histogram, Tally, Skewness, Kurtosis

🧺 BINNING
Type Example
Equal Width 22, 25, 30 → [20–30], [30–40]
Equal Frequency 5 values split into ~equal bins
Custom Income → Low, Medium, High

⚖️ STANDARDIZATION
Technique Formula

Z-score (x − μ)/σ (mean=0, SD=1)


Min-Max (x − min)/(max − min) → [0, 1]

🚨 OUTLIERS & ANOMALIES

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 17/18
Type Description
Global Outlier Far from all other data (clear outlier)
Contextual Unusual in some context only

Anomaly Rare, abnormal data (e.g., fraud detection)

Would you like this as a PDF or printable file?

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 18/18

You might also like