Eda 1
Eda 1
Dimensions 1D 2D
Data Types Single dtype per Series Multiple dtypes (one per column)
Example: A dataset of student records with columns like Name, Age, Marks is "data" in EDA.
python
Copy
import pandas as pd
df = pd.read_csv('data.csv')
2. Apply Operations:
o Column-wise: df['column'].mean()
o Aggregation: df.groupby('category').sum()
Example:
python
Copy
python
Copy
2. Visualize:
python
Copy
• Methods:
• MAR (Missing at Random): Depends on observed data (e.g., age missing more for
females).
What are the data analysis approaches exist and how does eda differ from the other
approaches
Data analysis can be broadly categorized into several approaches, each serving different
purposes. Exploratory Data Analysis (EDA) is distinct in its objectives and methods.
Key Takeaway
EDA is the first step in data analysis, focusing on understanding data before applying
statistical tests or predictive models. Other approaches build on EDA’s insights to answer
specific questions or solve problems.
Exploratory Data Analysis (EDA) has been pivotal in solving real-world problems across
industries. Below are 5 concrete examples where EDA revealed hidden patterns, anomalies,
or actionable insights:
EDA transforms raw data into actionable insights—whether it’s saving lives, money, or time!
Why we need data visualisation tool for exploitary data analysis explain use case with
example.16marks
The Role of Data Visualization in Exploratory Data Analysis (EDA) – 16 Marks
• Humans process visuals faster than raw numbers (e.g., spotting trends in a graph vs.
a table).
• Reveals hidden patterns (outliers, clusters, correlations) that summary statistics may
miss.
• EDA Approach:
• EDA Approach:
• EDA Approach:
o KDE Plot (Kernel Density Estimate) reveals peaks (e.g., more young adults vs.
seniors).
• EDA Approach:
• EDA Approach:
1. Histogram of Wait Times → Reveals most patients wait 20-40 mins, but some wait 2+
hours (outliers).
2. Box Plot by Department → Emergency room has the highest variability in wait times.
3. Scatter Plot (Wait Time vs. Staff Count) → Shows longer waits when fewer nurses
are on duty.
4. Heatmap (Wait Time by Hour & Day) → Identifies peak hours (e.g., Monday
mornings).
Outcome: Hospital adjusts staff schedules and streamlines triage processes, reducing
average wait time by 30%.
5. Conclusion
• Different charts serve different purposes (e.g., histograms for distributions, scatter
plots for correlations).
(This answer covers theoretical importance, practical examples, tools, and a case study—
sufficient for a 16-mark question.)
Pandas is a powerful Python library for data manipulation and analysis. Below are 8 critical
functions used in EDA, along with their purpose, syntax, and examples.
1. head() / tail()
Syntax:
python
Copy
Example:
python
Copy
import pandas as pd
df = pd.read_csv("sales_data.csv")
Output:
Copy
2 103 Keyboard 3 50
2. info()
Purpose: Provides a summary of the DataFrame (columns, data types, non-null counts).
Why Useful: Detects missing values and checks data types.
Syntax:
python
Copy
df.info()
Example:
python
Copy
df.info()
Output:
Copy
<class 'pandas.core.frame.DataFrame'>
3. describe()
Purpose: Generates descriptive statistics (count, mean, std, min, max, quartiles).
Why Useful: Identifies central tendency, spread, and outliers in numerical data.
Syntax:
python
Copy
Example:
python
Copy
df.describe()
Output:
Copy
Quantity Price
4. isnull().sum()
Syntax:
python
Copy
df.isnull().sum()
Example:
python
Copy
print(df.isnull().sum())
Output:
Copy
Order_ID 0
Product 5
Quantity 0
Price 0
5. value_counts()
Syntax:
python
Copy
Example:
python
Copy
print(df['Product'].value_counts())
Output:
Copy
Laptop 30
Monitor 25
Keyboard 20
Mouse 15
Speaker 10
6. groupby()
Syntax:
python
Copy
df.groupby('Column').agg(['mean', 'sum'])
Example:
python
Copy
print(df.groupby('Product')['Price'].mean())
Output:
Copy
Product
Laptop 1200
Monitor 300
Keyboard 50
Mouse 20
Speaker 80
7. corr()
Syntax:
python
Copy
df.corr()
Example:
python
Copy
print(df[['Quantity', 'Price']].corr())
Output:
Copy
Quantity Price
Syntax:
python
Copy
df['Column'].plot(kind='hist')
Example:
python
Copy
df['Price'].plot(kind='box')
plt.show()
Output:
Conclusion
Describe with syntax on summarising aggregating and grouping data in explotary data
analysis and provide a example
Summarizing, Aggregating, and Grouping Data in EDA (with Syntax & Example)
In Exploratory Data Analysis (EDA), summarizing, aggregating, and grouping data helps
uncover trends, patterns, and key statistics. Below is a detailed breakdown with syntax and
examples using Pandas.
1. Summarizing Data
Key Functions:
• describe() → Generates descriptive statistics (count, mean, std, min, max, quartiles).
• mean(), median(), sum(), count(), std(), min(), max() → Compute specific metrics.
Syntax:
python
Copy
Example:
python
Copy
import pandas as pd
# Sample DataFrame
df = pd.DataFrame(data)
print(df.describe())
# Mean price
Output:
Copy
Price Quantity
2. Aggregating Data
Aggregation combines multiple values into a single result (e.g., sum, average).
Key Functions:
Syntax:
python
Copy
df['column'].sum()
Example:
python
Copy
Output:
Copy
Price Quantity
3. Grouping Data
Key Function:
Syntax:
python
Copy
df.groupby('column').mean()
Example:
python
Copy
print(grouped)
Output:
Copy
Price Quantity
Product
Keyboard 50 3
Laptop 1150 3
Monitor 300 1
Mouse 20 5
→ Insight: Laptops have the highest avg price ($1150), Mice sell the most (5 units).
python
Copy
import pandas as pd
data = {
'Quantity': [2, 1, 3, 5, 1]
}
df = pd.DataFrame(data)
# 1. Summarize
print(df.describe())
# 2. Aggregate
# 3. GroupBy
print(grouped)
# 4. Visualization
plt.show()
Output:
• Graph: Bar plot comparing avg price and total quantity per product.
Key Takeaways
These techniques help in identifying trends, outliers, and business insights during EDA.
Data induction refers to the process of deriving general patterns, rules, or models from
specific observations in a dataset. It involves learning from data to make predictions or
decisions without explicit programming.
• Contrast with Deduction: Deduction starts with general rules → specific outcomes
(e.g., math proofs).
4. Handling Uncertainty: Works with noisy, incomplete data (real-world datasets are
rarely perfect).
• Examples:
o Regression: House price prediction (input: sq. ft, location → output: price).
• Algorithms:
python
Copy
• Examples:
• Algorithms:
python
Copy
• Concept: Discovers "if X, then Y" rules (e.g., "If {diapers}, then {beer}").
• Example:
python
Copy
• Example:
prolog
Copy
% Background knowledge: parent(X,Y) ← father(X,Y).
1. Data Collection: Gather raw data (e.g., sales records, sensor logs).
50,000 700 No
Python Implementation:
python
Copy
y = ['No', 'Yes']
model = DecisionTreeClassifier().fit(X, y)
Copy
| |--- class: No
2. Bias-Variance Tradeoff: Simpler models may underfit; complex models may overfit.
7. Real-World Applications
3. Retail: Recommender systems (e.g., "Customers who bought X also bought Y").
8. Conclusion
• Data induction is the core of machine learning, enabling systems to learn from data.