0% found this document useful (0 votes)

10 views6 pages

Lab 02 - Introduction to Pandas

This document outlines Lab 02 for the CS471 Machine Learning course, focusing on using Pandas and Scikit-Learn for data manipulation and exploratory data analysis. Students will work with the Titanic and NYC Airbnb datasets to develop skills in data handling, visualization, and machine learning model training. The lab includes tasks for data loading, cleaning, analysis, and requires a self-reflection report on the learning outcomes and challenges faced during the lab.

Uploaded by

mukhan.bese22seecs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views6 pages

Lab 02 - Introduction to Pandas

Uploaded by

mukhan.bese22seecs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Faculty of Computing

Lab 02: Introduction to Pandas and Scikit-

Learn
CS471 Machine Learning

BESE – 13B
03rd February 2025

Lab Engineer: Mr. Junaid Sajid

Instructor: Dr. Muhammad Daud Abdullah Asif
Lab 02: Introduction to Pandas and Scikit-Learn

Lab Objectives:

1. Develop hands-on experience in manipulating datasets using Python libraries such

as Pandas and Scikit-Learn.
2. Perform exploratory data analysis (EDA) to identify patterns, trends, and insights in
datasets.
3. Visualize data effectively to support analysis and reporting.

End Goal:

Students will be able to manipulate datasets, perform exploratory data analysis, and create
insightful visualizations using Python.

Datasets:

1. Titanic Dataset:
○ Description: Data on passengers of the Titanic, including survival information.
○ Source: Downloadable from Kaggle (Titanic Dataset).

2. NYC Airbnb Dataset:

○ Description: Listings data with features such as price, location, and
availability.
○ Source: Downloadable from Kaggle (New York City Airbnb Open Data).

Pandas:

 Importing Pandas:

import pandas as pd

 Creating a Series:

data = [10, 20, 30, 40]

series = pd.Series(data, index=['A', 'B', 'C', 'D'])
print(series)

 Creating a DataFrame:

data = {'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],
'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df)

 Loading a dataset:
df = pd.read_csv('data.csv')
print(df.head())

 Basic operations:

print(df.info())
print(df.describe())
print(df['Age'].mean())

scikit-learn:

 Loading a sample dataset:

from sklearn.datasets import load_iris

data = load_iris()
print(data.keys())
print(data['feature_names'])
print(data['target_names'])

 Splitting the dataset:

from sklearn.model_selection import train_test_split

X = data['data']
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape)

 Training a machine learning model:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X_train, y_train)

 Making predictions and evaluating:

from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

Matplot Lib:

 Importing Matplotlib:

import matplotlib.pyplot as plt

 Plotting a basic graph:

x = [1, 2, 3, 4, 5]
y = [10, 20, 30, 40, 50]

plt.plot(x, y, marker='o')
plt.title("Basic Line Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()

Exploratory Data Analysis on Iris Dataset:

 Load the Iris Dataset

from sklearn.datasets import load_iris

import pandas as pd

# Load the dataset

iris_sklearn = load_iris()
iris_df = pd.DataFrame(data=iris_sklearn.data,
columns=iris_sklearn.feature_names)
iris_df['species'] = iris_sklearn.target
iris_df['species'] = iris_df['species'].replace({0: 'setosa',
1: 'versicolor', 2: 'virginica'})

# Display the first few rows

print(iris_df.head())

 Data Visualization with Matplotlib

 Scatter Plot:

plt.scatter(iris_df['sepal length (cm)'], iris_df['sepal

width (cm)'], c='blue', label='Sepal')
plt.scatter(iris_df['petal length (cm)'], iris_df['petal
width (cm)'], c='red', label='Petal')
plt.title("Scatter Plot of Sepal and Petal Dimensions")
plt.xlabel("Length (cm)")
plt.ylabel("Width (cm)")
plt.legend()
plt.show()

 Histogram:

plt.hist(iris_df['sepal length (cm)'], bins=10,

color='green', alpha=0.7)
plt.title("Histogram of Sepal Length")
plt.xlabel("Sepal Length (cm)")
plt.ylabel("Frequency")
plt.show()

 Bar Plot:
species_counts = iris_df['species'].value_counts()
species_counts.plot(kind='bar', color=['blue', 'orange',
'green'])
plt.title("Species Count")
plt.xlabel("Species")
plt.ylabel("Count")
plt.show()

Lab Tasks:

Task 1: Data Loading and Exploration

1. Load the Titanic Dataset

2. Inspect the dataset:
○ Display column names, data types, and summary statistics using Pandas.
○ Identify missing values and discuss their implications.

Task 2: Data Cleaning and Manipulation

1. Handle missing values:

○ Fill missing values in the Age column with the median age.
○ Drop rows with zero values in the Fare column.
2. Create new features:
○ Add a family_size feature by combining sibsp (siblings/spouses) and
parch (parents/children).
○ Create a travel_alone feature indicating whether the passenger traveled
alone.
3. Filter data:
○ Select passengers aged between 18 and 50 and save the subset as a new
DataFrame.

Task 3: Exploratory Data Analysis

1. Analyze categorical variables:

○ Visualize the survival rate by gender and passenger
2. Analyze numerical variables:
○ Plot a histogram for the Age column and overlay the median.
○ Create a box plot of Fare across different passenger classes.
3. Detect correlations:
○ Compute and visualize the correlation matrix using heatmap.

Task 4: Advanced Data Visualization

1. Plot geospatial data using the NYC Airbnb Dataset:

○ Visualize the geographic distribution of listings using a scatter plot of latitude
and longitude.
○ Create a bar plot of average price per neighborhood.
2. Identify outliers in the price column using a box plot.

Lab Summary and Deliverables:

1. Deliverables: Python code, visualizations, and a report summarizing and one report
on self reflections:
○ Data cleaning steps.
○ Insights derived from EDA and visualizations.
○ Observations regarding patterns and anomalies in the datasets.
2. Learning Outcomes:
○ Gained practical skills in data manipulation using Pandas and NumPy.
○ Conducted exploratory data analysis to extract insights.
○ Applied effective visualization techniques to support data storytelling.

Self-Reflection Report Guidelines

Report Template (To Be Included in Notebook):

Students will answer the following questions as part of their notebook submission:

1. Understanding of Concepts:
○ Summarize the key concepts of data manipulation and EDA you applied in
this lab.
○ What new skills or knowledge did you gain?
2. Example:
"Through this lab, I learned how to handle missing values effectively using Pandas
and how to create new features to extract meaningful insights. The correlation matrix
visualization was particularly insightful for identifying relationships among variables."
3. Challenges Faced:
○ Describe any difficulties encountered during the lab tasks.
○ How did you resolve them?
4. Example:
"I initially struggled with visualizing correlations using a heatmap. After referring to
the Pandas and Seaborn documentation, I realized I needed to preprocess the data
to exclude non-numerical columns."

InTouch CSP5
No ratings yet
InTouch CSP5
15 pages
RBT 40 Hour Training Packet 201117
50% (2)
RBT 40 Hour Training Packet 201117
4 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
CHM1321 Lab 1
100% (1)
CHM1321 Lab 1
8 pages
vamshi ml-1,2
No ratings yet
vamshi ml-1,2
25 pages
Ml Lab Manual Completed
No ratings yet
Ml Lab Manual Completed
56 pages
Data Sci
No ratings yet
Data Sci
10 pages
EXP1-siddhant gupta (23_SE_148)
No ratings yet
EXP1-siddhant gupta (23_SE_148)
17 pages
Test Project
No ratings yet
Test Project
17 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
final dev record
No ratings yet
final dev record
49 pages
ML Lab File
No ratings yet
ML Lab File
43 pages
week_3
No ratings yet
week_3
10 pages
ML(sudhanshu)
No ratings yet
ML(sudhanshu)
24 pages
ML MANUAL
No ratings yet
ML MANUAL
21 pages
Machinelearninglabmanual
No ratings yet
Machinelearninglabmanual
47 pages
Fundamentals of Data Science Students
No ratings yet
Fundamentals of Data Science Students
52 pages
AD3301 DEV Lab Manual
No ratings yet
AD3301 DEV Lab Manual
26 pages
Machine Learning Experiment
No ratings yet
Machine Learning Experiment
69 pages
ML Lab 04 Manual - Pandas and MatplotLib
No ratings yet
ML Lab 04 Manual - Pandas and MatplotLib
7 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
96 pages
CS101 Syllabus
No ratings yet
CS101 Syllabus
2 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
No ratings yet
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
28 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
PH3094D Computational Lab_Exercise3
No ratings yet
PH3094D Computational Lab_Exercise3
3 pages
DXV Guidelines
No ratings yet
DXV Guidelines
3 pages
DATA_ANALYTICS_LAB_MANUAL_FINAL1[1]
No ratings yet
DATA_ANALYTICS_LAB_MANUAL_FINAL1[1]
32 pages
dev record final (3)
No ratings yet
dev record final (3)
34 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
41_DS_PL_MF
No ratings yet
41_DS_PL_MF
20 pages
vishnu. ml
No ratings yet
vishnu. ml
26 pages
Practical Labs Guide
No ratings yet
Practical Labs Guide
34 pages
ML lab
No ratings yet
ML lab
51 pages
To Study About Numpy, Pandas and Matplotlib Libraries in Python
No ratings yet
To Study About Numpy, Pandas and Matplotlib Libraries in Python
21 pages
CS3362 Data Science Laboratory Manual 2022-23
No ratings yet
CS3362 Data Science Laboratory Manual 2022-23
54 pages
ML_Exp
No ratings yet
ML_Exp
9 pages
fdsa lab manual final
No ratings yet
fdsa lab manual final
70 pages
lab manual
No ratings yet
lab manual
80 pages
AIML%20Short%20Term%20Internship%20Session%209%20Summary-1719044709410
No ratings yet
AIML%20Short%20Term%20Internship%20Session%209%20Summary-1719044709410
14 pages
Python Library Functions
No ratings yet
Python Library Functions
12 pages
ML Lab Manual
No ratings yet
ML Lab Manual
28 pages
Machine Learning Lab Dlihebca6sem
100% (1)
Machine Learning Lab Dlihebca6sem
25 pages
Ai - ML - Sarthak1.4
No ratings yet
Ai - ML - Sarthak1.4
4 pages
Data Visualization With Python PDF
93% (14)
Data Visualization With Python PDF
662 pages
Data Science
No ratings yet
Data Science
3 pages
DS-DS Lab-1
No ratings yet
DS-DS Lab-1
4 pages
Ccs346 Eda Unit 1
No ratings yet
Ccs346 Eda Unit 1
139 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Lecture02. ML Pipeline (Chapter 2)
No ratings yet
Lecture02. ML Pipeline (Chapter 2)
50 pages
Content From Jose Portilla's Udemy Course Learning Python For Data Analysis and Visualization Notes by Michael Brothers, Available On
No ratings yet
Content From Jose Portilla's Udemy Course Learning Python For Data Analysis and Visualization Notes by Michael Brothers, Available On
13 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
36 pages
unit 4
No ratings yet
unit 4
27 pages
Data Exploration and Visualization Laboratory - AD3301 - Lab Manual
No ratings yet
Data Exploration and Visualization Laboratory - AD3301 - Lab Manual
55 pages
Data Science lab manual..
No ratings yet
Data Science lab manual..
54 pages
Ml Cyber Lab
No ratings yet
Ml Cyber Lab
16 pages
Python For Data Exploration
No ratings yet
Python For Data Exploration
28 pages
pandas (1)
No ratings yet
pandas (1)
25 pages
EDAP LAB
No ratings yet
EDAP LAB
47 pages
ML[1]
No ratings yet
ML[1]
49 pages
Ml record_merged (1)
No ratings yet
Ml record_merged (1)
29 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
D80 90S-5 (Sb1102e04)
100% (1)
D80 90S-5 (Sb1102e04)
380 pages
Powdered Apple Snail Shell As Tiles
No ratings yet
Powdered Apple Snail Shell As Tiles
1 page
Keur 00821-029 201906
No ratings yet
Keur 00821-029 201906
5 pages
GROUP ASSIGNMENT - MKT1702DIG - MKT309m - GROUP2
No ratings yet
GROUP ASSIGNMENT - MKT1702DIG - MKT309m - GROUP2
33 pages
Celite® Filter Cel: Technical Data
No ratings yet
Celite® Filter Cel: Technical Data
1 page
Ulma Smart 300 Dossier - Smart300 - 566969 - 719344-00
No ratings yet
Ulma Smart 300 Dossier - Smart300 - 566969 - 719344-00
10 pages
COGNEX DM360 Quick Reference
No ratings yet
COGNEX DM360 Quick Reference
22 pages
Personal Statement - SAMPLE
100% (1)
Personal Statement - SAMPLE
2 pages
Lesson 2 Customer Service in Management
No ratings yet
Lesson 2 Customer Service in Management
24 pages
Eng11a Lesson2
No ratings yet
Eng11a Lesson2
15 pages
Sexual Harassment, Psychology and Feminism Lisa Lazard - Download the ebook today and own the complete version
No ratings yet
Sexual Harassment, Psychology and Feminism Lisa Lazard - Download the ebook today and own the complete version
58 pages
PF-CIS-Fall 2022 LAB
No ratings yet
PF-CIS-Fall 2022 LAB
4 pages
SEO For Startups: YCombinator February 2010
94% (17)
SEO For Startups: YCombinator February 2010
37 pages
bl1 307
No ratings yet
bl1 307
2 pages
Lecture 1 PDF
No ratings yet
Lecture 1 PDF
69 pages
Shift-Share Analysis (Mix and Share Analysis) : Chapter 7, Pages 67-73 in Textbook
No ratings yet
Shift-Share Analysis (Mix and Share Analysis) : Chapter 7, Pages 67-73 in Textbook
25 pages
11 - Big O and Recursion
No ratings yet
11 - Big O and Recursion
21 pages
Syllabus For Courses Affiliated To The Kerala University of Health Sciences
No ratings yet
Syllabus For Courses Affiliated To The Kerala University of Health Sciences
124 pages
Water Filtration Plant
No ratings yet
Water Filtration Plant
6 pages
Reading Comprehension Homework For Second Grade
100% (1)
Reading Comprehension Homework For Second Grade
6 pages
D100 Dungeon - Errata V1.5
No ratings yet
D100 Dungeon - Errata V1.5
12 pages
Financial Management II Unit Capital Budgeting (Highlighted)
No ratings yet
Financial Management II Unit Capital Budgeting (Highlighted)
13 pages
Consensus Statements Febrile Seizures
No ratings yet
Consensus Statements Febrile Seizures
7 pages
Davis 2006
No ratings yet
Davis 2006
21 pages
q1
No ratings yet
q1
10 pages
Comparative Advertisments Paper
No ratings yet
Comparative Advertisments Paper
24 pages
171 Value Proposition Canvass
No ratings yet
171 Value Proposition Canvass
2 pages