0% found this document useful (0 votes)

28 views13 pages

Data Science Lab Group Submission

The document discusses the analysis of a ride-sharing dataset, highlighting the differences between various dataset formats (.csv, .txt, .xlsx) and methods for handling missing values. It categorizes fields in the dataset into numerical and categorical types, identifies redundant data, and proposes problem statements for analysis. The document also outlines exploratory data analysis (EDA) techniques used in R, including visualizations and key takeaways from the analysis.

Uploaded by

mredward eu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views13 pages

Data Science Lab Group Submission

Uploaded by

mredward eu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

TEB2043: Data Science

Lab Group Submission

Wednesday (10am – 12pm)

Group Members:

Submission Date: 19/02/2025

1. Difference between Dataset Extensions
Different file formats are used to store and share datasets, each with its own advantages and
drawbacks. Some of the most common formats include:

.csv (Comma-Separated Values):

This is one of the most widely used formats for storing tabular data. Each line in the file represents
a row, with values separated by commas. It is highly compatible with data analysis tools such as
Python, R, Excel, and databases. However, it does not support multiple sheets, formatting, or
formulas.

.txt (Plain Text):

This is a basic text file that may contain structured or unstructured data. Structured text files often
use delimiters such as commas, tabs, or spaces to separate values. Unlike .csv files, a .txt file does
not have a standard format and requires additional processing to extract structured data.

.xlsx (Excel Spreadsheet):

A proprietary format developed by Microsoft, which allows for multiple sheets, advanced
formatting, formulas, and charts. This format is user-friendly and commonly used for manual data
entry and visualization, though it is less ideal for large-scale data processing compared to .csv.

The dataset provided to our group is available in .txt and .xlsx formats. Since the .txt file uses a
delimiter like a comma, it can be converted into a structured .csv-like format for easier processing
in R or Python. The dataset contains ride-sharing data, including information about ride requests,
locations, fares, traffic conditions, and user ratings. Understanding the differences between these
file formats is crucial for selecting the best approach to analyzing the data efficiently.
2. Filling In and Correcting Missing/Incomplete Values
In real-world datasets, missing or incomplete values are common due to human errors, technical
issues, or missing data collection points. Various techniques are used to handle these missing
values, depending on the nature of the data and the analysis requirements:

Removing Missing Values:

If only a small proportion of the data contains missing values, those rows or columns may be
removed. This approach works well when missing values are random and do not significantly
impact the dataset.

Imputation Techniques (to fill missing data):

• For Numerical Data: Missing values can be replaced using statistical measures such as
the mean, median, or mode. For example, missing fare amounts could be imputed using
the average fare of similar rides.

• For Categorical Data: The most common category (mode) can be used to fill in missing
values. Alternatively, missing values can be labeled as "Unknown" or "Not Available."

Prediction-Based Approaches:

More advanced methods use machine learning models to predict missing values based on other
variables in the dataset.

For our case, if missing data was encountered in the dataset, we would opt to use simpler
techniques such as dropping missing values or basic imputation with mean/median/mode.
To identify missing values in R, the following functions can be used:

any(is.na(df)) # Check if any missing values exist

colSums(is.na(df)) # Count missing values in each column

To identify missing values in Excel, the following method can be used:

3. Categorical and Numerical Fields in the Dataset
Data in a dataset can be broadly classified into two main types:

• Numerical Data: Represents measurable quantities that have mathematical significance.

Examples include ride distances, fare amounts, and user ratings.

• Categorical Data: Represents distinct categories or groups without inherent numerical

meaning. Examples include payment methods, vehicle types, and traffic conditions.

Classification of Fields in our Ride Sharing Dataset:

• Numerical Fields (7):

No. Field Name Data Type Description

1 Latitude Pickup num The latitude coordinate of the pickup location.
2 Longitude Pickup num The longitude coordinate of the pickup location.
3 Latitude Dropoff num The latitude coordinate of the drop-off location.
4 Longitude Dropoff num The longitude coordinate of the drop-off location.
5 Ride Distance (miles) num The total distance traveled in miles from pickup to
drop-off.
6 Fare Amount ($) num The total fare charged for the ride in dollars.
7 User Rating int The rating given by the passenger to the driver,
typically on a scale from 1 to 5.

• Categorical Fields (11):

No. Field Name Data Type Description

1 Ride ID int A unique identifier for each ride request.
2 Request Time chr The date and time when the ride was requested.
3 Payment Method chr The payment method.
4 Driver ID chr A unique identifier for the driver assigned to the
ride.
5 Vehicle Type chr The type of vehicle used for the ride.
6 Traffic Condition chr The level of traffic congestion during the ride.
7 Peak Hours chr Whether the ride was taken during peak hours.
8 Day of Week chr The day of the week when the ride was requested.
9 Public Holiday chr Indicates if the ride was taken on a public holiday.
10 Pickup Location chr A combined value representing the latitude and
longitude of the pickup point
11 Dropoff Location chr A combined value representing the latitude and
longitude of the drop-off point
Redundant Data

There are certain fields that can be deemed redundant in this dataset. The redundant fields are:

1. Pickup Location – This field combines Latitude Pickup and Longitude Pickup, which
are already included as separate fields.

2. Dropoff Location – This field combines Latitude Dropoff and Longitude Dropoff,
which are already included separately.

These two fields can be dropped because they duplicate information already available in the
dataset.

Replacing Certain Values

To make the dataset more numerical for machine learning or statistical analysis, categorical
fields can be encoded as numbers:

Field Replacement
Traffic Condition Low → 1, Medium → 2, High → 3
Peak Hours No → 0, Yes → 1
Public Holiday No → 0, Yes → 1
Day of Week Monday → 1, ..., Sunday → 7
4. Problem Statement of the Dataset
A problem statement is a concise description of an issue or challenge that needs to be addressed
through data analysis. It sets the direction for research and guides the investigation.

Given that our dataset pertains to ride-sharing services, the following problem statements could
be formulated:
• How do peak hours and traffic conditions impact ride fares?

Analyzing the relationship between ride fares and peak hours/traffic conditions can help
ride-sharing companies adjust pricing strategies accordingly.

• What factors influence the user rating of ride-sharing services?

Investigating whether fare amounts, ride distances, or vehicle types impact user ratings
can provide insights into service quality.

• Is there a correlation between ride distance and fare amount?

Understanding the relationship between ride distance and pricing structure can reveal
potential inefficiencies in fare calculation models.

• Which vehicle type is most commonly used in high-traffic conditions?

Analyzing vehicle type preferences in different traffic conditions can inform fleet
management decisions.
5. EDA on Given Dataset (Using R in Google Colab)
Load Data Set:
df <- read.csv("ridesharing.txt", sep=",", header=TRUE)
Check Data Structure:
str(df)

summary(df)
Identify Missing Values:
colSums(is.na(df))

Check on Categorical Variables:

table(df$Payment.Method)

table(df$Vehicle.Type)
Visualizations:

Bar Plot for Vehicle Type Distribution:

vehicle_counts <- table(df$Vehicle_Type)
bar_positions <- barplot(vehicle_counts, main="Vehicle Type Distribution", col="green",
ylim=c(0, max(vehicle_counts) * 1.2), xlab="Vehicle Type", ylab="Count")
text(bar_positions, vehicle_counts, labels=vehicle_counts, pos=3, cex=0.8)

Determining Most Used Vehicle Types in High-Traffic Conditions:

high_traffic_vehicle_counts <- table(df$Vehicle_Type[df$Traffic_Condition == "High"])
bar_positions <- barplot(high_traffic_vehicle_counts, main="Vehicle Type in High Traffic Conditions",
col="purple", ylim=c(0, max(high_traffic_vehicle_counts) * 1.2), xlab="Vehicle Type", ylab="Count")
text(bar_positions, high_traffic_vehicle_counts, labels=high_traffic_vehicle_counts, pos=3, cex=0.8)
Analyzing Peak Hours and Traffic Conditions Impact on Ride Fares:
boxplot(df$Fare_Amount_dollar ~ df$Traffic_Condition, main="Fare Amount by Traffic Condition",
col=c("red", "yellow", "green"), ylab="Fare Amount ($)", xlab="Traffic Condition")
means <- tapply(df$Fare_Amount_dollar, df$Traffic_Condition, mean, na.rm=TRUE)
text(1:length(means), means, labels=round(means, 2), pos=3, cex=0.8)
boxplot(df$Fare_Amount_dollar ~ df$Peak_Hours, main="Fare Amount by Peak Hours", col=c("blue",
"orange"), ylab="Fare Amount ($)", xlab="Peak Hours")
means_peak <- tapply(df$Fare_Amount_dollar, df$Peak_Hours, mean, na.rm=TRUE)
text(1:length(means_peak), means_peak, labels=round(means_peak, 2), pos=3, cex=0.8)

Comparing Cumulative Ride Distance and Amount of Rides taken according to Traffic
Conditions:
boxplot(df$Ride_Distance_miles ~ df$Traffic_Condition, main="Ride Distance by Traffic
Condition", col=c("red", "green", "yellow"), ylab="Ride Distance (miles)", xlab="Traffic
Condition")
means <- tapply(df$Ride_Distance_miles, df$Traffic_Condition, mean, na.rm=TRUE)
text(1:length(means), means, labels=round(means, 2), pos=3, cex=0.8)
traffic_counts <- table(df$Traffic_Condition)
bar_positions <- barplot(traffic_counts, main="Number of Rides by Traffic Condition",
col=c("red", "green", "yellow"), ylim=c(0, max(traffic_counts) * 1.2), xlab="Traffic
Condition", ylab="Number of Rides")
text(bar_positions, traffic_counts, labels=traffic_counts, pos=3, cex=0.8)
Understanding Factors Affecting User Ratings:
plot(df$Fare_Amount_dollar, df$User_Rating, main="Fare Amount vs User Rating", xlab="Fare
Amount ($)", ylab="User Rating", col="blue")

plot(df$Ride_Distance_miles, df$User_Rating, main="Ride Distance vs User Rating", xlab="Ride

Distance (miles)", ylab="User Rating", col="blue")

Examining Correlation Between Ride Distance and Fare Amount:

cor(df$Ride_Distance_miles, df$Fare_Amount_dollar)
plot(df$Ride_Distance_miles, df$Fare_Amount_dollar, main="Distance vs Fare", xlab="Ride
Distance (miles)", ylab="Fare Amount ($)", col="red")
Histogram of Fare Amount:
hist_data <- hist(df$Fare_Amount_dollar, main="Distribution of Fare Amount", col="blue", plot=FALSE)
bar_positions <- barplot(hist_data$counts, names.arg=round(hist_data$breaks[-1], 2), main="Distribution of
Fare Amount", col="blue", ylim=c(0, max(hist_data$counts) * 1.2), xlab="Fare Amount ($)", ylab="Frequency")
text(bar_positions, hist_data$counts, labels=hist_data$counts, pos=3, cex=0.8)

Google Colab Link:

https://colab.research.google.com/drive/1OS3tw1S3z3qKPGdFPwEtrrAiDP6ZZ4fd?usp=sharing

Takeways:
• No missing data was found in the dataset
• Motorcycles seem to be the overall favourite for ride sharing but when traffic conditions
are high, customers tend to prefer using sedans
• The fares for ride sharing can not only be higher during peak hours, but also during low
traffic conditions.
• This is because customers tend to ride more and travel further during low traffic
conditions.
• Most 3 stars ratings are given when the rides are decently priced. (not too expensive or
too cheap)
• The ratio of fares to ride distance is stable and doesn’t fluctuate much
• The majority of fares tend to cost between $500 - $900.
• All the takeaways are just based on a set of data during a specific period of time.

Dll-Eapp 12 Week 15
50% (2)
Dll-Eapp 12 Week 15
5 pages
Data Visualization and Story Telling Notes
No ratings yet
Data Visualization and Story Telling Notes
31 pages
Misa Santa Fe PDF
80% (5)
Misa Santa Fe PDF
26 pages
EDA
100% (1)
EDA
9 pages
Machine Learning Assignment Report - Cars
100% (4)
Machine Learning Assignment Report - Cars
42 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Grade 5 Catch Up Friday
100% (3)
Grade 5 Catch Up Friday
5 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
Machine Learning Statistical Model Using Transportation Data
No ratings yet
Machine Learning Statistical Model Using Transportation Data
32 pages
Uber Drive Practice DP PDF
No ratings yet
Uber Drive Practice DP PDF
10 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Phonemic Awareness and Phonics
No ratings yet
Phonemic Awareness and Phonics
19 pages
R Assignment
No ratings yet
R Assignment
5 pages
Exercise - 6: DS203-2024-S1 Problem1:: Statistics
No ratings yet
Exercise - 6: DS203-2024-S1 Problem1:: Statistics
10 pages
Data Science Lab Group Submission
No ratings yet
Data Science Lab Group Submission
13 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
FDS Unit 2
No ratings yet
FDS Unit 2
8 pages
DS Lab Manual Final
No ratings yet
DS Lab Manual Final
49 pages
Engo 645
No ratings yet
Engo 645
9 pages
Practical No.-01
No ratings yet
Practical No.-01
25 pages
DAC Phase2
No ratings yet
DAC Phase2
8 pages
R Programming
No ratings yet
R Programming
11 pages
Da Laqs Saqs
No ratings yet
Da Laqs Saqs
23 pages
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
No ratings yet
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
37 pages
FDS Pyq2
No ratings yet
FDS Pyq2
10 pages
IT - Specialist:: Data Analytics
No ratings yet
IT - Specialist:: Data Analytics
46 pages
Coursera Notes
No ratings yet
Coursera Notes
4 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
Dav Exps - Merged - Merged
No ratings yet
Dav Exps - Merged - Merged
99 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Week 3
No ratings yet
Week 3
77 pages
Ass 2
No ratings yet
Ass 2
13 pages
FDS - 2 Solved
No ratings yet
FDS - 2 Solved
14 pages
How To Convert Casuals To Members?": Google Data Analytics Course Capstone Project: Case Study 1 "Cyclistic"
No ratings yet
How To Convert Casuals To Members?": Google Data Analytics Course Capstone Project: Case Study 1 "Cyclistic"
18 pages
(Data & Variable Management) Extracting Metadata From Stata Datasets
No ratings yet
(Data & Variable Management) Extracting Metadata From Stata Datasets
51 pages
CS202 Assignment - 4 - GIKI
No ratings yet
CS202 Assignment - 4 - GIKI
3 pages
Data Project
No ratings yet
Data Project
12 pages
Lec 16
No ratings yet
Lec 16
11 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Lecture 3
No ratings yet
Lecture 3
53 pages
Big Data - Lab 3
No ratings yet
Big Data - Lab 3
25 pages
Data Preprocessing
No ratings yet
Data Preprocessing
27 pages
Practical 1 EDA
No ratings yet
Practical 1 EDA
14 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
Week3 - Data Preprocessing, Extraction and Preparation
No ratings yet
Week3 - Data Preprocessing, Extraction and Preparation
34 pages
Form, Structure, and Sense (Level 3) Answer Key
No ratings yet
Form, Structure, and Sense (Level 3) Answer Key
12 pages
Why Do We Baptize Infants - (Basics of The Faith) (Basics of - Bryan Chapell - Basics of The Reformed Faith, 1st Ed, Phillipsburg, N - J, - Oxford - 9781596380585 - Anna's Archive
No ratings yet
Why Do We Baptize Infants - (Basics of The Faith) (Basics of - Bryan Chapell - Basics of The Reformed Faith, 1st Ed, Phillipsburg, N - J, - Oxford - 9781596380585 - Anna's Archive
36 pages
Principles of Data Science WEB 3
No ratings yet
Principles of Data Science WEB 3
30 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Yash Week 3 Uber Case Study
No ratings yet
Yash Week 3 Uber Case Study
38 pages
Big Data Analytics (1) : Definition
No ratings yet
Big Data Analytics (1) : Definition
15 pages
PFDA Khalil Mirza TP053846
No ratings yet
PFDA Khalil Mirza TP053846
39 pages
Data Analysis With R
No ratings yet
Data Analysis With R
72 pages
Big Data Report
No ratings yet
Big Data Report
6 pages
Business Analytics Unit - IV Notes - 60637706 - 2025 - 05!15!02 - 16
No ratings yet
Business Analytics Unit - IV Notes - 60637706 - 2025 - 05!15!02 - 16
28 pages
R-Programming Lab Mannual
No ratings yet
R-Programming Lab Mannual
33 pages
Unit 2
No ratings yet
Unit 2
76 pages
Unit - I: Topic - 1
No ratings yet
Unit - I: Topic - 1
13 pages
CE880 Lecture3 Slides
No ratings yet
CE880 Lecture3 Slides
44 pages
Divvy Exercise R Script
No ratings yet
Divvy Exercise R Script
5 pages
Module 4
No ratings yet
Module 4
47 pages
Week13 Slides Review
No ratings yet
Week13 Slides Review
23 pages
Dev Record Aids
No ratings yet
Dev Record Aids
24 pages
TIBCO Ems Commands
No ratings yet
TIBCO Ems Commands
4 pages
Verbs in Early Modern English
No ratings yet
Verbs in Early Modern English
9 pages
My Best Friend, My Love
No ratings yet
My Best Friend, My Love
2 pages
NATO Phonetic Alphabet 2015 NGL
No ratings yet
NATO Phonetic Alphabet 2015 NGL
1 page
Swetha Ashok
No ratings yet
Swetha Ashok
5 pages
01-12 PBR Configuration
No ratings yet
01-12 PBR Configuration
25 pages
Cs 97si: Introduction To Programming Contests: Jaehyun Park
No ratings yet
Cs 97si: Introduction To Programming Contests: Jaehyun Park
39 pages
(En) M Sew - Soft Instruction
No ratings yet
(En) M Sew - Soft Instruction
58 pages
PT - Science 6 - Q2
No ratings yet
PT - Science 6 - Q2
5 pages
Titan Test Solution
No ratings yet
Titan Test Solution
4 pages
张培基《英译中国现代散文选》pdf打印版
No ratings yet
张培基《英译中国现代散文选》pdf打印版
45 pages
Rmi RCRS
No ratings yet
Rmi RCRS
10 pages
Gmail User Manual Basic Operation en Pdf1
No ratings yet
Gmail User Manual Basic Operation en Pdf1
24 pages
Tema 4
No ratings yet
Tema 4
65 pages
A Linux Command Line Primer
No ratings yet
A Linux Command Line Primer
20 pages
DBMS Module-II
No ratings yet
DBMS Module-II
33 pages
Computational Complexity Theory
No ratings yet
Computational Complexity Theory
15 pages
Java Programming 2 Syllabus
No ratings yet
Java Programming 2 Syllabus
3 pages
St. Mary's Educational Institute: Online Learning 1
No ratings yet
St. Mary's Educational Institute: Online Learning 1
7 pages
Aw DD508DX Manual G11 121115
No ratings yet
Aw DD508DX Manual G11 121115
56 pages
GEC 5 LESSON 5 Communication For Academic Purposes
No ratings yet
GEC 5 LESSON 5 Communication For Academic Purposes
17 pages
Test (Allophones and Aspiration)
No ratings yet
Test (Allophones and Aspiration)
3 pages
Tim's Blog Light - WS2812 Library V2.0 - Part I: Understanding The WS2812
No ratings yet
Tim's Blog Light - WS2812 Library V2.0 - Part I: Understanding The WS2812
15 pages
Downshifting Essay
No ratings yet
Downshifting Essay
1 page
Thinking About GIS: Geographic Information System Planning for Managers
From Everand
Thinking About GIS: Geographic Information System Planning for Managers
Roger Tomlinson
5/5 (4)
Ian Talks Algos & Data Structures A-Z: WebDevAtoZ, #2
From Everand
Ian Talks Algos & Data Structures A-Z: WebDevAtoZ, #2
Ian Eress
No ratings yet

Data Science Lab Group Submission

Uploaded by

Data Science Lab Group Submission

Uploaded by

TEB2043: Data Science

Lab Group Submission

Submission Date: 19/02/2025

.csv (Comma-Separated Values):

.txt (Plain Text):

.xlsx (Excel Spreadsheet):

Removing Missing Values:

Imputation Techniques (to fill missing data):

any(is.na(df)) # Check if any missing values exist

To identify missing values in Excel, the following method can be used:

• Numerical Data: Represents measurable quantities that have mathematical significance.

• Categorical Data: Represents distinct categories or groups without inherent numerical

Classification of Fields in our Ride Sharing Dataset:

• Numerical Fields (7):

No. Field Name Data Type Description

• Categorical Fields (11):

No. Field Name Data Type Description

Replacing Certain Values

• What factors influence the user rating of ride-sharing services?

• Is there a correlation between ride distance and fare amount?

• Which vehicle type is most commonly used in high-traffic conditions?

Check on Categorical Variables:

Bar Plot for Vehicle Type Distribution:

Determining Most Used Vehicle Types in High-Traffic Conditions:

plot(df$Ride_Distance_miles, df$User_Rating, main="Ride Distance vs User Rating", xlab="Ride

Examining Correlation Between Ride Distance and Fare Amount:

Google Colab Link:

You might also like