0% found this document useful (0 votes)
28 views13 pages

Data Science Lab Group Submission

The document discusses the analysis of a ride-sharing dataset, highlighting the differences between various dataset formats (.csv, .txt, .xlsx) and methods for handling missing values. It categorizes fields in the dataset into numerical and categorical types, identifies redundant data, and proposes problem statements for analysis. The document also outlines exploratory data analysis (EDA) techniques used in R, including visualizations and key takeaways from the analysis.

Uploaded by

mredward eu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views13 pages

Data Science Lab Group Submission

The document discusses the analysis of a ride-sharing dataset, highlighting the differences between various dataset formats (.csv, .txt, .xlsx) and methods for handling missing values. It categorizes fields in the dataset into numerical and categorical types, identifies redundant data, and proposes problem statements for analysis. The document also outlines exploratory data analysis (EDA) techniques used in R, including visualizations and key takeaways from the analysis.

Uploaded by

mredward eu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

TEB2043: Data Science

Lab Group Submission


Wednesday (10am – 12pm)

Group Members:

Submission Date: 19/02/2025


1. Difference between Dataset Extensions
Different file formats are used to store and share datasets, each with its own advantages and
drawbacks. Some of the most common formats include:

.csv (Comma-Separated Values):

This is one of the most widely used formats for storing tabular data. Each line in the file represents
a row, with values separated by commas. It is highly compatible with data analysis tools such as
Python, R, Excel, and databases. However, it does not support multiple sheets, formatting, or
formulas.

.txt (Plain Text):

This is a basic text file that may contain structured or unstructured data. Structured text files often
use delimiters such as commas, tabs, or spaces to separate values. Unlike .csv files, a .txt file does
not have a standard format and requires additional processing to extract structured data.

.xlsx (Excel Spreadsheet):

A proprietary format developed by Microsoft, which allows for multiple sheets, advanced
formatting, formulas, and charts. This format is user-friendly and commonly used for manual data
entry and visualization, though it is less ideal for large-scale data processing compared to .csv.

The dataset provided to our group is available in .txt and .xlsx formats. Since the .txt file uses a
delimiter like a comma, it can be converted into a structured .csv-like format for easier processing
in R or Python. The dataset contains ride-sharing data, including information about ride requests,
locations, fares, traffic conditions, and user ratings. Understanding the differences between these
file formats is crucial for selecting the best approach to analyzing the data efficiently.
2. Filling In and Correcting Missing/Incomplete Values
In real-world datasets, missing or incomplete values are common due to human errors, technical
issues, or missing data collection points. Various techniques are used to handle these missing
values, depending on the nature of the data and the analysis requirements:

Removing Missing Values:

If only a small proportion of the data contains missing values, those rows or columns may be
removed. This approach works well when missing values are random and do not significantly
impact the dataset.

Imputation Techniques (to fill missing data):

• For Numerical Data: Missing values can be replaced using statistical measures such as
the mean, median, or mode. For example, missing fare amounts could be imputed using
the average fare of similar rides.

• For Categorical Data: The most common category (mode) can be used to fill in missing
values. Alternatively, missing values can be labeled as "Unknown" or "Not Available."

Prediction-Based Approaches:

More advanced methods use machine learning models to predict missing values based on other
variables in the dataset.

For our case, if missing data was encountered in the dataset, we would opt to use simpler
techniques such as dropping missing values or basic imputation with mean/median/mode.
To identify missing values in R, the following functions can be used:

any(is.na(df)) # Check if any missing values exist


colSums(is.na(df)) # Count missing values in each column

To identify missing values in Excel, the following method can be used:


3. Categorical and Numerical Fields in the Dataset
Data in a dataset can be broadly classified into two main types:

• Numerical Data: Represents measurable quantities that have mathematical significance.


Examples include ride distances, fare amounts, and user ratings.

• Categorical Data: Represents distinct categories or groups without inherent numerical


meaning. Examples include payment methods, vehicle types, and traffic conditions.

Classification of Fields in our Ride Sharing Dataset:

• Numerical Fields (7):

No. Field Name Data Type Description


1 Latitude Pickup num The latitude coordinate of the pickup location.
2 Longitude Pickup num The longitude coordinate of the pickup location.
3 Latitude Dropoff num The latitude coordinate of the drop-off location.
4 Longitude Dropoff num The longitude coordinate of the drop-off location.
5 Ride Distance (miles) num The total distance traveled in miles from pickup to
drop-off.
6 Fare Amount ($) num The total fare charged for the ride in dollars.
7 User Rating int The rating given by the passenger to the driver,
typically on a scale from 1 to 5.

• Categorical Fields (11):

No. Field Name Data Type Description


1 Ride ID int A unique identifier for each ride request.
2 Request Time chr The date and time when the ride was requested.
3 Payment Method chr The payment method.
4 Driver ID chr A unique identifier for the driver assigned to the
ride.
5 Vehicle Type chr The type of vehicle used for the ride.
6 Traffic Condition chr The level of traffic congestion during the ride.
7 Peak Hours chr Whether the ride was taken during peak hours.
8 Day of Week chr The day of the week when the ride was requested.
9 Public Holiday chr Indicates if the ride was taken on a public holiday.
10 Pickup Location chr A combined value representing the latitude and
longitude of the pickup point
11 Dropoff Location chr A combined value representing the latitude and
longitude of the drop-off point
Redundant Data

There are certain fields that can be deemed redundant in this dataset. The redundant fields are:

1. Pickup Location – This field combines Latitude Pickup and Longitude Pickup, which
are already included as separate fields.

2. Dropoff Location – This field combines Latitude Dropoff and Longitude Dropoff,
which are already included separately.

These two fields can be dropped because they duplicate information already available in the
dataset.

Replacing Certain Values

To make the dataset more numerical for machine learning or statistical analysis, categorical
fields can be encoded as numbers:

Field Replacement
Traffic Condition Low → 1, Medium → 2, High → 3
Peak Hours No → 0, Yes → 1
Public Holiday No → 0, Yes → 1
Day of Week Monday → 1, ..., Sunday → 7
4. Problem Statement of the Dataset
A problem statement is a concise description of an issue or challenge that needs to be addressed
through data analysis. It sets the direction for research and guides the investigation.

Given that our dataset pertains to ride-sharing services, the following problem statements could
be formulated:
• How do peak hours and traffic conditions impact ride fares?

Analyzing the relationship between ride fares and peak hours/traffic conditions can help
ride-sharing companies adjust pricing strategies accordingly.

• What factors influence the user rating of ride-sharing services?

Investigating whether fare amounts, ride distances, or vehicle types impact user ratings
can provide insights into service quality.

• Is there a correlation between ride distance and fare amount?

Understanding the relationship between ride distance and pricing structure can reveal
potential inefficiencies in fare calculation models.

• Which vehicle type is most commonly used in high-traffic conditions?

Analyzing vehicle type preferences in different traffic conditions can inform fleet
management decisions.
5. EDA on Given Dataset (Using R in Google Colab)
Load Data Set:
df <- read.csv("ridesharing.txt", sep=",", header=TRUE)
Check Data Structure:
str(df)

summary(df)
Identify Missing Values:
colSums(is.na(df))

Check on Categorical Variables:


table(df$Payment.Method)

table(df$Vehicle.Type)
Visualizations:

Bar Plot for Vehicle Type Distribution:


vehicle_counts <- table(df$Vehicle_Type)
bar_positions <- barplot(vehicle_counts, main="Vehicle Type Distribution", col="green",
ylim=c(0, max(vehicle_counts) * 1.2), xlab="Vehicle Type", ylab="Count")
text(bar_positions, vehicle_counts, labels=vehicle_counts, pos=3, cex=0.8)

Determining Most Used Vehicle Types in High-Traffic Conditions:


high_traffic_vehicle_counts <- table(df$Vehicle_Type[df$Traffic_Condition == "High"])
bar_positions <- barplot(high_traffic_vehicle_counts, main="Vehicle Type in High Traffic Conditions",
col="purple", ylim=c(0, max(high_traffic_vehicle_counts) * 1.2), xlab="Vehicle Type", ylab="Count")
text(bar_positions, high_traffic_vehicle_counts, labels=high_traffic_vehicle_counts, pos=3, cex=0.8)
Analyzing Peak Hours and Traffic Conditions Impact on Ride Fares:
boxplot(df$Fare_Amount_dollar ~ df$Traffic_Condition, main="Fare Amount by Traffic Condition",
col=c("red", "yellow", "green"), ylab="Fare Amount ($)", xlab="Traffic Condition")
means <- tapply(df$Fare_Amount_dollar, df$Traffic_Condition, mean, na.rm=TRUE)
text(1:length(means), means, labels=round(means, 2), pos=3, cex=0.8)
boxplot(df$Fare_Amount_dollar ~ df$Peak_Hours, main="Fare Amount by Peak Hours", col=c("blue",
"orange"), ylab="Fare Amount ($)", xlab="Peak Hours")
means_peak <- tapply(df$Fare_Amount_dollar, df$Peak_Hours, mean, na.rm=TRUE)
text(1:length(means_peak), means_peak, labels=round(means_peak, 2), pos=3, cex=0.8)

Comparing Cumulative Ride Distance and Amount of Rides taken according to Traffic
Conditions:
boxplot(df$Ride_Distance_miles ~ df$Traffic_Condition, main="Ride Distance by Traffic
Condition", col=c("red", "green", "yellow"), ylab="Ride Distance (miles)", xlab="Traffic
Condition")
means <- tapply(df$Ride_Distance_miles, df$Traffic_Condition, mean, na.rm=TRUE)
text(1:length(means), means, labels=round(means, 2), pos=3, cex=0.8)
traffic_counts <- table(df$Traffic_Condition)
bar_positions <- barplot(traffic_counts, main="Number of Rides by Traffic Condition",
col=c("red", "green", "yellow"), ylim=c(0, max(traffic_counts) * 1.2), xlab="Traffic
Condition", ylab="Number of Rides")
text(bar_positions, traffic_counts, labels=traffic_counts, pos=3, cex=0.8)
Understanding Factors Affecting User Ratings:
plot(df$Fare_Amount_dollar, df$User_Rating, main="Fare Amount vs User Rating", xlab="Fare
Amount ($)", ylab="User Rating", col="blue")

plot(df$Ride_Distance_miles, df$User_Rating, main="Ride Distance vs User Rating", xlab="Ride


Distance (miles)", ylab="User Rating", col="blue")

Examining Correlation Between Ride Distance and Fare Amount:


cor(df$Ride_Distance_miles, df$Fare_Amount_dollar)
plot(df$Ride_Distance_miles, df$Fare_Amount_dollar, main="Distance vs Fare", xlab="Ride
Distance (miles)", ylab="Fare Amount ($)", col="red")
Histogram of Fare Amount:
hist_data <- hist(df$Fare_Amount_dollar, main="Distribution of Fare Amount", col="blue", plot=FALSE)
bar_positions <- barplot(hist_data$counts, names.arg=round(hist_data$breaks[-1], 2), main="Distribution of
Fare Amount", col="blue", ylim=c(0, max(hist_data$counts) * 1.2), xlab="Fare Amount ($)", ylab="Frequency")
text(bar_positions, hist_data$counts, labels=hist_data$counts, pos=3, cex=0.8)

Google Colab Link:


https://colab.research.google.com/drive/1OS3tw1S3z3qKPGdFPwEtrrAiDP6ZZ4fd?usp=sharing

Takeways:
• No missing data was found in the dataset
• Motorcycles seem to be the overall favourite for ride sharing but when traffic conditions
are high, customers tend to prefer using sedans
• The fares for ride sharing can not only be higher during peak hours, but also during low
traffic conditions.
• This is because customers tend to ride more and travel further during low traffic
conditions.
• Most 3 stars ratings are given when the rides are decently priced. (not too expensive or
too cheap)
• The ratio of fares to ride distance is stable and doesn’t fluctuate much
• The majority of fares tend to cost between $500 - $900.
• All the takeaways are just based on a set of data during a specific period of time.

You might also like