Data Science Lab Group Submission
Data Science Lab Group Submission
Group Members:
This is one of the most widely used formats for storing tabular data. Each line in the file represents
a row, with values separated by commas. It is highly compatible with data analysis tools such as
Python, R, Excel, and databases. However, it does not support multiple sheets, formatting, or
formulas.
This is a basic text file that may contain structured or unstructured data. Structured text files often
use delimiters such as commas, tabs, or spaces to separate values. Unlike .csv files, a .txt file does
not have a standard format and requires additional processing to extract structured data.
A proprietary format developed by Microsoft, which allows for multiple sheets, advanced
formatting, formulas, and charts. This format is user-friendly and commonly used for manual data
entry and visualization, though it is less ideal for large-scale data processing compared to .csv.
The dataset provided to our group is available in .txt and .xlsx formats. Since the .txt file uses a
delimiter like a comma, it can be converted into a structured .csv-like format for easier processing
in R or Python. The dataset contains ride-sharing data, including information about ride requests,
locations, fares, traffic conditions, and user ratings. Understanding the differences between these
file formats is crucial for selecting the best approach to analyzing the data efficiently.
2. Filling In and Correcting Missing/Incomplete Values
In real-world datasets, missing or incomplete values are common due to human errors, technical
issues, or missing data collection points. Various techniques are used to handle these missing
values, depending on the nature of the data and the analysis requirements:
If only a small proportion of the data contains missing values, those rows or columns may be
removed. This approach works well when missing values are random and do not significantly
impact the dataset.
• For Numerical Data: Missing values can be replaced using statistical measures such as
the mean, median, or mode. For example, missing fare amounts could be imputed using
the average fare of similar rides.
• For Categorical Data: The most common category (mode) can be used to fill in missing
values. Alternatively, missing values can be labeled as "Unknown" or "Not Available."
Prediction-Based Approaches:
More advanced methods use machine learning models to predict missing values based on other
variables in the dataset.
For our case, if missing data was encountered in the dataset, we would opt to use simpler
techniques such as dropping missing values or basic imputation with mean/median/mode.
To identify missing values in R, the following functions can be used:
There are certain fields that can be deemed redundant in this dataset. The redundant fields are:
1. Pickup Location – This field combines Latitude Pickup and Longitude Pickup, which
are already included as separate fields.
2. Dropoff Location – This field combines Latitude Dropoff and Longitude Dropoff,
which are already included separately.
These two fields can be dropped because they duplicate information already available in the
dataset.
To make the dataset more numerical for machine learning or statistical analysis, categorical
fields can be encoded as numbers:
Field Replacement
Traffic Condition Low → 1, Medium → 2, High → 3
Peak Hours No → 0, Yes → 1
Public Holiday No → 0, Yes → 1
Day of Week Monday → 1, ..., Sunday → 7
4. Problem Statement of the Dataset
A problem statement is a concise description of an issue or challenge that needs to be addressed
through data analysis. It sets the direction for research and guides the investigation.
Given that our dataset pertains to ride-sharing services, the following problem statements could
be formulated:
• How do peak hours and traffic conditions impact ride fares?
Analyzing the relationship between ride fares and peak hours/traffic conditions can help
ride-sharing companies adjust pricing strategies accordingly.
Investigating whether fare amounts, ride distances, or vehicle types impact user ratings
can provide insights into service quality.
Understanding the relationship between ride distance and pricing structure can reveal
potential inefficiencies in fare calculation models.
Analyzing vehicle type preferences in different traffic conditions can inform fleet
management decisions.
5. EDA on Given Dataset (Using R in Google Colab)
Load Data Set:
df <- read.csv("ridesharing.txt", sep=",", header=TRUE)
Check Data Structure:
str(df)
summary(df)
Identify Missing Values:
colSums(is.na(df))
table(df$Vehicle.Type)
Visualizations:
Comparing Cumulative Ride Distance and Amount of Rides taken according to Traffic
Conditions:
boxplot(df$Ride_Distance_miles ~ df$Traffic_Condition, main="Ride Distance by Traffic
Condition", col=c("red", "green", "yellow"), ylab="Ride Distance (miles)", xlab="Traffic
Condition")
means <- tapply(df$Ride_Distance_miles, df$Traffic_Condition, mean, na.rm=TRUE)
text(1:length(means), means, labels=round(means, 2), pos=3, cex=0.8)
traffic_counts <- table(df$Traffic_Condition)
bar_positions <- barplot(traffic_counts, main="Number of Rides by Traffic Condition",
col=c("red", "green", "yellow"), ylim=c(0, max(traffic_counts) * 1.2), xlab="Traffic
Condition", ylab="Number of Rides")
text(bar_positions, traffic_counts, labels=traffic_counts, pos=3, cex=0.8)
Understanding Factors Affecting User Ratings:
plot(df$Fare_Amount_dollar, df$User_Rating, main="Fare Amount vs User Rating", xlab="Fare
Amount ($)", ylab="User Rating", col="blue")
Takeways:
• No missing data was found in the dataset
• Motorcycles seem to be the overall favourite for ride sharing but when traffic conditions
are high, customers tend to prefer using sedans
• The fares for ride sharing can not only be higher during peak hours, but also during low
traffic conditions.
• This is because customers tend to ride more and travel further during low traffic
conditions.
• Most 3 stars ratings are given when the rides are decently priced. (not too expensive or
too cheap)
• The ratio of fares to ride distance is stable and doesn’t fluctuate much
• The majority of fares tend to cost between $500 - $900.
• All the takeaways are just based on a set of data during a specific period of time.