DMDS mini project final

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

KAMALA EDUCATION SOCIETY

PRATIBHA COLLEGE OF COMMERCE AND COMPUTER STUDIES

Academic year 2023-2024

TY BCA(SCI)

MINI PROJECT

Project Title: Uber Data Analysis through Visualisations in R.

Team members: 10

191_Karuna Jadhav

193_Kirti More

194_Jyoti Patare

195_Manisha Munkampalli

196_Priyanka Patil

197_Vaishnavi More

198_Trupti Shinde

199_Yogita Harde

206_Rameshwari Kambale

207_Gauri Kadam

Project Guide Name: Ms. Rutuja Chavan

Project Guide Signature:

1
INDEX

Sr No Content Page No

1. Introduction 1

2. Problem area/Problem statements 1

3. Data set give information of CSV 2

4. Which pre-processing 2

5. Algorithm type 2

6. Output prediction 2-12

7. Conclusion 13

2
1. Introduction
In order to extract insights from Uber ride data, we explore the field of data analysis through
visualisations using R in this project. applied ideas of data visualisation with Tableau and R
packages (ggplot2, dplyr, and tidyr) to help comprehend complex data and obtain business
insights. Our goal is to improve decision-making and service innovations by gaining a deeper
understanding of user behaviour, trends, and patterns connected to Uber rides through excellent
visualisations and exploration.

2. Problem Area / Problem Statement

1. Ride-sharing services such as Uber produce a significant amount of data about


journeys, driver activities, and customer interactions in the context of urban mobility
and transportation. This data contains insightful information that has the potential to
greatly impact operational strategy, increase customer satisfaction, and improve
service efficiency. Without strong analytical and visual aids, it can be difficult to glean
information from such vast and intricate datasets.
The challenge at hand is to use R programming language visualisation techniques to
make use of Uber's vast data in order to identify important trends and insights that
would otherwise be hard to identify. The following topics are the focus of the analysis:

2. Temporal and Spatial Patterns: Identify and visualize trends related to ride requests,
such as peak hours, day-of-week variations, and high-demand geographic areas.
Understanding these patterns can help in optimizing driver deployment and improving
service availability.
3. Operational Efficiency: Assess and visualize metrics related to trip duration, wait
times, and driver utilization. This analysis will highlight operational bottlenecks,
inefficiencies, and areas where service improvements are needed.
4. Customer Behaviour Analysis: Explore and visualize patterns in customer behavior,
including ride frequency, spending habits, and demographic characteristics. Insights
into these patterns can inform targeted marketing strategies and personalized service
offerings.
5. Demand Forecasting: Use historical data to visualize trends and forecast future
demand for rides. Accurate demand forecasts can guide resource planning and ensure
a balanced supply of drivers and vehicles.
6. Driver Performance Metrics: Visualize key performance indicators for drivers, such
as ratings, completion rates, and earnings. This can help in understanding driver
performance and identifying factors that contribute to high or low performance.
7. Overall Service Quality: Generate visualizations that aggregate various metrics to
provide a comprehensive view of overall service quality and user satisfaction.

By employing R for data visualization, this analysis seeks to transform raw Uber data into
clear, actionable insights that support strategic decision-making, operational improvements,
and enhanced customer experiences.

1
3. Data Set Information about CSV File
The dataset used for analysis is a CSV file containing Uber ride data. It includes data of data
rides from April to September. It includes fields like:

• Date and Time of Ride


• Pickup and Drop-off Locations (latitude and longitude)
• Base Fare
• Distance Travelled
• Rider ID
• Driver ID

4. Pre-processing
For pre-processing the data, the following steps are performed:

• Handling missing values: Identify and address missing data points in the dataset, either by
imputing values or removing records.
• Data format conversion: Convert date and time strings into appropriate date-time formats to
enable time-based analysis.
• Feature extraction: Derive additional features like day of the week and hour of the day from the
date-time information to facilitate trend analysis.

5. Algorithm Type
In this analysis project, the primary focus is on visual exploratory analysis rather than
algorithmic modelling. Therefore, the project may not require specific algorithmic steps like
clustering or classification. Instead, the emphasis is on utilizing R's visualization libraries to
create insightful plots and graphs.

6. Output Prediction
Since this project is more about exploration and insights, the "output prediction" might not be
in the traditional sense of predictive modelling. Instead, the project's outcome would be a
collection of informative visualizations and graphs that highlight trends such as:

• Peak ride demand times during the day or week.


• Visualization of ride density at different locations.
• Trends in fare and distance over time.
• Analysis of ride behaviour for specific user segments (e.g., weekdays vs. weekends).
• These visualizations would provide Uber with actionable insights to enhance
operational efficiency and improve the overall rider experience.

2
1) Importing the Essential Packages
In the first step of our R project, we will import the essential packages that we will use in this
uber data analysis project. Some of the important libraries of R that we will use are -

1. ggplot- It is most widely used for creating aesthetic visualization plots. ggplot2 is one of
the most popular and powerful data visualization packages in R. Developed by Hadley
Wickham, it is based on the Grammar of Graphics, a theoretical framework for data
visualization that provides a structured approach to creating a wide variety of plots.

2. ggthemes- The ggthemes package in R extends the functionality of ggplot2 by providing


additional themes, scales, and geoms that enhance the visual appeal and customization of
plots.

3. lubridate- The lubridate package in R is designed to simplify date and time manipulation.
It provides a set of functions that make it easier to parse, manipulate, and perform
operations on date-time objects. This is particularly useful in data analysis where handling
date and time information accurately is crucial.

4. tidyr- The tidyr package in R is designed to help users tidy their data, making it easier to
analyze and visualize. It provides a set of functions for reshaping and tidying data frames,
aligning with the principles of tidy data. Tidy data means each variable is a column, each
observation is a row, and each type of observational unit forms a table.

5. DT- The DT package in R is an interface to the Data Tables library, which is a powerful
tool for creating interactive and feature-rich tables in R. It provides a way to display data
frames in a web-based table format that supports features such as sorting, filtering, and
pagination.

6. scales- The scales package in R is designed to provide functionality for scaling and
formatting various aspects of data visualizations, particularly when using ggplot2. It helps
in transforming and customizing the appearance of axis labels, tick marks, and other
scaling-related features.

3
2) Creating vector of colors to be implemented in our plots

In this step of data science project, we will create a vector of our colors that will be included
in our plotting functions. You can also select your own set of colors.

3) Reading the Data into their designated variables


Now, we will read several csv files that contain the data from April 2014 to September 2014.

We will store these in corresponding data frames like apr_data, may_data, etc. After we have
read the files, we will combine all of this data into a single dataframe called 'data_2014'.

4) Plotting the trips by the hours in a day


In the next step or R project, we will use the ggplot function to plot the number of trips that
the passengers had made in a day

Code:

ggplot(hour_data, aes(hour, Total)) +

geom_bar( stat = "identity", fill = "ivory", color = "brown") +

ggtitle("Trips Every Hour") +

4
theme(legend.position = "none") +

scale_y_continuous(labels = comma)

Code: -

ggplot(month_hour, aes(hour, Total, fill = month)) +

geom_bar( stat = "identity") +

ggtitle("Trips by Hour and Month") +

scale_y_continuous(labels = comma)

5
5) Plotting data by trips during every day of the month

We observe from the resulting visualization that 30th of the month had the highest trips in the
year which is mostly contributed by the month of April.

Code:

ggplot(day_group, aes(day, Total)) +

geom_bar( stat = "identity", fill = "steelblue") +

ggtitle("Trips Every Day") +

theme(legend.position = "none") +

scale_y_continuous(labels = comma)

6
Code:

ggplot(month_weekday, aes(month, Total, fill = dayofweek)) +

geom_bar( stat = "identity", position = "dodge") +

ggtitle("Trips by Day and Month") +

scale_y_continuous(labels = comma) +

scale_fill_manual(values = colors)

7
6) Trips by bases
In this section, we will find out the number of trips made by bases.

Code:

ggplot(data_2014, aes(Base)) +

geom_bar(fill = "gold") +

scale_y_continuous(labels = comma) +

ggtitle("Trips by Bases")

7) Plotting the Heat map


Code: -

ggplot(day_and_hour, aes(day, hour, fill = Total)) +

geom_tile(color = "white") +

ggtitle("Heat Map by Hour and Day")

8
Code: -

ggplot(month_weekday, aes(dayofweek, month, fill = Total)) +

geom_tile(color = "white") +

ggtitle("Heat Map by Month and Day of Week")

Code:
9
ggplot(month_base, aes(Base, month, fill = Total)) +

geom_tile(color = "white") +

ggtitle("Heat Map by Month and Bases")

Code: -

ggplot(day0fweek_bases, aes(Base, dayofweek, fill = Total)) +

geom_tile(color = "white") +

ggtitle("Heat Map by Bases and Day of Week")

10
8) NYC MAP BASED ON UBER RIDES DURING 2014 (APR-SEP)
Code: -

ggplot(data_2014, aes(x=Lon, y=Lat)) +

geom_point(size=1, color = "green") +

scale_x_continuous(limits=c(min_long, max_long)) +

scale_y_continuous(limits=c(min_lat, max_lat)) +

theme_map() +

ggtitle("NYC MAP BASED ON UBER RIDES DURING 2014 (APR-SEP)")

11
NYC MAP BASED ON UBER RIDES DURING 2014 (APR-SEP)

12
7. Conclusion
At the end of the Uber data analysis R project, we observed how to create data visualizations.
We made use of packages like ggplot2 that allowed us to plot various types of visualizations
that pertained to several time-frames of the year. With this, we could conclude how time
affected customer trips. Finally, we made a geo plot of New York that provided us with the
details of how various users made trips from different bases.

Prior to focussing on features, we must understand the data insights that EDA provides. In
addition, we use several plots to visualise the data, which helps us realise that we lack
information on the cost of taxis, other cabs' prices, and weather patterns. The quantity and kind
of data in the dataset are displayed by other value count charts. Following that, we fill price
Nan by the median of all other values after converting all categorical values into a continuous
data type. Subsequently, recursive feature elimination was utilised to accomplish the most
crucial aspect of feature selection. The top 25 features were chosen with RFE's assistance.

We utilise our remaining dataset to test four different models, and the best three perform best:
Decision Tree, Random Forest, and Gradient Boosting Regressor, with training accuracy of
over 96%. our indicates that all three of the algorithms have very high predictive power in our
dataset with the selected features; nevertheless, we ultimately choose random forest since it is
less prone to overfitting and allows us to create a function that predicts the price using the same
model.

13

You might also like