DMDS mini project final
DMDS mini project final
DMDS mini project final
TY BCA(SCI)
MINI PROJECT
Team members: 10
191_Karuna Jadhav
193_Kirti More
194_Jyoti Patare
195_Manisha Munkampalli
196_Priyanka Patil
197_Vaishnavi More
198_Trupti Shinde
199_Yogita Harde
206_Rameshwari Kambale
207_Gauri Kadam
1
INDEX
Sr No Content Page No
1. Introduction 1
4. Which pre-processing 2
5. Algorithm type 2
7. Conclusion 13
2
1. Introduction
In order to extract insights from Uber ride data, we explore the field of data analysis through
visualisations using R in this project. applied ideas of data visualisation with Tableau and R
packages (ggplot2, dplyr, and tidyr) to help comprehend complex data and obtain business
insights. Our goal is to improve decision-making and service innovations by gaining a deeper
understanding of user behaviour, trends, and patterns connected to Uber rides through excellent
visualisations and exploration.
2. Temporal and Spatial Patterns: Identify and visualize trends related to ride requests,
such as peak hours, day-of-week variations, and high-demand geographic areas.
Understanding these patterns can help in optimizing driver deployment and improving
service availability.
3. Operational Efficiency: Assess and visualize metrics related to trip duration, wait
times, and driver utilization. This analysis will highlight operational bottlenecks,
inefficiencies, and areas where service improvements are needed.
4. Customer Behaviour Analysis: Explore and visualize patterns in customer behavior,
including ride frequency, spending habits, and demographic characteristics. Insights
into these patterns can inform targeted marketing strategies and personalized service
offerings.
5. Demand Forecasting: Use historical data to visualize trends and forecast future
demand for rides. Accurate demand forecasts can guide resource planning and ensure
a balanced supply of drivers and vehicles.
6. Driver Performance Metrics: Visualize key performance indicators for drivers, such
as ratings, completion rates, and earnings. This can help in understanding driver
performance and identifying factors that contribute to high or low performance.
7. Overall Service Quality: Generate visualizations that aggregate various metrics to
provide a comprehensive view of overall service quality and user satisfaction.
By employing R for data visualization, this analysis seeks to transform raw Uber data into
clear, actionable insights that support strategic decision-making, operational improvements,
and enhanced customer experiences.
1
3. Data Set Information about CSV File
The dataset used for analysis is a CSV file containing Uber ride data. It includes data of data
rides from April to September. It includes fields like:
4. Pre-processing
For pre-processing the data, the following steps are performed:
• Handling missing values: Identify and address missing data points in the dataset, either by
imputing values or removing records.
• Data format conversion: Convert date and time strings into appropriate date-time formats to
enable time-based analysis.
• Feature extraction: Derive additional features like day of the week and hour of the day from the
date-time information to facilitate trend analysis.
5. Algorithm Type
In this analysis project, the primary focus is on visual exploratory analysis rather than
algorithmic modelling. Therefore, the project may not require specific algorithmic steps like
clustering or classification. Instead, the emphasis is on utilizing R's visualization libraries to
create insightful plots and graphs.
6. Output Prediction
Since this project is more about exploration and insights, the "output prediction" might not be
in the traditional sense of predictive modelling. Instead, the project's outcome would be a
collection of informative visualizations and graphs that highlight trends such as:
2
1) Importing the Essential Packages
In the first step of our R project, we will import the essential packages that we will use in this
uber data analysis project. Some of the important libraries of R that we will use are -
1. ggplot- It is most widely used for creating aesthetic visualization plots. ggplot2 is one of
the most popular and powerful data visualization packages in R. Developed by Hadley
Wickham, it is based on the Grammar of Graphics, a theoretical framework for data
visualization that provides a structured approach to creating a wide variety of plots.
3. lubridate- The lubridate package in R is designed to simplify date and time manipulation.
It provides a set of functions that make it easier to parse, manipulate, and perform
operations on date-time objects. This is particularly useful in data analysis where handling
date and time information accurately is crucial.
4. tidyr- The tidyr package in R is designed to help users tidy their data, making it easier to
analyze and visualize. It provides a set of functions for reshaping and tidying data frames,
aligning with the principles of tidy data. Tidy data means each variable is a column, each
observation is a row, and each type of observational unit forms a table.
5. DT- The DT package in R is an interface to the Data Tables library, which is a powerful
tool for creating interactive and feature-rich tables in R. It provides a way to display data
frames in a web-based table format that supports features such as sorting, filtering, and
pagination.
6. scales- The scales package in R is designed to provide functionality for scaling and
formatting various aspects of data visualizations, particularly when using ggplot2. It helps
in transforming and customizing the appearance of axis labels, tick marks, and other
scaling-related features.
3
2) Creating vector of colors to be implemented in our plots
In this step of data science project, we will create a vector of our colors that will be included
in our plotting functions. You can also select your own set of colors.
We will store these in corresponding data frames like apr_data, may_data, etc. After we have
read the files, we will combine all of this data into a single dataframe called 'data_2014'.
Code:
4
theme(legend.position = "none") +
scale_y_continuous(labels = comma)
Code: -
scale_y_continuous(labels = comma)
5
5) Plotting data by trips during every day of the month
We observe from the resulting visualization that 30th of the month had the highest trips in the
year which is mostly contributed by the month of April.
Code:
theme(legend.position = "none") +
scale_y_continuous(labels = comma)
6
Code:
scale_y_continuous(labels = comma) +
scale_fill_manual(values = colors)
7
6) Trips by bases
In this section, we will find out the number of trips made by bases.
Code:
ggplot(data_2014, aes(Base)) +
geom_bar(fill = "gold") +
scale_y_continuous(labels = comma) +
ggtitle("Trips by Bases")
geom_tile(color = "white") +
8
Code: -
geom_tile(color = "white") +
Code:
9
ggplot(month_base, aes(Base, month, fill = Total)) +
geom_tile(color = "white") +
Code: -
geom_tile(color = "white") +
10
8) NYC MAP BASED ON UBER RIDES DURING 2014 (APR-SEP)
Code: -
scale_x_continuous(limits=c(min_long, max_long)) +
scale_y_continuous(limits=c(min_lat, max_lat)) +
theme_map() +
11
NYC MAP BASED ON UBER RIDES DURING 2014 (APR-SEP)
12
7. Conclusion
At the end of the Uber data analysis R project, we observed how to create data visualizations.
We made use of packages like ggplot2 that allowed us to plot various types of visualizations
that pertained to several time-frames of the year. With this, we could conclude how time
affected customer trips. Finally, we made a geo plot of New York that provided us with the
details of how various users made trips from different bases.
Prior to focussing on features, we must understand the data insights that EDA provides. In
addition, we use several plots to visualise the data, which helps us realise that we lack
information on the cost of taxis, other cabs' prices, and weather patterns. The quantity and kind
of data in the dataset are displayed by other value count charts. Following that, we fill price
Nan by the median of all other values after converting all categorical values into a continuous
data type. Subsequently, recursive feature elimination was utilised to accomplish the most
crucial aspect of feature selection. The top 25 features were chosen with RFE's assistance.
We utilise our remaining dataset to test four different models, and the best three perform best:
Decision Tree, Random Forest, and Gradient Boosting Regressor, with training accuracy of
over 96%. our indicates that all three of the algorithms have very high predictive power in our
dataset with the selected features; nevertheless, we ultimately choose random forest since it is
less prone to overfitting and allows us to create a function that predicts the price using the same
model.
13