0% found this document useful (0 votes)
8 views30 pages

Project Document

Uploaded by

anirudhsidagam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views30 pages

Project Document

Uploaded by

anirudhsidagam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

UBER TRIP ANALYSIS

Submitted in partial fulfilment for the award of certificate of


BACHELOR OF TECHNOLOGY
By
NAME: S. ANIRUDH
ROLLNO: 228T1A05F7

DHANEKULA INSTITUTE OF ENGINEERING & TECHNOLOGY

GANGURU, VIJAYAWADA - 521 139


Affiliated to JNTUK, Kakinada &Approved By AICTE,

New Delhi Certified by ISO 9001-2015, Accredited By NBA


DHANEKULA INSTITUTE OF ENGINEERING&TECHNOLOGY

GANGURU, VIJAYAWADA - 521 139


Affiliated to JNTUK, Kakinada &Approved By AICTE, New Delhi Certified by ISO
9001-2015, Accredited by NBA

Department of Computer Science & Engineering

CERTIFICATE

This is to certify that the Summer Internship work entitled “UBER TRIP ANALYSIS”
is a bonafide record of internship work done by S. ANIRUDH (228T1A05F7) for the
award of the Summer Internship in Computer Science and Engineering by Jawaharlal
Nehru Technological University, Kakinada during the academic year 2024- 2025.

Head of Department:
Dr. K. SOWMYA
Professor, HOD CSE- AI& ML EXTERNAL EXAMINER
DHANEKULA INSTITUTE OF ENGINEERING & TECHNOLOGY
Department of Computer Science & Engineering
VISION – MISSION – PEOs

Institute Vision Pioneering Professional Education through Quality

Institute Mission Providing Quality Education through state-of-art infrastructure,


laboratories and committed staff.

Moulding Students as proficient, competent, and socially responsible


engineering personnel with ingenious intellect.

Involving faculty members and students in research and development


works for betterment of society.

Department Vision To empower the budding talents and ensure them with probable
employability skills in addition to human values by optimizing the
resources.

Department Mission * To encourage students to become pioneers in the global


competition with problem-solving skills
* To make students become innovative with potential skills to
explore the employment opportunities and/or to become entrepreneurs *
To promote Research environment and inculcate corporate social
responsibility

Program Graduates of Computer Science & Engineering will:


Educational
Objectives(PEOs) PEO1: Excel in problem solving and designing new products for a
competitive and challenging business environment

PEO2: Contribute to technological innovation, research and society


through the application of information technology in a diversified world.
PROGRAM OUTCOMES(PO’S)

1. Engineering knowledge: apply the knowledge of mathematics, science, engineering


fundamentals, and an engineering specialization to the solution of complex engineering
problems.

2. Problem Analysis: identify, formulate, review research literature, and analyze complex
engineering problems reaching sustained conclusions using first principles of mathematics,
natural sciences, and engineering sciences.

3. Design/Development Of Solutions: design solutions for complex engineering problems


and design system components or process that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.

4. Conduct Investigations Of Complex Problems: use research-based knowledge and


research methods including design of experiments, analysis and interpretation of data, and
synthesis of the information to provide valid conclusions.

5. Modern Tool Usage: create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modelling to complex engineering
activities with an understanding of the limitations.

6. The Engineer And Society: apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice.

7. Environment And Sustainability: understand the impact of the professional engineering


solutions in societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.

8. Ethics: apply ethical principles and commit to professional ethics and responsibilities
and norms of the engineering practice.

9. Individual And Team Work: function effectively as an individual, and as a member or a


leader in diverse teams, and in multidisciplinary settings.

10. Communication: communicate effectively on complex engineering activities with the


engineering community and with society at large, such as, being able to comprehend and
write effective reports and design documentation, make effective presentations, and give and
receive clear instructions.

11. Project Management And Finance: demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.

12. Life- Long Learning: recognize the need for, and have the preparation and ability to
engage in independent and life- long learning in broadest context of technological change.

PROGRAM SPECIFIC OUTCOMES(PSOs)


PSO1: Designing and developing the Information Technology based systems with high professional
skills.
PSO2: Qualify in national and international level competitive examinations for successful higher
studies and get employment in IT enabled industries
Internship Mappings

Project Title P P
P P P P P P P P P P P P
S S
O O O O O O O O O O O O
O O
1 2 3 4 5 6 7 8 9 10 11 12
1 2

Uber trip
analysis
Mapping Level Mapping Description

1 Low Level Mapping with PO & PSO

2 Moderate Mapping with PO & PSO

3 High Level Mapping with PO & PSO


S. ANIRUDH
228T1A05F7
III-I
Contents
1. Internship carried out Company/Organization Details -----------------------------8

2. Internship Log ---------------------------------------------------------------------------9


3. Domain area of the Internship --------------------------------------------------------10
4. Project report ----------------------------------------------------------------------------11

5. Data Sets ---------------------------------------------------------------------------------21


6. Source Code with Output --------------------------------------------------------------22
7. Conclusion -------------------------------------------------------------------------------30
1. Internship carried out Company/Organization Details:

The address of the headquarters is:


Blackbucks Corporate Headquarters
2nd Floor, Vaswani Presidio,
100 Feet Road 84/2, Panathur main road, Off, Outer Ring Rd, Kaverappa Layout,
Kadubeesanahalli, India, 560103
Company Name: IIDT BLACKBUCKS

Company Status: Active

Registration Number :167412


Company Category: Educational Institutions
Age of Company: 8 years

Activity: Internship, Job Guarantees, Global Partnerships, Focus on Emerging


Technologies, Government Initiatives.
2. Internship Log:
Intern name: S. ANIRUDH
Intern period: 28-05-2024 to 20-07-2024
Company: IIDT-BLACKBUCK
3. Domain area of the Internship:
Domain Description

Machine Learning (ML) is a subset of artificial intelligence (AI) that enables systems to learn and
improve from experience without being explicitly programmed. It focuses on developing
algorithms that can analyze data, identify patterns, and make decisions with minimal human
intervention. Below, we'll explore the concepts, types, algorithms, and applications of machine
learning in detail.

Types of Machine Learning

1. Supervised Learning: The model learns from labeled data. It predicts the output for new
data based on learned patterns.
✓ Regression: Predicts continuous values (e.g., house prices, temperatures).
✓ Classification: Predicts discrete labels (e.g., spam detection, image classification).
2. Unsupervised Learning: The model learns from unlabeled data. It identifies patterns or
structures in the data.
✓ Clustering: Groups similar data points together (e.g., customer segmentation).
✓ Dimensionality Reduction: Reduces the number of features while preserving
important information (e.g., PCA, t-SNE).
3. Semi-Supervised Learning: Combines both labeled and unlabeled data to improve learning
accuracy.
4. Reinforcement Learning: The model learns by interacting with an environment and
receiving rewards or penalties based on its actions. It aims to maximize cumulative rewards
(e.g., game playing, robotics).

Common Machine Learning Algorithms

1. Linear Regression: Models the relationship between a dependent variable and one or more
independent variables by fitting a linear equation.
2. Logistic Regression: A classification algorithm that models the probability of a binary
outcome using a logistic function.
3. Decision Trees: A tree-like model where each node represents a feature, each branch
represents a decision rule, and each leaf represents an outcome.
4. Random Forest: An ensemble method that constructs multiple decision trees and combines
their predictions for improved accuracy and robustness.
5. Support Vector Machines (SVM): Finds the hyperplane that best separates different classes
in the feature space.
6. K-Nearest Neighbours (KNN): A simple algorithm that assigns labels based on the
majority label among the nearest neighbours in the feature space.
7. K-Means Clustering: An unsupervised algorithm that partitions data into K clusters based
on feature similarity.
8. Neural Networks: Composed of layers of interconnected nodes (neurons) that can learn
complex patterns from data.
✓ Deep Learning: A subset of neural networks with multiple layers (deep neural
networks) that can model complex relationships, especially for tasks like image and
speech recognition.

Steps in a Machine Learning Workflow

1. Problem Definition: Clearly define the problem, objectives, and constraints.


2. Data Collection: Gather relevant data from various sources.
3. Data Pre-processing: Clean and prepare data for analysis (handling missing values,
normalization, etc.).
4. Feature Engineering: Select and transform features to improve model performance.
5. Model Selection: Choose appropriate algorithms based on the problem type and data
characteristics.
6. Model Training: Train the model using training data and optimize its parameters.
7. Model Evaluation: Assess the model's performance using testing data and metrics like
accuracy, precision, recall, F1-score, etc.
8. Hyperparameter Tuning: Fine-tune the model's hyperparameters to improve performance.
9. Model Deployment: Integrate the trained model into a production environment for real-time
predictions.
10. Monitoring and Maintenance: Continuously monitor the model's performance and update
it as needed.

4. Project report:
JUPYTER NOTEBOOK:
Jupyter Notebook is an open-source web application that allows you to create and share documents
containing live code, equations, visualizations, and narrative text. It is widely used in data science,
machine learning, scientific research, and other fields for interactive data analysis and exploration.

Jupyter Notebook supports various programming languages, but it is most commonly used with
Python. The notebooks consist of "cells," which can contain code (executable cells) or
markdown/text (markdown cells). When you run a code cell, the output is displayed directly below
it, making it an ideal environment for data manipulation, visualization, and model buildiing.

To use Jupyter Notebook, follow these steps:


1. Installation: If you haven't installed Jupyter Notebook, you can do so using pip (Python's

package manager). Open your terminal or command prompt and type the following command: pip

install jupyter

2. Launching Jupyter Notebook: After installation, you can start Jupyter Notebook by

running the following command in your terminal: jupyter notebook

This will open a web browser with the Jupyter Notebook interface.

3. Create a New Notebook: In the Jupyter Notebook interface, you can click the "New"
button and select "Python 3" (or any other available kernel) to create a new notebook.

4. Working with Cells: Each notebook consists of cells. You can add new cells using the "+"
button in the toolbar. To execute a code cell, click on it and press "Shift + Enter" or use the "Run"
button in the toolbar.

5. Markdown Cells: To add text or documentation to your notebook, you can create
markdown cells. In a markdown cell, you can write plain text and use markdown syntax to format
the text, add headings, lists, links, images, etc.

6. Saving and Sharing: Jupyter Notebooks automatically save your work periodically. You
can also save the notebook manually using "File" -> "Save and Checkpoint." To share your
notebook with others, you can download it as a file or use services like Jupyter viewer, GitHub, or
Google Collab.

Jupyter Notebook provides an interactive and collaborative environment, making it a popular


choice for data analysis, prototyping ML models, and sharing research findings. It's a powerful tool
for working with data and code while providing a narrative context for your analyses and
visualizations.
Here are some popular Python libraries along with their brief definitions used in this project:

1. NumPy:
- Definition: NumPy is the fundamental package for scientific computing in Python. It provides
support for large, multi-dimensional arrays and matrices, along with a wide range of mathematical
functions to operate on these arrays.

2. Pandas:
- Definition: Pandas is a powerful library for data manipulation and analysis. It offers data
structures like Data Frames and Series, which allow you to efficiently handle and analyse
structured data.

3. Matplotlib:
- Definition: Matplotlib is a plotting library used to create static, interactive, and animated
visualizations in Python. It offers a wide range of customizable plots and charts for data
visualization.

4. Seaborn:
- Definition: Seaborn is built on top of Matplotlib and provides a higher-level interface for
creating statistical graphics. It simplifies the creation of complex visualizations and enhances the
default Matplotlib styles.

5. Scikit-learn:
- Definition: Scikit-learn is a widely-used machine learning library in Python. It provides tools
for various machine learning tasks such as classification, regression, clustering, dimensionality
reduction, and model evaluation.

K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means clustering
algorithm, how the algorithm works, along with the Python implementation of k-means clustering.
What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning Algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

It allows us to cluster the data into different groups and a convenient way to discover the categories
of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in
this algorithm.

The k-means clustering algorithm mainly performs two tasks:

• Determines the best value for K center points or centroids by an iterative process.
• Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We
will compute it by applying some mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both the centroids. Consider the
below image:

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid,
and points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow
for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:

o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue
points are right to the line. So, these three points will be assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which is finding new centroids
or K-points.

o We will repeat the process by finding the center of gravity of centroids, so the new centroids
will be as shown in the below image:

o As we got the new centroids so again will draw the median line and reassign the data points.
So, the image will be:
o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final clusters will
be as shown in the below image:

How to choose the value of "K number of clusters" in K-means Clustering?

The performance of the K-means clustering algorithm depends upon highly efficient clusters that it
forms. But choosing the optimal number of clusters is a big task. There are some different ways to
find the optimal number of clusters, but here we are discussing the most appropriate method to find
the number of clusters or value of K. The method is given below:

Elbow Method

The Elbow method is one of the most popular ways to find the optimal number of clusters. This
method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which
defines the total variations within a cluster. The formula to calculate the value of WCSS (for 3
clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2

In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and
its centroid within a cluster1 and the same for the other two terms.

To measure the distance between data points and centroid, we can use any method such as Euclidean
distance or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below steps:

o It executes the K-means clustering on a given dataset for different K values (ranges from 1-
10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered
as the best value of K.

Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow
method. The graph for the elbow method looks like the below image:
5. Data Set:
6. Source Code with Output:

# importing the necessary Python libraries and the dataset


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
import plotly.express as px

# Load the dataset


data = pd.read_csv("/content/uber-raw-data-sep14.csv")
data

# Convert Date/Time to datetime object


data["Date/Time"] = pd.to_datetime(data["Date/Time"])
data

# Prepare the data according to days, hours, and weekdays


data["day"] = data["Date/Time"].apply(lambda x: x.day)
data["weekdays"] = data["Date/Time"].apply(lambda x: x.weekday())
data["hour"] = data["Date/Time"].apply(lambda x: x.hour)
data

# Select relevant features for clustering


features = data[['Lat', 'Lon', 'hour', 'weekdays']]
features = features.fillna(0)
features

# Apply K-Means clustering


kmeans = KMeans(n_clusters=5, random_state=0).fit(features)
kmeans

# Add the cluster labels to the original dataset


data['cluster'] = kmeans.labels_
data

# Visualize the clusters


sns.set(rc={'figure.figsize':(12, 10)})
sns.scatterplot(x='Lon', y='Lat', hue='cluster', data=data,
palette='viridis')
plt.title('K-Means Clustering of Uber Trips')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend()
plt.show()
# Analyze the cluster centers
cluster_centers = pd.DataFrame(kmeans.cluster_centers_, columns=['Lat',
'Lon', 'hour', 'weekdays'])
print(cluster_centers)

# Analyze the distribution of clusters by day


sns.set(rc={'figure.figsize':(15, 13)})
sns.distplot(data["day"], bins=30, kde=False)
plt.title('Distribution of Uber Trips by Day')
plt.xlabel('Day')
plt.ylabel('Number of Trips')
plt.show()
# Analyze the distribution of clusters by hour
sns.set(rc={'figure.figsize':(15, 13)})
sns.distplot(data["hour"], bins=24, kde=False)
plt.title('Distribution of Uber Trips by Hour')
plt.xlabel('Hour')
plt.ylabel('Number of Trips')
plt.show()

# Analyze the distribution of clusters by weekday


sns.set(rc={'figure.figsize':(15, 13)})
sns.distplot(data["weekdays"], bins=7, kde=False,color='red')
plt.title('Distribution of Uber Trips by Weekday')
plt.xlabel('Weekday')
plt.ylabel('Number of Trips')
plt.show()
# Analyze the correlation of hours and weekdays within clusters
df = data.groupby(["weekdays", "hour", "cluster"]).size().unstack(level=-
1).fillna(0)
sns.heatmap(df, annot=False, cmap='viridis')
plt.title('Heatmap of Uber Trips by Hour and Weekday within Clusters')
plt.xlabel('Hour')
plt.ylabel('Weekday')
plt.show()
# Plotly visualization
color_discrete_map = {0: 'red', 1: 'blue', 2: 'green', 3: 'purple', 4:
'orange'}

# Data preparation for Plotly visualization


data_plotly = data.rename(columns={"Lat": "lat", "Lon": "lon", "cluster":
"kmeans_label"})
kmeans_centroids = cluster_centers.rename(columns={"Lat": "lat", "Lon":
"lon"})
kmeans_centroids['label'] = kmeans_centroids.index

# Scatter mapbox for clusters


kmeans_fig = px.scatter_mapbox(data_plotly, lat="lat", lon="lon",
color='kmeans_label',
zoom=10,
color_discrete_map=color_discrete_map)
kmeans_fig.update_layout(mapbox_style="carto-positron")
# Scatter mapbox for centroids
pin_recommendations = px.scatter_mapbox(kmeans_centroids, lat='lat',
lon='lon', color='label',
size=[10] *
kmeans_centroids.shape[0], size_max=10,
color_discrete_map=color_discrete_map)

# Add centroids to the map


kmeans_fig.add_traces(pin_recommendations.data)
kmeans_fig.update_layout(margin={"r": 0, "t": 0, "l": 0, "b": 0})
kmeans_fig.show()

7. Conclusion:

At the end of the Uber data evaluation, the dataset we used for this undertaking protected
statistics of Uber cars ridership in the metropolis of New York for the six months of 2014. As
we used to be exploring it, we seen that, in opposition to my preliminary intuition, the climate
variables had now not any or very vulnerable effect on the ridership. Going similarly in my
evaluation it used to be getting extra clear that the demand follows unique patterns each in the
course of the day and at some stage in the week. We found how to create records visualizations.
We made use of programs like ggplot2 that allowed us to plot a number of sorts of
visualizations that pertained to a number of time-frames of the year. With this, we should
conclude how time affected client trips. Finally, we made a geo plot of New York that furnished
us with the small print of how a variety of customers made journeys from one of a kind bases.

You might also like