Project Document
Project Document
CERTIFICATE
This is to certify that the Summer Internship work entitled “UBER TRIP ANALYSIS”
is a bonafide record of internship work done by S. ANIRUDH (228T1A05F7) for the
award of the Summer Internship in Computer Science and Engineering by Jawaharlal
Nehru Technological University, Kakinada during the academic year 2024- 2025.
Head of Department:
Dr. K. SOWMYA
Professor, HOD CSE- AI& ML EXTERNAL EXAMINER
DHANEKULA INSTITUTE OF ENGINEERING & TECHNOLOGY
Department of Computer Science & Engineering
VISION – MISSION – PEOs
Department Vision To empower the budding talents and ensure them with probable
employability skills in addition to human values by optimizing the
resources.
2. Problem Analysis: identify, formulate, review research literature, and analyze complex
engineering problems reaching sustained conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
5. Modern Tool Usage: create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modelling to complex engineering
activities with an understanding of the limitations.
6. The Engineer And Society: apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice.
8. Ethics: apply ethical principles and commit to professional ethics and responsibilities
and norms of the engineering practice.
11. Project Management And Finance: demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.
12. Life- Long Learning: recognize the need for, and have the preparation and ability to
engage in independent and life- long learning in broadest context of technological change.
Project Title P P
P P P P P P P P P P P P
S S
O O O O O O O O O O O O
O O
1 2 3 4 5 6 7 8 9 10 11 12
1 2
Uber trip
analysis
Mapping Level Mapping Description
Machine Learning (ML) is a subset of artificial intelligence (AI) that enables systems to learn and
improve from experience without being explicitly programmed. It focuses on developing
algorithms that can analyze data, identify patterns, and make decisions with minimal human
intervention. Below, we'll explore the concepts, types, algorithms, and applications of machine
learning in detail.
1. Supervised Learning: The model learns from labeled data. It predicts the output for new
data based on learned patterns.
✓ Regression: Predicts continuous values (e.g., house prices, temperatures).
✓ Classification: Predicts discrete labels (e.g., spam detection, image classification).
2. Unsupervised Learning: The model learns from unlabeled data. It identifies patterns or
structures in the data.
✓ Clustering: Groups similar data points together (e.g., customer segmentation).
✓ Dimensionality Reduction: Reduces the number of features while preserving
important information (e.g., PCA, t-SNE).
3. Semi-Supervised Learning: Combines both labeled and unlabeled data to improve learning
accuracy.
4. Reinforcement Learning: The model learns by interacting with an environment and
receiving rewards or penalties based on its actions. It aims to maximize cumulative rewards
(e.g., game playing, robotics).
1. Linear Regression: Models the relationship between a dependent variable and one or more
independent variables by fitting a linear equation.
2. Logistic Regression: A classification algorithm that models the probability of a binary
outcome using a logistic function.
3. Decision Trees: A tree-like model where each node represents a feature, each branch
represents a decision rule, and each leaf represents an outcome.
4. Random Forest: An ensemble method that constructs multiple decision trees and combines
their predictions for improved accuracy and robustness.
5. Support Vector Machines (SVM): Finds the hyperplane that best separates different classes
in the feature space.
6. K-Nearest Neighbours (KNN): A simple algorithm that assigns labels based on the
majority label among the nearest neighbours in the feature space.
7. K-Means Clustering: An unsupervised algorithm that partitions data into K clusters based
on feature similarity.
8. Neural Networks: Composed of layers of interconnected nodes (neurons) that can learn
complex patterns from data.
✓ Deep Learning: A subset of neural networks with multiple layers (deep neural
networks) that can model complex relationships, especially for tasks like image and
speech recognition.
4. Project report:
JUPYTER NOTEBOOK:
Jupyter Notebook is an open-source web application that allows you to create and share documents
containing live code, equations, visualizations, and narrative text. It is widely used in data science,
machine learning, scientific research, and other fields for interactive data analysis and exploration.
Jupyter Notebook supports various programming languages, but it is most commonly used with
Python. The notebooks consist of "cells," which can contain code (executable cells) or
markdown/text (markdown cells). When you run a code cell, the output is displayed directly below
it, making it an ideal environment for data manipulation, visualization, and model buildiing.
package manager). Open your terminal or command prompt and type the following command: pip
install jupyter
2. Launching Jupyter Notebook: After installation, you can start Jupyter Notebook by
This will open a web browser with the Jupyter Notebook interface.
3. Create a New Notebook: In the Jupyter Notebook interface, you can click the "New"
button and select "Python 3" (or any other available kernel) to create a new notebook.
4. Working with Cells: Each notebook consists of cells. You can add new cells using the "+"
button in the toolbar. To execute a code cell, click on it and press "Shift + Enter" or use the "Run"
button in the toolbar.
5. Markdown Cells: To add text or documentation to your notebook, you can create
markdown cells. In a markdown cell, you can write plain text and use markdown syntax to format
the text, add headings, lists, links, images, etc.
6. Saving and Sharing: Jupyter Notebooks automatically save your work periodically. You
can also save the notebook manually using "File" -> "Save and Checkpoint." To share your
notebook with others, you can download it as a file or use services like Jupyter viewer, GitHub, or
Google Collab.
1. NumPy:
- Definition: NumPy is the fundamental package for scientific computing in Python. It provides
support for large, multi-dimensional arrays and matrices, along with a wide range of mathematical
functions to operate on these arrays.
2. Pandas:
- Definition: Pandas is a powerful library for data manipulation and analysis. It offers data
structures like Data Frames and Series, which allow you to efficiently handle and analyse
structured data.
3. Matplotlib:
- Definition: Matplotlib is a plotting library used to create static, interactive, and animated
visualizations in Python. It offers a wide range of customizable plots and charts for data
visualization.
4. Seaborn:
- Definition: Seaborn is built on top of Matplotlib and provides a higher-level interface for
creating statistical graphics. It simplifies the creation of complex visualizations and enhances the
default Matplotlib styles.
5. Scikit-learn:
- Definition: Scikit-learn is a widely-used machine learning library in Python. It provides tools
for various machine learning tasks such as classification, regression, clustering, dimensionality
reduction, and model evaluation.
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means clustering
algorithm, how the algorithm works, along with the Python implementation of k-means clustering.
What is K-Means Algorithm?
K-Means Clustering is an Unsupervised Learning Algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
It allows us to cluster the data into different groups and a convenient way to discover the categories
of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in
this algorithm.
• Determines the best value for K center points or centroids by an iterative process.
• Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We
will compute it by applying some mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both the centroids. Consider the
below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid,
and points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow
for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue
points are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids
or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new centroids
will be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data points.
So, the image will be:
o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will
be as shown in the below image:
The performance of the K-means clustering algorithm depends upon highly efficient clusters that it
forms. But choosing the optimal number of clusters is a big task. There are some different ways to
find the optimal number of clusters, but here we are discussing the most appropriate method to find
the number of clusters or value of K. The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters. This
method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which
defines the total variations within a cluster. The formula to calculate the value of WCSS (for 3
clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and
its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as Euclidean
distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges from 1-
10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered
as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow
method. The graph for the elbow method looks like the below image:
5. Data Set:
6. Source Code with Output:
7. Conclusion:
At the end of the Uber data evaluation, the dataset we used for this undertaking protected
statistics of Uber cars ridership in the metropolis of New York for the six months of 2014. As
we used to be exploring it, we seen that, in opposition to my preliminary intuition, the climate
variables had now not any or very vulnerable effect on the ridership. Going similarly in my
evaluation it used to be getting extra clear that the demand follows unique patterns each in the
course of the day and at some stage in the week. We found how to create records visualizations.
We made use of programs like ggplot2 that allowed us to plot a number of sorts of
visualizations that pertained to a number of time-frames of the year. With this, we should
conclude how time affected client trips. Finally, we made a geo plot of New York that furnished
us with the small print of how a variety of customers made journeys from one of a kind bases.