Uber
Uber
MACHINE LEARNING
By
RISHI.S
(Reg.No. 37110642)
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
JEPPIAAR NAGAR, RAJIV GANDHI SALAI,
CHENNAI-6000119, TAMIL NADU
APRIL 2021
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with ―A‖ grade by NAAC
Jeppiaar Nagar, Rajiv Gandhi Salai, Chennai - 600119
www.sathyabama.ac.in
BONAFIDE CERTIFICATE
This is to certify that this Professional Training Report is the bonafide work of RISHI.S
(37110642) who underwent the professional training in―Uber Related Data Analysis
using Machine Learning‖ underour supervision from November 2020 to April2021.
Internal Guide
DR.B.ANKAYARKANNI,M.E, Ph.D
InternalExaminer ExternalExaminer
DECLARATION
I, RISHI.S (37110642), hereby declare that the Professional Training Reporton ―UBER
RELATED DATA ANALYSIS USING MACHINE LEARNING”,done by me under the
guidance of DR.B.ANKAYARKANNI ,M.E,Ph.D atSathyabama Institute of Science and
Technology, is submitted in partial fulfilment of the requirements for the award of
Bachelor of Engineering degree in Computer Science.
DATE:
PLACE:CHENNAI SIGNATURE OF THE CANDIDATE
ACKNOWLEDGEMENT
I wish to express my thanks to all Teaching and Non-teaching staff members of the
Departmentof Computer Science and Engineering who were helpful in many
ways for the completion of the project.
ABSTRACT
The project ―Uber Related Data Analysis using Machine Learning‖ explains the
working of an Uber dataset, which contains data produced by Uber for New York
City. Uber is defined as a P2P platform. The platform links you to drivers who can
take you to your destination. The dataset includes primary data on Uber pickups
with details including the date, time of the ride as well as longitude-latitude
information, Using the information, the paper explains the use of the k-means
clustering algorithm on the set of data and classify the various parts of New York
City. Since the industry is booming and expected to grow shortly. Effective taxi
dispatching will facilitate each driver and passenger to reduce the wait time to
seek out one another. The model is employed to predict the demand on points of
the city.
TABLE OF CONTENTS
LIST OF ABBREVIATIONS 10
LIST OF FIGURES 11
1 INTRODUCTION 12
1.1 Introductionto Uber Dataand 12
Predictions
1.1.1 Machine Learning Predictions 12
Introduction to Machine Learning13
Overview 13
Machine Learning Approaches 14
Theory 14
Models 15
2 LITERATURE REVIEW 18
Aim Of Project 21
Objectives 21
Scope Of Project 21
4 MODULE IMPLEMENTATIONS 22
Modules 22
%Pylab Inline 22
Pandas 22
Seaborn 23
Kmeans 24
YellowBrick 24
Folium 24
Modules Hardware/Software 24
Architectural Design 25
Module Implementations 26
5 RESULTS AND DISCUSSION 27
6 CONCLUSION AND FUTURE WORK 33
REFERENCES
APPENDIX
A. PLAGAIRISM REPORT
B.BASE PAPER
C. SOURCE CODE
LIST OF ABBREVIATIONS
ABBREVIATIONS EXPANSION
ML Machine Learning
GA Genetic Algorithms
IMPORTING MODULES 27
IMPORING DATASET 27
FINAL DATASET 28
VISUALISATION OF X-Y GRAPH FREQUENCY OF CABS 28
VISUALIZATION OF DATA BASED ON MONTH 29
VISUALIZATION OF DATA BASED ON DAY 29
PLOTTING HEAT MAP OF DATASET 29
CROSS REFERENCING LATITUDE AND LONGITUDE 30
PLOTTING COLLECTION OF POINTS IN XY GRAPH 30
GETTING THE DATA OF CENTROIDS FROM ALGORITHM 30
PLOTTING THE CENTROIDS ON MAP 31
TOTAL NUMBER OF TRIPS BASED ON CLUSTERS 31
Overview
Machine learning involves computers discovering how they'll perform tasks without
being explicitly programmed to try to to so. It involves computers learning from
data provided so they perform certain tasks. for easy tasks assigned to computers,
it's possible to program algorithms telling the machine the way to execute all steps
required to unravel the matter at hand; on the computer's part, no learning is
required. For more advanced tasks, it will be challenging for an individual's to
manually create the needed algorithms. In practice, it can end up to be simpler to
assist the machine develop its own algorithm, instead of having human
programmers specify every needed step.
Machine learning approaches are traditionally divided into three broad categories,
counting on the character of the "signal" or "feedback" available to the training
system:
• Supervised learning: the pc is presented with example inputs and their desired
outputs, given by a "teacher", and so the goal is to be told a general rule that maps
inputs to outputs.
Other approaches are developed which don't fit neatly into this three-fold
categorisation, and sometimes over one is used by the identical machine learning
system. as an example topic modelling, dimensionality reduction or meta learning.
As of 2020, deep learning has become the dominant approach for ongoing arenas
of machine learning.
Theory
Deep learning consists of multiple hidden layers in a synthetic neural network. This
approach tries to model the way the human brain processes light and sound into
vision and hearing. Some successful applications of deep learning are computer
vision and speech recognition.
REGRESSION ANALYSIS
LITERATURE REVIEW
A state in which the results, k-means clustering is used to estimate the most likely
collection points at a given time and to predict the best hotspots of nightlife
learning trends from previous Uber pickups. This has been verified using the Lyft
test set and is consistent with Yelp's best results.[1]
Poulsen, L.K In this document, they conducted a spatial analysis of the Green Cab
and Uber races in the outer districts of New York City to determine the competitive
position of the NYCTLC. We found that the demand for green taxis continues to
increase, but the number of Uber trips in the same area is growing faster.
However, the analysis showed that in Greece, in an area that is generally poor in
general. They did not find any variation between the green taxi and Uber when
variations were observed between weekdays and weekends. This research
recommends that NYCTLC create a dashboard that analyzes and displays data in
real time, as we believe this will increase its competitiveness compared to Uber.
Uber is a recent taxi operator in New York and is constantly devouring the market
share of the yellow and green taxis of the New York Taxi and Limousine
Commission (NYCTLC)[2]
Ahmed, M., has shown that by using detailed data on taxis at the travel level and
on the rental vehicle and data on complaints about the level of new complaints at
the level of incidents, we study how Uber and Lyft enter damaged the quality of
taxi services in New York City. The overall effect of the organizations based on the
scenario and in particular of the riding administrations was enormous and
widespread. One of these effects is the expansion of the rivalry between Uber and
Lyft over the quality of taxi administration. They use a new set of complaint data to
measure (the lack of) quality of service that we have never been analyzed before.
Focus on the quality dimensions generated by most of the complaints we
demonstrate. The increased competition for these shared travel services has had
an intuitive impact on the behavior of taxi drivers[5]
Wallsten, S, stated that the results of New York and Chicago are consistent with
the possibility that taxis react to the new challenge by improving quality. In New
York, the rise of Uber is linked to the reduction of objections to travel to the city.
They discuss the competitive effect of sharing taxis in the taxi industry using the
complete data set of the New York City Taxi and Limousine Commission for more
than one billion taxi trips in complaints and details of New York, New York and
Chicago Google Trends on the success of Uber's largest shared travel service.[6]
AIM OF PROJECT
The aim of the project is to predict the pickup of the cab with respect to the
location given by the user in the app from clusters which are main coordinated
points in a city by applying data visualization on the basis of frequency of trips
travelled with respect to the day in the month and applying data analysis using the
algorithm of k-means clustering.
OBJECTIVES
The objective of the project is to plot a heat map for the prediction of frequency of
trips across a collection of points(latitude,longitude)in a dataset which is data
visualization and using these objectives apply the data in the algorithm adopted in
the project.
SCOPE OF PROJECT
Based on the aim and the objectives in the project, the scope of the project is to
apply an algorithm best suited to analyze the data imported from the dataset such
that the algorithm can train the data and then proceed for data analysis.
.
CHAPTER 4
MODULE IMPLEMENTATION
MODULES
The modules used are based on the process of data acquisition,processing and
analysis of a dataset.
The dataset includes primary data on Uber pick-ups with details including the date,
time of the ride as well as longitude-latitude information of the city.
The dataset is imported and then analyzed and computized by analysis by
clustering algorithms,(K-means clustering). The module description can be
explained by importing the following modules from python.
%PYLAB INLINE
PyLab can be defined as a procedural interface to the Matplotlib object-oriented
plotting libraryMatplotlib which is that the whole package; matplotlib.pyplot can be
defined as a module in Matplotlib; and PyLab can be defined as a module that
gets installed alongside Matplotlib.
PyLab can be defined as a convenience module that bulk imports matplotlib.pyplot
(for plotting) and NumPy (for Mathematics and dealing with arrays) in a very single
name space.
PANDAS
In computer programing, pandas can be defined as a software library written for
the Python programing language for data manipulation and analysis. particularly, it
offers data structures and operations for manipulating numerical tables and
statistics and it's free software was released under the three-clause BSD license.
Pandas is principally used for data analysis. Pandas allows importing data from
various file formats like comma-separated values, JSON, SQL, Microsoft Excel.
Pandas allows various data manipulation operations like merging, reshaping,
selecting, still as data cleaning, and data wrangling features.
SEABORN
Seaborn helps you explore and understand your data. Its plotting functions
operate on dataframes and arrays containing whole datasets and internally
perform the necessary semantic mapping and statistical aggregation to produce
informative plots. Its dataset-oriented, declarative API lets you focus on what the
different elements of your plots mean, rather than on the details of how to draw
them.
It gives us the capability to create amplified data visuals. This helps us understand
the data by displaying it in a visual context to unearth any hidden correlations
between variables or trends that might not be obvious initially. Seaborn has a high-
level interface as compared to the low level of Matplotlib.
Seaborn comes with a large number of high-level interfaces and customized
themes that matplotlib lacks as it’s not easy to figure out the settings that make
plots attractive. Matplotlib functions don’t work well with dataframes, whereas
seaborn does.
Seaborn’s integration with matplotlib allows you to use it across the many
environments that matplotlib supports, inlcuding exploratory analysis in notebooks,
real-time interaction in GUI applications, and archival output in a number of raster
and vector formats.
While you can be productive using only seaborn functions, full customization of
your graphics will require some knowledge of matplotlib’s concepts and API. One
aspect of the learning curve for new users of seaborn will be knowing when
dropping down to the matplotlib layer is necessary to achieve a particular
customization. On the other hand, users coming from matplotlib will find that much
of their knowledge transfers.
KMEANS
The module is employed effectively and goal of the algorithm is to seek out groups
within the data, with the quantity of groups represented by the varaiable K. Data
points are clustered which is supported by the feature similarity.
YELLOWBRICK
The Visualizer allow users to steer the model selection process, building intuitions
around feature engineering, algorithm selections and hyperparameter tuning. for
example, they'll help diagnose common problems surrounding model complexity
and bias, heteroscedasticity, underfit and overtraining, or class balance issues. By
applying visualizers to the model selection workflow, It allows you to steer
predictive models toward more successful results, faster.
FOLIUM
Folium can be defined as a module and a tool for plotting maps, it's a strong
library that helps you create several sorts of Leaflet maps.The Folium results are
interactive as it makes the library very useful for dashboard building. By
default,Folium creates a map in an exceedingly separate HTML file and plots the
info obtained from the dataset
MODULE SOFTWARE/HARDWARE
Software Configuration:
Anaconda Navigator
Jupyter Notebook
Google Chrome
Hardware Configuration:
Windows 10,8,7
Windows Server 2019 or Windows Server 2016
Memory of atleast 4GB RAM
Storage of Atleast 128 GB
1 GB Graphics Card
ARCHITECTURAL DESIGN
The dataset includes primary data on Uber pick-ups with details including
the date, time of the ride as well as longitude-latitude information of the city
.
The dataset is imported and then analyzed and computized by analysis by
clustering algorithms,(K-means clustering).
Import the packages numpy and seaborn and import the dataset from
Kaggle or github(uber.csv) which contains the dataset
Import the data in jupyter notebook and using the data in dataset we
calculate the frequency of trips in an given area in this dataset (New York
City) from the latitudes and longitudes in the dataset
Generate the heatmap and predict the pinpoint locations of travel of the cab
Mark the centroid points on the map imported from the module folium
Predict the cluster from where the cab is scheduled to pickup from the data.
CHAPTER 5
RESULTS AND DISCUSSIONS
As a result of this system getting introduced in our data analysis, it will act like a
mechanism to assign a cab to a customer to cover the gap of communication
between the company and the customer. The company will be able to understand
the customer better by deploying more cabs based on the hotspots plotted on the
map using the algorithm.
After importing the dataset,add more columns such as dom,weekday and hour
FIG 5.3 FINAL DATASET
After importing 5.3,let us visualize the data on the basis of frequency of cabs
travelled in a day with respect to day,month.
Based on figures 5.4 to 5.11 we are discussing how the data is visualized on the
basis of day,month and plotting all the points of the dataset on a x-y graph
referencing latitude and longitude after plotting the points we proceed for data
analysis using the algorithm.
Figure 5.10 shows taking the data from the dataset and using the algorithm to
calculate the centroids using k-means algorithm and plotting the points on the
map.
Based on the data collected by the means of klabels by labelling the data in the
dataset in the form of clusters. Fig 5.11 explains the total number of trips on the
basis of clusters.
FIG 5.13 PREDICTION OF PICKUP OF CAB BASED ON THE CLUSTER
Based on the figures 5.1 to 5.12 we can discuss that using data analysis the
program can predict the pickup location of the cab based on clusters analyzed and
applied by k-means clustering to schedule the cab for pickup.
CHAPTER 5
CONCLUSION AND FUTURE WORK
CONCLUSION
This program will make the system of deploying more cabs to the required location
and makes it flexible for users. The users have no need to worry about the location
as the program will help in scheduling a cab for pickup nearest to the location. The
program shows the concepts of machine learning such as data visualization and
data analysis which makes the program and efficient for future work
FUTURE WORK
In future, system will provide the location of pickup to the users. Users can send
their location to the app, and the program used in the project will predict the
nearest location to the user and assign a cab to the user. The program and data
elements in the program developed must be tested by Uber such that it can be
used as an operational environment. It will make the program of predicting the
trips using data analysis more flexible and efficient for users.
.
REFERENCES
[1]. Poulsen, L.K., Dekkers, D., Wagenaar, N., Snijders, W., Lewinsky, B.,
Mukkamala, R.R. and Vatrapu, R., 2016, June. Green Cabs vs. Uber in New York
City. In 2016 IEEE International Congress on Big Data (BigData Congress) (pp.
222-229). IEEE.
[2]. Verma, N. and Baliyan, N., 2017, July. PAM clusteringbased taxi hotspot
detection for informed driving. In 2017 8th International Conference on Computing,
Communication and Networking Technologies (ICCCNT) (pp. 1-7). IEEE.
[3]. Sotiropoulos, D.N., Pournarakis, D.E. and Giaglis, G.M., 2016, July. A genetic
algorithm approach for topic clustering: A centroid-based encoding scheme. In
2016 7th International Conference on Information, Intelligence, Systems &
Applications (IISA) (pp. 1-8). IEEE.
[4]. Guha, S. and Mishra, N., 2016. Clustering data streams. In Data stream
management (pp.169-187). Springer, Berlin, Heidelberg.
[5]. Shah, D., Kumaran, A., Sen, R. and Kumaraguru, P., 2019, May. Travel Time
EstimationAccuracy in Developing Regions: An Empirical Case Study with Uber
Data in Delhi-NCR✱.In Companion Proceedings of The 2019 World Wide Web
Conference (pp. 130-136). ACM.
[6]. Ahmed, M., Johnson, E.B. and Kim, B.C., 2018. The Impact of Uber and Lyft
on TaxiService Quality Evidence from New York City. Available at SSRN 3267082.
[7]. Wallsten, S., 2015. The competitive effects of the sharing economy: how is
Uber changing taxis. Technology Policy Institute, 22, pp.1-21.
[8]. Verma, N. and Baliyan, N., 2017, July. PAM clustering based taxi hotspot
detection forinformed driving. In 2017 8th International Conference on Computing,
Communication andNetworking Technologies (ICCCNT) (pp. 1-7). IEEE.
[9]. Kumar, A., Surana, J., Kapoor, M. and Nahar, P.A., CSE 255 Assignment II
Perfecting Passenger Pickups: An Uber Case Study.
APPENDIX
A) PLAGIARISM REPORT
UberrelatedDataAnalysis (1).pdf
ORIGINALITY REPORT
8 %
SIMILARITY INDEX
6%
INTERNET SOURCES
4%
PUBLICATIONS
3%
STUDENT PAPERS
PRIMARY SOURCES
1
www.simplilearn.com
Internet Source
3%
2
cseweb.ucsd.edu
Internet Source
2%
G. Anitha, Mohamed Ismail, S.K.
3
Lakshmanaprabu. "Identification and 1%
characterisation of choroidal neovascularisation
using e-Health data through an optimal
classifier", Electronic Government, an
International Journal, 2020
Publication
5
www.ubdc.ac.uk
Internet Source
1%
B) BASE PAPER
Abstract-The paper explains the working of an Uber The ultimate aim of the project is to predict
dataset, which contains data produced by Uber for New the pickup of the cab on the basis of clusters defined
York City. Uber is defined as a P2P platform. The by the k-means clustering algorithm. This algorithm
platform links you to drivers who can take you to your is used to divide the dataset into k-groups. where k is
destination. The dataset includes primary data on Uber defined as the number of groups provided by the
pickups with details including the date, time of the ride user. The standard algorithm describes the maximum
as well as longitude-latitude information , Using the variance within the group as the number of square
information, the paper explains the use of the k-means
clustering algorithm on the set of data and classify the
distances Euclidean distances between the points and
various parts of New York City. Since the industry is the corresponding centroid.
booming and expected to grow shortly. Effective taxi The important packages used in the project
dispatching will facilitate each driver and passenger to are pandas,numpy,seaborn,kmeans,yellowbrick and
reduce the wait time to seek out one another. The folium.
model is employed to predict the demand on points of
the city.
II. LITERATURE SURVEY
Keywords–Artificial Neural Network, Genetic
Algorithms, K-means Clustering, Recurrent Neural Past few years have seen tremendous
Network, Time delay Neural Network, Convolutional growth in uber related data analysis using machine
Neural Network.
learning. People are coming up with various methods
to analyze uber related data such as A state in which
I.INTRODUCTION the results, k-means clustering is used to estimate the
The Uber platform connects you with most likely collection points at a given time and to
drivers who can take you to your destination or predict the best hotspots of nightlife learning trends
location. This dataset includes primary data on Uber from previous Uber pickups. The center of the taxi
collections with details that include the date, time of service decides on the space of area to be targeted for
travel, as well as information on longitude and the pickup of passengers.
latitude in San Francisco and has operations in over
900 metropolitan areas worldwide. The prediction of This can be justified by explaining that
the frequency of trips of data is by implementing a machine learning is the core of Uber and how it has
part of k-means clustering algorithm impacted on tremendous growth
The standard algorithm describes the Bridging the supply demand gap
maximum variance within the group as the number Reduction in ETA
of square distances Euclidean distances between the Route Optimization
points and the corresponding centroid.The use of the
digital computer has since moved to technology
where the program involves the use of neural
networks ,Examples of RNN (Recurrent Neural Poulsen, L.K In this document applied an
Network) and TDNN (Time delay Neural experiment of spatial analysis of Green cab and Uber
Network)for importing data from uber dataset which to hotspots of New York to determine the
takes the data for forecasting on a time horizon. competitive position of the NYCTLC. The resulted
research showed that as demand of green cabs on the complaints we demonstrate. The increased
hotspots grew,the demand of Uber taxis on the competition for these shared travel services has had
hotspots also growed. an intuitive impact on the behavior of taxi drivers[4].
This research recommends that NYCTLC Wallsten, S, stated that the results of New
creates a dashboard that analyzes and displays data York and Chicago are consistent with the possibility
in real time, as we believe this will increase its that taxis react to the new challenge by improving
competitiveness compared to Uber. Uber is a recent quality. In New York, the rise of Uber is linked to
taxi operator in New York and is constantly the reduction of objections to travel to the city. They
devouring the market share of the yellow and green discuss the competitive effect of sharing taxis in the
taxis of the New York Taxi and Limousine taxi industry using the complete data set of the New
Commission (NYCTLC).The NYCTLC is an agency York City Taxi and Limousine Commission for more
of the New York City Government which licenses than one billion taxi trips in complaints and details of
and regulates taxis and vehicle for hire industries and New York, New York and Chicago Google Trends
also app based companies. The commission was on the success of Uber's largest shared travel
founded on March 2 ,1971 and their headquarters are service.[5].
based in New York.[1].
Sotiropoulos, D.N, represented that this
Faghih, S.S recommends a recent modeling document addresses the problem of grouping, by
approach in Manhattan, New York City, to capture using a new approach to genetic algorithms that is
the demand for electronic mail services, particularly highly scalable in large volumes of textual details,
the Uber application. Uber collection data is added to developing a coding scheme based on centroids. We
the Manhattan TAD level and at 15-minute time apply k means clustering algorithm in this document.
intervals. This aggregation allows the Clustering is the unsupervised machine learning
implementation of a modern approach to spatio- algorithm used to solve grouping problems based on
temporal modeling to obtain a spatial and temporal similarities. This technique has aroused interest in a
understanding of the demand. During a typical day, wide range of scientific fields, which address
two spacetime models were developed using Uber clustering methods, to solve complex classification
collection data, the STAR and STAR and MSPE problems.[6].
turns determine the output of the models. The results
of the MSPE have shown that it is recommended to
Faghih, S.S said that the demand for
use the Lasso-Star system instead of the star design.
electronic mail services is growing rapidly,
A comparison between the demand for yellow and
particularly in large cities. Uber is the first and most
uber taxis in 2014 and 2015 in New York shows that
famous email company in the United States and New
the demand for uber has increased[2].
York City. A comparison between the demand for
yellow and Uber taxis in New York in 2014 and
Ghuhaexplained the grouping of the 2015 shows that the demand for Uber has increased.
sequences calculated and observed by using a small To study the forecast performance of the models,
amount of memory and time was necessary for you choose to choose data for a typical day. Our goal
applications that needed to develop a data flow in this document is to describe how these models can
model to involve large data sets and consider be used for forecasting Uber demand. The Uber data
categorizing the data in the form of clusters[3]. contains information about the position and time of
Ahmed, M., has shown that by using the pick-ups and returns of each trip during a day.
detailed data on taxis at the travel level and on the According to the available data, the Uber historical
rental vehicle and data on complaints about the level data of April 2014[7].Kumar, states that, k-means
of new complaints at the level of incidents, we study clustering is used to estimate the most likely
how Uber and Lyft enter damaged the quality of taxi collection points at a given time and to predict the
services in New York City. The overall effect of the best hotspots of nightlife learning trends from
organizations based on the scenario and in particular previous Uber pickups [8,9].
of the riding administrations was enormous and
widespread. One of these effects is the expansion of L.Liu, C.Andris, and C.Ratti planned for a
the rivalry between Uber and Lyft over the quality of strategy to disclose cabdrivers working patterns by
taxi administration. They use a new set of complaint inspecting their unbroken anatomy track[10].R-H
data to measure (the lack of) quality of service that Hwand focuses on GPS and the locality to pick up
we have never been analyzed before. Focus on the
quality dimensions generated by most of the
passengers , A venue to venue plot model referred to
as an OFF-ON model [11].
A. System Architecture
The system architecture for the given Fig 2. Raw dataset (csv file)
module is as follows:
C. Data Importing
A huge amount of trip data will be collected
fromUberfor training and testingdata. From the
collected dataset the latitude and latitude will be
clustered and classified based on the frequency of
trips travelled by the cab during the day. When these
criteria are considered, and data preprocess will be
done on these datasets.
D. Data Visualization
E. Testing data
Fig.8. K means clustering Fig.9 Importing k-means module and calculating the
centroids of the dataset
Fig. 14. Predicting the pickup of the cab from the particular
cluster and showing the cluster coordinates.
%pylab inline
import pandas
import seaborn
data=pandas.read_csv('Desktop/uber.csv')
data
data.tail()
data['Date/Time'] = data['Date/Time'].map(pandas.to_datetime)
def get_dom(dt):
return dt.day
data['dom'] = data['Date/Time'].map(get_dom)
data.tail()
def get_weekday(dt):
return dt.weekday()
data['weekday'] = data['Date/Time'].map(get_weekday)
def get_hour(dt):
return dt.hour
data['hour'] = data['Date/Time'].map(get_hour)
data.tail()
ylabel('frequency')
return len(rows)
by_date = data.groupby('dom').apply(count_rows)
by_date
by_date_sorted = by_date.sort_values()
by_date_sorted
bar(range(1, 31), by_date_sorted)
xticks(range(1,31), by_date_sorted.index)
xlabel('date of the month')
ylabel('frequency')
title('Frequency by DoM - uber - Apr 2014');
hist(data.hour, bins=24, range=(.5, 24))
hist(data.weekday, bins=7, range =(-.5,6.5), rwidth=.8, color='#AA6666', alpha=.4)
xticks(range(7), 'Mon Tue Wed Thu Fri Sat Sun'.split())
by_cross = data.groupby('weekday hour'.split()).apply(count_rows).unstack()
seaborn.heatmap (by_cross)
hist(data['Lat'], bins=100, range = (40.5, 41));
hist(data['Lon'], bins=100, range = (-74.1, -73.9));
hist(data['Lon'], bins=100, range = (-74.1, -73.9), color='g', alpha=.5, label = 'longitude')
grid()
legend(loc='upper left')
twiny()
hist(data['Lat'], bins=100, range = (40.5, 41), color='r', alpha=.5, label = 'latitude')
legend(loc='best');
figure(figsize=(20, 20))
plot(data['Lon'], data['Lat'], '.', ms=1, alpha=.5)
xlim(-74.2, -73.7) ylim(40.7, 41)
clus = data[['Lat','Lon']]
clus.dtypes
pip install sklearn
pip install yellowbrick
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
kmeans = KMeans(n_clusters = 5, random_state = 0)
kmeans.fit(clus)
centroids = kmeans.cluster_centers_
centroids
clocation = pandas.DataFrame(centroids, columns = ['Latitude', 'Longitude'])
clocation.head()
plt.scatter(clocation['Latitude'], clocation['Longitude'], marker = "x", color = 'R', s = 200)
pip install folium
import folium
centroid = clocation.values.tolist()
map = folium.Map(location = [40.71600413400166, -73.98971408426613], zoom_start = 10)
for point in range(0, len(centroid)):
folium.Marker(centroid[point], popup = centroid[point]).add_to(map)
map
label = kmeans.labels_
label
data_new = data
data_new['Clusters'] = label
data_new
seaborn.factorplot(data=data_new,x="Clusters",kind="count",size=7,aspect=2)
count_3 = 0
count_0 = 0
for value in data_new['Clusters']:
if value == 3:
count_3 += 1
if value == 0:
count_0 += 1
print(count_0, count_3)
new_location = [(40.86, -75.56)]
kmeans.predict(new_location)
clocation.head()