Code Review

This document describes the steps in a PySpark program to find the nearest area codes to a given area code based on latitude and longitude. It cross joins area code data to create all pairings, calculates distances between points using a UDF, ranks distances within each area code group, and filters to the top 5 closest area codes within a maximum distance. The output is saved to a file specified at runtime.

Uploaded by

haziqsajjad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views2 pages

Code Review

Uploaded by

haziqsajjad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 2

1.

From line number 1 to 5 is for importing dependencies of project

2. Line 7 is for MAX_DISTANCE 1000 km means it will cover 1000 km. You can change
that value accordingly.
3. Line 9 creating spark session with app name “nearby_area_codes”
4. In Line 10 we have a function called “haversine” that is for finding distance between 2
different latitude and longitude.
5. Line 24 update “haversine” is created as UDF(user defined function) so we can use that
for multiple data frames later in the code.
6. Line 26 “file = argv[1]” is the input file path that you pass when you run the code
7. Line 27 read that .csv file pass by you
8. Line 28 and 29 filter all latitude and longitude that are not null or empty.
9. Now line 31 is filtering all the US “clean_phone_country_code” rows
10. Line 32 is grouping “clean_phone_area_code” by same values and rename that column
to “npa”
11. Line 33 is getting the average of groups created by step 10 for “clean_phone_latitude”
and “clean_phone_longitude” column and saving that value. Now they change that
column name to “latitude” and “longitude” in the output file.
12. In Line 35 we are selecting “npa”, “latitude” and “longitude” in the output dataframe(Or
future output file).
13. In Line 36 We cross joining those selected items and create new rows with “right_npa”,
“right_latitude” and “right_longitude” columns. See the example for know more

We have something like this before step 12 and 13

npa latitude longitude

212 22.22 34.44
123 12.22 21.33

right_npa right_latitude right_longitude

221 14.33 98.33
779 33.44 66.43

After step 12 and 13 (After cross joining those)

npa latitude longitude right_npa right_latitude right_longitude

212 22.22 34.44 221 14.33 98.33
123 12.22 21.33 221 14.33 98.33
212 22.22 34.44 779 33.44 66.43
123 12.22 21.33 779 33.44 66.43

For more information please look into this artical https://luminousmen.com/post/introduction-to-

pyspark-join-types
14. In Line 38 to 41 we are getting row values of different columns like “latitude”, “longitude”,
“right_latitude” and “right_longitude” in different variables.
15. Line 43 calculates the distance between those different rows and calculates each row
distance by the UDF function created in step 5. And store that distance in a new column
name “distance”
16. In line 44 we are filtering values that are not equal to “npa” and “right_npa”. Means we
now have all those values, those npa and right_npa are not the same.
17. Line 45 we create a new column with name “rank” and we partition this in nap groups by
“npa” column. And after that we sort the table npa_dist by “distance” column in
ascending order.

To understand step 17 here’s the demo for you. And you can also visit this page for more
infomation https://www.datasciencemadesimple.com/populate-row-number-in-pyspark-row-
number-by-group/

18. In line 46 we are filtering rank values and selecting only those who are less than 6
values in every npa group.
19. In line 47 we are filtering the distance column to check we have only those row that is
less then MAX_DISTANCE
20. In 48 line we are finally saving that in the output file. That path is also given by us at the
run time in the second argument.

Deploying ML Production (Flask - API)
No ratings yet
Deploying ML Production (Flask - API)
27 pages
RICOH Aficio-2022 Aficio-2027 Service Manual Pages
33% (3)
RICOH Aficio-2022 Aficio-2027 Service Manual Pages
32 pages
Im 7000 SM
No ratings yet
Im 7000 SM
2,533 pages
A Case Study Analysis of JDPi Automotive Manufacturer
No ratings yet
A Case Study Analysis of JDPi Automotive Manufacturer
14 pages
Preface: Internship Report
No ratings yet
Preface: Internship Report
11 pages
KANSAI SPECIAL JJ30 - PB - INST - 5th - Ed PART BOOK
No ratings yet
KANSAI SPECIAL JJ30 - PB - INST - 5th - Ed PART BOOK
56 pages
Kurt Menke - Discover QGIS 3.x - Second Edition A Workbook For Classroom or Independent Study-Locate Press (2022)
67% (3)
Kurt Menke - Discover QGIS 3.x - Second Edition A Workbook For Classroom or Independent Study-Locate Press (2022)
430 pages
Motorcycle Wheel Bearing
No ratings yet
Motorcycle Wheel Bearing
11 pages
Python Programming123uo00es0440
No ratings yet
Python Programming123uo00es0440
405 pages
Nba 2kx Mod Tool
No ratings yet
Nba 2kx Mod Tool
7 pages
HTML Cheat Sheet
100% (1)
HTML Cheat Sheet
2 pages
Lab 6 - Gas Turbine
No ratings yet
Lab 6 - Gas Turbine
8 pages
Pandas Roadmap
No ratings yet
Pandas Roadmap
6 pages
Research Paper Title Publisher of Paper Summary
0% (1)
Research Paper Title Publisher of Paper Summary
5 pages
Comptia A 220 1201 Exam Objectives (2 0)
No ratings yet
Comptia A 220 1201 Exam Objectives (2 0)
18 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Lab Spark
No ratings yet
Lab Spark
3 pages
Unit 2 Data Representation: Worksheet 3 Characters
No ratings yet
Unit 2 Data Representation: Worksheet 3 Characters
3 pages
Social Network 1.synopsis
No ratings yet
Social Network 1.synopsis
45 pages
Module II - Form Object
No ratings yet
Module II - Form Object
35 pages
ICE Lab Manual 2016 First Half
No ratings yet
ICE Lab Manual 2016 First Half
22 pages
Energies 10 01648 PDF
No ratings yet
Energies 10 01648 PDF
16 pages
3 IntroSoftSec
No ratings yet
3 IntroSoftSec
42 pages
2A Chapter2 Color PDF
No ratings yet
2A Chapter2 Color PDF
18 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
HPE ProLiant DL365 Gen11
No ratings yet
HPE ProLiant DL365 Gen11
46 pages
Security Best Practices
No ratings yet
Security Best Practices
9 pages
Zappavigna 2016 Social Media Photography Construing Subjectivity in Instagram Images
No ratings yet
Zappavigna 2016 Social Media Photography Construing Subjectivity in Instagram Images
22 pages
Rev-Trac One: Accelerate SAP Change
No ratings yet
Rev-Trac One: Accelerate SAP Change
3 pages
NHP2400 2021 Project Planning Report
No ratings yet
NHP2400 2021 Project Planning Report
5 pages
Elevator Arm
No ratings yet
Elevator Arm
9 pages
Pyspark PDF
100% (1)
Pyspark PDF
397 pages
Geopandas 50 Exercises
No ratings yet
Geopandas 50 Exercises
2 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Online Vehicle Rental Management System-Mern
No ratings yet
Online Vehicle Rental Management System-Mern
5 pages
Unix Top100 e
No ratings yet
Unix Top100 e
3 pages
Py Spark
No ratings yet
Py Spark
427 pages
Manual of Applied Spatial Ecology
No ratings yet
Manual of Applied Spatial Ecology
190 pages
1child Birth Records Management System
No ratings yet
1child Birth Records Management System
53 pages
Toshiba e Studio 5518a 6518a 7518a 8518 Brochure
No ratings yet
Toshiba e Studio 5518a 6518a 7518a 8518 Brochure
6 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
Big Data Group Assignment - Report
No ratings yet
Big Data Group Assignment - Report
44 pages
Py Spark
No ratings yet
Py Spark
427 pages
Data Analysis With R
No ratings yet
Data Analysis With R
72 pages
Roll It!
No ratings yet
Roll It!
4 pages
Week 5 Information Access and Retrieval Tools
No ratings yet
Week 5 Information Access and Retrieval Tools
11 pages
Journal
No ratings yet
Journal
47 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
10 pages
P Seminar
No ratings yet
P Seminar
26 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
Overview of GPS Clustering Code
No ratings yet
Overview of GPS Clustering Code
9 pages
NumPy and Pandas Step
No ratings yet
NumPy and Pandas Step
9 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
w12 - Runningnotes 201026 001818
No ratings yet
w12 - Runningnotes 201026 001818
25 pages
Dk30a2dhu Datasheet
No ratings yet
Dk30a2dhu Datasheet
5 pages
ML Code Output
No ratings yet
ML Code Output
38 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Report
No ratings yet
Report
25 pages
Computer Science Technology - Programming (Profile 420.BP)
No ratings yet
Computer Science Technology - Programming (Profile 420.BP)
2 pages
DEV Manual
No ratings yet
DEV Manual
23 pages
Spark Lab
No ratings yet
Spark Lab
6 pages
Spark SQLPDF 20 Jan
No ratings yet
Spark SQLPDF 20 Jan
4 pages
Personalizing Your Food Journey - Unleashing The Power of Locality Data
No ratings yet
Personalizing Your Food Journey - Unleashing The Power of Locality Data
6 pages
A926534728 - 28953 - 8 - 2025 - Spark Mllib
No ratings yet
A926534728 - 28953 - 8 - 2025 - Spark Mllib
8 pages
Communication Skills Quiz 1
No ratings yet
Communication Skills Quiz 1
2 pages
Data Cleaning and Merging Workflow
No ratings yet
Data Cleaning and Merging Workflow
3 pages
AI5
No ratings yet
AI5
2 pages
Assam PAT Bot User Manual 2025-26
No ratings yet
Assam PAT Bot User Manual 2025-26
16 pages
Scaffold FG
No ratings yet
Scaffold FG
13 pages
And Longitude: - Algori THM
No ratings yet
And Longitude: - Algori THM
4 pages
Mini Project With Output
No ratings yet
Mini Project With Output
8 pages
BDA All 37 Practical Answers
No ratings yet
BDA All 37 Practical Answers
3 pages
ABC
No ratings yet
ABC
3 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
DataGrokr Technical Assignment - Data Engineering
No ratings yet
DataGrokr Technical Assignment - Data Engineering
4 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
Pyspark MLlib
No ratings yet
Pyspark MLlib
4 pages
Unit 6 Pyspark - MLlib
No ratings yet
Unit 6 Pyspark - MLlib
6 pages
Feature Engineering
No ratings yet
Feature Engineering
10 pages
Import Libraries
No ratings yet
Import Libraries
5 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
Data Mining Ex1
No ratings yet
Data Mining Ex1
10 pages
Lab1.ipynb - Colaboratory
No ratings yet
Lab1.ipynb - Colaboratory
9 pages
2324 BigData Lab3
No ratings yet
2324 BigData Lab3
6 pages
Customer Data Outliers Pyspark
No ratings yet
Customer Data Outliers Pyspark
1 page
Ml-Exp-1 - Jupyter Notebook
No ratings yet
Ml-Exp-1 - Jupyter Notebook
8 pages
P1) Code Uber
No ratings yet
P1) Code Uber
6 pages
House Price Prediction Models
No ratings yet
House Price Prediction Models
16 pages
Assignment 03:: Association Rule Mining
No ratings yet
Assignment 03:: Association Rule Mining
3 pages
Setup: Chapter 2 - End-To-End Machine Learning Project
No ratings yet
Setup: Chapter 2 - End-To-End Machine Learning Project
31 pages
Shaheed Zulfikar Ali Bhutto Institute of Science & Technology
No ratings yet
Shaheed Zulfikar Ali Bhutto Institute of Science & Technology
12 pages
ONTAP 9.10.1 Performance Tech Spec
No ratings yet
ONTAP 9.10.1 Performance Tech Spec
1 page
Python Scripts For Machine Learning
No ratings yet
Python Scripts For Machine Learning
13 pages
Learning Apache Spark With Python: Wenqiang Feng
No ratings yet
Learning Apache Spark With Python: Wenqiang Feng
8 pages
Discover Qgis3 Toc
No ratings yet
Discover Qgis3 Toc
7 pages
Mapa Tipo de Datos
No ratings yet
Mapa Tipo de Datos
1 page
A Friendly Introduction to MATLAB Programming
From Everand
A Friendly Introduction to MATLAB Programming
Orhan Gazi
No ratings yet
Introduction to Advanced Mathematical Analysis
From Everand
Introduction to Advanced Mathematical Analysis
Simone Malacrida
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

Code Review

Uploaded by

Code Review

Uploaded by

1.

From line number 1 to 5 is for importing dependencies of project

We have something like this before step 12 and 13

npa latitude longitude

right_npa right_latitude right_longitude

After step 12 and 13 (After cross joining those)

npa latitude longitude right_npa right_latitude right_longitude

For more information please look into this artical https://luminousmen.com/post/introduction-to-

You might also like