0% found this document useful (0 votes)
62 views2 pages

Code Review

This document describes the steps in a PySpark program to find the nearest area codes to a given area code based on latitude and longitude. It cross joins area code data to create all pairings, calculates distances between points using a UDF, ranks distances within each area code group, and filters to the top 5 closest area codes within a maximum distance. The output is saved to a file specified at runtime.

Uploaded by

haziqsajjad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views2 pages

Code Review

This document describes the steps in a PySpark program to find the nearest area codes to a given area code based on latitude and longitude. It cross joins area code data to create all pairings, calculates distances between points using a UDF, ranks distances within each area code group, and filters to the top 5 closest area codes within a maximum distance. The output is saved to a file specified at runtime.

Uploaded by

haziqsajjad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

1.

From line number 1 to 5 is for importing dependencies of project


2. Line 7 is for MAX_DISTANCE 1000 km means it will cover 1000 km. You can change
that value accordingly.
3. Line 9 creating spark session with app name “nearby_area_codes”
4. In Line 10 we have a function called “haversine” that is for finding distance between 2
different latitude and longitude.
5. Line 24 update “haversine” is created as UDF(user defined function) so we can use that
for multiple data frames later in the code.
6. Line 26 “file = argv[1]” is the input file path that you pass when you run the code
7. Line 27 read that .csv file pass by you
8. Line 28 and 29 filter all latitude and longitude that are not null or empty.
9. Now line 31 is filtering all the US “clean_phone_country_code” rows
10. Line 32 is grouping “clean_phone_area_code” by same values and rename that column
to “npa”
11. Line 33 is getting the average of groups created by step 10 for “clean_phone_latitude”
and “clean_phone_longitude” column and saving that value. Now they change that
column name to “latitude” and “longitude” in the output file.
12. In Line 35 we are selecting “npa”, “latitude” and “longitude” in the output dataframe(Or
future output file).
13. In Line 36 We cross joining those selected items and create new rows with “right_npa”,
“right_latitude” and “right_longitude” columns. See the example for know more

We have something like this before step 12 and 13

npa latitude longitude


212 22.22 34.44
123 12.22 21.33

right_npa right_latitude right_longitude


221 14.33 98.33
779 33.44 66.43

After step 12 and 13 (After cross joining those)

npa latitude longitude right_npa right_latitude right_longitude


212 22.22 34.44 221 14.33 98.33
123 12.22 21.33 221 14.33 98.33
212 22.22 34.44 779 33.44 66.43
123 12.22 21.33 779 33.44 66.43

For more information please look into this artical https://luminousmen.com/post/introduction-to-


pyspark-join-types
14. In Line 38 to 41 we are getting row values of different columns like “latitude”, “longitude”,
“right_latitude” and “right_longitude” in different variables.
15. Line 43 calculates the distance between those different rows and calculates each row
distance by the UDF function created in step 5. And store that distance in a new column
name “distance”
16. In line 44 we are filtering values that are not equal to “npa” and “right_npa”. Means we
now have all those values, those npa and right_npa are not the same.
17. Line 45 we create a new column with name “rank” and we partition this in nap groups by
“npa” column. And after that we sort the table npa_dist by “distance” column in
ascending order.

To understand step 17 here’s the demo for you. And you can also visit this page for more
infomation https://www.datasciencemadesimple.com/populate-row-number-in-pyspark-row-
number-by-group/

18. In line 46 we are filtering rank values and selecting only those who are less than 6
values in every npa group.
19. In line 47 we are filtering the distance column to check we have only those row that is
less then MAX_DISTANCE
20. In 48 line we are finally saving that in the output file. That path is also given by us at the
run time in the second argument.

You might also like