Scraping ArcGIS
FOR FUN AND PROFIT
@mathdroid
Scraping ArcGIS
FOR FUN AND PROFIT
KNOWLEDGE AND HUMANITY
@mathdroid
1. Pre 2. Main 3. Post
- Who - How - Growth
- What - Tools - Reach
- Why - Tools - Impact
- TOOLS - Usage
- Lesson learned
Who
Who
Who
Who
What
What
What
Background
- Just finished a surgery in the hospital
- Tech Twitter started talking about the new coronavirus
-
Why
If the COVID19 data was more accessible to everyone, more useful things
can be made to combat it.
Why
1. Too many sources
Why
1. Too many sources
2. Not formatted uniformly
3. CORS
JHU CSSE screenshot here
Worldometers screenshot here
Why
1. Too many sources
2. Not formatted uniformly
Why
1. Too many sources
2. Not formatted uniformly
3. CORS
Why
1. Too many sources
2. Not formatted uniformly
3. CORS
REPUTABLE
JHU CSSE screenshot here
Worldometers screenshot here
How
Analyze:
The dashboard is an auto updating SPA
No login required
Need to extract data that is available in the page
How
How ALL
ROADS
LEAD
TO
ROME
How
Who would win?
1. An extensive HTTP client library, combined with a blazing fast DOM
parser/manipulator
How
Who would win?
1. An extensive HTTP client library, combined with a blazing fast DOM
parser/manipulator
2. A smol inspect element boi
How
Who would win?
1. An extensive HTTP client library, combined with a blazing fast DOM
parser/manipulator
2. A STRONK inspect element boi
How
How
How
How
How
How
How
EASY?
How
EASY?
How
REQUIRED
EASY?
HEADERS
How
fetch("https://services9.arcgis.com/N9p5hsImWXAccRNI/arcgis/rest/services/N
c2JKvYFoAEOFCG5JSI6/FeatureServer/3/query?f=json&returnGeometry=false&spati
alRel=esriSpatialRelIntersects&outFields=*&outStatistics=%5B%7B%22statistic
Type%22%3A%22exceedslimit%22%2C%22outStatisticFieldName%22%3A%22exceedslimi
t%22%2C%22maxPointCount%22%3A4000%2C%22maxRecordCount%22%3A2000%2C%22maxVer
texCount%22%3A250000%7D%5D", {
"referrer": "https://www.arcgis.com/apps/opsdashboard/index.html",
"referrerPolicy": "no-referrer-when-downgrade",
"body": null,
"method": "GET",
"mode": "cors"
});
How
curl
'https://services9.arcgis.com/N9p5hsImWXAccRNI/arcgis/rest/services/Nc2JKvY
FoAEOFCG5JSI6/FeatureServer/3/query?f=json&returnGeometry=false&spatialRel=
esriSpatialRelIntersects&outFields=*&outStatistics=%5B%7B%22statisticType%2
2%3A%22exceedslimit%22%2C%22outStatisticFieldName%22%3A%22exceedslimit%22%2
C%22maxPointCount%22%3A4000%2C%22maxRecordCount%22%3A2000%2C%22maxVertexCou
nt%22%3A250000%7D%5D' \
-H 'Referer: https://www.arcgis.com/apps/opsdashboard/index.html' \
-H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36'
\
-H 'DNT: 1' \
--compressed
Tool 1 Insomnia
Tool 1 Insomnia
Tool 1 Insomnia
“SCRAPE”
IS
POSSIBLE!
Scribble
Tool 2 ZEIT Now
Tool 2 ZEIT Now
Tool 2 ZEIT Now
Tool 2 ZEIT Now
Tool 2 ZEIT Now
SUCCESS
Data wrangling
Things to consider:
All XHR to arcgis servers are
using the same format (they have
docs on this)
There are 2 main formats:
when the returned data is
an array (collection)
When the returned data is
an item (statistics)
Data wrangling
Things to consider:
All XHR to arcgis servers are using the same format (they have docs on
this)
There are 2 main formats:
when the returned data is an array (collection)
When the returned data is an item (statistics)
I guess repeat as needed
Value in [bracket] will be available in
req.query
Handle exceeding data (ArcGIS limits
only 1000 result count max)
Do heavy calculations server side, but
put it in cache for a bit
I guess repeat as needed
Value in [bracket] will be available in
req.query
Handle exceeding data (ArcGIS limits
only 1000 result count max)
Do heavy calculations server side, but
put it in cache for a bit
uWu wats this
uWu wats this
Open Graph image gen
1. Fetch required data
2. Generate HTML string (+CSS/JS
3. Screenshot the generated HTML
using Puppeteer
4. Return the image/png
Open Graph image gen
1. Fetch required data
2. Generate HTML string (+CSS/JS
3. Screenshot the generated HTML
using Puppeteer
4. Return the image/png
Open Graph image gen
1. Fetch required data
2. Generate HTML string (+CSS/JS
3. Screenshot the generated HTML
using Puppeteer
4. Return the image/png
Open Graph image gen
1. Fetch required data
2. Generate HTML string (+CSS/JS
3. Screenshot the generated HTML
using Puppeteer
4. Return the image/png
Open Graph image gen
1. Fetch required data
2. Generate HTML string (+CSS/JS
3. Screenshot the generated HTML
using Puppeteer
4. Return the image/png
Cache setup
Cache setup
STALE
WHILE
REVALIDATE
stale-while-revalidate
NOW
WE
WAIT
MONITORING
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Post
Out of nowhere,
Post
Out of nowhere,
Post NOTICED
BY
Out of nowhere,
SENPAI
LITERALLY THE
JAVASCRIPT SENPAI
Then
And not only websites
AS STRONG
AS YOUR
CREDIT CARD
$0
PER MONTH
Others
A ton of new friends
Some job offers
Sponsorship
Made a tool to help “scrape” in this way using puppeteer
ZEIT Now version upgrade broke it. Max 10s per lambda)
Lesson learned
I would use ZEIT again if there are no long running processes (fit for
lambdas), since it’s very economic and highly scalable.
I would setup a persistence layer from day 1. Also set up diffing ala git from
day 1.
Analytics are useful but they can be EXPENSIVE.
Integration tests are ESSENTIAL. Especially so when you are scraping.
Just build it.
People
Yahya @k1m0ch1
Yogs @teman_bahagia
Dito @morpigg
You