0% found this document useful (0 votes)

61 views86 pages

Scraping - ArcGIS For Fun and Profit

This document discusses scraping ArcGIS data for fun and profit. It begins by explaining the motivation for scraping ArcGIS due to limitations in accessing COVID-19 data from multiple sources. Next, it describes how to scrape the data using fetch requests and curl commands. Various tools for scraping are also presented, including Insomnia and ZEIT Now. The document concludes by discussing lessons learned, such as using caching and integration testing when scraping.

Uploaded by

Ilma Akrimatunnisa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views86 pages

Scraping - ArcGIS For Fun and Profit

Uploaded by

Ilma Akrimatunnisa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 86

Scraping ArcGIS

FOR FUN AND PROFIT

@mathdroid
Scraping ArcGIS
FOR FUN AND PROFIT
KNOWLEDGE AND HUMANITY

@mathdroid
1. Pre 2. Main 3. Post
- Who - How - Growth
- What - Tools - Reach
- Why - Tools - Impact
- TOOLS - Usage

- Lesson learned
Who
Who
Who
Who
What
What
What
Background

- Just finished a surgery in the hospital

- Tech Twitter started talking about the new coronavirus
-
Why

If the COVID19 data was more accessible to everyone, more useful things
can be made to combat it.
Why
1. Too many sources
Why
1. Too many sources
2. Not formatted uniformly
3. CORS

JHU CSSE screenshot here

Worldometers screenshot here

Why
1. Too many sources
2. Not formatted uniformly
Why
1. Too many sources
2. Not formatted uniformly
3. CORS
Why
1. Too many sources
2. Not formatted uniformly
3. CORS

REPUTABLE
JHU CSSE screenshot here

Worldometers screenshot here

How
Analyze:

The dashboard is an auto updating SPA

No login required

Need to extract data that is available in the page

How
How ALL
ROADS
LEAD
TO
ROME
How
Who would win?

1. An extensive HTTP client library, combined with a blazing fast DOM

parser/manipulator
How
Who would win?

1. An extensive HTTP client library, combined with a blazing fast DOM

parser/manipulator

2. A smol inspect element boi

How
Who would win?

1. An extensive HTTP client library, combined with a blazing fast DOM

parser/manipulator

2. A STRONK inspect element boi

How
How
How
How
How
How
How

EASY?
How

REQUIRED
EASY?
HEADERS
How
fetch("https://services9.arcgis.com/N9p5hsImWXAccRNI/arcgis/rest/services/N
c2JKvYFoAEOFCG5JSI6/FeatureServer/3/query?f=json&returnGeometry=false&spati
alRel=esriSpatialRelIntersects&outFields=*&outStatistics=%5B%7B%22statistic
Type%22%3A%22exceedslimit%22%2C%22outStatisticFieldName%22%3A%22exceedslimi
t%22%2C%22maxPointCount%22%3A4000%2C%22maxRecordCount%22%3A2000%2C%22maxVer
texCount%22%3A250000%7D%5D", {

"referrer": "https://www.arcgis.com/apps/opsdashboard/index.html",

"referrerPolicy": "no-referrer-when-downgrade",

"body": null,

"method": "GET",

"mode": "cors"

});
How
curl
'https://services9.arcgis.com/N9p5hsImWXAccRNI/arcgis/rest/services/Nc2JKvY
FoAEOFCG5JSI6/FeatureServer/3/query?f=json&returnGeometry=false&spatialRel=
esriSpatialRelIntersects&outFields=*&outStatistics=%5B%7B%22statisticType%2
2%3A%22exceedslimit%22%2C%22outStatisticFieldName%22%3A%22exceedslimit%22%2
C%22maxPointCount%22%3A4000%2C%22maxRecordCount%22%3A2000%2C%22maxVertexCou
nt%22%3A250000%7D%5D' \

-H 'Referer: https://www.arcgis.com/apps/opsdashboard/index.html' \

-H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)

AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36'
\

-H 'DNT: 1' \

--compressed
Tool 1 Insomnia
Tool 1 Insomnia
Tool 1 Insomnia
“SCRAPE”
IS
POSSIBLE!
Scribble
Tool 2 ZEIT Now
Tool 2 ZEIT Now
Tool 2 ZEIT Now
Tool 2 ZEIT Now
Tool 2 ZEIT Now

SUCCESS
Data wrangling
Things to consider:

All XHR to arcgis servers are

using the same format (they have
docs on this)

There are 2 main formats:

when the returned data is

an array (collection)

When the returned data is

an item (statistics)
Data wrangling
Things to consider:

All XHR to arcgis servers are using the same format (they have docs on
this)

There are 2 main formats:

when the returned data is an array (collection)

When the returned data is an item (statistics)

I guess repeat as needed
Value in [bracket] will be available in
req.query

Handle exceeding data (ArcGIS limits

only 1000 result count max)

Do heavy calculations server side, but

put it in cache for a bit
I guess repeat as needed
Value in [bracket] will be available in
req.query

Handle exceeding data (ArcGIS limits

only 1000 result count max)

Do heavy calculations server side, but

put it in cache for a bit
uWu wats this
uWu wats this
Open Graph image gen
1. Fetch required data
2. Generate HTML string (+CSS/JS
3. Screenshot the generated HTML
using Puppeteer
4. Return the image/png
Open Graph image gen
1. Fetch required data
2. Generate HTML string (+CSS/JS
3. Screenshot the generated HTML
using Puppeteer
4. Return the image/png
Open Graph image gen
1. Fetch required data
2. Generate HTML string (+CSS/JS
3. Screenshot the generated HTML
using Puppeteer
4. Return the image/png
Open Graph image gen
1. Fetch required data
2. Generate HTML string (+CSS/JS
3. Screenshot the generated HTML
using Puppeteer
4. Return the image/png
Open Graph image gen
1. Fetch required data
2. Generate HTML string (+CSS/JS
3. Screenshot the generated HTML
using Puppeteer
4. Return the image/png
Cache setup
Cache setup

STALE
WHILE
REVALIDATE
stale-while-revalidate
NOW
WE
WAIT
MONITORING
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Logs
Post
Out of nowhere,
Post
Out of nowhere,
Post NOTICED
BY
Out of nowhere,

SENPAI
LITERALLY THE
JAVASCRIPT SENPAI
Then
And not only websites
AS STRONG
AS YOUR
CREDIT CARD
$0
PER MONTH
Others
A ton of new friends

Some job offers

Sponsorship

Made a tool to help “scrape” in this way using puppeteer

ZEIT Now version upgrade broke it. Max 10s per lambda)
Lesson learned
I would use ZEIT again if there are no long running processes (fit for
lambdas), since it’s very economic and highly scalable.

I would setup a persistence layer from day 1. Also set up diffing ala git from
day 1.