Big Data and Analytics: Getting Started With Arcgis: Mike Park Erik Hoel
Big Data and Analytics: Getting Started With Arcgis: Mike Park Erik Hoel
• Big Data is a loosely defined term used to describe data sets so large and
complex that they become awkward to work with using standard software in a
tolerable elapsed time
- Big data "size" is a constantly moving target, ranging from a few dozen terabytes to
many petabytes of data
- In the past three years, 90% of all recorded data has been generated
• Every 60 seconds:
- 100,000 tweets
- 2.4 million Google searches
- 11 million instant messages
- 170 million email messages
- 1,800 TB of data
NYC Taxis by Day Manhattan Taxis Friday after 8pm
4
Big data
What techniques are applied to handle it?
•
“Big data is not about the data.”
Data distribution – large datasets are split into smaller datasets and distributed across a
collection of machines
• – Gary
Parallel processing King
– using a collection of machines to process the smaller datasets, combining
Harvard
the partial results University
together
• Director,
Fault tolerance Inst. ofFor
– making copies theQuantitative Social
partitioned data to ensureScience
that if a machine fails, the
dataset can still be processed
Commodity hardware – using standard hardware that is not dependent upon exotic
architectures,(Making
topologies, the point that while
RAID) data is plentiful and
•
or data storage (e.g.,
• easy to and
Scalability – algorithms collect, the real
frameworks value
that can is inscaled
be easily thetoanalytics)
run on larger collections of
machines in order to address larger datasets
ArcGIS users have big data
• Smart Sensors
- Electrical meters (AMI), SCADA, UAVs
• GPS Telemetry
- Vehicle tracking, smartphone data collectors, workforce tracking, geofencing
• Internet data
- Social media streams, web log files, customer sentiment
• Sensor data
- Weather sensors, stream gauge measurements, heavy equipment monitors, …
• Imagery
- Satellites, frame cameras, drones
6
GeoAnalytics Examples
• Aggregate vehicle locations into cells for each 10 minute period to reveal traffic
patterns
• Aggregate 911 call logs into census blocks by hour to reveal call patterns
• Enrich very large numbers of point locations with contextual data and then
select subset of locations meeting certain criterion
10.4
Road ahead?
GeoAnalytics 10.4
What is it, and what does it enable me to do?
• It provides me:
- The ability to do fast batch analysis on large raster and image datasets
• Batch analysis means the ability to run analysis jobs on large datasets
- The input is a persisted standard or big dataset
- The output is a persisted standard or big dataset
• Datasets
- Standard geospatial data (geodatabases, files, services)
- Big Data (databases, files, services)
• Key point:
With suitably scaled GeoAnalytics, jobs that would take hours now take minutes
GeoAnalytics Extension for Server 10.4
• Adds out of the box analytics to ArcGIS Server
- Analysis in ArcGIS Pro and Portal
- Powered by a new Analysis Service / Toolbox in Server
- Focused analysis for big data
• Works with:
- Standard geospatial data (geodatabases, files, services)
- Big Data (databases, files, services)
GeoAnalytics Extension for Server 10.4
Overview
• Users are able to manage, analyze, and visualize big data to derive valuable
information
GeoAnalytics
Performance: minutes, not hours
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Cores Cores
GeoAnalytics
User Experience - Analysis
• ArcGIS Pro:
- Out of the box tools that run in Server and process
services and registered data using a GP tool interface
17
Analytic capabilities
ArcGIS 10.4 release
• Summarize Data
- Aggregate Points by Polygon + time
- Aggregate by Cell + time
- Summarize Nearby + time
- Summarize Within + time
• Find Locations
- Find Existing Locations
- Find Similar Locations
• Analyze Patterns
- Calculate Density
- Find Hot Spots + time
• Use Proximity
- Create Buffers + time
• Manage Data
- Extract Data
- Field Calculator
- Geocode Addresses
• Both GIS data stores and big data stores are supported
- Map and Feature services
- ArcGIS SQL Data Store
30 minutes
10 meters
Y Time
5 minutes
4 meters
X X
Why Time Is Important 10.4
Summarization
Aggregation
Summary statistics
• Numeric Statistics
- Count
- Min
- Max
- Sum
- Mean
- Standard Deviation
- Variance
• Text Statistics
- Min (alphabetical ordering)
- Max (alphabetical ordering)
- Any
Aggregation
Summary methods
Aggregation 10.4
Point counts and attribute means
2 3
30
46 20
38 25
30 25
10
10 12
15
15 12.4
Path generation 10.4
Vertex count aggregation
3
2
5
Spatio-temporal big data store 10.4
Management
• I don’t need to learn anything new; I use it just like existing GP tools