Introduction To Big Data! With Apache Spark": Uc#Berkeley#
Introduction To Big Data! With Apache Spark": Uc#Berkeley#
UC#BERKELEY#
Course Goals"
This Lecture"
Course Goals"
1. Learn about Data Science "
Data Acquisition"
Data Preparation"
Analysis"
Data Presentation"
Data Products"
Course Goals"
2. Learn how to perform Data Science"
Course Goals"
3. Learn to write Apache Spark programs"
History and development"
Conceptual model"
How the Spark cluster model works"
Spark essentials (transformations, actions, !
persistence, broadcast variables, accumulators, !
Key-Value pairs, pySpark API)"
Debugging Spark programs"
Using Spark mllib for Machine Learning "
W. E. Demming"
Images: http://culturacientifica.wikispaces.com/CONTRIBUCIONES+DE+SIR+RONALD+FISHER+A+LA+ESTADISTICA+GENETICA"
http://es.wikipedia.org/wiki/William_Edwards_Deming "
John W. Tukey"
Howard Dresner"
1989: Business Intelligence"
Images: http://www.businessintelligence.info/definiciones/business-intelligence-system-1958.html "
http://www.betterworldbooks.com/exploratory-data-analysis-id-0201076160.aspx "
https://www.flickr.com/photos/42266634@N02/4621418442 "
Google"
1996: Prototype Search Engine"
Exponential growth in !
data volume"
2010: The Data Deluge"
http://en.wikipedia.org/wiki/Seven_Countries_Study "
60"
15"
40"
http://en.wikipedia.org/wiki/Seven_Countries_Study "
http://www.theguardian.com/world/2012/nov/07/nate-silver-election-forecasts-right "
Searches for
Facebook"
http://arxiv.org/abs/1401.4208 "
https://www.facebook.com/notes/mike-develin/debunking-princeton/10151947421191849 "
https://www.facebook.com/notes/mike-develin/debunking-princeton/10151947421191849 "
https://www.facebook.com/notes/mike-develin/debunking-princeton/10151947421191849 "
Graph Data"
Lots of interesting data has a graph structure:"
Social networks"
Telecommunication Networks"
Computer Networks"
Road networks"
Collaborations/Relationships"
"
Some of these graphs can get quite large!
(e.g., Facebook user graph)"
33"
Internet of Things: !
Example Measurements"
Humidity vs. Time
101
104
109
110
111
36meters"
33m: 111"
32m: 110"
95
30m: 109,108,107"
85
75
65
55
45
35
Temperature (C)
20m: 106,105,104"
10m: 103, 102, 101"
33
28
23
18
13
8
7/7/03 7/7/03 7/7/03 7/7/03 7/7/03 7/8/03 7/8/03 7/8/03 7/8/03 7/8/03 7/8/03 7/9/03 7/9/03 7/9/03 7/9/03
9:40
13:11 16:43 20:15 23:46
3:18
6:50
10:21 13:53 17:25 20:56
0:28
4:00
7:31
11:03
Date
Collected data !
also used for!
traffic reporting"
http://www.511.org/ "
http://en.wikipedia.org/wiki/FasTrak "
Crowdsourcing " +
="
http://traffic.berkeley.edu "
Physical modeling
Sensing
+ Data Assimilation"