0% found this document useful (0 votes)
9 views

Introduction To Big Data Computing

Uploaded by

高俊軒
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Introduction To Big Data Computing

Uploaded by

高俊軒
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Artificial Intelligence and Data Analytics (AIDA) E-Learning Module

Introduction to Big Data


Computing

Song Guo
Department of Computing
The Hong Kong Polytechnic University
Roadmap
• What is Big Data?

• What is Big Data Analytics?

• Big Data System

• Applications of Big Data

2
Where Big Data Come From
• Posts to social media sites
• Digital pictures and videos
• Software logs, cameras
• Microphones
• Sensing data
• Scans of government documents
• GPS trails
• Purchase transaction records
• Cell phone GPS signals
• Traffic
• ...

3
What is Big Data
• Definition from Wikipedia:
• “Big data” is a field that treats ways to analyze,
systematically extract information from, or deal with data
sets that are too large or complex to be dealt with by
traditional data-processing application software.
• The challenges include include capturing data, data storage,
data analysis, search, sharing, transfer, visualization,
querying, updating, information privacy and data source.
• Tends to refer to the use of predictive analytics, user
behavior analytics, or certain other advanced data
analytics methods that extract value from data.

4
Characterization of Big Data
• “Big data is high-volume, high-velocity and high-variety
information assets that demand cost-effective,
innovative forms of information processing for
enhanced insight and decision making.” -- Gartner

• “While enterprises struggle to consolidate systems and collapse


redundant databases to enable greater operational, analytical, and
collaborative consistencies, changing economic conditions have made
this job more difficult. Ecommerce, in particular, has exploded data
management challenges along three dimensions: volumes, velocity and
variety. In 2001/02, IT organizations much compile a variety of
approaches to have at their disposal for dealing each.” – Doug Laney

5
Characterization of Big Data
• “Big data is high-volume, high-velocity and high-variety
information assets that demand cost-effective, innovative
forms of information processing for enhanced insight and
decision making.” -- Gartner

• Vs of big data
• Volume
• Velocity
• Variety
• Veracity

6
Volume: scale of data
• Data volume is increasing exponentially
• Generated by huge number of devices and sensors
• 5 billion people have mobile phones
• A modern car has 100 sensors Value
• CERN's Large Hydron Collider (LHC) generates 15 PB/year

7
Velocity: speed of data generation
• Data is generated fast
• e.g., every 60 seconds, there are 11 million instant
messages, 168 million emails sent

8
Velocity: speed of data processing
• Data need to be processed fast
• Online Data Analytics: late decisions means missing
opportunities
• E.g. 1: Based on your current location and your purchase
history, send promotions right now for store next to you
• E.g. 2: Sensors monitoring your activities and body, notify
you if there are abnormal measurements

9
Variety: data in many forms
1
John likes to
watch movies. 3
Mary likes 2
movies too.
4

Signal
Text Image Graph
(Voice, Audio)
𝑨 1 2 3 4
signal1 signal2 1 0 1.2 4.3 0
Tuple 𝑥1 13.58 7.24
125 200 225 2 1.2 0 0.8 0
John also likes to
1 12.11 12.50 3 4.3 0.8 0 2.6
watch football games.
John likes to watch 105 150 255 4 0 0 2.6 0
13.49 8.66
2 movies. Mary likes 𝑬 Node1 Node2 Weight
movies too. 11.25 10.98
15 75 175 1 1 2 1.2
14.57 13.75 2 1 3 4.3
13.22 9.02 3 2 3 0.8
4 3 4 2.6
10
Variety: data in many forms
• A single application may generate/collect many types of
data, e.g., types of data are stored in emails
• Tabular data: attributes like subject, to, from
• Text (in email body)
• Image (in attachment)
• Hyperlinks
• Types of data
• Relational Data (e.g., Tables)
• Text Data (e.g., comments)
• Semi-structured Data (e.g., XML)
• Graph Data (e.g., social network)
• What else?

11
Veracity: uncertainty of data
• Is the data accurate?
• Measurement error
• Human errors like typos in names/addresses
• Does the data come from a reliable source?
• What if data from different sources are not consistent?

Fake, Paid-For Reviews in Amazon


12
Evolution of Data Analytics

13
Process of Data Analytics

14
Big Data Driven Approach

Big Data Processing Platforms

15
Big Data System
• Data Collection and Analytic on
• Clusters of machines
• Program/Query from a single machine perspective
• Data Processing (NoSQL)
• Traditional Database (e.g., Microsoft SQL Server, Oracle) could not
keep up with Big Data (Google, FB) companies need
• What they needed? Extreme high insert throughput (e.g., tweets)
• What they did not need? Transactions (ACID)
• They developed NoSQL “database” themselves
• Analytic Platform (Hadoop and Spark)

16
Techniques towards Big Data
• Massive Parallelism
• Huge Data Volumes Storage
• Data Distribution
• High-Speed Networks
• High-Performance Computing
• Task and Thread Management
• Data Mining and Analytics
• Data Retrieval
• Machine Learning
• Data Visualization

17
Big Data Analytics Platform

Hadoop Cluster

18
Big Data Programming - MapReduce
• Map: a mapping that is responsible for dividing the
data and transforming the original data into key-
value pairs
• Shuffle: the process of further organizing and
delivering the Map output to the Reduce
• the output of the Map must be sorted and segmented
• then passed to the corresponding Reduce
• Reduce: a merge that processes the values with the
same key and then outputs to the final result

19
MapReduce Pipeline

20
Data Visualization

Taxi trajectories in New York City from Nov 15th to December 31st, in 2013
21
Big Data Application - AlphaGO
• AlphaGo learns from 30 million moves of 160 thousands games
played by experts (5-9 dan) → Big Data
• AlphaGo uses deep learning and neural networks combined with
Monte-Carlo tree search to decide the moves → Analytics

22
Big Data Application - GTF
• Google Flu Trends (GFT) was once held-up as the prototypical
example of the power of big data
• By leveraging search term data, a group of Data Scientists with
little relevant expertise were able to predict the spread of flu
across the continental United State
• More accurate than
the “experts” at the
Centre for Disease
Control with their
models built from
expensive survey data

23
Big Data Application - DeepMind
• Disease Treatment: Joint research between Google
DeepMind and Moorfields Eye Hospital
• Eyecare professionals diagnose eye conditions by using
optical coherence tomography (OCT) scans (over 1,000 a
day at Moorfields alone)
• Achieving expert error rate 5.5% comparably to the two
best retina specialists (6.7% and 6.8% error rate)

24
Big Data Application - CityBrain

25

You might also like