0% found this document useful (0 votes)

25 views

1 Introduction To Big Data Management and Processing

The document discusses big data storage and processing. It introduces concepts like Hadoop ecosystem, HDFS, NoSQL databases, and data processing techniques like MapReduce and Spark. It also talks about challenges in big data and popular technologies used.

Uploaded by

tranngocbaooooo12062003

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

1 Introduction To Big Data Management and Processing

Uploaded by

tranngocbaooooo12062003

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

1

Lecture 1
Introduction to big data storage and
processing

2
Syllabus
STT Lecture

1 Tổng quan về lưu trữ và xử lý dữ liệu lớn

2 Hệ sinh thái Hadoop (Hadoop ecosystem)

3 Hệ thống tập tin phân tán Hadoop HDFS

4 Cơ sở dữ liệu phi quan hệ NoSQL - phần 1

Tổng quan
5 Cơ sở dữ liệu phi quan hệ NoSQL - phần 2
Kiến trúc phân tán phổ biến
6 Cơ sở dữ liệu phi quan hệ NoSQL - phần 3
Truy vấn SQL trên NoSQL
7 Hệ thống truyền thông điệp phân tán

8 Các kĩ thuật xử lý dữ liệu lớn theo khối - phần 1

Map Reduce
9 Các kĩ thuật xử lý dữ liệu lớn theo khối - phần 2
Apache Spark
10 Các kĩ thuật xử lý luồng dữ liệu lớn
Spark Streaming
11 Kiến trúc dữ liệu lớn
Lambda architecture
12 Phân tích dữ liệu lớn
Spark ML

3
How big is big data?

4
5
How big is big data?

6
Data science: The 4th paradigm for scientific
discovery

7
Big data in 2008

8
Big data in 2014

9
Big data today

10
Big numbers

11
Big data sources
• E-commerce
• Social networks
• Internet of things
• Data-intensive experiments (bioinformatics, quantum
physics, etc)

12
Data is the new oil

13
Big data 5'V

Big data is a term for data sets that are so large or complex that
traditional data processing application software is inadequate to
deal with them (wikipedia)

14
Big data – big value

15
source: wipro.com
Big Data in education industry
• Customized and Dynamic Learning Programs
• Reframing Course Material
• Grading Systems
• Career Prediction

16
Edtech
• Coursera
• VioEdu
• https://byjus.com/
• Engaging Video Lessons
• Personalized Learning Journeys
• Mapped to the Syllabus
• In-depth Analysis
• Engaging Interactive Questions

17
Big Data in healthcare industry
• Reduce costs of treatments, unnecessary diagnosis.
• Predict outbreaks of epidemics and preventive
measures.
• Avoid preventable diseases

18
Big Data in government sector
• Welfare Schemes
• Make faster and informed decisions
• Identify areas that are in immediate need of attention
• Overcome national challenges such as unemployment,
terrorism,.
• Cyber Security
• deceit recognition.
• Catching tax evaders.

19
Big Data in media and entertainment
industry
• Predicting the interests of audiences
• Optimized or on-demand scheduling of media streams
in digital media distribution platforms
• Getting insights from customer reviews
• Effective targeting of the advertisements
• Example
• Spotify, Amazon Prime

20
Big data in scientific discovery

CERN’s Large Hydron Collider (LHC) generates 15 PB a year

21
Maximilien Brice, © CERN
Top 10 Company Market Cap
Ranking History (1998-2018)

https://www.youtube.com/watch?v=fobx4wIS6W0
22
Top 10 Company Market Cap
Ranking History (1998-2018)

23
Big data technology stack

24
Scalable data management
• Scalability
• Able to manage incresingly big volume of data
• Accessibility
• Able to maintain efficiciency in reading and writing data (I/O)
into data storage systems
• Transparency
• In distributed environment, users should be able to access
data over the network as easily as if the data were stored
locally.
• Users should not have to know the physical location of data to
access it.
• Availability
• Fault tolerance
• The number of users, system failures, or other consequences
of distribution shouldnʼt compromise the availability.

25
Data I/O landscape
0.1 Gb/s
1 Gb/s or125 MB/s Nodesin
another
Network rack
CPUs:
10GB/s

100MB/s 1 Gb/s or125 MB/s Nodesin

600MB/s same
rack

3-12 msrandom 0.1 ms random

access access

$0.025 perGB $0.35 perGB

26
Scalable data ingestion and
processing
• Data ingestion
• Data from different complementing information systems is to be combined to
gain a more comprehensive basis to satisfy the need
• How to ingest data efficiently from various, distributed heterogeneous
sources?
• Different data formats
• Different data models and schemas
• Security and privacy

• Data processing
• How to process massive volume of data in a timely fashion?
• How to process massive stream of data in a real-time fashion?
• Traditional parallel, distributed processing (OpenMP, MPI)
• Big learning curve
• Scalability is limited
• Fault tolerence is hard to achive
• Expensive, high performance computing infrastructure
• Novel realtime processing architecture
• Eg. Mini-batch in Spark streaming
• Eg. Complex event processing in Apache Flink

27
Scalable analytic algorithms
• Challenges
• Big volume
• Big dimensionality
• Realtime processing
• Scaling-up Machine Learning algorithms
• Adapting the algorithm to handle Big Data in a single machine.
• Eg. Sub-sampling
• Eg. Principal component analysis
• Eg. feature extraction and feature selection
• Scaling-up algorithms by parallelism
• Eg. k-nn classification based on MapReduce
• Eg. scaling-up support vector machines (SVM) by a divide and-
conquer approach

28
Eg. Curse of dimensionality
• The required number of samples (to achieve the same accuracy)
grows exponentionally with the number of variables!
• In practice: number of training examples is fixed!
• => the classifier’s performance usually will degrade for a large
number of features!

In fact, after a certain point, increasing

the dimensionality of the problem by
adding new features would actually
degrade the performance of classifier.

29
Utilization and interpretability of big
data
• Domain expertise to findout problems and
interprete analytics results
• Scalable visualization and interpretability of
million data points
• to facilitate their interpretability and
understanding

30
Privacy and security

31
Big data job trends

32
Talent shortage in big data

33
Big data skill set

34
How to land big data related jobs
• Learn to code
• Coursera
• Udacity
• Freecodecamp
• Codecademy
• Math, Stats and machine learning
• Kaggle
• Hadoop, NoSQL, Spark
• Visualization and Reporting
• Tableau
• Pentahoo
• Meetup & Share
• Find a mentor
• Internships, projects

35
Data science method
1. Formulate a question

4. Product
2. Gather data

3. Analyze data

Source: Foundational Methodology for Data Science, IBM, 2015 36

DeepQA: Incremental Progress in Precision and
Confidence 6/2007-11/2010

Now Playing in the

100% Winners Cloud
90% 11/2010

80% 4/2010

70% 10/2009
5/2009
60%
12/2008
Precision

50% 8/2008

5/2008
40% 12/2007

30%

20%
Baseline
10%

0%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
% Answered
37
Cleaning big data: most time-consuming,
least enjoyable data science task
• Data preparation accounts for about 80% of the work of
data scientists

source: https://www.forbes.com/

38
Cleaning big data: most time-consuming,
least enjoyable data science task
• 57% of data scientists regard cleaning and organizing
data as the least enjoyable part of their work and 19%
say this about collecting data sets.

39
References
[1] Tiwari, Shashank. Professional NoSQL. John Wiley & Sons, 2011.
[2] Lam, Chuck. Hadoop in action. Manning Publications Co., 2010.
[3] Miner, Donald, and Adam Shook. MapReduce design patterns: building effective algorithms and analytics for Hadoop and other systems. " O'Reilly
Media, Inc.", 2012.
[4] Karau, Holden. Fast Data Processing with Spark. Packt Publishing Ltd, 2013.
[5] Penchikala, Srini. Big data processing with apache spark. Lulu. com, 2018.
[6] White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012.
[7] Gandomi, Amir, and Murtaza Haider. "Beyond the hype: Big data concepts, methods, and analytics." International Journal of Information
Management 35.2 (2015): 137-144.
[8] Cattell, Rick. "Scalable SQL and NoSQL data stores." Acm Sigmod Record 39.4 (2011): 12-27.
[9] Gessert, Felix, et al. "NoSQL database systems: a survey and decision guidance." Computer Science-Research and Development 32.3-4 (2017): 353-
365.
[10] George, Lars. HBase: the definitive guide: random access to your planet-size data. " O'Reilly Media, Inc.", 2011.
[11] Sivasubramanian, Swaminathan. "Amazon dynamoDB: a seamlessly scalable non-relational database service." Proceedings of the 2012 ACM
SIGMOD International Conference on Management of Data. ACM, 2012.
[12] Chan, L. "Presto: Interacting with petabytes of data at Facebook." (2013).
[13] Garg, Nishant. Apache Kafka. Packt Publishing Ltd, 2013.
[14] Karau, Holden, et al. Learning spark: lightning-fast big data analysis. " O'Reilly Media, Inc.", 2015.
[15] Iqbal, Muhammad Hussain, and Tariq Rahim Soomro. "Big data analysis: Apache storm perspective." International journal of computer trends and
technology 19.1 (2015): 9-14.
[16] Toshniwal, Ankit, et al. "Storm@ twitter." Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 2014.
[17] Lin, Jimmy. "The lambda and the kappa." IEEE Internet Computing 21.5 (2017): 60-66.

40
Online courses
• https://www.coursera.org/learn/nosql-database-systems
• https://who.rocq.inria.fr/Vassilis.Christophides/Big/index.htm
• https://www.coursera.org/learn/big-data-introduction?specialization=big-
data
• https://www.coursera.org/learn/big-data-integration-
processing?specialization=big-data
• https://www.coursera.org/learn/big-data-management?specialization=big-
data
• https://www.coursera.org/learn/hadoop
• https://www.coursera.org/learn/scala-spark-big-data

41
Thank you
for your
attention!!!

GeorgiaTech CS-6515: Graduate Algorithms: Divide-And-Conquer Flashcards by Yang Hu - Brainscape
No ratings yet
GeorgiaTech CS-6515: Graduate Algorithms: Divide-And-Conquer Flashcards by Yang Hu - Brainscape
8 pages
Big Data PPT 55b0fc01e7543
No ratings yet
Big Data PPT 55b0fc01e7543
31 pages
Arbor Networks Sightline 9.0.2 9.0.1 9.0 Release-Notes 2019-09-24
No ratings yet
Arbor Networks Sightline 9.0.2 9.0.1 9.0 Release-Notes 2019-09-24
40 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
1_introduction_to_big_data_management_and_processing
No ratings yet
1_introduction_to_big_data_management_and_processing
46 pages
(15) Big Data
No ratings yet
(15) Big Data
10 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
Big-Data-A-Comprehensive-Overview
No ratings yet
Big-Data-A-Comprehensive-Overview
25 pages
BDA UNIT-1 NOTES
No ratings yet
BDA UNIT-1 NOTES
10 pages
DBMS Unit1
No ratings yet
DBMS Unit1
30 pages
Big Data
No ratings yet
Big Data
16 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
17 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
153 pages
UNIT1 -BDH
No ratings yet
UNIT1 -BDH
77 pages
Now To Be Data
No ratings yet
Now To Be Data
16 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Big Data
No ratings yet
Big Data
190 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
4 pages
Big Data Technology Report With Pages Removed
No ratings yet
Big Data Technology Report With Pages Removed
32 pages
Big Data MINING AND TOOLS
No ratings yet
Big Data MINING AND TOOLS
44 pages
Taming Big Data
No ratings yet
Taming Big Data
268 pages
Bigdata
No ratings yet
Bigdata
12 pages
Seminar On: Big Data
No ratings yet
Seminar On: Big Data
23 pages
PPT 1.1.2
No ratings yet
PPT 1.1.2
17 pages
Big Data Analysis Seminar
100% (1)
Big Data Analysis Seminar
15 pages
Big Data Presentation Slide
100% (1)
Big Data Presentation Slide
30 pages
BD IMP QUES 1
No ratings yet
BD IMP QUES 1
22 pages
BD unit 1
No ratings yet
BD unit 1
5 pages
BDA Unit 1
No ratings yet
BDA Unit 1
36 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
BA ppt
No ratings yet
BA ppt
17 pages
Big Data
No ratings yet
Big Data
76 pages
BIG Data_Unit_1
No ratings yet
BIG Data_Unit_1
24 pages
PPT01-Introduction To Big Data
No ratings yet
PPT01-Introduction To Big Data
34 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
20 pages
What Is Big Data ?
No ratings yet
What Is Big Data ?
6 pages
Data Science
No ratings yet
Data Science
87 pages
Big Data Ashish
No ratings yet
Big Data Ashish
7 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
Seminar Report Alisha
No ratings yet
Seminar Report Alisha
22 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
Big Data Class - Introduction
No ratings yet
Big Data Class - Introduction
60 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
DATA228 Lecture Notes Week 1
No ratings yet
DATA228 Lecture Notes Week 1
20 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
Big Data
No ratings yet
Big Data
9 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Big Data Seminar
100% (2)
Big Data Seminar
27 pages
Big Data Algorithms
100% (1)
Big Data Algorithms
476 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
Big Data
No ratings yet
Big Data
8 pages
j.ijdsa.20241005.11
No ratings yet
j.ijdsa.20241005.11
14 pages
Bigdatappt
No ratings yet
Bigdatappt
31 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
Big Data
No ratings yet
Big Data
30 pages
Big Data
No ratings yet
Big Data
31 pages
BDT..U1_PPT_08112023
No ratings yet
BDT..U1_PPT_08112023
71 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
From Everand
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
Rob Botwright
No ratings yet
Building Scalable Data-Intensive Applications
From Everand
Building Scalable Data-Intensive Applications
Chandani Kaul
No ratings yet
A Branch and Bound Algorithm For The Traveling Purchaser Problem
No ratings yet
A Branch and Bound Algorithm For The Traveling Purchaser Problem
9 pages
Payment Instructions
No ratings yet
Payment Instructions
3 pages
Mechatronics Ppt
No ratings yet
Mechatronics Ppt
4 pages
ISSYH18 - Application Form PDF
No ratings yet
ISSYH18 - Application Form PDF
2 pages
Durgule Bhagyashri: Experience (8 Years) Education
No ratings yet
Durgule Bhagyashri: Experience (8 Years) Education
3 pages
Project Summary Report Format - Co1 - Co2
No ratings yet
Project Summary Report Format - Co1 - Co2
4 pages
Chapter 1 - Computer Programming
No ratings yet
Chapter 1 - Computer Programming
12 pages
VDT Project 2021
No ratings yet
VDT Project 2021
3 pages
10 1016@j Copsyc 2020 04 005
No ratings yet
10 1016@j Copsyc 2020 04 005
6 pages
Nature of Roots
No ratings yet
Nature of Roots
13 pages
Lzma SDK
No ratings yet
Lzma SDK
7 pages
Strings PDF
No ratings yet
Strings PDF
14 pages
Vs 1011
No ratings yet
Vs 1011
49 pages
06 Task Performance 1 Roger Villanueva
100% (1)
06 Task Performance 1 Roger Villanueva
4 pages
Fusion Module Spare Parts(1)
No ratings yet
Fusion Module Spare Parts(1)
4 pages
RIce Plant Disease Detection Using Different AI Approaches
No ratings yet
RIce Plant Disease Detection Using Different AI Approaches
11 pages
CV Asim EM
No ratings yet
CV Asim EM
2 pages
BOUZIANE Messaoud CV
No ratings yet
BOUZIANE Messaoud CV
5 pages
Initial Quick Test Log
No ratings yet
Initial Quick Test Log
6 pages
Nanonull, Inc. 119 Oakstreet, Suite 4876 Vereno DC 29213: Company Name Street City State ZIP
No ratings yet
Nanonull, Inc. 119 Oakstreet, Suite 4876 Vereno DC 29213: Company Name Street City State ZIP
3 pages
Quick_Guide_ATARI-LYNX-II_REV3_0
No ratings yet
Quick_Guide_ATARI-LYNX-II_REV3_0
2 pages
CloudEngine 6863 Data Center Switch Datasheet
No ratings yet
CloudEngine 6863 Data Center Switch Datasheet
13 pages
VSP F Series Family Matrix Product Line Card
No ratings yet
VSP F Series Family Matrix Product Line Card
5 pages
1-Introduction To Power Electronics-WEEK 1
No ratings yet
1-Introduction To Power Electronics-WEEK 1
8 pages
300mb Movies Hub. Com
No ratings yet
300mb Movies Hub. Com
3 pages
Venn and 2 Way Tables
No ratings yet
Venn and 2 Way Tables
7 pages
Grecon Bs 7 r08 en Web
No ratings yet
Grecon Bs 7 r08 en Web
16 pages
Plumbing Design With RME
No ratings yet
Plumbing Design With RME
20 pages

1 Introduction To Big Data Management and Processing

Uploaded by

1 Introduction To Big Data Management and Processing

Uploaded by

1

1 Tổng quan về lưu trữ và xử lý dữ liệu lớn

2 Hệ sinh thái Hadoop (Hadoop ecosystem)

3 Hệ thống tập tin phân tán Hadoop HDFS

4 Cơ sở dữ liệu phi quan hệ NoSQL - phần 1

8 Các kĩ thuật xử lý dữ liệu lớn theo khối - phần 1

CERN’s Large Hydron Collider (LHC) generates 15 PB a year

100MB/s 1 Gb/s or125 MB/s Nodesin

3-12 msrandom 0.1 ms random

$0.025 perGB $0.35 perGB

In fact, after a certain point, increasing

Source: Foundational Methodology for Data Science, IBM, 2015 36

Now Playing in the

You might also like