0% found this document useful (0 votes)
25 views

1 Introduction To Big Data Management and Processing

The document discusses big data storage and processing. It introduces concepts like Hadoop ecosystem, HDFS, NoSQL databases, and data processing techniques like MapReduce and Spark. It also talks about challenges in big data and popular technologies used.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

1 Introduction To Big Data Management and Processing

The document discusses big data storage and processing. It introduces concepts like Hadoop ecosystem, HDFS, NoSQL databases, and data processing techniques like MapReduce and Spark. It also talks about challenges in big data and popular technologies used.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

1

Lecture 1
Introduction to big data storage and
processing

2
Syllabus
STT Lecture

1 Tổng quan về lưu trữ và xử lý dữ liệu lớn

2 Hệ sinh thái Hadoop (Hadoop ecosystem)

3 Hệ thống tập tin phân tán Hadoop HDFS

4 Cơ sở dữ liệu phi quan hệ NoSQL - phần 1


Tổng quan
5 Cơ sở dữ liệu phi quan hệ NoSQL - phần 2
Kiến trúc phân tán phổ biến
6 Cơ sở dữ liệu phi quan hệ NoSQL - phần 3
Truy vấn SQL trên NoSQL
7 Hệ thống truyền thông điệp phân tán

8 Các kĩ thuật xử lý dữ liệu lớn theo khối - phần 1


Map Reduce
9 Các kĩ thuật xử lý dữ liệu lớn theo khối - phần 2
Apache Spark
10 Các kĩ thuật xử lý luồng dữ liệu lớn
Spark Streaming
11 Kiến trúc dữ liệu lớn
Lambda architecture
12 Phân tích dữ liệu lớn
Spark ML

3
How big is big data?

4
5
How big is big data?

6
Data science: The 4th paradigm for scientific
discovery

7
Big data in 2008

8
Big data in 2014

9
Big data today

10
Big numbers

11
Big data sources
• E-commerce
• Social networks
• Internet of things
• Data-intensive experiments (bioinformatics, quantum
physics, etc)

12
Data is the new oil

13
Big data 5'V

Big data is a term for data sets that are so large or complex that
traditional data processing application software is inadequate to
deal with them (wikipedia)

14
Big data – big value

15
source: wipro.com
Big Data in education industry
• Customized and Dynamic Learning Programs
• Reframing Course Material
• Grading Systems
• Career Prediction

16
Edtech
• Coursera
• VioEdu
• https://byjus.com/
• Engaging Video Lessons
• Personalized Learning Journeys
• Mapped to the Syllabus
• In-depth Analysis
• Engaging Interactive Questions

17
Big Data in healthcare industry
• Reduce costs of treatments, unnecessary diagnosis.
• Predict outbreaks of epidemics and preventive
measures.
• Avoid preventable diseases

18
Big Data in government sector
• Welfare Schemes
• Make faster and informed decisions
• Identify areas that are in immediate need of attention
• Overcome national challenges such as unemployment,
terrorism,.
• Cyber Security
• deceit recognition.
• Catching tax evaders.

19
Big Data in media and entertainment
industry
• Predicting the interests of audiences
• Optimized or on-demand scheduling of media streams
in digital media distribution platforms
• Getting insights from customer reviews
• Effective targeting of the advertisements
• Example
• Spotify, Amazon Prime

20
Big data in scientific discovery

CERN’s Large Hydron Collider (LHC) generates 15 PB a year


21
Maximilien Brice, © CERN
Top 10 Company Market Cap
Ranking History (1998-2018)

https://www.youtube.com/watch?v=fobx4wIS6W0
22
Top 10 Company Market Cap
Ranking History (1998-2018)

23
Big data technology stack

24
Scalable data management
• Scalability
• Able to manage incresingly big volume of data
• Accessibility
• Able to maintain efficiciency in reading and writing data (I/O)
into data storage systems
• Transparency
• In distributed environment, users should be able to access
data over the network as easily as if the data were stored
locally.
• Users should not have to know the physical location of data to
access it.
• Availability
• Fault tolerance
• The number of users, system failures, or other consequences
of distribution shouldnʼt compromise the availability.

25
Data I/O landscape
0.1 Gb/s
1 Gb/s or125 MB/s Nodesin
another
Network rack
CPUs:
10GB/s

100MB/s 1 Gb/s or125 MB/s Nodesin


600MB/s same
rack

3-12 msrandom 0.1 ms random


access access

$0.025 perGB $0.35 perGB

26
Scalable data ingestion and
processing
• Data ingestion
• Data from different complementing information systems is to be combined to
gain a more comprehensive basis to satisfy the need
• How to ingest data efficiently from various, distributed heterogeneous
sources?
• Different data formats
• Different data models and schemas
• Security and privacy

• Data processing
• How to process massive volume of data in a timely fashion?
• How to process massive stream of data in a real-time fashion?
• Traditional parallel, distributed processing (OpenMP, MPI)
• Big learning curve
• Scalability is limited
• Fault tolerence is hard to achive
• Expensive, high performance computing infrastructure
• Novel realtime processing architecture
• Eg. Mini-batch in Spark streaming
• Eg. Complex event processing in Apache Flink

27
Scalable analytic algorithms
• Challenges
• Big volume
• Big dimensionality
• Realtime processing
• Scaling-up Machine Learning algorithms
• Adapting the algorithm to handle Big Data in a single machine.
• Eg. Sub-sampling
• Eg. Principal component analysis
• Eg. feature extraction and feature selection
• Scaling-up algorithms by parallelism
• Eg. k-nn classification based on MapReduce
• Eg. scaling-up support vector machines (SVM) by a divide and-
conquer approach

28
Eg. Curse of dimensionality
• The required number of samples (to achieve the same accuracy)
grows exponentionally with the number of variables!
• In practice: number of training examples is fixed!
• => the classifier’s performance usually will degrade for a large
number of features!

In fact, after a certain point, increasing


the dimensionality of the problem by
adding new features would actually
degrade the performance of classifier.

29
Utilization and interpretability of big
data
• Domain expertise to findout problems and
interprete analytics results
• Scalable visualization and interpretability of
million data points
• to facilitate their interpretability and
understanding

30
Privacy and security

31
Big data job trends

32
Talent shortage in big data

33
Big data skill set

34
How to land big data related jobs
• Learn to code
• Coursera
• Udacity
• Freecodecamp
• Codecademy
• Math, Stats and machine learning
• Kaggle
• Hadoop, NoSQL, Spark
• Visualization and Reporting
• Tableau
• Pentahoo
• Meetup & Share
• Find a mentor
• Internships, projects

35
Data science method
1. Formulate a question

4. Product
2. Gather data

3. Analyze data

Source: Foundational Methodology for Data Science, IBM, 2015 36


DeepQA: Incremental Progress in Precision and
Confidence 6/2007-11/2010

Now Playing in the


100% Winners Cloud
90% 11/2010

80% 4/2010

70% 10/2009
5/2009
60%
12/2008
Precision

50% 8/2008

5/2008
40% 12/2007

30%

20%
Baseline
10%

0%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
% Answered
37
Cleaning big data: most time-consuming,
least enjoyable data science task
• Data preparation accounts for about 80% of the work of
data scientists

source: https://www.forbes.com/

38
Cleaning big data: most time-consuming,
least enjoyable data science task
• 57% of data scientists regard cleaning and organizing
data as the least enjoyable part of their work and 19%
say this about collecting data sets.

39
References
[1] Tiwari, Shashank. Professional NoSQL. John Wiley & Sons, 2011.
[2] Lam, Chuck. Hadoop in action. Manning Publications Co., 2010.
[3] Miner, Donald, and Adam Shook. MapReduce design patterns: building effective algorithms and analytics for Hadoop and other systems. " O'Reilly
Media, Inc.", 2012.
[4] Karau, Holden. Fast Data Processing with Spark. Packt Publishing Ltd, 2013.
[5] Penchikala, Srini. Big data processing with apache spark. Lulu. com, 2018.
[6] White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012.
[7] Gandomi, Amir, and Murtaza Haider. "Beyond the hype: Big data concepts, methods, and analytics." International Journal of Information
Management 35.2 (2015): 137-144.
[8] Cattell, Rick. "Scalable SQL and NoSQL data stores." Acm Sigmod Record 39.4 (2011): 12-27.
[9] Gessert, Felix, et al. "NoSQL database systems: a survey and decision guidance." Computer Science-Research and Development 32.3-4 (2017): 353-
365.
[10] George, Lars. HBase: the definitive guide: random access to your planet-size data. " O'Reilly Media, Inc.", 2011.
[11] Sivasubramanian, Swaminathan. "Amazon dynamoDB: a seamlessly scalable non-relational database service." Proceedings of the 2012 ACM
SIGMOD International Conference on Management of Data. ACM, 2012.
[12] Chan, L. "Presto: Interacting with petabytes of data at Facebook." (2013).
[13] Garg, Nishant. Apache Kafka. Packt Publishing Ltd, 2013.
[14] Karau, Holden, et al. Learning spark: lightning-fast big data analysis. " O'Reilly Media, Inc.", 2015.
[15] Iqbal, Muhammad Hussain, and Tariq Rahim Soomro. "Big data analysis: Apache storm perspective." International journal of computer trends and
technology 19.1 (2015): 9-14.
[16] Toshniwal, Ankit, et al. "Storm@ twitter." Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 2014.
[17] Lin, Jimmy. "The lambda and the kappa." IEEE Internet Computing 21.5 (2017): 60-66.

40
Online courses
• https://www.coursera.org/learn/nosql-database-systems
• https://who.rocq.inria.fr/Vassilis.Christophides/Big/index.htm
• https://www.coursera.org/learn/big-data-introduction?specialization=big-
data
• https://www.coursera.org/learn/big-data-integration-
processing?specialization=big-data
• https://www.coursera.org/learn/big-data-management?specialization=big-
data
• https://www.coursera.org/learn/hadoop
• https://www.coursera.org/learn/scala-spark-big-data

41
Thank you
for your
attention!!!

42

You might also like