BIG DATA FUNDAMENTALS
Presented by: Le Ngoc Thanh
Outline
o Big Data Platform and Technologies
• IBM Big Data Platform
o Digging into Big Data Technology
• Big Data Technology Stack
• Big Data Analytics Platforms and Software
• Big Data Landscape 2018
o Big Data and Data Science
• The Data Process for Big Data
• Data Analyst vs. Data Scientist
©lnthanh
2
Big Data Platform
Comprehensive, enterprise-ready, integrated
©lnthanh 3
Main tasks in Big data
Aggregation Analysis
Manipulation Visualization
©lnthanh
4
Big data are multidisciplinary
o Technologies applied to Big data should include
Massively parallel processing databases
Distributed
databases Data mining
grids
Strong internet
Scalable
storage systems
Distributed filesystems Cloud computing platforms
o These can be drawn from several fields such as Statistics,
Compute science, Applied mathematics, Economics, etc.
©lnthanh
5
IBM Big data platform
o Give a solution which is
designed specifically
with the needs of the
enterprise in the mind.
©lnthanh
6
A Big data platform should offer
o Comprehensive
• Every dimension of the Big data challenge is addressed.
o Enterprise-ready
• Features of performance, security, usability and reliability included.
o Integrated
• Introduction of Big data technologies to enterprise should be
simplified and accelerated
• Integration with information supply chain, including databases, data
warehouses, and business intelligence applications.
o Moreover, a Big Data platform should also offer
• Open-source based, low latency reads/updates, ad-hoc queries,
scalability, extensible, robust fault-tolerant, minimal maintenance.
©lnthanh
7
IBM Big data platform
©lnthanh
8
IBM Big data platform
©lnthanh
9
IBM Big data platform
©lnthanh
10
IBM Big data platform
©lnthanh
11
Components in a Big data
platform
©lnthanh
12
IBM Big data platform
©lnthanh
13
Components in a Big data
platform
©lnthanh
14
Digging into Big Data Technology
Digging deeper, better insights
©lnthanh 15
Big data technology stack
4.
3.
2.
1.
0.
©lnthanh
16
Layer 0: Redundant physical infrastructure
• The physical infrastructure
is the lowest level.
• Hardware, network, etc.
o Your company might already have a data center or
made investments in physical infrastructures.
o Hence, you may want to find a way to utilize existing
assets.
©lnthanh
17
Where most of this began?
o A prioritized list of these principles should include
statements about the following
Flexibility Performance
Cost
Availability
Scalability
©lnthanh
18
It grows bigger..
©lnthanh
19
….then very big
©lnthanh
20
Why redundant?
o Most big data implementations need to be highly available.
o That is, networks, servers, and physical storage must be both
resilient and redundant.
o A system is resilient to failure or changes when sufficient
redundant resources are in place, ready to jump into action.
©lnthanh
21
Layer 1: Security infrastructure
o Security and privacy requirements for big data are similar
to those for conventional data environments.
o They have to be closely aligned to specific business needs.
The data should be available only to those who have a legitimate business need
Data access for examining or interacting with it.
Protection from unauthorized usage or access are offered by
most APIs. Application access
Most challenging, extremely stress the systems’ resources
Data encryption Encrypt only data elements that require this level of security
The inclusion of mobile devices and social networks exponentially
increases both the amount of data and the opportunities for security Threat detection
threats.
©lnthanh
22
Layer 2: Operational databases
o The core of any Big data environment is database
engines holding collections of data elements
relevant to a business.
If any part of the transaction or the underlying system fails, the entire transaction
Atomicity fails.
Only transactions with valid data will be performed. Consistency
Multiple and simultaneous transactions do not interfere with each other.
Isolation All valid transactions will execute until completed and in the order they were
submitted for processing.
After the data from the transaction is written to the database, it stays
there “forever.” Durability
©lnthanh
23
Layer 3: Organizing Data Services and Tools
o Organizing data services are, in reality, an ecosystem
of tools and technologies that can be used to gather
and assemble data in preparation for further
processing.
o Technologies in this layer include the following:
• A distributed file system
• Serialization services
• Coordination services
• Extract, transform, and load (ETL) tools
• Workflow services
©lnthanh
24
Hadoop, MapReduce and Big
Table
o New technologies to store,
access, and analyze huge
amounts of data.
• Proved to be the sparks that led to a new generation of data
management.
• Addressing one of the most fundamental problems: the
capability of processing massive amounts of data efficiently,
cost effectively, and in a timely fashion.
©lnthanh
25
Layer 4: Traditional and advanced analytics
o What does your business now do with all the data in all
its forms to try to make sense of it for the business?
• Managing big data holistically requires many different analysis
approaches, depending on the problem being solved, to help the
business to successfully plan for the future.
• Some analyses will use a traditional data warehouse, while the
others will take advantage of advanced predictive analytics.
o Key techniques: Analytical data warehouses and data
marts, Big data analytics, Reporting and visualization and
Big data applications, etc.
©lnthanh
26
Big data platform and
analytics software
o Features of Big data platform and analytics software
Data ingestion, Data management, ETL and Warehouse,
Hadoop system and Stream Computing
Analytics/Machine learning, Content management, Data
integration and governance
Provide efficiency in workplace
Provide accurate data
Give answer to complex questions
It is secure
Source: https://www.predictiveanalyticstoday.com/bigdata-platforms-
©lnthanh
bigdata-analytics-software/ 27
Big data analytic platform tools
o There are some key Big data analytic platform tools
available for enterprise use
Reference for more:
https://www.predictiveanalyticstoday.com/bigdata-platforms-bigdata-
analytics-software/
©lnthanh
28
Example of Analytics platform for
Real-time Data ingestion, Streaming analytics
29
Source: https://www.xenonstack.com/blog/big-data-engineering/iot-analytics-
©lnthanh
platform-solutions/
Source: http://mattturck.com/bigdata2018/,
©lnthanh updated 15/07/2018 30
Big Data and Data Science
…
©lnthanh 31
What is Data science?
o Data science is the process of distilling insights from
data to inform decisions.
©lnthanh
32
What is Data science?
o In data science, the size of the
data is less important.
o One can use data of all sizes,
small, medium, and big data that
is related to a business or
scientific case.
©lnthanh
33
Data science process for Big data
o The data science process for Big data could include
the following steps:
©lnthanh
34
Data scientist vs. Data analyst
©lnthanh
35
Data scientist vs. Data analyst
Jobs trends of Data analysts (left) and Data scientists (right)
Source: https://www.edureka.co/blog/difference-between-data-scientist-and-data-analyst/
©lnthanh
36
©lnthanh 37