0% found this document useful (0 votes)
52 views49 pages

Basic Description of Big Dat As A Distributed Procesing Ans Strogae Strategy

Basic Description of Big Dat as a Distributed Procesing ans strogae strategy

Uploaded by

CASJORGE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views49 pages

Basic Description of Big Dat As A Distributed Procesing Ans Strogae Strategy

Basic Description of Big Dat as a Distributed Procesing ans strogae strategy

Uploaded by

CASJORGE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

BIG DATA:

A DISTRIBUTED PROCESSING &


STORAGE INNOVATIVE SCHEME

Jorge Castro – CEO


Big Data Training program 2017
Thank you very much
for your time
If you have any questions about this document
please don’t hesitate to contact me at:

[email protected]

▪ @casjorge1967

2
Human power evolution

Information
society
Industrial
Agricultural society
society
Information society evolution

En 10
En 5 años
Hoy años
- PC Obligatorio
Hace - Internet Opcional
- PC Obligatorio - Cloud Obligatorio
Hace 15 años - Internet Obligatorio - Desarrollo
Obligatorio
- PC Obligatorio - Cloud Obligatorio
20 años - Internet Obligatorio - Desarrollo
Opcional
- PC Obligatorio - Cloud Opcional
- Internet Opcional
- PC Opcional
- Internet Opcional
A new software definition

PaaS/FaaS
• Processes • Service
• Automation • Code • Mass passing
• Classifying • IaaS • Economy of
• Order • Big Data scale
Cloud
Good Ideas
Big Data
Software: A ▪In 2016, the author Joshua Cooper Ramo in his
new book “The seven sense” says: Access to reliable
definition translation algorithms is more important than the
ability to speak different languages ​and, therefore,
being polyglot in the future will be an archaic
specialty, since the machines will take care of
that. That is, the power is not in those who
dominate more languages, or better speak
English, but in those who control the systems and
algorithms to make better translations. For Joshua
Cooper Ramo, what will replace English will not
be Chinese or Spanish or German or Arabic, but
algorithms and protocols. 7
Software: A ▪Since 1970 Josephn Weizenbaum announced
new that "anyone with a reasonably ordered mind can
definition become a good programmer, but requires maturity
to tolerate the long time Between an effort and
something that shows success. " The age of
connectivity will create a new caste. This caste
does not involve the training of millions of
programmers, but if a group of people with high
technical skills to design and control systems and
protocols, and with a historical and political
understanding, they can influence collective
thinking.
8
Automation ▪45% of human activities now paid are affected in
Pros & Cons one way or another but displaced by automation.
The study included more than 800 occupations,
2,000 activities in 3 major capacities (social,
cognitive, and physical). The conclusion is,
although not hopeless in relation to the release of
human time to reinvent itself, if it is blunt: There
are no immune to automation activities.
Mckinsey, "Four Fundamentals of Work Automation,“- Nov 2015

While only 5% of human activities can be fully automated using


current technology, 60% of occupations could automate at least
30% of their current activities by completely redefining how we do
our work and allocating our time to activities Most relevant.

9
Automation
Quill (intelligent narrative program) is capable of
Pros & Cons
analyzing data, generating language naturally and
writing reports without the user even suspecting that he
is being taken care of by a machine.

Amazon's Kiba Robots are capable of planning,


navigating and coordinating the logistics of huge
warehouses for shipping to customers.

Watson the IBM robot is able to suggest, based on


analysis, accurate medical treatments for certain
ailments and already has even creative and learning
ability that will allow him to be used in jobs previously
only intended for the human brain.
1. Introduction
a. Provoking quotes
b. Definition

Table of contents c. Features


d. Real prominent use cases
Big data: A distributed processing and
2. Concepts
storage scheme
a. Basic concepts
b. Analysis concepts

3. Big Data mechanisms


4. NoSQL devices
5. Processing fundamentals
a. Map-Reduce
b. Bulk syncronization paralell

11
1.
Introduction
Provoking quotes
about big data

12

“Any enterprise CEO really ought to be able to ask a
question that involves connecting data across the
organization, be able to run a company effectively, and
especially to be able to respond to unexpected events.
Most organizations are missing this ability to connect all
the data together.”
—Tim Berners Lee
13

“Information is the oil of the 21st century, and analytics
is the combustion engine.”

Peter Sondergaard, Gartner Research

14

▪“Without big data, you are blind and deaf in the
middle of a freeway”

▪ Geoffrey Moore, management consultant and theorist

15

▪“The world is one big data problem.” – Andrew McAfee

▪“Data really powers everything that we do.” – Jeff Weiner

▪“In God we trust. All others must bring data.” – Edwards Deming

▪“Data are becoming the new raw material of business.” – Craig


Mundie
16
Big data is the field dedicated to the analysis,
Big Data
processing & storage of large collections of data
Definition that frequently originate from disparate sources,
and used when traditional technology is
insufficient.
In TRADITIONAL In TRADITIONAL
TECHNOLOGIES TECHNOLOGIES
There are not the pre- The main technology data
requisites to comply a fine management paradigm is RDBMS
management for the todays which was fine for many of the
data deluge. previous data schemes.

17
▪Big data is more feasible when traditional
Big Data
technologies are not due to challenges in 5V:
Features

Also Variability
Is a feauture
considered by
some analysts to
reinforce the
concept of
challenges in
schemes than
vary with time.

18
Volume: “There were 5 exabytes of
information created between the dawn of
Big Data civilization through 2003, but that much
information is now created every 2 days.”
Quoted
Eric Schmidt, of Google, said in 2010
features

Velocity: “With data collection, ‘the sooner


the better’ is always the best answer.

Marissa Mayer

Variety: “Big data is at the foundation of all


the megatrends that are happening today,
from social to mobile to cloud to gaming.”

Chris Lynch, Vertica Systems


19
Big Data
Quoted Veracity: “Torture the data, and it will
confess to anything.”
features
– Ronald Coase

Value: “You can use all the


quantitative data you can get, but
you still have to distrust it and use
your own intelligence and judgment.”
– Alvin Toffler

20
Health retail: Walgreens
• At Walgreens, big data is being used by clinicians at in-store health clinics
delivering advanced analytics at the point of care to better assess patient
conditions and provide recommendations that improve health overall and avoid
Big Data: future medical costs. Over 7.5 billion medical events for 100 million people power
the big data system with information like demographics, enrollment, diagnoses,
Prominent procedures, and data from managed-care plans.

real use
Airlines: Delta
cases • Delta has used big data to help with one of the most uncomfortable travel
situations that exists—lost baggage. With over 130 million bags checked per
year, the company held a lot of tracking data about bags and became the first
major airline to allow customers to track their bags from mobile devices. To date,
the app has been downloaded over 11 million times and gives customers much
greater peace of mind while traveling while also differentiating Delta as a
customer-centric company.

Automotive: Tesla
•Tesla excel for instrumenting vehicles with sensors and sending all the
data back to the mother ship for analysis, using an Apache Hadoop®
cluster to collect the data. The data is used to improve the company’s
R&D, car performance, car maintenance, and customer satisfaction. The
company is notified if the car is not functioning properly and consumers
can be advised to get a service.

21
Logistics: UPS
•On a daily basis, UPS makes 16.9 package and document deliveries
every day and over 4 billion items shipped per year through almost
Big Data: 100,000 vehicles. One of the applications is for fleet optimization. On-
truck telematics and advanced algorithms help with routes, engine idle
time, and predictive maintenance. Since starting the program, the
Prominent company has saved over 39 million gallons of fuel and avoided driving
364 million miles.
real use
Telecommunications: Sprint
cases • Sprint spoke about using big data analytics to improve quality and
customer experience while reducing network error rates and
customer churn. They handle 10s of billions of transactions per
day for 53 million users, and their big data analytics put real-time
intelligence into the network, driving a 90% increase in capacity.

Financial Services: AMEX


•AMEX looked to shift traditional business intelligence-based hindsight
reporting or trailing indicators of how business was doing to predict
loyalty. Their sophisticated predictive models analyzed historical
transactions with 115 variables to forecast potential churn. In the
Australian market, they now believe they can identify 24% of accounts
that will close within four months.

22
2.
Concepts
Basic terminology

23

Understanding the traditional
technology will help to comprehend
the produced Big Data paradigm shift.

Understanding New from


Traditional!!

24
From
Traditional/concepts
traditional to Dataset: Collections Data analysis: BI: Process of gaining
Process of examining insight in an
of related data where
Big Data each member shares
data to find facts, enterprise workings to
relations, patterns, improve decisión
same attributes.
concepts insigths or trends. making.

OLTP: On line OLAP: On line ETL: Extract/-


transaction Process: analysis process: Transform-/Load:
Software system that System used to Process to load data
Traditional process transaction process data analysis from a source to a
oriented data. queries. target system.
terminology
Data warehouse: Data mart: Subset
Central & Enterprise of related data
repository with
historical & current stored in a data
data. warehouse.
25
From
BigData/concepts
Hadoop: Open Machine learning:
traditional to software framework for Process of teaching
large scale data storage computers to learn from
Big Data & processing that can be
existing data & apply that
knowledge to formulate
run on commodity h/w
concepts predictions with new data.

Data Mining: Big Data BI: Evolution


of traditional BI which
Automated SW based
include semi & non
techniques to identify
Big Data hidden and unknown
structured data to be
processed with existing
patterns & trends.
terminology data.

Advanced Visual
tools: User friendly
tolos for descriptive,
diagnostic, predictive &
prescriptive analytics. 26
BigData/concepts
From
traditional to Descriptive analysis:
Diagnostic analysis:
Big Data Respond to “what”
Respond to “Why”
questions, it is static and
concepts usually from OLTP
questions. It is interactive
& used in OLAP systems.
systems.

Analysis
types Predictive analysis:
Prescriptive analysis:
Respond to “Which”
Respond to “what-if”
questions, it respond to
questions. Usually has its
simulation of various
own visualization tool.
scenarios.

27
For more than 20 years databases turn around relational database
systems based on ACID properties and concepts, standard query
From language and a single scheme to develop and maintaing enterprise
traditional to solutions.
Big Data
concepts

Traditional Drivers Big Data

28
Latest internet generation businesses like social networks, powerfull
search engines, IoT, smart cities, cloud computing, data deluge,
From digital transformation, lower costs, …
traditional to
Big Data
concepts

Traditional Drivers Big Data

29
BIG DATA
ENABLERS:
• Analytics & Data Science
• Digitization
• Affordable Technology & Commodity Hardware
• Social Media
• Hyper-Connected Communities & Devices
• Cloud Computing
• Business Intelligence evolution
30
CAP Theorem ACID BASE
Traditional
• Consistency • Atomicity • Basically
concepts • Availability • Consistency Available
• Partition • Isolation • Soft state
Tolerance • Durability • Eventual
consistency
Limited and
precise PROS CONS
scope in
structured
data

31
Volume Velocity Variety

New • Terabytes • Streaming: Social • Structured


• Petabytes networks, IoT, • Semistructured
requirements Smart City • Not structured

When to use
Big Data?
Additional Information
Additional Information
Additional Information
3.
Big data
mechanisms
Distributed processing
& storage technologies

36
New Parallel
approachs to Processing
solutions
Distributed
Clusters in an Clusters
inexpensive RDBMS
way (commodity
HW and Open
source SW) Storage NoSQL

NewSQL
Processing engine
• Responsible for processing data based on a predefined logic.

New
Resource Manager
Processing • Schedulles & prioritizes requests according to individual processing workloads

Concepts Data Transfer Engine


• Enables data to be moved in or out of bd solution (Event, file & relational)

7 Big Data Query engine


• Processing engine focused on queries via a frontend user interface.
solution
mechanisms Analytics engine
• For advanced statistical & machine learning algorithms in support.
for
processing Workflow engine
• Provides ability to design and process a complex sequence of operations.

Coordination engine
• Ensure operational consistency across all serversto support distibuted locks and
queues, asyncronous comms
4.
NoSQL
devices
Distributed storage
technologies

39
DFS
New Storage Distributed File Systems

Concepts
RDBMS KVS
Big Data Key Value Systems

usually uses Storage


- In Memory NoSQL Document DB
partitioning
- On disk
tolerance
feauture NewSQL Columnar DB

Graph DB
RDBMS
• Good fit for transactional workloads, and generally is a single node. Do not provide
out of box redundancy or fault tolerance.Vertical scaling. Some propietary
New solutions are DDB but with shared storage with single point of failure.

KVS – Key Value Storage


Processing • Act as hash tables storing key value pairs. Look up only done by keys, oblivious to
value, update is a delete or insert, highly scalable.
Concepts
Document DB:
• Works in the base of key value pairs, but value is a encoded document, are value aware and
value is self dwscribing. A select can refer to a field inside a value. Partial updates are
6 Big Data supported. It supports nested schemas. Not good fit when several documents have to be
update in one transaction.

solution Column family DB


• Group related columns in a raw resulting in column families composed by columns
mechanisms or super-columns. Flexible schema support each row identified by a key.

for storage Graph DB


• To persist interconnected entities. It focuses on links between entities called
edges, and each entity or edge can have attributes as key values.Optimized for
node traversal.

New SQL
• Combines ACID of RDBMS with scalability and fault tolerance of
NoSQL. It support SQL for data definition and data manipulation.
5.
Processing
fundamentals
Basic algorithms for
distributed processing

42
BSPJob
New
approachs to
solutions

Basic
processing
examples Map Reduce Job

ESP
Real Time
Processing CEP
Batch
BSP
Bulk
ESP: Event Streaming Processing Synchronization
Parallel
• Single source events

CEP: Complex Event Processing


• Multiple sources & different time series
Full simple
big data
solution
example

44
Full complex
big data
solution
example

45
Simple
macro steps
for designing
a Big Data
solution
HADOOP WORLD
MECHANISM PRODUCT
DFS HDFS
MAP REDUCE APACHE MAP REDUCE,
Mapping COUCHBASE,
COUCHDB, INFINISPAN
technologies COORDINATION
ENGINE
ZOOKEEPER, CONSUL,
DOOZERD, ETCD
QUERY ENGINE HIVE, PIG
ANALYTICS ENGINE MAHOUT
WORKFLOW ENGINE OOZIE
DATA TRANSFER FLUME (EVENTS),
ENGINE SQOOP (RDBS), SCRIBE
(FILES)

STORAGE DEVICES IN MEMORY STORAGE


Crossing MECHANISM
KVS AMAZON
PRODUCT
DYNAMO DB,
DEVICES
concepts & DOCUMENT DB
RIAK, REDIS
MONGO DB, COUCH DB,
MECHANISM
IMDG
PRODUCT
IN MEMORY DATA FABRIC,
HAZELCAST, ORACLE
current COLUMN FAMILY
TERRASTORE
AMAZON SIMPLE DB,
COHERENCE
IMDB AEROSPIKE, MEMSQL,
vendor DB
GRAPH DB
CASSANDRA, HBASE
NEO4J, INFINITE GRAPH,
ALTIBASE HDB, eXTREMDB,
PIVOTAL GEMFIRE
solutions NEWSQL
ORIENT DB
VOLTDB, FOUNDATION DB,
REAL TIME APACHE SPARK,
STORM, APACHE TEZ
APACHE

NUODB, INNODB

Ser miembro del
Inscríbete en
Meetup Big Data
www.softy365.com
Colombia

www.softy365.com
[email protected]

48
Thank you very much
for your time
If you have any questions about this document
please don’t hesitate to contact me at:

[email protected]

▪ @casjorge1967

49

You might also like