0% found this document useful (0 votes)

5 views75 pages

Big Data Intro-1

The document discusses big data processing techniques and platforms, highlighting the exponential growth of data and the significance of big data analytics in various industries. It covers the sources of big data, applications in sectors like healthcare and security, and introduces technologies such as Hadoop and Apache Spark for data storage and processing. The document emphasizes the importance of real-time analytics and the role of big data in enhancing business insights and operational efficiency.

Uploaded by

anshikanahata33

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views75 pages

Big Data Intro-1

Uploaded by

anshikanahata33

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 75

BIG DATA PROCESSING TECHNIQUES

AND PLATFORMS
Prof. JP AGGARWAL
Introduction
• Almost everything we do leaves a digital trail that is
recorded and saved forever
• In fact, it is estimated that 90% of the world's data was
generated in the last two years alone.
• In the space of 13 years, this figure has increased by an
estimated 60x from just 2 zettabytes in 2010.
• The 120 zettabytes generated in 2023 are expected to
increase by over 150% in 2025, hitting 181
zettabytes.
Source: https://explodingtopics.com/blog/data-generated-per-day

www.spjain.org
Source: https://www.domo.com/data-never-sleeps
Big Data Era
• We are exposed to more information each day than our
15th century ancestors were exposed to in a lifetime
• The 3 Vs of Big Data (Volume, Velocity, Variety)
• Data beyond storage capacity or data beyond local
processing capability.
• Big data refers to large volume of structured, semi-
structured and unstructured data. The analysis of of big
data leads to better insights for business.
• Why suddenly Big Data ?

www.spjain.org
Sources of Big Data ( Internal and
External )
• Web searches, browsing, and social media
• Astronomy and telescopes
• Financial data
• Medical data
• Telephone calls and email
• Surveillance data
• Internet of Things (Smart Watches, Smart Fridge, etc)
• Sensor Data (eg. Smart phone, GPS)

www.spjain.org
Big Data Analytics
• Big Data Analytics is the process of examining large
data sets (current as well as historic) to uncover hidden
pattern, unknown correlations, market trends, customer
preferences, prevent failures and other useful business
information. Archived Social
Transactions
Data Media
• Analyze both internal and external data
• Batch Analytics
• On Historical Data BIG DATA
• Real-Time Analytics
• On current data in real-time Predictions
• e,g Credit Card Transactions
www.spjain.org
Turning Big Data into Value
What’s the social
sentiment for my How do I better
brand or products predict future
outcomes?

LIVE DATA
FEEDS

SOCIAL & How do I optimize my

WEB fleet based on
weather and traffic ADVANCED
ANALYTICS ANALYTICS
patterns?

www.spjain.org
Applications of Big Data
Better Understand and Target Customer

Telecom: Companies can now better

predict customer churn

Retail: Retailers can predict what

products will sell

Car Insurance: can now understand

how well their customers actually
drive.
Applications of Big Data

Understand and Optimize Business Retailer can optimize their stock based on Supply chain or delivery route
predictive models generated from social optimization using data from geographic
media data, web search trends and positioning and radio frequency
weather forecast. identification sensors.
Improving Health

Applicatio The computing power of Big Data Analytics enables us to

find new cures and better understand and predict disease
patterns.

ns of Big
Data
We can use all the data from smart watches and wearable
devices to better understand links between lifestyles and
diseases

Big Data Analytics also allow us to monitor and predict

epidemics and disease outbreak simply by listening to what
people are saying e.g “ Not feeling well today – in bed with
a cold” or searching for on the internet “cures for flu”
Applications of Big Data
Improving Security
and Law
•Security Services uses Big Data
Enforcement Analytics to foil terrorist plots and
detect cyber attacks.
•Police forces use Big Data tools to
catch criminals and even predict
criminal activity.
•Credit card companies use Big Data
Analytics to detect fraudulent
transactions.
Applicatio Most elite sports have now embraced Big
Data Analytics. Many use video analytics to

ns of Big track the performance of every player.

Data
Improvin Sensor technology is built into sports
equipment.

g Sports
Performa Many elite sports teams track athletes
outside of sporting environment – using

nce
smart technology to track nutrition and
sleep, as well as social media conversations
to monitor emotional wellbeing !
BIG Data SIZES
https://www.c-sharpcorner.com/
UploadFile/mahesh/cracking-big-data/
Case Study – Analyze
Payment Risks
A product distributor wants to match invoice to payments received from
vendors to predict payment risks. The invoice system and receivable
system are maintained in ORACLE, and they want to present the data
using Tableau, which is data visualization tool.
Source: SpiderOpsNet

www.spjain.org
Source: SpiderOpsNet

www.spjain.org
Big Data Solution Highlights
A cluster of 30 low-cost (commodity class) machines

Use of all data including archived data

Work done in parallel: time reduced to 1 hour from 30 hours

Cost reduced by 80%

Open-source tools: Free software, low maintenance

Built in fault tolerance ad data replication.

www.spjain.org
Big Data Industry Examples
Market Basket Analysis
Predict Inventory Demand
Medical device calibration
Bank loan default prediction and fraud detection
Algorithmic trading in stock markets
Data collections for insurance underswriting
Sensory data ingestion and processing
DNA Analysis in genetics
Short term load forecast in a powergrid

www.spjain.org
Large relational
database on SAN Highly parallel
(Storage area processors
networks)

Tradition
al Data may be
distributed but Bring data to

Technolo processing in one

place
process

gy
High end hardware
Limited scalability (in tune of
$50,000 / TB)
Parallel Processing

Cluster of commodity hardware

Fault tolerant processing

Big data Distributed data and distributed processing
Technolo Data redundancy
gy Data locality: Bring process to data

Commodity hardware ($3000 / TB)

Data and processing on same machine

How is BIG data Technology different
?

Cluster of commodity class Distributed data

machine
Mid range servers with less memory and mid- Data is distributed across cluster of machines
level hardware Each piece is called block
Machines may have 32GB – 96GB RAM 2Ghz Data redundancy and data replication
processor
Cost of hardware < $3000 / TB

www.spjain.org
How is BIG data
Technology
different ?
Distributed processing
Processing is divided into parallel task –
as many as number of blocks
Tasks are distributed to the cluster of
machines
Process is sent to where the data is
instead of bringing data to process
Fault tolerant processing
Question

• Can I use Big Data technologies to process small

amount of data ?

www.spjain.org
Big Data Solution Stages

BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA
INGESTION STORAGE ANALYTICS VISUALIZATION

BIG DATA SECURITY

www.spjain.org
Introduction to Hadoop

www.spjain.org
Hadoop

www.spjain.org
HDFS
• HDFS is a storage layer of Hadoop suitable for
distributed storage and processing.
• It provides file permissions, authentication, and
streaming access to file system data.
• HDFS can be accessed through Hadoop command line
interface.

www.spjain.org
Hadoop Architecture
• In 1990, Hard Disk Capacity 1 GB to 20 GB, RAM 64 to 128 MB,
I/O 10 Kbps
• Now, Hard disk Capacity 1 TB, RAM 16 - 32 GB, I/O 100 MBPS
• E.g 100 TB of Data by few years. So you approach Data Center
• E.g Processing time: 1 person – 12 months , 12 person – 1
month
• E.g Number of Files continuously coming in for processing
• You save data because you might need to process later.
• So, I write a program to process my data.
• Earlier before Hadoop, computation was processor bound.

www.spjain.org
Hadoop Architecture
• Hadoop is open source framework by Apache Software
Foundation
• For small data, its better to use local disk
• Hadoop = HDFS (Storage) + Map Reduce (Processing)
• Key Terms for HDFS
• Cluster (set of machines)
• Commodity Hardware (cheap hardware)
• Streaming access pattern (Write Once Read Many times)
• Blocks
• 4 KB in normal Operating System FS
• If you store 2KB – remaining space will be wasted

www.spjain.org
Streaming Access Pattern
• HDFS is built around the idea that the most efficient data processing pattern is
write-once, read-many-times pattern. A dataset is typically generated or copied
from source, then various analysis are performed on that dataset over time. Each
analysis will involve a large proportion, if not all, of the dataset, so the time to
read the whole dataset is more important than the latency in reading the first
record.
• It stores data in large blocks - like 64 MB / 128 MB. The idea is that you want your
data layed out sequentially on your hard drive, reducing the number of seeks your
hard drive has to do to read data.
• In addition, HDFS is a user-space file system, so there is a single central name
node that contains an in-memory directory of where all of the blocks (and their
replicas) are stored across the cluster. Files are expected to be large (say 1 GB or
more), and are split up into several blocks. In order to read a file, the code asks
the name node for a list of blocks and then reads the blocks sequentially.
• The data is "streamed" off the hard drive by maintaining the maximum I/O rate
that the drive can sustain for these large blocks of data.

www.spjain.org
Hadoop Architecture
• HDFS – 64 MB to 128 MB
• e.g for a 35 MB file, no waste of space
• Five Services (Daemons):
• Master - Name Node
• Master - Secondary Name Node
• Master - Job Tracker (Resource Manager)
• Slave - Data Node
• Slave - Task Tracker (Node Manager)
• Master Services can talk to each other and Slave services
can talk to each other, Master can talk to its Slave
services and vice-versa. (NN<->DN, JT <-> TT)
www.spjain.org
Hadoop Architecture
• Writing on Hadoop Cluster (e.g. 200MB for understanding) for faster
processing
• On a cluster of 64MB block size, 4 blocks will be required (3 * 64MB and
1* 8 MB)
• Client will contact Name Node
• Meta Data (Data about Data) will be stored at Name Node
• Replication Factor for blocks
• Data Node will be communicating back to clients that file has been
written successfully.
• Data Nodes will send Heart Beats + Block Report to Name Node regularly.
• If any data node fail to send heart beat, then Name Node will modify the
meta data and save the block to other node.

www.spjain.org
Hadoop Architecture
• What if Name Node crashes ?
• Name Nodes are high reliable machines (they don’t store huge data, only
meta data)
• Single Point of Failure
• Processing :
• Job Tracker will accept clients processing request.
• Job Tracker will talk with Name Node to get block details for the
requested file
• Meta data will be shared by Name Node with Job Tracker
• Job Tracker will select Data Node (best possible) and assign task to
Task Tracker on that Node.
• This process is called MAP.
www.spjain.org
Hadoop Architecture

• Similar MAP jobs will be done on all input splits. (number of

mapper = number of input splits)
• What happens in case a node crashes while processing ?
• TT will fail and JT will assign new TT to process the same job.
Heartbeats are sent by TT to JT.
• What is Job Tracker is down ?
• We maintain high reliable machine for JT
• Single point of Failure
• All map outputs will now be combined and output will be
REDUCED to required output.
www.spjain.org
Limitations of Hadoop Vs Spark

• In Hadoop, you can only do Batch Processing, no real-time

processing.

www.spjain.org
Limitations of Hadoop Vs. Spark

www.spjain.org
Apache Spark

• SPARK supports batch as well as real-time processing of

Data

www.spjain.org
Apache Spark

• Apache Spark is a cluster computing framework for real-time

processing developed by Apache Software Foundation.
• Spark provides and interface for programming entire cluster
with implicit data parallelism and fault-tolerance.
• It was built on Hadoop MapReduce and it extends the
MapReduce model to efficiently use more types of
computations.

www.spjain.org
Apache Spark

www.spjain.org
Driverless Cars
• Waymo, Uber, Tesla, others
• Mission: design a driverless car that can drive on highways
and roads
• Goals:
• Fully autonomous operation
• High safety
• Fleet operation
• Solutions
• Deep learning from video
• Object detection model
• Action model

www.spjain.org
Forensics
• Palantir and Amazon, as well as Chinese companies
• Mission: design system to predict crime
• Goals:
• Integration across multiple surveillance sources
• High reliability
• Transparent algorithms that comply with laws
• Solutions
• Predictive algorithms using history, surveillance and behaviour
analysis
• Transparent explanations
• Face and gait recognition
www.spjain.org
Military AI
• USA, Russia, China. Others?
• Mission: design autonomous forces for ground, sea,
amphibian and air combat and surveillance
• Goals:
• Autonomous combat
• Autonomous surveillance
• Swarm tactics and strategy
• Solutions
• Autonomous weaponised robots for ground combat
• Driverless terrestrial, underwater, and aerial combat vehicles
• Drone and dust swarms

www.spjain.org
Checkpoint Frequency
• Default every 1 hour
• Every 64MB of editslog
• After 1 million transaction

www.spjain.org
Big Data Solution Stages

BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA
INGESTION STORAGE ANALYTICS VISUALIZATION

BIG DATA SECURITY

www.spjain.org
Types of Data
• INTERNAL DATA
• Transactional data from Relational Database
• Archived Files
• Log Files from Applications
• Web site click data
• Customer files
• External Data
• Social Media e.g. Twitter
• Web Scraping
• 3rd Party / Agencies supplying trending data

www.spjain.org
BIG Data INGESTION
• Batch Data
• Streaming Data

www.spjain.org
Types of data for Ingestion
• Data Ingestion: Getting data into the big data system.
• Batch Data
• Relational databases
• Archived Files
• Streaming Data
• Continuous flow of data
• Social media e.g. Twitter
• Data coming from customer every 10 min.

www.spjain.org
Batch Data
• Data that comes in periodically:
• Once a day or once a week etc.
• Examples:
• Data from RDBMS
• Data from archived files
• Data coming from clients
• Can be processed through a scheduler
• Bulk loading to big data system
• Tools: Sqoop, distcp

www.spjain.org
Streaming Data
• Data that comes in continuously
• Examples:
• Data from Twitter
• Data from application log files
• Data from messaging system
• Need to be processed as soon as the data comes
• Fault tolerance and exactly once processing s important
• Tools: Flume, Nifi, Flink

www.spjain.org
Feature to Look Out For
• Fault Tolerance
• No data should be lost even if the process aborts
• Accurate processing
• Should not miss any data
• Should handle if destination system is down
• Exactly one processing
• If process restarts, should not process data already processed
• High scalability
• Should be able to handle huge number records per minute

www.spjain.org
sQOOP
• Part of Apache Open Souce
• Used for importing relational data into Hadoop
• Extract data from databases such as MYSQL, ORACLE,
DB2, etc
• Can import into HDFS, HIVE or HBASE
• Can be used to export data back to database

www.spjain.org
Distcp
• DISTCP is part of Apache Hadoop
• Used for importing file data into HDFS
• It can be used to get data from Amazon S3 to HDFS
• It can be used to copy data from one HDFS installation
to another HDFS installation as well
• It used map-reduce to copy data in parallel

www.spjain.org
Flume
• Part of Apache Open Source
• Parallel copying of streaming data in HDFS
• Distributed, reliable and available service
• Collect, aggregate and move log data
• Can move large amount of data efficiently
• Used for getting weblog and program log from multiple
sources into HDFS
• Sqoop for structured data, Flume for unstructured data

www.spjain.org
APACHE Flink
• Part of Apache Open Source and also supported by
Confluent Technologies
• Continuous processing for Streaming data
• Can process millions of events per second
• Source - > Sink based architecture like Flume
• Exactly once processing of data
• Provides high level data set APIs
• Out of order data processing

www.spjain.org
Out of Order Data
• Some data in the stream may come late
• If ingestion time is used, then this will be part of wrong
batch
• Flink makes it possible to make this data part of the
time stamp as specified in the data
• Event time that is part of the data can be used to
specify the batch of data

www.spjain.org
Nifi
• Part of Apache Open Source
• Supports dataflow with graphical user interface
• Can route messages from data ingestion to processing
using GUI
• Configure to include various connectors:
• Kafka
• HDFS
• Provides data provenance: Track data flow as data flows
through the system.

www.spjain.org
Kafka – Distributed Streaming
Platform
• What is Apache Kafka ?
• Distributed Streaming Platform

• What does this means ?

• Create one or more Real Time Data Streams
• Process these Real Time Data Streams

• Real Time refers to Minutes / Seconds or even

milliseconds.

www.spjain.org
Example – Smart Electricity Meter
• Smart Electricity Meter generating load related data every
minute.
• Kafka server receives this data every minute.
• Similarly there are other meters in the city who sends load
data to this Kafka Server.
• Application to read and process this data. For eg.
Computing and monitoring load of every house.
• As soon as the charge goes above pre-defined threshold,
send SMS at least in few seconds/mins.
• This is real time stream processing.

Apache Kafka is a highly scalable and distributed platform for creating and processing streams in real time.

www.spjain.org
How does Kafka Works ?
• Typically, 3 Components
• Publisher / Producer
• Client application that sends data
records(messages)
• The broker is responsible for receiving
messages from the producer and storing them
in local disk.
• The consumers are client applications that
read messages from the Broker and process
them.
• The Broker is at the center and acts as the
middleman between Producer and
Consumer.
• Kafka Server is the Messaging Broker.
• Kafka works as Pub-Sub Messaging
System

www.spjain.org
Who created Kafka and Why ?
• Created in Linked In and Open Source in 2011.
• It was initially designed to handle data integration problem.

www.spjain.org
Who created Kafka and Why ?
• All the boxes generate and store some data.
• Data Generated on one application is often needed in
other applications.
• Linked In solved this problem using Pub-Sub Architecture.

www.spjain.org
Kafka in Enterprise Application
Ecosystem.
• Circulatory system of your data ecosystem.
• Kafka occupies central space in the real-time data
integration infrastructure.
• The data producers can send data like messages.
• Messages sent to Kafka broker as soon as the
business event occurs.
• Consumers can consume messages from broker as
soon as the data arrives at the broker.
• Can be designed for processing data in milliseconds.
• Producers always interact with Kafka Brokers and
need not be concerned who is using the data.
• Producers and Consumer can be added, removed or
modified as the business case evolves.

www.spjain.org

ME 142L Gear Pump Test Experimentmade
No ratings yet
ME 142L Gear Pump Test Experimentmade
8 pages
Light Activated Switch Circuit Diagram
100% (1)
Light Activated Switch Circuit Diagram
2 pages
Organisational Behaviour
No ratings yet
Organisational Behaviour
75 pages
MT8127 Android Scatter
100% (1)
MT8127 Android Scatter
7 pages
ESP32 Microcontroller Based Smart Power
No ratings yet
ESP32 Microcontroller Based Smart Power
8 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
41 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
51 pages
Data Science
No ratings yet
Data Science
87 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Module 1
No ratings yet
Module 1
29 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
BigData AmberSahai1
No ratings yet
BigData AmberSahai1
32 pages
BigDataAnalytics 1.2
No ratings yet
BigDataAnalytics 1.2
25 pages
Big Data A Comprehensive Overview
No ratings yet
Big Data A Comprehensive Overview
25 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Big Data Question Bank
No ratings yet
Big Data Question Bank
38 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
20 pages
Bda Unit-I
No ratings yet
Bda Unit-I
15 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
No ratings yet
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
42 pages
Introduction To Analytics and Big Data - Hadoop: Thomas Rivera Hitachi Data Systems
No ratings yet
Introduction To Analytics and Big Data - Hadoop: Thomas Rivera Hitachi Data Systems
45 pages
Hadoop Big Data Unit 2
No ratings yet
Hadoop Big Data Unit 2
23 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Session 8 - George Strawn - Big Data
No ratings yet
Session 8 - George Strawn - Big Data
34 pages
Stream Processing Chapter 2
No ratings yet
Stream Processing Chapter 2
21 pages
Evolution and Applications of Big Data
No ratings yet
Evolution and Applications of Big Data
4 pages
Big Data Unit 1 AKTU Notes
No ratings yet
Big Data Unit 1 AKTU Notes
87 pages
Big Data Storage: Margaret Rouse Garry Kranz
No ratings yet
Big Data Storage: Margaret Rouse Garry Kranz
6 pages
Big Data Analytics
No ratings yet
Big Data Analytics
31 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
DBMS Unit1
No ratings yet
DBMS Unit1
30 pages
Big Data Framework
No ratings yet
Big Data Framework
6 pages
BDA Unit-1
No ratings yet
BDA Unit-1
32 pages
Detailednotes - Unit1 - Big Data
No ratings yet
Detailednotes - Unit1 - Big Data
22 pages
The Growing Enormous of Big Data Storage
No ratings yet
The Growing Enormous of Big Data Storage
6 pages
Kwasu-Csc204 Big Data Computing and Security-1
No ratings yet
Kwasu-Csc204 Big Data Computing and Security-1
57 pages
BIG DATA AND ANALYTICS Presentation
No ratings yet
BIG DATA AND ANALYTICS Presentation
31 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
Big Data and Hadoop Overview
100% (1)
Big Data and Hadoop Overview
17 pages
Now To Be Data
No ratings yet
Now To Be Data
16 pages
Big Data Overview
No ratings yet
Big Data Overview
75 pages
Big Data Analytics - Project
50% (2)
Big Data Analytics - Project
27 pages
BD U-1 (Anupam Sir)
No ratings yet
BD U-1 (Anupam Sir)
20 pages
Lecture 2
No ratings yet
Lecture 2
11 pages
Big Data
No ratings yet
Big Data
16 pages
Introduction To Big Data Platform
No ratings yet
Introduction To Big Data Platform
20 pages
BDA Unit-1
No ratings yet
BDA Unit-1
31 pages
Unit 1 B Tech 3 Year BD
No ratings yet
Unit 1 B Tech 3 Year BD
10 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
BIT4440 BSE4040 CloudComputing 3.big Data Technologies
No ratings yet
BIT4440 BSE4040 CloudComputing 3.big Data Technologies
43 pages
Kwasu-Csc204 Module 1 Big Data Computing and Security 2
No ratings yet
Kwasu-Csc204 Module 1 Big Data Computing and Security 2
22 pages
Harnessing Big Data
No ratings yet
Harnessing Big Data
29 pages
Big Data: the Revolution That Is Transforming Our Work, Market and World
From Everand
Big Data: the Revolution That Is Transforming Our Work, Market and World
PAT NAKAMOTO
No ratings yet
The Data Whisperer - Making Sense of Big Data
From Everand
The Data Whisperer - Making Sense of Big Data
Keaton Rivers
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Understanding Big Data: A Beginners Guide to Data Science & the Business Applications
From Everand
Understanding Big Data: A Beginners Guide to Data Science & the Business Applications
Eileen McNulty-Holmes
4/5 (5)
93 - Grammar Likes and Dislikes
No ratings yet
93 - Grammar Likes and Dislikes
3 pages
Probability and Reliability Aspects in P PDF
No ratings yet
Probability and Reliability Aspects in P PDF
5 pages
Thesis Paper Project Evaluation 1
No ratings yet
Thesis Paper Project Evaluation 1
25 pages
Pro B760M P DDR4
No ratings yet
Pro B760M P DDR4
1 page
Figures of Speech
100% (1)
Figures of Speech
3 pages
Sikyon Sample
No ratings yet
Sikyon Sample
66 pages
Direct and Indirect
No ratings yet
Direct and Indirect
5 pages
Where Can Buy Gentrification 1st Edition Loretta Lees Ebook With Cheap Price
No ratings yet
Where Can Buy Gentrification 1st Edition Loretta Lees Ebook With Cheap Price
67 pages
Introduction To Programming With RAPTOR
No ratings yet
Introduction To Programming With RAPTOR
12 pages
Philosophy of Life
No ratings yet
Philosophy of Life
3 pages
06.06.2023 Duzenlendi MT-103 - 202
No ratings yet
06.06.2023 Duzenlendi MT-103 - 202
2 pages
Squishy Circuits Handout
No ratings yet
Squishy Circuits Handout
9 pages
MasterEase 3503 v1
No ratings yet
MasterEase 3503 v1
2 pages
Flight Ticket - Vadodara To New Delhi: Fare Rules & Baggage
No ratings yet
Flight Ticket - Vadodara To New Delhi: Fare Rules & Baggage
2 pages
Composite Materials Group Composites On Meso - Macro Level: Stepan V. Lomov
No ratings yet
Composite Materials Group Composites On Meso - Macro Level: Stepan V. Lomov
34 pages
CEE4708 EarthworkAssignment
No ratings yet
CEE4708 EarthworkAssignment
10 pages
Gas To Power Feasibility Study Presentation
No ratings yet
Gas To Power Feasibility Study Presentation
23 pages
Diff. Lit. Elements
No ratings yet
Diff. Lit. Elements
11 pages
Mesin Skala Industri
No ratings yet
Mesin Skala Industri
2 pages
Power Windows Description and Operation
No ratings yet
Power Windows Description and Operation
4 pages
GEHealthcare Transport Pro Monitor Spec Sheet
No ratings yet
GEHealthcare Transport Pro Monitor Spec Sheet
2 pages
Anunnaki
No ratings yet
Anunnaki
97 pages
Pompa (P-01 Sampai P-07)
No ratings yet
Pompa (P-01 Sampai P-07)
62 pages
Hansen, Mass Culture in Kracauer, Derrida, Adorno
No ratings yet
Hansen, Mass Culture in Kracauer, Derrida, Adorno
32 pages
Private Placement Memorandum Manager
No ratings yet
Private Placement Memorandum Manager
4 pages

Big Data Intro-1

Uploaded by

Big Data Intro-1

Uploaded by

BIG DATA PROCESSING TECHNIQUES

SOCIAL & How do I optimize my

Telecom: Companies can now better

Retail: Retailers can predict what

Car Insurance: can now understand

Applicatio The computing power of Big Data Analytics enables us to

Big Data Analytics also allow us to monitor and predict

ns of Big track the performance of every player.

Use of all data including archived data

Work done in parallel: time reduced to 1 hour from 30 hours

Cost reduced by 80%

Open-source tools: Free software, low maintenance

Built in fault tolerance ad data replication.

Technolo processing in one

Cluster of commodity hardware

Fault tolerant processing

Commodity hardware ($3000 / TB)

Data and processing on same machine

Cluster of commodity class Distributed data

• Can I use Big Data technologies to process small

BIG DATA SECURITY

• Similar MAP jobs will be done on all input splits. (number of

• In Hadoop, you can only do Batch Processing, no real-time

• SPARK supports batch as well as real-time processing of

• Apache Spark is a cluster computing framework for real-time

BIG DATA SECURITY

• What does this means ?

• Real Time refers to Minutes / Seconds or even

You might also like