0% found this document useful (0 votes)

11 views70 pages

Lecture 2

Uploaded by

bojikarlan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views70 pages

Lecture 2

Uploaded by

bojikarlan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

Big Data Analytics Platforms

1
Reading Reference for Lecture 2

2 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Remind -- Apache Hadoop

The Apache™ Hadoop® project develops open-source software for reliable, scalable,
distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver high-availability, the
library itself is designed to detect and handle failures at the application layer, so delivering
a highly-available service on top of a cluster of computers, each of which may be prone to
failures.

The project includes these modules:

• Hadoop Common: The common utilities that support the other Hadoop modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-
throughput access to application data.
• Hadoop YARN: A framework for job scheduling and cluster resource management.
• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

http://hadoop.apache.org
3
Remind -- Hadoop-related Apache Projects
• Ambari™: A web-based tool for provisioning, managing, and monitoring Hadoop
clusters.It also provides a dashboard for viewing cluster health and ability to view
MapReduce, Pig and Hive applications visually.
• Avro™: A data serialization system.
• Cassandra™: A scalable multi-master database with no single points of failure.
• Chukwa™: A data collection system for managing large distributed systems.
• HBase™: A scalable, distributed database that supports structured data storage for
large tables.
• Hive™: A data warehouse infrastructure that provides data summarization and ad hoc
querying.
• Mahout™: A Scalable machine learning and data mining library.
• Pig™: A high-level data-flow language and execution framework for parallel
computation.
• Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple
and expressive programming model that supports a wide range of applications,
including ETL, machine learning, stream processing, and graph computation.
• Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which
provides a powerful and flexible engine to execute an arbitrary DAG of tasks to
process data for both batch and interactive use-cases.
• ZooKeeper™: A high-performance coordination service for distributed applications.

4
Four distinctive layers of Hadoop

5
Common Use Cases for Big Data in Hadoop

• Log Data Analysis

– most common, fits perfectly for HDFS scenario: Write once & Read
often.
• Data Warehouse Modernization
• Fraud Detection
• Risk Modeling
• Social Sentiment Analysis
• Image Classification
• Graph Analysis
• Beyond

D. deRoos et al, Hadoop for Dummies, John Wiley & Sons, 2014
6
Example: Business Value of Log Analysis – “Struggle Detection”

D. deRoos et al, Hadoop for Dummies, John Wiley & Sons, 2014
7 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Remind -- MapReduce example

http://www.alex-hanna.com
8 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
MapReduce Process on User Behavior via Log Analysis

D. deRoos et al, Hadoop for Dummies, John Wiley & Sons, 2014

9 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Setting Up the Hadoop Environment

• Local (standalone) mode

• Pseudo-distributed mode
• Fully-distributed mode

10 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Data Storage Operations on HDFS

• Hadoop is designed to work best with a modest number of extremely large files.
• Average file sizes ➔ larger than 500MB.

• Write Once, Read Often model.

• Content of individual files cannot be modified, other than appending new data at
the end of the file.

• What we can do:

– Create a new file
– Append content to the end of a file
– Delete a file
– Rename a file
– Modify file attributes like owner

11 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Remind -- Hadoop Distributed File System (HDFS)

http://hortonworks.com/hadoop/hdfs/
12 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
HDFS blocks

• File is divided into blocks (default: 64MB) and duplicated in multiple places (default: 3)

• Dividing into blocks is normal for a file system. E.g., the default block size in Linux is 4KB.
The difference of HDFS is the scale.
• Hadoop was designed to operate at the petabyte scale.
• Every data block stored in HDFS has its own metadata and needs to be tracked by a
central server.

13 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
HDFS blocks

• Replication patterns of data blocks in HDFS.

• When HDFS stores the replicas of the original blocks across the Hadoop cluster, it tries to
ensure that the block replicas are stored in different failure points.

14 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
HDFS is a User-Space-Level file system

15 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Interaction between HDFS components

16 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
HDFS Federation

• Before Hadoop 2.0, NameNode was a single point of failure and operation
limitation.
• Before Hadoop 2, Hadoop clusters usually have fewer clusters that were able to
scale beyond 3,000 or 4,000 nodes.
• Multiple NameNodes can be used in Hadoop 2.x. (HDFS High Availability
feature – one is in an Active state, the other one is in a Standby state).

http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
17 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
High Availability of the NameNodes

• Active NameNode
• Standby NameNode – keeping the state of the block locations and block metadata in memory ->
HDFS checkpointing responsibilities.

• JournalNode – if a failure occurs, the Standby Node reads all completed journal entries to
ensure the new Active NameNode is fully consistent with the state of cluster.
• Zookeeper – provides coordination and configuration services for distributed systems.
18 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Several useful commands for HDFS

• All hadoop commands are invoked by the bin/hadoop script.

• % hadoop fsck / -files –blocks:

➔ list the blocks that make up each file in HDFS.

• For HDFS, the schema name is hdfs, and for the local file system, the schema name is
file.
• A file or director in HDFS can be specified in a fully qualified way, such as:
hdfs://namenodehost/parent/child or hdfs://namenodehost

• The HDFS file system shell command is similar to Linux file commands, with the following
general syntax: hadoop hdfs –file_cmd

• For instance mkdir runs as:

$hadoop hdfs dfs –mkdir /user/directory_name

19 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Several useful commands for HDFS -- II

20 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
YARN

• YARN – Yet Another Resource Negotiator:

– A Tool that enables the other processing frameworks to run on Hadoop.

– A general-purpose resource management facility that can schedule and

assign CPU cycles and memory (and in the future, other resources, such as
network bandwidth) from the Hadoop cluster to waiting applications.

➔YARN has converted Hadoop from simply a batch

processing engine into a platform for many different modes
of data processing, from traditional batch to interactive
queries to streaming analysis.

21 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Four distinctive layers of Hadoop

22 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Hadoop execution

1. The client application submits an application request to the JobTracker.

2. The JobTracker determines how many processing resources are needed to execute the entire
application.
3. The JobTracker looks at the state of the slave nodes and queues all the map tasks and reduce tasks
for execution.
4. As processing slots become available on the slave nodes, map tasks are deployed to the slave nodes.
Map tasks are assigned to nodes where the same data is stored.
5. The JobTracker monitors task progress. If failure, the task is restarted on the next available slot.
6. After the map tasks are finished, reduce tasks process the interim results sets from the map tasks.
7. The result set is returned to the client application.

23 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Limitation of original Hadoop 1

• MapReduce is a successful batch-oriented programming model.

• A glass ceiling in terms of wider use:

– Exclusive tie to MapReduce, which means it could be used only for batch-
style workloads and for general-purpose analysis.

• Triggered demands for additional processing modes:

– Graph Analysis
– Stream data processing
– Message passing
➔ Demand is growing for real-time and ad-hoc analysis
➔ Analysts ask many smaller questions against subsets of data
and need a near-instant response.
➔ Some analysts are more used to SQL & Relational databases

YARN was created to move beyond the limitation

of a Hadoop 1 / MapReduce world.
24 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Hadoop Data Processing Architecture

25 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
YARN’s application execution

• Client submits an application to Resource Manager.

• Resource Manager asks a Node Manager to create an Application Master instance and starts up.
• Application Manager initializes itself and register with the Resource Manager
• Application manager figures out how many resources are needed to execute the application.
• The Application Master then requests the necessary resources from the Resource Manager. It sens
heartbeat message to the Resource Manager throughout its lifetime.
• The Resource Manager accepts the request and queue up.
• As the requested resources become available on the slave nodes, the Resource Manager grants the
Application Master leases for containers on specific slave nodes.
• …. ➔ only need to decide on how much memory tasks can have.
26 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Remind -- MapReduce Data Flow

http://www.ibm.com/developerworks/cloud/library/cl-openstack-deployhadoop/
27 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
MapReduce Use Case Example – flight data

• Data Source: Airline On-time Performance data set (flight data set).
– All the logs of domestic flights from the period of October 1987 to April 2008.
– Each record represents an individual flight where various details are
captured:
• Time and date of arrival and departure
• Originating and destination airports
• Amount of time taken to taxi from the runway to the gate.

– Download it from Statistical Computing: http://stat-computing.org/dataexpo/

2009/

28 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Other datasets available from Statistical Computing

http://stat-computing.org/dataexpo/

29 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Flight Data Schema

30 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
MapReduce Use Case Example – flight data

• Count the number of flights for each carrier

• Serial way (not MapReduce):

31 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
MapReduce Use Case Example – flight data

• Count the number of flights for each carrier

• Parallel way:

32 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
MapReduce application flow

33 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
MapReduce
steps for flight
data
computation

34 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
FlightsByCarrier application

Create FlightsByCarrier.java:

35 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
FlightsByCarrier application

36 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
FlightsByCarrier Mapper

37 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
FlightsByCarrier Reducer

38 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Run the code

39 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
See Result

40 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
HBase

HBase is modeled after Google’s BigTable and written in Java. It is developed on top of
HDFS.

It provides a fault-tolerant way of storing large quantities of sparse data (small amounts of
information caught within a large collection of empty or unimportant data, such as finding
the 50 largest items in a group of 2 billion records, or finding the non-zero items
representing less than 0.1% of a huge collection).

HBase features compression, in-memory operation, and Bloom filters on a per-column basis

An HBase system comprises a set of tables. Each table contains rows and columns, much
like a traditional database. Each table must have an element defined as a Primary Key,
and all access attempts to HBase tables must use this Primary Key. An HBase column
represents an attribute of an object

41 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Characteristics of data in HBase

Sparse data

HDFS lacks random read and write access. This is where HBase comes into picture. It's a
distributed, scalable, big data store, modeled after Google's BigTable. It stores data as
key/value pairs.
42 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
HBase Architecture

48 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Creating, Dropping, and Alternating DB in Hive

49 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Another Hive Example
Hive’s operation modes

79
50 E6893E6893
Big Data
Big Analytics – Lecture
Data Analytics 7: Spark
– Lecture andData
2: Big DataPlatform
Analytics © 2021
2015 CY Lin, Columbia University
Reference

Basic functionality of Spark, including components for:

• Task Scheduling
• Memory Management
• Fault Recovery
• Interacting with Storage Systems
• and more

Home to the API that defines resilient distributed datasets (RDDs) - Spark’s main
programming abstraction.

RDD represents a collection of items distributed across many compute nodes that can be
manipulated in parallel.

53 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
First language to use — Python

54 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Spark’s Python Shell (PySpark Shell)
bin/pyspark

• At a high level, every Spark application consists of a driver program that launches various
parallel operations on a cluster.

• The driver program contains your application’s main function and defines distributed
databases on the cluster, then applies operations to them.

• In the preceding example, the driver program was the Spark shell itself.

• Driver programs access Spark through a SparkContext object, which represents a

connection to a computing cluster.

• In the shell, a SparkContext is automatically created as the variable called sc.

Driver programs typically manage a number of nodes called executors.

If we run the count() operation on a cluster, different machines might count lines in different
ranges of the file.

lambda —> define functions inline in Python.

60 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2020 CY Lin, Columbia University
Resilient Distributed Dataset (RDD) Basics

• An RDD in Spark is an immutable distributed collection of objects.

• Each RDD is split into multiple partitions, which may be computed on different nodes of the
cluster.

• Users create RDDs in two ways: by loading an external dataset, or by distributing a

collection of objects in their driver program.

• Once created, RDDs offer two types of operations: transformations and actions.

<== create RDD

<== transformation

<== action

Transformations and actions are different because of the way Spark computes RDDs.
==> Only computes when something is, the first time, in an action.
61 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Persistance in Spark

• By default, RDDs are computed each time you run an action on them.
• If you like to reuse an RDD in multiple actions, you can ask Spark to persist it using
RDD.persist().
• RDD.persist() will then store the RDD contents in memory and reuse them in future actions.
• Persisting RDDs on disk instead of memory is also possible.
• The behavior of not persisting by default seems to be unusual, but it makes sense for big
data.

64 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Using Spark SQL — Steps and Example

65 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Query testtweet.json
Get it from Learning Spark Github ==> https://github.com/databricks/learning-spark/tree/master/files

66 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Machine Learning Library in Spark — MLlib

An example of using MLlib for text classification task, e.g., identifying spammy emails.

68 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2021 CY Lin, Columbia University
Feature Extraction Example — TF-IDF

Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
37 pages
ENC P.B.BSC 2 Unit Nursing Research
100% (1)
ENC P.B.BSC 2 Unit Nursing Research
41 pages
Wa0005.
No ratings yet
Wa0005.
84 pages
ABAP Programming Tips
No ratings yet
ABAP Programming Tips
185 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Unit 2
No ratings yet
Unit 2
73 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Hadoop
No ratings yet
Hadoop
83 pages
BDT Unit03.pptx
No ratings yet
BDT Unit03.pptx
93 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Artificial Intelligence and Journalism Practice in Nigeria - Perception of Journalists in Benin City Edo State
No ratings yet
Artificial Intelligence and Journalism Practice in Nigeria - Perception of Journalists in Benin City Edo State
19 pages
EECS6893 BigDataAnalytics Lecture2
No ratings yet
EECS6893 BigDataAnalytics Lecture2
79 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
HADOOP
No ratings yet
HADOOP
19 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Unit 4 Endsem PYQs
No ratings yet
Unit 4 Endsem PYQs
24 pages
Chap4 BigDataStorageAndManagement
No ratings yet
Chap4 BigDataStorageAndManagement
46 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Big Data Unit 2 (Easy Notes) Edushine Classes
No ratings yet
Big Data Unit 2 (Easy Notes) Edushine Classes
35 pages
Slide 2 GFS and Hadoop
No ratings yet
Slide 2 GFS and Hadoop
95 pages
Method of Research Module Lesson 2
No ratings yet
Method of Research Module Lesson 2
6 pages
CO3 Session 19
No ratings yet
CO3 Session 19
29 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Baseline Survey Report 2022 Revised
No ratings yet
Baseline Survey Report 2022 Revised
13 pages
BD U-2 (Anupam Sir)
No ratings yet
BD U-2 (Anupam Sir)
30 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
25 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Unit 3 - BD - Hadoop Ecosystem
No ratings yet
Unit 3 - BD - Hadoop Ecosystem
42 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
Wyse Viger EAJ 2011
No ratings yet
Wyse Viger EAJ 2011
45 pages
Attachment
No ratings yet
Attachment
11 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Adani Wilmark
No ratings yet
Adani Wilmark
78 pages
Lesson 2 - Training Process
No ratings yet
Lesson 2 - Training Process
68 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Unit2 Bda
No ratings yet
Unit2 Bda
12 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Bda Unit2
No ratings yet
Bda Unit2
24 pages
GDS II Stream Format Manual 6.0 Feb87
No ratings yet
GDS II Stream Format Manual 6.0 Feb87
47 pages
The CARMENES Search For Exoplanets Around M Dwarfs
No ratings yet
The CARMENES Search For Exoplanets Around M Dwarfs
25 pages
Unit 3-1
No ratings yet
Unit 3-1
14 pages
AIin Product Management
No ratings yet
AIin Product Management
11 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
1 - Big Data and Hadoop Framework
No ratings yet
1 - Big Data and Hadoop Framework
40 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
45 pages
Big Data - Tomas Iglesias IV
No ratings yet
Big Data - Tomas Iglesias IV
37 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
UNIT-I Introduction To Hadoop - A20
No ratings yet
UNIT-I Introduction To Hadoop - A20
24 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
Data W - Bigdata8
No ratings yet
Data W - Bigdata8
105 pages
EECS6893 BigDataAnalytics Lecture2
No ratings yet
EECS6893 BigDataAnalytics Lecture2
71 pages
Module 4 - Hadoop
No ratings yet
Module 4 - Hadoop
5 pages
Big Data
No ratings yet
Big Data
67 pages
IEC104 Review
No ratings yet
IEC104 Review
2 pages
MT6765 Anrdoid Scatter
No ratings yet
MT6765 Anrdoid Scatter
12 pages
Drift Survey Paper JETIR2411319
No ratings yet
Drift Survey Paper JETIR2411319
9 pages
Act2 - March7 - 6E - BDA - SEC
No ratings yet
Act2 - March7 - 6E - BDA - SEC
8 pages
Relational Model in DBMS
No ratings yet
Relational Model in DBMS
5 pages
MIDAS Whitepaper
No ratings yet
MIDAS Whitepaper
15 pages
Osi Layers
No ratings yet
Osi Layers
33 pages
All Statistics Vs Numerical Facts Cleaned
No ratings yet
All Statistics Vs Numerical Facts Cleaned
2 pages
234 5 6789:96 234 6789:9 2 8?@ 6789:9a 4 B C76?@ 4 BB
No ratings yet
234 5 6789:96 234 6789:9 2 8?@ 6789:9a 4 B C76?@ 4 BB
3 pages
CEM
No ratings yet
CEM
8 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
Food Process Engineering Department
No ratings yet
Food Process Engineering Department
9 pages
Onyx Data DataDNA Challenge - April 2024
No ratings yet
Onyx Data DataDNA Challenge - April 2024
6 pages
The Nature, Challenges and Consequences of Urban Youth Unemployment: A Case of Nairobi City, Kenya
No ratings yet
The Nature, Challenges and Consequences of Urban Youth Unemployment: A Case of Nairobi City, Kenya
9 pages
Computer Architecture Notes VIII
No ratings yet
Computer Architecture Notes VIII
10 pages
CDW Version 2016Q3 Release Note
No ratings yet
CDW Version 2016Q3 Release Note
4 pages
Datastage Range Lookup Tip
No ratings yet
Datastage Range Lookup Tip
8 pages
Primary and Secondary Sources
No ratings yet
Primary and Secondary Sources
4 pages
Dbms Indexing
No ratings yet
Dbms Indexing
3 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet

Lecture 2

Uploaded by

Lecture 2

Uploaded by

Big Data Analytics Platforms

The project includes these modules:

• Log Data Analysis

• Local (standalone) mode

• Write Once, Read Often model.

• What we can do:

• Replication patterns of data blocks in HDFS.

• All hadoop commands are invoked by the bin/hadoop script.

• % hadoop fsck / -files –blocks:

• For instance mkdir runs as:

• YARN – Yet Another Resource Negotiator:

– A Tool that enables the other processing frameworks to run on Hadoop.

– A general-purpose resource management facility that can schedule and

➔YARN has converted Hadoop from simply a batch

1. The client application submits an application request to the JobTracker.

• MapReduce is a successful batch-oriented programming model.

• A glass ceiling in terms of wider use:

• Triggered demands for additional processing modes:

YARN was created to move beyond the limitation

• Client submits an application to Resource Manager.

– Download it from Statistical Computing: http://stat-computing.org/dataexpo/

• Count the number of flights for each carrier

• Serial way (not MapReduce):

• Count the number of flights for each carrier

Basic functionality of Spark, including components for:

• Driver programs access Spark through a SparkContext object, which represents a

• In the shell, a SparkContext is automatically created as the variable called sc.

Driver programs typically manage a number of nodes called executors.

lambda —> define functions inline in Python.

• An RDD in Spark is an immutable distributed collection of objects.

• Users create RDDs in two ways: by loading an external dataset, or by distributing a

<== create RDD

You might also like