0% found this document useful (0 votes)

45 views15 pages

Exames BDF PDF

The document contains two quizzes with multiple choice questions about big data concepts and technologies like HDFS, MapReduce, Hive, Amazon Kinesis, AWS services. Quiz 1 has 27 questions covering topics from classes 1-12. Quiz 2 has 26 questions and references figures related to log analytics solutions and data streaming architectures. The questions test knowledge of concepts like data partitioning, Kinesis stream scaling, Hive optimizations, and AWS service capabilities.

Uploaded by

Bruno Teles

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views15 pages

Exames BDF PDF

Uploaded by

Bruno Teles

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Quizzes & Exams

QUIZ 1 (Classes 1-7)

1 – Despite its capabilities, TEZ still needs the storage of intermediate output to HDFS. (2020, 2021)
(2020, 2021)
FALSE
(Week 4, slide 48)

2 – Volume drives the need for processing and storage parallelism, and its management during
processing of large datasets.
(2020, 2021)
TRUE
(Week 1, slide 35)

3 – The servers where the data resides can only perform the map operation.
(2020, 2021)
TRUE
(Week 4, slide 27)

4 – The process of generating a new fsimage from a merge operation is called the Checkpoint process.
(2020, 2021)
TRUE
(Week 3, slide 15)

5 – Hive follows a “schema on read” approach, unlike RDMBS, which enforces “schema on write.”
(2020, 2021)
TRUE
(Week 5, slide 29)

6 – Big Data is a characterization only for volumes of data above one petabyte.
(2020, 2021)
FALSE
(Definition)

7 – The term MapReduce refers in exclusive to a programming model.

(2020, 2021)
FALSE
(Week 4, slide 8)

8 – Hadoop is considered a schema-on-write-system regarding write operations.

(2020, 2021)
FALSE
(Week 2, slide 30)

9 – One of the key design principles of HDFS is that it should favor low latency random access over high
sustained bandwidth.
(2020, 2021)
FALSE
(Week 2, slide 34)

10 – One of the key design principles of HDFS is that it should be able to use commodity hardware.
(2020, 2021)
TRUE
(Week 2, slide 24)

11 – The fundamental architectural principles of Hadoop are: large scale, distributed, shared everything
systems, connected by a good network, working together to solve the same problem.
(2020, 2021)
FALSE
(Week 2, slide 23)
12 – Apache Tez is an engine built on top of Apache Hadoop YARN.
(2020, 2021)
TRUE
(Week 4, slide 42)

13 – The name node is not involved in the actual data transfer.

(2020, 2021)
FALSE
(Week 3, slide 33)

14 – Business benefits are frequently higher when addressing the variety of data than when addressing
volume
(2020, 2021)
TRUE
(Week 1, slide 37)

15 – In object storage, data is stored close to processing, just like in HDFS, but with rich metadata.
(2020, 2021)
FALSE
(Week 5, slide 6)

16 – The map function in MapReduce processes key/value pairs to generate a set of intermediate
key/value pairs.
(2020, 2021)
TRUE
(Week 4, slide 26)

17 - Essentially, MapReduce divides a computation into two sequential stages: map and reduce.
(2020, 2021)
FALSE
(Week 4, slide 26)

18 – When a HiveQL is executed, Hive translates the query into MapReduce, saving the time and effort
of writing actual MapReduce jobs.
(2020, 2021)
TRUE
(Week 5, slide 21)

19 – When we drop an external table in Hive, both the data and the schema will be dropped.
(2020, 2021)
FALSE
(Week 6, slide 34)

20 – Hive can be considered a data warehousing layer over Hadoop that allows for data to be exposed
as structured tables.
(2020, 2021)
TRUE
(Week 5, slide 20)

--//--

21 – YARN manages resources and monitors workloads, in a secure multitenant environment, while
ensuring high availability across multiple Hadoop clusters.
(2021)
TRUE
(Week 4, slide 34)
22 – TEZ provides the capability to build an application framework that allows for a complex DAG of tasks
for high-performance data processing only in batch mode.
(2021)
FALSE
(Week 4, slide 42)

23 – The operation
CREATE TABLE external
(col1 STRING,
col2 INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
creates an external table in Hive.
(2021)
FALSE
(Should be CREATE EXTERNAL)

24 – Internal tables in Hive can only be stored as TEXTFILE.

(2021)
FALSE
(Week 6, slide 50)

25 – We can say data internal tables in Hive adhere to a principle called “data on schema”.
(2021)
TRUE
(Week 5, slide 31)

QUIZ 2 (Classes 8-12)

1 – Consider the scenario in Fig. 1, for a log analytics solution. The web server in this example is an
Amazon Elastic Compute Cloud (EC2) instance. In step 1, Amazon Kinesis Firehose will continuously pull
log records from the Apache Web Server.
(2020, 2021)
FALSE

2 – We can configure, in Step 1, each shard in Amazon Kinesis Firehose to ingest up to 1MB/sec and 1000
records/sec, and emit up to 2MB/sec.
(2020, 2021)
FALSE

3 – In step 2, Amazon Kinesis Firehose will write each log record to Amazon Simple Storage Service
(Amazon S3) for durable storage of the raw log data.
(2020, 2021)
TRUE

4 – In step 2, also, Amazon Kinesis Analytics application will continuously run a Kinesis Streaming Python
script against the streaming input data.
(2021)
FALSE

5 – In step 3, the Amazon Kinesis Analytics application will create an aggregated data set every minute
and output that data to a second Firehose delivery stream.
(2021)
TRUE

6 – All data in DynamoDB is replicated in two availability zones

(2020, 2021)
FALSE

7 – An end user provisions a Lambda service with similar steps as it provisions an EC2 instance
(2020, 2021)
FALSE

8 – The concept of partitioning can be used to reduce the cost of querying the data
(2020, 2021)
TRUE

9 – In a publish/subscribe model, although data producers are decoupled from data consumers,
publishers know who the consumers are.
(2020, 2021)
FALSE

10 – In Hive most of the optimizations are not based on the cost of query execution.
(2020, 2021)
TRUE

10 – The concept of partitioning can be used to reduce the cost of querying the data
(2020, 2021)
TRUE

11 – The number of shards cannot be modified after the Kinesis stream is created.
(2020, 2021)
FALSE

12 – The basic idea of vectorized query execution is to process a batch of columns as an array of line
vectors.
(2020, 2021)
FALSE

13 – In Hive bucketing, data is evenly distributed between all buckets based on the hashing principle.
(2020, 2021)
TRUE

14 – The selection of the partition key is always an important factor for performance. It should always
be a low-cardinal attribute to avoid so many sub-directories overhead.
(2020, 2021)
TRUE

15 – To partition the table customer by country we use the following HiveQL statement
CREATE TABLE customer (id STRING, name STRING, gender STRING,
state STRING, country STRING)PARTITIONED BY (country STRING)
(2020, 2021)
FALSE

16 – We can configure the values for the Amazon S3 buffer size (1 MB to 128 MB) or buffer interval (60
seconds to 900 seconds). The condition satisfied first triggers data delivery to Amazon S3.
(2020, 2021)
TRUE

17 – Regarding Fig. 2, the box marked with an X refers to an Amazon Kinesis Analytics service.
(2021)
TRUE

18 – Kinesis Analytics outputs its results to Kinesis Streams or Kinesis Firehose.

(2021)
TRUE

19 – AWS Lambda polls the stream periodically (once per second) for new records. When it detects new
records, it invokes the Lambda function by passing the new records as a parameter. If no new records
are detected, the Lambda function is not invoked.
(2020, 2021)
TRUE

20 – DynamoDB tables do not have fixed schemas, but all items must have a similar number of attributes.
(2020, 2021)
FALSE

21 – The main drawback of Kinesis Data Firehose is that to scale up or down you need to manually
provision servers using the "AWS Kinesis Data Firehose Scaling API".
(2021)
FALSE

22 – Kinesis Data Firehose can convert a stream of JSON data to Apache Parquet or Apache ORC.
(2021)
TRUE
23 – Since it uses SQL as query language, Amazon Athena is a relational/transactional database.
(2021)
FALSE

24 – AWS Glue ETL jobs are Spark-based.

(2021)
TRUE

25 – Kinesis Data Streams is good choice for long-term data storage and analytics
(2021)
FALSE

26 – Amazon OpenSearch/Elasticsearch stores CSV documents.

(2021)
FALSE

-- // --

27 – One of the drawbacks of Amazon Kinesis Firehose is that it is unable to encrypt the data, limiting its
usability for sensitive applications.
(2020)
FALSE

28 – Stream processing applications process data continuously in realtime, usually after a store operation
(in HD or SSD).
(2020)
FALSE

29 – You cannot use streaming data services for real-time applications such as application monitoring,
fraud detection, and live leaderboards, because these use cases require millisecond end-to-end
latencies—from ingestion, to processing, all the way to emitting the results to target data stores and
other systems.
(2020)
FALSE

29 – Kinesis Firehose It’s a fully managed service that automatically scales to match the throughput of
the data and requires no ongoing administration.
(2020)
TRUE
BDF2020 Review Questions 1 Questions

1. What is HDFS, and what are HDFS design goals?

HDFS is a highly scalable, distributed, load-balanced, portable, and fault-tolerant storage component
of Hadoop (with built-in redundancy at the software level).
When HDFS was implemented originally, certain assumptions and design goals were discussed:
• Horizontal scalability—Based on the scale-out model. HDFS can run on thousands of nodes. Fault
tolerance—Keeps multiple copies of data to recover from failure.
• Capability to run on commodity hardware—Designed to run on commodity hardware. Write once
and read many times—Based on a concept of write once, read multiple times, with an assumption
that once data is written, it will not be modified. Its focus is thus retrieving the data in the fastest
possible way.
• Capability to handle large data sets and streaming data access—Targeted to small numbers of very
large files for the storage of large data sets.
• Data locality—Every slave node in the Hadoop cluster has a data node (storage component) and
a JobTracker (processing component). Processing is done where data exists, to avoid data
movement across nodes of the cluster.
• High throughput—Designed for parallel data storage and retrieval.
• HDFS file system namespace—Uses a traditional hierarchical file organization in which any user or
application can create directories and recursively store files inside them.

2. In terms of storage, what does a name node contain and what do data nodes contain?
• HDFS stores and maintains file system metadata and application data separately. The name node
(master of HDFS) contains the metadata related to the file system (information about each file, as
well as the history of changes to the file metadata). Data nodes (workers of HDFS) contain
application data in a partitioned manner for parallel writes and reads.
• The name node contains an entire metadata called namespace (a hierarchy of files and directories)
in physical memory, for quicker response to client requests. This is called the fsimage. Any changes
into a transactional file is called an edit log. For persistence, both of these files are written to host
OS drives. The name node simultaneously responds to the multiple client requests (in a
multithreaded system) and provides information to the client to connect to data nodes to write
or read the data. While writing, a file is broken down into multiple chunks of 128MB (by default,
called blocks, 64MB sometimes). Each block is stored as a separate file on data nodes. Based on
the replication factor of a file, multiple copies or replicas of each block are stored for fault
tolerance.

3. What is the default data block placement policy?

By default, three copies, or replicas, of each block are placed, per the default block placement policy
mentioned next. The objective is a properly load-balanced, fast-access, fault-tolerant file system:
• The first replica is written to the data node creating the file.
• The second replica is written to another data node within the same rack.
• The third replica is written to a data node in a different rack.

4. What is the replication pipeline? What is its significance?

Data nodes maintain a pipeline for data transfer. Having said that, data node 1 does not need to wait
for a complete block to arrive before it can start transferring it to data node 2 in the flow. In fact, the
data transfer from the client to data node 1 for a given block happens in smaller chunks of 4KB. When
data node 1 receives the first 4KB chunk from the client, it stores this chunk in its local repository and
immediately starts transferring it to data node 2 in the flow. Likewise, when data node 2 receives the
first 4KB chunk from data node 1, it stores this chunk in its local repository and immediately starts
transferring it to data node 3, and so on. This way, all the data nodes in the flow (except the last one)
receive data from the previous data node and, at the same time, transfer it to the next data node in
the flow, to improve the write performance by avoiding a wait at each stage.

5. What is client-side caching, and what is its significance when writing data to HDFS?
HDFS uses several optimization techniques. One is to use client-side caching, by the HDFS client, to
improve the performance of the block write operation and to minimize network congestion. The HDFS
client transparently caches the file into a temporary local file; when it accumulates enough data for a
block size, the client reaches out to the name node. At this time, the name node responds by inserting
the filename into the file system hierarchy and allocating data nodes for its storage. The client then
flushes the block of data from the local, temporary file to the closest data node, and that data node
transfers the block to other data nodes (as instructed by the name node, based on the replication
factor of the file). This client-side caching avoids continuous use of the network and minimizes the risk
of network congestion.

6. How can you enable rack awareness in Hadoop?

You can make the Hadoop cluster rack aware by using a script that enables the master node to map
the network topology of the cluster using the properties topology.script.file.name or
net.topology.script.file.name, available in the core-site.xml configuration file. First, you must change
this property to specify the name of the script file. Then you must write the script and place it in the
file at the specified location. The script should accept a list of IP addresses and return the
corresponding list of rack identifiers. For example, the script would take host.foo.bar as an argument
and return /rack1 as the output.

7. What is the data block replication factor?

An application or a job can specify the number of replicas of a file that HDFS should maintain. The
number of copies or replicas of each block of a file is called the replication factor of that file. The
replication factor is configurable and can be changed at the cluster level or for each file when it is
created, or even later for a stored file.

8. What is block size, and how is it controlled?

When a client writes a file to a data node, it splits the file into multiple chunks, called blocks. This data
partitioning helps in parallel data writes and reads. Block size is controlled by the dfs.blocksize
configuration property in the hdfs-site.xml file and applies for files that are created without a block
size specification. When creating a file, the client can also specify a block size specification to override
the cluster-wide configuration.

9. What is a checkpoint, and who performs this operation?

The process of generating a new fsimage by merging transactional records from the edit log to the
current fsimage is called checkpoint. The secondary name node periodically performs a checkpoint by
downloading fsimage and the edit log file from the name node and then uploading the new fsimage
back to the name node. The name node performs a checkpoint upon restart (not periodically,
though—only on name node start-up).

10. How does a name node ensure that all the data nodes are functioning properly?
Each data node in the cluster periodically sends heartbeat signals and a block-report to the name
node. Receipt of a heartbeat signal implies that the data node is active and functioning properly. A
block-report from a data node contains a list of all blocks on that specific data node.

11. How does a client ensure that the data it receives while reading is not corrupted? Is there a way
to recover an accidently deleted file from HDFS?
When writing blocks of a file, an HDFS client computes the checksum of each block of the file and
stores these checksums in a separate hidden file in the same HDFS file system namespace. Later, while
reading the blocks, the client references these checksums to verify that these blocks were not
corrupted (corruption might happen because of faults in a storage device, network transmission faults,
or bugs in the program). When the client realizes that a block is corrupted, it reaches out to another
data node that has the replica of the corrupted block, to get another copy of the block.

12. How can you access and manage files in HDFS?

You can access the files and data stored in HDFS in many different ways. For example, you can use
HDFS FS Shell commands, leverage the Java API available in the classes of the org.apache.hadoop.fs
package, write a MapReduce job, or write Hive or Pig queries. In addition, you can even use a web
browser to browse the files from an HDFS cluster.

13. What two issues does HDFS encounter in Hadoop 1.0?

First, the name node in Hadoop 1.0 is a single point of failure. You can configure a secondary name
node, but it’s not an active-passing configuration. The secondary name node thus cannot be used for
failure in case the name node fails. Second, as the number of data nodes grows beyond 4,000, the
performance of the name node degrades, setting a kind of upper limit to the number of nodes in a
cluster.

14. What is a daemon?

The word daemon comes from the UNIX world. It refers to a process or service that runs in the
background. On a Windows platform, we generally refer to it is as a service. For example, in HDFS, we
have daemons such as name node, data node, and secondary name node.

15. What is YARN and what does it do?

In Hadoop 2.0, MapReduce has undergone a complete overhaul, with a new layer created on top of
HDFS. This new layer, called YARN (Yet Another Resource Negotiator), takes care of two major
functions: resource management and application life-cycle management. The JobTracker previously
handled those functions. Now MapReduce is just a batch-mode computational layer sitting on top of
YARN, whereas YARN acts like an operating system for the Hadoop cluster by providing resource
management and application life-cycle management functionalities. This makes Hadoop a general-
purpose data processing platform that is not constrained only to MapReduce.

16. What is uber-tasking optimization?

The concept of uber-tasking in YARN applies to smaller jobs. Those jobs are executed in the same
container or in the same JVM in which that application-specific Application Master is running. The
basic idea behind uber-tasking optimization is that the distributed task allocation and management
overhead exceeds the benefits of executing tasks in parallel for smaller jobs, hence its optimum to
execute smaller job in the same JVM or container of the Application Master.

17. What are the different components of YARN?

Aligning to the original master-slave architecture principle, even YARN has a global or master Resource
Manager for managing cluster resources and a per-node and -slave Node Manager that takes direction
from the Resource Manager and manages resources on the node. These two, form the computation
fabric for YARN. Apart from that is a per-application Application Master, which is merely an
application-specific library tasked with negotiating resources from the global Resource Manager and
coordinating with the Node Manager(s) to execute the tasks and monitor their execution. Containers
also are present—these are a group of computing resources, such as memory, CPU, disk, and network.
What were the key design strategies for HDFS to become fault tolerant?
HDFS is a highly scalable, distributed, load-balanced, portable, and fault-tolerant storage component
of Hadoop (with built-in redundancy at the software level).
When HDFS was implemented originally, certain assumptions and design goals were discussed:
Horizontal scalability—Based on the scale-out model. HDFS can run on thousands of nodes.
Fault tolerance—Keeps multiple copies of data to recover from failure.
Capability to run on commodity hardware—Designed to run on commodity hardware.
Write once and read many times—Based on a concept of write once, read multiple times, with an
assumption that once data is written, it will not be modified. Its focus is thus retrieving the data in
the fastest possible way.
Capability to handle large data sets and streaming data access—Targeted to small numbers of very
large files for the storage of large data sets.
Data locality—Every slave node in the Hadoop cluster has a data node (storage component) and a
JobTracker (processing component). Processing is done where data exists, to avoid data movement
across nodes of the cluster.
High throughput—Designed for parallel data storage and retrieval.
HDFS file system namespace—Uses a traditional hierarchical file organization in which any user or
application can create directories and recursively store files inside them.

To what does the term data locality refer?

Data locality is the concept of processing data locally wherever possible. This concept is central to
Hadoop, a platform that intentionally attempts to minimize the amount of data transferred across the
network by bringing the processing to the data instead of the reverse.
2nd Exam 2020

You may assume that you are a Data Scientist in a consulting assignment with the CDO (Chief Data
Officer) of Exportera (a large company). The CDO asks you lots of questions and appreciates very much
precise answers to his questions.

1. You are hired by the Municipality of Terras do Ouro as Data Scientist, to help prevent the effects
of flooding from the River Guardião. The municipality has already distributed IoT devices across
the river, that are able to measure the flow the wates. In the kick-off meeting, you say: "I have
an idea for a possible solution. We may need to use a number of AWS services.” And then you
explain your solution. What do you say?
In order to collect and process these real time streams of data records regarding the flow of the water,
the data from the IoT devices across the river is sent to AWS Kinesis Data Streams. Then this data is
processed using the AWS Lambda module and stored in AWS S3 buckets. The data collected into
Kinesis Data Streams can be used for simple data analysis and reporting in real time, using AWS
Elasticsearch. Finally we can send the processed records to dashboards to visualize the variability of
the flow of water, using Amazon QuickSight.

You do not need to worry about computer resources if you use AWS Kinesys. This system will enable
you to acquire data through a gateway on the cloud that will receive your data, than you can process
it using AWS Lambda module and later store the data in a storage system e.g. AWS's S3

2. You go to a conference and hear a speaker saying the following: "Real-time analytics is a key
factor for digital transformation. Companies everywhere are using their datalakes based on
technologies such as Hadoop and HDFS to give key insights, in real time, of their sales and
customer preferences." You rise your hand for a comment. What are going to say? Please, justify
carefully.
Firstly, I would like to clarify that Hadoop is a stack of different components, one of which is the HDFS,
which is its distributed file system. Moreover, although HDFS is thriving recently, there are plenty of
other choices such as Amazon S3 or Microsoft Azure Data Lake Store (ADLS) for cloud storage, or
Apache Kudu for IoT and analytic data.

Hadoop is a framework that incorporates several technologies including batch, streaming (you
mentioned) and others. HDFS is Hadoop's file system.

3. In a whiteboarding session, you say to the CDO of Exportera: "OK, I'll explain what the Hive
Driver is and what are its key functions/roles." What do you say, trying to be very complete and
precise?
To begin with, Hive is Hadoop’s SQL-like analytics query engine, which enables the writing of data
processing logic or queries in a declarative language similar to SQL, called HiveQL. The brain of Hive is
the Hive Driver, which maintains the lifecycle of a HiveQL statement: it comprehends a query compiler,
optimizer and executor, executing the tasks plan generated by the compiler in proper dependency
order while interacting with the underlying Hadoop instance.

Hive is Hadoop's SQL analytics query framework. It is the same as saying Hive is a SQL abstraction layer
over Hadoop MapReduce with a SQL-like query engine. It enables several types of connections to
other data bases using Hive driver. HIVE DRIVER can for instances connect through ODBC to relational
databases.

4. On implementing Hadoop, the CDO is worried on understanding how a name node ensures that
all the data nodes are functioning properly. What can you tell him to reassure him?
The name node, or master node, contains the metadata related to the Hadoop Distributed File System,
such as the file name, file path or configurations. It keeps track of all the data nodes, or slave nodes,
using the heartbeat methodology. Each data node regularly send heartbeat signals to the name node.
After receiving these signals the name node ensures that the slave nodes are live and functioning
properly. In the event that the name node does not receive a heartbeat signal from the data node,
that particular data node is considered inactive and starts the process of block replication on some
other data node.

Each data node in the cluster periodically sends heartbeat signals and a block-report to the name
node. Receipt of a heartbeat signal implies that the data node is active and functioning properly. A
block-report from a data node contains a list of all blocks on that specific data node.

5. The CDO is considering using Amazon Kinesis Firehose to deliver data to Amazon S3. He is
concerned on losing data if the data delivery to the destination is falling behind data writing to
the delivery stream. Can you help him out understanding how this process works, in order to
alleviate his concerns?
In Amazon Kinesis Firehose, the scaling is handled automatically, up to gigabytes per second, in order
to meet the demand. Amazon Kinesis Firehose will write each log record to Amazon Simple Storage
Service (Amazon S3) for durable storage of the raw log data, and the Amazon Kinesis Analytics
application will continuously run a Kinesis Streaming SQL statement against the streaming input data.
This way, you should not be concerned.

6. The CDO of Exportera asks you to prepare the talking points for a presentation he must make
to the board regarding a budget increase for his team. It is important that the board members
understand what are the impacts of the data variety versus the data volume. What can you tell
them?
When it comes to Big Data applications, there are 4 characteristics which need to be addressed:
Volume, Velocity, Variety and Variability (also know as the 4 V’s). Volume is the most commonly
recognized characteristic of Big Data, representing the large amount of data available for analysis to
extract valuable information. The time and expense required to process large datasets drives the need
for processing and storage parallelism. On the other hand, data variety represents the need to analyze
data from multiple sources, domains and data types to bring added value to the business, which drives
the need for distributed processing. This way, a budget increase is needed to meet the added
requirements that big data processing brings to the table.

While the volume can be handle scaling the processing and storage resources the issues related with
variety require customization of processing, namely interfaces, and programming of handlers. For
instances handling objects coming from SQL sources or OSI-PI sources require different type of
programming skills. Not having the right skills to handle this can ruin the ambition of storing different
-and so rich - data.
To have this kind of skills we need a budget increase in the department.

7. The CDO of Exportera tells you: “From what I understand, in Hive, the partitioning of tables can
only be made on the basis of one parameter, making it not as useful as it could be." What can
you answer him?
You are wrong. Hive partitioning can be made with more than one parameter of the original table.
However, the partition key should always be a low-cardinal attribute to avoid many partitions
overhead. It is very useful, as a query with partition filtering will only load data from the specified
partitions, so it can execute much faster than a normal query that filters by a non-partitioning field.
Hive partitioning can be made with more than one parameter from the original table. The partitioning
can be made by dividing the table in small tables by folders against each combination of parameters.
And even the partitioning can be optimized by bucket files.

8. The CDO of Exportera has been exploring stream processing and is worried about the latency of
a solution based on a platform such as Kinesis Data Streams. To address the question, what can
you tell him?
Kinesis Data Streams can be used efficiently to solve a variety of streaming data problems. It ensures
durability and elasticity, which enables you to scale the stream up or down, so that you never lose
data records before they expire. The delay between the time a record is put into the stream and the
time it can be retrieved is typically less than 1 second, so you don’t have to worry!

Kinesis Data Stream platform is based on a high throughput, large bandwidth performance
components.
Kinesis ensures durability and elasticity of data acquired through streaming. The elasticity of Kinesis
Data Streams enables you to scale the stream up or down, so that you never lose data records before
it expires (1 day default).

9. The CDO of Exportera is confused: "I've read about MapReduce, but I still do not fully
understand it. Than you explain very clearly to him with a small example: counting the words of
the following text:
"Mary had a little lamb
Little lamb, little lamb
Mary had a little lamb
It's fleece was white as snow"
Justify carefully.
MapReduce is a process which divides a computation into three sequential stages: map, shuffle and
reduce. In the map phase, the required data is read from the HDFS and processed in parallel by
multiple independent map tasks. A map() function is defined which produces key-value outputs. Then,
the map outputs are sorted in the shuffle phase. Finally, a reduce() function is defined to aggregate all
the values for each key, summarizing the inputs.
One of the applications of the MapReduce algorithm is to count words. If we take into consideration
the example mentioned:
1 – In the first step, the map() phase, all the words in the text are split into a list ordered from the first
word of the text until the last one:
“Mary”,
“had”,
“a”,
“little”,
“lamb”,
“Little”,
“lamb”,
(…)
“white”,
“as”,
“snow”

Still in the map() phase, each word is converted into a key-value pair, being the key the word and the
value 1 :
(“Mary”: 1),
(“had”: 1),
(“a”: 1),
(…)
(“snow”:1).
2 – In the shuffle phase, all the key-value pairs are sorted alphabetically by the key, so that similar keys
can get together
3 – Finally, in the reduce phase, there is an aggregation by key (word), performing a count of the
respective values. This way, the final result is the key-value pairs of the distinct words and their count
of occurrences throughout the text

MapReduce is a program model, or a processing technique that can be applied in several contexts. Its
implementation is usually divided in three steps: Mapping the data variables considered, shuffling
those variables following a pattern/directive and then aggregating/reducing them.
Let us take the text above as reference for the following example: We want to count the number of
words in the text.
In the first step we would map all the words, let say, continuously, getting a list like:
"Mary
had
a
little
...
as
snow"
If one wants to count the number of words, a number should be associated with each word for later
counting. So we can associate "1", so we can create, still in Map step the kind of Key-Value database
associating each word with "1", e.g.:
<"Mary",1>, where the word is the Key.
<Mary,1>
<little,1>
In the next step one could shuffle this created database so that all similar keys can get together. But
since one just wants to count the number of words, we just need to sum the values and this operation
is the Reduce operation in this case.

10. In the Exportera cafetaria you hear someone telling the following: "Oh, processing data in
motion is no different than processing data at rest." If you had to join the conversation, what
would you say to them?
I don’t agree with you. Processing data in motion is very different than processing data at rest. The
operational difference between streaming data and data at rest lies in when the data is stored and
analyzed. When dealing with data at rest, each observation is recorded and stored before performing
analysis on the data. On the other hand, in streaming data, each event is processed as it is read, and
subsequent results are stored in a database. However, there are some similarities: data at rest and
streaming data can come from the same source, they can be processed with the same analytics and
can be stored with the same storage service
Processing data in motion (streaming) is much different from processing data at rest. The main
differences are that:
1. Analytics over streaming data needs to be done with data incoming and not in steady data.
2. Data in motion needs buffering to process amounts of data while similar steady data processing
can be done through querying.
3. The concept of incoming data in streaming does not apply in data at rest. In the former case one
needs to adapt the acquisition to the amount of incoming data.
Of course, there are also some similarities, namely in quality control of data, namely the detection of
common types of errors like valid values or missing values. Anyway, the processing- which is what we
are talking about - is much different in both cases.

--//--

1. The CDO of Exportera is confused: "I've read about MapReduce, but I still do not fully understand
it. Can you explain very clearly its processes, and the role of the NameNode and the DataNodes in a
MapReduce operation?" Can you help him out?
MapReduce is a process which divides a computation into three sequential stages: map, shuffle and
reduce. In the map phase, the required data is read from the HDFS and processed in parallel by
multiple independent map tasks. A map() function is defined which produces key-value outputs. Then,
the map outputs are sorted in the shuffle phase. Finally, a reduce() function is defined to aggregate all
the values for each key, summarizing the inputs.
During this process, the NameNode allocates which DataNodes, meaning individual servers, will
perform the map operation and which will perform the reduce operation. This way, mapper
DataNodes perform the map phase and produce key-value pairs and the reducer DataNodes apply the
reduce function to the key-value pairs and generate the final output.

2. The CDO of Exportera asks you to prepare the talking points for a presentation he must make to
the board regarding a budget increase for his team. It is important that the board members
understand what data variety versus data variability is, as well as its impacts on the analytics
platforms (and, of course, business benefits of addressing them.) What do you write to make it
crystal clear?
When it comes to Big Data applications, there are 4 characteristics which need to be addressed:
Volume, Velocity, Variety and Variability (also know as the 4 V’s).
While the variety characteristic represents the need to analyze data from multiple sources and data
types, the variability refers to the changes in dataset characteristics, whether in the data flow rate,
format/structure and/or volume. When it comes to variety of data, distributed processing should be
applied on different types of data, followed by individual pre-analytics. On the other hand, variability
implies the need to scale-up or scale-down to efficiently handle the additional processing load, which
justifies the use of cloud computing. Finally, regarding the benefits to the business, both have their
advantages: variety of data brings an additional richness to the business, as more details from multiple
domains are available; variability keep the systems efficient as we don’t always have to design and
resource to the expected peak capacity.

Data Engineering With AWS: Acquire The Skills To Design and Build AWS-based Data Transformation Pipelines Like A Pro 2nd Edition Eagar Instant Download
100% (2)
Data Engineering With AWS: Acquire The Skills To Design and Build AWS-based Data Transformation Pipelines Like A Pro 2nd Edition Eagar Instant Download
46 pages
AWS DevOps - Cheat Sheet
100% (1)
AWS DevOps - Cheat Sheet
21 pages
AWS Associate Data Engineer
100% (2)
AWS Associate Data Engineer
23 pages
Hadoop MCQs
75% (8)
Hadoop MCQs
21 pages
Exam 1 - Attempt Review
No ratings yet
Exam 1 - Attempt Review
12 pages
ABD22 1st Exam - 6 January - Attempt Review
No ratings yet
ABD22 1st Exam - 6 January - Attempt Review
13 pages
BDF 2022 Combined 2
No ratings yet
BDF 2022 Combined 2
266 pages
Exame 1 PDF
No ratings yet
Exame 1 PDF
3 pages
Bigdata MCQ QA Part2
No ratings yet
Bigdata MCQ QA Part2
9 pages
Big Data QCM 1 PDF
100% (1)
Big Data QCM 1 PDF
7 pages
It-222 Reviewer
No ratings yet
It-222 Reviewer
3 pages
CT 2
No ratings yet
CT 2
8 pages
Analysis and Visualization
No ratings yet
Analysis and Visualization
5 pages
AWS Data Engg Exam MCQ
No ratings yet
AWS Data Engg Exam MCQ
21 pages
Q. What Is Big Data?
No ratings yet
Q. What Is Big Data?
8 pages
BDF 2022 Exame+quiz Merged Merged
No ratings yet
BDF 2022 Exame+quiz Merged Merged
37 pages
Big Data Deals With Large Data Sets
No ratings yet
Big Data Deals With Large Data Sets
4 pages
CS614 Final Term File Made by RajpoOt
No ratings yet
CS614 Final Term File Made by RajpoOt
10 pages
Big Data Solution Assignment-I
No ratings yet
Big Data Solution Assignment-I
4 pages
Cours BI 23 24 Session 4 2
No ratings yet
Cours BI 23 24 Session 4 2
46 pages
DS BigDATA 2ièmeN2TR UVT 2022 2023
No ratings yet
DS BigDATA 2ièmeN2TR UVT 2022 2023
4 pages
Business Intelligence and Analytics: Systems For Decision Support, 10e (Sharda) Chapter 13 Big Data and Analytics
No ratings yet
Business Intelligence and Analytics: Systems For Decision Support, 10e (Sharda) Chapter 13 Big Data and Analytics
13 pages
20aipw602 - Big Data Analytics With Laboratory Question Bank
No ratings yet
20aipw602 - Big Data Analytics With Laboratory Question Bank
22 pages
Big Data (KCS-061)
No ratings yet
Big Data (KCS-061)
46 pages
SQL DM1
No ratings yet
SQL DM1
5 pages
Professional 2
No ratings yet
Professional 2
34 pages
Pig
No ratings yet
Pig
24 pages
coursBUTONLYQA Merged
No ratings yet
coursBUTONLYQA Merged
52 pages
Aksha Interview Questions
100% (1)
Aksha Interview Questions
52 pages
Bda Test1 Key Answers
No ratings yet
Bda Test1 Key Answers
7 pages
Bda MCQ
100% (1)
Bda MCQ
44 pages
Amazon Module 8 Learning Check Exam Questions and Accurate ANSWERS 2024/2025 - VERIFIED
No ratings yet
Amazon Module 8 Learning Check Exam Questions and Accurate ANSWERS 2024/2025 - VERIFIED
1 page
19-Databricks
No ratings yet
19-Databricks
28 pages
r16 Te Sem Viii Choice It Big Data Analytics
No ratings yet
r16 Te Sem Viii Choice It Big Data Analytics
5 pages
Data-Intensive Computing
No ratings yet
Data-Intensive Computing
88 pages
BDA UNIT-2 (Final)
No ratings yet
BDA UNIT-2 (Final)
27 pages
Big Data 2020
No ratings yet
Big Data 2020
13 pages
BDA Class3
No ratings yet
BDA Class3
15 pages
Data Science and Big Data UNIT 4
No ratings yet
Data Science and Big Data UNIT 4
10 pages
2REVIEW Merged
No ratings yet
2REVIEW Merged
309 pages
Chapter 1
No ratings yet
Chapter 1
16 pages
Question Bank - Big Data
No ratings yet
Question Bank - Big Data
6 pages
BDA mid-II (Bits)
No ratings yet
BDA mid-II (Bits)
6 pages
Pre Requisite Form For CCS368
No ratings yet
Pre Requisite Form For CCS368
4 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
DSE 3222 05 Mar 2025
No ratings yet
DSE 3222 05 Mar 2025
14 pages
AWS Certified Big Data Specialty Exam
No ratings yet
AWS Certified Big Data Specialty Exam
13 pages
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
2022 Assignment Answers
No ratings yet
2022 Assignment Answers
37 pages
BD V
No ratings yet
BD V
6 pages
Big Data Emerging Technologie
No ratings yet
Big Data Emerging Technologie
10 pages
BD Imp Ques 2
No ratings yet
BD Imp Ques 2
26 pages
Big Data Technology
No ratings yet
Big Data Technology
9 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
IE494 - Big - Data - Processing - Course - File - Autumn24 - PMJ - PM Jat
No ratings yet
IE494 - Big - Data - Processing - Course - File - Autumn24 - PMJ - PM Jat
5 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
Bda Sem 7 Book
No ratings yet
Bda Sem 7 Book
188 pages
DS QCM BigData 2021
No ratings yet
DS QCM BigData 2021
6 pages
BDH Admin Ebook
No ratings yet
BDH Admin Ebook
807 pages
Adbms
No ratings yet
Adbms
19 pages
Course Code: CCS334 Course Name: Big Data Analytics Regulation: 2021 Year/Sem: Iii / Vi Faculty Incharge
No ratings yet
Course Code: CCS334 Course Name: Big Data Analytics Regulation: 2021 Year/Sem: Iii / Vi Faculty Incharge
12 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
BDA Question Bank
100% (1)
BDA Question Bank
10 pages
BDA Viva
No ratings yet
BDA Viva
26 pages
Riskwell Protect Intro Summary v3
0% (1)
Riskwell Protect Intro Summary v3
3 pages
Oems Manual l1 en
100% (1)
Oems Manual l1 en
59 pages
The Evolution, Not Revolution, of Digital Integration in Oil and Gas
No ratings yet
The Evolution, Not Revolution, of Digital Integration in Oil and Gas
159 pages
Fruit Quality Classifier - Group 1
No ratings yet
Fruit Quality Classifier - Group 1
12 pages
Exam 2023
No ratings yet
Exam 2023
13 pages
ABD00 Notebooks Combined - Databricks
No ratings yet
ABD00 Notebooks Combined - Databricks
109 pages
Sap c02 True2
No ratings yet
Sap c02 True2
24 pages
AWS Cloud Data Ingestion Patterns Practices
No ratings yet
AWS Cloud Data Ingestion Patterns Practices
40 pages
Module 5 Aws
No ratings yet
Module 5 Aws
55 pages
Aws Certified Data Engineer Associate 9
No ratings yet
Aws Certified Data Engineer Associate 9
14 pages
Cloud Unit V
No ratings yet
Cloud Unit V
23 pages
Tutorials+Dojo+ +AWS+Certified+Developer+Associate+DVA C01+Video+Slides
No ratings yet
Tutorials+Dojo+ +AWS+Certified+Developer+Associate+DVA C01+Video+Slides
709 pages
Amazon Web Services Certshared Saa-C03 Exam Dumps 2024-Feb-18 by Mark 288q Vce
No ratings yet
Amazon Web Services Certshared Saa-C03 Exam Dumps 2024-Feb-18 by Mark 288q Vce
18 pages
Google - Professional Machine Learning Engineer.v2023 08 28.q121
No ratings yet
Google - Professional Machine Learning Engineer.v2023 08 28.q121
56 pages
SAA Practice 1
No ratings yet
SAA Practice 1
58 pages
AWS SA Practice Questions (Basic Cloud Concepts)
No ratings yet
AWS SA Practice Questions (Basic Cloud Concepts)
170 pages
scs-c02 4
No ratings yet
scs-c02 4
28 pages
Real Exam Question Amazonaws
0% (1)
Real Exam Question Amazonaws
27 pages
AWS Certified Machine Learning Specialty Exam Guide
0% (1)
AWS Certified Machine Learning Specialty Exam Guide
11 pages
Data Engineering
No ratings yet
Data Engineering
48 pages
AWS Architecture Icons Deck - For Light BG - 06072024
No ratings yet
AWS Architecture Icons Deck - For Light BG - 06072024
154 pages
Certified Developer Associate Quiz
No ratings yet
Certified Developer Associate Quiz
12 pages
Cloud
No ratings yet
Cloud
6 pages
AWS Certified Data Engineer
No ratings yet
AWS Certified Data Engineer
186 pages
Aws Certified Solutions Architect Professional - 4
No ratings yet
Aws Certified Solutions Architect Professional - 4
21 pages
Architecting Amazon Eks For Pci Dss Compliance
No ratings yet
Architecting Amazon Eks For Pci Dss Compliance
21 pages
New SAA Practice Set 5
No ratings yet
New SAA Practice Set 5
85 pages
QRadar SEIM - Product Education For Technical Sales Quiz Attempt Review
No ratings yet
QRadar SEIM - Product Education For Technical Sales Quiz Attempt Review
7 pages
SAA-C03 M Khóa
No ratings yet
SAA-C03 M Khóa
500 pages
AWS SAP Questions and Answers
No ratings yet
AWS SAP Questions and Answers
106 pages
AWS Cheatbook For Dummies
No ratings yet
AWS Cheatbook For Dummies
172 pages
Event Streaming With Modern Data Pipelines in A SaaS Architecture ISV201
No ratings yet
Event Streaming With Modern Data Pipelines in A SaaS Architecture ISV201
22 pages

Exames BDF PDF

Uploaded by

Exames BDF PDF

Uploaded by

Quizzes & Exams

QUIZ 1 (Classes 1-7)

7 – The term MapReduce refers in exclusive to a programming model.

8 – Hadoop is considered a schema-on-write-system regarding write operations.

13 – The name node is not involved in the actual data transfer.

24 – Internal tables in Hive can only be stored as TEXTFILE.

QUIZ 2 (Classes 8-12)

6 – All data in DynamoDB is replicated in two availability zones

18 – Kinesis Analytics outputs its results to Kinesis Streams or Kinesis Firehose.

24 – AWS Glue ETL jobs are Spark-based.

26 – Amazon OpenSearch/Elasticsearch stores CSV documents.

1. What is HDFS, and what are HDFS design goals?

3. What is the default data block placement policy?

4. What is the replication pipeline? What is its significance?

6. How can you enable rack awareness in Hadoop?

7. What is the data block replication factor?

8. What is block size, and how is it controlled?

9. What is a checkpoint, and who performs this operation?

12. How can you access and manage files in HDFS?

13. What two issues does HDFS encounter in Hadoop 1.0?

14. What is a daemon?

15. What is YARN and what does it do?

16. What is uber-tasking optimization?

17. What are the different components of YARN?

To what does the term data locality refer?

You might also like