0% found this document useful (0 votes)
45 views15 pages

Exames BDF PDF

The document contains two quizzes with multiple choice questions about big data concepts and technologies like HDFS, MapReduce, Hive, Amazon Kinesis, AWS services. Quiz 1 has 27 questions covering topics from classes 1-12. Quiz 2 has 26 questions and references figures related to log analytics solutions and data streaming architectures. The questions test knowledge of concepts like data partitioning, Kinesis stream scaling, Hive optimizations, and AWS service capabilities.

Uploaded by

Bruno Teles
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views15 pages

Exames BDF PDF

The document contains two quizzes with multiple choice questions about big data concepts and technologies like HDFS, MapReduce, Hive, Amazon Kinesis, AWS services. Quiz 1 has 27 questions covering topics from classes 1-12. Quiz 2 has 26 questions and references figures related to log analytics solutions and data streaming architectures. The questions test knowledge of concepts like data partitioning, Kinesis stream scaling, Hive optimizations, and AWS service capabilities.

Uploaded by

Bruno Teles
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Quizzes & Exams

QUIZ 1 (Classes 1-7)

1 – Despite its capabilities, TEZ still needs the storage of intermediate output to HDFS. (2020, 2021)
(2020, 2021)
FALSE
(Week 4, slide 48)

2 – Volume drives the need for processing and storage parallelism, and its management during
processing of large datasets.
(2020, 2021)
TRUE
(Week 1, slide 35)

3 – The servers where the data resides can only perform the map operation.
(2020, 2021)
TRUE
(Week 4, slide 27)

4 – The process of generating a new fsimage from a merge operation is called the Checkpoint process.
(2020, 2021)
TRUE
(Week 3, slide 15)

5 – Hive follows a “schema on read” approach, unlike RDMBS, which enforces “schema on write.”
(2020, 2021)
TRUE
(Week 5, slide 29)

6 – Big Data is a characterization only for volumes of data above one petabyte.
(2020, 2021)
FALSE
(Definition)

7 – The term MapReduce refers in exclusive to a programming model.


(2020, 2021)
FALSE
(Week 4, slide 8)

8 – Hadoop is considered a schema-on-write-system regarding write operations.


(2020, 2021)
FALSE
(Week 2, slide 30)

9 – One of the key design principles of HDFS is that it should favor low latency random access over high
sustained bandwidth.
(2020, 2021)
FALSE
(Week 2, slide 34)

10 – One of the key design principles of HDFS is that it should be able to use commodity hardware.
(2020, 2021)
TRUE
(Week 2, slide 24)

11 – The fundamental architectural principles of Hadoop are: large scale, distributed, shared everything
systems, connected by a good network, working together to solve the same problem.
(2020, 2021)
FALSE
(Week 2, slide 23)
12 – Apache Tez is an engine built on top of Apache Hadoop YARN.
(2020, 2021)
TRUE
(Week 4, slide 42)

13 – The name node is not involved in the actual data transfer.


(2020, 2021)
FALSE
(Week 3, slide 33)

14 – Business benefits are frequently higher when addressing the variety of data than when addressing
volume
(2020, 2021)
TRUE
(Week 1, slide 37)

15 – In object storage, data is stored close to processing, just like in HDFS, but with rich metadata.
(2020, 2021)
FALSE
(Week 5, slide 6)

16 – The map function in MapReduce processes key/value pairs to generate a set of intermediate
key/value pairs.
(2020, 2021)
TRUE
(Week 4, slide 26)

17 - Essentially, MapReduce divides a computation into two sequential stages: map and reduce.
(2020, 2021)
FALSE
(Week 4, slide 26)

18 – When a HiveQL is executed, Hive translates the query into MapReduce, saving the time and effort
of writing actual MapReduce jobs.
(2020, 2021)
TRUE
(Week 5, slide 21)

19 – When we drop an external table in Hive, both the data and the schema will be dropped.
(2020, 2021)
FALSE
(Week 6, slide 34)

20 – Hive can be considered a data warehousing layer over Hadoop that allows for data to be exposed
as structured tables.
(2020, 2021)
TRUE
(Week 5, slide 20)

--//--

21 – YARN manages resources and monitors workloads, in a secure multitenant environment, while
ensuring high availability across multiple Hadoop clusters.
(2021)
TRUE
(Week 4, slide 34)
22 – TEZ provides the capability to build an application framework that allows for a complex DAG of tasks
for high-performance data processing only in batch mode.
(2021)
FALSE
(Week 4, slide 42)

23 – The operation
CREATE TABLE external
(col1 STRING,
col2 INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
creates an external table in Hive.
(2021)
FALSE
(Should be CREATE EXTERNAL)

24 – Internal tables in Hive can only be stored as TEXTFILE.


(2021)
FALSE
(Week 6, slide 50)

25 – We can say data internal tables in Hive adhere to a principle called “data on schema”.
(2021)
TRUE
(Week 5, slide 31)

QUIZ 2 (Classes 8-12)

1 – Consider the scenario in Fig. 1, for a log analytics solution. The web server in this example is an
Amazon Elastic Compute Cloud (EC2) instance. In step 1, Amazon Kinesis Firehose will continuously pull
log records from the Apache Web Server.
(2020, 2021)
FALSE

2 – We can configure, in Step 1, each shard in Amazon Kinesis Firehose to ingest up to 1MB/sec and 1000
records/sec, and emit up to 2MB/sec.
(2020, 2021)
FALSE

3 – In step 2, Amazon Kinesis Firehose will write each log record to Amazon Simple Storage Service
(Amazon S3) for durable storage of the raw log data.
(2020, 2021)
TRUE

4 – In step 2, also, Amazon Kinesis Analytics application will continuously run a Kinesis Streaming Python
script against the streaming input data.
(2021)
FALSE

5 – In step 3, the Amazon Kinesis Analytics application will create an aggregated data set every minute
and output that data to a second Firehose delivery stream.
(2021)
TRUE

6 – All data in DynamoDB is replicated in two availability zones


(2020, 2021)
FALSE

7 – An end user provisions a Lambda service with similar steps as it provisions an EC2 instance
(2020, 2021)
FALSE

8 – The concept of partitioning can be used to reduce the cost of querying the data
(2020, 2021)
TRUE

9 – In a publish/subscribe model, although data producers are decoupled from data consumers,
publishers know who the consumers are.
(2020, 2021)
FALSE

10 – In Hive most of the optimizations are not based on the cost of query execution.
(2020, 2021)
TRUE

10 – The concept of partitioning can be used to reduce the cost of querying the data
(2020, 2021)
TRUE

11 – The number of shards cannot be modified after the Kinesis stream is created.
(2020, 2021)
FALSE

12 – The basic idea of vectorized query execution is to process a batch of columns as an array of line
vectors.
(2020, 2021)
FALSE

13 – In Hive bucketing, data is evenly distributed between all buckets based on the hashing principle.
(2020, 2021)
TRUE

14 – The selection of the partition key is always an important factor for performance. It should always
be a low-cardinal attribute to avoid so many sub-directories overhead.
(2020, 2021)
TRUE

15 – To partition the table customer by country we use the following HiveQL statement
CREATE TABLE customer (id STRING, name STRING, gender STRING,
state STRING, country STRING)PARTITIONED BY (country STRING)
(2020, 2021)
FALSE

16 – We can configure the values for the Amazon S3 buffer size (1 MB to 128 MB) or buffer interval (60
seconds to 900 seconds). The condition satisfied first triggers data delivery to Amazon S3.
(2020, 2021)
TRUE

17 – Regarding Fig. 2, the box marked with an X refers to an Amazon Kinesis Analytics service.
(2021)
TRUE

18 – Kinesis Analytics outputs its results to Kinesis Streams or Kinesis Firehose.


(2021)
TRUE

19 – AWS Lambda polls the stream periodically (once per second) for new records. When it detects new
records, it invokes the Lambda function by passing the new records as a parameter. If no new records
are detected, the Lambda function is not invoked.
(2020, 2021)
TRUE

20 – DynamoDB tables do not have fixed schemas, but all items must have a similar number of attributes.
(2020, 2021)
FALSE

21 – The main drawback of Kinesis Data Firehose is that to scale up or down you need to manually
provision servers using the "AWS Kinesis Data Firehose Scaling API".
(2021)
FALSE

22 – Kinesis Data Firehose can convert a stream of JSON data to Apache Parquet or Apache ORC.
(2021)
TRUE
23 – Since it uses SQL as query language, Amazon Athena is a relational/transactional database.
(2021)
FALSE

24 – AWS Glue ETL jobs are Spark-based.


(2021)
TRUE

25 – Kinesis Data Streams is good choice for long-term data storage and analytics
(2021)
FALSE

26 – Amazon OpenSearch/Elasticsearch stores CSV documents.


(2021)
FALSE

-- // --

27 – One of the drawbacks of Amazon Kinesis Firehose is that it is unable to encrypt the data, limiting its
usability for sensitive applications.
(2020)
FALSE

28 – Stream processing applications process data continuously in realtime, usually after a store operation
(in HD or SSD).
(2020)
FALSE

29 – You cannot use streaming data services for real-time applications such as application monitoring,
fraud detection, and live leaderboards, because these use cases require millisecond end-to-end
latencies—from ingestion, to processing, all the way to emitting the results to target data stores and
other systems.
(2020)
FALSE

29 – Kinesis Firehose It’s a fully managed service that automatically scales to match the throughput of
the data and requires no ongoing administration.
(2020)
TRUE
BDF2020 Review Questions 1 Questions

1. What is HDFS, and what are HDFS design goals?


HDFS is a highly scalable, distributed, load-balanced, portable, and fault-tolerant storage component
of Hadoop (with built-in redundancy at the software level).
When HDFS was implemented originally, certain assumptions and design goals were discussed:
• Horizontal scalability—Based on the scale-out model. HDFS can run on thousands of nodes. Fault
tolerance—Keeps multiple copies of data to recover from failure.
• Capability to run on commodity hardware—Designed to run on commodity hardware. Write once
and read many times—Based on a concept of write once, read multiple times, with an assumption
that once data is written, it will not be modified. Its focus is thus retrieving the data in the fastest
possible way.
• Capability to handle large data sets and streaming data access—Targeted to small numbers of very
large files for the storage of large data sets.
• Data locality—Every slave node in the Hadoop cluster has a data node (storage component) and
a JobTracker (processing component). Processing is done where data exists, to avoid data
movement across nodes of the cluster.
• High throughput—Designed for parallel data storage and retrieval.
• HDFS file system namespace—Uses a traditional hierarchical file organization in which any user or
application can create directories and recursively store files inside them.

2. In terms of storage, what does a name node contain and what do data nodes contain?
• HDFS stores and maintains file system metadata and application data separately. The name node
(master of HDFS) contains the metadata related to the file system (information about each file, as
well as the history of changes to the file metadata). Data nodes (workers of HDFS) contain
application data in a partitioned manner for parallel writes and reads.
• The name node contains an entire metadata called namespace (a hierarchy of files and directories)
in physical memory, for quicker response to client requests. This is called the fsimage. Any changes
into a transactional file is called an edit log. For persistence, both of these files are written to host
OS drives. The name node simultaneously responds to the multiple client requests (in a
multithreaded system) and provides information to the client to connect to data nodes to write
or read the data. While writing, a file is broken down into multiple chunks of 128MB (by default,
called blocks, 64MB sometimes). Each block is stored as a separate file on data nodes. Based on
the replication factor of a file, multiple copies or replicas of each block are stored for fault
tolerance.

3. What is the default data block placement policy?


By default, three copies, or replicas, of each block are placed, per the default block placement policy
mentioned next. The objective is a properly load-balanced, fast-access, fault-tolerant file system:
• The first replica is written to the data node creating the file.
• The second replica is written to another data node within the same rack.
• The third replica is written to a data node in a different rack.

4. What is the replication pipeline? What is its significance?


Data nodes maintain a pipeline for data transfer. Having said that, data node 1 does not need to wait
for a complete block to arrive before it can start transferring it to data node 2 in the flow. In fact, the
data transfer from the client to data node 1 for a given block happens in smaller chunks of 4KB. When
data node 1 receives the first 4KB chunk from the client, it stores this chunk in its local repository and
immediately starts transferring it to data node 2 in the flow. Likewise, when data node 2 receives the
first 4KB chunk from data node 1, it stores this chunk in its local repository and immediately starts
transferring it to data node 3, and so on. This way, all the data nodes in the flow (except the last one)
receive data from the previous data node and, at the same time, transfer it to the next data node in
the flow, to improve the write performance by avoiding a wait at each stage.

5. What is client-side caching, and what is its significance when writing data to HDFS?
HDFS uses several optimization techniques. One is to use client-side caching, by the HDFS client, to
improve the performance of the block write operation and to minimize network congestion. The HDFS
client transparently caches the file into a temporary local file; when it accumulates enough data for a
block size, the client reaches out to the name node. At this time, the name node responds by inserting
the filename into the file system hierarchy and allocating data nodes for its storage. The client then
flushes the block of data from the local, temporary file to the closest data node, and that data node
transfers the block to other data nodes (as instructed by the name node, based on the replication
factor of the file). This client-side caching avoids continuous use of the network and minimizes the risk
of network congestion.

6. How can you enable rack awareness in Hadoop?


You can make the Hadoop cluster rack aware by using a script that enables the master node to map
the network topology of the cluster using the properties topology.script.file.name or
net.topology.script.file.name, available in the core-site.xml configuration file. First, you must change
this property to specify the name of the script file. Then you must write the script and place it in the
file at the specified location. The script should accept a list of IP addresses and return the
corresponding list of rack identifiers. For example, the script would take host.foo.bar as an argument
and return /rack1 as the output.

7. What is the data block replication factor?


An application or a job can specify the number of replicas of a file that HDFS should maintain. The
number of copies or replicas of each block of a file is called the replication factor of that file. The
replication factor is configurable and can be changed at the cluster level or for each file when it is
created, or even later for a stored file.

8. What is block size, and how is it controlled?


When a client writes a file to a data node, it splits the file into multiple chunks, called blocks. This data
partitioning helps in parallel data writes and reads. Block size is controlled by the dfs.blocksize
configuration property in the hdfs-site.xml file and applies for files that are created without a block
size specification. When creating a file, the client can also specify a block size specification to override
the cluster-wide configuration.

9. What is a checkpoint, and who performs this operation?


The process of generating a new fsimage by merging transactional records from the edit log to the
current fsimage is called checkpoint. The secondary name node periodically performs a checkpoint by
downloading fsimage and the edit log file from the name node and then uploading the new fsimage
back to the name node. The name node performs a checkpoint upon restart (not periodically,
though—only on name node start-up).

10. How does a name node ensure that all the data nodes are functioning properly?
Each data node in the cluster periodically sends heartbeat signals and a block-report to the name
node. Receipt of a heartbeat signal implies that the data node is active and functioning properly. A
block-report from a data node contains a list of all blocks on that specific data node.

11. How does a client ensure that the data it receives while reading is not corrupted? Is there a way
to recover an accidently deleted file from HDFS?
When writing blocks of a file, an HDFS client computes the checksum of each block of the file and
stores these checksums in a separate hidden file in the same HDFS file system namespace. Later, while
reading the blocks, the client references these checksums to verify that these blocks were not
corrupted (corruption might happen because of faults in a storage device, network transmission faults,
or bugs in the program). When the client realizes that a block is corrupted, it reaches out to another
data node that has the replica of the corrupted block, to get another copy of the block.

12. How can you access and manage files in HDFS?


You can access the files and data stored in HDFS in many different ways. For example, you can use
HDFS FS Shell commands, leverage the Java API available in the classes of the org.apache.hadoop.fs
package, write a MapReduce job, or write Hive or Pig queries. In addition, you can even use a web
browser to browse the files from an HDFS cluster.

13. What two issues does HDFS encounter in Hadoop 1.0?


First, the name node in Hadoop 1.0 is a single point of failure. You can configure a secondary name
node, but it’s not an active-passing configuration. The secondary name node thus cannot be used for
failure in case the name node fails. Second, as the number of data nodes grows beyond 4,000, the
performance of the name node degrades, setting a kind of upper limit to the number of nodes in a
cluster.

14. What is a daemon?


The word daemon comes from the UNIX world. It refers to a process or service that runs in the
background. On a Windows platform, we generally refer to it is as a service. For example, in HDFS, we
have daemons such as name node, data node, and secondary name node.

15. What is YARN and what does it do?


In Hadoop 2.0, MapReduce has undergone a complete overhaul, with a new layer created on top of
HDFS. This new layer, called YARN (Yet Another Resource Negotiator), takes care of two major
functions: resource management and application life-cycle management. The JobTracker previously
handled those functions. Now MapReduce is just a batch-mode computational layer sitting on top of
YARN, whereas YARN acts like an operating system for the Hadoop cluster by providing resource
management and application life-cycle management functionalities. This makes Hadoop a general-
purpose data processing platform that is not constrained only to MapReduce.

16. What is uber-tasking optimization?


The concept of uber-tasking in YARN applies to smaller jobs. Those jobs are executed in the same
container or in the same JVM in which that application-specific Application Master is running. The
basic idea behind uber-tasking optimization is that the distributed task allocation and management
overhead exceeds the benefits of executing tasks in parallel for smaller jobs, hence its optimum to
execute smaller job in the same JVM or container of the Application Master.

17. What are the different components of YARN?


Aligning to the original master-slave architecture principle, even YARN has a global or master Resource
Manager for managing cluster resources and a per-node and -slave Node Manager that takes direction
from the Resource Manager and manages resources on the node. These two, form the computation
fabric for YARN. Apart from that is a per-application Application Master, which is merely an
application-specific library tasked with negotiating resources from the global Resource Manager and
coordinating with the Node Manager(s) to execute the tasks and monitor their execution. Containers
also are present—these are a group of computing resources, such as memory, CPU, disk, and network.
What were the key design strategies for HDFS to become fault tolerant?
HDFS is a highly scalable, distributed, load-balanced, portable, and fault-tolerant storage component
of Hadoop (with built-in redundancy at the software level).
When HDFS was implemented originally, certain assumptions and design goals were discussed:
Horizontal scalability—Based on the scale-out model. HDFS can run on thousands of nodes.
Fault tolerance—Keeps multiple copies of data to recover from failure.
Capability to run on commodity hardware—Designed to run on commodity hardware.
Write once and read many times—Based on a concept of write once, read multiple times, with an
assumption that once data is written, it will not be modified. Its focus is thus retrieving the data in
the fastest possible way.
Capability to handle large data sets and streaming data access—Targeted to small numbers of very
large files for the storage of large data sets.
Data locality—Every slave node in the Hadoop cluster has a data node (storage component) and a
JobTracker (processing component). Processing is done where data exists, to avoid data movement
across nodes of the cluster.
High throughput—Designed for parallel data storage and retrieval.
HDFS file system namespace—Uses a traditional hierarchical file organization in which any user or
application can create directories and recursively store files inside them.

To what does the term data locality refer?


Data locality is the concept of processing data locally wherever possible. This concept is central to
Hadoop, a platform that intentionally attempts to minimize the amount of data transferred across the
network by bringing the processing to the data instead of the reverse.
2nd Exam 2020

You may assume that you are a Data Scientist in a consulting assignment with the CDO (Chief Data
Officer) of Exportera (a large company). The CDO asks you lots of questions and appreciates very much
precise answers to his questions.

1. You are hired by the Municipality of Terras do Ouro as Data Scientist, to help prevent the effects
of flooding from the River Guardião. The municipality has already distributed IoT devices across
the river, that are able to measure the flow the wates. In the kick-off meeting, you say: "I have
an idea for a possible solution. We may need to use a number of AWS services.” And then you
explain your solution. What do you say?
In order to collect and process these real time streams of data records regarding the flow of the water,
the data from the IoT devices across the river is sent to AWS Kinesis Data Streams. Then this data is
processed using the AWS Lambda module and stored in AWS S3 buckets. The data collected into
Kinesis Data Streams can be used for simple data analysis and reporting in real time, using AWS
Elasticsearch. Finally we can send the processed records to dashboards to visualize the variability of
the flow of water, using Amazon QuickSight.

You do not need to worry about computer resources if you use AWS Kinesys. This system will enable
you to acquire data through a gateway on the cloud that will receive your data, than you can process
it using AWS Lambda module and later store the data in a storage system e.g. AWS's S3

2. You go to a conference and hear a speaker saying the following: "Real-time analytics is a key
factor for digital transformation. Companies everywhere are using their datalakes based on
technologies such as Hadoop and HDFS to give key insights, in real time, of their sales and
customer preferences." You rise your hand for a comment. What are going to say? Please, justify
carefully.
Firstly, I would like to clarify that Hadoop is a stack of different components, one of which is the HDFS,
which is its distributed file system. Moreover, although HDFS is thriving recently, there are plenty of
other choices such as Amazon S3 or Microsoft Azure Data Lake Store (ADLS) for cloud storage, or
Apache Kudu for IoT and analytic data.

Hadoop is a framework that incorporates several technologies including batch, streaming (you
mentioned) and others. HDFS is Hadoop's file system.

3. In a whiteboarding session, you say to the CDO of Exportera: "OK, I'll explain what the Hive
Driver is and what are its key functions/roles." What do you say, trying to be very complete and
precise?
To begin with, Hive is Hadoop’s SQL-like analytics query engine, which enables the writing of data
processing logic or queries in a declarative language similar to SQL, called HiveQL. The brain of Hive is
the Hive Driver, which maintains the lifecycle of a HiveQL statement: it comprehends a query compiler,
optimizer and executor, executing the tasks plan generated by the compiler in proper dependency
order while interacting with the underlying Hadoop instance.

Hive is Hadoop's SQL analytics query framework. It is the same as saying Hive is a SQL abstraction layer
over Hadoop MapReduce with a SQL-like query engine. It enables several types of connections to
other data bases using Hive driver. HIVE DRIVER can for instances connect through ODBC to relational
databases.

4. On implementing Hadoop, the CDO is worried on understanding how a name node ensures that
all the data nodes are functioning properly. What can you tell him to reassure him?
The name node, or master node, contains the metadata related to the Hadoop Distributed File System,
such as the file name, file path or configurations. It keeps track of all the data nodes, or slave nodes,
using the heartbeat methodology. Each data node regularly send heartbeat signals to the name node.
After receiving these signals the name node ensures that the slave nodes are live and functioning
properly. In the event that the name node does not receive a heartbeat signal from the data node,
that particular data node is considered inactive and starts the process of block replication on some
other data node.

Each data node in the cluster periodically sends heartbeat signals and a block-report to the name
node. Receipt of a heartbeat signal implies that the data node is active and functioning properly. A
block-report from a data node contains a list of all blocks on that specific data node.

5. The CDO is considering using Amazon Kinesis Firehose to deliver data to Amazon S3. He is
concerned on losing data if the data delivery to the destination is falling behind data writing to
the delivery stream. Can you help him out understanding how this process works, in order to
alleviate his concerns?
In Amazon Kinesis Firehose, the scaling is handled automatically, up to gigabytes per second, in order
to meet the demand. Amazon Kinesis Firehose will write each log record to Amazon Simple Storage
Service (Amazon S3) for durable storage of the raw log data, and the Amazon Kinesis Analytics
application will continuously run a Kinesis Streaming SQL statement against the streaming input data.
This way, you should not be concerned.

6. The CDO of Exportera asks you to prepare the talking points for a presentation he must make
to the board regarding a budget increase for his team. It is important that the board members
understand what are the impacts of the data variety versus the data volume. What can you tell
them?
When it comes to Big Data applications, there are 4 characteristics which need to be addressed:
Volume, Velocity, Variety and Variability (also know as the 4 V’s). Volume is the most commonly
recognized characteristic of Big Data, representing the large amount of data available for analysis to
extract valuable information. The time and expense required to process large datasets drives the need
for processing and storage parallelism. On the other hand, data variety represents the need to analyze
data from multiple sources, domains and data types to bring added value to the business, which drives
the need for distributed processing. This way, a budget increase is needed to meet the added
requirements that big data processing brings to the table.

While the volume can be handle scaling the processing and storage resources the issues related with
variety require customization of processing, namely interfaces, and programming of handlers. For
instances handling objects coming from SQL sources or OSI-PI sources require different type of
programming skills. Not having the right skills to handle this can ruin the ambition of storing different
-and so rich - data.
To have this kind of skills we need a budget increase in the department.

7. The CDO of Exportera tells you: “From what I understand, in Hive, the partitioning of tables can
only be made on the basis of one parameter, making it not as useful as it could be." What can
you answer him?
You are wrong. Hive partitioning can be made with more than one parameter of the original table.
However, the partition key should always be a low-cardinal attribute to avoid many partitions
overhead. It is very useful, as a query with partition filtering will only load data from the specified
partitions, so it can execute much faster than a normal query that filters by a non-partitioning field.
Hive partitioning can be made with more than one parameter from the original table. The partitioning
can be made by dividing the table in small tables by folders against each combination of parameters.
And even the partitioning can be optimized by bucket files.

8. The CDO of Exportera has been exploring stream processing and is worried about the latency of
a solution based on a platform such as Kinesis Data Streams. To address the question, what can
you tell him?
Kinesis Data Streams can be used efficiently to solve a variety of streaming data problems. It ensures
durability and elasticity, which enables you to scale the stream up or down, so that you never lose
data records before they expire. The delay between the time a record is put into the stream and the
time it can be retrieved is typically less than 1 second, so you don’t have to worry!

Kinesis Data Stream platform is based on a high throughput, large bandwidth performance
components.
Kinesis ensures durability and elasticity of data acquired through streaming. The elasticity of Kinesis
Data Streams enables you to scale the stream up or down, so that you never lose data records before
it expires (1 day default).

9. The CDO of Exportera is confused: "I've read about MapReduce, but I still do not fully
understand it. Than you explain very clearly to him with a small example: counting the words of
the following text:
"Mary had a little lamb
Little lamb, little lamb
Mary had a little lamb
It's fleece was white as snow"
Justify carefully.
MapReduce is a process which divides a computation into three sequential stages: map, shuffle and
reduce. In the map phase, the required data is read from the HDFS and processed in parallel by
multiple independent map tasks. A map() function is defined which produces key-value outputs. Then,
the map outputs are sorted in the shuffle phase. Finally, a reduce() function is defined to aggregate all
the values for each key, summarizing the inputs.
One of the applications of the MapReduce algorithm is to count words. If we take into consideration
the example mentioned:
1 – In the first step, the map() phase, all the words in the text are split into a list ordered from the first
word of the text until the last one:
“Mary”,
“had”,
“a”,
“little”,
“lamb”,
“Little”,
“lamb”,
(…)
“white”,
“as”,
“snow”

Still in the map() phase, each word is converted into a key-value pair, being the key the word and the
value 1 :
(“Mary”: 1),
(“had”: 1),
(“a”: 1),
(…)
(“snow”:1).
2 – In the shuffle phase, all the key-value pairs are sorted alphabetically by the key, so that similar keys
can get together
3 – Finally, in the reduce phase, there is an aggregation by key (word), performing a count of the
respective values. This way, the final result is the key-value pairs of the distinct words and their count
of occurrences throughout the text

MapReduce is a program model, or a processing technique that can be applied in several contexts. Its
implementation is usually divided in three steps: Mapping the data variables considered, shuffling
those variables following a pattern/directive and then aggregating/reducing them.
Let us take the text above as reference for the following example: We want to count the number of
words in the text.
In the first step we would map all the words, let say, continuously, getting a list like:
"Mary
had
a
little
...
as
snow"
If one wants to count the number of words, a number should be associated with each word for later
counting. So we can associate "1", so we can create, still in Map step the kind of Key-Value database
associating each word with "1", e.g.:
<"Mary",1>, where the word is the Key.
<Mary,1>
<little,1>
In the next step one could shuffle this created database so that all similar keys can get together. But
since one just wants to count the number of words, we just need to sum the values and this operation
is the Reduce operation in this case.

10. In the Exportera cafetaria you hear someone telling the following: "Oh, processing data in
motion is no different than processing data at rest." If you had to join the conversation, what
would you say to them?
I don’t agree with you. Processing data in motion is very different than processing data at rest. The
operational difference between streaming data and data at rest lies in when the data is stored and
analyzed. When dealing with data at rest, each observation is recorded and stored before performing
analysis on the data. On the other hand, in streaming data, each event is processed as it is read, and
subsequent results are stored in a database. However, there are some similarities: data at rest and
streaming data can come from the same source, they can be processed with the same analytics and
can be stored with the same storage service
Processing data in motion (streaming) is much different from processing data at rest. The main
differences are that:
1. Analytics over streaming data needs to be done with data incoming and not in steady data.
2. Data in motion needs buffering to process amounts of data while similar steady data processing
can be done through querying.
3. The concept of incoming data in streaming does not apply in data at rest. In the former case one
needs to adapt the acquisition to the amount of incoming data.
Of course, there are also some similarities, namely in quality control of data, namely the detection of
common types of errors like valid values or missing values. Anyway, the processing- which is what we
are talking about - is much different in both cases.

--//--

1. The CDO of Exportera is confused: "I've read about MapReduce, but I still do not fully understand
it. Can you explain very clearly its processes, and the role of the NameNode and the DataNodes in a
MapReduce operation?" Can you help him out?
MapReduce is a process which divides a computation into three sequential stages: map, shuffle and
reduce. In the map phase, the required data is read from the HDFS and processed in parallel by
multiple independent map tasks. A map() function is defined which produces key-value outputs. Then,
the map outputs are sorted in the shuffle phase. Finally, a reduce() function is defined to aggregate all
the values for each key, summarizing the inputs.
During this process, the NameNode allocates which DataNodes, meaning individual servers, will
perform the map operation and which will perform the reduce operation. This way, mapper
DataNodes perform the map phase and produce key-value pairs and the reducer DataNodes apply the
reduce function to the key-value pairs and generate the final output.

2. The CDO of Exportera asks you to prepare the talking points for a presentation he must make to
the board regarding a budget increase for his team. It is important that the board members
understand what data variety versus data variability is, as well as its impacts on the analytics
platforms (and, of course, business benefits of addressing them.) What do you write to make it
crystal clear?
When it comes to Big Data applications, there are 4 characteristics which need to be addressed:
Volume, Velocity, Variety and Variability (also know as the 4 V’s).
While the variety characteristic represents the need to analyze data from multiple sources and data
types, the variability refers to the changes in dataset characteristics, whether in the data flow rate,
format/structure and/or volume. When it comes to variety of data, distributed processing should be
applied on different types of data, followed by individual pre-analytics. On the other hand, variability
implies the need to scale-up or scale-down to efficiently handle the additional processing load, which
justifies the use of cloud computing. Finally, regarding the benefits to the business, both have their
advantages: variety of data brings an additional richness to the business, as more details from multiple
domains are available; variability keep the systems efficient as we don’t always have to design and
resource to the expected peak capacity.

You might also like