Exame 1 PDF
Exame 1 PDF
22. In object storage, data is stored close to processing, not be modified. Its focus is thus retrieving the data in the
intermediate output to HDFS. FALSE process a batch of columns as an array of line vectors. F just like in HDFS, but with rich metadata. (F) fastest possible way. • Capability to handle large data sets
2 – Volume drives the need for processing and storage 13 – In Hive bucketing, data is evenly distributed between 23. When we drop an external table in Hive, both the and streaming data access—Targeted to small numbers of
parallelism, and its management during processing of all buckets based on the hashing principle. T data and the schema will be dropped. (F) very large files for the storage of large data sets. • Data
large datasets. TRUE 14 – The selection of the partition key is always an 24.Internal tables in Hive can only be stored asTEXTFILE. F locality—Every slave node in the Hadoop cluster has a
3 – The servers where the data resides can only perform important factor for performance. It should always be a 25. We can say data internal tables in Hive adhere to a data node (storage component) and a JobTracker
the map operation. TRUE low-cardinal attribute to avoid so many sub-directories principle called “data on schema”. (T) (processing component). Processing is done where data
4 – The process of generating a new fsimage from a overhead. (T) exists, to avoid data movement across nodes of the
merge operation is called the Checkpoint process. T 15 – To partition the table customerby countrywe use the cluster. • High throughput—Designed for parallel data
5 – Hive follows a “schema on read” approach, unlike following HiveQl statement CREATE TABLE customer(id storage and retrieval. • HDFS file system namespace—
RDMBS, which enforces “schema on write.” T STRING, name STRING, gender STRING, stateSTRING, Uses a traditional hierarchical file organization in which
6 – Big Data is a characterization only for volumes of data country STRING)PARTITIONED BY (country STRING) (F) any user or application can create directories and
above one petabyte. FALSE 16 – We can configure the values for the Amazon S3 recursively store files inside them.
7 – The term MapReduce refers in exclusive to a buffer size (1 MB to 128 MB) or buffer interval (60 2. In terms of storage, what does a name node contain
programming model. FALSE seconds to 900 seconds). The condition satisfied first and what do data nodes contain?
8 – Hadoop is considered a schema-on-write-system triggers data delivery to Amazon S3. (T) • HDFS stores and maintains file system metadata and
regarding write operations. FALSE application data separately. The name node (master of
9 – One of the key design principles of HDFS is that it 17 – Regarding Fig. 2, HDFS) contains the metadata related to the file system
should favor low latency random access over high the box marked with (information about each file, as well as the history of
sustained bandwidth. FALSE an X refers to an changes to the file metadata). Data nodes (workers of
10 – One of the key design principles of HDFS is that it Amazon Kinesis HDFS) contain application data in a partitioned manner
should be able to use commodity hardware. TRUE Analytics service. (T) for parallel writes and reads. • The name node contains
11 – The fundamental architectural principles of Hadoop 18 – Kinesis Analytics outputs its results to Kinesis 1. By default, the Hive query execution engine processes an entire metadata called namespace (a hierarchy of files
are: large scale, distributed, shared everything systems, Streams or Kinesis Firehose. (T) one column of a table at a time. (F) and directories) in physical memory, for quicker response
connected by a good network, working together to solve 19 – AWS Lambda polls the stream periodically (once per 2. Clickstream data from web applications can be to client requests. This is called the fsimage. Any changes
the same problem. FALSE second) for new records. When it detects new records, it collected directly in a data lake and a portion of that data into a transactional file is called an edit log. For
12 – Apache Tez is an engine built on top of Apache invokes the Lambda function by passing the new records can be moved out to a data warehouse for daily persistence, both of these files are written to host OS
Hadoop YARN. TRUE as a parameter. If no new records are detected, the reporting. We think of this concept as inside-out data drives. The name node simultaneously responds to the
13 – The name node is not involved in the actual data Lambda function is not invoked. (T) movement. (T) multiple client requests (in a multithreaded system) and
transfer. FALSE 20 – DynamoDB tables do not have fixed schemas, but all 3. Fig. 1Consider the scenario in Fig. 1, for a log analytics provides information to the client to connect to data
14 – Business benefits are frequently higher when items must have a similar number of attributes. (F) solution. The web server in this example is an Amazon nodes to write or read the data. While writing, a file is
addressing the variety of data than when addressing 21 – The main drawback of Kinesis Data Firehose is that Elastic Compute Cloud (EC2) instance. In step 1, Amazon broken down into multiple chunks of 128MB (by default,
volume TRUE to scale up or down you need to manually provision Kinesis Firehose will continuously pull log records from called blocks, 64MB sometimes). Each block is stored as a
15 – In object storage, data is stored close to processing, servers using the "AWS Kinesis Data Firehose Scaling API". the Apache Web Server. (F) separate file on data nodes. Based on the replication
just like in HDFS,but with rich metadata. F (F) 4. Data storage formats like ORC and Parquet rely on factor of a file, multiple copies or replicas of each block
16 – The map function in MapReduce processes key/value 22 – Kinesis Data Firehose can convert a stream of JSON metadata which describes a set of values in a section of are stored for fault tolerance.
pairs to generate a set of intermediate key/value pairs. data to Apache Parquet or Apache ORC. (T) the data,called a stripe. If,for example,the user is 3. What is the default data block placement policy?
TRUE 23 – Since it uses SQL as query language, Amazon Athena interested in values <10 and the metadata says all the By default, three copies, or replicas, of each block are
17 - Essentially, MapReduce divides a computation into is a relational/transactional database. (F) data in this stripe is between 20 and 30,the stripe is not placed, per the default block placement policy mentioned
two sequential stages: map and reduce. FALSE 24 – AWS Glue ETL jobs are Spark-based. (T) relevant to the query at all, & the query can skip over it. T next. The objective is a properly load-balanced, fast-
18 – When a HiveQL is executed, Hive translates the 25 – Kinesis Data Streams is good choice for long-term 5. In Athena,if your files are too large or not splittable, access, fault-tolerant file system: • The first replica is
query into MapReduce, saving the time and effort of data storage and analytics. (F) parallelism can be limited due to query processing halting written to the data node creating the file. • The second
writing actual MapReduce jobs. TRUE 26 – Amazon OpenSearch/Elasticsearch stores CSV until one reader has finished reading the complete file.T replica is written to another data node within the same
19 – When we drop an external table in Hive, both the documents. (F) 6. In Athena, to change the name of Table1 to Table2, we rack. • The third replica is written to a data node in a
data and the schema will be dropped. FALSE would use the following instruction, as we would in different rack.
20 – Hive can be considered a data warehousing layer 1. Big Data is a characterization only for volumes of data Hive:ALTER TABLE Table1RENAME TO Table2; (F) 4. What is the replication pipeline? What is its
over Hadoop that allows for data to be exposed as above one petabyte (F). 7. Fig. 1In step 2, also, Amazon Kinesis Analytics significance?
structured tables. TRUE 2. The fundamental architectural principles of Hadoop application will continuously run a Presto script against Data nodes maintain a pipeline for data transfer. Having
21 – YARN manages resources and monitors workloads, in are: large scale, distributed, shared everything systems, the streaming input data. (F) said that, data node 1 does not need to wait for a
a secure multitenant environment, while ensuring high connected by a good network, working together to solve 8. Fig. 1In step 2, Amazon Kinesis Firehose will write each complete block to arrive before it can start transferring it
availability across multiple Hadoop clusters. TRUE the same problem. (F) log record to Amazon Simple Storage Service (Amazon S3) to data node 2 in the flow. In fact, the data transfer from
22 – TEZ provides the capability to build an application 3. Volume drives the need for processing and storage for durable storage of the raw log data. (T) the client to data node 1 for a given block happens in
framework that allows for a complex DAG of tasks for parallelism, and its management during processing of 9. Fig. 1In step 3, the Amazon Kinesis Analytics application smaller chunks of 4KB. When data node 1 receives the
high-performance data processing only in batch mode. large datasets. (T) will create an aggregated data set every minute and first 4KB chunk from the client, it stores this chunk in its
FALSE 4. Business benefits are frequently higher when output that data to a second Firehose delivery stream. (T) local repository and immediately starts transferring it to
23 – The operation CREATE TABLE external (col1 STRING, addressing the variety of data than when addressing 10. Kinesis Analytics outputs its results to Kinesis Streams data node 2 in the flow. Likewise, when data node 2
col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED volume. (T) or Kinesis Firehose. (T) receives the first 4KB chunk from data node 1, it stores
BY ' '; creates an external table in Hive. FALSE (Should be 5. Hadoop is considered a schema-on-write-system 11. Kinesis Data Firehose can convert a stream of JSON this chunk in its local repository and immediately starts
CREATE EXTERNAL) regarding write operations. (F) data to Apache Parquet or Apache ORC. (T) transferring it to data node 3, and so on. This way, all the
24 – Internal tables in Hive can only be stored as 6. One of the key design principles of HDFS is that it 12. Kinesis Data Streams is a good choice for long-term data nodes in the flow (except the last one) receive data
TEXTFILE. FALSE should be able to use commodity hardware. (T) data storage and analytics. (F) from the previous data node and, at the same time,
25 – We can say data internal tables in Hive adhere to a 7. One of the key design principles of HDFS is that it 13. One of the best use cases for AWS Glue, since it is a transfer it to the next data node in the flow, to improve
principle called “data on schema”. TRUE should favor low latency random access over high fully managed service, is if you require extensive the write performance by avoiding a wait at each stage.
sustained bandwidth. (F) configuration changes from it. (F) 5. What is client-side caching, and what is its significance
8. The process of generating a new fsimage from a merge 14. One of the characteristics of a serverless architecture, when writing data to HDFS?
1 – Fig. 1, for a operation is called the Checkpoint process. (T) such as Athena's, is that, as the name implies, there are HDFS uses several optimization techniques. One is to use
log analytics 9. The map function in MapReduce processes key/value no servers provisioned to begin with, and it is up to the client-side caching, by the HDFS client, to improve the
solution. The pairs to generate a set of intermediate key/value pairs (T) user to completely provision all servers and services. (F) performance of the block write operation and to
web server in 10. YARN manages resources and monitors workloads, in 15. Presto is an open-source distributed SQL query engine minimize network congestion. The HDFS client
this example is a secure multitenant environment, while ensuring high optimized for batch ETL type jobs. (F) transparently caches the file into a temporary local file;
an Amazon Elastic Compute Cloud (EC2) instance. In step availability across multiple Hadoop clusters. (T) 16. Fig. 2Regarding Fig. 2, the box marked with an X refers when it accumulates enough data for a block size, the
1, Amazon Kinesis Firehose will continuously pull log 11. When a HiveQL is executed, Hive translates the query to an Amazon Kinesis Analytics service. (T) client reaches out to the name node. At this time, the
records from the Apache Web Server. F into MapReduce, saving the time and effort of writing 17. The basic idea of vectorized query execution is to name node responds by inserting the filename into the
2 – We can configure, in Step 1, each shard in Amazon actual MapReduce jobs. (T) process a batch of columns as an array of line vectors. (F) file system hierarchy and allocating data nodes for its
Kinesis Firehose to ingest up to 1MB/sec and 1000 12. Hive follows a “schema on read” approach, unlike 18. Fig. 1We can configure, in Step 1, each shard in storage. The client then flushes the block of data from the
records/sec, and emit up to 2MB/sec. (F) RDMBS, which enforces “schema on write.” (T) Amazon Kinesis Firehose to ingest up to 1MB/sec and local, temporary file to the closest data node, and that
3 – In step 2, Amazon Kinesis Firehose will write each log 13. Hive can be considered a data warehousing layer over 1000 records/sec, and emit up to 2MB/sec. (F) data node transfers the block to other data nodes (as
record to Amazon Simple Storage Service (Amaz. S3) for Hadoop that allows for data to be exposed as structured 19. You can use precisely the same set of tools to collect, instructed by the name node, based on the replication
durable storage of the raw log data. (T) tables. (T) prepare, and process real-time streaming data as those factor of the file). This client-side caching avoids
4 – In step 2, also, Amazon Kinesis Analytics application 14. Apache Tez is an engine built on top of Apache tools that you have traditionally used for batch analytics. continuous use of the network and minimizes the risk of
will continuously run a Kinesis Streaming Python script Hadoop YARN. (T) That's what is the basic premise of Lambda architecture. F network congestion.
against the streaming input data. (F) 15. Tez provides the capability to build an application 20. You may copy the product catalog data stored in your 6. How can you enable rack awareness in Hadoop?
5 – In step 3, the Amazon Kinesis Analytics application will framework that allows for a complex DAG of tasks for database to your search service to make it easier to look You can make the Hadoop cluster rack aware by using a
create an aggregated data set every minute and output high-performance data processing only in batch mode. (F) through your product catalog and offload the search script that enables the master node to map the network
that data to a second Firehose delivery stream. (T) 16. Despite its capabilities, TEZ still needs the storage of queries from the database. We think of this concept as topology of the cluster using the properties
6 – All data in DynamoDB is replicated in two availability intermediate output to HDFS. (F) data movement outside-in. (F) topology.script.file.name or net.topology.script.file.name,
zones (F) 17. The name node is not involved in the actual data available in the core-site.xml configuration file. First, you
7 – An end user provisions a Lambda service with similar transfer. (T) 1. What is HDFS, and what are HDFS design goals? must change this property to specify the name of the
steps as it provisions an EC2 instance. (F) 18. The operation: CREATE TABLE external (col1 STRING, HDFS is a highly scalable, distributed, load-balanced, script file. Then you must write the script and place it in
8 – The concept of partitioning can be used to reduce the col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED portable, and fault-tolerant storage component of the file at the specified location. The script should accept
cost of querying the data. (T) BY ' '; creates an external table in Hive. (F, should be Hadoop (with built-in redundancy at the software level). a list of IP addresses and return the corresponding list of
9 – In a publish/subscribe model, although data CREATE EXTERNAL) When HDFS was implemented originally, certain rack identifiers. For example, the script would take
producers are decoupled from data consumers, 19. The term MapReduce refers in exclusive to a assumptions and design goals were discussed: • host.foo.bar as an argument and return /rack1 as the
publishers know who the consumers are. (F) programming model. (F) Horizontal scalability—Based on the scale-out model. output.
10 – In Hive most of the optimizations are not based on 20. Essentially, MapReduce divides a computation into HDFS can run on thousands of nodes. Fault tolerance— 7. What is the data block replication factor?
the cost of query execution. (T) two sequential stages: map and reduce. (F) Keeps multiple copies of data to recover from failure. • An application or a job can specify the number of replicas
11 – The number of shards cannot be modified after the 21. The servers where the data resides can only perform Capability to run on commodity hardware—Designed to of a file that HDFS should maintain. The number of copies
Kinesis stream is created. (F) the map operation. (T) run on commodity hardware. Write once and read many or replicas of each block of a file is called the replication
times—Based on a concept of write once, read multiple factor of that file. The replication factor is configurable
times, with an assumption that once data is written, it will
and can be changed at the cluster level or for each file and coordinating with the Node Manager(s) to execute The name node, or master node, contains the metadata MapReduce is a process which divides a computation into
when it is created, or even later for a stored file. the tasks and monitor their execution. Containers also are related to the Hadoop Distributed File System, such as three sequential stages: map, shuffle and reduce. In the
8. What is block size, and how is it controlled? present—these are a group of computing resources, such the file name, file path or configurations. It keeps track of map phase, the required data is read from the HDFS and
When a client writes a file to a data node, it splits the file as memory, CPU, disk, and network. all the data nodes, or slave nodes, using the heartbeat processed in parallel by multiple independent map tasks.
into multiple chunks, called blocks. This data partitioning 18. What were the key design strategies for HDFS to methodology. Each data node regularly send heartbeat A map() function is defined which produces key-value
helps in parallel data writes and reads. Block size is become fault tolerant? signals to the name node. After receiving these signals outputs. Then, the map outputs are sorted in the shuffle
controlled by the dfs.blocksize configuration property in HDFS is a highly scalable, distributed, load-balanced, the name node ensures that the slave nodes are live and phase. Finally, a reduce() function is defined to aggregate
the hdfs-site.xml file and applies for files that are created portable, and fault-tolerant storage component of functioning properly. In the event that the name node all the values for each key, summarizing the inputs.
without a block size specification. When creating a file, Hadoop (with built-in redundancy at the software level). does not receive a heartbeat signal from the data node, One of the applications of the MapReduce algorithm is to
the client can also specify a block size specification to When HDFS was implemented originally, certain that particular data node is considered inactive and starts count words. If we take into consideration the example
override the cluster-wide configuration. assumptions and design goals were discussed: Horizontal the process of block replication on some other data node. mentioned:
9. What is a checkpoint, and who performs this scalability—Based on the scale-out model. HDFS can run Each data node in the cluster periodically sends heartbeat 1 – In the first step, the map() phase, all the words in the
operation? on thousands of nodes. Fault tolerance—Keeps multiple signals and a block-report to the name node. Receipt of a text are split into a list ordered from the first word of the
The process of generating a new fsimage by merging copies of data to recover from failure. Capability to run heartbeat signal implies that the data node is active and text until the last one:
transactional records from the edit log to the current on commodity hardware—Designed to run on commodity functioning properly. A block-report from a data node “Mary”, “had”, [] “a”, [] “little”, [] “lamb”, [] “Little”, []
fsimage is called checkpoint. The secondary name node hardware. Write once and read many times—Based on a contains a list of all blocks on that specific data node. “lamb”, [] (…) [] “white”, [] “as”, [] “snow”
periodically performs a checkpoint by downloading concept of write once, read multiple times, with an 5. The CDO is considering using Amazon Kinesis Firehose Still in the map() phase, each word is converted into a
fsimage and the edit log file from the name node and assumption that once data is written, it will not be to deliver data to Amazon S3. He is concerned on losing key-value pair, being the key the word and the value 1 :
then uploading the new fsimage back to the name node. modified. Its focus is thus retrieving the data in the data if the data delivery to the destination is falling (“Mary”: 1), (“had”: 1), (“a”: 1), (…) (“snow”:1).
The name node performs a checkpoint upon restart (not fastest possible way. Capability to handle large data sets behind data writing to the delivery stream. Can you help 2 – In the shuffle phase, all the key-value pairs are sorted
periodically, though—only on name node start-up). and streaming data access—Targeted to small numbers of him out understanding how this process works, in order alphabetically by the key, so that similar keys can get
10. How does a name node ensure that all the data very large files for the storage of large data sets. Data to alleviate his concerns? together
nodes are functioning properly? locality—Every slave node in the Hadoop cluster has a In Amazon Kinesis Firehose, the scaling is handled 3 – Finally, in the reduce phase, there is an aggregation by
Each data node in the cluster periodically sends heartbeat data node (storage component) and a JobTracker automatically, up to gigabytes per second, in order to key (word), performing a count of the respective values.
signals and a block-report to the name node. Receipt of a (processing component). Processing is done where data meet the demand. Amazon Kinesis Firehose will write This way, the final result is the key-value pairs of the
heartbeat signal implies that the data node is active and exists, to avoid data movement across nodes of the each log record to Amazon Simple Storage Service distinct words and their count of occurrences throughout
functioning properly. A block-report from a data node cluster. High throughput—Designed for parallel data (Amazon S3) for durable storage of the raw log data, and the text MapReduce is a program model, or a processing
contains a list of all blocks on that specific data node. storage and retrieval. HDFS file system namespace—Uses the Amazon Kinesis Analytics application will continuously technique that can be applied in several contexts. Its
11. How does a client ensure that the data it receives a traditional hierarchical file organization in which any run a Kinesis Streaming SQL statement against the implementation is usually divided in three steps: Mapping
while reading is not corrupted? Is there a way to recover user or application can create directories and recursively streaming input data. the data variables considered, shuffling those variables
an accidently deleted file from HDFS? store files inside them. This way, you should not be concerned. following a pattern/directive and then
When writing blocks of a file, an HDFS client computes 19. To what does the term data locality refer? 6. It is important that the board members understand aggregating/reducing them.
the checksum of each block of the file and stores these Data locality is the concept of processing data locally what are the impacts of the data variety versus the data Let us take the text above as reference for the following
checksums in a separate hidden file in the same HDFS file wherever possible. This concept is central to Hadoop, a volume. What can you tell them? example: We want to count the number of words in the
system namespace. Later, while reading the blocks, the platform that intentionally attempts to minimize the When it comes to Big Data applications, there are 4 text. In the first step we would map all the words, let say,
client references these checksums to verify that these amount of data transferred across the network by characteristics which need to be addressed: Volume, continuously, getting a list like:
blocks were not corrupted (corruption might happen bringing the processing to the data instead of the reverse. Velocity, Variety and Variability (also know as the 4 V’s). "Mary [] had [] a [] little [] ... [] as [] snow"
because of faults in a storage device, network Volume is the most commonly recognized characteristic If one wants to count the number of words, a number
transmission faults, or bugs in the program). When the 1. You are hired to help prevent the effects of flooding of Big Data, representing the large amount of data should be associated with each word for later counting.
client realizes that a block is corrupted, it reaches out to from the River Guardião. The municipality has already available for analysis to extract valuable information. The So we can associate "1", so we can create, still in Map
another data. node that has the replica of the corrupted distributed IoT devices across the river, that are able to time and expense required to process large datasets step the kind of Key-Value database associating each
block, to get another copy of the block. measure the flow the wates. In the kick-off meeting, you drives the need for processing and storage parallelism. On word with "1", e.g.:
12. How can you access and manage files in HDFS? say: "I have an idea for a possible solution. We may the other hand, data variety represents the need to <"Mary",1>, where the word is the Key. <Mary,1>
You can access the files and data stored in HDFS in many need to use a number of AWS services.” And then you analyze data from multiple sources, domains and data <little,1> In the next step one could shuffle this created
different ways. For example, you can use HDFS FS Shell explain your solution. What do you say? types to bring added value to the business, which drives database so that all similar keys can get together. But
commands, leverage the Java API available in the classes In order to collect and process these real time streams of the need for distributed processing. This way, a budget since one just wants to count the number of words, we
of the org.apache.hadoop.fs package, write a MapReduce data records regarding the flow of the water, the data increase is needed to meet the added requirements that just need to sum the values and this operation is the
job, or write Hive or Pig queries. In addition, you can even from the IoT devices across the river is sent to AWS big data processing brings to the table. Reduce operation in this case.
use a web browser to browse the files from an HDFS Kinesis Data Streams. Then this data is processed using While the volume can be handle scaling the processing 10. In the Exportera cafetaria you hear someone telling
cluster. the AWS Lambda module and stored in AWS S3 buckets. and storage resources the issues related with variety the following: "Oh, processing data in motion is no
13. What two issues does HDFS encounter in Hadoop The data collected into Kinesis Data Streams can be used require customization of processing, namely interfaces, different than processing data at rest." If you had to join
1.0? First, the name node in Hadoop 1.0 is a single point for simple data analysis and reporting in real time, using and programming of handlers. For instances handling the conversation, what would you say to them?
of failure. You can configure a secondary name node, but AWS Elasticsearch. Finally we can send the processed objects coming from SQL sources or OSI-PI sources I don’t agree with you. Processing data in motion is very
it’s not an active-passing configuration. The secondary records to dashboards to visualize the variability of the require different type of programming skills. Not having different than processing data at rest. The operational
name node thus cannot be used for failure in case the flow of water, using Amazon QuickSight. the right skills to handle this can ruin the ambition of difference between streaming data and data at rest lies in
name node fails. Second, as the number of data nodes You do not need to worry about computer resources if you storing different -and so rich - data. when the data is stored and analyzed. When dealing with
grows beyond 4,000, the performance of the name node use AWS Kinesys. This system will enable you to acquire To have this kind of skills we need a budget increase in data at rest, each observation is recorded and stored
degrades, setting a kind of upper limit to the number of data through a gateway on the cloud that will receive the department. before performing analysis on the data. On the other
nodes in a cluster. your data, than you can process it using AWS Lambda 7. The CDO of Exportera tells you: “From what I hand, in streaming data, each event is processed as it is
14. What is a daemon? module and later store the data in a storage system e.g. understand, in Hive, the partitioning of tables can only read, and subsequent results are stored in a database.
The word daemon comes from the UNIX world. It refers AWS's S3 be made on the basis of one parameter, making it not as However, there are some similarities: data at rest and
to a process or service that runs in the background. On a 2."Real-time analytics is a key factor for digital useful as it could be." What can you answer him? streaming data can come from the same source, they can
Windows platform, we generally refer to it is as a service. transformation. Companies everywhere are using their You are wrong. Hive partitioning can be made with more be processed with the same analytics and can be stored
For example, in HDFS, we have daemons such as name datalakes based on technologies such as Hadoop and than one parameter of the original table. However, the with the same storage service
node, data node, and secondary name node. HDFS to give key insights, in real time, of their sales and partition key should always be a low-cardinal attribute to Processing data in motion (streaming) is much different
15. What is YARN and what does it do? customer preferences." You rise your hand for a avoid many partitions overhead. It is very useful, as a from processing data at rest. The main differences are
In Hadoop 2.0, MapReduce has undergone a complete comment. What are going to say? Please, justify. query with partition filtering will only load data from the that:
overhaul, with a new layer created on top of HDFS. This Firstly, I would like to clarify that Hadoop is a stack of specified partitions, so it can execute much faster than a 1. Analytics over streaming data needs to be done with
new layer, called YARN (Yet Another Resource different components, one of which is the HDFS, which is normal query that filters by a non-partitioning field. data incoming and not in steady data.
Negotiator), takes care of two major functions: resource its distributed file system. Moreover, although HDFS is Hive partitioning can be made with more than one 2. Data in motion needs buffering to process amounts of
management and application life-cycle management. The thriving recently, there are plenty of other choices such as parameter from the original table. The partitioning can be data while similar steady data processing can be done
JobTracker previously handled those functions. Now Amazon S3 or Microsoft Azure Data Lake Store (ADLS) for made by dividing the table in small tables by folders through querying.
MapReduce is just a batch-mode computational layer cloud storage, or Apache Kudu for IoT and analytic data. against each combination of parameters. And even the 3. The concept of incoming data in streaming does not
sitting on top of YARN, whereas YARN acts like an Hadoop is a framework that incorporates several partitioning can be optimized by bucket files. apply in data at rest. In the former case one needs to
operating system for the Hadoop cluster by providing technologies including batch, streaming (you mentioned) 8. The CDO of Exportera has been exploring stream adapt the acquisition to the amount of incoming data.
resource management and application life-cycle and others. HDFS is Hadoop's file system. processing and is worried about the latency of a solution Of course, there are also some similarities, namely in
management functionalities. This makes Hadoop a 3. "OK, I'll explain what the Hive Driver is and what are based on a platform such as Kinesis Data Streams. To quality control of data, namely the detection of common
generalpurpose data processing platform that is not its key functions/roles." What do you say, trying to be address the question, what can you tell him? types of errors like valid values or missing values. Anyway,
constrained only to MapReduce. very complete and precise? Kinesis Data Streams can be used efficiently to solve a the processing- which is what we are talking about - is
16. What is uber-tasking optimization? To begin with, Hive is Hadoop’s SQL-like analytics query variety of streaming data problems. It ensures durability much different in both cases.
The concept of uber-tasking in YARN applies to smaller engine, which enables the writing of data processing logic and elasticity, which enables you to scale the stream up 1. The CDO of Exportera is confused: "I've read about
jobs. Those jobs are executed in the same container or in or queries in a declarative language similar to SQL, called or down, so that you never lose data records before they MapReduce, but I still do not fully understand it. Can
the same JVM in which that application-specific HiveQL. The brain of Hive is the Hive Driver, which expire. The delay between the time a record is put into you explain very clearly its processes, and the role of the
Application Master is running. The basic idea behind maintains the lifecycle of a HiveQL statement: it the stream and the time it can be retrieved is typically NameNode and the DataNodes in a MapReduce
uber-tasking optimization is that the distributed task comprehends a query compiler, optimizer and executor, less than 1 second, so you don’t have to worry! operation?" Can you help him out?
allocation and management overhead exceeds the executing the tasks plan generated by the compiler in Kinesis Data Stream platform is based on a high MapReduce is a process which divides a computation into
benefits of executing tasks in parallel for smaller jobs, proper dependency order while interacting with the throughput, large bandwidth performance components. three sequential stages: map, shuffle and reduce. In the
hence its optimum to execute smaller job in the same underlying Hadoop instance. Kinesis ensures durability and elasticity of data acquired map phase, the required data is read from the HDFS and
JVM or container of the Application Master. Hive is Hadoop's SQL analytics query framework. It is the through streaming. The elasticity of Kinesis Data Streams processed in parallel by multiple independent map tasks.
17. What are the different components of YARN? same as saying Hive is a SQL abstraction layer over enables you to scale the stream up or down, so that you A map() function is defined which produces key-value
Aligning to the original master-slave architecture Hadoop MapReduce with a SQL-like query engine. It never lose data records before it expires (1 day default). outputs. Then, the map outputs are sorted in the shuffle
principle, even YARN has a global or master Resource enables several types of connections to other data bases 9. The CDO of Exportera is confused: "I've read about phase. Finally, a reduce() function is defined to aggregate
Manager for managing cluster resources and a per-node using Hive driver. HIVE DRIVER can for instances connect MapReduce, but I still do not fully understand it. Than all the values for each key, summarizing the inputs.
and -slave Node Manager that takes direction from the through ODBC to relational databases. you explain very clearly to him with a small example: During this process, the NameNode allocates which
Resource Manager and manages resources on the node. 4. On implementing Hadoop, the CDO is worried on counting the words of the following text: DataNodes, meaning individual servers, will perform the
These two, form the computation fabric for YARN. Apart understanding how a name node ensures that all the "Mary had a little lamb [] Little lamb, little lamb [] Mary map operation and which will perform the reduce
from that is a per-application Application Master, which is data nodes are functioning properly. What can you tell had a little lamb [] It's fleece was white as snow" Justify operation. This way, mapper DataNodes perform the map
merely an application-specific library tasked with him to reassure him? carefully. phase and produce key-value pairs and the reducer
negotiating resources from the global Resource Manager
DataNodes apply the reduce function to the key-value
pairs and generate the final output.
2. The CDO of Exportera asks you to prepare the talking
points for a presentation he must make to the board
regarding a budget increase for his team. It is important
that the board members understand what data variety
versus data variability is, as well as its impacts on the
analytics platforms (and, of course, business benefits of
addressing them.) What do you write to make it crystal
clear?
When it comes to Big Data applications, there are 4
characteristics which need to be addressed: Volume,
Velocity, Variety and Variability (also know as the 4 V’s).
While the variety characteristic represents the need to
analyze data from multiple sources and data types, the
variability refers to the changes in dataset characteristics,
whether in the data flow rate, format/structure and/or
volume. When it comes to variety of data, distributed
processing should be applied on different types of data,
followed by individual pre-analytics. On the other hand,
variability implies the need to scale-up or scale-down to
efficiently handle the additional processing load, which
justifies the use of cloud computing. Finally, regarding the
benefits to the business, both have their advantages:
variety of data brings an additional richness to the
business, as more details from multiple domains are
available; variability keep the systems efficient as we
don’t always have to design and resource to the expected
peak capacity.