Data Engineer Interview Questions
Data Engineer Interview Questions
Data Engineer Interview Questions
SET OF 57 QUESTIONS
Data engineering is a term used in big data. It focuses on the application of data collection and research.
The data generated from various sources are just raw data. Data engineering helps to convert this raw
data into useful information.
Data modeling is the method of documenting complex software design as a diagram so that anyone can
easily understand. It is a conceptual representation of data objects that are associated between various
data objects and the rules.
There are mainly two types of schemas in data modeling: 1) Star schema and 2) Snowflake schema.
Hadoop Common: It is a common set of utilities and libraries that are utilized by Hadoop.
HDFS: This Hadoop application relates to the file system in which the Hadoop data is stored. It is a
distributed file system having high bandwidth.
Hadoop MapReduce: It is based according to the algorithm for the provision of large-scale data
processing.
Hadoop YARN: It is used for resource management within the Hadoop cluster. It can also be used for
task scheduling for users.
6) What is NameNode?
It is the centerpiece of HDFS. It stores data of HDFS and tracks various files across the clusters. Here, the
actual data is not stored. The data is stored in DataNodes.
It is a utility which allows for the creation of the map and Reduces jobs and submits them to a specific
cluster.
Blocks are the smallest unit of a data file. Hadoop automatically splits huge files into small pieces.
Block Scanner verifies the list of blocks that are presented on a DataNode.
10) What are the steps that occur when Block Scanner detects a corrupted data block?
Following are the steps that occur when Block Scanner find a corrupted data block:
1) First of all, when Block Scanner find a corrupted data block, DataNode report to NameNode
2) NameNode start the process of creating a new replica using a replica of the corrupted block.
3) Replication count of the correct replicas tries to match with the replication factor. If the match found
corrupted data block will not be deleted.
There are two messages which NameNode gets from DataNode. They are 1) Block report and 2)
Heartbeat.
Mapred-site
Core-site
HDFS-site
Yarn-site
Velocity
Variety
Volume
Veracity
14) Explain the features of Hadoop
Hadoop is compatible with the many types of hardware and easy to access new hardware within a
specific node.
It stores the data in the cluster, which is independent of the rest of the operations.
Hadoop allows creating 3 replicas for each block with different nodes.
setup (): It is used for configuring parameters like the size of input data and distributed cache.
reduce(): It is a heart of the reducer which is called once per key with the associated reduced task
The abbreviation of COSHH is Classification and Optimization based Schedule for Heterogeneous
Hadoop systems.
17) Explain Star Schema
Star Schema or Star Join Schema is the simplest type of Data Warehouse schema. It is known as star
schema because its structure is like a star. In the Star schema, the center of the star may have one fact
table and multiple associated dimension table. This schema is used for querying large data sets.
1) Integrate data using data sources like RDBMS, SAP, MySQL, Salesforce
3) Deploy big data solution using processing frameworks like Pig, Spark, and MapReduce.
File System Check or FSCK is command used by HDFS. FSCK command is used to check inconsistencies
and problem in file.
A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions. It is so-called
as snowflake because its diagram looks like a Snowflake. The dimension tables are normalized, that
splits data into additional tables.
21) Explain Hadoop distributed file system
Hadoop works with scalable distributed file systems like S3, HFTP FS, FS, and HDFS. Hadoop Distributed
File System is made on the Google File System. This file system is designed in a way that it can easily run
on a large cluster of the computer system.
Data engineers have many responsibilities. They manage the source system of data. Data engineers
simplify complex data structure and prevent the reduplication of data. Many times they also provide ELT
and data transformation.
Modes in Hadoop are 1) Standalone mode 2) Pseudo distributed mode 3) Fully distributed mode.
25) How to achieve security in Hadoop?
1) The first step is to secure the authentication channel of the client to the server. Provide time-stamped
to the client.
2) In the second step, the client uses the received time-stamped to request TGS for a service ticket.
3) In the last step, the client use service ticket for self-authentication to a specific server.
In Hadoop, NameNode and DataNode communicate with each other. Heartbeat is the signal sent by
DataNode to NameNode on a regular basis to show its presence.
It is a large amount of structured and unstructured data, that cannot be easily processed by traditional
data storage methods. Data engineers are using Hadoop to manage big data.
28) What is FIFO scheduling?
It is a Hadoop Job scheduling algorithm. In this FIFO scheduling, a reporter selects jobs from a work
queue, the oldest job first.
29) Mention default port numbers on which task tracker, NameNode, and job tracker run in Hadoop
Default port numbers on which task tracker, NameNode, and job tracker run in Hadoop are as follows:
The distance is equal to the sum of the distance to the closest nodes. The method getDistance() is used
to calculate the distance between two nodes.
Commodity hardware is easy to obtain and affordable. It is a system that is compatible with Windows,
MS-DOS, or Linux.
Namenode stores the metadata for the HDFS like block information, and namespace information.
In Haddop cluster, Namenode uses the Datanode to improve the network traffic while reading or writing
any file that is closer to the nearby rack to Read or Write request. Namenode maintains the rack id of
each DataNode to achieve rack information. This concept is called as Rack Awareness in Hadoop.
NameNode crash: If the NameNode crashes, then Secondary NameNode’s FsImage can be used to
recreate the NameNode.
Checkpoint: It is used by Secondary NameNode to confirm that data is not corrupted in HDFS.
Update: It automatically updates the EditLog and FsImage file. It helps to keep FsImage file on Secondary
NameNode updated.
36) What happens when NameNode is down, and the user submits a new job?
NameNode is the single point of failure in Hadoop so the user can not submit a new job cannot execute.
If the NameNode is down, then the job may fail, due to this user needs to wait for NameNode to restart
before running any job.
2. Sort: In sort, Hadoop sorts the input to Reducer using the same key.
3. Reduce: In this phase, output values associated with a key are reduced to consolidate the data
into the final output.
Hadoop framework uses Context object with the Mapper class in order to interact with the remaining
system. Context object gets the system configuration details and job in its constructor.
We use Context object in order to pass the information in setup(), cleanup() and map() methods. This
object makes vital information available during the map operations.
40) What is the default replication factor available in HDFS What it indicates?
Default replication factor in available in HDFS is three. Default replication factor indicates that there will
be three replicas of each data.
In a Big Data system, the size of data is huge, and that is why it does not make sense to move data
across the network. Now, Hadoop tries to move computation closer to data. This way, the data remains
local to the stored location.
In HDFS, the balancer is an administrative used by admin staff to rebalance data across DataNodes and
moves blocks from overutilized to underutilized nodes.
Hadoop framework makes replica of these files to the nodes one which a task has to be executed. This is
done before the execution of task starts. Distributed Cache supports the distribution of read only files as
well as zips, and jars files.
Hive table defines, mappings, and metadata that are stored in Metastore. This can be stored in RDBMS
supported by JPOX.
SerDe is a short name for Serializer or Deserializer. In Hive, SerDe allows to read data from table to and
write to a specific field in any format you want.
Tables
Partitions
Buckets
48) Explain the use of Hive in Hadoop eco-system.
Hive provides an interface to manage data stored in Hadoop eco-system. Hive is used for mapping and
working with HBase tables. Hive queries are converted into MapReduce jobs in order to hide the
complexity associated with creating and running MapReduce jobs.
Map
Struct
Array
Union
In Hive, .hiverc is the initialization file. This file is initially loaded when we start Command Line Interface
(CLI) for Hive. We can set the initial values of parameters in .hiverc file.
51) Is it possible to create more than one table in Hive for a single data file?
Yes, we can create more than one table schemas for a data file. Hive saves schema in Hive Metastore.
Based on this schema, we can retrieve dissimilar results from same Data.
52) Explain different SerDe implementations available in Hive
There are many SerDe implementations available in Hive. You can also write your own custom SerDe
implementation. Following are some famous SerDe implementations:
OpenCSVSerde
RegexSerDe
DelimitedJSONSerDe
ByteStreamTypedSerDe
Explode(array)
JSON_tuple()
Stack()
Explode(map)
Database
Index
Table
User
Procedure
Trigger
Event
View
Function
Use regex operator to search for a String in MySQL column. Here, we can also define various types of
regular expression and search for using regex.
57) Explain how data analytics and big data can increase company revenue?
Following are the ways how data analytics and big data can increase company revenue: