Bigdata Final
Bigdata Final
Hadoop runs code across a cluster of computers. This process includes the
following core tasks that Hadoop performs −
Data is initially divided into directories and files. Files are divided into
uniform sized blocks of 128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further
processing.
HDFS, being on top of the local file system, supervises the processing.
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.
1. Pie Chart
Pie charts are one of the most common and basic data visualization techniques,
used across a wide range of applications. Pie charts are ideal for illustrating
proportions, or part-to-whole comparisons.
Because pie charts are relatively simple and easy to read, they’re best suited for
audiences who might be unfamiliar with the information or are only interested
in the key takeaways. For viewers who require a more thorough explanation of
the data, pie charts fall short in their ability to display complex information.
2. Bar Chart
The classic bar chart, or bar graph, is another common and easy-to-use method
of data visualization. In this type of visualization, one axis of the chart shows
the categories being compared, and the other, a measured value. The length of
the bar indicates how each group measures according to the value.
One drawback is that labeling and clarity can become problematic when there
are too many categories included. Like pie charts, they can also be too simple
for more complex data sets.
3. Histogram
Unlike bar charts, histograms illustrate the distribution of data over a continuous
interval or defined period. These visualizations are helpful in identifying where
values are concentrated, as well as where there are gaps or unusual values.
Histograms are especially useful for showing the frequency of a particular
occurrence. For instance, if you’d like to show how many clicks your website
received each day over the last week, you can use a histogram. From this
visualization, you can quickly determine which days your website saw the
greatest and fewest number of clicks.
4. Gantt Chart
Gantt charts are particularly common in project management, as they’re useful
in illustrating a project timeline or progression of tasks. In this type of chart,
tasks to be performed are listed on the vertical axis and time intervals on the
horizontal axis. Horizontal bars in the body of the chart represent the duration of
each activity.
Utilizing Gantt charts to display timelines can be incredibly helpful, and enable
team members to keep track of every aspect of a project. Even if you’re not a
project management professional, familiarizing yourself with Gantt charts can
help you stay organized.
5. Heat Map
A box and whisker plot, or box plot, provides a visual summary of data through
its quartiles. First, a box is drawn from the first quartile to the third of the data
set. A line within the box represents the median. “Whiskers,” or lines, are then
drawn extending from the box to the minimum (lower extreme) and maximum
(upper extreme). Outliers are represented by individual points that are in-line
with the whiskers.
This type of chart is helpful in quickly identifying whether or not the data is
symmetrical or skewed, as well as providing a visual summary of the data set
that can be easily interpreted.
7. Waterfall Chart
A waterfall chart is a visual representation that illustrates how a value changes
as it’s influenced by different factors, such as time. The main goal of this chart
is to show the viewer how a value has grown or declined over a defined period.
For example, waterfall charts are popular for showing spending or earnings over
time.
8. Area Chart
An area chart, or area graph, is a variation on a basic line graph in which the
area underneath the line is shaded to represent the total value of each data point.
When several data series must be compared on the same graph, stacked area
charts are used.
This method of data visualization is useful for showing changes in one or more
quantities over time, as well as showing how each quantity combines to make
up the whole. Stacked area charts are effective in showing part-to-whole
comparisons.
9. Scatter Plot
Another technique commonly used to display data is a scatter plot. A scatter
plot displays data for two variables as represented by points plotted against the
horizontal and vertical axis. This type of data visualization is useful in
illustrating the relationships that exist between variables and can be used to
identify trends or correlations in data.
Scatter plots are most effective for fairly large data sets, since it’s often easier to
identify trends when there are more data points present. Additionally, the closer
the data points are grouped together, the stronger the correlation or trend tends
to be.
The source data is stored in the distributed file system, which is typically
partitioned across multiple nodes in a cluster. The query engine provides an
SQL-like interface for querying the data, but instead of executing queries
directly against the source data, it can create materialized views based on the
queries. The materialized views are stored as physical tables in the distributed
file system, which can be queried like any other table.
When a query is executed against the materialized view, the query engine can
read the data directly from the physical table, rather than computing the result
from scratch each time. This can significantly improve query performance,
especially for complex queries that involve aggregations, joins, or other
expensive operations.
However, materialized views in big data also have some limitations. Because
the data is stored in a distributed file system, updates to the source data can be
slow or require complex synchronization mechanisms. Additionally, the
materialized views themselves can take up significant storage space, especially
if they are created based on large datasets or complex queries. As a result,
materialized views in big data are typically used in combination with other
optimization techniques, such as partitioning, indexing, and caching, to achieve
the best possible query performance.
Consistency –
Consistency means that the nodes will have the same copies of a replicated
data item visible for various transactions. A guarantee that every node in a
distributed cluster returns the same, most recent and a successful write.
Consistency refers to every client having the same view of the data. There are
various types of consistency models. Consistency in CAP refers to sequential
consistency, a very strong form of consistency.
Availability –
Availability means that each read or write request for a data item will either be
processed successfully or will receive a message that the operation cannot be
completed. Every non-failing node returns a response for all the read and write
requests in a reasonable amount of time. The key word here is “every”. In
simple terms, every node (on either side of a network partition) must be able to
respond in a reasonable amount of time.
Partition Tolerance –
Partition tolerance means that the system can continue operating even if the
network connecting the nodes has a fault that results in two or more partitions,
where the nodes in each partition can only communicate among each other.
That means, the system continues to function and upholds its consistency
guarantees in spite of network partitions. Network partitions are a fact of life.
Distributed systems guaranteeing partition tolerance can gracefully recover
from partitions once the partition heals.
The use of the word consistency in CAP and its use in ACID do not refer to
the same identical concept.
In CAP, the term consistency refers to the consistency of the values in
different copies of the same data item in a replicated distributed system.
In ACID, it refers to the fact that a transaction will not violate the integrity
constraints specified on the database schema.
The services provided by this cloud computing architecture for big data include:
Scalability: The architecture is designed to scale horizontally by adding
more nodes to the cluster. This allows for the processing of large amounts of
data and high-volume requests.
Fault-tolerance: The architecture is designed to be fault-tolerant, meaning
that it can continue to operate even if some nodes fail.
Cost-effectiveness: The architecture is cost-effective because it leverages
commodity hardware and open-source software.
Flexibility: The architecture is flexible because it can handle various types
of data and can be adapted to different data processing and analysis
requirements.
Real-time data processing: The architecture can handle real-time data
streams and process them in real-time, allowing for real-time analytics and
insights.
Security and governance: The architecture provides tools and technologies
for managing data security and governance, ensuring that data is protected
and compliant with regulations.
3.GROUP: The GROUP operator is used to group the data based on one or
more columns. It takes one or more columns as input and returns a bag of tuples
for each group.
Example:
mydata_grouped = GROUP mydata BY name;
4.JOIN: The JOIN operator is used to combine two or more data sets based on
a common column. It takes two or more relations as input and returns a relation
that contains the columns of all the input relations.
Example:
mydata_join = JOIN mydata BY name, mydata2 BY name;
5.FOREACH: The FOREACH operator is used to perform a transformation on
each tuple of a relation. It takes an expression as input and applies it to each
tuple in the relation.
Example:
mydata_transformed = FOREACH mydata GENERATE name, salary *
12 AS annual_salary;
14. Formulate the steps for the creation of employee tables with empid,
empname, empaddress, empphoneno using HIVE programming. Address
the following queries:
1. Write a query for insertion of values to the table.
2. Write a query for listing the all the employees of the ABC company
with the name 'Radhika'
3. Write a query to list the address of employess located in
sadashivanagar.
Here are the steps to create an employee table with empid, empname,
empaddress, empphoneno using HIVE programming:
1. Open the HIVE shell or any other interface like Hue or Beeline and connect
to the Hadoop cluster.
2. Create a database in HIVE where the employee table will be stored using
the following command:
CREATE DATABASE employee_db;
2. To list all employees of the ABC company with the name 'Radhika', use the
following command:
SELECT * FROM employee WHERE empname = 'Radhika';
This will return a list of all employees with the name 'Radhika'.