15CS34E Analytic Computing Key

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 17

15CS34E Analytic computing

Answer key

Part-A

1. The method of moments is a method of estimation of population parameters. One starts with deriving
equations that relate the population moments (i.e., the expected values of powers of the random variable under
consideration) to the parameters of interest.

2. Reports are important and can be valuable. Reports used correctly will add value. But reports have
their limits, and it is important to understand what they are. In the end, an organization will need both
reporting and analysis to succeed in taming big data, just as both reporting and analysis have been
utilized to tame every other data source that’s come along in the past. The key is to understand the
difference between a report and an analysis. It is also critical to understand how they both fit together.
Without that understanding, your organization won’t get it right.

3.

4. The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements in a
stream with a single pass and space-consumption which is logarithmic in the maximum number of
possible distinct elements in the stream.

5. An approach to improve the efficiency of apriori algorithm. Association rule mining has a great
importance in data mining. ... In this paper, we are proposing amethod to improve Apriori algorithm
efficiency by reducing the database size as well as reducing the time wasted on scanning the transactions

6. Clustering methods are used to identify groups of similar objects in a multivariate data sets collected
from fields such as marketing, bio-medical and geo-spatial. They aredifferent types of clustering
methods, including: Partitioning methods and Hierarchical clustering.

7. Applying data mining techniques to large social media data sets has the potential to continue to
improve search results for everyday search engines, realize specialized target marketing for businesses,
help psychologist study be- havior, provide new insights into social structure for sociologists, personalize
web.

8. OLTP and OLAP both are the online processing systems.OLTP is a transactional processing
while OLAP is an analytical processing system. ... The basic difference between OLTP and OLAP is
that OLTP is an online database modifying system, whereas, OLAP is an online database query answering
system.

9. Sharding is a type of database partitioning that separates very large databases the into smaller, faster,
more easily managed parts called data shards. The word shard means a small part of a whole.
10. MapReduce Algorithm for Matrix Multiplication

 Facts: The final step in the MapReduce algorithm is to produce the matrix A × B. The unit of
computation of of matrix A × B is one element in the matrix:

 Conclusion: The input information of the reduce( ) step (function) of the MapReduce
algorithm are: One row vector from matrix A. One column vector from matrix B.

Part-B

11.

12. a) Evolution of Analytic Scalability

The world of big data requires new levels of scalability. As the amount of data organizations process
continues to increase, the same old methods for handling data just won’t work anymore. Organizations
that don’t update their technologies to provide a higher level of scalability will quite simply choke on big
data. The convergence of the analytic and data environments, massively parallel processing (MPP)
architectures, the cloud, grid computing, and MapReduce.

A History Of Scalability

Until well into the 1900s, doing analytics was very, very difficult. To do a deep analysis, such as a
predictive model, it required manually computing all of the statistics. Scalability of any sort was virtually
nonexistent. The amount of data has grown at least as fast as the computing power of the machines that
process it. As new big data sources become available the boundaries are being pushed further.

The Convergence Of The Analytic And Data Environments

It used to be that analytic professionals had to pull all their data together into a separate analytics
environment to do analysis. None of the data that was required was together in one place, and the tools
that did analysis didn’t have a way to run where the data sat. The only option was to pull the data
together in a separate analytics environment and then start performing analysis. Much of the work
analytic professionals do falls into the realm of advanced analytics, which encompasses data mining,
predictive modeling, and other advanced techniques. There is an interesting parallel between what
analysts did in the early days and what data warehousing is all about. There’s not much difference
between a data set as analysts define it and a “table” in a database. Analysts have done “merges” of
their data sets for years. That is the exact same thing as a “join” of tables in a database.

In both a merge and a join, two or more data sets or tables are combined together. Analysts do what is
called “data preparation.” In the data warehousing world this process is called “extract, transform, and
load (ETL).” Basically, analysts were building custom data marts and mini–data warehouses before the
terms data mart or data warehouse were invented. The Relational Database Management System
(RDBMS) started to not only become popular, but to become more scalable and more widely adopted.
Initially, databases were built for each specific purpose or team, and relational databases were spread all
over an organization. Such single purpose databases are often called “data marts.”Combining the various
database systems into one big system called an Enterprise Data Warehouse (EDW).

Modern In-Database Architecture


In an enterprise data warehousing environment, most of the data sources have already been
brought into one place. Move the analysis to the data instead of moving the data to the
analysis.

b) (i) Statistical Concepts


Statistics is the study of the collection, analysis, interpretation, presentation, and
organization of data. In applying statistics to, e.g., a scientific, industrial, or social
problem, it is conventional to begin with a statistical population or a statistical model
process to be studied.
Populations and Parameters
A population is any large collection of objects or individuals, such as Americans,
students, or trees about which information is desired.
A parameter is any summary number, like an average or percentage that describes the
entire population.
The population mean μ and the population proportion p are two different population
parameters. For example:
We might be interested in learning about μ, the average weight of all middle-aged
female Americans. The population consists of all middle-aged female Americans, and
the parameter is μ.
Or, we might be interested in learning about p, the proportion of likely American voters
approving of the president's job performance. The population comprises all likely
American voters, and the parameter is p.
The problem is that 99.999999999999... % of the time, we don't — or can't — know the
real value of a population parameter.
Samples and statistics
A sample is a representative group drawn from the population.
A statistic is any summary number, like an average or percentage that describes the
sample.
The sample mean, , and the sample proportion are two different sample statistics.
For example:
We might use the average weight of a random sample of 100 middle-aged female
Americans, to estimate μ, the average weight of all middle-aged female Americans.
Or, we might use , the proportion in a random sample of 1000 likely American voters
who approve of the president's job performance, to estimate p, the proportion of all
likely American voters who approve of the president's job performance.
Because samples are manageable in size, we can determine the actual value of any
statistic. We use the known value of the sample statistic to learn about the unknown
value of the population parameter.
Confidence Intervals
Suppose we want to estimate an actual population mean μ. As you know, we can only
obtain , the mean of a sample randomly selected from the population of interest. We
can use to find a range of values:
that we can be really confident contains the population mean μ. The range of values is
called a "confidence interval"
General form of most confidence intervals
The previous example illustrates the general form of most confidence intervals, namely:
That is:
and:
Once we've obtained the interval, we can claim that we are really confident that the
value of the population parameter is somewhere between the value of L and the value
of U.
So far, we've been very general in our discussion of the calculation and interpretation of
confidence intervals. To be more specific about their use, let's consider a specific
interval, namely the "t-interval for a population mean μ."

(ii) (1) Reservoir sampling is a family of randomized algorithms for randomly


choosing k
samples from a list of n items, where n is either a very large or unknown number.
Typically n is large enough that the list doesn’t fit into main memory. For
example, a list
of search queries in Google
(2) Statistical Inference, Learning and Models in Big Data. Big
dataprovides big opportunities for statistical inference, but perhaps
even bigger challenges, often related to differences in volume, variety, velocity,
and veracity of information when compared to smaller carefully collected
datasets

13.(a) The Flajolet-Martin Algorithm


The principle behind the FM algorithm is that, it is possible to estimate the number of
distinct elements by hashing the elements of the universal set-to a bit-string that is
sufficiently long.
This means that the length of the bit-string must be such that there are more possible
results of the hash function than there are elements of the universal set.
Before we can estimate the number of distinct elements, we first choose an upper
boundary; of distinct elements . This boundary gives us the maximum number of
distinct elements that we might be able to detect. Choosing to be too small will
influence the precision of our measurement. Choosing that is far bigger than the
number of distinct elements will only use too much memory. Here, the memory that is
required is .
For most applications, is a sufficiently large bit-array. The array needs to be
initialized to zero. We will then use one or more adequate hashing functions. These
hash functions will map the input to a number that is representable by our bit-array. This
number will then be analyzed for each record. If the resulting number contained
trailing zeros, we will set the kth bit in the bit array to one.
Finally, we can estimate the currently available number of distinct elements by taking
the index of the first zero bit in the bit-array. This index is usually denoted by R. We can
then estimate the number of unique elements N to be estimated by . The
algorithm is as follows:
Algorithm
Pick a hash function that maps each of the elements to at least bits-
For each stream element let be the number of trailing 0s in .
Record the maximum seen.
Estimate = ,
Example
= position of first 1 counting from the right; say , then 12 is 1100 in
binary, so
Example
Suppose the stream is 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1 .... Let .
So the transformed stream ( applied to each item) is 4,5,2,4,2,5,3,5,4,2,5,4
Each of the above element is converted into its binary equivalent as 100, 101, 10, 100,
10, 101, 11, 101, 100, 10,.101, 100
We compute of each item in the above stream: 2, 0, 1, 2, 1, 0, 0, 0, 2, 1, 0, 2
So , which is 2. Output 2 square=4.
A very simple and heuristic intuition as to why Fiajokt-Martin works can be explained as
follows:
1.h(a) hashes a with equal probability to any of N values
2.Then h(a) is a sequence of bits, where fraction of all a s have a tail of
zeros
About 50% of as hash to ***0
About 25% of as hash to **00
So, if we saw the longest tail of (i.e., item hash ending *100) then we have
probably seen about four distinct items so far
3.So, it takes to hash about items before we see one with zero-suffix of length r.
More formally we can see that the algorithm works, because the probability that a given
hash function , ends in at least zeros is . In case of m different elements, the
probability that (R is max. tail length seen so far) is given by
2
Since is small, –
If , – – .
if , –
Thus, will almost always be around .
Variations to the FM Algorithm
There are reasons why the simple FM algorithm won’t work with just a single hash
function. The expected value of [ ] is actually infinite. The probability halves when is
increased to , however, the value doubles. In order to get a much smoother
estimate, that is also more reliable, we can use many hash functions. Another problem
with the FM algorithm in the above form is that the results vary a lot. A common solution
is to run the algorithm multiple times with different hash-functions, and combine the
results from the different runs.
One idea is to take the mean of the results together from each hash-function,
obtaining a single estimate of the cardinality. The problem with this is that averaging is
very susceptible to outliers (which are likely here).
A different idea is to use the median which is less prone to be influenced by outliers.
The problem with this is that the results can only take form as some power of 2. Thus,
no matter how many hash functions we use, should the correct value of be between
two powers of 2, say 400, then it will be impossible to obtain a close estimate. A
common solution is to combine both the mean and the median:
1.Create hash-functions and split them into k distinct groups (each of size ).
2.Within each group use the median for aggregating together the results
3.Finally take the mean of the group estimates as the final estimate.
Sometimes, an outsized will bias some of the groups and make them too large.
However, taking the median of group averages will reduce the influence of this effect
almost to nothing.
Moreover, if the groups themselves are large enough, then the averages can be
essentially any number, which enable us to approach the true value m as long as we
use enough hash functions.
Groups should be of size at least some small multiple of .log2m.

(b) Decaying Windows


Pure sliding windows are not the only way by which the evolution of data streams can
be taken into account during the mining process. A second way is to introduce a decay
factor into the computation. Specifically, the weight of each transaction is multiplied by a
factor of , when a new transaction arrives. The overall effect of such an approach
is to create an exponential decay function on the arrivals in the data stream. Such a
model is quite effective for evolving data stream, since recent transactions are counted
more significantly during the mining process. Specifically, the decay factor is applied
only to those itemsets whose counts are affected by the current transaction. However,
the decay factor will have to be applied in a modified way by taking into account the last
time that the itemset was touched by an update. This approach works because the
counts of each itemset reduce by the same decay factor in each iteration, as long as a
transaction count is not added to it. Such approach is also applicable to other mining
problems where statistics are represented as the sum of decaying values.
We discuss a few applications of decaying windows to find interesting aggregates over
data streams.
The Problem of Most-Common Elements
Suppose we have a stream whose elements are the movie tickets purchased all over
the world, with the name of the movie as part of the element. We want to keep a
summary of the stream that is the most popular movies “currently.” While the notion of
“currently” is imprecise, intuitively, we want to discount the popularity of an older movie
that may have sold many tickets, but most of these decades ago. Thus, a newer movie
that sold n tickets in each of the last 10 weeks is probably more popular than a movie
that sold 2n tickets last week but nothing in previous weeks.
One solution would be to imagine a bit stream for each movie. The ith bit has value 1 if
the ith ticket is for that movie, and 0 otherwise. Pick a window size N, which is the
number of most recent tickets that would be considered in evaluating popularity. Then,
use the method of the DGIM algorithm to estimate the number of tickets for each movie,
and rank movies by their estimated counts.
This technique might work for movies, because there are only thousands of movies, but
it would fail if we were instead recording the popularity of items sold at Amazon, or the
rate at which different Twitter-users tweet, because there are too many Amazon
products and too many tweeters. Further, it only offers approximate answers.
Describing a Decaying Window
One approach is to re-define the question so that we are not asking for a simple count
of 1s in a window. We compute a smooth aggregation of all the 1s ever seen in the
stream, but with decaying weights. The further in the past a 1 is found the lesser is the
weight given to it. Formally, let a stream currently consist of the elements
where is the first element to arrive and is the current element. Let be
a small constant, such as 10-6 or 10-9. Define the exponentially decaying window for this
stream to be the sum.
The effect of this definition is to spread out the weights of the stream elements as far
back in time as the stream goes. In contrast, a fixed window with the same sum of the
weights, , would put equal weight on each of the most recent He elements to
arrive and weight on all previous elements. This is illustrated in Figure below
It is much easier to adjust the sum in an exponentially decaying window than in a sliding
window of fixed length. In the sliding window, we have to somehow take into
consideration the element that falls out of the window each time a new element arrives.
This forces us to keep the exact elements along with the sum, or to use some
approximation scheme such as DGIM. But in the case of a decaying window, when a
new element arrives at the stream input, all we need to do is the following:
The reason this method works is that each of me previous elements has now moved
one position further from the current element, so its weight is multiplied by .
Further, the weight on the current element is , so adding Add is the
correct way to include the new elements contribution.
Now we can try to solve the problem of finding the most popular movies in a stream of
ticket sales. We can use an exponentially decaying window with a constant c, say 10-9.
We are approximating a sliding window that holds the last one billion ticket sales. For
each movie, we can imagine a separate stream with a 1 each time a ticket for that
movie appears in the stream, and a 0 each time a ticket for some other movie arrives.
The decaying sum of the ; thus, it measures the current popularity of the movie.
To optimize this process, we can avoid performing these counts for the unpopular
movies. If the popularity score for a movie goes below 1, its score is dropped from the
counting. A good threshold value to use is (1/2).
When a new ticket arrives on the stream, do the following:
1.For each movie whose score is currently maintained multiply its score by (1 - c)
2.Suppose the new ticket is for movie M. If there is currently a score for M, add 1 to that
score. If there is no score for M, create one and initialize it to 1.
3.If any score is below the threshold 1/2, drop that score.
A point to be noted is that the sum of all scores is . Thus, there cannot be more than
movies with score of or more, or else the sum of the scores would exceed lie.
Thus, lie is a limit on the number of movies being counted at any time. Of course in
practice, the number of actively counted movies would be much less than . If the
number of items is very large then other more sophisticated techniques are required.

14.a) We can classify the visualization of social networks in four main groups,
depending on the main focus of the predominant visual task for which the
visualization metaphor is envisioned. These are: structural, the most common
representation thus far, semantic, which emphasizes the meaning of the entities
and relationships over their structure, temporal and statistical.
Structural Visualization
A structural visualization of the social networks focuses precisely on that,
its structure. The structure can be thought of as the topology of a graph that
represents the actors and relationships in a social network. There are two
predominant approaches to structural visualization: node-link diagrams and
matrix-oriented methods. While node-link diagrams are easy to interpret and
depict explicitly the links between nodes, matrix-oriented representations usually
make a better use of limited display area, critical for today’s availability
of visualizations in a multitude of devices. In the recent years, we have seen
efforts to combine the best of the two types into a series of hybrid representations.
The value of network layout in visualization. The purpose of
a visualization is to allow users to gain insight on the complexity of social dynamics.
Therefore, an aspect such as the layout of network elements must be
evaluated with respect to its value toward that goal. Numerous layout algorithms
have been proposed, each of them with its strengths and weaknesses. In
general, we can highlight a number of high level properties that a layout must
satisfy: (1) The positioning of nodes and links in a diagram should facilitate
the readability of the network. (2) The positioning of nodes should help uncover
any inherent clusterability of the network, i.e., if a certain group of nodes
is considered to be a cluster, the user should be able to extract that information
from the layout, and (3), the position of nodes should result in a trustworthy
representation of the social network,
Matrix-oriented Techniques. Matrix-oriented techniques are
visualization metaphors designed to represent a social network via an explicit
display of its adjacency or incidence matrix. In this visualization, each link is
represented as a grid location with cartesian coordinates corresponding to the
nodes in the network. Color and opacity are often used to represent important
structural or semantic quantities. Consequently, a visual representation of the
adjacency matrix is much more compact than a node-link diagram and overlap
free, since each link is represented without overlap.
One of the challenges with this matrix representation is enabling the users
to visually identify local and global structures in the network. In general, this
depends on the way the nodes are ordered along the axes of the matrix view.
When the order of nodes is arbitrary, the matrix view may not exhibit the clustering
present in the data. Reordering of a matrix is a complex combinatorial
problem, and depends on a given objective function. In general, if we follow
the high-level desired properties of a good network visualization, it should retain
clusters in close proximity.
Hybrid Techniques. Node-link diagrams are easy to interpret
when compared to adjacency matrix diagrams, but the latter are usually more
effective to show overall structure where the network is dense. This observation
has led to a number of hybrid techniques, which combine matrix-oriented
techniques with node-link diagrams in an attempt to overcome the issues associated
with each of them.
Semantic and Temporal Visualization
Structural visualizations, although unify the depiction of both overviews
and detail information, are less effective when the social network becomes
large. Node-link diagrams become rapidly cluttered and algorithmic layouts
often result in "hairballs", as highly connected nodes tend to concentrate in
the center of the display. For this reason, recent approaches have focused on
a different aspect of social networks: semantics. Instead of highlighting the
explicit relationships found in the data, these represent high level attributes
and connections of actors and links, either as speci�ed explicitly in the data, or
implicitly inferred from cross-referencing external sources.
Ontology-based Visualization. One such semantic visualization
is the use of ontologies to represent the types of actors and relationships in a
social network. An example is Ontovis, where a typical node-link diagram
is enhanced with a high-level graphical representation of the ontology [38].
An ontology is a graph whose nodes represent node types and links represent
types of representations. Once an ontology is known, either given explicitly or
extracted from the data itself, a social network can be represented implicitly
by the relationships in this graph. The goal in this type of visualization is not
necessarily to discover the structural properties of the network, but rather the
distribution of attributes of nodes and links.
Temporal Visualization. A special type of semantic information
that has captured the attention of social network visualization researchers
is time. Since social interaction is a time-dependent phenomenon, it is only
natural to represent the temporal dimension using visual means. Nonetheless,
the visualization of dynamic networks has not been explored in depth. One
can argue that one of the reasons for this lack of treatment is that handling
the temporal dimension from a structural point of view alone is limited and
insuf�cient. A semantic approach, as the ones suggested above, seems more
appropriate. However, time as a dimension deserves a treatment different from
other data-speci�c node attributes. One of the dif�culties to represent time is
a shortage of dimensions to depict a dynamic network in a 2D display. As
an alternative, one can represent time along precisely a temporal dimension.
Moody et al. considers two types of such dynamic visualizations: �ipbooks,
where nodes remain static while the edges vary over time, and movies, where
nodes are allowed to move as the relationships change Statistical Visualization
An important aspect of visualization is the interplay between analytical and
visual tools to gain insight. This is at the core of visual analytics. For this
reason, analysts often explore summary visualizations to understand the distributions
of variables of interest. These variables often correspond to network
statistics that represent the structure of the network, such as degree, centrality
and the clustering coef�cient. While the �rst two describe the importance across the network, the latter metric
indicates how clusterable are the nodes in
a network.

(b) (i)We organize the advances on evolution in a multi-dimensional framework.


We identify four dimensions that are associated to knowledge discovery in
social networks and elaborate on their interplay in the context of evolution.
Dimension 1: Dealing with Time. We identify two different threads of
research on the analysis of dynamic social networks. Both threads consider
the temporal information about the social network for model learning, but they
exploit the temporal data differently and, ultimately, deliver different types of
knowledge about network dynamics.
Methods of the �rst thread learn a single model that captures the dynamics
of the network by exploiting information on how the network has changed from
one timepoint to the next. These methods observe the temporal data as a time
series that has a beginning and an end. Advances in this thread include[19, 25,
27, 45, 26], many of the methods are of evolutionary nature. The model they
learn provides insights on how the network has evolved, and can also be used
for prediction on how it will change in the future, provided that the process
which generated the data is stationary. Hence, such a model delivers laws on
the evolution of the network, given its past data.
Methods of the second thread learn one model at each timepoint and adapt
it to the data arriving at the next timepoint. Although some early examples [46,
2, 37] are contemporary or even older than the �rst thread, the popularity of
this thread has grown more recently.These methods observe the temporal data
as an endless stream. They deliver insights on how each community evolves,
and mostly assume a non-stationary data generating process.
The differences between the two threads are fundamental in many aspects.
The object of study for the �rst thread is a time series, for the second one it is
a stream. From the algorithmic perspective, the methods of the �rst thread can
exploit all information available about the network and use data structures that
accommodate this information in an ef�cient way, e.g. using a matrix of interactions
among network members. Methods of the second thread cannot know
how the network will grow and how many entities will show up at each timepoint,
so they must resort to structures that can adapt to an evolving network size. But foremostly, the two threads
differ in their objective (see Dimension 2
below): while the second thread monitors communities, the �rst thread derives
laws of evolution that govern all communities in a given network.
Dimension 2: Objective of Study. One of the most appealing objectives at
the eve of the Social Web was understanding how social networks are formed
and how they evolve. This is pursued mainly within the �rst research thread of
Dimension 1, which seeks to �nd laws of evolution that explain the dynamics
of the large social networks we see in the Web. Dimension 3: De�nition of a Community. A
community may be de�ned
by structure, e.g. communities as cliques. Alternatively, it may be de�ned
by proximity or similarity of members, whereupon many authors assume that
nodes are proximal if they are linked, while others assume that nodes are similar
if they have similar semantics. Dimension 4: Evolution as Objective vs Assumption. The
�rst category
encompasses methods that discretize the time axis to study how communities
have changed from one timepoint to the next. The second category encompasses
methods that make the assumption of smooth evolution across time and
learn community models that preserve temporal smoothness.

(ii) Many forms of dynamic and streaming algorithms can be used to perform trend
detection in a wide variety of social networking applications.
In such applications, the data is dynamically clustered in a streaming fashion and can
be used in order to determine the important patterns of changes.
Examples of such streaming data could be multi-dimensional data, text streams,
streaming time-series data, and trajectory data. Key trends and events in the data can
be discovered with the use of clustering methods. Social tagging can be useful in the areas
including indexing, search,
taxonomy generation, clustering, classi�cation, social interest discovery, etc.
Indexing
Tags can be useful for indexing sites faster. Users bookmark sites launched
by their friends or colleagues before a search engine bot can �nd them. Tags
are also useful in deeper indexing.
Search
Tags have been found useful for web search, personalized search and enterprise
search. Tags offer multiple descriptions of a given resource, which
potentially increases the likelihood that searcher and tagger �nd a common
language and thus using tags, retrieval effectiveness may be enhanced. Social
bookmarking can provide search data not currently provided by other sources.
Semantic Query Expansion. Schenkel et al. [45] develop an incremental
top-k algorithm for answering queries of the form Q(u, q1, q2, ..., qn)
where u is the user and q1, q2, ..., qn are the query keywords.
Enhanced Similarity Measure. Estimating similarity between
a query and a web page is important to the web search problem. Tags provided
by web users provide different perspectives and so are usually good summaries
of the corresponding web pages and provide a new metadata for the similarity
calculation between a query and a web page.
Enhanced Static Ranking. Estimating the quality of a web
page is also important to the web search problem. The amount of annotations
assigned to a page indicates its popularity and indicates its quality in some
sense. In order to explore social tags for measuring the quality of web pages,
researchers have exploited the tagging graph.
Personalized Search. Furthermore, personal tags are naturally
good resources for describing a person’s interests. So personal search could
be enhanced via exploring personal tags.

15.(a) Hierarchical visualization techniques


The visualization techniques discussed so far focus on visualizing multiple dimensions
simultaneously. However, for a large data set of high dimensionality, it would be difficult
to visualize all dimensions at the same time. Hierarchical visualization techniques
partition all dimensions into subsets (i.e., subspaces). The subspaces are visualized in a
hierarchical manner.
“Worlds-within-Worlds,” also known as n-Vision, is a representative hierarchical
visualization method. Suppose we want to visualize a 6-D data set, where the dimensions
are F,X1, : : : ,X5.We want to observe how dimension F changes with respect to the other
dimensions.We can first fix the values of dimensions X3,X4,X5 to some selected values,
say, c3, c4, c5.We can then visualize F,X1,X2 using a 3-D plot, called a world, as shown in
Figure 2.19. The position of the origin of the inner world is located at the point .c3, c4, c5/
in the outer world, which is another 3-D plot using dimensions X3,X4,X5. A user can
interactively change, in the outer world, the location of the origin of the inner world.
The user then views the resulting changes of the inner world. Moreover, a user can vary
the dimensions used in the inner world and the outer world. Given more dimensions,
more levels of worlds can be used, which is why the method is called “worlds-withinworlds.”
As another example of hierarchical visualization methods, tree-maps display hierarchical
data as a set of nested rectangles. For example, Figure 2.20 shows a tree-map
visualizing Google news stories. All news stories are organized into seven categories, each
shown in a large rectangle of a unique color. Within each category (i.e., each rectangle
at the top level), the news stories are further partitioned into smaller subcategories.
In early days, visualization techniques were mainly for numeric data. Recently, more
and more non-numeric data, such as text and social networks, have become available.
Visualizing and analyzing such data attracts a lot of interest.
There are many new visualization techniques dedicated to these kinds of data. For
example, many people on theWeb tag various objects such as pictures, blog entries, and
product reviews. A tag cloud is a visualization of statistics of user-generated tags. Often,
in a tag cloud, tags are listed alphabetically or in a user-preferred order. The importance
of a tag is indicated by font size or color. In summary, visualization provides effective tools
to explore data. We have introduced
several popular methods and the essential ideas behind them. There are many
existing tools and methods. Moreover, visualization can be used in data mining in various
aspects. In addition to visualizing data, visualization can be used to represent the
data mining process, the patterns obtained from a mining method, and user interaction
with the data. Visual data mining is an important research and development direction.
In summary, visualization provides effective tools to explore data. We have introduced
several popular methods and the essential ideas behind them. There are many
existing tools and methods. Moreover, visualization can be used in data mining in various
aspects. In addition to visualizing data, visualization can be used to represent the
data mining process, the patterns obtained from a mining method, and user interaction
with the data. Visual data mining is an important research and development direction.

(b) (i) Hadoop Framework


Apache Hadoop is a framework that allows distributed processing of large data sets
across clusters of commodity computers using a simple programming model. It is
designed to scale-up from single servers to thousands of machines, each providing
computation and storage. Rather than rely on hardware to deliver high-availability, the
framework itself is designed to detect and handle failures at the application layer, thus
delivering a highly available service on top of a cluster of computers, each of which may
be prone to failures.
In short, Hadoop is an open-source software framework for storing and processing big
data in a distributed way on large clusters of commodity hardware. Basically, it
accomplishes the following two tasks:
1. Massive data storage.
2. Faster processing.
Advantage of Hadoop
Problems in data transfer made the organizations to think about an alternate way. The
following examples explain the use of Hadoop.
The transfer speed is around 100 MB/s and a standard disk is 1 TB. Time to read entire
disk = 10,000 s or 3 h!. Then increase in processing time may not be very helpful
because of two reasons:
Network bandwidth is now more of a limiting factor.
Physical limits of processor chips are reached.
If 100 TB of datasets are to be scanned on a 1000 node cluster, then in case of
remote storage with 10 Mbps bandwidth, it would take 165 min.
local storage with 50 Mbps, it will take 33 min.
So it is better to move computation rather than moving data.
Taking care of hardware failure cannot be made optional in Big Data Analytics but has
to be made as a rule. In case of 1000 nodes, we need to consider say 4000 disks, 8000
core, 25 switches, 1000 NICs and 2000 RAMs (16 TB). Meantime between failures
could be even less than a day since commodity hardware is. used. There is a need for
fault tolerant store to guarantee reasonable availability.
Hadoop Goals
The main goals of Hadoop are listed below:
1. Scalable: It can scale up from a single server to thousands of servers.
2. Fault tolerance: It is designed with very high degree of fault tolerance.
3. Economical: It uses commodity hardware instead of high-end hardware.
4. Handle hardware failures: The resiliency of these clusters comes from the
software’s ability to detect and handle failures at the application layer.
The Hadoop framework can store huge amounts of data by dividing the data into blocks
and storing it across multiple computers, and computations can be run in parallel across
multiple connected machines.
Hadoop gained its importance because of its ability to process huge amount of variety
of data generated every day especially from automated sensors and social media using
low-cost commodity hardware.
Since processing is done in batches, throughput is high but latency is low. Latency is
the time (minutes/seconds or clock period) to perform some action or produce some
result whereas throughput is the number of such actions executed or result produced
per unit of time. The throughput of memory system is termed as memory bandwidth.
Hadoop Assumptions
Hadoop was developed with large clusters of computers in mind with the following
assumptions:
1. Hardware will fail, since it considers a large cluster of computers.
2. Processing will be run in batches; so aims at high throughput as opposed to low
latency.
3. Applications that run on Hadoop Distributed File System (HDFS) have large
datasets typically from gigabytes to terabytes in size.
4. Portability is important.
5. Availability of high-aggregate data bandwidth and scale to hundreds of nodes in
a single cluster.
6. Should support tens of millions of files in a single instance.
7. Applications need a write-once-read-many access model.
(ii) Core Components of Hadoop
Hadoop consists of the following components:
1. Hadoop Common: This package provides file system and OS level abstractions.
It contains libraries and utilities required by other Hadoop modules. 2. Hadoop
Distributed File System (HDFS): HDFS is a distributed file system1
that provides a limited interface for managing the file system.
3. Hadoop MapReduce: MapReduce is the key algorithm that the Hadoop
MapReduce engine uses to distribute work around a cluster.
4. Hadoop Yet Another Resource Negotiator (YARN) (MapReduce 2.0): It is a
resource- management platform responsible for managing compute resources in
clusters and using them for scheduling of users’ applications.
Hadoop Common Package
This consists of necessary Java archive (JAR) files and scripts needed to start Hadoop.
Hadoop requires Java Runtime Environment (JRE) 1.6 or higher version. The standard
start-up and shut-down scripts need Secure Shell (SSH) to be setup between the nodes
in the cluster.
HDFS (storage) and MapReduce (processing) are the two core components of Apache
Hadoop. Both HDFS and MapReduce work in unison and they are co-deployed, such
that there is a single cluster that provides the ability to move computation to the data.
Thus, the storage system HDFS is not physically separate from a processing system
MapReduce.
Hadoop Distributed File System (HDFS)
HDFS is a distributed file system that provides a limited interface for managing the file
system to allow it to scale and provide high throughput. HDFS creates multiple replicas
of each data block and distributes them on computers throughout a cluster to enable
reliable and rapid access. When a file is loaded into HDFS, it is replicated and
fragmented into “blocks” of data, which are stored across the cluster nodes; the cluster
nodes are also called the DataNodes. The NameNode is responsible for storage and
management of metadata, so that when MapReduce or another execution framework
calls for the data, the NameNode informs it where the data that is needed resides.
Figure 1 shows the NameNode and DataNode block replication in HDFS architecture.
Main Features of Hadoop
1. HDFS creates multiple replicas of data blocks for reliability, placing them on the
computer nodes around the cluster.
2. Hadoop’s target is to run on clusters of the order of 10,000 nodes.
3. A file consists of many 64 MB blocks.
Main Components of HDFS NameNode
NameNode is the master that contains the metadata. In general, it maintains the
directories and files and manages the blocks which are present on the DataNode. The
following are the functions of NameNode:
1. Manages namespace of the file system in memory.
2. Maintains “inode” information.
3. Maps inode to the list of blocks and locations.
4. Takes care of authorization and authentication.
5. Creates checkpoints and logs the namespace changes.
NameNode and DataNode block replication.
So the NameNode maps DataNode to the list of blocks, monitors status (health) of
DataNode and replicates the missing blocks.
DataNodes
DataNodes are the slaves which provide the actual storage and are deployed on each
machine. They are responsible for processing read and write requests for the clients.
The following are the other functions of DataNode:
1. Handles block storage on multiple volumes and also maintain block integrity.
2. Periodically sends heartbeats and also the block reports to NameNode.
Figure 2 shows how HDFS handles job processing requests from the user in the form of
Sequence Diagram. User copies the input files into DFS and submits the job to the
client. Client gets the input file information from DFS, creates splits and uploads the job
information to DFS. JobTracker puts ready job into the internal queue. JobScheduler
picks job from the queue and initializes the job by creating job object. JobTracker
creates a list of tasks and assigns one map task for each input split. TaskTrackers send
heartbeat to JobTracker to indicate if ready to run new tasks. JobTracker chooses task
from first job in priority-queue and assigns it to the TaskTracker.
Secondary NameNode is responsible for performing periodic checkpoints. These are
used to restart the NameNode in case of failure. MapReduce can then process the data
where it is located.
Main Components of MapReduce
The main components of MapReduce are listed below:
1. JobTrackers: JobTracker is the master which manages the jobs and resources
in the cluster. The JobTracker tries to schedule each map on the TaskTracker
which is running on the same DataNode as, the underlying block.
2. TaskTrackers: TaskTrackers are slaves which are deployed on each machine in
the cluster. They are responsible for running the map and reduce tasks as
instructed by the JobTracker.
3. JobHistoryServer: JobHistoryServer is a daemon that saves historical
information about completed tasks/applications.

You might also like