15CS34E Analytic Computing Answer Key Part-A
15CS34E Analytic Computing Answer Key Part-A
Answer key
Part-A
1. The method of moments is a method of estimation of population parameters. One starts with deriving
equations that relate the population moments (i.e., the expected values of powers of the random variable under
consideration) to the parameters of interest.
2. Reports are important and can be valuable. Reports used correctly will add value. But reports have
their limits, and it is important to understand what they are. In the end, an organization will need both
reporting and analysis to succeed in taming big data, just as both reporting and analysis have been
utilized to tame every other data source that’s come along in the past. The key is to understand the
difference between a report and an analysis. It is also critical to understand how they both fit together.
Without that understanding, your organization won’t get it right.
3.
4. The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements in a
stream with a single pass and space-consumption which is logarithmic in the maximum number of
possible distinct elements in the stream.
6. Clustering methods are used to identify groups of similar objects in a multivariate data sets collected
from fields such as marketing, bio-medical and geo-spatial. They aredifferent types of clustering
methods, including: Partitioning methods and Hierarchical clustering.
8. OLTP and OLAP both are the online processing systems.OLTP is a transactional processing
while OLAP is an analytical processing system. ... The basic difference between OLTP and OLAP is
that OLTP is an online database modifying system, whereas, OLAP is an online database query answering
system.
9. Sharding is a type of database partitioning that separates very large databases the into smaller, faster,
more easily managed parts called data shards. The word shard means a small part of a whole.
10. MapReduce Algorithm for Matrix Multiplication
Part-B
11.
The world of big data requires new levels of scalability. As the amount of data organizations process
continues to increase, the same old methods for handling data just won’t work anymore. Organizations
that don’t update their technologies to provide a higher level of scalability will quite simply choke on big
data. The convergence of the analytic and data environments, massively parallel processing (MPP)
architectures, the cloud, grid computing, and MapReduce.
A History Of Scalability
Until well into the 1900s, doing analytics was very, very difficult. To do a deep analysis, such as a
predictive model, it required manually computing all of the statistics. Scalability of any sort was virtually
nonexistent. The amount of data has grown at least as fast as the computing power of the machines that
process it. As new big data sources become available the boundaries are being pushed further.
It used to be that analytic professionals had to pull all their data together into a separate analytics
environment to do analysis. None of the data that was required was together in one place, and the tools
that did analysis didn’t have a way to run where the data sat. The only option was to pull the data
together in a separate analytics environment and then start performing analysis. Much of the work
analytic professionals do falls into the realm of advanced analytics, which encompasses data mining,
predictive modeling, and other advanced techniques. There is an interesting parallel between what
analysts did in the early days and what data warehousing is all about. There’s not much difference
between a data set as analysts define it and a “table” in a database. Analysts have done “merges” of
their data sets for years. That is the exact same thing as a “join” of tables in a database.
In both a merge and a join, two or more data sets or tables are combined together. Analysts do what is
called “data preparation.” In the data warehousing world this process is called “extract, transform, and
load (ETL).” Basically, analysts were building custom data marts and mini–data warehouses before the
terms data mart or data warehouse were invented. The Relational Database Management System
(RDBMS) started to not only become popular, but to become more scalable and more widely adopted.
Initially, databases were built for each specific purpose or team, and relational databases were spread
all over an organization. Such single purpose databases are often called “data marts.”Combining the
various database systems into one big system called an Enterprise Data Warehouse (EDW).
14.a) We can classify the visualization of social networks in four main groups,
depending on the main focus of the predominant visual task for which the
visualization metaphor is envisioned. These are: structural, the most common
representation thus far, semantic, which emphasizes the meaning of the entities
and relationships over their structure, temporal and statistical.
Structural Visualization
A structural visualization of the social networks focuses precisely on that,
its structure. The structure can be thought of as the topology of a graph that
represents the actors and relationships in a social network. There are two
predominant approaches to structural visualization: node-link diagrams and
matrix-oriented methods. While node-link diagrams are easy to interpret and
depict explicitly the links between nodes, matrix-oriented representations usually
make a better use of limited display area, critical for today’s availability
of visualizations in a multitude of devices. In the recent years, we have seen
efforts to combine the best of the two types into a series of hybrid representations.
The value of network layout in visualization. The purpose of
a visualization is to allow users to gain insight on the complexity of social dynamics.
Therefore, an aspect such as the layout of network elements must be
evaluated with respect to its value toward that goal. Numerous layout algorithms
have been proposed, each of them with its strengths and weaknesses. In
general, we can highlight a number of high level properties that a layout must
satisfy: (1) The positioning of nodes and links in a diagram should facilitate
the readability of the network. (2) The positioning of nodes should help uncover
any inherent clusterability of the network, i.e., if a certain group of nodes
is considered to be a cluster, the user should be able to extract that information
from the layout, and (3), the position of nodes should result in a trustworthy
representation of the social network,
Matrix-oriented Techniques. Matrix-oriented techniques are
visualization metaphors designed to represent a social network via an explicit
display of its adjacency or incidence matrix. In this visualization, each link is
represented as a grid location with cartesian coordinates corresponding to the
nodes in the network. Color and opacity are often used to represent important
structural or semantic quantities. Consequently, a visual representation of the
adjacency matrix is much more compact than a node-link diagram and overlap
free, since each link is represented without overlap.
One of the challenges with this matrix representation is enabling the users
to visually identify local and global structures in the network. In general, this
depends on the way the nodes are ordered along the axes of the matrix view.
When the order of nodes is arbitrary, the matrix view may not exhibit the clustering
present in the data. Reordering of a matrix is a complex combinatorial
problem, and depends on a given objective function. In general, if we follow
the high-level desired properties of a good network visualization, it should retain
clusters in close proximity.
Hybrid Techniques. Node-link diagrams are easy to interpret
when compared to adjacency matrix diagrams, but the latter are usually more
effective to show overall structure where the network is dense. This observation
has led to a number of hybrid techniques, which combine matrix-oriented
techniques with node-link diagrams in an attempt to overcome the issues associated
with each of them.
Semantic and Temporal Visualization
Structural visualizations, although unify the depiction of both overviews
and detail information, are less effective when the social network becomes
large. Node-link diagrams become rapidly cluttered and algorithmic layouts
often result in "hairballs", as highly connected nodes tend to concentrate in
the center of the display. For this reason, recent approaches have focused on
a different aspect of social networks: semantics. Instead of highlighting the
explicit relationships found in the data, these represent high level attributes
and connections of actors and links, either as specied explicitly in the data, or
implicitly inferred from cross-referencing external sources.
Ontology-based Visualization. One such semantic visualization
is the use of ontologies to represent the types of actors and relationships in a
social network. An example is Ontovis, where a typical node-link diagram
is enhanced with a high-level graphical representation of the ontology [38].
An ontology is a graph whose nodes represent node types and links represent
types of representations. Once an ontology is known, either given explicitly or
extracted from the data itself, a social network can be represented implicitly
by the relationships in this graph. The goal in this type of visualization is not
necessarily to discover the structural properties of the network, but rather the
distribution of attributes of nodes and links.
Temporal Visualization. A special type of semantic information
that has captured the attention of social network visualization researchers
is time. Since social interaction is a time-dependent phenomenon, it is only
natural to represent the temporal dimension using visual means. Nonetheless,
the visualization of dynamic networks has not been explored in depth. One
can argue that one of the reasons for this lack of treatment is that handling
the temporal dimension from a structural point of view alone is limited and
insufcient. A semantic approach, as the ones suggested above, seems more
appropriate. However, time as a dimension deserves a treatment different from
other data-specic node attributes. One of the difculties to represent time is
a shortage of dimensions to depict a dynamic network in a 2D display. As
an alternative, one can represent time along precisely a temporal dimension.
Moody et al. considers two types of such dynamic visualizations: ipbooks,
where nodes remain static while the edges vary over time, and movies, where
nodes are allowed to move as the relationships change Statistical Visualization
An important aspect of visualization is the interplay between analytical and
visual tools to gain insight. This is at the core of visual analytics. For this
reason, analysts often explore summary visualizations to understand the distributions
of variables of interest. These variables often correspond to network
statistics that represent the structure of the network, such as degree, centrality
and the clustering coefcient. While the rst two describe the importance across the network, the latter metric
indicates how clusterable are the nodes in
a network.
(ii) Many forms of dynamic and streaming algorithms can be used to perform trend
detection in a wide variety of social networking applications.
In such applications, the data is dynamically clustered in a streaming fashion and can
be used in order to determine the important patterns of changes.
Examples of such streaming data could be multi-dimensional data, text streams,
streaming time-series data, and trajectory data. Key trends and events in the data can
be discovered with the use of clustering methods. Social tagging can be useful in the areas
including indexing, search,
taxonomy generation, clustering, classication, social interest discovery, etc.
Indexing
Tags can be useful for indexing sites faster. Users bookmark sites launched
by their friends or colleagues before a search engine bot can nd them. Tags
are also useful in deeper indexing.
Search
Tags have been found useful for web search, personalized search and enterprise
search. Tags offer multiple descriptions of a given resource, which
potentially increases the likelihood that searcher and tagger nd a common
language and thus using tags, retrieval effectiveness may be enhanced. Social
bookmarking can provide search data not currently provided by other sources.
Semantic Query Expansion. Schenkel et al. [45] develop an incremental
top-k algorithm for answering queries of the form Q(u, q1, q2, ..., qn)
where u is the user and q1, q2, ..., qn are the query keywords.
Enhanced Similarity Measure. Estimating similarity between
a query and a web page is important to the web search problem. Tags provided
by web users provide different perspectives and so are usually good summaries
of the corresponding web pages and provide a new metadata for the similarity
calculation between a query and a web page.
Enhanced Static Ranking. Estimating the quality of a web
page is also important to the web search problem. The amount of annotations
assigned to a page indicates its popularity and indicates its quality in some
sense. In order to explore social tags for measuring the quality of web pages,
researchers have exploited the tagging graph.
Personalized Search. Furthermore, personal tags are naturally
good resources for describing a person’s interests. So personal search could
be enhanced via exploring personal tags.
Hadoop Framework
(b) (i)
Apache Hadoop is a framework that allows distributed processing of large data sets
across clusters of commodity computers using a simple programming model. It is
designed to scale-up from single servers to thousands of machines, each providing
computation and storage. Rather than rely on hardware to deliver high-availability, the
framework itself is designed to detect and handle failures at the application layer, thus
delivering a highly available service on top of a cluster of computers, each of which may
be prone to failures.
In short, Hadoop is an open-source software framework for storing and processing big
data in a distributed way on large clusters of commodity hardware. Basically, it
accomplishes the following two tasks:
1. Massive data storage.
2. Faster processing.
Advantage of Hadoop
Problems in data transfer made the organizations to think about an alternate way. The
following examples explain the use of Hadoop.
The transfer speed is around 100 MB/s and a standard disk is 1 TB. Time to read entire
disk = 10,000 s or 3 h!. Then increase in processing time may not be very helpful
because of two reasons:
Network bandwidth is now more of a limiting factor.
Physical limits of processor chips are reached.
If 100 TB of datasets are to be scanned on a 1000 node cluster, then in case of
remote storage with 10 Mbps bandwidth, it would take 165 min.
local storage with 50 Mbps, it will take 33 min.
So it is better to move computation rather than moving data.
Taking care of hardware failure cannot be made optional in Big Data Analytics but has
to be made as a rule. In case of 1000 nodes, we need to consider say 4000 disks, 8000
core, 25 switches, 1000 NICs and 2000 RAMs (16 TB). Meantime between failures
could be even less than a day since commodity hardware is. used. There is a need for
fault tolerant store to guarantee reasonable availability.
Hadoop Goals
The main goals of Hadoop are listed below:
1. Scalable: It can scale up from a single server to thousands of servers.
2. Fault tolerance: It is designed with very high degree of fault tolerance.
3. Economical: It uses commodity hardware instead of high-end hardware.
4. Handle hardware failures: The resiliency of these clusters comes from the
software’s ability to detect and handle failures at the application layer.
The Hadoop framework can store huge amounts of data by dividing the data into blocks
and storing it across multiple computers, and computations can be run in parallel across
multiple connected machines.
Hadoop gained its importance because of its ability to process huge amount of variety
of data generated every day especially from automated sensors and social media using
low-cost commodity hardware.
Since processing is done in batches, throughput is high but latency is low. Latency is
the time (minutes/seconds or clock period) to perform some action or produce some
result whereas throughput is the number of such actions executed or result produced
per unit of time. The throughput of memory system is termed as memory bandwidth.
Hadoop Assumptions
Hadoop was developed with large clusters of computers in mind with the following
assumptions:
1. Hardware will fail, since it considers a large cluster of computers.
2. Processing will be run in batches; so aims at high throughput as opposed to low
latency.
3. Applications that run on Hadoop Distributed File System (HDFS) have large
datasets typically from gigabytes to terabytes in size.
4. Portability is important.
5. Availability of high-aggregate data bandwidth and scale to hundreds of nodes in
a single cluster.
6. Should support tens of millions of files in a single instance.
7. Applications need a write-once-read-many access model.
(ii) Core Components of Hadoop
Hadoop consists of the following components:
1. Hadoop Common: This package provides file system and OS level abstractions.
It contains libraries and utilities required by other Hadoop modules. 2. Hadoop
Distributed File System (HDFS): HDFS is a distributed file system1
that provides a limited interface for managing the file system.
3. Hadoop MapReduce: MapReduce is the key algorithm that the Hadoop
MapReduce engine uses to distribute work around a cluster.
4. Hadoop Yet Another Resource Negotiator (YARN) (MapReduce 2.0): It is a
resource- management platform responsible for managing compute resources in
clusters and using them for scheduling of users’ applications.
Hadoop Common Package
This consists of necessary Java archive (JAR) files and scripts needed to start Hadoop.
Hadoop requires Java Runtime Environment (JRE) 1.6 or higher version. The standard
start-up and shut-down scripts need Secure Shell (SSH) to be setup between the nodes
in the cluster.
HDFS (storage) and MapReduce (processing) are the two core components of Apache
Hadoop. Both HDFS and MapReduce work in unison and they are co-deployed, such
that there is a single cluster that provides the ability to move computation to the data.
Thus, the storage system HDFS is not physically separate from a processing system
MapReduce.
Hadoop Distributed File System (HDFS)
HDFS is a distributed file system that provides a limited interface for managing the file
system to allow it to scale and provide high throughput. HDFS creates multiple replicas
of each data block and distributes them on computers throughout a cluster to enable
reliable and rapid access. When a file is loaded into HDFS, it is replicated and
fragmented into “blocks” of data, which are stored across the cluster nodes; the cluster
nodes are also called the DataNodes. The NameNode is responsible for storage and
management of metadata, so that when MapReduce or another execution framework
calls for the data, the NameNode informs it where the data that is needed resides.
Figure 1 shows the NameNode and DataNode block replication in HDFS architecture.
Main Features of Hadoop
1. HDFS creates multiple replicas of data blocks for reliability, placing them on the
computer nodes around the cluster.
2. Hadoop’s target is to run on clusters of the order of 10,000 nodes.
3. A file consists of many 64 MB blocks.
Main Components of HDFS NameNode
NameNode is the master that contains the metadata. In general, it maintains the
directories and files and manages the blocks which are present on the DataNode. The
following are the functions of NameNode:
1. Manages namespace of the file system in memory.
2. Maintains “inode” information.
3. Maps inode to the list of blocks and locations.
4. Takes care of authorization and authentication.
5. Creates checkpoints and logs the namespace changes.
NameNode and DataNode block replication.
So the NameNode maps DataNode to the list of blocks, monitors status (health) of
DataNode and replicates the missing blocks.
DataNodes
DataNodes are the slaves which provide the actual storage and are deployed on each
machine. They are responsible for processing read and write requests for the clients.
The following are the other functions of DataNode:
1. Handles block storage on multiple volumes and also maintain block integrity.
2. Periodically sends heartbeats and also the block reports to NameNode.
Figure 2 shows how HDFS handles job processing requests from the user in the form of
Sequence Diagram. User copies the input files into DFS and submits the job to the
client. Client gets the input file information from DFS, creates splits and uploads the job
information to DFS. JobTracker puts ready job into the internal queue. JobScheduler
picks job from the queue and initializes the job by creating job object. JobTracker
creates a list of tasks and assigns one map task for each input split. TaskTrackers send
heartbeat to JobTracker to indicate if ready to run new tasks. JobTracker chooses task
from first job in priority-queue and assigns it to the TaskTracker.
Secondary NameNode is responsible for performing periodic checkpoints. These are
used to restart the NameNode in case of failure. MapReduce can then process the data
where it is located.
Main Components of MapReduce
The main components of MapReduce are listed below:
1. JobTrackers: JobTracker is the master which manages the jobs and resources
in the cluster. The JobTracker tries to schedule each map on the TaskTracker
which is running on the same DataNode as, the underlying block.
2. TaskTrackers: TaskTrackers are slaves which are deployed on each machine in
the cluster. They are responsible for running the map and reduce tasks as
instructed by the JobTracker.
3. JobHistoryServer: JobHistoryServer is a daemon that saves historical
information about completed tasks/applications.