15CS34E Analytic Computing Key
15CS34E Analytic Computing Key
15CS34E Analytic Computing Key
Answer key
Part-A
1. The method of moments is a method of estimation of population parameters. One starts with deriving
equations that relate the population moments (i.e., the expected values of powers of the random variable under
consideration) to the parameters of interest.
2. Reports are important and can be valuable. Reports used correctly will add value. But reports have
their limits, and it is important to understand what they are. In the end, an organization will need both
reporting and analysis to succeed in taming big data, just as both reporting and analysis have been
utilized to tame every other data source that’s come along in the past. The key is to understand the
difference between a report and an analysis. It is also critical to understand how they both fit together.
Without that understanding, your organization won’t get it right.
3.
4. The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements in a
stream with a single pass and space-consumption which is logarithmic in the maximum number of
possible distinct elements in the stream.
5. An approach to improve the efficiency of apriori algorithm. Association rule mining has a great
importance in data mining. ... In this paper, we are proposing amethod to improve Apriori algorithm
efficiency by reducing the database size as well as reducing the time wasted on scanning the transactions
6. Clustering methods are used to identify groups of similar objects in a multivariate data sets collected
from fields such as marketing, bio-medical and geo-spatial. They aredifferent types of clustering
methods, including: Partitioning methods and Hierarchical clustering.
7. Applying data mining techniques to large social media data sets has the potential to continue to
improve search results for everyday search engines, realize specialized target marketing for businesses,
help psychologist study be- havior, provide new insights into social structure for sociologists, personalize
web.
8. OLTP and OLAP both are the online processing systems.OLTP is a transactional processing
while OLAP is an analytical processing system. ... The basic difference between OLTP and OLAP is
that OLTP is an online database modifying system, whereas, OLAP is an online database query answering
system.
9. Sharding is a type of database partitioning that separates very large databases the into smaller, faster,
more easily managed parts called data shards. The word shard means a small part of a whole.
10. MapReduce Algorithm for Matrix Multiplication
Facts: The final step in the MapReduce algorithm is to produce the matrix A × B. The unit of
computation of of matrix A × B is one element in the matrix:
Conclusion: The input information of the reduce( ) step (function) of the MapReduce
algorithm are: One row vector from matrix A. One column vector from matrix B.
Part-B
11.
The world of big data requires new levels of scalability. As the amount of data organizations process
continues to increase, the same old methods for handling data just won’t work anymore. Organizations
that don’t update their technologies to provide a higher level of scalability will quite simply choke on big
data. The convergence of the analytic and data environments, massively parallel processing (MPP)
architectures, the cloud, grid computing, and MapReduce.
A History Of Scalability
Until well into the 1900s, doing analytics was very, very difficult. To do a deep analysis, such as a
predictive model, it required manually computing all of the statistics. Scalability of any sort was virtually
nonexistent. The amount of data has grown at least as fast as the computing power of the machines that
process it. As new big data sources become available the boundaries are being pushed further.
It used to be that analytic professionals had to pull all their data together into a separate analytics
environment to do analysis. None of the data that was required was together in one place, and the tools
that did analysis didn’t have a way to run where the data sat. The only option was to pull the data
together in a separate analytics environment and then start performing analysis. Much of the work
analytic professionals do falls into the realm of advanced analytics, which encompasses data mining,
predictive modeling, and other advanced techniques. There is an interesting parallel between what
analysts did in the early days and what data warehousing is all about. There’s not much difference
between a data set as analysts define it and a “table” in a database. Analysts have done “merges” of
their data sets for years. That is the exact same thing as a “join” of tables in a database.
In both a merge and a join, two or more data sets or tables are combined together. Analysts do what is
called “data preparation.” In the data warehousing world this process is called “extract, transform, and
load (ETL).” Basically, analysts were building custom data marts and mini–data warehouses before the
terms data mart or data warehouse were invented. The Relational Database Management System
(RDBMS) started to not only become popular, but to become more scalable and more widely adopted.
Initially, databases were built for each specific purpose or team, and relational databases were spread all
over an organization. Such single purpose databases are often called “data marts.”Combining the various
database systems into one big system called an Enterprise Data Warehouse (EDW).
14.a) We can classify the visualization of social networks in four main groups,
depending on the main focus of the predominant visual task for which the
visualization metaphor is envisioned. These are: structural, the most common
representation thus far, semantic, which emphasizes the meaning of the entities
and relationships over their structure, temporal and statistical.
Structural Visualization
A structural visualization of the social networks focuses precisely on that,
its structure. The structure can be thought of as the topology of a graph that
represents the actors and relationships in a social network. There are two
predominant approaches to structural visualization: node-link diagrams and
matrix-oriented methods. While node-link diagrams are easy to interpret and
depict explicitly the links between nodes, matrix-oriented representations usually
make a better use of limited display area, critical for today’s availability
of visualizations in a multitude of devices. In the recent years, we have seen
efforts to combine the best of the two types into a series of hybrid representations.
The value of network layout in visualization. The purpose of
a visualization is to allow users to gain insight on the complexity of social dynamics.
Therefore, an aspect such as the layout of network elements must be
evaluated with respect to its value toward that goal. Numerous layout algorithms
have been proposed, each of them with its strengths and weaknesses. In
general, we can highlight a number of high level properties that a layout must
satisfy: (1) The positioning of nodes and links in a diagram should facilitate
the readability of the network. (2) The positioning of nodes should help uncover
any inherent clusterability of the network, i.e., if a certain group of nodes
is considered to be a cluster, the user should be able to extract that information
from the layout, and (3), the position of nodes should result in a trustworthy
representation of the social network,
Matrix-oriented Techniques. Matrix-oriented techniques are
visualization metaphors designed to represent a social network via an explicit
display of its adjacency or incidence matrix. In this visualization, each link is
represented as a grid location with cartesian coordinates corresponding to the
nodes in the network. Color and opacity are often used to represent important
structural or semantic quantities. Consequently, a visual representation of the
adjacency matrix is much more compact than a node-link diagram and overlap
free, since each link is represented without overlap.
One of the challenges with this matrix representation is enabling the users
to visually identify local and global structures in the network. In general, this
depends on the way the nodes are ordered along the axes of the matrix view.
When the order of nodes is arbitrary, the matrix view may not exhibit the clustering
present in the data. Reordering of a matrix is a complex combinatorial
problem, and depends on a given objective function. In general, if we follow
the high-level desired properties of a good network visualization, it should retain
clusters in close proximity.
Hybrid Techniques. Node-link diagrams are easy to interpret
when compared to adjacency matrix diagrams, but the latter are usually more
effective to show overall structure where the network is dense. This observation
has led to a number of hybrid techniques, which combine matrix-oriented
techniques with node-link diagrams in an attempt to overcome the issues associated
with each of them.
Semantic and Temporal Visualization
Structural visualizations, although unify the depiction of both overviews
and detail information, are less effective when the social network becomes
large. Node-link diagrams become rapidly cluttered and algorithmic layouts
often result in "hairballs", as highly connected nodes tend to concentrate in
the center of the display. For this reason, recent approaches have focused on
a different aspect of social networks: semantics. Instead of highlighting the
explicit relationships found in the data, these represent high level attributes
and connections of actors and links, either as speci�ed explicitly in the data, or
implicitly inferred from cross-referencing external sources.
Ontology-based Visualization. One such semantic visualization
is the use of ontologies to represent the types of actors and relationships in a
social network. An example is Ontovis, where a typical node-link diagram
is enhanced with a high-level graphical representation of the ontology [38].
An ontology is a graph whose nodes represent node types and links represent
types of representations. Once an ontology is known, either given explicitly or
extracted from the data itself, a social network can be represented implicitly
by the relationships in this graph. The goal in this type of visualization is not
necessarily to discover the structural properties of the network, but rather the
distribution of attributes of nodes and links.
Temporal Visualization. A special type of semantic information
that has captured the attention of social network visualization researchers
is time. Since social interaction is a time-dependent phenomenon, it is only
natural to represent the temporal dimension using visual means. Nonetheless,
the visualization of dynamic networks has not been explored in depth. One
can argue that one of the reasons for this lack of treatment is that handling
the temporal dimension from a structural point of view alone is limited and
insuf�cient. A semantic approach, as the ones suggested above, seems more
appropriate. However, time as a dimension deserves a treatment different from
other data-speci�c node attributes. One of the dif�culties to represent time is
a shortage of dimensions to depict a dynamic network in a 2D display. As
an alternative, one can represent time along precisely a temporal dimension.
Moody et al. considers two types of such dynamic visualizations: �ipbooks,
where nodes remain static while the edges vary over time, and movies, where
nodes are allowed to move as the relationships change Statistical Visualization
An important aspect of visualization is the interplay between analytical and
visual tools to gain insight. This is at the core of visual analytics. For this
reason, analysts often explore summary visualizations to understand the distributions
of variables of interest. These variables often correspond to network
statistics that represent the structure of the network, such as degree, centrality
and the clustering coef�cient. While the �rst two describe the importance across the network, the latter metric
indicates how clusterable are the nodes in
a network.
(ii) Many forms of dynamic and streaming algorithms can be used to perform trend
detection in a wide variety of social networking applications.
In such applications, the data is dynamically clustered in a streaming fashion and can
be used in order to determine the important patterns of changes.
Examples of such streaming data could be multi-dimensional data, text streams,
streaming time-series data, and trajectory data. Key trends and events in the data can
be discovered with the use of clustering methods. Social tagging can be useful in the areas
including indexing, search,
taxonomy generation, clustering, classi�cation, social interest discovery, etc.
Indexing
Tags can be useful for indexing sites faster. Users bookmark sites launched
by their friends or colleagues before a search engine bot can �nd them. Tags
are also useful in deeper indexing.
Search
Tags have been found useful for web search, personalized search and enterprise
search. Tags offer multiple descriptions of a given resource, which
potentially increases the likelihood that searcher and tagger �nd a common
language and thus using tags, retrieval effectiveness may be enhanced. Social
bookmarking can provide search data not currently provided by other sources.
Semantic Query Expansion. Schenkel et al. [45] develop an incremental
top-k algorithm for answering queries of the form Q(u, q1, q2, ..., qn)
where u is the user and q1, q2, ..., qn are the query keywords.
Enhanced Similarity Measure. Estimating similarity between
a query and a web page is important to the web search problem. Tags provided
by web users provide different perspectives and so are usually good summaries
of the corresponding web pages and provide a new metadata for the similarity
calculation between a query and a web page.
Enhanced Static Ranking. Estimating the quality of a web
page is also important to the web search problem. The amount of annotations
assigned to a page indicates its popularity and indicates its quality in some
sense. In order to explore social tags for measuring the quality of web pages,
researchers have exploited the tagging graph.
Personalized Search. Furthermore, personal tags are naturally
good resources for describing a person’s interests. So personal search could
be enhanced via exploring personal tags.