0% found this document useful (0 votes)
64 views6 pages

Data Mining With Big DataUsing HACE Theorem

This document presents the HACE theorem for characterizing big data. It describes the four characteristics of big data: 1) Huge amounts and varieties of data from diverse sources (Volume) 2) Heterogeneous and diverse types of data (Variety) 3) Autonomous sources that independently collect and generate data (Autonomy) 4) Complex and evolving relationships within the data (Complexity) Big data aims to discover useful knowledge from these massive, diverse, decentralized and dynamically changing data sources. This poses significant challenges that traditional data analysis methods cannot handle.

Uploaded by

friend
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views6 pages

Data Mining With Big DataUsing HACE Theorem

This document presents the HACE theorem for characterizing big data. It describes the four characteristics of big data: 1) Huge amounts and varieties of data from diverse sources (Volume) 2) Heterogeneous and diverse types of data (Variety) 3) Autonomous sources that independently collect and generate data (Autonomy) 4) Complex and evolving relationships within the data (Complexity) Big data aims to discover useful knowledge from these massive, diverse, decentralized and dynamically changing data sources. This poses significant challenges that traditional data analysis methods cannot handle.

Uploaded by

friend
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Journal of Science Engineering and Advance ISSN 2321-6905

Technology, IJSEAT , Vol.3, Issue 9 September 2015

Data Mining with Big Data Using HACE Theorem


Pawan.P1, Trivikram Rao2
1
M.Tech (CSE), Usha Rama College of Engineering and Technology, A.P., India.
2
Asst Professor, Dept. of Information Technology, Usha Rama College of Engineering and Technology,
A.P., India.

Abstract — The term Big Data comprises large-volume, As another example, on 4 October 2012, the first
complex, growing data sets with multiple, autonomous presidential debate between President Barack Obama
sources. With the fast development of networking, data and Governor Mitt Romney triggered more than 10
storage, and the data collection capacity, Big Data are million tweets within 2 hours [1]. Among all these
now rapidly expanding in all science and engineering tweets, the specific moments that generated the most
domains, including physical, biological and biomedical discussions actually revealed the public interests, such as
sciences. This paper presents a HACE theorem that the discussions about medicare and vouchers. Such
characterizes the features of the Big Data revolution, and online discussions provide a new means to sense the
proposes a Big Data processing model, from the data public interests and generate feedback in real time, and
mining perspective. This data-driven model involves are mostly appealing compared to generic media, such as
demand-driven aggregation of information sources, radio or TV broadcasting. Another example is Flickr, a
mining and analysis, user interest modeling, and security public picture sharing site, which received 1.8 million
and privacy considerations. We analyze the challenging photos per day, on average, from February to March
issues in the data-driven model and also in the Big Data 2012 [2]. Assuming the size of each photo is 2
revolution. megabytes (MB), this requires 3.6 terabytes (TB) storage
every single day. Indeed, as an old saying states: “a
Keywords — Big Data, data mining, heterogeneity, picture is worth a thousand words,” the billions of
autonomous sources, complex and evolving pictures on Flicker are a treasure tank for us to explore
associations the human society, social events, public affairs, disasters,
and so on, only if we have the power to harness the
I. INTRODUCTION enormous amount of data.

According to the sources, every day, 2.5 quintillion bytes The above examples demonstrate the rise of Big Data
of data are created and 90 percent of the data in the applications where data collection has grown
world today were produced within the past two years tremendously and is beyond the ability of commonly
[16]. Our capability for data generation has never been used software tools to capture, manage, and process
so powerful and enormous ever since the invention of within a “tolerable elapsed time.” The most fundamental
the information technology in the early 19th century. challenge for Big Data applications is to explore the
Let’s take an example; Dr. Yan Mo won the 2012 Nobel large volumes of data and extract useful information or
Prize in Literature. This is probably the most knowledge for future actions [3]. In many situations, the
controversial Nobel Prize of this category. Searching on knowledge extraction process has to be very efficient
Google with “Yan Mo Nobel Prize,” resulted in and close to real time because storing all observed data is
1,050,000 web pointers on the Internet (as of 3 January nearly infeasible. For example, the square kilometer
2013). “For all praises as well as criticisms,” said Mo array (SKA) [4] in radio astronomy consists of 1,000 to
recently, “I am grateful.” What types of praises and 1,500 15-meter dishes in a central 5-km area. It provides
criticisms has Mo actually received over his 31-year 100 times more sensitive vision than any existing radio
writing career? As comments keep coming on the telescopes, answering fundamental questions about the
Internet and in various news media, can we summarize Universe. However, with a 40 gigabytes (GB)/second
all types of opinions in different media in a real-time data volume, the data generated from the SKA are
fashion, including updated, cross-referenced discussions exceptionally large. Although researchers have
by critics? This type of summarization program is an confirmed that interesting patterns, such as transient
excellent example for Big Data processing, as the radio anomalies [5] can be discovered from the SKA
information comes from multiple, heterogeneous, data, existing methods can only work in an offline
autonomous sources with complex and evolving fashion and are incapable of handling this Big Data
relationships, and keeps growing. scenario in real time. As a result, the unprecedented data
volumes require an effective data analysis and prediction

www.ijseat.com Page 415


International Journal of Science Engineering and Advance ISSN 2321-6905
Technology, IJSEAT , Vol.3, Issue 9 September 2015

platform to achieve fast response and real-time


classification for such Big Data.
Autonomous Sources with Distributed and
Decentralized Control
II. BIG DATA CHARACTERISTICS : HACE THEOREM
Autonomous data sources with distributed and
HACE Theorem. Big Data starts with large-volume, decentralized controls are a main characteristic of Big
heterogeneous, autonomous sources with distributed and Data applications. Being autonomous, each data source
decentralized control, and seeks to explore complex and is able to generate and collect information without
evolving relationships among data These characteristics involving (or relying on) any centralized control. This is
make it an extreme challenge for discovering useful similar to the World Wide Web (WWW) setting where
knowledge from the Big Data. In a naı¨ve sense, we can each web server provides a certain amount of
imagine that a number of blind men are trying to size up information and each server is able to fully function
a giant elephant (see Fig. 1), which will be the Big Data without necessarily relying on other servers. On the
in this context. The goal of each blind man is to draw a other hand, the enormous volumes of the data also make
picture (or conclusion) of the elephant according to the an application vulnerable to attacks or malfunctions, if
part of information he collects during the process. the whole system has to rely on any centralized control
Because each person’s view is limited to his local region, unit. For major Big Data-related applications, such as
it is not surprising that the blind men will each conclude Google, Flicker, Facebook, and Walmart, a large number
independently that the elephant “feels” like a rope, a of server farms are deployed all over the world to ensure
hose, or a wall, depending on the region each of them is nonstop services and quick responses for local markets.
limited to. To make the problem even more complicated, Such autonomous sources are not only the solutions of
let us assume that 1) the elephant is growing rapidly and the technical designs, but also the results of the
its pose changes constantly, and 2) each blind man may legislation and the regulation rules in different countries/
have his own (possible unreliable and inaccurate) regions. For example, Asian markets of Walmart are
information sources that tell him about biased inherently different from its North American markets in
knowledge about the elephant (e.g., one blind man may terms of seasonal promotions, top sell items, and
exchange his feeling about the elephant with another customer behaviors. More specifically, the local
blind man, where the exchanged knowledge is inherently government regulations also impact on the wholesale
biased). Exploring the Big Data in this scenario is management process and result in restructured data
equivalent to aggregating heterogeneous information representations and data warehouses for local markets.
from different sources (blind men) to help draw a best
possible picture to reveal the genuine gesture of the Complex and Evolving Relationships
elephant in a real-time fashion. Indeed, this task is not as
simple as asking each blind man to describe his feelings While the volume of the Big Data increases, so do the
about the elephant and then getting an expert to draw one complexity and the relationships underneath the data. In
single picture with a combined view, concerning that an early stage of data centralized information systems,
each individual may speak a different language the focus is on finding best feature values to represent
(heterogeneous and diverse information sources) and each observation. This is similar to using a number of
they may even have privacy concerns about the data fields, such as age, gender, income, education
messages they deliberate in the information exchange background, and so on, to characterize each individual.
process. This type of sample feature representation inherently
treats each individual as an independent entity without
considering their social connections, which is one of the
most important factors of the human society. Our friend
circles may be formed based on the common hobbies or
people are connected by biological relationships. Such
social connections commonly exist not only in our daily
activities, but also are very popular in cyberworlds. For
example, major social network sites, such as Facebook
or Twitter, are mainly characterized by social functions
such as friend-connections and followers (in Twitter).
The correlations between individuals inherently
complicate the whole data representation and any
reasoning process on the data.

Figure 1 The blind men and the giant elephant: the


localized (limited) view of each blind man leads to a III. DATA MINING CHALLENGES WITH BIG DATA
biased conclusion.

www.ijseat.com Page 416


International Journal of Science Engineering and Advance ISSN 2321-6905
Technology, IJSEAT , Vol.3, Issue 9 September 2015

For an intelligent learning database system [6] to handle information is fed-back to the preprocessing stage. Then,
Big Data, the essential key is to scale up to the the model and parameters are adjusted according to the
exceptionally large volume of data and provide feedback. In the whole process, information sharing is
treatments for the characteristics featured by the not only a promise of smooth development of each stage,
aforementioned HACE theorem. Fig. 2 shows a but also a purpose of Big Data processing.
conceptual view of the Big Data processing framework,
which includes three tiers from inside out with
considerations on data accessing and computing (Tier I), Tier I: Big Data Mining Platform
data privacy and domain knowledge (Tier II), and Big
Data mining algorithms (Tier III). In typical data mining systems, the mining procedures
require computational intensive computing units for data
The challenges at Tier I focus on data accessing and analysis and comparisons. A computing platform is,
arithmetic computing procedures. Because Big Data are therefore, needed to have efficient access to, at least, two
often stored at different locations and data volumes may types of resources: data and computing processors. For
continuously grow, an effective computing platform will small scale data mining tasks, a single desktop computer,
have to take distributed large-scale data storage into which contains hard disk and CPU processors, is
consideration for computing. For example, typical data sufficient to fulfill the data mining goals. Indeed, many
mining algorithms require all data to be loaded into the data mining algorithm are designed for this type of
main memory, this, however, is becoming a clear problem settings. For medium scale data mining tasks,
technical barrier for Big Data because moving data data are typically large (and possibly distributed) and
across different locations is expensive (e.g., subject to cannot be fit into the main memory. Common solutions
intensive network communication and other IO costs), are to rely on parallel computing [7] or collective mining
even if we do have a super large main memory to hold [12] to sample and aggregate data from different sources
all data for computing. The challenges at Tier II center and then use parallel computing programming (such as
around semantics and domain knowledge for different the Message Passing Interface) to carry out the mining
Big Data applications. Such information can provide process.
additional benefits to the mining process, as well as add
technical barriers to the Big Data access (Tier I) and Tier II: Big Data Semantics and Application
mining algorithms (Tier III). Knowledge

Semantics and application knowledge in Big Data refer


to numerous aspects related to the regulations, policies,
user knowledge, and domain information. The two most
important issues at this tier include 1) data sharing and
privacy; and 2) domain and application knowledge. The
former provides answers to resolve concerns on how
data are maintained, accessed, and shared; whereas the
latter focuses on answering questions like “what are the
underlying applications ?” and “what are the knowledge
or patterns users intend to discover from the data ?”

Information Sharing and Data Privacy

Information sharing is an ultimate goal for all systems


involving multiple parties [8]. While the motivation for
sharing is clear, a real-world concern is that Big Data
Figure 2 Big Data processing framework applications are related to sensitive information, such as
banking transactions and medical records. Simple data
exchanges or transmissions do not resolve privacy
At Tier III, the data mining challenges concentrate on
concerns
algorithm designs in tackling the difficulties raised by
For example, knowing people’s locations and their
the Big Data volumes, distributed data distributions, and
preferences, one can enable a variety of useful location-
by complex and dynamic data characteristics. The circle
based services, but public disclosure of an individual’s
at Tier III contains three stages. First, sparse,
locations/movements over time can have serious
heterogeneous, uncertain, incomplete, and multisource
consequences for privacy. To protect privacy, two
data are preprocessed by data fusion techniques. Second,
common approaches are to
complex and dynamic data are mined after
preprocessing. Third, the global knowledge obtained by
1) restrict access to the data, such as adding
local learning and model fusion is tested and relevant
certification or access control to the data

www.ijseat.com Page 417


International Journal of Science Engineering and Advance ISSN 2321-6905
Technology, IJSEAT , Vol.3, Issue 9 September 2015

entries, so sensitive information is accessible by conclusions. This is normally a complication of the data
a limited group of users only, and dimensionality issues, where data in a high-dimensional
space (such as more than 1,000 dimensions) do not show
2) anonymize data fields such that sensitive clear trends or distributions. For most machine learning
information cannot be pinpointed to an and data mining algorithms, high-dimensional spare data
individual record [15]. significantly deteriorate the reliability of the models
derived from the data. Common approaches are to
Domain and Application Knowledge employ dimension reduction or feature selection [10] to
reduce the data dimensions or to carefully include
Domain and application knowledge provides essential additional samples to alleviate the data scarcity, such as
information for designing Big Data mining algorithms generic unsupervised learning methods in data mining.
and systems. In a simple case, domain knowledge can
help identify right features for modeling the underlying Mining Complex and Dynamic Data
data (e.g., blood glucose level is clearly a better feature The rise of Big Data is driven by the rapid increasing of
than body mass in diagnosing Type II diabetes). The complex data and their changes in volumes and in nature
domain and application knowledge can also help design [6]. Documents posted on WWW servers, Internet
achievable business objectives by using Big Data backbones, social networks, communication networks,
analytical techniques. For example, stock market data and transportation networks, and so on are all featured
are a typical domain that constantly generates a large with complex data. While complex dependency
quantity of information, such as bids, buys, and puts, in structures underneath the data raise the difficulty for our
every single second. The market continuously evolves learning systems, they also offer exciting opportunities
and is impacted by different factors, such as domestic that simple data representations are incapable of
and international news, government reports, and natural achieving. For example, researchers have successfully
disasters, and so on. An appealing Big Data mining task used Twitter, a well-known social networking site, to
is to design a Big Data mining system to predict the detect events such as earthquakes and major social
movement of the market in the next one or two minutes. activities, with nearly real-time speed and very high
Such systems, even if the prediction accuracy is just accuracy. In addition, by summarizing the queries users
slightly better than random guess, will bring significant submitted to the search engines, which are all over the
business values to the developers [9]. world, it is now possible to build an early warning
system for detecting fast spreading flu outbreaks [11].
Tier III: Big Data Mining Algorithms Making use of complex data is a major challenge for Big
Data applications, because any two parties in a complex
Local Learning and Model Fusion for Multiple network are potentially interested to each other with a
Information Sources social connection. Such a connection is quadratic with
As Big Data applications are featured with autonomous respect to the number of nodes in the network, so a
sources and decentralized controls, aggregating million node network may be subject to one trillion
distributed data sources to a centralized site for mining is connections. For a large social network site, like
systematically prohibitive due to the potential Facebook, the number of active users has already
transmission cost and privacy concerns. On the other reached 1 billion, and analyzing such an enormous
hand, although we can always carry out mining activities network is a big challenge for Big Data mining.
at each distributed site, the biased view of the data
collected at each site often leads to biased decisions or
models, just like the elephant and blind men case. Under IV. RELATED WORK
such a circumstance, a Big Data mining system has to
enable an information exchange and fusion mechanism Due to the multisource, massive, heterogeneous, and
to ensure that all distributed sites (or information dynamic characteristics of application data involved in a
sources) can work together to achieve a global distributed environment, one of the most important
optimization goal. Model mining and correlations are the characteristics of Big Data is to carry out computing on
key steps to ensure that models or patterns discovered the petabyte (PB), even the exabyte (EB)-level data with
from multiple information sources can be consolidated to a complex computing process. Therefore, utilizing a
meet the global mining objective. More specifically, the parallel computing infrastructure, its corresponding
global mining can be featured with a two-step (local programming language support, and software models to
mining and global correlation) process, at data, model, efficiently analyze and mine the distributed data are the
and at knowledge levels. critical goals for Big Data processing to change from
“quantity” to “quality.”
Mining from Sparse, Uncertain, and Incomplete Data
Spare, uncertain, and incomplete data are defining Currently, Big Data processing mainly depends on
features for Big Data applications. Being sparse, the parallel programming models like MapReduce, as well
number of data points is too few for drawing reliable as providing a cloud computing platform of Big Data

www.ijseat.com Page 418


International Journal of Science Engineering and Advance ISSN 2321-6905
Technology, IJSEAT , Vol.3, Issue 9 September 2015

services for the public. MapReduce is a batch-oriented between distributed sites, and fuse decisions from
parallel computing model. There is still a certain gap in multiple sources to gain a best model out of the Big
performance with relational databases. Improving the Data. At the system level, the essential challenge is that a
performance of MapReduce and enhancing the real-time Big Data mining framework needs to consider complex
nature of large-scale data processing have received a relationships between samples, models, and data sources,
significant amount of attention, with MapReduce parallel along with their evolving changes with time and other
programming being applied to many machine learning possible factors. A system needs to be carefully designed
and data mining algorithms. Data mining algorithms so that unstructured data can be linked through their
usually need to scan through the training data for complex relationships to form useful patterns, and the
obtaining the statistics to solve or optimize model growth of data volumes and item relationships should
parameters. It calls for intensive computing to access the help form legitimate patterns to predict the trend and
large-scale data frequently. To improve the efficiency of future.
algorithms, Chu et al. proposed a general-purpose
parallel programming method, which is applicable to a
large number of machine learning algorithms based on
the simple MapReduce programming model on REFERENCES
multicore processors. Ten classical data mining [1] R. Ahmed and G. Karypis, “Algorithms for Mining
algorithms are realized in the framework, including the Evolution of Conserved Relational States in
locally weighted linear regression, k-Means, logistic Dynamic Networks,” Knowledge and Information
regression, naive Bayes, linear support vector machines, Systems, vol. 33, no. 3, pp. 603-630, Dec. 2012.
the independent variable analysis, Gaussian discriminant
analysis, expectation maximization, and back- [2] M.H. Alam, J.W. Ha, and S.K. Lee, “Novel
propagation neural networks [14]. Approaches to Crawling Important Pages Early,”
Knowledge and Information Systems, vol. 33, no. 3, pp
707-734, Dec. 2012.

[3] S. Aral and D. Walker, “Identifying Influential and


Susceptible Members of Social Networks,” Science, vol.
V. CONCLUSION 337, pp. 337-341, 2012.

Driven by real-world applications and key industrial [4] A. Machanavajjhala and J.P. Reiter, “Big Privacy:
stakeholders and initialized by national funding Protecting Confidentiality in Big Data,” ACM
agencies, managing and mining Big Data have shown to Crossroads, vol. 19, no. 1, pp. 20-23, 2012.
be a challenging yet very compelling task. While the
term Big Data literally concerns about data volumes, our [5] S. Banerjee and N. Agarwal, “Analyzing Collective
HACE theorem suggests that the key characteristics of Behavior from Blogs Using Swarm Intelligence,”
the Big Data are 1) huge with heterogeneous and diverse Knowledge and Information Systems, vol. 33, no. 3, pp.
data sources, 2) autonomous with distributed and 523-547, Dec. 2012.
decentralized control, and 3) complex and evolving in
data and knowledge associations. Such combined [6] E. Birney, “The Making of ENCODE: Lessons for
characteristics suggest that Big Data require a “big Big-Data Projects,” Nature, vol. 489, pp. 49-51, 2012.
mind” to consolidate data for maximum values [13].
[7] J. Bollen, H. Mao, and X. Zeng, “Twitter Mood
To explore Big Data, we have analyzed several Predicts the Stock Market,” J. Computational Science,
challenges at the data, model, and system levels. To vol. 2, no. 1, pp. 1-8, 2011.
support Big Data mining, high-performance computing
platforms are required, which impose systematic designs [8] S. Borgatti, A. Mehra, D. Brass, and G. Labianca,
to unleash the full power of the Big Data. At the data “Network Analysis in the Social Sciences,” Science, vol.
level, the autonomous information sources and the 323, pp. 892-895, 2009.
variety of the data collection environments, often result
in data with complicated conditions, such as [9] J. Bughin, M. Chui, and J. Manyika, Clouds, Big
missing/uncertain values. In other situations, privacy Data, and Smart Assets: Ten Tech-Enabled Business
concerns, noise, and errors can be introduced into the Trends to Watch. McKinSey Quarterly, 2010.
data, to produce altered data copies. Developing a safe
and sound information sharing protocol is a major [10] D. Centola, “The Spread of Behavior in an Online
challenge. At the model level, the key challenge is to Social Network Experiment,” Science, vol. 329, pp.
generate global models by combining locally discovered 1194-1197, 2010.
patterns to form a unifying view. This requires carefully
designed algorithms to analyze model correlations

www.ijseat.com Page 419


International Journal of Science Engineering and Advance ISSN 2321-6905
Technology, IJSEAT , Vol.3, Issue 9 September 2015

[11] E.Y. Chang, H. Bai, and K. Zhu, “Parallel


Algorithms for Mining Large-Scale Rich-Media Data,”
Proc. 17th ACM Int’l Conf. Multimedia, (MM ’09,) pp.
917-918, 2009.

[12] R. Chen, K. Sivakumar, and H. Kargupta,


“Collective Mining of Bayesian Networks from
Distributed Heterogeneous Data,” Knowledge and
Information Systems, vol. 6, no. 2, pp. 164-187, 2004.

[13] Y.-C. Chen, W.-C. Peng, and S.-Y. Lee, “Efficient


Algorithms for Influence Maximization in Social
Networks,” Knowledge and Information Systems, vol.
33, no. 3, pp. 577-601, Dec. 2012.

[14] C.T. Chu, S.K. Kim, Y.A. Lin, Y. Yu, G.R. Bradski,
A.Y. Ng, and K. Olukotun, “Map-Reduce for Machine
Learning on Multicore,” Proc. 20th Ann. Conf. Neural
Information Processing Systems (NIPS’06), pp. 281-288,
2006.

[15] G. Cormode and D. Srivastava, “Anonymized Data:


Generation, Models, Usage,” Proc. ACM SIGMOD Int’l
Conf. Management Data, pp. 1015-1018, 2009.

[16] “IBM What Is Big Data: Bring Big Data to the


Enterprise,” http:// www-
01.ibm.com/software/data/bigdata/, IBM, 2012.

www.ijseat.com Page 420

You might also like