Chapter-1 (Introduction)
Chapter-1 (Introduction)
We say we live in the information age. However, actually we live in the data age. Because of digitalization data
has increased many fold than the previous era. This growth is explosive. World Wide Web (WWW), social
networks, supermarkets, business houses, industries etc. are generating data in terms of petabytes and more.
◼ Major sources of explosive growth of abundant data
◼ Business: Web, e-commerce, transactions, stocks, …
◼ Science: Remote sensing, bioinformatics, scientific simulation, …
◼ Society and everyone: news, digital cameras, YouTube, …
It is not possible to uncover the knowledge / information hidden in the heap of this data without automated
tools.
◼ We are drowning in data, but starving for knowledge!
◼ “Necessity is the mother of invention” — Data mining — Automated analysis of massive data sets.
Data Mining – Why?
Data warehouse technology includes data cleaning, data integration, and online analytical processing (OLAP) —
that is, analysis techniques with functionalities such as summarization, consolidation, and aggregation, as well as
the ability to view information from different angles.
Though OLAP tools support multidimensional analysis and decision making, additional data analysis tools are
required for in-depth analysis.
For example, data mining tools that provide data classification, clustering, outlier/anomaly detection, and
the characterization of changes in data over time.
Since 1990s, huge volumes of data have been accumulated beyond databases and data warehouses, e.g., World
Wide Web and web-based databases, Internet-based global interconnected, heterogeneous databases /
information bases. They play a vital role in the information industry.
The effective and efficient analysis of data from such different forms of data by integration of information
retrieval, data mining, and information network analysis technologies is a challenging task.
Data Mining – Why?
The fast-growing, tremendous amount of data,
collected and stored in large and numerous data
repositories, has far exceeded our human ability for
comprehension without powerful tools. This situation
(the abundance of data, and the need for powerful
data analysis tools), is described as a data rich but
information poor situation.
The first four steps (Steps 1 – 4) are different forms of data preprocessing, where data are prepared for mining.
The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to
the user and may be stored as new knowledge in the knowledge base.
Though the data mining is shown as one step in the knowledge discovery process, in industry, in media, and in
the research milieu, the term data mining is often used to refer to the entire knowledge discovery process.
Therefore, we adopt a broad view of data mining functionality: Data mining is the process of discovering
interesting patterns and knowledge from large amounts of data. The data sources can include databases, data
warehouses, the Web, other information repositories, or data that are streamed into the system dynamically.
What Kinds of Data can be Mined?
As a general technology, data mining can be applied to any kind of data as long as the data are meaningful for a
target application.
However, the most basic forms of data for mining applications are database data, data warehouse data, and
transactional data.
Data mining can also be applied to other forms of data, e.g., data streams, ordered/sequence data, graph or
networked data, spatial data, text data, multimedia data, the WWW.
The interdisciplinary nature of data mining research and development contributes significantly to the success of
data mining and its extensive applications.
What Kinds of Applications are Targeted?
As a highly application-driven discipline, data mining has seen great successes in many applications. Its
applications are limited only by the human imagination.
However, the two highly successful and popular application examples of data mining: business intelligence and
search engines.
Business Intelligence (BI): BI technologies provide historical, current, and predictive views of business
operations. Examples include reporting, online analytical processing, business performance management,
competitive intelligence, benchmarking, and predictive analytics. The data mining is the core of business
intelligence.
Web Search Engines: Web search engine is a specialized computer server that searches for information on the
Web. The search results of a user query are often returned as a list (sometimes called hits). The hits may consist
of web pages, images, and other types of files. Web search engines are essentially very large data mining
applications. Various data mining techniques are used in all aspects of search engines, ranging from crawling,
indexing, and searching.
Major Issues in Data Mining
Data mining is a dynamic and fast-expanding field with great strengths. The major issues in data mining research
can be partitioned into the following five groups.
Mining methodology
✓ Mining various and new kinds of knowledge: Due to the diversity of applications, new mining tasks
continue to emerge, making data mining a dynamic and fast-growing field.
✓ Mining knowledge in multidimensional space: When searching for knowledge in large data sets, we can
explore the data in multidimensional space.
✓ Data mining – an interdisciplinary effort: The power of data mining can be substantially enhanced by
integrating new methods from multiple disciplines.
✓ Boosting the power of discovery in a networked environment: Most data objects reside in a linked or
interconnected environment, whether it be the Web, database relations, files, or documents.
✓ Handling uncertainty, noise, or incompleteness of data: Data often contain noise, errors, exceptions, or
uncertainty, or are incomplete. Errors and noise may confuse the data mining process, leading to the
derivation of erroneous patterns.
✓ Pattern evaluation and pattern- or constraint-guided mining: Not all the patterns generated by data
mining processes are interesting. Pattern interestingness is user dependent. Therefore, techniques are
needed to assess the interestingness of discovered patterns based on subjective measures.
Major Issues in Data Mining
User interaction
✓ Interactive mining: Interactive mining should allow users to dynamically change the focus of a search, to
refine mining requests based on returned results,
✓ Incorporation of background knowledge: Background knowledge, constraints, rules, and other
information regarding the domain under study should be incorporated into the knowledge discovery
process.
✓ Ad hoc data mining and data mining query languages: There should be high-level data mining query
languages or other high-level flexible user interfaces that give users the freedom to define ad hoc data
mining tasks.
✓ Presentation and visualization of data mining results: How can a data mining system present data mining
results, vividly and flexibly, so that the discovered knowledge can be easily understood and directly
usable by humans?
Major Issues in Data Mining
Efficiency and scalability
✓ Efficiency and scalability of data mining algorithms: The algorithms must be efficient and scalable to
effectively extract information from huge amounts of data in many data repositories or in dynamic data
streams.
✓ Parallel, distributed, and incremental mining algorithms: The humongous size of many data sets, the wide
distribution of data, and the computational complexity of some data mining methods are prime factors.
Diversity of data types
✓ Handling complex types of data: Diverse applications generate a wide spectrum of new and complex data
types.
✓ Mining dynamic, networked, and global data repositories: Mining gigantic, interconnected information
networks may help disclose many more patterns and knowledge in heterogeneous data sets than can be
discovered from a small set of isolated data repositories.
Data mining and society
✓ Social impacts of data mining: With data mining penetrating our everyday lives, it is important to study
the impact of data mining on society.
✓ Privacy-preserving data mining: Data mining poses the risk of disclosing an individual’s personal
information.
✓ Invisible data mining: People should be able to perform data mining or use data mining results simply by
mouse clicking, without any knowledge of data mining algorithms.