Data Mining 1
Data Mining 1
1
✓ Orange | Open-source data mining toolbox
✓ Apache Mahout | Ideal for complex and large-scale data mining
✓ SAS Enterprise Miner | Solve business problems with data mining
Examples of Databases:
2
II. What is DBMS?
❖ DBMS software primarily functions as an interface between the end user and the
database, simultaneously managing the data, the database engine, and the database
schema in order to facilitate the organization and manipulation of data.
❖ Though functions of DBMS vary greatly, general-purpose DBMS features and
capabilities should include: a user accessible catalog describing metadata, DBMS
library management system, data abstraction and independence, data security, logging
and auditing of activity, support for concurrency and transactions, support for
authorization of access, access support from remote locations, DBMS data recovery
support in the event of damage, and enforcement of constraints to ensure the data
follows certain rules.
3
❖ Purposes/Benefits:
✓ Integrate data from multiple sources into a single database and data model.
More congregation of data to single database so a single query engine can be
used to present data in an ODS.
✓ Mitigate the problem of database isolation level lock contention in transaction
processing systems caused by attempts to run large, long-running analysis
queries in transaction processing databases.
✓ Maintain data history, even if the source transaction systems do not.
✓ Integrate data from multiple source systems, enabling a central view across the
enterprise. This benefit is always valuable, but particularly so when the
organization has grown by merger.
✓ Improve data quality, by providing consistent codes and descriptions, flagging
or even fixing bad data.
✓ Present the organization's information consistently.
✓ Provide a single common data model for all data of interest regardless of the
data's source.
✓ Restructure the data so that it makes sense to the business users.
✓ Restructure the data so that it delivers excellent query performance, even for
complex analytic queries, without impacting the operational systems.
✓ Add value to operational business applications, notably customer relationship
management (CRM) systems.
✓ Make decision–support queries easier to write.
✓ Organize and disambiguate repetitive data
4
V. Define Data Mart?
❖ A data mart is a simple form of a data warehouse that is focused on a single subject
(or functional area), hence they draw data from a limited number of sources such as
sales, finance or marketing. Data marts are often built and controlled by a single
department within an organization. The sources could be internal operational systems,
a central data warehouse, or external data. Denormalization is the norm for data
modeling techniques in this system. Given that data marts generally cover only a
subset of the data contained in a data warehouse, they are often easier and faster to
implement.
5
❖ Current usage of the term big data tends to refer to the use of predictive
analytics, user behavior analytics, or certain other advanced data analytics methods
that extract value from big data, and seldom to a particular size of data set. "There is
little doubt that the quantities of data now available are indeed large, but that's
not the most relevant characteristic of this new data ecosystem.” Analysis of data
sets can find new correlations to "spot business trends, prevent diseases, combat
crime and so on". Scientists, business executives, medical practitioners, advertising
and governments alike regularly meet difficulties with large data-sets in areas
including Internet searches, fintech, healthcare analytics, geographic information
systems, urban informatics, and business informatics. Scientists encounter limitations
in e-Science work, including meteorology, genomics, connectomics, complex physics
simulations, biology, and environmental research.
❖ The size and number of available data sets has grown rapidly as data is collected by
devices such as mobile devices, cheap and numerous information-sensing Internet of
things devices, aerial (remote sensing), software logs, cameras, microphones, radio-
frequency identification (RFID) readers and wireless sensor networks. The world's
technological per-capita capacity to store information has roughly doubled every 40
months since the 1980s; as of 2012, every day 2.5 exabytes (2.5×260 bytes) of data
are generated. Based on an IDC report prediction, the global data volume was
predicted to grow exponentially from 4.4 zettabytes to 44 zettabytes between 2013
and 2020. By 2025, IDC predicts there will be 163 zettabytes of data. One question
for large enterprises is determining who should own big-data initiatives that affect the
entire organization.
6
❖ Data science is a "concept to unify statistics, data analysis, informatics, and their
related methods" in order to "understand and analyze actual phenomena" with data. It
uses techniques and theories drawn from many fields within the context
of mathematics, statistics, computer science, information science, and domain
knowledge. However, data science is different from computer science and
information science. Turing Award winner Jim Gray imagined data science as a
"fourth paradigm" of science (empirical, theoretical, computational, and now data-
driven) and asserted that "everything about science is changing because of the impact
of information technology" and the data deluge.
❖ Data science is an interdisciplinary field focused on extracting knowledge from data
sets, which are typically large (see big data), and applying the knowledge and
actionable insights from data to solve problems in a wide range of application
domains. The field encompasses preparing data for analysis, formulating data science
problems, analyzing data, developing data-driven solutions, and presenting findings
to inform high-level decisions in a broad range of application domains. As such, it
incorporates skills from computer science, statistics, information science,
mathematics, information visualization, data integration, graphic design, complex
systems, communication and business. Statistician Nathan Yau, drawing on Ben Fry,
also links data science to human-computer interaction: users should be able to
intuitively control and explore data. In 2015, the American Statistical
Association identified database management, statistics and machine learning,
and distributed and parallel systems as the three emerging foundational professional
communities.
7
the field as the study of "intelligent agents": any device that perceives its environment
and takes actions that maximize its chance of achieving its goals. Colloquially, the
term "artificial intelligence" is often used to describe machines that mimic "cognitive"
functions that humans associate with the human mind, such as "learning" and
"problem solving".
❖ As machines become increasingly capable, tasks considered to require "intelligence"
are often removed from the definition of AI, a phenomenon known as the AI effect. A
quip in Tesler's Theorem says "AI is whatever hasn't been done yet." For
instance, optical character recognition is frequently excluded from things considered
to be AI, having become a routine technology. Modern machine capabilities generally
classified as AI include successfully understanding human speech, competing at the
highest level in strategic game systems (such as chess and Go), and also imperfect-
information games like poker, self-driving cars, intelligent routing in content delivery
networks, and military simulations.
❖ Artificial intelligence was founded as an academic discipline in 1955, and in the years
since has experienced several waves of optimism, followed by disappointment and
the loss of funding (known as an "AI winter"), followed by new approaches, success
and renewed funding. After AlphaGo defeated a professional Go player in 2015,
artificial intelligence once again attracted widespread global attention. For most of its
history, AI research has been divided into sub-fields that often fail to communicate
with each other. These sub-fields are based on technical considerations, such as
particular goals (e.g. "robotics" or "machine learning"), the use of particular tools
("logic" or artificial neural networks), or deep philosophical differences. Sub-fields
have also been based on social factors (particular institutions or the work of particular
researchers).
❖ The traditional problems (or goals) of AI research include reasoning, knowledge
representation, planning, learning, natural language processing, perception and the
ability to move and manipulate objects. AGI is among the field's long-term
goals. Approaches include statistical methods, computational intelligence,
and traditional symbolic AI. Many tools are used in AI, including versions of search
and mathematical optimization, artificial neural networks, and methods based on
8
statistics, probability and economics. The AI field draws upon computer
science, information engineering, mathematics, psychology, linguistics, philosophy,
and many other fields.
❖ The field was founded on the assumption that human intelligence "can be so precisely
described that a machine can be made to simulate it".This raises philosophical
arguments about the mind and the ethics of creating artificial beings endowed with
human-like intelligence. These issues have been explored
by myth, fiction and philosophy since antiquity. Some people also consider AI to be a
danger to humanity if it progresses unabated. Others believe that AI, unlike previous
technological revolutions, will create a risk of mass unemployment.
❖ In the twenty-first century, AI techniques have experienced a resurgence following
concurrent advances in computer power, large amounts of data, and theoretical
understanding; and AI techniques have become an essential part of the technology
industry, helping to solve many challenging problems in computer science, software
engineering and operations research.
9
learning. In its application across business problems, machine learning is also referred
to as predictive analytics.
❖ Machine learning involves computers discovering how they can perform tasks without
being explicitly programmed to do so. It involves computers learning from data
provided so that they carry out certain tasks. For simple tasks assigned to computers, it
is possible to program algorithms telling the machine how to execute all steps required
to solve the problem at hand; on the computer's part, no learning is needed. For more
advanced tasks, it can be challenging for a human to manually create the needed
algorithms. In practice, it can turn out to be more effective to help the machine
develop its own algorithm, rather than having human programmers specify every
needed step.
❖ The discipline of machine learning employs various approaches to teach computers to
accomplish tasks where no fully satisfactory algorithm is available. In cases where
vast numbers of potential answers exist, one approach is to label some of the correct
answers as valid. This can then be used as training data for the computer to improve
the algorithm(s) it uses to determine correct answers. For example, to train a system
for the task of digital character recognition, the MNIST dataset of handwritten digits
has often been used.
10
material inspection and board game programs, where they have produced results
comparable to and in some cases surpassing human expert performance.
❖ Artificial neural networks (ANNs) were inspired by information processing and
distributed communication nodes in biological systems. ANNs have various
differences from biological brains. Specifically, neural networks tend to be static and
symbolic, while the biological brain of most living organisms is dynamic (plastic) and
analogue.
11
Similarity learning is an area of supervised machine learning closely related to
regression and classification, but the goal is to learn from examples using a similarity
function that measures how similar or related two objects are. It has applications
in ranking, recommendation systems, visual identity tracking, face verification, and
speaker verification.
❖ Unsupervised learning
Unsupervised learning algorithms take a set of data that contains only inputs, and find
structure in the data, like grouping or clustering of data points. The algorithms,
therefore, learn from test data that has not been labeled, classified or categorized.
Instead of responding to feedback, unsupervised learning algorithms identify
commonalities in the data and react based on the presence or absence of such
commonalities in each new piece of data. A central application of unsupervised
learning is in the field of density estimation in statistics, such as finding the probability
density function. Though unsupervised learning encompasses other domains involving
summarizing and explaining data features.
Cluster analysis is the assignment of a set of observations into subsets (called clusters)
so that observations within the same cluster are similar according to one or more
predesignated criteria, while observations drawn from different clusters are dissimilar.
Different clustering techniques make different assumptions on the structure of the
data, often defined by some similarity metric and evaluated, for example, by internal
compactness, or the similarity between members of the same cluster, and separation,
the difference between clusters. Other methods are based on estimated
density and graph connectivity.
12
❖ Semi-supervised learning
In weakly supervised learning, the training labels are noisy, limited, or imprecise;
however, these labels are often cheaper to obtain, resulting in larger effective training
sets.
❖ Reinforcement learning
13