0% found this document useful (0 votes)
73 views13 pages

Data Mining 1

Unlock full access (pages 9-17) by uploading documents or with a 30 day free trialUnlock full access (pages 9-17) by uploading documents or with a 30 day free trialUnlock full access (pages 9-17) by uploading documents or with a 30 day free trial

Uploaded by

Waleed Raza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views13 pages

Data Mining 1

Unlock full access (pages 9-17) by uploading documents or with a 30 day free trialUnlock full access (pages 9-17) by uploading documents or with a 30 day free trialUnlock full access (pages 9-17) by uploading documents or with a 30 day free trial

Uploaded by

Waleed Raza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

DATA MINING

I. What is Data Mining?


❖ Data mining is a process of extracting and discovering patterns in large data
sets involving methods at the intersection of machine learning, statistics, and database
systems.
❖ Data mining is an interdisciplinary subfield of computer science and statistics with an
overall goal to extract information (with intelligent methods) from a data set and
transform the information into a comprehensible structure for further use.
❖ Data mining is the analysis step of the "knowledge discovery in databases" process,
or KDD. Aside from the raw analysis step, it also involves database and data
management aspects, data pre-processing, model and inference considerations,
interestingness metrics, complexity considerations, post-processing of discovered
structures, visualization, and online updating.
❖ The term "Data Mining" is a misnomer, because the goal is the extraction of patterns
and knowledge from large amounts of data, not the extraction (mining) of data
itself. It also is a buzzword and is frequently applied to any form of large-scale data
or information processing (collection, extraction, warehousing, analysis, and
statistics) as well as any application of computer decision support system,
including artificial intelligence (e.g., machine learning) and business intelligence.

❖ Most frequently Tool used for Data Mining:


✓ MonkeyLearn | No-code text mining tools
✓ RapidMiner | Drag and drop workflows or data mining in Python
✓ Oracle Data Mining | Predictive data mining models
✓ IBM SPSS Modeler | A predictive analytics platform for data scientists
✓ Weka | Open-source software for data mining
✓ Knime | Pre-built components for data mining projects
✓ H2O | Open-source library offering data mining in Python

1
✓ Orange | Open-source data mining toolbox
✓ Apache Mahout | Ideal for complex and large-scale data mining
✓ SAS Enterprise Miner | Solve business problems with data mining

Examples of Databases:

1. SolarWinds Database Performance Analyzer


2. DbVisualizer
3. ManageEngine Applications Manager
4. Oracle RDBMS
5. IBM DB2
6. Microsoft SQL Server
7. SAP Sybase ASE
8. Teradata
9. ADABAS
10. MySQL
11. FileMaker
12. Microsoft Access
13. Informix
14. SQLite
15. PostgresSQL
16. AmazonRDS
17. MongoDB
18. Redis
19. CouchDB
20. Neo4j
21. OrientDB
22. Couchbase
23. Toad
24. phpMyAdmin
25. SQL Developer
26. Seqel PRO
27. Robomongo
28. Hadoop HDFS
29. Cloudera
30. MariaDB
31. Informix Dynamic Server
32. 4D (4th Dimension)
33. Altibase

2
II. What is DBMS?
❖ DBMS software primarily functions as an interface between the end user and the
database, simultaneously managing the data, the database engine, and the database
schema in order to facilitate the organization and manipulation of data.
❖ Though functions of DBMS vary greatly, general-purpose DBMS features and
capabilities should include: a user accessible catalog describing metadata, DBMS
library management system, data abstraction and independence, data security, logging
and auditing of activity, support for concurrency and transactions, support for
authorization of access, access support from remote locations, DBMS data recovery
support in the event of damage, and enforcement of constraints to ensure the data
follows certain rules.

III. What are Database Systems?


❖ A database is an organized collection of structured information, or data, typically
stored electronically in a computer system. A database is usually controlled by
a database management system (DBMS). Together, the data and the DBMS, along
with the applications that are associated with them, are referred to as a database
system, often shortened to just database.

IV. What is Data Warehouse?


❖ In computing, a data warehouse (DW or DWH), also known as an enterprise data
warehouse (EDW), is a system used for reporting and data analysis and is considered
a core component of business intelligence. DWs are central repositories of integrated
data from one or more disparate sources. They store current and historical data in one
single place that are used for creating analytical reports for workers throughout the
enterprise. The data stored in the warehouse is uploaded from the operational
systems (such as marketing or sales). The data may pass through an operational data
store and may require data cleansing for additional operations to ensure data
quality before it is used in the DW for reporting.

3
❖ Purposes/Benefits:
✓ Integrate data from multiple sources into a single database and data model.
More congregation of data to single database so a single query engine can be
used to present data in an ODS.
✓ Mitigate the problem of database isolation level lock contention in transaction
processing systems caused by attempts to run large, long-running analysis
queries in transaction processing databases.
✓ Maintain data history, even if the source transaction systems do not.
✓ Integrate data from multiple source systems, enabling a central view across the
enterprise. This benefit is always valuable, but particularly so when the
organization has grown by merger.
✓ Improve data quality, by providing consistent codes and descriptions, flagging
or even fixing bad data.
✓ Present the organization's information consistently.
✓ Provide a single common data model for all data of interest regardless of the
data's source.
✓ Restructure the data so that it makes sense to the business users.
✓ Restructure the data so that it delivers excellent query performance, even for
complex analytic queries, without impacting the operational systems.
✓ Add value to operational business applications, notably customer relationship
management (CRM) systems.
✓ Make decision–support queries easier to write.
✓ Organize and disambiguate repetitive data

4
V. Define Data Mart?
❖ A data mart is a simple form of a data warehouse that is focused on a single subject
(or functional area), hence they draw data from a limited number of sources such as
sales, finance or marketing. Data marts are often built and controlled by a single
department within an organization. The sources could be internal operational systems,
a central data warehouse, or external data. Denormalization is the norm for data
modeling techniques in this system. Given that data marts generally cover only a
subset of the data contained in a data warehouse, they are often easier and faster to
implement.

DIFFERENCE BETWEEN DATA WAREHOUSE AND DATA MART


ATTRIBUTE Data warehouse Data mart
SCOPE OF THE DATA enterprise-wide department-wide
NUMBER OF SUBJECT AREAS multiple single
HOW DIFFICULT TO BUILD difficult easy
HOW MUCH TIME TAKES TO BUILD more less
AMOUNT OF MEMORY larger limited

VI. What is Big Data?


❖ Big data is a field that treats ways to analyze, systematically extract information from,
or otherwise deal with data sets that are too large or complex to be dealt with by
traditional data-processing application software. Data with many fields (columns)
offer greater statistical power, while data with higher complexity (more attributes or
columns) may lead to a higher false discovery rate. Big data analysis challenges
include capturing data, data storage, data analysis,
search, sharing, transfer, visualization, querying, updating, information privacy, and
data source. Big data was originally associated with three key
concepts: volume, variety, and velocity. The analysis of big data presents challenges
in sampling, and thus previously allowing for only observations and sampling.
Therefore, big data often includes data with sizes that exceed the capacity of
traditional software to process within an acceptable time and value.

5
❖ Current usage of the term big data tends to refer to the use of predictive
analytics, user behavior analytics, or certain other advanced data analytics methods
that extract value from big data, and seldom to a particular size of data set. "There is
little doubt that the quantities of data now available are indeed large, but that's
not the most relevant characteristic of this new data ecosystem.” Analysis of data
sets can find new correlations to "spot business trends, prevent diseases, combat
crime and so on". Scientists, business executives, medical practitioners, advertising
and governments alike regularly meet difficulties with large data-sets in areas
including Internet searches, fintech, healthcare analytics, geographic information
systems, urban informatics, and business informatics. Scientists encounter limitations
in e-Science work, including meteorology, genomics, connectomics, complex physics
simulations, biology, and environmental research.
❖ The size and number of available data sets has grown rapidly as data is collected by
devices such as mobile devices, cheap and numerous information-sensing Internet of
things devices, aerial (remote sensing), software logs, cameras, microphones, radio-
frequency identification (RFID) readers and wireless sensor networks. The world's
technological per-capita capacity to store information has roughly doubled every 40
months since the 1980s; as of 2012, every day 2.5 exabytes (2.5×260 bytes) of data
are generated. Based on an IDC report prediction, the global data volume was
predicted to grow exponentially from 4.4 zettabytes to 44 zettabytes between 2013
and 2020. By 2025, IDC predicts there will be 163 zettabytes of data. One question
for large enterprises is determining who should own big-data initiatives that affect the
entire organization.

VII. What is Data Science?


❖ Data science is an interdisciplinary field that uses scientific methods, processes,
algorithms and systems to extract knowledge and insights from structured
and unstructured data, and apply knowledge and actionable insights from data across
a broad range of application domains. Data science is related to data
mining, machine learning and big data.

6
❖ Data science is a "concept to unify statistics, data analysis, informatics, and their
related methods" in order to "understand and analyze actual phenomena" with data. It
uses techniques and theories drawn from many fields within the context
of mathematics, statistics, computer science, information science, and domain
knowledge. However, data science is different from computer science and
information science. Turing Award winner Jim Gray imagined data science as a
"fourth paradigm" of science (empirical, theoretical, computational, and now data-
driven) and asserted that "everything about science is changing because of the impact
of information technology" and the data deluge.
❖ Data science is an interdisciplinary field focused on extracting knowledge from data
sets, which are typically large (see big data), and applying the knowledge and
actionable insights from data to solve problems in a wide range of application
domains. The field encompasses preparing data for analysis, formulating data science
problems, analyzing data, developing data-driven solutions, and presenting findings
to inform high-level decisions in a broad range of application domains. As such, it
incorporates skills from computer science, statistics, information science,
mathematics, information visualization, data integration, graphic design, complex
systems, communication and business. Statistician Nathan Yau, drawing on Ben Fry,
also links data science to human-computer interaction: users should be able to
intuitively control and explore data. In 2015, the American Statistical
Association identified database management, statistics and machine learning,
and distributed and parallel systems as the three emerging foundational professional
communities.

VIII. What is AI (Artificial Intelligence)?


❖ Artificial intelligence (AI) is intelligence demonstrated by machines, unlike
the natural intelligence displayed by humans and animals, which involves
consciousness and emotionality. The distinction between the former and the latter
categories is often revealed by the acronym chosen. 'Strong' AI is usually labelled
as artificial general intelligence (AGI) while attempts to emulate 'natural' intelligence
have been called artificial biological intelligence (ABI). Leading AI textbooks define

7
the field as the study of "intelligent agents": any device that perceives its environment
and takes actions that maximize its chance of achieving its goals. Colloquially, the
term "artificial intelligence" is often used to describe machines that mimic "cognitive"
functions that humans associate with the human mind, such as "learning" and
"problem solving".
❖ As machines become increasingly capable, tasks considered to require "intelligence"
are often removed from the definition of AI, a phenomenon known as the AI effect. A
quip in Tesler's Theorem says "AI is whatever hasn't been done yet." For
instance, optical character recognition is frequently excluded from things considered
to be AI, having become a routine technology. Modern machine capabilities generally
classified as AI include successfully understanding human speech, competing at the
highest level in strategic game systems (such as chess and Go), and also imperfect-
information games like poker, self-driving cars, intelligent routing in content delivery
networks, and military simulations.
❖ Artificial intelligence was founded as an academic discipline in 1955, and in the years
since has experienced several waves of optimism, followed by disappointment and
the loss of funding (known as an "AI winter"), followed by new approaches, success
and renewed funding. After AlphaGo defeated a professional Go player in 2015,
artificial intelligence once again attracted widespread global attention. For most of its
history, AI research has been divided into sub-fields that often fail to communicate
with each other. These sub-fields are based on technical considerations, such as
particular goals (e.g. "robotics" or "machine learning"), the use of particular tools
("logic" or artificial neural networks), or deep philosophical differences. Sub-fields
have also been based on social factors (particular institutions or the work of particular
researchers).
❖ The traditional problems (or goals) of AI research include reasoning, knowledge
representation, planning, learning, natural language processing, perception and the
ability to move and manipulate objects. AGI is among the field's long-term
goals. Approaches include statistical methods, computational intelligence,
and traditional symbolic AI. Many tools are used in AI, including versions of search
and mathematical optimization, artificial neural networks, and methods based on

8
statistics, probability and economics. The AI field draws upon computer
science, information engineering, mathematics, psychology, linguistics, philosophy,
and many other fields.
❖ The field was founded on the assumption that human intelligence "can be so precisely
described that a machine can be made to simulate it".This raises philosophical
arguments about the mind and the ethics of creating artificial beings endowed with
human-like intelligence. These issues have been explored
by myth, fiction and philosophy since antiquity. Some people also consider AI to be a
danger to humanity if it progresses unabated. Others believe that AI, unlike previous
technological revolutions, will create a risk of mass unemployment.
❖ In the twenty-first century, AI techniques have experienced a resurgence following
concurrent advances in computer power, large amounts of data, and theoretical
understanding; and AI techniques have become an essential part of the technology
industry, helping to solve many challenging problems in computer science, software
engineering and operations research.

IX. What is Machine Learning?


❖ Machine learning (ML) is the study of computer algorithms that improve
automatically through experience and by the use of data. It is seen as a part
of artificial intelligence. Machine learning algorithms build a model based on sample
data, known as "training data", in order to make predictions or decisions without
being explicitly programmed to do so. Machine learning algorithms are used in a
wide variety of applications, such as in medicine, email filtering, speech recognition,
and computer vision, where it is difficult or unfeasible to develop conventional
algorithms to perform the needed tasks.
❖ A subset of machine learning is closely related to computational statistics, which
focuses on making predictions using computers; but not all machine learning is
statistical learning. The study of mathematical optimization delivers methods, theory
and application domains to the field of machine learning. Data mining is a related
field of study, focusing on exploratory data analysis through unsupervised

9
learning. In its application across business problems, machine learning is also referred
to as predictive analytics.
❖ Machine learning involves computers discovering how they can perform tasks without
being explicitly programmed to do so. It involves computers learning from data
provided so that they carry out certain tasks. For simple tasks assigned to computers, it
is possible to program algorithms telling the machine how to execute all steps required
to solve the problem at hand; on the computer's part, no learning is needed. For more
advanced tasks, it can be challenging for a human to manually create the needed
algorithms. In practice, it can turn out to be more effective to help the machine
develop its own algorithm, rather than having human programmers specify every
needed step.
❖ The discipline of machine learning employs various approaches to teach computers to
accomplish tasks where no fully satisfactory algorithm is available. In cases where
vast numbers of potential answers exist, one approach is to label some of the correct
answers as valid. This can then be used as training data for the computer to improve
the algorithm(s) it uses to determine correct answers. For example, to train a system
for the task of digital character recognition, the MNIST dataset of handwritten digits
has often been used.

X. What is Deep Learning (DL)?


❖ Deep learning is a class of machine learning algorithms that (pp199–200) uses
multiple layers to progressively extract higher-level features from the raw input. For
example, in image processing, lower layers may identify edges, while higher layers
may identify the concepts relevant to a human such as digits or letters or faces.
❖ Deep learning (also known as deep structured learning) is part of a broader family
of machine learning methods based on artificial neural networks with representation
learning. Learning can be supervised, semi-supervised or unsupervised.
❖ Deep-learning architectures such as deep neural networks, deep belief networks, graph
neural networks, recurrent neural networks and convolutional neural networks have
been applied to fields including computer vision, speech recognition, natural language
processing, machine translation, bioinformatics, drug design, medical image analysis,

10
material inspection and board game programs, where they have produced results
comparable to and in some cases surpassing human expert performance.
❖ Artificial neural networks (ANNs) were inspired by information processing and
distributed communication nodes in biological systems. ANNs have various
differences from biological brains. Specifically, neural networks tend to be static and
symbolic, while the biological brain of most living organisms is dynamic (plastic) and
analogue.

XI. What is Supervised, Un-Supervised Learning & Reinforcement


learning?
❖ Supervised learning

Supervised learning algorithms build a mathematical model of a set of data that


contains both the inputs and the desired outputs. The data is known as training data,
and consists of a set of training examples. Each training example has one or more
inputs and the desired output, also known as a supervisory signal. In the mathematical
model, each training example is represented by an array or vector, sometimes called a
feature vector, and the training data is represented by a matrix. Through iterative
optimization of an objective function, supervised learning algorithms learn a function
that can be used to predict the output associated with new inputs. An optimal function
will allow the algorithm to correctly determine the output for inputs that were not a
part of the training data. An algorithm that improves the accuracy of its outputs or
predictions over time is said to have learned to perform that task.

Types of supervised learning algorithms include active


learning, classification and regression. Classification algorithms are used when the
outputs are restricted to a limited set of values, and regression algorithms are used
when the outputs may have any numerical value within a range. As an example, for a
classification algorithm that filters emails, the input would be an incoming email, and
the output would be the name of the folder in which to file the email.

11
Similarity learning is an area of supervised machine learning closely related to
regression and classification, but the goal is to learn from examples using a similarity
function that measures how similar or related two objects are. It has applications
in ranking, recommendation systems, visual identity tracking, face verification, and
speaker verification.

❖ Unsupervised learning

Unsupervised learning algorithms take a set of data that contains only inputs, and find
structure in the data, like grouping or clustering of data points. The algorithms,
therefore, learn from test data that has not been labeled, classified or categorized.
Instead of responding to feedback, unsupervised learning algorithms identify
commonalities in the data and react based on the presence or absence of such
commonalities in each new piece of data. A central application of unsupervised
learning is in the field of density estimation in statistics, such as finding the probability
density function. Though unsupervised learning encompasses other domains involving
summarizing and explaining data features.

Cluster analysis is the assignment of a set of observations into subsets (called clusters)
so that observations within the same cluster are similar according to one or more
predesignated criteria, while observations drawn from different clusters are dissimilar.
Different clustering techniques make different assumptions on the structure of the
data, often defined by some similarity metric and evaluated, for example, by internal
compactness, or the similarity between members of the same cluster, and separation,
the difference between clusters. Other methods are based on estimated
density and graph connectivity.

12
❖ Semi-supervised learning

Semi-supervised learning falls between unsupervised learning (without any labeled


training data) and supervised learning (with completely labeled training data). Some of
the training examples are missing training labels, yet many machine-learning
researchers have found that unlabeled data, when used in conjunction with a small
amount of labeled data, can produce a considerable improvement in learning accuracy.

In weakly supervised learning, the training labels are noisy, limited, or imprecise;
however, these labels are often cheaper to obtain, resulting in larger effective training
sets.

❖ Reinforcement learning

Reinforcement learning is an area of machine learning concerned with how software


agents ought to take actions in an environment so as to maximize some notion of
cumulative reward. Due to its generality, the field is studied in many other disciplines,
such as game theory, control theory, operations research, information
theory, simulation-based optimization, multi-agent systems, swarm
intelligence, statistics and genetic algorithms. In machine learning, the environment is
typically represented as a Markov decision process (MDP). Many reinforcement
learning algorithms use dynamic programming techniques. Reinforcement learning
algorithms do not assume knowledge of an exact mathematical model of the MDP, and
are used when exact models are infeasible. Reinforcement learning algorithms are used
in autonomous vehicles or in learning to play a game against a human opponent.

13

You might also like