Unit 1 Notes DM
Unit 1 Notes DM
UNIT I
INTRODUCTION
Syllabus:
Data mining, Text mining, Web mining, Spatial mining, Process mining, BI
process- Private and Public intelligence, Strategic assessment of
implementing BI
Alternative names
1. Data cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining
6. Pattern evaluation
7. Knowledge presentation
Knowledge base:
User interface:
There are no. of data stores on which data mining can be performed:
Relational database
Data warehouse
Transactional database
Time-series data
Stream data
Multimedia database
Relational database
Transactional database
In general, a transactional database consists of a file where each
record represents a transaction.
Object-Relational Databases
Temporal Databases
Sequence Databases
Time-Series Databases
A time-series database stores sequences of values or events
obtained over repeated measurements of time (e.g., hourly,
daily, weekly).
Examples include data collected from the stock exchange,
inventory control, and the observation of natural phenomena
(like temperature and wind).
Spatial Databases
Data Streams
Valid means that the discovered patterns should hold true on new
data with sufficient degree of certainty.
Data mining is not a new discipline, but rather a new definition for
the use of many disciplines.
Data mining tools are readily combined with sp read sheets and
other software development tools.
Categorical data
Nominal data
Ordinal data
Numeric data
Interval data
Ratio data
1. Associations
2. Predictions
3. Clusters
4. Sequential relationships
prediction,
association, and
clustering.
Based on the way in which the patterns are extracted from the
historical data, the learning algorithms of data mining methods
can be classified as either
supervised or
unsupervised.
PREDICTION
That is, in order of increasing reliability, one might list the relevant
terms as guessing, predicting, and forecasting, respectively.
The hope is that the model can then be used to predict the classes
of other unclassified records and, more important, to accurately
predict actual future events.
Neural networks
Decision trees
CLUSTERING
ASSOCIATIONS
• Banking.
• Insurance.
• Health care.
• Medicine.
• Entertainment industry.
• Sports.
2. Banking.
(1) predict machinery failures before they occur through the use of
sensory data (enabling what is called condition-based
maintenance);
(1) predict when and how much certain bond prices will change;
6. Insurance.
(3) predict which customers are more likely to buy new policies
with special features; and
(1) predict disk drive failures well before they actually occur;
(2) identify and filter unwanted Web content and e-mail messages;
(3) identify the most profitable customers and provide them with
personalized services to maintain their repeat business; and
(3) forecast the level and the time of demand at different service
locations to optimally allocate organizational resources; and
11. Medicine.
14.Sports.
need to be addressed.
First and foremost, the analyst should be clear and concise about
the description of the data mining task so that the most relevant
data can be identified.
Example:
The four main steps needed to convert the raw, real-world data
into minable datasets.
Data Cleaning
Data Transformation
Data Reduction
2. Data Cleaning
3. Data Transformation
No value is added by the data mining task until the business value
obtained from discovered knowledge patterns is identified and
recognized.
Step 6: Deployment
In many cases, it is the customer, not the data analyst, who carries
out the deployment steps.
Most data mining software tools employ more than one technique
(or algorithm) for each of these methods.
1. Classification
model testing/deployment.
1. Predictive accuracy.
2. Speed.
The computational costs involved in generating and using the
model, where faster is deemed to be better.
3. Robustness.
4. Scalability.
5. Interpretability.
The level of understanding and insight provided by the model
(e.g., how and/or what the model concludes on certain
predictions).
SIMPLE SPLIT
The simple split partitions the data into two mutually exclusive
subsets called a training set and a test set (or holdout set).
Training set - used by the inducer (model builder), and the built
classifier is then tested on the test set
Each time, it is trained on all but one fold and then tested on the
remaining single fold.
1. Leave-one-out.
2. Bootstrapping.
3. Jackknifing.
true positive rate is plotted on the Y-axis and false positive rate
is plotted on the X-axis.
CLASSIFICATION TECHNIQUES
Statistical analysis.
Neural networks.
Case-based reasoning
Bayesian classifiers
Genetic algorithms
Rough sets
2. Cluster Analysis
ANALYSIS METHODS
Statistical methods
Neural networks
Fuzzy logic
Each of these methods generally works with one of two general method
classes:
Divisive.
Agglomerative
(1) putting the items next to each other to make it more convenient
for the customers to pick them
(2) promoting the items as a package (do not put one on sale if
(3) placing them apart from each other so that the customer has to
walk the aisles to search for it, and by doing so potentially seeing
and buying other items
Apriori,
Eclat, and
FP-Growth.
These algorithms only do half the job, which is to identify the
frequent itemsets in the database.
APRIORI ALGORITHM
TEXT MINING
Example
1. Information retrieval.
2. Information extraction.
3. Named-entity recognition.
4. Question answering.
5. Automatic summarization.
8. Machine translation.
12. Text-to-speech.
1. Marketing Applications
2. Security Applications
3. Biomedical Applications
4. Academic Applications
nt g
e e rin
Terms k g em ine
t ris a na e ng e nt
en tm are pm
tm c elo
es je ftw v P
Documents inv pro so de SA ...
Document 1 1 1
Document 2 1
Document 3 3 1
Document 4 1
Document 5 2 1
Document 6 1 1
...
Synonyms, homonyms
Stemming
The following are some of the most popular software tools used for
text mining.
Free software tools, some of which are open source, are available from
a number of nonprofit organizations:
The Web is perhaps the world’s largest data and text repository.
Web also poses great challenges for effective and efficient knowledge
discovery:
Not only does the Web grow rapidly, but its content is constantly
being updated. Blogs, news stories, stock market results, weather
reports
Web crawlers are used to read through the content of a Web site
automatically.
one authority will rarely have its Web page point to rival
authorities in the same domain
HITS is a link analysis algorithm that rates Web pages using the
hyperlink information contained within them.
To gather the base document set, a root set that matches the
query is fetched from a search engine.
2. User profiles
PROCESS MINING
An event log stores information about cases and activities, but also
information about event performers, event timestamps (moment
when the event is triggered) or data elements recorded with the
event
discovery,
conformance and
enhancement.
1. Process discovery:
2. Conformance
3. Enhancement :
For instance, by using time stamps in the event log one can
extend the model to show bottlenecks, service levels,
throughput times and frequencies
Deterministic algorithms,
Genetic algorithms.
Why is BI important?
Step 1:
Step 2:
The data is cleaned and transformed into the data warehouse. The
table can be linked, and data cubes are formed.
Step 3:
Using BI system the user can ask quires, request ad-hoc reports
or conduct any other analysis
Following given are the four key players who are used Business
Intelligence System:
2. The IT users:
1. Boost productivity
2. To improve visibility
3. Fix Accountability :
BI System Disadvantages
1. Cost:
Business intelligence can prove costly for small as well as for medium-
sized enterprises. The use of such type of system may be expensive for
routine business transactions.
2. Complexity:
3. Limited use
It takes almost one and half year for data warehousing system to be
completely implemented. Therefore, it is a time-consuming process.
Artificial Intelligence:
Collaborative BI:
Embedded BI:
Cloud Analytics:
Prepared by,
D.DURAI KUMAR,
GTEC.