overview_basics
overview_basics
The arc in the diagram allows representation of causal knowledge. For example, lung
cancer is influenced by a person's family history of lung cancer, as well as whether or not
the person is a smoker. It is worth noting that the variable PositiveXray is independent of
whether the patient has a family history of lung cancer or that the patient is a smoker,
given that we know the patient has lung cancer.
Conditional Probability Table
The conditional probability table for the values of the variable LungCancer (LC)
showing each possible combination of the values of its parent nodes, FamilyHistory (FH),
and Smoker (S) is as follows −
Types of data
At the highest level, two kinds of data exist: quantitative and qualitative.
Quantitative data deals with numbers and things you can measure objectively:
dimensions such as height, width, and length, Temperature and humidity, Prices, Area and
volume.
Qualitative data deals with characteristics and descriptors that can't be easily
measured, but can be observed subjectively—such as smells, tastes, textures,
attractiveness, and color.
There are two types of quantitative data, which is also referred to as numeric
data: continuous and discrete. As a general rule, counts are discrete and measurements are
continuous.
Discrete data is a count that can't be made more precise. Typically it involves
integers. For instance, the number of children (or adults, or pets) in your family is discrete
data, because you are counting whole, indivisible entities: you can't have 2.5 kids, or 1.3
pets.
Continuous data, on the other hand, could be divided and reduced to finer and finer
levels. For example, you can measure the height of your kids at progressively more precise
scales—meters, centimeters, millimeters, and beyond—so height is continuous data.
There are three main kinds of qualitative data.
1. Binomial
2. Nominal
3. Ordinal
Binary data place things in one of two mutually exclusive categories: right/wrong,
true/false, or accept/reject.
When collecting unordered or nominal data, we assign individual items to named
categories that do not have an implicit or natural value or rank.
We also can have ordered or ordinal data, in which items are assigned to categories
that do have some kind of implicit or natural order, such as "Short, Medium, or Tall."
Advantages
The major advantage of this method is fast processing time.
It is dependent only on the number of cells in each dimension in the
quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data
for a given model. This method locates the clusters by clustering the density function. It
reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields robust
clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or
application-oriented constraints. A constraint refers to the user expectation or the
properties of desired clustering results. Constraints provide us with an interactive way of
communication with the clustering process. Constraints can be specified by the user or the
application requirement.
Outlier Analysis
It is one of the Data mining function or process. Outliers may be defined as
following:
A database may contain data objects that do not comply with the general
behavior or
Model of the data. Such data objects, which are grossly different from or
inconsistent with the remaining set of data, are called outliers.
Most data mining methods discard outliers as noise or exceptions.
However, in some applications such as fraud detection, the rare events can
be more interesting than the more regularly occurring ones.
The analysis of outlier data is referred to as outlier mining.
Outliers may be detected using statistical tests that assume a distribution or
probability model for the data, or using distance measures where objects
that are a substantial distance from any other cluster are considered outliers.
Rather than using statistical or distance measures, deviation-based methods
identify outliers by examining differences in the main characteristics of
objects in a group.
Outliers can be caused by measurement or execution error.
Outliers may be the result of inherent data variability.
Many data mining algorithms try to minimize the influence of outliers or
eliminate them all together.
This, however, could result in the loss of important hidden information
because one person’s noise could be another person’s signal.
Thus, outlier detection and analysis is an interesting data mining task,
referred to as outlier mining.
Data Mining Applications
Data mining is widely used in diverse areas. There are a number of commercial data
mining systems available today and yet there are many challenges in this field. In this
tutorial, we will discuss the applications and the trend of data mining.
Here is the list of areas where data mining is widely used −
Financial Data Analysis
Retail Industry
Telecommunication Industry
Biological Data Analysis
Other Scientific Applications
Intrusion Detection
Financial Data Analysis
The financial data in banking and financial industry is generally reliable and of high
quality which facilitates systematic data analysis and data mining. Some of the typical cases
are as follows −
Design and construction of data warehouses for multidimensional data
analysis and data mining.
Loan payment prediction and customer credit policy analysis.
Classification and clustering of customers for targeted marketing.
Detection of money laundering and other financial crimes.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large
amount of data from on sales, customer purchasing history, goods transportation,
consumption and services. It is natural that the quantity of data collected will continue to
expand rapidly because of the increasing ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and
trends that lead to improved quality of customer service and good customer retention and
satisfaction. Here is the list of examples of data mining in the retail industry −
Design and Construction of data warehouses based on the benefits of data
mining.
Multidimensional analysis of sales, customers, products, time and region.
Analysis of effectiveness of sales campaigns.
Customer Retention.
Product recommendation and cross-referencing of items.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries
providing various services such as fax, pager, cellular phone, internet messenger, images, e-
mail, web data transmission, etc. Due to the development of new computer and
communication technologies, the telecommunication industry is rapidly expanding. This is
the reason why data mining is become very important to help and understand the
business.
Data mining in telecommunication industry helps in identifying the
telecommunication patterns, catch fraudulent activities, make better use of resource, and
improve quality of service. Here is the list of examples for which data mining improves
telecommunication services −
Multidimensional Analysis of Telecommunication data.
Fraudulent pattern analysis.
Identification of unusual patterns.
Multidimensional association and sequential patterns analysis.
Mobile Telecommunication services.
Use of visualization tools in telecommunication data analysis.
Biological Data Analysis
In recent times, we have seen a tremendous growth in the field of biology such as
genomics, proteomics, functional Genomics and biomedical research. Biological data
mining is a very important part of Bioinformatics. Following are the aspects in which data
mining contributes for biological data analysis −
Semantic integration of heterogeneous, distributed genomic and proteomic
databases.
Alignment, indexing, similarity search and comparative analysis multiple
nucleotide sequences.
Discovery of structural patterns and analysis of genetic networks and protein
pathways.
Association and path analysis.
Visualization tools in genetic data analysis.
Other Scientific Applications
The applications discussed above tend to handle relatively small and homogeneous
data sets for which the statistical techniques are appropriate. Huge amount of data have
been collected from scientific domains such as geosciences, astronomy, etc. A large
amount of data sets is being generated because of the fast numerical simulations in various
fields such as climate and ecosystem modeling, chemical engineering, fluid dynamics, etc.
Following are the applications of data mining in the field of Scientific Applications −
Data Warehouses and data preprocessing.
Graph-based mining.
Visualization and domain specific knowledge.
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
intruding and attacking network prompted intrusion detection to become a critical
component of network administration. Here is the list of areas in which data mining
technology may be applied for intrusion detection −
Development of data mining algorithm for intrusion detection.
Association and correlation analysis, aggregation to help select and build
discriminating attributes.
Analysis of Stream data.
Distributed data mining.
Visualization and query tools.