Data Mining 445545
Data Mining 445545
Definition
Data mining is one of the most useful techniques that help entrepreneurs, researchers, and individuals to extract
valuable information from huge sets of data. Data mining is also called Knowledge Discovery in Database (KDD).
Data mining is the process of extracting the useful information, which is stored in the large database.
It is a powerful tool, which is useful for organizations to retrieve the useful information from available data
warehouses.
Data mining can be applied to relational databases, object-oriented databases, data warehouses, structured-
unstructured databases, etc.
Data mining is used in numerous areas like banking, insurance companies, pharmaceutical companies etc.
2. Relational Databases
A Relational database is defined as the collection of data organized in tables with rows and columns.
Physical schema in Relational databases is a schema which defines the structure of tables.
Logical schema in Relational databases is a schema which defines the relationship among tables.
Standard API of relational database is SQL.
Application: Data Mining, ROLAP model, etc.
3. Data Warehouse
A datawarehouse is defined as the collection of data
integrated from multiple sources that will queries and
decision making.
There are three types
ofdatawarehouse: Enterprise datawarehouse, Data Mart
and Virtual Warehouse.
Two approaches can be used to update data in
DataWarehouse: Query-driven Approach and Update-
driven Approach.
Application: Business decision making, Data mining, etc.
4. Transactional Databases
5. Multimedia Databases
Multimedia databases consists audio, video, images and text media.
They can be stored on Object-Oriented Databases.
They are used to store complex information in a pre-specified formats.
Application: Digital libraries, video-on demand, news-on demand, musical database, etc.
Spatial Database
Store geographical information.
Stores data in the form of coordinates, topology, lines, polygons, etc.
Application: Maps, Global positioning, etc.
6. Time-series Databases
Time series databases contains stock exchange data and user logged activities.
Handles array of numbers indexed by time, date, etc.
It requires real-time analysis.
Application: eXtremeDB, Graphite, InfluxDB, etc.
7. WWW
WWW refers to World wide web is a collection of documents and resources like audio, video, text, etc which are
identified by Uniform Resource Locators (URLs) through web browsers, linked by HTML pages, and accessible via
the Internet network.
It is the most heterogeneous repository as it collects data from multiple resources. It
is dynamic in nature as Volume of data is continuously increasing and changing.
Application: Online shopping, Job search, Research, studying, etc.
X is a variable representing a customer. A confidence 50% means, if a customer buys a computer, there is a 50%
chance that she will buy software. 1% means that transactions analysis showed that computer and software were
purchased together .
Association rules are discarded , if they do not satisfy both minimum support threshold and minimum confidence
threshold .
4.Cluster Analysis
Class label is unknown: group data to form new classes
Clusters of objects are based on the principle of maximizing the intraclass similarity and minimizing the interclass
similarity
5.Outlier Analysis
A database may contain data objects that do not comply with the general behavior or model of the data.
Outliers are usually discarded as noise or exceptions.
Useful for fraud detection.
E.g. Detect purchases of extremely large amounts.
6. Evolution Analysis
Data evolution analysis describes and model regularities or trends for objects whose behavior changes over time.
E.g. identify stock evolution regularities for overall stock and for the stocks of particular companies.
1. Statistics:
Data mining has an inherent connection with statistics. It studies the collection, and interpretation performs the
analysis and helps visualize data presentation. A statistical model is used for data classes and data modeling. It
describes the behavior of an object in a class and its probability. Statistical models are the outcomes of data mining
tasks like classification and data characterization. Or we can use the mining task on top of the statistical models.
It uses the mathematical analysis to express representations, model and summarize empirical data or real world
observations.
Statistical analysis involves the collection of methods, applicable to large amount of data to conclude and report the
trend.
Advantage:
Statistics can be used to model noise and missing data values. The tools for forecasting, predicting, or summarizing
data can be availed by statistics. Statistics are useful for pattern mining. After mining a classification model, the
statistical hypothesis is used for verification. A hypothetical test makes the decisions using the test data. The result is
statistically significant if it is not likely to have been incurred by chance.
Disadvantage:
When the statistical model is used on large data set, it increases the complexity cost. When data mining is used to
handle large real-time and streamed data, computation costs increase dramatically.
2. Machine learning
Arthur Samuel defined machine learning as a field of study that gives computers the ability to learn without being
programmed.
When the new data is entered in the computer, algorithms help the data to grow or change due to machine
learning. In machine learning, an algorithm is constructed to predict the data from the available database (Predictive
analysis). It is related to computational statistics.
The four types of machine learning are:
1. Supervised learning
It is based on the classification.
It is also called as inductive learning. In this method, the desired outputs are included in the training dataset.
2. Unsupervised learning
Unsupervised learning is based on clustering. Clusters are formed on the basis of similarity measures and desired
outputs are not included in the training dataset.
3. Semi-supervised learning
Semi-supervised learning includes some desired outputs to the training dataset to generate the appropriate
functions. This method generally avoids the large number of labeled examples (i.e. desired outputs).
4. Active learning
Active learning is a powerful approach in analyzing the data efficiently.
The algorithm is designed in such a way that, the desired output should be decided by the algorithm itself (the user
plays important role in this type).
3. Information Retrieval:
The technique searches for the information in the document, which may be in text, multimedia, or residing on
the Web. It has two main characteristics:
Searched data is unstructured
Queries are formed by keywords that don’t have complex structures.
The most widely used information retrieval approach is the probabilistic model. Information retrieval combined
with data mining techniques is used for finding out any relevant topic in the document or web.
Uses: A large amount of data are available and streamed in the web, both text and multimedia due to the fast
growth of digitalization including the government sector, health care, and many others. The search and ana lysis
have raised many challenges and hence Information Retrieval becomes increasingly important.
Attribute:
It can be seen as a data field that represents the characteristics or features of a data object. For a customer, object
attributes can be customer Id, address, etc. We can say that a set of attributes used to describe a given object are
known as attribute vector or feature vector.
Type of attributes :
This is the First step of Data-preprocessing. We differentiate between different types of attributes and then
preprocess the data. So here is the description of attribute types.
Qualitative (Nominal (N), Ordinal (O), Binary(B)).
Quantitative (Numeric, Discrete, Continuous)
Qualitative Attributes:
1. Nominal Attributes – related to names: The values of a Nominal attribute are names of things, some kind of
symbols. Values of Nominal attributes represents some category or state and that’s why nominal attribute also
referred as categorical attributes and there is no order (rank, position) among values of the nominal attribute.
Example :
2. Binary Attributes: Binary data has only 2 values/states. For Example yes or no, affected or unaffected, true or
false.
Symmetric: Both values are equally important (Gender).
Asymmetric: Both values are not equally important (Result).
Ordinal Attributes : The Ordinal Attributes contains values that have a meaningful sequence or ranking(order)
between them, but the magnitude between values is not actually known, the order of values that shows what is
important but don’t indicate how important it is.
Quantitative Attributes:
1. Numeric: A numeric attribute is quantitative because, it is a measurable quantity, represented in integer or real
values. Numerical attributes are of 2 types, interval, and ratio.
An interval-scaled attribute has values, whose differences are interpretable, but the numerical attributes do not
have the correct reference point, or we can call zero points. Data can be added and subtracted at an interval scale
but can not be multiplied or divided. Consider an example of temperature in degrees Centigrade. If a day’s
temperature of one day is twice of the other day we cannot say that one day is twice as hot as another day.
A ratio-scaled attribute is a numeric attribute with a fix zero-point. If a measurement is ratio-scaled, we can say of a
value as being a multiple (or ratio) of another value. The values are ordered, and we can also compute the difference
between values, and the mean, median, mode, Quantile-range, and Five number summary can be given.
2. Discrete : Discrete data have finite values it can be numerical and can also be in categorical form. These attributes
has finite or countably infinite set of values.
Example:
3. Continuous: Continuous data have an infinite no of states. Continuous data is of float type. There can be many
values between 2 and 3.
Example :