DM Notes
DM Notes
Data mining refers to extracting or mining knowledge from large amounts of data. The term is
actually a misnomer. Thus, data mining should have been more appropriately named as knowledge
mining which emphasis on mining from large amounts of data.
It is the computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and transform
it into an understandable structure for further use.
Data mining derives its name from the similarities between searching for valuable business
information in a large database — for example, finding linked products in gigabytes of store
scanner data — and mining a mountain for a vein of valuable ore. Both processes require either
sifting through an immense amount of material, or intelligently probing it to find exactly where
the value resides. Given databases of sufficient size and quality, data mining technology can
generate new business opportunities by providing these capabilities:
Automated discovery of previously unknown patterns. Data mining tools sweep through
databases and identify previously hidden patterns in one step. An example of pattern discovery is
the analysis of retail sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit card transactions
and identifying anomalous data that could represent data entry keying errors.
Regression – attempts to find a function which models the data with the least error.
Presentation and visualization of data mining results. - Once the patterns are discovered it needs
to be expressed in high level languages, visual representations. These representations should be easily
understandable by the users.
Handling noisy or incomplete data. - The data cleaning methods are required that can handle the
noise, incomplete objects while mining the data regularities. If data cleaning methods are not there
then the accuracy of the discovered patterns will be poor.
Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered shouldbe
interesting because either they represent common knowledge or lack novelty.
Efficiency and scalability of data mining algorithms. - In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient and
scalable.
Parallel, distributed, and incremental mining algorithms. - The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the data
into partitions which is further processed parallel. Then the results from the partitions is merged. The
incremental algorithms, updates databases without having mine the data again from scratch.
A DBMS (Database Management System) is a complete system used for managing digital databases that
allows storage of database content, creation/maintenance of data, search and other functionalities. On the other
hand, Data Mining is a field in computer science, which deals with the extraction of previously unknown and
interesting information from raw data. Usually, the data used as the input for the Data mining process is stored
in databases. Users who are inclined toward statistics use Data Mining. They utilize statistical models to look
for hidden patterns in data. Data miners are interested in finding useful relationships between different data
elements, which is ultimately profitable for businesses.
DBMS:
DBMS, sometimes just called a database manager, is a collection of computer programs that is dedicated for
the management (i.e., organization, storage and retrieval) of all databases that are installed in a system (i.e.
hard drive or network). There are different types of Database Management Systems existing in the world, and
some of them are designed for the proper management of databases configured for specific purposes. Most
popular commercial Database Management Systems are Oracle, DB2 and Microsoft Access. All these
products provide means of allocation of different levels of privileges for different users, making it possible
for a DBMS to be controlled centrally by a single administrator or to be allocated to several different people.
There are four important elements in any Database Management System. They are the modeling language,
data structures, query language and mechanism for transactions. The modeling language defines the language
of each database hosted in the DBMS. Currently several popular approaches like hierarchal, network,
relational and object are in practice. Data structures help organize the data such as individual records, files,
fields and their definitions and objects such as visual media. Data query language maintains the security of
the database by monitoring login data, access rights to different users, and protocols to add data to the system.
SQL is a popular query language that is used in Relational Database Management Systems. Finally, the
mechanism that allows for transactions help concurrency and multiplicity. That mechanism will make sure
that the same record will not be modified by multiple users at the same time, thus keeping the data integrity
intact. Additionally, DBMS provide backup and other facilities as well.
Created by Ashwini
Gopwad
Created by Ashwini
Gopwad
Created by Ashwini
Gopwad
Created by Ashwini
Gopwad
Created by Ashwini
Gopwad
Created by Ashwini
Gopwad
Created by Ashwini Gopwad
Data Mining Applications
Here is the list of areas where data mining is widely used −
The financial data in banking and financial industry is generally reliable and of
high quality which facilitates systematic data analysis and data mining. Some of
the typical cases are as follows −
Data Mining has its great application in Retail Industry because it collects large
amount of data from on sales, customer purchasing history, goods transportation,
consumption and services. It is natural that the quantity of data collected will
continue to expand rapidly because of the increasing ease, availability and
popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and
trends that lead to improved quality of customer service and good customer
retention and satisfaction. Here is the list of examples of data mining in the retail
industry −
In recent times, we have seen a tremendous growth in the field of biology such as
genomics, proteomics, functional Genomics and biomedical research. Biological
There are many data mining system products and domain specific data mining
applications. The new data mining systems and applications are being added to
the previous systems. Also, efforts are being made to standardize data mining
languages.
Data Types − The data mining system may handle formatted text,
record-based data, and relational data. The data could also be in
ASCII text, relational database data or data warehouse data.
Therefore, we should check what exact format the data mining
system can handle.
System Issues − We must consider the compatibility of a data
mining system with different operating systems. One data mining
system may run on only one operating system or on several. There
are also data mining systems that provide web-based user interfaces
and allow XML data as input.
Data Sources − Data sources refer to the data formats in which data
mining system will operate. Some data mining system may work
only on ASCII text files while others on multiple relational sources.
Data mining system should also support ODBC connections or OLE
DB for ODBC connections.
Created by Ashwini Gopwad
Data Mining functions and methodologies − There are some data
mining systems that provide only one data mining function such as
classification while some provides multiple data mining functions
such as concept description, discovery-driven OLAP analysis,
association mining, linkage analysis, statistical analysis,
classification, prediction, clustering, outlier analysis, similarity
search, etc.
Coupling data mining with databases or data warehouse
systems − Data mining systems need to be coupled with a database
or a data warehouse system. The coupled components are integrated
into a uniform information processing environment. Here are the
types of coupling listed below −
o No coupling
o Loose Coupling
o Semi tight Coupling
o Tight Coupling
Scalability − There are two scalability issues in data mining −
o Row (Database size) Scalability − A data mining
system is considered as row scalable when the number
or rows are enlarged 10 times. It takes no more than 10
times to execute a query.
o Column (Dimension) Scalability − A data mining
system is considered as column scalable if the mining
query execution time increases linearly with the number
of columns.
Visualization Tools − Visualization in data mining can be
categorized as follows −
o Data Visualization
o Mining Results Visualization
Data mining concepts are still evolving and here are the latest trends that we get
to see in this field −
Application Exploration.
Scalable and interactive data mining methods.
Integration of data mining with database systems, data warehouse
systems and web database systems.
Standardisation of data mining query language.
Visual data mining.
New methods for mining complex types of data.
Biological data mining.
Data mining and software engineering.
Web mining.
Distributed data mining.
Real time data mining.
Multi database data mining.
Privacy protection and information security in data mining.
2. Crime detection:
Data Mining detects outliers across a vast amount of data. The criminal data
includes all details of the crime that has happened. Data Mining will study the
patterns and trends and predict future events with better accuracy. The agencies
can find out which area is more prone to crime, how much police personnel
should be deployed, which age group should be targeted, vehicle numbers to be
scrutinized, etc.
the first users of data mining technology as it helps them with credit assessment.
Data mining analyses what services offered by banks are used by customers, what
type of customers use ATM cards and what do they generally buy using their
cards (for cross-selling). Banks use data mining to analyse the transactions which
the customer does before they decide to change the bank to reduce customer
attrition. Also, some outliers in transactions are analysed for fraud detection.
Basic DM task:
Data mining tasks are majorly categorized into two categories: descriptive and
predictive.
2) Prediction
To detect the inaccessible data, it uses regression analysis and detects the missing
numeric values in the data. If the class mark is absent, so classification is used to
render the prediction. Due to its relevance in business intelligence, the prediction
is common. If the class mark is absent, so the prediction is performed using
classification. There are two methods of predicting data. Due to its relevance in
business intelligence, a prediction is common. The prediction of the class mark
using the previously developed class model and the prediction of incomplete or
incomplete data using prediction analysis are two ways of predicting data.
3) Classification
4) Association Analysis
The link between the data and the rules that bind them is discovered. And two or
more data attributes are associated. It associates qualities that are transacted
together regularly. They work out what are called the rules of partnerships that
5) Outlier Analysis
Data components that cannot be clustered into a given class or cluster are outliers.
They are often referred to as anomalies or surprises and are also very important
to remember.
Although in some contexts, outliers can be called noise and discarded, they can
disclose useful information in other areas, and hence can be very important and
beneficial for their study.
6) Cluster Analysis
We may uncover patterns and shifts in actions over time, with such distinct
analysis, we can find features such as time-series results, periodicity, and
similarities in patterns. Many technologies from space science to retail marketing
can be found holistically in data processing and features.