ABSTRACT
“Data is a precious thing and will last
longer than the systems themselves.”
– Tim Berners-Lee, inventor of the
World Wide Web.
DATA WAREHOUSING AND DATA MINIMG
DATA MINING
ALSTON FERNANDES 165077
In general terms, “Mining” is the process of extraction of some valuable
material from the earth e.g. coal mining, diamond mining etc. In the
context of computer science, “Data Mining” refers to the extraction of
useful information from a bulk of data or data warehouses. One can see
that the term itself is a little bit confusing. In case of coal or diamond
mining, the result of extraction process is coal or diamond. But in case of
Data Mining, the result of extraction process is not data!! Instead, the
result of data mining is the patterns and knowledge that we gain at the
end of the extraction process. In that sense, Data Mining is also known
as Knowledge Discovery or Knowledge Extraction.
Gregory Piatetsky-Shapiro coined the term “Knowledge Discovery in
Databases” in 1989. However, the term ‘data mining’ became more
popular in the business and press communities. Currently, Data Mining
and Knowledge Discovery are used interchangeably.
Now a days, data mining is used in almost all the places where a large
amount of data is stored and processed. For example, banks typically
use ‘data mining’ to find out their prospective customers who could be
interested in credit cards, personal loans or insurances as well. Since
banks have the transaction details and detailed profiles of their
customers, they analyze all this data and try to find out patterns which
help them predict that certain customers could be interested in personal
loans etc.
Main Purpose of Data Mining
Basically, the information gathered from Data Mining helps to predict
hidden patterns, future trends and behaviors and allowing businesses to
take decisions.
Technically, data mining is the computational process of analyzing data
from different perspective, dimensions, angles and
categorizing/summarizing it into meaningful information.
Data Mining can be applied to any type of data e.g. Data Warehouses,
Transactional Databases, Relational Databases, Multimedia Databases,
Spatial Databases, Time-series Databases, World Wide Web.
Data Mining Implementation Process
Business understanding:
In this phase, business and data-mining goals are established.
First, you need to understand business and client objectives. You
need to define what your client wants (which many times even they
do not know themselves)
Take stock of the current data mining scenario. Factor in resources,
assumption, constraints, and other significant factors into your
assessment.
Using business objectives and current scenario, define your data
mining goals.
A good data mining plan is very detailed and should be developed to
accomplish both business and data mining goals.
Data understanding:
In this phase, sanity check on data is performed to check whether its
appropriate for the data mining goals.
First, data is collected from multiple data sources available in the
organization.
These data sources may include multiple databases, flat filer or data
cubes. There are issues like object matching and schema integration
which can arise during Data Integration process. It is a quite complex
and tricky process as data from various sources unlikely to match
easily. For example, table A contains an entity named cust_no
whereas another table B contains an entity named cust-id.
Therefore, it is quite difficult to ensure that both of these given
objects refer to the same value or not. Here, Metadata should be used
to reduce errors in the data integration process.
Next, the step is to search for properties of acquired data. A good way
to explore the data is to answer the data mining questions (decided
in business phase) using the query, reporting, and visualization tools.
Based on the results of query, the data quality should be ascertained.
Missing data if any should be acquired.
Data preparation:
In this phase, data is made production ready.
The data preparation process consumes about 90% of the time of the
project.
The data from different sources should be selected, cleaned,
transformed, formatted, anonymized, and constructed (if required).
Data cleaning is a process to "clean" the data by smoothing noisy data
and filling in missing values.
For example, for a customer demographics profile, age data is
missing. The data is incomplete and should be filled. In some cases,
there could be data outliers. For instance, age has a value 300. Data
could be inconsistent. For instance, name of the customer is different
in different tables.
Data transformation operations change the data to make it useful in
data mining. Following transformation can be applied
Data transformation:
Data transformation operations would contribute toward the success of the
mining process.
Smoothing: It helps to remove noise from the data.
Aggregation: Summary or aggregation operations are applied to the data.
I.e., the weekly sales data is aggregated to calculate the monthly and yearly
total.
Generalization: In this step, Low-level data is replaced by higher-level
concepts with the help of concept hierarchies. For example, the city is
replaced by the county.
Normalization: Normalization performed when the attribute data are scaled
up o scaled down. Example: Data should fall in the range -2.0 to 2.0 post-
normalization.
Attribute construction: these attributes are constructed and included the
given set of attributes helpful for data mining.
in modeling The result of this process is a final data set that can be used.
Modelling
In this phase, mathematical models are used to determine data patterns.
Based on the business objectives, suitable modeling techniques
should be selected for the prepared dataset.
Create a scenario to test check the quality and validity of the model.
Run the model on the prepared dataset.
Results should be assessed by all stakeholders to make sure that
model can meet data mining objectives.
Evaluation:
In this phase, patterns identified are evaluated against the business
objectives.
Results generated by the data mining model should be evaluated
against the business objectives.
Gaining business understanding is an iterative process. In fact, while
understanding, new business requirements may be raised because of
data mining.
A go or no-go decision is taken to move the model in the deployment
phase.
Deployment:
In the deployment phase, you ship your data mining discoveries to
everyday business operations.
The knowledge or information discovered during data mining
process should be made easy to understand for non-technical
stakeholders.
A detailed deployment plan, for shipping, maintenance, and
monitoring of data mining discoveries is created.
A final project report is created with lessons learned and key
experiences during the project. This helps to improve the
organization's business policy.
Data mining Examples:
Example 1:
Consider a marketing head of telecom service provides who wants to
increase revenues of long distance services. For high ROI on his sales and
marketing efforts customer profiling is important. He has a vast data pool
of customer information like age, gender, income, credit history, etc. But its
impossible to determine characteristics of people who prefer long distance
calls with manual analysis. Using data mining techniques, he may uncover
patterns between high long distance call users and their characteristics.For
example, he might learn that his best customers are married females
between the age of 45 and 54 who make more than $80,000 per year.
Marketing efforts can be targeted to such demographic.
Example 2:
A bank wants to search new ways to increase revenues from its credit card
operations. They want to check whether usage would double if fees were
halved.Bank has multiple years of record on average credit card balances,
payment amounts, credit limit usage, and other key parameters. They
create a model to check the impact of the proposed new business policy.
The data results show that cutting fees in half for a targetted customer base
could increase revenues by $10 million.
Advantages of Data Mining
Marketing / Retail
Data mining helps marketing companies build models based on
historical data to predict who will respond to the new marketing
campaigns such as direct mail, online marketing campaign…etc.
Through the results, marketers will have an appropriate approach to
selling profitable products to targeted customers.
Data mining brings a lot of benefits to retail companies in the same way
as marketing. Through market basket analysis, a store can have an
appropriate production arrangement in a way that customers can buy
frequent buying products together with pleasant. In addition, it also helps
the retail companies offer certain discounts for particular products that
will attract more customers.
Finance / Banking
Data mining gives financial institutions information about loan
information and credit reporting. By building a model from historical
customer’s data, the bank, and financial institution can determine good
and bad loans. In addition, data mining helps banks detect fraudulent
credit card transactions to protect credit card’s owner.
Manufacturing
By applying data mining in operational engineering data, manufacturers
can detect faulty equipment and determine optimal control parameters.
For example, semiconductor manufacturers have a challenge that even
the conditions of manufacturing environments at different wafer
production plants are similar, the quality of wafer are a lot the same and
some for unknown reasons even has defects. Data mining has been
applying to determine the ranges of control parameters that lead to the
production of the golden wafer. Then those optimal control parameters
are used to manufacture wafers with desired quality.
Governments
Data mining helps government agency by digging and analyzing records
of the financial transaction to build patterns that can detect money
laundering or criminal activities.
Important Future Trends in Data Mining
Businesses which have been slow in adopting the process of data
mining are now catching up with the others. Extracting important
information through the process of data mining is widely used to make
critical business decisions. In the coming decade, we can expect data
mining to become as ubiquitous as some of the more prevalent
technologies used today. Some of the key data mining trends for the
future include -
1. Multimedia Data Mining
This is one of the latest methods which is catching up because of the
growing ability to capture useful data accurately. It involves the
extraction of data from different kinds of multimedia sources such as
audio, text, hypertext, video, images, etc. and the data is converted
into a numerical representation in different formats. This method
can be used in clustering and classifications, performing similarity
checks, and also to identify associations.
2. Ubiquitous Data Mining
This method involves the mining of data from mobile devices to get
information about individuals. In spite of having several challenges
in this type such as complexity, privacy, cost, etc. this method has a
lot of opportunities to be enormous in various industries especially
in studying human-computer interactions.
3. Distributed Data Mining
This type of data mining is gaining popularity as it involves the
mining of huge amount of information stored in different company
locations or at different organizations. Highly sophisticated
algorithms are used to extract data from different locations and
provide proper insights and reports based upon them.
4. Spatial and Geographic Data Mining
This is new trending type of data mining which includes extracting
information from environmental, astronomical, and geographical
data which also includes images taken from outer space. This type of
data mining can reveal various aspects such as distance and topology
which is mainly used in geographic information systems and other
navigation applications.
5. Time Series and Sequence Data Mining
The primary application of this type of data mining is study of
cyclical and seasonal trends. This practice is also helpful in analyzing
even random events which occur outside the normal series of events.
This method is mainly being use by retail companies to access
customer's buying patterns and their behaviors.
REFERENCES
www.geekforgeeks.org
www.guru99.com
www.zentut.com
www.flatworldsolutions.com