Unit 1
Unit 1
Unit 1
Data Mining:
The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is called Data
Mining.
Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures. Data mining utilizes complex
mathematical algorithms for data segments and evaluates the probability of future events. Data
Mining is also called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful information.
Data Mining is similar to Data Science carried out by a person, in a specific situation, on
a particular data set, with an objective. This process includes various types of services such as
text mining, web mining, audio and video mining, pictorial data mining, and social media
mining. It is done through software that is simple or highly specific. By outsourcing data mining,
all the work can be done faster with low operation costs. Specialized firms can also use new
technologies to collect data that is impossible to locate manually. There are tonnes of
information available on various platforms, but very little knowledge is accessible. The biggest
challenge is to analyze the data to extract important information that can be used to solve a
problem or for company development. There are many powerful instruments and techniques
available to mine data and find better insight from it.
1
History of Data Mining
In the 1990s, the term "Data Mining" was introduced, but data mining is the evolution of a sector
with an extensive history.
Early techniques of identifying patterns in data include Bayes theorem (1700s), and the evolution
of regression (1800s). The generation and growing power of computer science have boosted data
collection, storage, and manipulation as data sets have broad in size and complexity level.
Explicit hands-on data investigation has progressively been improved with indirect, automatic
data processing, and other computer science discoveries such as neural networks, clustering,
genetic algorithms (1950s), decision trees(1960s), and supporting vector machines (1990s).
Data mining origins are traced back to three family lines: Classical statistics, Artificial
intelligence, and Machine learning.
Classical statistics:
Statistics are the basis of most technology on which data mining is built, such as regression
analysis, standard deviation, standard distribution, standard variance, discriminatory analysis,
cluster analysis, and confidence intervals. All of these are used to analyze data and data
connection.
Artificial Intelligence:
Machine Learning:
2
o It can be induced in the new system as well as the existing platforms.
o It is a quick process that makes it easy for new users to analyze enormous amounts of
data in a short time.
Data Mining is primarily used by organizations with intense consumer demands- Retail,
Communication, Financial, marketing company, determine price, consumer preferences, product
positioning, and impact on sales, customer satisfaction, and corporate profits. Data mining
enables a retailer to use point-of-sale records of customer purchases to develop products and
promotions that help the organization to attract the customer.
These are the following areas where data mining is widely used:
3
Data mining in healthcare has excellent potential to improve the health system. It uses data and
analytics for better insights and to identify best practices that will enhance health care services
and reduce costs. Analysts use data mining approaches such as Machine learning, Multi-
dimensional database, Data visualization, Soft computing, and statistics. Data Mining can be
used to forecast patients in each category. The procedures ensure that the patients get intensive
care at the right place and at the right time. Data mining also enables healthcare insurers to
recognize fraud and abuse.
Market basket analysis is a modeling method based on a hypothesis. If you buy a specific group
of products, then you are more likely to buy another group of products. This technique may
enable the retailer to understand the purchase behavior of a buyer. This data may assist the
retailer in understanding the requirements of the buyer and altering the store's layout accordingly.
Using a different analytical comparison of results between various stores, between customers in
different demographic groups can be done.
Education data mining is a newly emerging field, concerned with developing techniques that
explore knowledge from the data generated from educational Environments. EDM objectives are
recognized as affirming student's future learning behavior, studying the impact of educational
support, and promoting learning science. An organization can use data mining to make precise
decisions and also to predict the results of the student. With the results, the institution can
concentrate on what to teach and how to teach.
Knowledge is the best asset possessed by a manufacturing company. Data mining tools can be
beneficial to find patterns in a complex manufacturing process. Data mining can be used in
system-level designing to obtain the relationships between product architecture, product
portfolio, and data needs of the customers. It can also be used to forecast the product
development period, cost, and expectations among the other tasks.
Customer Relationship Management (CRM) is all about obtaining and holding Customers, also
enhancing customer loyalty and implementing customer-oriented strategies. To get a decent
relationship with the customer, a business organization needs to collect data and analyze the data.
With data mining technologies, the collected data can be used for analytics.
4
Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection are a
little bit time consuming and sophisticated. Data mining provides meaningful patterns and
turning data into information. An ideal fraud detection system should protect the data of all the
users. Supervised methods consist of a collection of sample records, and these records are
classified as fraudulent or non-fraudulent. A model is constructed using this data, and the
technique is made to identify whether the document is fraudulent or not.
Apprehending a criminal is not a big deal, but bringing out the truth from him is a very
challenging task. Law enforcement may use data mining techniques to investigate offenses,
monitor suspected terrorist communications, etc. This technique includes text mining also, and it
seeks meaningful patterns in data, which is usually unstructured text. The information collected
from the previous investigations is compared, and a model for lie detection is constructed.
The Digitalization of the banking system is supposed to generate an enormous amount of data
with every new transaction. The data mining technique can help bankers by solving business-
related problems in banking and finance by identifying trends, casualties, and correlations in
business information and market costs that are not instantly evident to managers or executives
because the data volume is too large or are produced too rapidly on the screen by experts. The
manager may find these data for better targeting, acquiring, retaining, segmenting, and maintain
a profitable customer.
1) Relational Database:
A relational database is a collection of multiple data sets formally organized by tables, records,
and columns from which data can be accessed in various ways without having to recognize the
database tables. Tables convey and share information, which facilitates data searchability,
reporting, and organization.
2) Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within the
organization to provide meaningful business insights. The huge amount of data comes from
multiple places such as Marketing and Finance. The extracted data is utilized for analytical
purposes and helps in decision- making for a business organization. The data warehouse is
designed for the analysis of data rather than transaction processing.
3) Data Repositories:
5
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT
structure. For example a group of databases, where an organization has kept various kinds of
information.
4) Object-Relational Database:
One of the primary objectives of the Object-relational data model is to close the gap between the
Relational database and the object-oriented model practices frequently utilized in many
programming languages, for example, C++, Java, C#, and so on.
5) Transactional Database:
A transactional database refers to a database management system (DBMS) that has the potential
to undo a database transaction if it is not performed appropriately. Even though this was a unique
capability a very long while back, today, most of the relational database systems support
transactional database activities.
Although data mining is very powerful, it faces many challenges during its execution. Various
challenges could be related to performance, data, methods, and techniques, etc. The process of
data mining becomes effective when the challenges or problems are correctly recognized and
adequately resolved.
6
The process of extracting useful data from large volumes of data is data mining. The data in the
real-world is heterogeneous, incomplete, and noisy. Data in huge quantities will usually be
inaccurate or unreliable. These problems may occur due to data measuring instrument or because
of human errors. Suppose a retail chain collects phone numbers of customers who spend more
than 500 Rs and the accounting employees put the information into their system. The person may
make a digit mistake when entering the phone number, which results in incorrect data. Even
some customers may not be willing to disclose their phone numbers, which results in incomplete
data. The data could get changed due to human or system error. All these consequences (noisy
and incomplete data) make data mining challenging.
Data Distribution:
Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including audio and video,
images, complex data, spatial data, time series, and so on. Managing these various types of data
and extracting useful information is a tough task. Most of the time, new technologies, new tools,
and methodologies would have to be refined to obtain specific information.
Performance:
The data mining system's performance relies primarily on the efficiency of algorithms and
techniques used. If the designed algorithm and techniques are not up to the mark, then the
efficiency of the data mining process will be affected adversely.
Data mining usually leads to serious issues in terms of data security, governance, and privacy.
For example, if a retailer analyzes the details of the purchased items, then it reveals data about
buying habits and preferences of the customers without their permission.
Data Visualization:
In data mining, data visualization is a very important process because it is the primary method
that shows the output to the user in a presentable way. The extracted data should convey the
7
exact meaning of what it intends to express. But many times, representing the information to the
end-user in a precise and easy way is difficult. The input data and the output information being
complicated, very efficient, and successful data visualization processes need to be implemented
to make it successful.
It develops the scene for understanding what should be done with the various decisions like
transformation, algorithms, representation, etc. The individuals who are in charge of a KDD
venture need to understand and characterize the objectives of the end-user and the environment
in which the knowledge discovery process will occur (involves relevant prior knowledge).
Once defined the objectives, the data that will be utilized for the knowledge discovery process
should be determined. This incorporates discovering what data is accessible, obtaining important
data, and afterward integrating all the data for knowledge discovery onto one set involves the
qualities that will be considered for the process. This process is important because of Data
Mining learns and discovers from the accessible data. This is the evidence base for building the
8
models. If some significant attributes are missing, at that point, then the entire study may be
unsuccessful from this respect, the more attributes are considered. On the other hand, to
organize, collect, and operate advanced data repositories is expensive, and there is an
arrangement with the opportunity for best understanding the phenomena. This arrangement refers
to an aspect where the interactive and iterative aspect of the KDD is taking place. This begins
with the best available data sets and later expands and observes the impact in terms of knowledge
discovery and modeling.
In this step, data reliability is improved. It incorporates data clearing, for example, Handling the
missing quantities and removal of noise or outliers. It might include complex statistical
techniques or use a Data Mining algorithm in this context. For example, when one suspects that a
specific attribute of lacking reliability or has many missing data, at this point, this attribute could
turn into the objective of the Data Mining supervised algorithm. A prediction model for these
attributes will be created, and after that, missing data can be predicted. The expansion to which
one pays attention to this level relies upon numerous factors. Regardless, studying the aspects is
significant and regularly revealing by itself, to enterprise data frameworks.
4. Data Transformation
In this stage, the creation of appropriate data for Data Mining is prepared and developed.
Techniques here incorporate dimension reduction (for example, feature selection and extraction
and record sampling), also attribute transformation (for example, discretization of numerical
attributes and functional transformation). This step can be essential for the success of the entire
KDD project, and it is typically very project-specific. For example, in medical assessments, the
quotient of attributes may often be the most significant factor and not each one by itself. In
business, we may need to think about impacts beyond our control as well as efforts and transient
issues. For example, studying the impact of advertising accumulation. However, if we do not
utilize the right transformation at the starting, then we may acquire an amazing effect that
insights to us about the transformation required in the next iteration. Thus, the KDD process
follows upon itself and prompts an understanding of the transformation required.
We are now prepared to decide on which kind of Data Mining to use, for example, classification,
regression, clustering, etc. This mainly relies on the KDD objectives, and also on the previous
steps. There are two significant objectives in Data Mining, the first one is a prediction, and the
second one is the description. Prediction is usually referred to as supervised Data Mining, while
descriptive Data Mining incorporates the unsupervised and visualization aspects of Data Mining.
Most Data Mining techniques depend on inductive learning, where a model is built explicitly or
implicitly by generalizing from an adequate number of preparing models. The fundamental
9
assumption of the inductive approach is that the prepared model applies to future cases. The
technique also takes into account the level of meta-learning for the specific set of accessible data.
Having the technique, we now decide on the strategies. This stage incorporates choosing a
particular technique to be used for searching patterns that include multiple inducers. For
example, considering precision versus understandability, the previous is better with neural
networks, while the latter is better with decision trees. For each system of meta-learning, there
are several possibilities of how it can be succeeded. Meta-learning focuses on clarifying what
causes a Data Mining algorithm to be fruitful or not in a specific issue. Thus, this methodology
attempts to understand the situation under which a Data Mining algorithm is most suitable. Each
algorithm has parameters and strategies of leaning, such as ten folds cross-validation or another
division for training and testing.
At last, the implementation of the Data Mining algorithm is reached. In this stage, we may need
to utilize the algorithm several times until a satisfying outcome is obtained. For example, by
turning the algorithms control parameters, such as the minimum number of instances in a single
leaf of a decision tree.
8. Evaluation
In this step, we assess and interpret the mined patterns, rules, and reliability to the objective
characterized in the first step. Here we consider the preprocessing steps as for their impact on the
Data Mining algorithm results. For example, including a feature in step 4, and repeat from there.
This step focuses on the comprehensibility and utility of the induced model. In this step, the
identified knowledge is also recorded for further use. The last step is the use, and overall
feedback and discovery results acquire by Data Mining.
Now, we are prepared to include the knowledge into another system for further activity. The
knowledge becomes effective in the sense that we may make changes to the system and measure
the impacts. The accomplishment of this step decides the effectiveness of the whole KDD
process. There are numerous challenges in this step, such as losing the "laboratory conditions"
under which we have worked. For example, the knowledge was discovered from a certain static
depiction, it is usually a set of data, but now the data becomes dynamic. Data structures may
change certain quantities that become unavailable, and the data domain might be modified, such
as an attribute that may have a value that was not expected previously.
10
Advantages of KDD
1. Improves decision-making: KDD provides valuable insights and knowledge that can help
organizations make better decisions.
2. Increased efficiency: KDD automates repetitive and time-consuming tasks and makes the
data ready for analysis, which saves time and money.
3. Better customer service: KDD helps organizations gain a better understanding of their
customers’ needs and preferences, which can help them provide better customer service.
4. Fraud detection: KDD can be used to detect fraudulent activities by identifying patterns
and anomalies in the data that may indicate fraud.
5. Predictive modeling: KDD can be used to build predictive models that can forecast future
trends and patterns.
Disadvantages of KDD
1. Privacy concerns: KDD can raise privacy concerns as it involves collecting and analyzing
large amounts of data, which can include sensitive information about individuals.
2. Complexity: KDD can be a complex process that requires specialized skills and knowledge
to implement and interpret the results.
3. Unintended consequences: KDD can lead to unintended consequences, such as bias or
discrimination, if the data or models are not properly understood or used.
4. Data Quality: KDD process heavily depends on the quality of data, if data is not accurate
or consistent, the results can be misleading
5. High cost: KDD can be an expensive process, requiring significant investments in
hardware, software, and personnel.
6. Overfitting: KDD process can lead to overfitting, which is a common problem in machine
learning where a model learns the detail and noise in the training data to the extent that it
negatively impacts the performance of the model on new unseen data.
11
Parameter KDD Data Mining
visualization.
Data mining includes the utilization of refined data analysis tools to find previously unknown,
valid patterns and relationships in huge data sets. These tools can incorporate statistical models,
machine learning techniques, and mathematical algorithms, such as neural networks or decision
trees. Thus, data mining incorporates analysis and prediction.
In recent data mining projects, various major data mining techniques have been developed and
used, including association, classification, clustering, prediction, sequential patterns, and
regression.
12
1. Classification:
This technique is used to obtain important and relevant information about data and metadata.
This data mining technique helps to classify data in different classes.
i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial
data, text data, time-series data, World Wide Web, and so on.
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented
database, transactional database, relational database, and so on.
iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering, characterization,
etc. some frameworks tend to be extensive frameworks offering a few data mining
functionalities together.
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks,
machine learning, genetic algorithms, visualization, statistics, data warehouse-oriented or
database-oriented,etc.
The classification can also take into account, the level of user interaction involved in the
data mining procedure, such as query-driven systems, autonomous systems, or interactive
exploratory systems.
13
2. Clustering:
Clustering is a division of information into groups of connected objects. Describing the data by a
few clusters mainly loses certain confine details, but accomplishes improvement. It models data
by its clusters. Data modeling puts clustering from a historical point of view rooted in statistics,
mathematics, and numerical analysis. From a machine learning point of view, clusters relate to
hidden patterns, the search for clusters is unsupervised learning, and the subsequent framework
represents a data concept. From a practical point of view, clustering plays an extraordinary job in
data mining applications. For example, scientific data exploration, text mining, information
retrieval, spatial database applications, CRM, Web analysis, computational biology, medical
diagnostics, and much more.
In other words, we can say that Clustering analysis is a data mining technique to identify similar
data. This technique helps to recognize the differences and similarities between the data.
Clustering is very similar to the classification, but it involves grouping chunks of data together
based on their similarities.
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the relationship
between variables because of the presence of the other factor. It is used to define the probability
of the specific variable. Regression, primarily a form of planning and modeling. For example, we
might use it to project certain costs, depending on other factors such as availability, consumer
demand, and competition. Primarily it gives the exact relationship between two or more variables
in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a hidden
pattern in the data set.
Association rules are if-then statements that support to show the probability of interactions
between data items within large data sets in different types of databases. Association rule mining
has several applications and is commonly used to help sales correlations in data or medical data
sets.
The way the algorithm works is that you have various data, For example, a list of grocery items
that you have been buying for the last six months. It calculates a percentage of items being
purchased together.
14
o Lift:
This measurement technique measures the accuracy of the confidence over how often
item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are purchased and
compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased when item A is
purchased as well.
(Item A + Item B)/ (Item A)
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data set, which
do not match an expected pattern or expected behavior. This technique may be used in various
domains like intrusion, detection, fraud detection, etc. It is also known as Outlier Analysis or
Outilier mining. The outlier is a data point that diverges too much from the rest of the dataset.
The majority of the real-world datasets have an outlier. Outlier detection plays a significant role
in the data mining field. Outlier detection is valuable in numerous fields like network
interruption identification, credit or debit card fraud detection, detecting outlying in wireless
sensor network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential data to
discover sequential patterns. It comprises of finding interesting subsequences in a set of
sequences, where the stake of a sequence can be measured in terms of different criteria like
length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or recognize similar patterns in
transaction data over some time.
7. Prediction:
Prediction used a combination of other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a future
event.
15
Data Warehouse
A Data Warehouse (DW) is a relational database that is designed for query and analysis rather
than transaction processing. It includes historical data derived from transaction data from single
and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing
support for decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to a particular
group of users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
o Its usage is read-intensive.
o It contains a few large tables.
A data warehouse target on the modeling and analysis of data for decision-makers. Therefore,
data warehouses typically provide a concise and straightforward view around a particular
subject, such as customer, product, or sales, instead of the global organization's ongoing
operations. This is done by excluding data that are not useful concerning the subject and
including all data needed by the users to understand the subject.
16
2) Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions attributes types, etc., among different
data sources.
17
3)Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3
months, 6 months, 12 months, or even previous data from a data warehouse. These variations
with a transactions system, where often only the most current file is kept.
4) Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the source
operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e.,
update, insert, and delete operations are not performed. It usually requires only two procedures in
data accessing: Initial loading of data and access to data. Therefore, the DW does not require
transaction processing, recovery, and concurrency capabilities, which allows for substantial
speedup of data retrieval. Non-Volatile defines that once entered into the warehouse and data
should not change.
The idea of data warehousing came to the late 1980's when IBM researchers Barry Devlin and
Paul Murphy established the "Business Data Warehouse."
In essence, the data warehousing idea was planned to support an architectural model for the flow
of information from the operational system to decisional support environments. The concept
attempt to address the various problems associated with the flow, mainly the high costs
associated with it.
In the absence of data warehousing architecture, a vast amount of space was required to support
multiple decision support environments. In large corporations, it was ordinary for various
decision support environments to operate independently.
18
Goals of Data Warehousing
o To help reporting as well as analysis
o Maintain the organization's historical information
o Be the foundation for decision making.
2) Store historical data: Data Warehouse is required to store the time variable data from
the past. This input is made to be used for various purposes.
3) Make strategic decisions: Some strategies may be depending upon the data in the
data warehouse. So, data warehouse contributes to making strategic decisions.
4) For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and consistency
in data.
5) High response time: Data warehouse has to be ready for somewhat unexpected loads
and types of queries, which demands a significant degree of flexibility and quick
response time.
The following are the functions of data warehouse tools and utilities
19
Data Transformation − Involves converting the data from legacy format to warehouse
format.
The multi-Dimensional Data Model is a method which is used for ordering data in the database
along with good arrangement and assembling of the contents in the database.
The Multi Dimensional Data Model allows customers to interrogate analytical questions
associated with market or business trends, unlike relational databases which allow customers to
access data in the form of queries. They allow users to rapidly receive answers to the requests
which they made by creating and examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing uses multi dimensional databases.
It is used to show multiple dimensions of the data to users.
It represents data in the form of data cubes. Data cubes allow to model and view the data from
many dimensions and perspectives. It is defined by dimensions and facts and is represented by
a fact table. Facts are numerical measures and fact tables contain measures of the related
dimensional tables or names of the facts.
On the basis of the pre-decided steps, the Multidimensional Data Model works.
20
The following stages should be followed by every project for building a Multi Dimensional
Data Model:
Stage 1 : Assembling data from the client: In first stage, a Multi Dimensional Data Model
collects correct data from the client. Mostly, software professionals provide simplicity to the
client about the range of data which can be gained with the selected technology and collect the
complete data in detail.
Stage 2 : Grouping different segments of the system: In the second stage, the Multi
Dimensional Data Model recognizes and classifies all the data to the respective section they
belong to and also builds it problem-free to apply step by step.
Stage 3: Noticing the different proportions: In the third stage, it is the basis on which the
design of the system is based. In this stage, the main factors are recognized according to the
user’s point of view. These factors are also known as “Dimensions”.
Stage 4: Preparing the actual-time factors and their respective qualities: In the fourth
stage, the factors which are recognized in the previous step are used further for identifying the
related qualities. These qualities are also known as “attributes” in the database.
Stage 5: Finding the actuality of factors which are listed previously and their qualities: In
the fifth stage, A Multi Dimensional Data Model separates and differentiates the actuality from
the factors which are collected by it. These actually play a significant role in the arrangement
of a Multi Dimensional Data Model.
Stage 6 : Building the Schema to place the data, with respect to the information collected
from the steps above : In the sixth stage, on the basis of the data which was collected
previously, a Schema is built.
For Example:
1. Let us take the example of a firm. The revenue cost of a firm can be recognized on the basis
of different factors such as geographical location of firm’s workplace, products of the firm,
advertisements done, time utilized to flourish a product, etc.
21
Example 1
2. Let us take the example of the data of a factory which sells products per quarter in
Bangalore. The data is represented in the table given below:
2D factory data
In the above given presentation, the factory’s sales for Bangalore are, for the time dimension,
which is organized into quarters and the dimension of items, which is sorted according to the
kind of item which is sold. The facts here are represented in rupees (in thousands).
Now, if we desire to view the data of the sales in a three-dimensional table, then it is
represented in the diagram given below. Here the data of the sales is represented as a
two dimensional table. Let us consider the data according to item, time and location (like
Kolkata, Delhi, Mumbai). Here is the table :
3D data representation as 2D
Data Cleaning:
Data cleaning is the process of correcting or deleting inaccurate, damaged, improperly formatted,
duplicated, or insufficient data from a dataset. Even if results and algorithms appear to be
22
correct, they are unreliable if the data is inaccurate. There are numerous ways for data to be
duplicated or incorrectly labeled when merging multiple data sources.
o Accuracy: The business's database must contain only extremely accurate data.
Comparing them to other sources is one technique to confirm their veracity. The stored
data will also have issues if the source cannot be located or contains errors.
o Coherence: To ensure that the information on a person or body is the same throughout
all types of storage, the data must be consistent with one another.
o Validity: There must be rules or limitations in place for the stored data. The information
must also be confirmed to support its veracity.
o Uniformity: A database's data must all share the same units or values. Since it doesn't
complicate the process, it is a crucial component while doing the Data Cleansing process.
o Data Verification: Every step of the process, including its appropriateness and
effectiveness, must be checked. The study, design, and validation stages all play a role in
the verification process. The disadvantages are frequently obvious after applying the data
to a specific number of changes.
o Clean Data Backflow: After addressing quality issues, the previously clean data must be
replaced with data that is not present in the source so that legacy applications can profit
from it and avoid the need for a subsequent data-cleaning program.
Data Cleansing Tools can be very helpful if you are not confident of cleaning the data yourself or
have no time to clean up all the data sets. There are many data cleaning tools in the market. Here
are some top-ranked data cleaning tools, such as:
1. OpenRefine
2. Trifacta Wrangler
3. Drake
4. Data Ladder
5. Data Cleaner
6. Cloudingo
7. Reifier
8. IBM Infosphere Quality Stage
23
9. TIBCO Clarity
10. Winpure
Having clean data will ultimately increase overall productivity and allow for the highest quality
information in your decision-making. Benefits include:
Ability to map the different functions and what your data is intended to do.
Monitoring errors and better reporting to see where errors are coming from, making it
easier to fix incorrect or corrupt data for future applications.
Using tools for data cleaning will make for more efficient business practices and quicker
decision-making.
Data Integration
Data integration is the process of merging data from several disparate sources. While performing
data integration, you must work on data redundancy, inconsistency, duplicity, etc. In data
mining, data integration is a record preprocessing method that includes merging data from a
couple of the heterogeneous data sources into coherent data to retain and provide a unified
perspective of the data.
Data integration is particularly important in the healthcare industry. Integrated data from various
patient records and clinics assist clinicians in identifying medical disorders and diseases by
integrating data from many systems into a single perspective of beneficial information from
which useful insights can be derived. Effective data collection and integration also improve
medical insurance claims processing accuracy and ensure that patient names and contact
information are recorded consistently and accurately. Interoperability refers to the sharing of
information across different systems.
Data integration is used in a wide range of applications, such as business intelligence, data
warehousing, master data management, and analytics. Data integration can be critical to the
24
success of these applications, as it enables organizations to access and analyze data that is
spread across different systems, departments, and lines of business, in order to make better
decisions, improve operational efficiency, and gain a competitive advantage.
There are several issues that can arise when integrating data from multiple sources, including:
1. Data Quality: Inconsistencies and errors in the data can make it difficult to combine and
analyze.
2. Data Semantics: Different sources may use different terms or definitions for the same data,
making it difficult to combine and understand the data.
3. Data Heterogeneity: Different sources may use different data formats, structures, or
schemas, making it difficult to combine and analyze the data.
4. Data Privacy and Security: Protecting sensitive information and maintaining security can
be difficult when integrating data from multiple sources.
5. Scalability: Integrating large amounts of data from multiple sources can be
computationally expensive and time-consuming.
6. Data Governance: Managing and maintaining the integration of data from multiple sources
can be difficult, especially when it comes to ensuring data accuracy, consistency, and
timeliness.
7. Performance: Integrating data from multiple sources can also affect the performance of the
system.
8. Integration with existing systems: Integrating new data sources with existing systems can
be a complex task, requiring significant effort and resources.
9. Complexity: The complexity of integrating data from multiple sources can be high,
requiring specialized skills and knowledge.
25
There are mainly 2 major approaches for data integration – one is the “tight coupling
approach” and another is the “loose coupling approach”.
Tight Coupling:
This approach involves creating a centralized repository or data warehouse to store the
integrated data. The data is extracted from various sources, transformed and loaded into a data
warehouse. Data is integrated in a tightly coupled manner, meaning that the data is integrated
at a high level, such as at the level of the entire dataset or schema. This approach is also known
as data warehousing, and it enables data consistency and integrity, but it can be inflexible and
difficult to change or update.
Here, a data warehouse is treated as an information retrieval component.
In this coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation, and Loading.
Loose Coupling:
This approach involves integrating data at the lowest level, such as at the level of individual
data elements or records. Data is integrated in a loosely coupled manner, meaning that the data
is integrated at a low level, and it allows data to be integrated without having to create a central
repository or data warehouse. This approach is also known as data federation, and it enables
data flexibility and easy updates, but it can be difficult to maintain consistency and integrity
across multiple data sources.
Here, an interface is provided that takes the query from the user, transforms it in a way the
source database can understand, and then sends the query directly to the source databases to
obtain the result.
And the data only remains in the actual source databases.
Moving data from one system to another requires a data pipeline that understands the structure
and meaning of the data as well as defines the path it will take through the technical systems.
The specific techniques used for data integration depend on:
Data transformation is a technique used to convert the raw data into a suitable format that
efficiently eases data mining and retrieves strategic information. Data transformation includes
26
data cleaning techniques and a data reduction technique to convert the data into the appropriate
form.
Data transformation is an essential data preprocessing technique that must be performed on the
data before data mining to provide patterns that are easier to understand.
Data transformation changes the format, structure, or values of the data and converts them
into clean, usable data. Data may be transformed at two stages of the data pipeline for data
analytics projects. Organizations that use on-premises data warehouses generally use an ETL
(extract, transform, and load) process, in which data transformation is the middle step.
27