Data Mining

The document discusses various issues and processes related to data mining, including human interaction, overfitting, outliers, and the importance of data cleaning and transformation. It outlines the stages of the Knowledge Discovery in Databases (KDD) process, applications of data mining in fields like education and healthcare, and introduces concepts such as data warehouses, data marts, and OLAP operations. Additionally, it covers classification techniques, algorithms, and their applications, emphasizing the significance of decision trees and Bayesian methods in data analysis.

Uploaded by

jadhavaniket0981

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views15 pages

Data Mining

Uploaded by

jadhavaniket0981

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

DATA MINING

DATA MINING ISSUES

1. Human interaction: When a data mining task is to be undertaken, the goal is not clear. Users as
well as the technical expert are unaware of the results. There is a need for a proper interface
between the comain expert and users. The queries are formed by the experts based on the
user's demand.
2. Overfitting: Overfitting is a statistical error. When a model is generated for a particular data
set, it is supposed that the same model should accommodate future data sets as well. But
overfitting occurs when the generated model is well suited for the training data set and it is not
suited for the test data set or future data set.
3. Outliers: When a model is derived, there are some values of data that do not fit in the model.
These values significantly different from the normal values, or they do not fit in any cluster.
These values are called outliers.
4.Interpretation of the results: Interpretation of the results obtained by data mining is a very
crucial task. This interpretation is beyond only explanation of the results. This task requires
expert analysis and interpretation. Hence, interpretation of the results is an issue in data mining
5. Visualization of the results: Visualization of the results is useful to understand and quickly
view the output of the different database algorithms.
6.Large data sets: Data mining models are generally designed to test the small data sets. But,
when these models are applied to large data sets i.e. datasets with larger size then these models
either fail or they wobble. There are many such models that work very well for the normal data
sets but are inefficient in handling large data sets. The large data set issue can be handled with
sampling and parallelization.
7. Noisy data: The data which has no meaning is called noisy data. These values need to be
corrected or replaced with meaningful data.
8. Multimedia Data: Many users demand the mining tasks for graphical, video or audio data. The
multimedia data can be an issue in data mining as traditionally data mining tasks are designed fur
numeric or alphanumeric data.
9. Missing data: Sometimes the data is incomplete or missing. During the KDD process, this data
may be filled with nearest estimates. These estimates may give false or invalid results creating
problems.
**STAGES OF THE DATA MINING PROCESS (KDD)**
KDD (Knowledge Discovery in Databases) is a process that involves the
extraction of useful, previously unknown, and potentially valuable information
from large datasets. The KDD process is an iterative process and it requires
multiple iterations of the above steps to extract accurate knowledge from the
data.
1. Selection: The data which is to be mined may not be necessarily from a single source. The data
may have many heterogeneous origins. This data needs to be obtained from various data
sources and files. The data selection is based on your mining goal. Data relevant to the mining
task is selected from various sources.
2. Pre-processing: Pre-processing involves cleaning of the data and integration of the data. The
data selected for mining purposes may have some incorrect, irrelevant values which leads to
unwanted results. Some values may be missing or erroneous. Also, when data is collected from
heterogeneous sources, it may involve varying data types and metrics. So, this data needs to be
cleaned and integrated for noise elimination and inconsistency.
3. Transformation: Data transformation is the process of converting the data into the format
which is suitable for processing. Here, data is molded in the form which is required by the data
mining process.
4. Mining: The Mining process leads towards using methods, techniques to extract the pattern
present in the data. The process involves transformation of relevant data records into patterns
using classification. This step involves application of various data mining algorithms to the
transformed data. Mining process generates the desired results for which the whole KDD
process is undertaken.
APPLICATIONS OF DATA MINING
Data mining is used by many organizations to improve the customer base. They focus on
customer behavioral patterns, market analysis, profit areas and product improvement.
(a) Education: Educational data mining deals with developing the methods to discover the
knowledge from the education field. It is used to find out / project students' areas of interest,
future learning capacities and other aspects
(b) Health and Medicine: Data mining can effectively be used in health care systems. During
Covid 19 pandemic, the predictions of the covid 19 waves and the volume of patients was done
using data mining.
(c) Market Analysis: Market analysis is based on a particular pattern of purchase followed by
customers. These patterns help the shop owner to understand the buying pattern of customers
and accordingly useful decisions can be implemented so as to increase the profit of the store.
(d) Fraud Detection: A fraud detection system helps in finding out the pattern of fraud, its
potential attackers/ criminal detection and possible solutions using different data mining
algorithms. These data mining methods provide timely and efficient solutions for detection and
prevention of the frauds.
prevention of the frauds.
!!DATA CLEANING!!
The first step in data preprocessing is data cleaning. Data cleaning includes handling missing
data and noisy data
(a) Missing data: Missing data is the case wherein some of the attributes or attribute data is
missing or the data is not normalised. This situation can be handled by either ignoring the values
or filling the missing value.
(b) Noisy data: This is data with error or data which has no meaning at all. This type of data can
either lead to invalid results or can create the problem to the process of mining itself. The
problem of noisy data can be solved with binning method, regression and clustering.
!!DATA TRASFORMATION!!
(a) Smoothing: This is the process of removing the unnecessary data and cleaning the data so as
to improve the functionality of the data.
(b) Aggregation: This is the process of collecting the data from heterogeneous platforms and
converting it to a uniform format. This improves the cuality of the data.
(c) Discretization: Large data sets are complex to handle. Discretization is the process of
breaking up the data in small intervals. These chunks are continuous chunks and these are
supported by all the existing frameworks.
(d) Attribute construction: To improve the efficiency in the mining process, some new attributes
are generated from existing data sets.
(e) Generalization: This is the process of converting low level attributes to high level attributes
using hierarchy.
(f) Normalization: In the process of Normalization, attributes are scaled within a specified range.
!!DATA REDUCTION!!
Data reduction is a process that reduced the volume of original data and represents it in a much
smaller volume
METHODS : a) Attribute Selection: When data is collected from various sources, it may contain
duplicate attributes. Some of the attributes are irrelevant. The Attribute Selection method is used
to remove such redundant and unnecessary attributes from the data set. This process results in
an improved data set.
(b) Data Cube Aggregation: In this reduction method, aggregation property is applied on
selected data sets so as to get the data in a much simpler format.
(c) Numerosity Reduction: In this reduction method, actual data is substituted with a
mathematical model of the data.
(d) Dimensionality Reduction: In this reduction method, duplicate attributes are removed to
reduce the data size.
!!DATA DISCRETIZATION!!
Large data sets are complex to handle. Discretization is the process of breaking up the data in
small intervals. Here, the data size is reduced. But the data which is divided into intervals is kept
continuous having some sequence. Every interval has its own name and later these intervals can
be replaced with actual data. These chunks are continuous chunks and these are supported by
all the existing frameworks.
What is data warehouse?
A data warehouse is an enterprise system used for the analysis and reporting of structured and
semi-structured data from multiple sources, such as point-of-sale transactions, marketing
automation, customer relationship management, and more.
Need of Data Mart:
Since data mart is related to a specific domain, the time of retrieval of information is less with
improved efficiency.
It provides easy access to frequently requested data.
They are easy to implement and the cost of implementation is less as compared to a data
warehouse.
A data mart is agile. In case of change in model, data mart can be built quicker due to a smaller
size.
A Data mart is defined by a single Subject Matter Expert.
Data can be segmented and stored on different hardware/software platforms.
Advantages and Disadvantages of Data Mart:
Advantages:
Data marts are domain specific; hence it is valuable to a specific group of users.
It is cost effective and easy to implement.
Data mart allows faster access of data.
Data mart is easy to use as it is specifically designed for the needs of its users.
A data mart can accelerate business processes.
It is easy to implement and efficient to use.
It contains historical data which enables the analyst to determine data trends.
Disadvantages:
Many subsets of corporate data warehouse may create unnecessary burden.
It is very hard to maintain the data mart if they are created with unrelated data.
Data mart cannot provide company-wide data analysis as their data set is limited.
!!Different OLAP Operations!!
OLAP operations are done on multidimensional data. This multidimensional data is organized in
various dimensions. Every dimension includes multiple levels of abstraction. So, there are various
OLAP operations to demonstrate these views.
OLAP operations are based on a multidimensional view of data. Here is the list of OLAP
operations:
1 Roll-up 2Drill-down 3Slice and dice 4Pivot (rotate)
!!Snowflake Schema!!
Snowflake schema is a refinement of star schema where some dimensional hierarchy is
normalized into a set of smaller dimension tables, forming a shape similar to snowflake. 'Snow
flaking' or the normalization of the dimension tables can be done in many different ways.
•Snowflake schema is an arrangement of tables in a multidimensional database system.
Advantages to the Snowflake Schema:
Data is structured.
Data integrity is maintained.
Less disk space is utilized.
Disadvantages of Snowflake Schema:
It requires more complex queries.
Complex queries decrease the performance
!!Star Schema!!
•Star schema is the most common schema in data warehouses. This is widely used to design the
data warehouse. The basic architecture of star schema includes one fact table and many
dimension tables. The advantage of star schema is that it is very efficient in handling the queries.
•Star schema contains one fact table associated with many dimension tables.
•The fact table contains primary information of the data warehouse.
•Dimension tables have details of the surrounding tables.
•The primary key which is present in each dimension is related to a foreign key which is present
in the fact table.
Advantages of Star schema:
•Its performance is good because simple queries are used.
•It contains single dimension tables.
•In star schema, both Dimension and Fact Tables are in De-Normalized form.
•It has less number of foreign keys and hence shorter query execution time.
Disadvantages of Star Schema:
•It has redundant data and hence difficult to maintain/ch.ange
•There are data integrity issues.
•Many-to-Many relationships are not supported
!!Definition of Classification!!:
•As the name suggests, classification is the process of classifying the data. It is a data mining
technique which is done for analysis of the data. It is the process of finding the model that
defines the classes and their concepts. It identifies and categorizes the sub population of the
data.
•The main goal of the Classification algorithm is to identify the category of a given dataset, and
these algorithms are mainly used to predict the output for the categorical data.
!! Applications of Classification!!
•Credit approval: Applicant as good or poor credit risk.
•Target marketing: Profile of a good customer.
•Medical diagnosis: Develop a profile of stroke victims. Cancer tumor cell identification.
•Fraud detection: Determine a credit card purchase is fraudulent.
•Email spam classification: Filter the spam e-mail automatically.
•Banking: Bank customer's loan pay willingness prediction
!!Classifier!!
The algorithm which implements the classification on a dataset is known as a Classifier
1.Binary Classifier: If the classification problem has only two possible outcomes then it is called
as Binary Classifier.
2. Multi-class Classifier: If a classification problem has more than two outcomes then it is called
as Multi-class Classifier
!! Tree Pruning!!
•The pruning of tree pruning is a process where the anomalies in the training data are removed
due to outliers or noise in the data.
•The pruned trees are beneficial as the pruned trees are smaller and less complex.
•Also, decision trees that are trained on any training data do have the feat of overfitting the
training data. So, pruning will remove the overfitting problem as well.
•The pruning process cuts off or crops the lower ends of the tree so as to make it simple and
less complex.
!!Advantages of Decision Tree:
1. D'T's are easy to use and efficient.
2. Rules can be generated that are easy to interpret and understand.
3. They can handle nonlinear parameters easily.
4. The scalability of DTs is good as the tree size is independent of the database size.
5. Trees can be constructed for many attributes in the tree.
Disadvantages of Decision Tree:
1. They do not easily handle continuous data. These attribute domains must be divided
categories to be handled.
2. The mathematical calculation of the decision tree mostly requires more memory.
3.The mathematical calculation of the decision tree is time consuming.
4. The space and time complexity of the decision tree model is relatively higher.
5. Decision tree model training time is more as complexity is high.
!!Classification Algorithms!!
The main aim of the Classification algorithm is to identify the category of a given dataset, and
these algorithms are mainly used to predict the output for the categorical data
!!Types!!
1. Statistical based classification: Statistical classification is the broad supervised learning
approach that trains a program to categorize new, unlabeled information based upon its
relevance to known, labeled data.
2. Distance based classification: Distance based algorithms are nonparametric methods that can
be used for classification. These algorithms classify objects by the difference between them as
measured by distance functions.
3. Decision tree based classification: Decision tree builds classification in the form of a tree
structure. It breaks down a dataset into smaller and smaller subsets while at the same time an
associated decision tree is incrementally developed. It can handle both categorical and
numerical data.
4. Neural network based classification: Artificial neural networks are relatively basic electronic
networks of neurons based on the neural structure of the brain. They process records one at a
time, and learn by comparing their classification of the record (i.e., largely arbitrary) with the
known actual classification of the record.
!!BAYES CLASSIFICATION METHODS!!
Bayesian classification is a probabilistic approach to learning and inference based on a different
view of what it means to learn from data, in which probability is used to represent uncertainty
about the relationship being learnt.
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifiers can predict class membership probabilities such as the
probability that a given tuple belongs to a particular class.
Naive Bayes is a type of classifier which uses the Bayes' Theorem. It predicts membership
probabilities for each class such as the probability that given record or data point belongs to a
particular class. The class with the highest probability is considered as the most likely class.
!!Naïve Bayesian Classification!!
Naive Bayes classifiers are a collection of classification algorithms based on Bayes' Theorem. It
is not a single algorithm but a family of algorithms where all of them share a common principle,
i.e. every pair of features being classified is independent of each other.
It is a classification technique based on Bayes' Theorem with an assumption of independence
among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence of any other feature.
!!Applications of Naive Bayes Algorithms!!
•Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could
be used for making predictions in real time.
•Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here
we can predict the probability of multiple classes of target variable.
•Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in
text classification (due to better result in multi class problems and independence rule) have
higher success rate as compared to other algorithms. As a result, it is widely used in Spam
filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify
positive and negative customer sentiments).
•Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a
Recommendation System that uses machine learning and data mining techniques to filter unseen
information and predict whether a user would like a given resource or not.
!!K-NEAREST-NEIGHBOR CLASSIFIERS!!
K-Nearest-Neighbor (KNN) is a supervised learning algorithm which is used for classification and
regression. The major application of KNN is in classification of predictive problems. K-nearest-
neighbours stores all available cases and classifies new cases based on a similarity between the
data items. KNN does not make assumptions on underlying data (non-parametric algorithm).
!!SVM CLASSIFIER!!
•Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. Support Vector Machine is a
method for the classification of both linear and nonlinear data. SVM is an algorithm that uses a
nonlinear mapping to transform the original training data into a higher dimension.
•The main objective of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a Hyperplane.
!!Types of SVM!!
1.Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
2. Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.
!!Linear Regression!!
•Linear regression is the method where regression models the relationship between two
variables by fitting a linear equation to observe the data. It is a simple type of regression.
•Linear regression attempts to find the mathematical relationship between variables.
•If the outcome is a straight line then it is considered a linear model and if it is a curved line, then
it is a non-linear model.
•The relationship between dependent variables is given by a straight line and it has only one
independent variable.
Y = a + βx
!!Non-linear Regression
•Non-linear regression is a form of regression analysis in which data is fit to a model and then
expressed as a mathematical function.
•Non-linear regression is a curved function of an X variable (or variables) that is used to predict
a Y variable.
•Non-linear regression can show a prediction of population growth over time.
!!INTRODUCTION TO PREDICTION!!
Prediction is one of the data mining processes. It is used to find a numerical output. Here, the
training dataset contains the inputs and corresponding numerical output values. According to the
training dataset, the algorithm derives the model or a predictor. When the new data is given, the
model should find a numerical output. This method does not have the class label. The model
predicts a continuous-valued function or ordered value.
•Prediction is the method of recognizing the missing or not available numerical data for a new
process of observing.
•In prediction, the authenticity depends on how well a given predictor can guess the value of a
predicted attribute for new data.
!!Applications of Clustering:!!
•Data summarization and compression: Clustering will be useful in the fields like image
processing and vector quantization which requires data summarization, compression and
reduction.
•Trend detection in dynamic data: Clustering can also be applied for trend detection in dynamic
data sets as clusters of similar trends can be created.
•Social network analysis: In social network analysis, like Facebook/Twitter clustering will be sued
for generating sequences in images, videos or audios.
•Biological data analysis: In biological data as example to detect cancer, clustering can be used
for making cluster images, videos.
•Marketing: Finding groups of customers with similar behavior given a large database of
customer data containing their properties and past buying records.
•Biology: Data Mining helps in the classification of animals and plants are done using similar
functions or genes in the field of biology.
•Libraries: Book ordering.
•Insurance: Identifying groups of motor insurance policy holders with a high average claim cost;
identifying frauds.
•City-planning: Identifying groups of houses according to their house type, value and
geographical location
•Earthquake Studies: Clustering observed earthquake epicenters to identify dangerous zones.
•WWW: Document classification; clustering weblog data to discover groups of similar access
patterns.
!! Requirements for Cluster Analysis!!
•Scalability: Highly scalable clustering algorithms are needed to deal with large databases such
as big data.
•Ability to deal with different kinds of attributes: Algorithms should be applicable to any kind of
data sets such as numerical data, categorical, and binary data.
•Discovery of clusters with attribute shape: Clustering algorithms determines clusters of random
shape. Different types of distance measures are used to find out circular shape clusters of small
sizes.
•High dimensionality: The desired clustering algorithm should be able to handle low as well as
high dimensional data sets.
•Ability to deal with noisy data: Algorithms should be designed in such a way that they should
able to handle noisy data. Some algorithms do not handle noisy data so peor quality of clusters
as designed.
•Interpretability: Clustering algorithms should produce interpretable, comprehensible, and
usable results.
!! Types of Clustering!!
1 Hierarchical Clustering:
•In this type of clustering method, a nested set of clusters is created.
•Every level in hierarchy has distinct set of clusters.
•At lowest level, every item is its own cluster.These clusters are unique.
•As level in the hierarchy increases, items are grouped together in cluster.
•These algorithms can be classified into Agglomerative or Divisive.
•Agglomerative: In this type, clusters are created in bottom-up fashion.
•Divisive: In this type, clusters are created in top-down fashion.
2. Partitional Clustering:
•In this type of clustering method, only one set of clusters is created.
•The desired numbers of clusters are defined.
3. Categorical Clustering:
•These algorithms work on categorical databases. In Categorical databases, values describe
some characteristic or category. For example, what is your favourite colour?
4. Large DB Clustering:
•These algorithms work on Large databases. i.e. Big data.
•These algorithms familiarize to memory constraints either by sampling or compression
technique.
•In case of sampling, data structures are used. These data structures can be compressed or
pruned to fit into memory irrespective of size of database.
!!K-MEANS: A Centroid-Based Technique!!
K-Means Clustering:
This is iterative clustering algorithm. In this, items are moved among set of clusters until desired
set is reached.
It is one type of squared error algorithm and some convergence criteria should be defined to
obtain the final result.
Input:
D=(t1, t2,..., t) // Set of elements k // Number of desired clusters
Output:
K // Set of clusters
K-means algorithm:
1. assign initial values for means m₁, M2,
2. repeat
3. assign each item ti to the cluster which has the closest mean;
4. calculate new mean for each cluster;
5. until convergence criteria is met;
!!Market Basket Analysis:!!
•Market basket analysis is a data mining technique used by retailers to increase sales of items by
well understanding customer purchasing patterns.
•Analysis of large data sets for example to find out purchase history, to find out product
groupings, to find out products that are likely to be purchased together can be done using
market basket analysis.
•Combinations of items that occur frequently in the transactions will be examined. Such
examinations of transactions allow retailers to identify relationships between the items that
people frequently purchase together.
•For example, people who buy bread and peanut butter also buy jelly. People who buy sugar may
buy milk.
!!APRIORI ALGORITHM!!
This is one of the well-known best algorithms for generating association rules.
It is powerful algorithm for mining frequent itemsets for Boolean association rules.
Name of the algorithm is based on the priority it uses. i.e. Apriori.
•Apriori property based on fact that it uses prior knowledge of frequent itemset properties.
•This property uses iterative approach as level wise search.
•At any level k itemsets are used to explore (k + 1) itemsets.
•At first step, whole database is scanned and count of each individual item is found. Assume
minimum support.
•Consider those items which satisfy minimum support. Set of such frequent itemset is found.
• The resulting set is denoted as L1(Level 1).
• L1is used to find L2(L2is the frequent itemsets 2).
•L2 is used to find L3 and the process continues till no more frequent k-itemsets can be found.
•Every time database has to be scanned for find L_{k} frequent itemset.
!! Components of Data Science!!
1Statistics: Without statistics data science cannot be thought of. It is the basic and critical unit of
data science. Huge amount of numerical data is given to algorithm, in that different statistical
measures are applied and useful output is generated.
2. Visualization: Huge amount of data is represented in quick glance which is easy to understand
by all business processes and its stakeholders and society.
3.Machine Learning: Different machine learning algorithms are used to make predictions about
future/unpredicted data.
4. Deep Learning: It is a new machine learning research technique where an algorithm selects the
analysis model to follow.
5. Advanced Computing: Data science is nothing but data computing. So, many advanced
computer techniques tools are need to process data in data science.
6. Data Engineering: It is preparation of systems for collecting, validating, and preparing that
high- quality data. Data engineers gather and prepare the data and data scientists use the data
to promote better business decisions.
7. Domain Expertise: Domain experts have knowledge about data domains. The useful insight
which is output of data science needed to be clearly understood by user. If user is unable to
interpret results the domain experts are needed to interpret the results.
!!DATA SCIENCE PROCESS!!
Step 1 - Discovery: This step involves obtaining data from all identified, recognized internal and
external sources. This acquired data helps user to answer the business problem under study.
Step 2- Data Preparation: The collected data have many conflicts like missing value, blank
columns, incorrect data format, which needs to be cleaned. Before using data for modelling it
should be clean, transformed and in a single format. So, Data transformation is done at this step.
Step 3 Model Planning: In this step of data science process, model for training data set is
finalized. At this stage, user needs to determine the method and technique to draw the relation
between input variables and output variables. Planning for a model which is used different
statistical formulas and visualization tools.
Step 4 Model Building: Models must be building using training data set and their performance is
evaluated using test data set. Actual model building process happens here. Data scientist
distributes datasets for training and testing
Step 5 - Operationalize: In this stage, user will deliver the final baselined model with reports,
code, and technical documents. Model is deployed into a real-time production environment after
thorough testing. At this step, preparation of visual insights to discover useful patterns in the
data will be done.
Step 6 Communicate Results: This is known as presenting your results and automating the
analysis. In this stage, the key results are transferred to all stakeholders. This helps user to
decide if the results of the project are a success or a failure based on the inputs from the model.
Actionable insight is a key outcome that user show how data science can bring about predictive
analytics and later prescriptive analytics.
!!BASICS OF DATA ANALYTICS!!
Data Analytics (DA) is the process of examining data sets to draw inferences about the
information they contain, increasingly with the aid of specialized systems and software.
Data analytics is used in many industries to allow companies and organization to make better
business decisions and in the sciences to verify or disprove existing models or theories.
Following points will clear ideas about Data Analytics:
1.Define the question or goal behind the analysis: What are you (user) trying to discover/ find
out?
2. Right data collection to answer this question.
3. Perform data cleaning/scrubbing to improve data quality, removal of unnecessary data and
prepare it for analysis process. Right data, right format.
4. Data Manipulation using any tools or techniques.
5. Data Analysis and interpretation using statistical tools and techniques, finding correlations,
patterns, trends, outliers in data.
!!CHALLENGES OF DATA SCIENCE TECHNOLOGY!!
Data science is broadening its branches all over the world. But there includes a lot of challenges
which delays a data scientist while dealing with data.
Let us see some of the major challenges faced by data scientists.
• High variety of information & data is required for accurate analysis.
•Not adequate data science talent pool available.
•Management does not provide financial support for a data science team.
• Unavailability of/difficult access to data.
•Data Science results not effectively used by business decision makers.
•Explaining data science to others is difficult.
• Privacy issues.
•Lack of significant domain expert.
•If an organization is very small, they can't have a Data Science team.
!!Difference between EDA and IDA:
IDA means Initial Data Analysis.
IDA focuses on:
Checking assumptions required for model fitting.
Hypothesis testing.
Handling missing values.
Making transformations of variables as per need.
IDA is part of EDA.
EDA is a method/philosophy for data analysis that employs a variety of techniques (mostly
graphical).
The goal of EDA is to accomplish the following:
Maximise insight into a data set.
Uncover underlying structure.
Extract important variables.
Detect outliers and anomalies.
Test underlying assumptions,
Develop parsimonious models.
Determine optimal factor settings.
!!Principles of Data Visualization!!
1.Select the right graph/chart: Some charts are having good intension but more time
complexity.So, choose the graph based on the kind of data and the message to be conveyed. Do
not use variety of graphs as just sake of purpose. Sometimes use of numbers is essential rather
than graphs or charts. For example, display pie chart for percentage instead of bar graph.
2.Form must follow function: An intuitive design is more important than appealing charts, and
graphs should convey the meaning of data in an easy-to-understand manner.
3.Balance the design: The visual elements should be equally distributed across plots, charts,
colour, text, shape, and space. Symmetrical visual software should be used for best visualization
of data.
4.Focus on the key areas: Ensure that key areas should be highlighted so that it will quickly be
highlighted in front of user.
5.Keep visualization simple: The visualizations should be displayed must be easy to understand.
Remove unwanted information. Avoid confusions. The goal of data visualization is simplicity.
6.Incorporate interactivity: Data visualization tools implemented in to charts and graphs.
7.Use patterns: Different visualizations tools will be used to display patterns in data with some
similar colour.
8.Compare aspects: Many comparison aspects are used to display same data using different
charts either horizontally or vertically or in both manners.
!!BENEFITS OF DATA VISUALIZATION!!
Data visualization helps business stakeholders analyze reports regarding sales, marketing
strategies, and product interest. Based on the analysis, they can focus on the areas that require
attention to
increase profits. This makes the business more productive. Different Visualization techniques are
used to take quick action on problem under solution and take necessary actions for business
growth.
It benefits business users to recognize new patterns and to find errors in the data. These
patterns give idea and help the users to pay attention to areas that indicate progress. This
process, in turn, drives the business ahead to achieve business goals to get benefits.
Some visualization techniques are used understand the story behind data
As decision making becomes easy it will be used for exploring different business insights.
It is used for grasping the latest trends in knowledge through data.
!!ADVANTAGES AND DISADVANTAGES OF EDA!!
Advantages of EDA:
It gives up valuable insights into the data.
Visualization is an effective tool to detect outlier.
It helps us with feature selection.
Disadvantages of EDA:
If not performed properly EDA can misguide a problem
EDA is not effective when we deal with high dimensional data.

Preparing For Your Professional Data Engineer Journey - T-GCPPDE-A-m0-l6-file-en-7
100% (1)
Preparing For Your Professional Data Engineer Journey - T-GCPPDE-A-m0-l6-file-en-7
80 pages
What Is Data Mining: Effective Data Collection Warehousing
No ratings yet
What Is Data Mining: Effective Data Collection Warehousing
21 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
Notes For DMDWH - Module1
No ratings yet
Notes For DMDWH - Module1
21 pages
Data Mining and Data Analysis UNIT-1 Notes For Print
No ratings yet
Data Mining and Data Analysis UNIT-1 Notes For Print
22 pages
DM Module1
No ratings yet
DM Module1
15 pages
Unit 2
No ratings yet
Unit 2
144 pages
DMML Notes
No ratings yet
DMML Notes
89 pages
Adm Unit - 1
No ratings yet
Adm Unit - 1
62 pages
Datamining&warehousing
No ratings yet
Datamining&warehousing
65 pages
DWM 4
No ratings yet
DWM 4
23 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
FDS Unit 1
No ratings yet
FDS Unit 1
20 pages
BI - Unit 5
No ratings yet
BI - Unit 5
9 pages
Unit 1 Data Mining Task
No ratings yet
Unit 1 Data Mining Task
7 pages
Data Mining-CH5
No ratings yet
Data Mining-CH5
49 pages
DMDW Imp Ques
No ratings yet
DMDW Imp Ques
17 pages
Unit 1
No ratings yet
Unit 1
11 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Module-1 DM
No ratings yet
Module-1 DM
15 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
cc15 2nd
No ratings yet
cc15 2nd
2 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
Unit III DWDM
No ratings yet
Unit III DWDM
113 pages
Whats App
No ratings yet
Whats App
23 pages
Unit 1 Datamining
No ratings yet
Unit 1 Datamining
16 pages
DM Chapter 1
No ratings yet
DM Chapter 1
10 pages
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
22 pages
Chapter 1
No ratings yet
Chapter 1
6 pages
Data Minng
No ratings yet
Data Minng
20 pages
DM Notes (6th Nov)
No ratings yet
DM Notes (6th Nov)
6 pages
Lesson 1
No ratings yet
Lesson 1
32 pages
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
No ratings yet
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
23 pages
My Notes DWDM
No ratings yet
My Notes DWDM
18 pages
Unit Iii
No ratings yet
Unit Iii
33 pages
Chapter-1 - Introduction To Data Mining
No ratings yet
Chapter-1 - Introduction To Data Mining
10 pages
Data Mining
No ratings yet
Data Mining
26 pages
Fundamentals of Data Science Notes (Module - 1)
No ratings yet
Fundamentals of Data Science Notes (Module - 1)
19 pages
DWM NOTES
No ratings yet
DWM NOTES
118 pages
FDS Chap 1
No ratings yet
FDS Chap 1
22 pages
DWDMunit 2
No ratings yet
DWDMunit 2
27 pages
Chapter 1 - Data Mining and Data Warehouse
No ratings yet
Chapter 1 - Data Mining and Data Warehouse
44 pages
Data Mining - Digital Notes (Unit I To V)
No ratings yet
Data Mining - Digital Notes (Unit I To V)
85 pages
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
No ratings yet
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
5 pages
Unit III
No ratings yet
Unit III
101 pages
Data Mining Notes1
No ratings yet
Data Mining Notes1
56 pages
Unit 3 Data Mining PDF
No ratings yet
Unit 3 Data Mining PDF
19 pages
Unit-2 Introduction To Data Mining
100% (1)
Unit-2 Introduction To Data Mining
11 pages
Data Mining Notes
No ratings yet
Data Mining Notes
14 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
13 pages
Module 1
No ratings yet
Module 1
41 pages
Unit-1 Notes Onl
No ratings yet
Unit-1 Notes Onl
25 pages
Datawarehouse&Data Mining - ALL
No ratings yet
Datawarehouse&Data Mining - ALL
46 pages
DWDM Unit-II Notes
No ratings yet
DWDM Unit-II Notes
29 pages
p144 Data Mining
100% (3)
p144 Data Mining
11 pages
Unit-2 Finalized
No ratings yet
Unit-2 Finalized
12 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
Unit - 4 Introduction To Data Mining
No ratings yet
Unit - 4 Introduction To Data Mining
71 pages
Data Mining
No ratings yet
Data Mining
7 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
R19 T24 Retail Oracle Benchmark Report PDF
No ratings yet
R19 T24 Retail Oracle Benchmark Report PDF
13 pages
Priyanka - Java - Full Stack
No ratings yet
Priyanka - Java - Full Stack
10 pages
Missing Data Imputation by K Nearest Neighbours Based On Grey Relational Structure and Mutual Information
No ratings yet
Missing Data Imputation by K Nearest Neighbours Based On Grey Relational Structure and Mutual Information
22 pages
Chapter-4 - Data Structure-File Structure
No ratings yet
Chapter-4 - Data Structure-File Structure
34 pages
Romney Ais13 PPT 17
No ratings yet
Romney Ais13 PPT 17
9 pages
Schema Workbench
No ratings yet
Schema Workbench
7 pages
Dbms Practical File Sem 1
No ratings yet
Dbms Practical File Sem 1
12 pages
Change Data Capture and Change Tracking Deep Dive
No ratings yet
Change Data Capture and Change Tracking Deep Dive
13 pages
Oracle WebLogic 12C Installation Guide Version 0
No ratings yet
Oracle WebLogic 12C Installation Guide Version 0
36 pages
Bda - Digital Notes
No ratings yet
Bda - Digital Notes
85 pages
Index: S.N O. Experiment Date Remark S T. Sign
No ratings yet
Index: S.N O. Experiment Date Remark S T. Sign
34 pages
Questions 2
No ratings yet
Questions 2
23 pages
Database Programming With SQL: 2-2 Limit Rows Selected
No ratings yet
Database Programming With SQL: 2-2 Limit Rows Selected
14 pages
Schedules in DBMS - Types of Schedules in DBMS
No ratings yet
Schedules in DBMS - Types of Schedules in DBMS
16 pages
1 HCIA - Cloud Service1
No ratings yet
1 HCIA - Cloud Service1
4 pages
Api Class
No ratings yet
Api Class
49 pages
Advanced Database Lab Module
No ratings yet
Advanced Database Lab Module
34 pages
Module 4
No ratings yet
Module 4
18 pages
Database Summative 2020 - 1
No ratings yet
Database Summative 2020 - 1
4 pages
SQL 11
No ratings yet
SQL 11
14 pages
Tecnomatix System Maintenance Guide
No ratings yet
Tecnomatix System Maintenance Guide
68 pages
SQL CheatSheet
No ratings yet
SQL CheatSheet
17 pages
Plant and Flower Ordering System
No ratings yet
Plant and Flower Ordering System
22 pages
Running Head:: Data Mining 1
No ratings yet
Running Head:: Data Mining 1
7 pages
Mod 5
No ratings yet
Mod 5
22 pages
(Original PDF) Business Statistics For Contemporary Decision Making, 2nd Canadian Editioninstant Download
100% (3)
(Original PDF) Business Statistics For Contemporary Decision Making, 2nd Canadian Editioninstant Download
59 pages
Unit IV
No ratings yet
Unit IV
47 pages
Hibernate - ORM Overview: What Is JDBC?
No ratings yet
Hibernate - ORM Overview: What Is JDBC?
76 pages
Task 1
No ratings yet
Task 1
3 pages

Data Mining

Uploaded by

Data Mining

Uploaded by

DATA MINING

DATA MINING ISSUES

You might also like