0% found this document useful (0 votes)
11 views7 pages

Summary

Uploaded by

Jaja Nie Si Yao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

Summary

Uploaded by

Jaja Nie Si Yao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Business Intelligence:

 storing data, looking backward

Business analytics:
 analyze data, looking into the future
 scientific process of transforming data into insights for the purpose of making better
decisions
 action-driven approach
 ensure our analysis is driving business action.

Methods of business analytics:


1. Descriptive: Descriptive analytics is the analysis of historical data using two key
methods – data aggregation and data mining.
 it is used to represent what has happened in the past
 uses fairly simple analysis techniques
 descriptive analytics form the core of the everyday reporting
 shows companies the raw results of their potential actions
2. Predictive: It uses probabilities to make assessments of what could happen in the
future.
 it uses data mining
 it uses statistical modelling and machine learning techniques

3. Prescriptive: shows companies which option is the best


 borrows heavily from mathematics and computer science, using a variety of
statistical methods.
 emphasises actionable insights instead of data monitoring

What is data?
Data: values or measurements that represent conditions, objects or ideas
In the context of Data Mining, data is often assumed to be available in a tabular form
Conventional tools:
 reporting
 query-languages
 OLAP and spreadsheets
Disadvantages of conventional tools:
 Often, quite simple questions can be answered only
 Automation difficult
 Only small amounts of data may be handled (esp. spreadsheets)
 Only primitive statistical methods involved
 OLAP: query-focused and low complexity of analysis

Data mining: is the process of sorting through large data sets to identify patterns and
relationships that can help solve business problems through data analysis.
“Data mining is the analysis of (often large) observational data sets to find unsuspected
relationships and to summarize the data in novel ways that are both understandable and
useful to the data owner.” (Hand et al. 2001)
Clustering: attempts to group individuals in a population together by their similarity, but
without regard to any specific purpose
 Useful in preliminary domain exploration

Classification: attempts to predict, for each individual in a population, which of a (small) set
of classes that individual belongs to
 Classification algorithms provide models that determine which class a new individual
belongs to
 Classification is related to scoring

Regression: (value estimation) attempts to estimate or predict, for each individual, the
numerical value for that individual
 Generate regression model by looking at other, similar individuals in the population

KDD process: KDD is the systematic process of identifying valid, practical, and
understandable patterns in massive and complicated data sets.
CRISP-DM: Cross Industry Standard Process for Data Mining

Accuracy: closeness between the value in the data and the true value
 Low accuracy of numerical attributes due to noisy measurements, limited precision,
wrong measurements, transposition of digits when entered manually.
 Low accuracy of categorical attributes due to erroneous entries and typos.

Syntactic accuracy is violated if an entry does not belong to the domain of the attribute.
Semantic accuracy is violated if an entry is not correct although it belongs to the domain of
the attribute.

Missing at random (MAR)


 The probability that a value for 𝑌 is missing does not depend on the true value of 𝑌
→𝑃(𝑌 =?𝑋 =𝑃𝑌 =?𝑌,𝑋 𝑜𝑏𝑠 𝑜𝑏𝑠)
 Example: The maintenance staff does not change the batteries of a sensor when it
rains. Thus, the sensor does not always provide measurements when it rains.

Nonignorable
 The probability that a value for 𝑌 is missing depends on the true value of 𝑌.
 Example: A sensor for the temperature will not work when there is frost.
Pearson’s Correlation Coefficient

The (sample) Pearson‘s correlation coefficient is a measure for a linear relationship between
two numerical attributes 𝐴_1 and 𝐴_2
The larger the absolute value of the Pearson correlation coefficient, the stronger the linear
relationship between the two attributes.
Positive (negative) correlation indicates a line with positive (negative) slope.

Single Attributes:
 Categorical attributes: an outlier is a value that occurs with an extremely lower
frequency than the frequency of all other values
 Numerical attributes: outliers can be identified in boxplots or by statistical tests.

Supervised Learning: specific target


Unsupervised Learning: no specific target

What is Cluster Analysis?


Clustering is the process of grouping data into classes or clusters so that objects within a
cluster have high similarity in comparison to one another, but are very dissimilar to objects in
other clusters.
Cluster = collection of data objects that are similar to each other
Clustering is unsupervised learning

Hierarchical-Based Clustering: Creates a hierarchical decomposition of a set of data objects


Two main approaches:
 Agglomerative approach (bottom up): start with each object as a separate group,
then successively merge groups until a certain termination condition holds
 Divisive approach (top-down): start with all objects in one cluster, then successively
split up into smaller clusters until a certain termination condition holds
Measures for distances between two clusters:
 Single linkage: Minimum distance between two data points of different clusters
 Complete linkage: Maximum distance between two data points of different clusters
 Meandistance, i.e.,distancebetweenthemeanoftheclusters
 Average linkage: Average distance of all single distances of data points from different
clusters
The dendrogram visualizes all splits/merges and helps to identify a suitable number of
clusters after the computation
Prediction = estimating an unknown value
Induction = generalizing from specific cases to general rules

Entropy = is a measure of disorder that can be applied to a set 𝑆


Misclassification Rate = The sum of fractions of samples that represent a minority class
inside a set
Gini Index = Fraction of times in which any randomly drawn instance from the set 𝑆 would be
labeled incorrectly if its label was randomly chosen according to the label distribution inside
the set
Correlation = statistical relationship between two attributes

You might also like