Summary
Summary
Business analytics:
analyze data, looking into the future
scientific process of transforming data into insights for the purpose of making better
decisions
action-driven approach
ensure our analysis is driving business action.
What is data?
Data: values or measurements that represent conditions, objects or ideas
In the context of Data Mining, data is often assumed to be available in a tabular form
Conventional tools:
reporting
query-languages
OLAP and spreadsheets
Disadvantages of conventional tools:
Often, quite simple questions can be answered only
Automation difficult
Only small amounts of data may be handled (esp. spreadsheets)
Only primitive statistical methods involved
OLAP: query-focused and low complexity of analysis
Data mining: is the process of sorting through large data sets to identify patterns and
relationships that can help solve business problems through data analysis.
“Data mining is the analysis of (often large) observational data sets to find unsuspected
relationships and to summarize the data in novel ways that are both understandable and
useful to the data owner.” (Hand et al. 2001)
Clustering: attempts to group individuals in a population together by their similarity, but
without regard to any specific purpose
Useful in preliminary domain exploration
Classification: attempts to predict, for each individual in a population, which of a (small) set
of classes that individual belongs to
Classification algorithms provide models that determine which class a new individual
belongs to
Classification is related to scoring
Regression: (value estimation) attempts to estimate or predict, for each individual, the
numerical value for that individual
Generate regression model by looking at other, similar individuals in the population
KDD process: KDD is the systematic process of identifying valid, practical, and
understandable patterns in massive and complicated data sets.
CRISP-DM: Cross Industry Standard Process for Data Mining
Accuracy: closeness between the value in the data and the true value
Low accuracy of numerical attributes due to noisy measurements, limited precision,
wrong measurements, transposition of digits when entered manually.
Low accuracy of categorical attributes due to erroneous entries and typos.
Syntactic accuracy is violated if an entry does not belong to the domain of the attribute.
Semantic accuracy is violated if an entry is not correct although it belongs to the domain of
the attribute.
Nonignorable
The probability that a value for 𝑌 is missing depends on the true value of 𝑌.
Example: A sensor for the temperature will not work when there is frost.
Pearson’s Correlation Coefficient
The (sample) Pearson‘s correlation coefficient is a measure for a linear relationship between
two numerical attributes 𝐴_1 and 𝐴_2
The larger the absolute value of the Pearson correlation coefficient, the stronger the linear
relationship between the two attributes.
Positive (negative) correlation indicates a line with positive (negative) slope.
Single Attributes:
Categorical attributes: an outlier is a value that occurs with an extremely lower
frequency than the frequency of all other values
Numerical attributes: outliers can be identified in boxplots or by statistical tests.