Jss Science and Technology University MYSURU-570006 Department of Information Science and Engineering
Jss Science and Technology University MYSURU-570006 Department of Information Science and Engineering
MYSURU-570006
Department of Information Science and Engineering
Submitted by:
Anirudh D (01JST19PSE001)
Statistical Data Mining
Data Mining is extracting information from huge sets of data. In other words, data mining is
the procedure of mining knowledge from data.
Due to the broad scope of data mining and the large variety of data mining methodologies are
present. Some of the methodologies include Statistical Data Mining. Foundations of Data
Mining, Visual and Audio Data Mining etc.
Statistical Data Mining are designed for the efficient handling of huge amounts of data that are
typically multidimensional and possibly of various complex types. Major statistical methods
for data analysis include:
• Regression
• Generalized linear models
• Analysis of variance
• Mixed-effect models
• Factor analysis
• Discriminant analysis
• Survival analysis
• Quality control
1. Regression: Regression is a data mining technique used to fit an equation to a dataset.
Regression analysis is a statistical method to model the relationship between a
dependent and independent variable with one or more independent variables. More
specifically, Regression analysis helps us to understand how the value of the dependent
variable is changing corresponding to an independent variable. It predicts
continuous/real values such as temperature, age, salary, price, etc.
2. Generalized linear models: These models and their generalization, allow a categorical
response variable to be related to a set of predictor variables in a manner similar to the
modelling of a numeric response variable using linear regression. Generalized linear
models include logistic regression and Poisson regression.
3. Analysis of variance: Analysis of variance (ANOVA) is an analysis tool used in
statistics that splits an observed aggregate variability found inside a data set into two
parts: systematic factors and random factors. The systematic factors have a statistical
influence on the given data set, while the random factors do not. Analysts use the
ANOVA test to determine the influence that independent variables have on the
dependent variable in a regression study.
4. Mixed-effect models: Mixed-effect models are an extension of simple linear models
to allow both fixed and random effects, and are particularly used when there is non-
independence in the data, such as arises from a hierarchical structure. They describe
relationships between a response variable and some covariates in data grouped
according to one or more factors. Common areas of application include multilevel data,
repeated measures data, block designs, and longitudinal data.
5. Factor analysis: Factor Analysis is an exploratory data analysis method used to search
influential underlying factors from a set of observed variables. It helps in data
interpretations by reducing the number of variables. It extracts maximum common
variance from all variables and puts them into a common score. Factor analysis is
widely utilized in market research, advertising, psychology, finance, and operation
research.
6. Discriminant analysis: Discriminant analysis is a statistical method that helps to
understand the relationship between a dependent variable and one or more independent
variables. A dependent variable is the variable used to explain or predict from the
values of the independent variables. Discriminant analysis is similar to regression
analysis and analysis of variance (ANOVA). Discriminant analysis is commonly used
in social sciences.
7. Survival analysis: Survival analysis is one of the primary statistical methods for
analysing data on time to an event such as death, heart attack, device failure, etc. Such
data analysis is essential for many facets of legal proceedings including apportioning
cost of future medical care, estimating years of life lost, evaluating product reliability,
assessing drug safety. Some methods include Kaplan Meier estimates of survival, Cox
proportional hazards regression models.
8. Quality Control: Quality control is a set of methods used by organizations to achieve
quality parameters or quality goals and continually improve the organization's ability
to ensure that a software product will meet quality goals. It confirms that the standards
are followed while working on the product.