Statistical Learning - Introduction
Statistical Learning - Introduction
Introduction to Statistics
PGPBDML
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Outline
1. Why Statistics?
2. Business Statistics-Tools
3. Types of Statistics - Descriptive and Inferential
Statistics
4. Data Sources and Types of Datasets
5. Attributes of Datasets
6. Key Takeaways
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Why Statistics is So Important?
Event 2
• Advances in enormous computing power to effectively
process and analyze massive amounts of data
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Why Statistics is So Important?
Event 3
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Big Data
Big data
• A set of data that cannot be managed, processed, or
analyzed with traditional software/algorithms within a
reasonable amount of time.
• Big data revolves around
Volume Velocity Variety Value
Veracity
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Business Statistics-Tools
Classification
• For example, these models can classify and predict buyers and
non-buyers, and defaulters and non-defaulters on credit card loan.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Classic Definition of Statistics
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Some Key Terms Used in Statistics
Population is the collection of all Parameter is the population
possible observations of characteristic of interest. For
specified a you
example,are interested in
interest. An example is income the of a average
particular class of
studentsallin characteristic of
thethe people. The average income of this
Methods course Quantitative
in an entire class of people is called a
program MBA
parameter.
.
Sample is a subset Statistic is based on a sample to
population.
of Supposethe you want to inferences
make about the
select a team of 20 students from parameter. If you look at the
population
200 students in an MBA program example,
previous the average
for in a population canincomebein the
managementparticipatin The average estimated
income based onbythe sample.
students 200 is the population.
quiz. the
g to This sample average is called a statistic.
20 students selected for the quiz
tal
is the sample.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Data Sources
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Types of Data
• Record
• Relational records
• Data matrix, e.g., numerical matrix,
crosstabs
• Document data: text documents: term-
frequency vector
• Transaction data
• Graph and network
• World Wide Web
• Social or information networks
• Molecular Structures
• Ordered
• Video data: sequence of images
• Temporal data: time-series
• Sequential Data: transaction sequences
• Genetic sequence data
• Spatial, image and multimedia:
• Spatial data: maps
• Image data
• Video data
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Data Objects
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Attributes
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Attribute Types
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Numeric Attribute Types
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Take-aways
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
References: Data Types and Statistical
Measures
• W. Cleveland, Visualizing Data, Hobart Press, 1993
• T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John
Wiley, 2003
• U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data
Mining and Knowledge Discovery, Morgan Kaufmann, 2001
• L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to
Cluster Analysis. John Wiley & Sons, 1990.
• H. V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of
the Tech. Committee on Data Eng., 20(4), Dec. 1997
• D. A. Keim. Information visualization and visual data mining, IEEE trans. on
Visualization and Computer Graphics, 8(1), 2002
• D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
• S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis
and Machine Intelligence, 21(9), 1999
• E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics
Press, 2001
• C. Yu et al., Visual data mining of multimedia data for social and behavioral
studies, Information Visualization, 8(1), 2009
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited