Whatisstat Unit 2
Whatisstat Unit 2
Kamakshaiah Musunuru
Associate Professor-Business Analytics
Dhruva College of Management
[email protected]
Contents
1 Business Statistics
1.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Inferential Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2
2
2
2
2
2
4
5
6 Important Questions
6.1 Short-answer questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
7
Business Statistics
Statistics is a way to get information from data. The term statistician is used to describe
so many different kinds of occupations that it has ceased to have any meaning. It is used,
for example, to describe a person who calculates baseball statistics as well as an individual
educated tin statistical principles. The former is statistics practitioner later statistician. A
statistics practitioner is a person who uses statistical techniques properly. Examples of statistics practitioners include the following.
a financial analyst who develops stock portolios based on historical rates of return.
an economist who uses statistical models to help explain and predict variables such as
inflation rate, unemployment rate, and changes in the gross domestic product; and
a market researcher who surveys consumers and converts the responses into useful information.
The term statistician refers to an individual who works with the mathematics of statistics. His
or her work involves research that develops techniques and concepts that in the future may
help the statistics practitioner. Statisticians also statistics practitioner, frequently conducting
empirical research and consulting. For instance, your instructor is probably a statistician.
1.1
Descriptive Statistics
Descriptive statistics deals with methods of organizing, summarizing, and presenting data in a
convenient and informative way. One form of descriptive statistics uses graphical techniques that
allow statistics practitioners to present data in ways that make it easy for the rader to extract
useful information.
Another form of descriptive statistics uses numerical techniques to summarize data. One such
method that you have already used frequently calculates the average or mean. Certain techniques
like measures of central tendency and measures of variability.
1.2
Inferential Statistics
Inferential statistics is a body of methods used to draw conclusions or inferences about characteristics of populations based on sample data.
Statistical inference problems involve three key concepts:the population, the sample, and the statistical inference.
2.1
Population
A population is the group of all items of interest to a statistics practitioner. It is frequently very large
and may, in fact, be infinitely large. in the language of statistics, population does not necessarily
refer to a group of people. It may, for example, refer to the population of ball bearings produced
at a large plant.
The descriptive measure of population is called parameter. For instance, the parameter of
interest could be the mean number of soft drinks consumed by all the students at the university.
2.2
Sample
A sample is a set of data drawn from the studied population. A descriptive measure of a sample
is called a statistic. We use statistics to make inferences about parameters. For instance, we may
compute the statistic as a mean number of soft drinks consumed in teh last week by the 500 students
in the sample. We may then use the sample mean to infer the value of the population mean, which
is the parameter of interest in this problem.
2.3
Statistical Inference
Statistical inference is the process of making an estimate, prediction, or decision about a population based on sample data. Because, populations are almost always very large, investigating
each member of the population would be impractical and expensive. It is far easier and cheaper
to take a sample from the population of interest and draw conclusions or make estimates about
the population on the basis of information provided by the sample. However, such conclusions
and estimates are not always going to be correct. For this reason, we build into the statistical
inference a measure of reliability. There are two such measures confidence level and significance
level. The confidence level is the proportion of times that an estimating procedure will be correct.
For example, we can produce an estimate of the average number of soft drinks to be consumed by
all 50, 000 students that has a confidence level of 95%. In other words, estimates based on this form
of statistical inference will be correct 95% of the time. When the purpose of the statistical inference is to draw a conclusion about a population, the significance level measures how frequently the
conclusion will be wrong. For example, suppose that, as a result of the analysis, we may conclude
that more than 50% of the electorate will vote for BJP, and thus that will win; and 5% significance
is that samples that lead us to conclude that BJP wins the election, will be wrong 5% of the time.
Statistics is the mathematical science involving the collection, analysis and interpretation of data.
A number of specialties have evolved to apply statistical theory and methods to various disciplines.
Certain topics have tatistical in their name but relate to manipulations of probability distributions
rather than to statistical analysis. The following is the list of several applications, however the list
is not exhaustive.
Actuarial science is the discipline that applies mathematical and statistical methods to assess
risk in the insurance and finance industries.
Astrostatistics is the discipline that applies statistical analysis to the understanding of astronomical data.
Biostatistics is a branch of biology that studies biological phenomena and observations by
means of statistical analysis, and includes medical statistics.
Business analytics is a rapidly developing business process that applies statistical methods to
data sets (often very large) to develop new insights and understanding of business performance
opportunities
Chemometrics is the science of relating measurements made on a chemical system or process
to the state of the system via application of mathematical or statistical methods.
Demography is the statistical study of all populations. It can be a very general science that
can be applied to any kind of dynamic population, that is, one that changes over time or
space.
Econometrics is a branch of economics that applies statistical methods to the empirical study
of economic theories and relationships.
As we studied in the first unit, the organizations are literally starving for right data for rights
decisions depends on right and meaningful data. There is abundant of data everywhere, however
finding right data to solve business problems is always under question in spite of large data repositories everywhere. The follwing are certain important data repositories for statisticians (the list is
not exhaustive).
4.1
Data repositories
You may click at highlighted words in pink color for more information on respective data
repository.
AWS (Amazon Web Services) Public Data Sets, provides a centralized repository of public
data sets that can be seamlessly integrated into AWS cloud-based applications.
BigML big list of public data sources.
Bioassay data, described in Virtual screening of bioassay data, by Amanda Schierz, J. of Cheminformatics, with 21 Bioassay datasets (Active / Inactive compounds) available for download.
Bitly 1.usa.gov data, anonymized clicks on gov links.
Canada Open Data, pilot project with many government and geospatial datasets.
Causality Workbench data repository.
at Texas Advanced Computing Center, supporting data-centric science.
Data Source Handbook, A Guide to Public Data, by Pete Warden, OReilly (Jan 2011).
Datacatalogs.org, open government data from US, EU, Canada, CKAN, and more.
Data.gov.uk, publicly available data from UK (also London datastore.)
Data.gov/Education, central guide for education data resources including high-value data sets,
data visualization tools, resources for the classroom, applications created from open data and
more.
DataMarket, visualize the worlds economy, societies, nature, and industries, with 100 million
time series from UN, World Bank, Eurostat and other important data providers.
Datamob, public data put to good use.
DataSF.org, a clearinghouse of datasets available from the City County of San Francisco,
CA.
DataFerrett, a data mining tool that accesses and manipulates TheDataWeb, a collection of
many on-line US Goverment datasets.
Delve, Data for Evaluating Learning in Valid Experiments
Also known as computational statistics, or statistical computing, is the interface between statistics
and computer science. It is the area of computational science (or scientific computing) specific to
the mathematical science of statistics. This area is also developing rapidly, leading to calls that a
broader concept of computing should be taught as part of general statistical education.
The terms computational statistics and statistical computing are often used interchangeably,
although Carlo Lauro (a former president of the International Association for Statistical Computing
proposed making a distinction, defining statistical computing as the application of computer science
to statistics, and computational statistics as aiming at the design of algorithm for implementing
statistical methods on computers, including the ones unthinkable before the computer age (e.g.
bootstrap, simulation), as well as to cope with analytically intractable problems.
As far as nature of computing software concerned; they are categorized as below:
Open source statistical packages:
DAP A free replacement for SAS
gretl gnu regression, econometrics and time-series Library
Mondrian (software) - data analysis tool using interactive statistical graphics with a link
to R.
Octave programming language (very similar to Matlab) with statistical features
OpenMx A package for Structural equation modeling running in R.
R A free implementation of the S language.
PSPP A free software alternative to IBM SPSS Statistics
Perl Data Language - Scientific computing with Perl
Pandas HPC data structures and data analysis tools for Python in Python and Cython
(statsmodels, scikit-learn)
RapidMiner, a machine learning toolbox
Weka is also a suite of machine learning software written at the University of Waikato.
SciPy (a Python library for scientific computing) contains the stats sub-package which
is partly based on the venerable STAT (a.k.a. PipeStat, formerly UNIXSTAT)
software
Freeware statistical packages
BV4.1
GeoDA
IDAMS/WinIDAMS
MINUIT
Proprietary statistical packages
Excel
SAS
IBM-SPSS
Minitab
Mathematica
S-PLUS
STATISTICA
STATA
Important Questions
1. What is the distinction between statistician and a statistics practitioner ?
2. What is descriptive statistics? Explain about measures of central tendency and measures of
dispersion.
3. What is inferential statistics? Explain about the goal of statistician while computing inferential statistics.
4. Explain about the following:
Population
Sample
Statistical inference
5. Give a brief note on different statistical applications in business.
6. What is a data repository? Give few examples.
7. Describe about various types of computing tools available for statistical analysis. Categorize,
list few important tools widely used by statisticians.