Proc Univariate HTML
Proc Univariate HTML
HOME SAS R PYTHON DATA SCIENCE CREDIT RISK SQL EXCEL JOBS SPSS CALCULATORS INFOGRAPHICS SEARCH... GO
Home » SAS » Complete Guide to PROC UNIVARIATE Get Free Email Updates
Follow us on Facebook
This tutorial explains how to explore data with PROC UNIVARIATE. It is one of the most powerful
SAS procedure for running descriptive statistics as well as checking important assumptions of
various statistical techniques such as normality, detecting outliers. Despite various powerful
features supported by PROC UNIVARIATE, its popularity is low as compared to PROC MEANS.
Most of the SAS Analysts are comfortable running PROC MEANS to run summary statistics such
as count, mean, median, missing values etc, In reality, PROC UNIVARIATE surpass PROC MEANS
in terms of options supported in the procedure. See the main difference between the two
procedures.
1. PROC MEANS can calculate various percentile points such as 1st, 5th, 10th, 25th, 50th, 75th,
90th, 95th, 99th percentiles but it cannot calculate custom percentiles such as 20th, 80th, 97.5th,
99.5th percentiles. Whereas, PROC UNIVARIATE can run custom percentiles.
2. PROC UNIVARIATE can calculate extreme observations - the five lowest and five highest
values. Whereas, PROC MEANS can only calculate MAX value.
3. PROC UNIVARIATE supports normality tests to check normal distribution. Whereas, PROC
MEANS does not support normality tests.
4. PROC UNIVARIATE generates multiple plots such as histogram, box-plot, steam leaf diagrams
whereas PROC MEANS does not support graphics.
In the example below. we would use sashelp.shoes dataset. SALES is the numeric (or
measured) variable.
4. Percentiles (Quantiles)
5. Extreme Observations - first smallest and largest values against their row position.
Suppose you are asked to calculate basic statistics of sales by region. In this case, region is a
grouping (or categorical) variable. The CLASS statement is used to define categorical variable.
The similar output was generated for other regions - Asia, Canada, Eastern Europe, Middle East
etc.
Suppose you want only percentiles to be appeared in output window. By default, PROC
UNIVARIATE creates five output tables : Moments, BasicMeasures, TestsForLocation, Quantiles,
and ExtremeObs. The ODS SELECT can be used to select only one of the table. The Quantiles is
the standard table name of PROC UNIVARIATE for percentiles which we want. ODS stands for
Output Delivery System.
The ODS TRACE ON produces name and label of tables that SAS Procedures generates in the
log window.
The ODS OUTPUT statement is used to write output in results window to a SAS dataset. In the
code below, temp would be the name of the dataset in which all the percentile information
exists.
Like we generated percentiles in the previous example, we can generate extreme values with
extremeobs option. The ODS OUTPUT tells SAS to write the extreme values information to a
dataset named outlier. The "extremeobs" is the standard table name of PROC UNIVARIATE for
extreme values.
4. Checking Normality
Most of the statistical techniques assumes data should be normally distributed. It is important
to check this assumption before running a model.
2. Calculate Skewness
3. Normality Tests
I. Plot Histogram
II. Skewness
Skewness
A positive skewed data means that there are a few extreme large values which turns its mean
to skew positively. It is also called right skewed.
Positive Skewness : If skewness > 0, data is positively skewed. Another way to see
positive skewness : Mean is greater than median and median is greater than
mode.
A negative skewed data means that there are a few extreme small values which turns its mean
to skew negatively. It is also called left skewed.
Rule :
2. If skewness is between −1 and −0.5 or between 0.5 and +1, the distribution is moderately
skewed.
3. If skewness > −0.5 and < 0.5, the distribution is approximately symmetric or normal.
Since Skewness is greater than 1, it means data is highly skewed and non-normal.
In the example above, p value is less that 0.05 so we reject the null hypothesis. It
implies distribution is not normal. If p-value > 0.05, it implies distribution is normal.
In this test, the null hypothesis states the data is normally distributed.
If p-value > 0.05, data is normal. In the example above, p-value is less than 0.05, it
means data is not normal.
This test can handle larger sample size greater than 2000.
With PCTLPTS= option, we can calculate custom percentiles. Suppose you need to generate 10,
20, 30, 40, 50, 60, 70, 80, 90, 100 percentiles.
The OUTPUT OUT= statement is used to tell SAS to save the percentile information in TEMP
dataset. The PCTLPRE= is used to add prefix in the variable names for the variable that contains
the PCTLPTS= percentile.
The Winsorized and Trimmed Means are insensitive to Outliers. They should be reported rather
than mean when the data is highly skewed.
Trimmed Mean : Removing extreme values and then calculate mean after filtering out the
extreme values. 10% Trimmed Mean means calculating 10th and 90th percentile values and
removing values above these percentile values.
Winsorized Mean : Capping extreme values and then calculate mean after capping extreme
values at kth percentile level. It is same as trimmed mean except removing the extreme values,
we are capping at kth percentile level.
Winsorized Mean
Winsorized Means
Percent Winsorized in Tail : 20% of values winsorized from each tail (upper and lower side)
Number Winsorized in Tail : 79 values winsorized from each tail
Trimmed Mean
It tests the null hypothesis that mean of the variable is equal to 0. The alternative hypothesis is
that mean is not equal to 0. When you run PROC UNIVARIATE, it defaults generates sample t-
test in 'Tests for Location' section of output.
Since p-value is less than 0.05. we reject the null hypothesis. It concludes the mean value of the
variable is significantly different from zero.
8. Generate Plots
1. Histogram
2. Box Plot
Related Posts
Check number of observations in SAS dataset
About Author:
Deepanshu founded ListenData with a simple objective - Make analytics
easy to understand and follow. He has over 10 years of experience in data
science. During his tenure, he has worked with global clients in various
domains like Banking, Insurance, Private Equity, Telecom and Human
Resource.
While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn
Reply Delete
Replies
Delete
Reply
Reply Delete
Reply Delete
Replies
Delete
Delete
Reply
Reply Delete
Reply Delete
Reply Delete
Reply Delete
Reply Delete
Reply Delete
Reply Delete
Reply Delete
← P REV NEXT →