Fundamental & Advanced
Analytics
Using SAS
SAS Procedures for Fundamental & Advanced
Analytics
› PROC MEANS
› PROC UNIVARIATE
› PROC FREQUENCY
› PROC CORR
› PROG REG
› PROC SQL
› Prints Descriptive Statistics
› Without any options, prints for all numeric
PROC MEANS variables in the data set
– No of Non-missing observations,
– mean,
SYNTAX – standard deviation,
– minimum and
PROC MEANS data=XYZ;
– maximum.
RUN; › With Options presents the opted measures
› Computing Statistics for Each Value- BY
Variable
– Pre-requisite the dataset must be sorted on the
BY variable
› CLASS –
– Substitute for BY,
– Sorted dataset not needed
PROC MEANS- Options
Option Description
N No of Non-missing Observations used to compute the Statistics
NMISS No of Missing Observations
MEAN The Mean
STD The Standard Deviation
CV The Coefficient of Variation
CLM The 95% confidence interval for the mean
STDERR The Standard Error
MIN The Minimum Value
MAX The Maximum Value
MEDIAN The Median
MAXDEC=n The Maximum no of Decimal Places in all table values
Example Code
PROC Sort DATA= SASHELP.SHOES PROC MEANS DATA= SASHELP.SHOES n
OUT=Sorted_Shoes; nmiss mean std;
BY REGION; CLASS REGION;
RUN; VAR STORES SALES INVENTORY;
PROC MEANS DATA= Sorted_Shoes n RUN;
nmiss mean std;
Including Multiple CLASS Variables
BY STORES;
PROC MEANS DATA= SASHELP.SHOES n
VAR STORES SALES INVENTORY; nmiss mean std;
RUN; CLASS REGION PRODUCT;
VAR STORES SALES INVENTORY;
RUN;
PRINTALLTYPES
› Used with Multiple CLASS Variables.
› Outputs Statistics broken down by every combination of CLASS
Variables
Sample Code:
PROC MEANS DATA= SASHELP.SHOES n nmiss mean std PRINTALLTYPES;
CLASS REGION PRODUCT;
VAR STORES SALES INVENTORY;
RUN;
› Similar to PROC MEANS
PROC
UNIVARIATE › Also produces Histograms & Probability
Plots.
› Options:
Syntax: › Histogram: Generates Histogram of all
variables on the VAR statement
Proc UNIVARIATE DATA= XYZ; › Qqplot: Produces quantile-quantile plot to
ID Var;
determine deviations from normality.
– Option NORMAL draws a straight line representing
Var v1 v2 v3; what a normal distribution would look like on the
Histogram; plot.
– mu(Mean) sigma(standard deviation) for theoretical
Qqplot /normal (mu=est normal plot.
sigma=est); – Option est helps get data to request these.
RUN;
One way Frequency Tables
PROC FREQ data= SASHELP.SHOES;
Tables Region product;
PROC FREQ Run;
Option: NOCUM: Eliminates cumulative frequencies
Generates Frequency Tables
PROC FREQ data= SASHELP.SHOES;
• One-way, Tables Region product /nocum;
Run;
• Two-way And
• Three-way
Two way & Three Way Frequency Tables
PROC FREQ data= SASHELP.SHOES;
Tables REGION * product; Two Way
Tables REGION * product * Sales;
Three Way
Run;
Region as rows and Product as columns
Option: Chisq: Chi square tables added in output
PROC FREQ data= SASHELP.SHOES;
Tables REGION * product / chisq; Run;
› Correlation Analysis of all SAS Variables
PROC CORR with each other.
Syntax › If variables are specified then their
correlation with each other will be
presented.
PROC CORR Data=XYZ;
RUN;
PROC CORR Data=XYZ;
Var v1 v2 v3…;
RUN;
› Models relationship between scalar dependent
PROC REG variable and one or more explanatory variables.
› Syntax 1 for Simple Linear Regression
Syntax 1: › Syntax 2 for Multiple Linear Regression
PROC REG Data=XYZ; Options: OUT, RESIDUAL, P
MODEL Var1=Var2 OUTPUT OUT=res RESIDUAL=resid P=pred
RUN; › OUT: For sending output to New dataset instead of
screen
› RESIDUAL: Residual Value
Syntax 2:
› P: Predicted Value
PROC REG Data=XYZ;
MODEL Var1=Var2 Var2 Var3
…; clm prints 95% confidence intervals for mean of each obs
RUN; cli prints 95% prediction intervals
PROC SQL › SAS offers extensive support to SQL by
using SQL queries inside SAS programs.
Syntax › Most of the ANSI SQL syntax is
supported.
› PROC SQL is used to process the SQL
PROC SQL;
statements.
SELECT Columns
› This procedure can
FROM TABLE – gives back the result of an SQL query,
WHERE Columns – can create SAS tables & variables.
GROUP BY Columns
;
QUIT;
Running SQL Commands
CREATING TABLES READING DATA
PROC SQL; PROC SQL;
CREATE TABLE EMPLOYEES AS SELECT make, model, type, invoice, horsepower
SELECT * FROM TEMP; FROM SASHELP.CARS;
QUIT; QUIT;
PROC PRINT data = EMPLOYEES; UPDATING DATA
RUN; PROC SQL;
UPDATE EMPLOYEES2 SET SALARY=SALARY*1.25;
WHERE CLAUSE QUIT;
PROC SQL; PROC PRINT data = EMPLOYEES2; RUN;
SELECT make, model, type, invoice,
horsepower DELETING DATA
FROM SASHELP.CARS PROC SQL;
Where make = 'Audi‘ and Type = 'Sports'; DELETE FROM EMPLOYEES2 WHERE SALARY >
900; QUIT;
QUIT;
PROC PRINT data = EMPLOYEES2; RUN;
INTCK Function
› Counts number of Intervals between two dates or times.
› SYNTAX:
INTCK(‘Interval’, From, To)
› INTERVAL may be
– ‘YEAR’, ‘SEMIYEAR’,
– ‘MONTH’, ‘SEMIMONTH’, ‘QTR’,
– ‘DAY’, ‘WEEKDAY’, ‘TENDAY’.
› Selects a Sample from a population.
PROC
SURVEYSELECT › Options:
› OUT=
– output data set that contains the sample.
PROC SURVEYSELECT
Data=XYZ <options>; › METHOD=
– Sample selection method.
STRATA variables;
– Default is simple random sampling (METHOD=SRS) with
CONTRAL variables; no SIZE statement.
– With SIZE statement, default is probability proportional to
SIZE variable; size without replacement (METHOD=PPS)
ID variables; › SAMPSIZE= number for sample size
› STRATA partitions input data set into nonoverlapping
groups
› ID lists variables from the input data set to be included
in the output data set else all variables inlcuded