Learn Data Science Using SAS Studio: A Quick-Start Guide
By Engy Fouda
()
About this ebook
Do you want to create data analysis reports without writing a line of code? This book introduces SAS Studio, a free data science web browser-based product for educational and non-commercial purposes. The power of SAS Studio comes from its visual point-and-click user interface that generates SAS code. It is easier to learn SAS Studio than to learn R and Python to accomplish data cleaning, statistics, and visualization tasks.
The book includes a case study about analyzing the data required for predicting the results of presidential elections in the state of Maine for 2016 and 2020. In addition to the presidential elections, the book provides real-life examples including analyzing stocks, oil and gold prices, crime, marketing, and healthcare. You will see data science in action and how easy it is to perform complicated tasks and visualizations in SAS Studio.
You will learn, step-by-step, how to do visualizations, including maps. In most cases, you will not need a line of code as you work with the SAS Studio graphical user interface. The book includes explanations of the code that SAS Studio generates automatically. You will learn how to edit this code to perform more complicated advanced tasks. The book introduces you to multiple SAS products such as SAS Viya, SAS Analytics, and SAS Visual Statistics.What You Will Learn
- Become familiar with SAS Studio IDE
- Understand essential visualizations
- Know the fundamental statistical analysis required in most data science and analytics reports
- Clean the most common data set problems
- Use linear progression for data prediction
- Write programs in SAS
- Get introduced to SAS-Viya, which is more potent than SAS studio
Who This Book Is For
A general audience of people who are new to data science, students, and data analysts and scientists who are experiencedbut new to SAS. No programming or in-depth statistics knowledge is needed.
Related to Learn Data Science Using SAS Studio
Related ebooks
Python for SAS Users: A SAS-Oriented Introduction to Python Rating: 0 out of 5 stars0 ratingsLearn R for Applied Statistics: With Data Visualizations, Regressions, and Statistics Rating: 0 out of 5 stars0 ratingsElementary Statistics Using SAS Rating: 0 out of 5 stars0 ratingsSAS Viya: The R Perspective Rating: 0 out of 5 stars0 ratingsSAS Programming for Enterprise Guide Users, Second Edition Rating: 0 out of 5 stars0 ratingsDeep Learning for Numerical Applications with SAS Rating: 0 out of 5 stars0 ratingsData Science Fundamentals for Python and MongoDB Rating: 0 out of 5 stars0 ratingsFundamentals of Programming in SAS: A Case Studies Approach Rating: 0 out of 5 stars0 ratingsLearn Java with Math: Using Fun Projects and Games Rating: 0 out of 5 stars0 ratingsPROC SQL: Beyond the Basics Using SAS, Third Edition Rating: 0 out of 5 stars0 ratingsBiostatistics by Example Using SAS Studio Rating: 0 out of 5 stars0 ratingsExtending Excel with Python and R: Unlock the potential of analytics languages for advanced data manipulation and visualization Rating: 0 out of 5 stars0 ratingsInstant Heat Maps in R How-to Rating: 0 out of 5 stars0 ratingsApplied Data Mining for Forecasting Using SAS Rating: 0 out of 5 stars0 ratingsPROC DOCUMENT by Example Using SAS Rating: 0 out of 5 stars0 ratingsBlueJ Programming: learn lots of logic based skill of BlueJ Rating: 0 out of 5 stars0 ratingsPython Testing with Selenium: Learn to Implement Different Testing Techniques Using the Selenium WebDriver Rating: 0 out of 5 stars0 ratingsAdvanced SQL with SAS Rating: 0 out of 5 stars0 ratingsLearn RStudio IDE: Quick, Effective, and Productive Data Science Rating: 0 out of 5 stars0 ratingsLearning Highcharts 4 Rating: 0 out of 5 stars0 ratingsSAS For Dummies Rating: 0 out of 5 stars0 ratingsGetting Started with SAS Programming: Using SAS Studio in the Cloud Rating: 0 out of 5 stars0 ratingsMachine Learning for Decision Makers: Cognitive Computing Fundamentals for Better Decision Making Rating: 0 out of 5 stars0 ratingsIntroduction to Quantitative Data Analysis in the Behavioral and Social Sciences Rating: 0 out of 5 stars0 ratingsSAS Programming Guidelines Interview Questions You'll Most Likely Be Asked Rating: 0 out of 5 stars0 ratingsNumerical Python: A Practical Techniques Approach for Industry Rating: 0 out of 5 stars0 ratingsBeginning T-SQL Rating: 0 out of 5 stars0 ratingsCody's Data Cleaning Techniques Using SAS, Third Edition Rating: 5 out of 5 stars5/5Python Apps on Visual Studio Code: Develop apps and utilize the true potential of Visual Studio Code (English Edition) Rating: 0 out of 5 stars0 ratingsWebAssembly Essentials Rating: 0 out of 5 stars0 ratings
Databases For You
Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Access 2019 For Dummies Rating: 0 out of 5 stars0 ratingsAccess 2016 For Dummies Rating: 0 out of 5 stars0 ratingsBlockchain Basics: A Non-Technical Introduction in 25 Steps Rating: 4 out of 5 stars4/5CompTIA DataSys+ Study Guide: Exam DS0-001 Rating: 0 out of 5 stars0 ratingsPractical Data Analysis Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Learn SQL Server Administration in a Month of Lunches Rating: 3 out of 5 stars3/5COMPUTER SCIENCE FOR ROOKIES Rating: 0 out of 5 stars0 ratingsManaging Data Using Excel Rating: 5 out of 5 stars5/5Learn SQL in 24 Hours Rating: 5 out of 5 stars5/5Excel 2021 Rating: 4 out of 5 stars4/5Learning PostgreSQL Rating: 1 out of 5 stars1/5Query Store for SQL Server 2019: Identify and Fix Poorly Performing Queries Rating: 0 out of 5 stars0 ratingsLearning NumPy Array Rating: 0 out of 5 stars0 ratingsData Science Strategy For Dummies Rating: 0 out of 5 stars0 ratingsEntity Framework Core in Action, Second Edition Rating: 0 out of 5 stars0 ratingsSchaum’s Outline of Fundamentals of SQL Programming Rating: 3 out of 5 stars3/5Python Projects for Everyone Rating: 0 out of 5 stars0 ratingsA concise guide to PHP MySQL and Apache Rating: 4 out of 5 stars4/5Sql : The Ultimate Beginner to Advanced Guide To Master SQL Quickly with Step-by-Step Practical Examples Rating: 0 out of 5 stars0 ratingsVisualizing Graph Data Rating: 0 out of 5 stars0 ratingsAccess 2007 Bible Rating: 3 out of 5 stars3/5Access 2021 / Microsoft 365 Programming by Example: Mastering VBA for Data Management and Automation Rating: 0 out of 5 stars0 ratingsRaspberry Pi Server Essentials Rating: 0 out of 5 stars0 ratingsProfessional Access 2013 Programming Rating: 0 out of 5 stars0 ratingsNode.js Design Patterns - Second Edition Rating: 4 out of 5 stars4/5
Reviews for Learn Data Science Using SAS Studio
0 ratings0 reviews
Book preview
Learn Data Science Using SAS Studio - Engy Fouda
Part IBasics
© Engy Fouda 2020
E. FoudaLearn Data Science Using SAS Studiohttps://doi.org/10.1007/978-1-4842-6237-5_1
1. Data Science in Action
Engy Fouda¹
(1)
Hopewell Junction, NY, USA
In this chapter, we will introduce the case study of the book, which analyzes voters’ data in the state of Maine. It is based on a project I did at Harvard University in 2016 during my master’s degree. In fall 2016, the project for my A Practical Approach to Data Science
course was to predict the presidential election results in every state. The project was under the guidance and supervision of Professor Larry Adams, who set the project milestones and requirements. I was responsible for forecasting Maine’s outcome for the 2016 and 2020 elections.
The project was done in two phases. The first was to predict the results for the 2016 election. After verifying our data and results against what actually happened in the election, the second phase started. It was to include the new data that was generated in 2016 and use it to predict the results of the year 2020. Therefore, some charts and exercises in this book include 2016 data. Whenever possible, I collected any related historic data. For the prediction, I used historic election data going back to 1960.
I defined voters’ groups by age, gender, education, demographics, and race. After studying the state from reliable academic sources, I identified issue categories like the economy, education, the environment, health care, and gun control.
Similarly, I listed the state’s issues that would influence the presidential election by using the county ballot topics. Using the voting patterns of each party since 1960, poll accuracy, and the electoral votes, I tried different prediction methods and algorithms, such as Monte Carlo and Bayes, and statistical testing, such as T-test, chi-square, and others. Afterward, I had to compare my results to other forecast sites, like Five-Thirty-Eight. My prediction was correct for 2016.
This project was an exciting experience in which I converted cognitive features to numbers and crunched them to come up with results. Similarly, through other data science projects, I learned how to predict outcomes so as to drive decision making based upon measuring trends and studying patterns.
Data Science Process
The data science process starts with forming a question or hypothesis, then collecting relevant raw data, then cleaning and exploring that data, then modeling and evaluating, then deploying, visualizing, and communicating results in reports, as shown in Figure 1-1.
../images/501068_1_En_1_Chapter/501068_1_En_1_Fig1_HTML.jpgFigure 1-1
Data science process
Questions vary according to the field; for example:
Politics: Will Trump win in Maine in 2016 and 2020?
Facebook: How can you make people stay on Facebook longer?
Medical: Is this tumor cancer or not?
Hospital Management: How can you decrease patients’ wait lines so as to increase patients’ satisfaction?
The second step is collecting raw data. For example, in the politics question: Will a particular candidate win in a certain state?
Collecting all the voters’ information—age, race, education, income, gender, and industry—is a crucial step, as is collecting the ballot data and voting results from over the years. The more historical data we have, the more accurate our predictions are. Furthermore, we should collect information on the population distribution throughout the years.
The third step is cleaning this raw data, from managing the missing values, outliers, repeated rows, and misspelled information, to adjusting the columns’ data types, unifying the format of the values, and so on.
The fourth step is trying several models and comparing their results with each other, depending upon the problem’s nature. In the presidential election problem, I used Monte Carlo and Bayes algorithms.
The fifth and final step is visualizing the results and communicating them in plain language in our reports. This step is the primary goal of the whole process because it holds the predictions to the answer to the first question that initiated the whole process.
Case Study: Presidential Elections in Maine
As I mentioned in the previous section, the data science process starts with a question. In this project, my question is: Will Donald Trump win in the state of Maine in the 2016 and 2020 presidential elections?
Population
The second step is collecting as much related data as possible. Therefore, I started with the population.
From information on the population distribution over Maine’s counties, found at the U.S. Census Bureau, I learned that it is not uniformly distributed. There are vast areas that are either unpopulated or that have only one person living in them. While the red dots in the south look small, more than 5,000 people live in each of them. Therefore, I should not be deceived by the maps distributed by the presidential campaigns or by the mainstream media.
The following logical step was to get the voters’ information. Some states publish their voters databases for free, and anyone could download them. However, in Maine, this was not the case. The state sold the voter databases to the political parties. So, I contacted the Secretary of State.
The office replied that to obtain voters files and updates from Maine’s Central Voter Registration system, the requesting person or entity must be from the following five cases:
1.
A candidate or person or entity working on a candidate’s campaign
2.
Someone working for a party
3.
A person or entity involved in a referendum campaign that will be on the ballot in Maine in the next statewide election
4.
A person or entity involved in specific get-out-the-vote efforts in Maine (the efforts have to be identified, including name, location, and date of events in Maine)
5.
An individual who has been elected or appointed to and currently serving in a municipal, county, state, or federal office, but only for use for the official’s authorized activities, not to turn over to another entity
The cost was based on the number of records obtained; the fee was scheduled in Title 21-A, section 196-A. A statewide voter file, which contained almost one million records, was $2,200.
After a few emails back and forth explaining that I needed them for a research project and sending some verifications, the office kindly sent me for free a DVD with all the required information, hiding the unneeded data like last names and so on.
The first table on the DVD has the voters’ information and is shown in Figure 1-2. The columns are first name, year of birth, enrollment code, special designations, date of registration, congressional district, county ID, changed date, and date of last statewide election with VPH.
../images/501068_1_En_1_Chapter/501068_1_En_1_Fig2_HTML.jpgFigure 1-2
Voters’ information
The second table contains a registered and enrolled voters report, as in Figure 1-3. The columns of this table are the county name, municipality name, ward precinct, congressional district, state senate, county commissioner district, the party, and the total. The parties listed in the file are Democratic, Green Independent, Libertarian, Republican, and unenrolled.
../images/501068_1_En_1_Chapter/501068_1_En_1_Fig3_HTML.jpgFigure 1-3
Registered and enrolled voters report
This raw data was messy and contained many wrong values and outliers. For example, the age of one voter was 220 years, while his date of birth states that he was about 67 years old at that time. Some voters’ information was missing, and so on. Again, as mentioned earlier, always clean your data: outliers, missing data, adjust data formatting, and explore your data.
Not only that, but also you should collect as much historical data as you can. So, I started digging and collected as much data as I could find. From the United States Census Bureau, I downloaded more tables