Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Learn Data Science Using SAS Studio: A Quick-Start Guide
Learn Data Science Using SAS Studio: A Quick-Start Guide
Learn Data Science Using SAS Studio: A Quick-Start Guide
Ebook317 pages1 hour

Learn Data Science Using SAS Studio: A Quick-Start Guide

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Do you want to create data analysis reports without writing a line of code? This book introduces SAS Studio, a free data science web browser-based product for educational and non-commercial purposes. The power of SAS Studio comes from its visual point-and-click user interface that generates SAS code. It is easier to learn SAS Studio than to learn R and Python to accomplish data cleaning, statistics, and visualization tasks.

The book includes a case study about analyzing the data required for predicting the results of presidential elections in the state of Maine for 2016 and 2020. In addition to the presidential elections, the book provides real-life examples including analyzing stocks, oil and gold prices, crime, marketing, and healthcare. You will see data science in action and how easy it is to perform complicated tasks and visualizations in SAS Studio.

You will learn, step-by-step, how to do visualizations, including maps. In most cases, you will not need a line of code as you work with the SAS Studio graphical user interface. The book includes explanations of the code that SAS Studio generates automatically. You will learn how to edit this code to perform more complicated advanced tasks. The book introduces you to multiple SAS products such as SAS Viya, SAS Analytics, and SAS Visual Statistics.


What You Will Learn

  • Become familiar with SAS Studio IDE
  • Understand essential visualizations
  • Know the fundamental statistical analysis required in most data science and analytics reports
  • Clean the most common data set problems
  • Use linear progression for data prediction
  • Write programs in SAS
  • Get introduced to SAS-Viya, which is more potent than SAS studio


Who This Book Is For

A general audience of people who are new to data science, students, and data analysts and scientists who are experiencedbut new to SAS. No programming or in-depth statistics knowledge is needed.

LanguageEnglish
PublisherApress
Release dateOct 1, 2020
ISBN9781484262375
Learn Data Science Using SAS Studio: A Quick-Start Guide

Related to Learn Data Science Using SAS Studio

Related ebooks

Databases For You

View More

Related articles

Reviews for Learn Data Science Using SAS Studio

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Learn Data Science Using SAS Studio - Engy Fouda

    Part IBasics

    © Engy Fouda 2020

    E. FoudaLearn Data Science Using SAS Studiohttps://doi.org/10.1007/978-1-4842-6237-5_1

    1. Data Science in Action

    Engy Fouda¹ 

    (1)

    Hopewell Junction, NY, USA

    In this chapter, we will introduce the case study of the book, which analyzes voters’ data in the state of Maine. It is based on a project I did at Harvard University in 2016 during my master’s degree. In fall 2016, the project for my A Practical Approach to Data Science course was to predict the presidential election results in every state. The project was under the guidance and supervision of Professor Larry Adams, who set the project milestones and requirements. I was responsible for forecasting Maine’s outcome for the 2016 and 2020 elections.

    The project was done in two phases. The first was to predict the results for the 2016 election. After verifying our data and results against what actually happened in the election, the second phase started. It was to include the new data that was generated in 2016 and use it to predict the results of the year 2020. Therefore, some charts and exercises in this book include 2016 data. Whenever possible, I collected any related historic data. For the prediction, I used historic election data going back to 1960.

    I defined voters’ groups by age, gender, education, demographics, and race. After studying the state from reliable academic sources, I identified issue categories like the economy, education, the environment, health care, and gun control.

    Similarly, I listed the state’s issues that would influence the presidential election by using the county ballot topics. Using the voting patterns of each party since 1960, poll accuracy, and the electoral votes, I tried different prediction methods and algorithms, such as Monte Carlo and Bayes, and statistical testing, such as T-test, chi-square, and others. Afterward, I had to compare my results to other forecast sites, like Five-Thirty-Eight. My prediction was correct for 2016.

    This project was an exciting experience in which I converted cognitive features to numbers and crunched them to come up with results. Similarly, through other data science projects, I learned how to predict outcomes so as to drive decision making based upon measuring trends and studying patterns.

    Data Science Process

    The data science process starts with forming a question or hypothesis, then collecting relevant raw data, then cleaning and exploring that data, then modeling and evaluating, then deploying, visualizing, and communicating results in reports, as shown in Figure 1-1.

    ../images/501068_1_En_1_Chapter/501068_1_En_1_Fig1_HTML.jpg

    Figure 1-1

    Data science process

    Questions vary according to the field; for example:

    Politics: Will Trump win in Maine in 2016 and 2020?

    Facebook: How can you make people stay on Facebook longer?

    Medical: Is this tumor cancer or not?

    Hospital Management: How can you decrease patients’ wait lines so as to increase patients’ satisfaction?

    The second step is collecting raw data. For example, in the politics question: Will a particular candidate win in a certain state?

    Collecting all the voters’ information—age, race, education, income, gender, and industry—is a crucial step, as is collecting the ballot data and voting results from over the years. The more historical data we have, the more accurate our predictions are. Furthermore, we should collect information on the population distribution throughout the years.

    The third step is cleaning this raw data, from managing the missing values, outliers, repeated rows, and misspelled information, to adjusting the columns’ data types, unifying the format of the values, and so on.

    The fourth step is trying several models and comparing their results with each other, depending upon the problem’s nature. In the presidential election problem, I used Monte Carlo and Bayes algorithms.

    The fifth and final step is visualizing the results and communicating them in plain language in our reports. This step is the primary goal of the whole process because it holds the predictions to the answer to the first question that initiated the whole process.

    Case Study: Presidential Elections in Maine

    As I mentioned in the previous section, the data science process starts with a question. In this project, my question is: Will Donald Trump win in the state of Maine in the 2016 and 2020 presidential elections?

    Population

    The second step is collecting as much related data as possible. Therefore, I started with the population.

    From information on the population distribution over Maine’s counties, found at the U.S. Census Bureau, I learned that it is not uniformly distributed. There are vast areas that are either unpopulated or that have only one person living in them. While the red dots in the south look small, more than 5,000 people live in each of them. Therefore, I should not be deceived by the maps distributed by the presidential campaigns or by the mainstream media.

    The following logical step was to get the voters’ information. Some states publish their voters databases for free, and anyone could download them. However, in Maine, this was not the case. The state sold the voter databases to the political parties. So, I contacted the Secretary of State.

    The office replied that to obtain voters files and updates from Maine’s Central Voter Registration system, the requesting person or entity must be from the following five cases:

    1.

    A candidate or person or entity working on a candidate’s campaign

    2.

    Someone working for a party

    3.

    A person or entity involved in a referendum campaign that will be on the ballot in Maine in the next statewide election

    4.

    A person or entity involved in specific get-out-the-vote efforts in Maine (the efforts have to be identified, including name, location, and date of events in Maine)

    5.

    An individual who has been elected or appointed to and currently serving in a municipal, county, state, or federal office, but only for use for the official’s authorized activities, not to turn over to another entity

    The cost was based on the number of records obtained; the fee was scheduled in Title 21-A, section 196-A. A statewide voter file, which contained almost one million records, was $2,200.

    After a few emails back and forth explaining that I needed them for a research project and sending some verifications, the office kindly sent me for free a DVD with all the required information, hiding the unneeded data like last names and so on.

    The first table on the DVD has the voters’ information and is shown in Figure 1-2. The columns are first name, year of birth, enrollment code, special designations, date of registration, congressional district, county ID, changed date, and date of last statewide election with VPH.

    ../images/501068_1_En_1_Chapter/501068_1_En_1_Fig2_HTML.jpg

    Figure 1-2

    Voters’ information

    The second table contains a registered and enrolled voters report, as in Figure 1-3. The columns of this table are the county name, municipality name, ward precinct, congressional district, state senate, county commissioner district, the party, and the total. The parties listed in the file are Democratic, Green Independent, Libertarian, Republican, and unenrolled.

    ../images/501068_1_En_1_Chapter/501068_1_En_1_Fig3_HTML.jpg

    Figure 1-3

    Registered and enrolled voters report

    This raw data was messy and contained many wrong values and outliers. For example, the age of one voter was 220 years, while his date of birth states that he was about 67 years old at that time. Some voters’ information was missing, and so on. Again, as mentioned earlier, always clean your data: outliers, missing data, adjust data formatting, and explore your data.

    Not only that, but also you should collect as much historical data as you can. So, I started digging and collected as much data as I could find. From the United States Census Bureau, I downloaded more tables

    Enjoying the preview?
    Page 1 of 1