0% found this document useful (0 votes)
9 views23 pages

Getting Started With R and Data Science: Data Science Goals, Vectors, Functions, Univariate Analysis (Vector Analysis)

The document provides an introduction to using R for data science, emphasizing its importance in extracting meaningful information from raw data for business decision-making. It covers fundamental concepts such as vectors, functions, and univariate analysis, along with practical steps for handling data, including dealing with missing values (NAs) and performing summary statistics. The document also highlights the advantages of R over Excel for data analysis.

Uploaded by

nodariko.gachava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views23 pages

Getting Started With R and Data Science: Data Science Goals, Vectors, Functions, Univariate Analysis (Vector Analysis)

The document provides an introduction to using R for data science, emphasizing its importance in extracting meaningful information from raw data for business decision-making. It covers fundamental concepts such as vectors, functions, and univariate analysis, along with practical steps for handling data, including dealing with missing values (NAs) and performing summary statistics. The document also highlights the advantages of R over Excel for data analysis.

Uploaded by

nodariko.gachava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

IPD

1.
Getting Started
with R and Data Science

Data Science Goals,


Vectors,
Functions,
Univariate Analysis (Vector Analysis)

1
IPD

What does R (Data Science) have to do with


business management or economics?
Or: Why do we have to learn this?
IPD

Raw Data

This is a part of a company's sales data. The data holds 49 variables (columns) and almost 50.000 observations (rows).
Raw Data – like the above - typically do not directly support data-based decision making or data-driven management.

The goal of data science is to draw meaningful information out of raw data; information that supports decision-making.
IPD

Data Science with R

What types of meaningful information can we get out of raw data?


IPD

We use R
to get meaningful information
out of raw data.
IPD

Why not Excel?

6
[email protected]
IPD

Vectors

Name an important characteristic of R Vectors:

Working with R Vectors:


• assigning values to a vector (creating a vector):
• example for an element-wise operation on a vector:
• get value at a certain position:
• logical subsetting (element-wise): # return the value if the condition at the position is TRUE
vector_name[condition] # element-wise execution
# also uses [] – position brackets

7
IPD

Important R Data Objects

Dimension One Data Type Multiple Data Types

1 Vector

R language: vector
DataScience language: variable
Tabular Representation: column

8
IPD

Programming Concepts

What programming components can you find in the cartoon?

Source; Jeremy Keeshin, ReadWriteCode, p.105 9


IPD

Programming Concepts

Input is: 4,5,6,7,8,9

Write the R command(s) to get the X * 2 output:

Source; Jeremy Keeshin, ReadWriteCode, p.105 10


IPD
Data Objects and Functions

Data Objects Code Objects

e.g. Vector Functions

store data elements (values) store instructions, operate on data

functionName() ,
functionName(argument1=value,argument2=value,
vector_name
argument3=value, …)

die_vec c(1,2,3,4,5,6)

11
IPD

Getting Meaningful Information out of Raw Data


"week1",605,1572,1454,2103,3579,3135,3971,"week2",2470,2062,1884,1786,3694,3258,2690,"week3
",1996,2676,1565,1551,3497,3498,3652,"week4",2999,2749,2522,2745,3284,3343,3142,"week5",1987
,2662,1584,2267,3546,3641,3539,"week6",1275,2353,3287,3388,3753,3936,3173,"week7",2235,2430,
3333,2101,3825,4578,3879,"week8",2926,1941,3353,2906,4018,3714,3656,"week9",2634,3163,3551,2
821,4304,4179,4237,"week10",3249,3585,2066,3427,3765,3691,4259,"week11",2191,3771,3254,3447,
3524,4392,4335,"week12",1650,1863,3491,2738,4638,4432,4183,"week13",4050,3883,2915,2287,415
7,4297,4757

Here is some raw data – daily sales revenues – from a small business.
We start with a univariate analysis, that means we only look at one variable - the raw sales data.

What type of meaningful information can we get out of this variable?

12
IPD

Univariat Summary Statistics with R


Step 1: Create the vector in R

- how many elements does the vector have?


- what vector type is it?
- what does the summary() command return?

13
IPD

Univariat Summary Statistics with R


Step 2: Convert the character vector into a numerical vector

Statistical summary functions cannot be applied to a character vector.


→ we need to convert our character vector into a numerical vector

sales_rev_num <- as.numeric(sales_rev)

- does the command work?


- Why do we get a warning?
- What output do we get?

14
IPD

NA Values - Missing Values


• Week information was replaced by NAs
• NA is not a character value. (It does not have quotation marks.)
• In R, NA is a reserved keyword to indicate that a value is missing.
• When trying to convert a character value into a numeric value, R turns the character value into a NA.
(The character value is lost!)
• The resulting vector contains numerical values and NAs (missing values)

Some functions in R work fine with NA values, others do not work with NA values!
Some functions in R ignore rows with NA values!

mean(sales_rev_num)
median(sales_rev_num)
Output?

Before you start analyzing data, ALWAYS check if there are NA values.
15
IPD

Univariate Summary Statistics with R

Step 3: We check vector type and length again and run the summary statistics:

typeof(sales_rev_num)
str(sales_rev_num)
summary(sales_rev_num)
length(sales_rev_num)

What does the summary() function return?

16
IPD

Univariate Summary Statistics with R

Median > Mean

How to interpret this?


- in terms of skewdness?

- in terms of sales data:

17
IPD
NAs - Missing Values

Before you start analyzing data, ALWAYS check if there are NA values.
Important functions:

• anyNA(x) #Checks, if a vector has ANY NA values


anyNA(sales_rev_num) #Result is TRUE if there is at least one NA value
[1] TRUE

• is.na(x) # Returns a logical vector (TRUE / FALSE)


# TRUE ≡ NA, FALSE ≡ value
is.na(sales_rev_num)
!is.na(sales_rev_num)

• Return the number of NAs / not NA elements

sum(is.na(sales_rev_num))
sum(!is.na(sales_rev_num))

The is.na() function can be used for logical subsetting of the vector.

18
IPD
NAs - Missing Values

Step 4: We remove the NAs out of our sales data vector

• We use the function is.na() and logical subsetting to achieve this

logical subsetting (element-wise): return the value if the condition at the position is TRUE

• vector_name[condition] #if condition is true, value at the position is returned

• Write the subsetting command that removes the NA values out of the vector.

• How many elements are left in the vector?

19
IPD
R Script
We save all steps in a script.
A script is a sequence of code lines that can be saved as .R file.

Step 1: Create the vector in R


Step 2: Convert the character vector into a numerical vector
Step 3: We check vector type and length and run the summary statistics
Step 4: We remove the NAs out of our numerical sales data vector
Result: We have a clean numerical vector that can easily be analyzed.

• We comment our code.


• Test the script: Run it and check that vectors are correctly created and shown in environment.
• Save your script as .R file.

Always save your scripts.

20
IPD

Creating a 2nd Vector - Subtracting Taxes


• In Georgia there is a 20% income tax.

We create a 2nd vector out of the 1st that calculates for each day the sales revenue after tax deduction.

- write the R command:

- use the round() function to round the vector values to full integers.

Getting help in R:
?round()

21
IPD
Summary

Summarize in your own words your key take-aways from the lesson.
Write it somewhere and copy to chat in a minute.

22
IPD
Summary

Write down the R functions that you learned today

23

You might also like