Getting Started With R and Data Science: Data Science Goals, Vectors, Functions, Univariate Analysis (Vector Analysis)
Getting Started With R and Data Science: Data Science Goals, Vectors, Functions, Univariate Analysis (Vector Analysis)
1.
Getting Started
with R and Data Science
1
IPD
Raw Data
This is a part of a company's sales data. The data holds 49 variables (columns) and almost 50.000 observations (rows).
Raw Data – like the above - typically do not directly support data-based decision making or data-driven management.
The goal of data science is to draw meaningful information out of raw data; information that supports decision-making.
IPD
We use R
to get meaningful information
out of raw data.
IPD
6
[email protected]
IPD
Vectors
7
IPD
1 Vector
R language: vector
DataScience language: variable
Tabular Representation: column
8
IPD
Programming Concepts
Programming Concepts
functionName() ,
functionName(argument1=value,argument2=value,
vector_name
argument3=value, …)
die_vec c(1,2,3,4,5,6)
11
IPD
Here is some raw data – daily sales revenues – from a small business.
We start with a univariate analysis, that means we only look at one variable - the raw sales data.
12
IPD
13
IPD
14
IPD
Some functions in R work fine with NA values, others do not work with NA values!
Some functions in R ignore rows with NA values!
mean(sales_rev_num)
median(sales_rev_num)
Output?
Before you start analyzing data, ALWAYS check if there are NA values.
15
IPD
Step 3: We check vector type and length again and run the summary statistics:
typeof(sales_rev_num)
str(sales_rev_num)
summary(sales_rev_num)
length(sales_rev_num)
16
IPD
17
IPD
NAs - Missing Values
Before you start analyzing data, ALWAYS check if there are NA values.
Important functions:
sum(is.na(sales_rev_num))
sum(!is.na(sales_rev_num))
The is.na() function can be used for logical subsetting of the vector.
18
IPD
NAs - Missing Values
logical subsetting (element-wise): return the value if the condition at the position is TRUE
• Write the subsetting command that removes the NA values out of the vector.
19
IPD
R Script
We save all steps in a script.
A script is a sequence of code lines that can be saved as .R file.
20
IPD
We create a 2nd vector out of the 1st that calculates for each day the sales revenue after tax deduction.
- use the round() function to round the vector values to full integers.
Getting help in R:
?round()
21
IPD
Summary
Summarize in your own words your key take-aways from the lesson.
Write it somewhere and copy to chat in a minute.
22
IPD
Summary
23