01-R Basics
01-R Basics
Financial Analytics
Sima Jannati
Data Science
What is Data Science?
Data science is the study of the generalizable extraction of knowledge from
data
Assessing whether new knowledge is actionable for decision making
i.e., predictive power as opposed to mere ability to explain the past
What is Data Science?
Requires an integrated skill set:
Problem formulation
Mathematics
Machine learning
Statistics
Databases
Optimization
…
It’s More than Statistics
Medication dispensed
Combinations?
Question:
How can you combine this information with other demographic characteristics?
Feature Construction
When data is large and multidimensional, it’s impossible to know a priori whether a
specific query is a good one
When predictive accuracy is the primary objective given massive data, the computer
can do what we cannot
It automates the earlier criteria (bold predictions, time-tested) on a large scale
Still need a clear understanding of the “story” behind the data to infer causality
If the data follows a “natural experiment,” it may be possible to extract a causal model that
could be used for intervention
Question: What is a natural experiment?
Requisite Skills
Fundamental skills fall into three broad classes:
1. Statistics
Bayesian statistics, multivariate analysis, and econometrics
Fitting robust statistical models to data
2. Computer science:
Data structures, distributed computing, parallel computing, along with scripting languages, non-
relational structures for big data
Since the 2008 financial crisis, market practitioners are realizing that reliance on
longer acceptable
The emerging new field of Analytics, also known as Data Science, is providing
Analytics is a practical and pragmatic approach where statistical rules and discrete
structures are automated on the datasets
Analytics has become the term used for describing the iterative process of proposing
models and finding how well the data fit the models, and how to predict future
outcomes from the models
What is Financial Analytics?
portfolios
R Basics
What is R?
R is an integrated suite of software for data manipulation, calculation, and
graphical display
A well-developed sophisticated programming language called “S”
It consists of:
Packages for data handling and storage
A collection of operators for calculations on arrays (matrices)
A large collection of statistical and data analysis tools
Sophisticated graphics facilities
What is R?
It is an environment
A fully planned and coherent system
As opposed to an incremental collection of very specific and flexible
tools
How is R Organized?
R is open source
https://www.reddit.com/r/Rlanguage/
https://discuss.ropensci.org/
https://jumpingrivers.github.io/meetingsR/r-user-groups.html
Question!
5. Communication
Analysis is useless unless you can get others to understand it
Installing R and RStudio
Installing Base R
Go to https://r-project.org
CRAN: Comprehensive R Archive Network
Select: “To download R” on the second line of the text
You’ll see:
Install Base R
Select a mirror site (I chose WashU) and the following appears:
Select Download. Download the .exe file, execute it, and follow the directions
Install RStudio
Go to https://rstudio.com and select Download RStudio and the following appears:
Install Rstudio
Select Download under the free version
Install RStudio
Select your operating system and download the installation file
Open RStudio
Open the RStudio GUI (Graphical User Interface)
The RStudio GUI Window
Upper left is an editor where you can enter, edit and save sets of
commands
Lower left is a command line where you can work interactively with R
that equals 10 and put the squared root of (x+y)^3 in a third variable called
Z.
Tidy is a standardized way to link dataset structure (i.e., its physical layout)
with semantics (i.e., what it means)
Tidy datasets follow three basic rules:
While not always the easiest format for humans, it is the easiest format for
computers
Why Tidy Data?
Bottom line:
Standardized
Reducing errors
Facilitates vectorization
The Five Most Common Problems
While useful for tidying and eliminating inconsistencies, few data analysis tools work
directly with relational data, so we usually have to merge the datasets as needed.
Problem 5:
A single observational unit is stored in multiple tables
Question!
2. Subsetting
Tibbles are quite strict about subsetting
The function [ always returns another tibble
In contrast, data frames sometimes return a data frame and sometimes simply return a vector