0% found this document useful (0 votes)
23 views65 pages

01-R Basics

Uploaded by

srusti.patil.97
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views65 pages

01-R Basics

Uploaded by

srusti.patil.97
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

FINA 5376

Financial Analytics
Sima Jannati
Data Science
What is Data Science?
 Data science is the study of the generalizable extraction of knowledge from
data
 Assessing whether new knowledge is actionable for decision making
 i.e., predictive power as opposed to mere ability to explain the past
What is Data Science?
 Requires an integrated skill set:
 Problem formulation
 Mathematics
 Machine learning
 Statistics
 Databases
 Optimization
 …
It’s More than Statistics

 Data is increasingly heterogeneous and unstructured, emanating from


complex networks
 Analysis requires integrating structured and unstructured data
 A complicated task requiring tools from computer science, linguistics, econometrics,
sociology, and others
Scale Matters
 Traditional database are inadequate for knowledge discovery
 Optimized for fast access and data summarization, given a query
 Not designed for pattern discovery

 Unexpected, actionable, and robust patterns in data


 Requirement for predictive accuracy on observations that will occur in the
future is a prime consideration in data science
Data Volume and Implications
 Growing rapidly
 Storage capacity is virtually free, and most is stored
 Using large datasets became practical in the 80’s and 90’s
 Explosion in software tools dedicated to leveraging transactional and behavioral data to explain and
predict
 Lesson from the 90’s
 Machine learning “works”, i.e., can detect subtle structure in data with few assumptions
 Downside: methods also pick up “noise”, with no way to distinguish from “signal”
Extracting Interesting Patterns is Nontrivial

 Consider a health care system transactions database


 Contact with a provider

 Medication dispensed

 Notes and observations


Extracting Interesting Patterns is Nontrivial
 Data is separated into prior history, diagnosis, outcome period (costs, complications,
etc.)
 Potential Questions:
 Are complications associated with certain prior meds?

 Combinations?

 Can we create “useful” features or aggregations?

 Given all dimensions, combinations of aggregations are virtually limitless, even in a


relatively simple database
Example 1
 For example, let’s consider MEPS database
 Link to data: https://meps.ahrq.gov/mepsweb/
 Large amount of data, unstructured, and in different files
 Question:
 Find information about medical prescription and download a readable file of data?
 What are some challenges you face?
Example 2

 As another example, consider Census database

 Link to data: https://www.census.gov/

 Question:

 Find information about population income nationally and regionally?

 How can you combine this information with other demographic characteristics?
Feature Construction
 When data is large and multidimensional, it’s impossible to know a priori whether a
specific query is a good one
 When predictive accuracy is the primary objective given massive data, the computer
can do what we cannot
 It automates the earlier criteria (bold predictions, time-tested) on a large scale
 Still need a clear understanding of the “story” behind the data to infer causality
 If the data follows a “natural experiment,” it may be possible to extract a causal model that
could be used for intervention
 Question: What is a natural experiment?
Requisite Skills
 Fundamental skills fall into three broad classes:
1. Statistics
 Bayesian statistics, multivariate analysis, and econometrics
 Fitting robust statistical models to data

2. Computer science:
 Data structures, distributed computing, parallel computing, along with scripting languages, non-
relational structures for big data

3. Knowledge about correlation and causation


 Perhaps most importantly, the ability to formulate problems in ways that lead to effective solutions
To Sum Up
 Situations may exist where massive data scale may be able to exploit observations that would not occur
under controlled circumstances
 For example, in healthcare:
 Clinical trials exclude up to 80% of situations in which a drug might be prescribed
 This new landscape requires an integrated skill set as discussed earlier:
 Computer science
 Statistics
 Causal modelling
 Problem isomorphs and formulation
 Computational thinking
Financial Analytics
What is Analytics?

 Since the 2008 financial crisis, market practitioners are realizing that reliance on

models which are mathematically pure but fundamentally inaccurate is no

longer acceptable

 The emerging new field of Analytics, also known as Data Science, is providing

computational intelligence to businesses in ways many had never envisioned


What is Analytics?

 Analytics is a practical and pragmatic approach where statistical rules and discrete
structures are automated on the datasets
 Analytics has become the term used for describing the iterative process of proposing
models and finding how well the data fit the models, and how to predict future
outcomes from the models
What is Financial Analytics?

 Financial analytics involves applying classic statistical models and

computerized algorithms to the financial market data and investment

portfolios
R Basics
What is R?
 R is an integrated suite of software for data manipulation, calculation, and
graphical display
 A well-developed sophisticated programming language called “S”
 It consists of:
 Packages for data handling and storage
 A collection of operators for calculations on arrays (matrices)
 A large collection of statistical and data analysis tools
 Sophisticated graphics facilities
What is R?
 It is an environment
 A fully planned and coherent system
 As opposed to an incremental collection of very specific and flexible
tools
How is R Organized?
 R is open source

 The source code is available to all

 Installation packages are distributed through CRAN

 Base R includes the “bare bones” routines required to do simple tasks

 Hundreds of packages exist to perform a multitude of specialized tasks


Parallel Universes
 Packages and users are so numerous and diverse, users typically employ a specific subset
of packages sometimes known as “universes”
 Example:
 Tidyverse – Environment for organized for data science
 xts – Extensible time series, a specialized environment for time series data analysis
 Tidyquant – Branch of Tidyverse designed specifically for quant finance
 Visualization universe – Dedicated entirely to graphics
 Flexibility – You can go back and forth between universes
How is R Organized?
 Packages are contributed by the user community

 “Good news” – someone has developed a package for virtually everything

 “Bad news” – too many choices and constant updating of packages

 “Good news” – the user community is very helpful


User Community Example
 https://community.rstudio.com/

 https://www.reddit.com/r/Rlanguage/

 https://discuss.ropensci.org/

 https://jumpingrivers.github.io/meetingsR/r-user-groups.html

 Question!

 I will open a discussion forum on Canvas

 Can you find and add other R communities there?


How is R Organized?
 Follow http://www.r-bloggers.com and https://blog.rstudio.org
 R vs. Stata:
 Open source versus paying large sums for licensing and updates

 User community versus paid help from corporate employees


Tasks In a Typical Data Science Project

1. Import the data into R


 Take data from a file, database, or web API and load into an R “dataframe”
or “tibble”

2. Wrangle the data


 Tidy the data – a good, time saving investment

 Transform the data – narrowing the data to the observations of interest,


create new variables from existing ones, and calculate summary statistics
Tasks In a Typical Data Science Project
3. Visualization – the first engine of knowledge generation
 May show things that were not expected
 May suggest that you’re asking the wrong question, or need for different data

4. Modelling – the second engine of knowledge generation


 Fundamentally mathematical, which scales well

5. Communication
 Analysis is useless unless you can get others to understand it
Installing R and RStudio
Installing Base R
 Go to https://r-project.org
 CRAN: Comprehensive R Archive Network
 Select: “To download R” on the second line of the text
 You’ll see:
Install Base R
 Select a mirror site (I chose WashU) and the following appears:

 Select your operating system and the following appears:


Install R
 Select base

 Select Download. Download the .exe file, execute it, and follow the directions
Install RStudio
 Go to https://rstudio.com and select Download RStudio and the following appears:
Install Rstudio
 Select Download under the free version
Install RStudio
 Select your operating system and download the installation file
Open RStudio
 Open the RStudio GUI (Graphical User Interface)
The RStudio GUI Window
 Upper left is an editor where you can enter, edit and save sets of

commands

 Lower left is a command line where you can work interactively with R

 Upper right describes the environment and lists existing objects

 Lower right is a space for displaying output

 e.g., plots, files, installed packages, and more


A Simple Math Check
 Create a variable name x with a value of 5, add that to a variable called y

that equals 10 and put the squared root of (x+y)^3 in a third variable called

Z.

 What is the natural logarithm of Z?


Tidy Data
Tidy Data:
The Heart of Tidyverse
 Majority of data analysis work consists of cleaning and preparation
 Cleaning and preparation consists of diverse activities
 Outlier checking
 Date parsing
 Missing value handling
 Controlling for erroneous data
 We need a standard way to organize datasets for efficiency
Tidy Data
The Heart of Tidyverse

 Tidy is a standardized way to link dataset structure (i.e., its physical layout)
with semantics (i.e., what it means)
 Tidy datasets follow three basic rules:

1. Each variable must have its own column

2. Each observation must have its own row

3. Each value must have its own cell


Example
 Messy data:
Example
 Tidy data
Why Tidy Data?
 Messy data requires different strategies to retrieve different variables
 Slow, and it invites errors

 Tidy data allows us to extract all values in a variable in a simple way


 Ensures that values of different variables from the same observation are always
paired
 Well suited to vectorized programming

 While not always the easiest format for humans, it is the easiest format for
computers
Why Tidy Data?

 Bottom line:
 Standardized

 Reducing errors

 Easiest format for computer processing

 Facilitates vectorization
The Five Most Common Problems

1. Column headers are values, not variable names

2. Multiple variables stored in one column

3. Variables are stored in both rows and columns

4. Multiple types of observational units are stored in the same table

5. A single observational unit is stored in multiple tables


Problem 1:
Column Headers are Values, Not Variable Names
 Data set has three variables:
 Religion
 Income
 Frequency
Column Headers are Values, Not Variable Names

The tidy version would look like this:


Problem 2:
Multiple Variables in One Column
 Data set has five variables:
 Country
 Year
 Gender
 Age
 Cases
Problem 3:
Variables Stored in Rows and Columns
 This is an actual weather data table excerpt (di = day 1)
 Data set has four variables:
 id
 date
 tmax
 tmin
Problem 3:
Variables Stored in Rows and Columns

 The tidy version:


Problem 5:
A single observational unit is stored in multiple tables

 While useful for tidying and eliminating inconsistencies, few data analysis tools work
directly with relational data, so we usually have to merge the datasets as needed.
Problem 5:
A single observational unit is stored in multiple tables
Question!

 Find (or create!) an example of untidy data for problem 4


 Show a messy and tidy version of the data
Tibbles
What are Tibbles?

 We need a systematic, efficient way to condition data for analysis


 Most of these operations are conducted repetitively on a diverse variety of data sets
 A systematic approach saves time and reduces errors
 Most newer packages operate on tibbles
 It is a simple procedure to convert tibbles to regular data frames for use with older
packages
What are Tibbles?
 Tibbles are modernized data frames
 They retain features that work, and drop outdated ones,
 e.g., converting character vectors to factors.
 The tibble() package is a convenient way to create data frames, incorporating best
practices for data frames
Interacting with Older Code
 Some older functions don’t work with tibbles
 Use the command, as.data.frame() to convert a tibble back to a data.frame
 The incompatibility relates mostly to the use of the subsetting function [ , which
always returns another tibble in the tidy world
Example 1

tb <- tibble(x = 1:5, y = 1, z = x^2 +y)


x <- is_tibble(tb)
x
## [1] TRUE
Example 1

Compare and contrast the following operations on a data.frame and an equivalent


tibble:

df <- data.frame(abc = 1, xyz = “a”)


df$x
df[, “xyz”]
df[, c(“abc”, “xyz”)]
Example 2
df <- data.frame(abc = 1, xyz = "a")
df$x
## [1] a
## Levels: a
df[, "xyz"]
## [1] a
## Levels: a
df[, c("abc", "xyz")]
## abc xyz
## 1 1 a
Example 3
tbl <- as_tibble(df)
tbl$x
## Warning: Unknown or uninitialised column: 'x'.
## NULL
Tibbles Features
 Tibbles never change an input’s type
 Tibbles never change file names
 Tibbles never use row.names()
 The whole point of tidy data is to store variables in a consistent way, so tibbles never store a
variable as a special attribute.
Tibbles vs. Data Frames
1. Printing
 Printing a tibble shows only the first 10 rows and all the columns that fit on one screen
 It also prints an abbreviated description of the column type, using font styles and color for
highlighting

2. Subsetting
 Tibbles are quite strict about subsetting
 The function [ always returns another tibble
 In contrast, data frames sometimes return a data frame and sometimes simply return a vector

You might also like