0% found this document useful (0 votes)

23 views65 pages

01-R Basics

Uploaded by

srusti.patil.97

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views65 pages

01-R Basics

Uploaded by

srusti.patil.97

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 65

FINA 5376

Financial Analytics
Sima Jannati
Data Science
What is Data Science?
 Data science is the study of the generalizable extraction of knowledge from
data
 Assessing whether new knowledge is actionable for decision making
 i.e., predictive power as opposed to mere ability to explain the past
What is Data Science?
 Requires an integrated skill set:
 Problem formulation
 Mathematics
 Machine learning
 Statistics
 Databases
 Optimization
 …
It’s More than Statistics

 Data is increasingly heterogeneous and unstructured, emanating from

complex networks
 Analysis requires integrating structured and unstructured data
 A complicated task requiring tools from computer science, linguistics, econometrics,
sociology, and others
Scale Matters
 Traditional database are inadequate for knowledge discovery
 Optimized for fast access and data summarization, given a query
 Not designed for pattern discovery

 Unexpected, actionable, and robust patterns in data

 Requirement for predictive accuracy on observations that will occur in the
future is a prime consideration in data science
Data Volume and Implications
 Growing rapidly
 Storage capacity is virtually free, and most is stored
 Using large datasets became practical in the 80’s and 90’s
 Explosion in software tools dedicated to leveraging transactional and behavioral data to explain and
predict
 Lesson from the 90’s
 Machine learning “works”, i.e., can detect subtle structure in data with few assumptions
 Downside: methods also pick up “noise”, with no way to distinguish from “signal”
Extracting Interesting Patterns is Nontrivial

 Consider a health care system transactions database

 Contact with a provider

 Medication dispensed

 Notes and observations

Extracting Interesting Patterns is Nontrivial
 Data is separated into prior history, diagnosis, outcome period (costs, complications,
etc.)
 Potential Questions:
 Are complications associated with certain prior meds?

 Combinations?

 Can we create “useful” features or aggregations?

 Given all dimensions, combinations of aggregations are virtually limitless, even in a

relatively simple database
Example 1
 For example, let’s consider MEPS database
 Link to data: https://meps.ahrq.gov/mepsweb/
 Large amount of data, unstructured, and in different files
 Question:
 Find information about medical prescription and download a readable file of data?
 What are some challenges you face?
Example 2

 As another example, consider Census database

 Link to data: https://www.census.gov/

 Question:

 Find information about population income nationally and regionally?

 How can you combine this information with other demographic characteristics?
Feature Construction
 When data is large and multidimensional, it’s impossible to know a priori whether a
specific query is a good one
 When predictive accuracy is the primary objective given massive data, the computer
can do what we cannot
 It automates the earlier criteria (bold predictions, time-tested) on a large scale
 Still need a clear understanding of the “story” behind the data to infer causality
 If the data follows a “natural experiment,” it may be possible to extract a causal model that
could be used for intervention
 Question: What is a natural experiment?
Requisite Skills
 Fundamental skills fall into three broad classes:
1. Statistics
 Bayesian statistics, multivariate analysis, and econometrics
 Fitting robust statistical models to data

2. Computer science:
 Data structures, distributed computing, parallel computing, along with scripting languages, non-
relational structures for big data

3. Knowledge about correlation and causation

 Perhaps most importantly, the ability to formulate problems in ways that lead to effective solutions
To Sum Up
 Situations may exist where massive data scale may be able to exploit observations that would not occur
under controlled circumstances
 For example, in healthcare:
 Clinical trials exclude up to 80% of situations in which a drug might be prescribed
 This new landscape requires an integrated skill set as discussed earlier:
 Computer science
 Statistics
 Causal modelling
 Problem isomorphs and formulation
 Computational thinking
Financial Analytics
What is Analytics?

 Since the 2008 financial crisis, market practitioners are realizing that reliance on

models which are mathematically pure but fundamentally inaccurate is no

longer acceptable

 The emerging new field of Analytics, also known as Data Science, is providing

computational intelligence to businesses in ways many had never envisioned

What is Analytics?

 Analytics is a practical and pragmatic approach where statistical rules and discrete
structures are automated on the datasets
 Analytics has become the term used for describing the iterative process of proposing
models and finding how well the data fit the models, and how to predict future
outcomes from the models
What is Financial Analytics?

 Financial analytics involves applying classic statistical models and

computerized algorithms to the financial market data and investment

portfolios
R Basics
What is R?
 R is an integrated suite of software for data manipulation, calculation, and
graphical display
 A well-developed sophisticated programming language called “S”
 It consists of:
 Packages for data handling and storage
 A collection of operators for calculations on arrays (matrices)
 A large collection of statistical and data analysis tools
 Sophisticated graphics facilities
What is R?
 It is an environment
 A fully planned and coherent system
 As opposed to an incremental collection of very specific and flexible
tools
How is R Organized?
 R is open source

 The source code is available to all

 Installation packages are distributed through CRAN

 Base R includes the “bare bones” routines required to do simple tasks

 Hundreds of packages exist to perform a multitude of specialized tasks

Parallel Universes
 Packages and users are so numerous and diverse, users typically employ a specific subset
of packages sometimes known as “universes”
 Example:
 Tidyverse – Environment for organized for data science
 xts – Extensible time series, a specialized environment for time series data analysis
 Tidyquant – Branch of Tidyverse designed specifically for quant finance
 Visualization universe – Dedicated entirely to graphics
 Flexibility – You can go back and forth between universes
How is R Organized?
 Packages are contributed by the user community

 “Good news” – someone has developed a package for virtually everything

 “Bad news” – too many choices and constant updating of packages

 “Good news” – the user community is very helpful

User Community Example
 https://community.rstudio.com/

 https://www.reddit.com/r/Rlanguage/

 https://discuss.ropensci.org/

 https://jumpingrivers.github.io/meetingsR/r-user-groups.html

 Question!

 I will open a discussion forum on Canvas

 Can you find and add other R communities there?

How is R Organized?
 Follow http://www.r-bloggers.com and https://blog.rstudio.org
 R vs. Stata:
 Open source versus paying large sums for licensing and updates

 User community versus paid help from corporate employees

Tasks In a Typical Data Science Project

1. Import the data into R

 Take data from a file, database, or web API and load into an R “dataframe”
or “tibble”

2. Wrangle the data

 Tidy the data – a good, time saving investment

 Transform the data – narrowing the data to the observations of interest,

create new variables from existing ones, and calculate summary statistics
Tasks In a Typical Data Science Project
3. Visualization – the first engine of knowledge generation
 May show things that were not expected
 May suggest that you’re asking the wrong question, or need for different data

4. Modelling – the second engine of knowledge generation

 Fundamentally mathematical, which scales well

5. Communication
 Analysis is useless unless you can get others to understand it
Installing R and RStudio
Installing Base R
 Go to https://r-project.org
 CRAN: Comprehensive R Archive Network
 Select: “To download R” on the second line of the text
 You’ll see:
Install Base R
 Select a mirror site (I chose WashU) and the following appears:

 Select your operating system and the following appears:

Install R
 Select base

 Select Download. Download the .exe file, execute it, and follow the directions
Install RStudio
 Go to https://rstudio.com and select Download RStudio and the following appears:
Install Rstudio
 Select Download under the free version
Install RStudio
 Select your operating system and download the installation file
Open RStudio
 Open the RStudio GUI (Graphical User Interface)
The RStudio GUI Window
 Upper left is an editor where you can enter, edit and save sets of

commands

 Lower left is a command line where you can work interactively with R

 Upper right describes the environment and lists existing objects

 Lower right is a space for displaying output

 e.g., plots, files, installed packages, and more

A Simple Math Check
 Create a variable name x with a value of 5, add that to a variable called y

that equals 10 and put the squared root of (x+y)^3 in a third variable called

 What is the natural logarithm of Z?

Tidy Data
Tidy Data:
The Heart of Tidyverse
 Majority of data analysis work consists of cleaning and preparation
 Cleaning and preparation consists of diverse activities
 Outlier checking
 Date parsing
 Missing value handling
 Controlling for erroneous data
 We need a standard way to organize datasets for efficiency
Tidy Data
The Heart of Tidyverse

 Tidy is a standardized way to link dataset structure (i.e., its physical layout)
with semantics (i.e., what it means)
 Tidy datasets follow three basic rules:

1. Each variable must have its own column

2. Each observation must have its own row

3. Each value must have its own cell

Example
 Messy data:
Example
 Tidy data
Why Tidy Data?
 Messy data requires different strategies to retrieve different variables
 Slow, and it invites errors

 Tidy data allows us to extract all values in a variable in a simple way

 Ensures that values of different variables from the same observation are always
paired
 Well suited to vectorized programming

 While not always the easiest format for humans, it is the easiest format for
computers
Why Tidy Data?

 Bottom line:
 Standardized

 Reducing errors

 Easiest format for computer processing

 Facilitates vectorization
The Five Most Common Problems

1. Column headers are values, not variable names

2. Multiple variables stored in one column

3. Variables are stored in both rows and columns

4. Multiple types of observational units are stored in the same table

5. A single observational unit is stored in multiple tables

Problem 1:
Column Headers are Values, Not Variable Names
 Data set has three variables:
 Religion
 Income
 Frequency
Column Headers are Values, Not Variable Names

The tidy version would look like this:

Problem 2:
Multiple Variables in One Column
 Data set has five variables:
 Country
 Year
 Gender
 Age
 Cases
Problem 3:
Variables Stored in Rows and Columns
 This is an actual weather data table excerpt (di = day 1)
 Data set has four variables:
 id
 date
 tmax
 tmin
Problem 3:
Variables Stored in Rows and Columns

 The tidy version:

Problem 5:
A single observational unit is stored in multiple tables

 While useful for tidying and eliminating inconsistencies, few data analysis tools work
directly with relational data, so we usually have to merge the datasets as needed.
Problem 5:
A single observational unit is stored in multiple tables
Question!

 Find (or create!) an example of untidy data for problem 4

 Show a messy and tidy version of the data
Tibbles
What are Tibbles?

 We need a systematic, efficient way to condition data for analysis

 Most of these operations are conducted repetitively on a diverse variety of data sets
 A systematic approach saves time and reduces errors
 Most newer packages operate on tibbles
 It is a simple procedure to convert tibbles to regular data frames for use with older
packages
What are Tibbles?
 Tibbles are modernized data frames
 They retain features that work, and drop outdated ones,
 e.g., converting character vectors to factors.
 The tibble() package is a convenient way to create data frames, incorporating best
practices for data frames
Interacting with Older Code
 Some older functions don’t work with tibbles
 Use the command, as.data.frame() to convert a tibble back to a data.frame
 The incompatibility relates mostly to the use of the subsetting function [ , which
always returns another tibble in the tidy world
Example 1

tb <- tibble(x = 1:5, y = 1, z = x^2 +y)

x <- is_tibble(tb)
x
## [1] TRUE
Example 1

Compare and contrast the following operations on a data.frame and an equivalent

tibble:

df <- data.frame(abc = 1, xyz = “a”)

df$x
df[, “xyz”]
df[, c(“abc”, “xyz”)]
Example 2
df <- data.frame(abc = 1, xyz = "a")
df$x
## [1] a
## Levels: a
df[, "xyz"]
## [1] a
## Levels: a
df[, c("abc", "xyz")]
## abc xyz
## 1 1 a
Example 3
tbl <- as_tibble(df)
tbl$x
## Warning: Unknown or uninitialised column: 'x'.
## NULL
Tibbles Features
 Tibbles never change an input’s type
 Tibbles never change file names
 Tibbles never use row.names()
 The whole point of tidy data is to store variables in a consistent way, so tibbles never store a
variable as a special attribute.
Tibbles vs. Data Frames
1. Printing
 Printing a tibble shows only the first 10 rows and all the columns that fit on one screen
 It also prints an abbreviated description of the column type, using font styles and color for
highlighting

2. Subsetting
 Tibbles are quite strict about subsetting
 The function [ always returns another tibble
 In contrast, data frames sometimes return a data frame and sometimes simply return a vector

Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
Disassembly and Assembly Manual Cat c15 Engine
0% (1)
Disassembly and Assembly Manual Cat c15 Engine
2 pages
Oracle Fusion Middleware: Cloning
No ratings yet
Oracle Fusion Middleware: Cloning
25 pages
Manual - Pdms Support Design
No ratings yet
Manual - Pdms Support Design
84 pages
514 614 L28 32H Fuel Oil System
100% (1)
514 614 L28 32H Fuel Oil System
30 pages
Students Perceptions On Online Education
No ratings yet
Students Perceptions On Online Education
4 pages
4 (John Stredwick) Introduction To Human Resource Ma
No ratings yet
4 (John Stredwick) Introduction To Human Resource Ma
61 pages
Information Sheet 2.4-1 Electrical and Electric Controls
No ratings yet
Information Sheet 2.4-1 Electrical and Electric Controls
23 pages
Business Analytics For Managers - 17.02.2020 PDF
100% (3)
Business Analytics For Managers - 17.02.2020 PDF
295 pages
Module2-Signals and Systems
No ratings yet
Module2-Signals and Systems
21 pages
Optimization of Rocksdb For Redis On Flash: Keren Ouaknine Oran Agra Zvika Guz
No ratings yet
Optimization of Rocksdb For Redis On Flash: Keren Ouaknine Oran Agra Zvika Guz
7 pages
Debian Server
No ratings yet
Debian Server
12 pages
Data Science Notes
100% (1)
Data Science Notes
138 pages
Tugas Inggris
No ratings yet
Tugas Inggris
2 pages
Datascience
75% (8)
Datascience
28 pages
Introduction To Machine Learning PART 1
No ratings yet
Introduction To Machine Learning PART 1
6 pages
Agri-Fishery LAS 5
No ratings yet
Agri-Fishery LAS 5
5 pages
Towards One-Shot Learning For Text Classification Using Inductive Logic Programming
No ratings yet
Towards One-Shot Learning For Text Classification Using Inductive Logic Programming
11 pages
Oil Seals Met
No ratings yet
Oil Seals Met
22 pages
Data Science Specialization
No ratings yet
Data Science Specialization
21 pages
0wning Antivirus: Alex Wheeler Neel Mehta
No ratings yet
0wning Antivirus: Alex Wheeler Neel Mehta
39 pages
Sonnenschein A412/20 G5 Data Sheet: Drawing: Terminal
No ratings yet
Sonnenschein A412/20 G5 Data Sheet: Drawing: Terminal
1 page
Data Science Notes
No ratings yet
Data Science Notes
95 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
Content
No ratings yet
Content
36 pages
S1A Intro To Business Analytics
No ratings yet
S1A Intro To Business Analytics
39 pages
Data Science
No ratings yet
Data Science
33 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Tertiary Winding Function
No ratings yet
Tertiary Winding Function
1 page
DL QB With Ans
No ratings yet
DL QB With Ans
38 pages
Harpoon Lagoon Manual Ice
No ratings yet
Harpoon Lagoon Manual Ice
22 pages
Why R Programming
No ratings yet
Why R Programming
25 pages
PSAI Unit 1
No ratings yet
PSAI Unit 1
70 pages
3250+module+1+ +Intro+to+Data+Science
No ratings yet
3250+module+1+ +Intro+to+Data+Science
71 pages
Game Designer Resume
100% (1)
Game Designer Resume
6 pages
Data Science Through R Lesson-1 Introduction To Data Science
No ratings yet
Data Science Through R Lesson-1 Introduction To Data Science
33 pages
Data Science Intro
No ratings yet
Data Science Intro
52 pages
Session 1819
No ratings yet
Session 1819
47 pages
B Ei
No ratings yet
B Ei
44 pages
Unit 8
No ratings yet
Unit 8
4 pages
Unit I
No ratings yet
Unit I
52 pages
Capacidades de Reabastecimento R1700K
No ratings yet
Capacidades de Reabastecimento R1700K
2 pages
Data Science
100% (2)
Data Science
52 pages
Professional Certificate in Data Science
No ratings yet
Professional Certificate in Data Science
15 pages
What Is Data Science
No ratings yet
What Is Data Science
14 pages
Cover Letter Qatar
No ratings yet
Cover Letter Qatar
1 page
Industrial Training Report
No ratings yet
Industrial Training Report
24 pages
Data Scientist Masters - V9
No ratings yet
Data Scientist Masters - V9
30 pages
Data Science: by Neha Tyagi
100% (1)
Data Science: by Neha Tyagi
17 pages
Data Science Content
No ratings yet
Data Science Content
4 pages
MECH0023 Week 01 Notes
No ratings yet
MECH0023 Week 01 Notes
24 pages
Basics of Data Science KPK
No ratings yet
Basics of Data Science KPK
38 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
Introduction - R Programming
100% (1)
Introduction - R Programming
26 pages
L1 - Introduction To Data Science
No ratings yet
L1 - Introduction To Data Science
33 pages
Data Science
No ratings yet
Data Science
40 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Chapter 1 - Lecture
No ratings yet
Chapter 1 - Lecture
7 pages
Crash Course - Introduction To Data Science
No ratings yet
Crash Course - Introduction To Data Science
121 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
Am-4258 NEO Brushless Motor - Data Sheet
No ratings yet
Am-4258 NEO Brushless Motor - Data Sheet
2 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
Data Science and Its Role in Data Analytics
No ratings yet
Data Science and Its Role in Data Analytics
23 pages
DSF 7-8
No ratings yet
DSF 7-8
22 pages
Kadir
No ratings yet
Kadir
80 pages
DAT100 - Int - Data - Ana - Lec2 - Intro II
No ratings yet
DAT100 - Int - Data - Ana - Lec2 - Intro II
39 pages
Designer Interview Questionnaire
No ratings yet
Designer Interview Questionnaire
1 page
AI ML June 4 2022
No ratings yet
AI ML June 4 2022
40 pages
Kadir
No ratings yet
Kadir
84 pages
DA-1,2,3 (1) Merged
No ratings yet
DA-1,2,3 (1) Merged
39 pages
AIDS C04-Session-19
No ratings yet
AIDS C04-Session-19
29 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
25 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Unit-3 Intr Data Science
No ratings yet
Unit-3 Intr Data Science
150 pages
Bus Times
No ratings yet
Bus Times
2 pages
FIRE FIGHTING TANK - MEP-Model
No ratings yet
FIRE FIGHTING TANK - MEP-Model
1 page
CFS Families
No ratings yet
CFS Families
4 pages
Kmu BSN 1st Semes Computer Slides by M Ibrahim
No ratings yet
Kmu BSN 1st Semes Computer Slides by M Ibrahim
33 pages
347 862932 Introduction
No ratings yet
347 862932 Introduction
35 pages
Ba Notes
No ratings yet
Ba Notes
34 pages
Business Analytics ASSIGNMENT Questions
No ratings yet
Business Analytics ASSIGNMENT Questions
20 pages
Trends in Data Science: AI and DS-I
No ratings yet
Trends in Data Science: AI and DS-I
32 pages
CU Data Science
No ratings yet
CU Data Science
8 pages
Introductiontodatascience 230122140841 B90a0856
No ratings yet
Introductiontodatascience 230122140841 B90a0856
44 pages
Notes Unit1 Unit2
No ratings yet
Notes Unit1 Unit2
83 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Introductiontodatascience 230122140841 B90a0856 1
No ratings yet
Introductiontodatascience 230122140841 B90a0856 1
44 pages
Data Science Mastery: From Beginner to Expert in Big Data Analytics
From Everand
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Kameron Hussain
No ratings yet
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Developing Analytic Talent: Becoming a Data Scientist
From Everand
Developing Analytic Talent: Becoming a Data Scientist
Vincent Granville
3/5 (7)

01-R Basics

Uploaded by

01-R Basics

Uploaded by

FINA 5376

 Data is increasingly heterogeneous and unstructured, emanating from

 Unexpected, actionable, and robust patterns in data

 Consider a health care system transactions database

 Notes and observations

 Can we create “useful” features or aggregations?

 Given all dimensions, combinations of aggregations are virtually limitless, even in a

 As another example, consider Census database

 Link to data: https://www.census.gov/

 Find information about population income nationally and regionally?

3. Knowledge about correlation and causation

models which are mathematically pure but fundamentally inaccurate is no

computational intelligence to businesses in ways many had never envisioned

 Financial analytics involves applying classic statistical models and

computerized algorithms to the financial market data and investment

 The source code is available to all

 Installation packages are distributed through CRAN

 Base R includes the “bare bones” routines required to do simple tasks

 Hundreds of packages exist to perform a multitude of specialized tasks

 “Good news” – someone has developed a package for virtually everything

 “Bad news” – too many choices and constant updating of packages

 “Good news” – the user community is very helpful

 I will open a discussion forum on Canvas

 Can you find and add other R communities there?

 User community versus paid help from corporate employees

1. Import the data into R

2. Wrangle the data

 Transform the data – narrowing the data to the observations of interest,

4. Modelling – the second engine of knowledge generation

 Select your operating system and the following appears:

 Upper right describes the environment and lists existing objects

 Lower right is a space for displaying output

 e.g., plots, files, installed packages, and more

 What is the natural logarithm of Z?

1. Each variable must have its own column

2. Each observation must have its own row

3. Each value must have its own cell

 Tidy data allows us to extract all values in a variable in a simple way

 Easiest format for computer processing

1. Column headers are values, not variable names

2. Multiple variables stored in one column

3. Variables are stored in both rows and columns

4. Multiple types of observational units are stored in the same table

5. A single observational unit is stored in multiple tables

The tidy version would look like this:

 The tidy version:

 Find (or create!) an example of untidy data for problem 4

 We need a systematic, efficient way to condition data for analysis

tb <- tibble(x = 1:5, y = 1, z = x^2 +y)

Compare and contrast the following operations on a data.frame and an equivalent

df <- data.frame(abc = 1, xyz = “a”)

You might also like