Lecture 4 Post-Lecture

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

MF810 Advanced Programming

Data Structure and Algorithms

Lecture 4
Database and SQL

Jun Fan
[email protected]

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 1


Review

 Sorting Algorithm
 Asymptotic Bounding
 Data Abstraction
 Tidy Data
 Case Study: The Great Quant Meltdown

Lecture 1 – 1/18/2024 MF810 Advanced Programming – J Fan 2


August 2007 Great Quant Meltdown

2007 GQM 2021 Archegos


• Multiple Leveraged Portfolios • Single Leveraged Portfolio
• Crowded Positions among • Concentrated Positions by one
Quant PMs PM
• Each PM holds 10% ADV of 1000 stocks • Archegos holds 1000% ADV of 8 stocks
• 20 Quant PMs hold similar stocks • Shared among 5 Brokers

• Global Liquidations among • Forced Liquidations by PBs


Quant funds
• Severe temporarily market • Strong correction of stock price
impact -> rebound quickly from prior market impact

Lecture 3 – 2/1/2024 MF810 Advanced Programming – J Fan 3


August 2007 Great Quant Meltdown
Multi-day trading impact to your positions

Lecture 3 – 2/1/2024 MF810 Advanced Programming – J Fan 4


Agenda

 Data Manipulation
 Relational Database
 Structured Querying Language
 Database Landscape
 Case Study: The WSB GameStop short sell frenzy

Lecture 1 – 1/18/2024 MF810 Advanced Programming – J Fan 5


Tidy Data

Fundamentals of “tidy” data

Why?
• Tidying data is about structuring datasets to facilitate analysis
• Such datasets are easy to manipulate, model and visualize
• The format provides a well-defined structure and uniformity over your datasets
• Provides access to a number of tools that work well with tidy datasets
https://vita.had.co.nz/papers/tidy-data.pdf
Lecture 3 – 2/1/2024 MF810 Advanced Programming – J Fan 6
Grammar for data manipulation

Common Verbs for operating on “tidy” data

select: pick variables by their names


filter: choose rows that satisfy some criteria
mutate: create transformed or derived variables
join: combine two tables of data
summarize: collapse rows down to summaries
arrange: reorder the rows

group_by: verb to perform a summary operation on a grouping of rows

Lecture 3 – 2/1/2024 MF810 Advanced Programming – J Fan 7


Grammar for data manipulation

Lecture 3 – 2/1/2024 MF810 Advanced Programming – J Fan 8


Agenda

 Data Manipulation
 Relational Database
 Structured Querying Language
 Database Landscape
 Case Study: The WSB GameStop short sell frenzy

Lecture 1 – 1/18/2024 MF810 Advanced Programming – J Fan 9


Commodity Futures

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 10


Chinook Database Schema

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 11


Agenda

 Data Manipulation
 Relational Database
 Structured Querying Language
 Database Landscape
 Case Study: The WSB GameStop short sell frenzy

Lecture 1 – 1/18/2024 MF810 Advanced Programming – J Fan 12


Structured Querying Language (SQL)

SQL is a domain-specific language that enables us to perform set operations such as


union, intersect, etc to organize and retrieve information in relational databases. This
language has been expounded upon and used almost everywhere there is tabular data.

Operations supported by SQL adhere to the CRUD (create, read, update, delete)
paradigm. We will focus on the ”read” aspect.

While there is an ANSI SQL standard, the standard is both a superset of what is often
implemented and also insufficient to cover areas such as indexes and file storage.
Commercial vendors are not incentivized to fully standardize as it allows for vendor
lock-in.

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 13


Structured Querying Language (SQL)

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 14


Structured Querying Language (SQL)

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 15


Structured Querying Language (SQL)

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 16


Structured Querying Language (SQL)

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 17


Structured Querying Language (SQL)

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 18


Structured Querying Language (SQL)

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 19


Structured Querying Language (SQL)

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 20


Agenda

 Data Manipulation
 Relational Database
 Structured Querying Language
 Database Landscape
 Case Study: The WSB GameStop short sell frenzy

Lecture 1 – 1/18/2024 MF810 Advanced Programming – J Fan 21


Data Considerations: Introduction

‣ Structured vs unstructured data vs semi-structured


‣ Native file format? CSV, XLS, JSON, XML
‣ Size of the data set? GB? TB? PB?
‣ Frequency of data - Monthly? Daily? Hourly? Second?
‣ How available does the data need to be?
‣ Sensitivity to missing data
‣ Longevity? Do we need to retain data for audit purposes?
‣ Do I need to secure the data?
‣ How will the data be used? Batch vs interactive?
‣ What kind of computations will I be performing with this data?
‣ What is my budget? How much will it cost?

Lecture 3 – 2/1/2024 MF810 Advanced Programming – J Fan 22


Data Representation

Image source: https://www.igneous.io/blog/structured-data-vs-unstructured-data

Lecture 3 – 2/1/2024 MF810 Advanced Programming – J Fan 23


Data Considerations: Storage Formats

Storage Formats

Files – “Old-school” approach where files get a


name, tagged with metadata and organized
hierarchically in a series of folders/directories.
This format excels at handling relatively small
amounts of data.

Blocks - A block is a raw storage volume filled with files that have been split into equal size chunks of
data. Each block does not have associated metadata, rather the operating system allocates storage
for different applications and decides what goes into each block. Often used for databases, email
servers, RAID redundancy and virtual machines.

Objects - data is stored in isolated containers identified by a unique ID or hash. These objects can be
stored locally or remotely and very amenable to scaling. Often used for big data, web apps and
backups.

Lecture 3 – 2/1/2024 MF810 Advanced Programming – J Fan 24


Data Storage Landscape

Relational Databases
Most common and prevalent method of scalable storage for predominantly tabular data,
that has existed in some form or another since the 1960s. Traditional forms of usually
supported ACID properties (atomicity, consistency, isolation and durability). Modern forms
of these sometimes sacrifice one or more of these properties in exchange for some
optimization.

Traditional vendors who have adapted to distributed computing include Oracle,


MySQL, MS SQLServer, PostgreSQL.
Modern versions

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 25


Data Storage Landscape

NoSQL Databases
NoSQL or sometimes called ”Not only SQL” became popular with the advent of the
Web 2.0 movement. Emphasis is on simpler design and better horizontal scaling to
clusters. With NoSQL, unstructured schema-less data can be stored.

Main categories of NoSQL databases consist of:


• Key-Value
• Document
• Wide-column
• Graph

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 26


NoSQL Paradigm

NoSQL Databases (Key-Value Stores)


Key-value stores are usually simple databases that use an associative array to map a single
unique key to one and only one value in a collection. Often some form of dictionary or
hash-table and are usually cached in memory allowing fast query.

Popular flavors include

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 27


NoSQL Paradigm

NoSQL Databases (Key-Value Stores)

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 28


NoSQL Paradigm

NoSQL Databases (Document Stores)


Document store databases use JSON, XML, or BSON documents to store data.
Fairly flexibly support for any type of data you want. Documents can have different
structures, but you can index common fields for faster performance.

Popular flavors include

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 29


NoSQL Paradigm

NoSQL Databases (Document Stores)

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 30


NoSQL Paradigm

NoSQL Databases (Wide-Column Stores)


Wide-column stores use tables, rows and columns like relational databases, but the names
and format can vary from row-to-row.

Popular flavors include

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 31


NoSQL Paradigm

NoSQL Databases (Wide-Column Stores)

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 32


CAP Theorem

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 33


SQL v.s. NoSQL

SQL NoSQL

Relational Non-Relational

Structured Data (tables) Un-Structured Data

Schema “Schema-less” or Dynamic

Vertical Scaled Horizontal Scaled

Limited data capacity Can handle very large data

Easier deployment “Harder” deployment

Co-exist in the foreseeable future (just like nerd and quant)

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 34


Data Storage Landscape

Distributed File Systems


A distributed file system allows files to be accessed using the same interfaces and
semantics as local files – for example, mounting/unmounting, listing directories,
read/write at byte boundaries, system's native permission model

Notable versions include Google FS and

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 35


Amazon Simple Storage Service (S3)

Storage Format: Object storage organized as buckets where each object is identified by a
bucket, key and version ID

Good for frequently accessed data due to its’ design for low latency and high throughput –
amenable to cloud applications, dynamic websites, content distribution, big data analytics,
etc. High fault tolerance and availability of data through replication. Scalability from the
perspective of being an object store.

Multiple tiers of service that specialize on different guarantees:


• Standard: default
• Standard Infrequent Access (IA): optimized for cost by splitting data across two stores
(one for frequent data and one for infrequent data access)
• One-Zone Infrequent Access: optimized for cost at the tradeoff of availability and
resiliency
• Glacier: designed for archiving, optimized for durability at the tradeoff for high latency

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 36


Agenda

 Data Manipulation
 Relational Database
 Structured Querying Language
 Database Landscape
 Case Study: The WSB GameStop short sell frenzy

Lecture 1 – 1/18/2024 MF810 Advanced Programming – J Fan 37


GameStop

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 38


GameStop

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 39


The Wolf of the WallStreetBets

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 40


The Wolf of the WallStreetBets

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 41


Robinhood

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 42


Buy Long and Sell Short

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 43


Short Squeeze

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 44


Timeline

• Short selling
• Cost of borrowing
• Limited upside risk, but unlimited downside risk
• GameStop was being short sold at more than 100% of its total free float
market share
• Crowd sourced strong buying triggered by users of the subreddit
r/wallstreetbets
• Short seller realized significant loss
• forced liquidating short positions by buying back shares
• raise stock price further
• GameStop price reached a max value of over $500 per share, nearly 30
times the $17.25 price at the beginning of the month.

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 45


Timeline

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 46


GameStop

Lecture 4 – 2/8/2024 MF810 Advanced Programming – J Fan 47

You might also like