DATA 1050 Cheatsheet
DATA 1050 Cheatsheet
], to a different location in RAM and continue reading bytes from there. 2 Jupyter is a popular development environment which provides researchers with
"linecolor":"rgba(0, 0, 0, 1.000)", tools for combining exposition and code into a single document called a Jupyter
5 CPU operations are synchronized by a clock generator, which fires about a bil-
"tickcolor":"rgb(0, 0, 0)", notebook. Under the hood, the file contents of a Jupyter notebooks is a JSON string.
lion times a second. Machine integer operations can be executed in 1-3 clock cycles,
"tickfont":{ while more complex operations (like floating point division) can take more like 30 3 Jupyter supports many magic commands which are not part of the Python lan-
"color":"rgba(0, 0, 0, 1.000)", cycles. guage but which allow us to do various convenient things. For example, the %%sql
"size":11 magic causes the contents of the cell to be interpreted as SQL code.
} 6 When you write code in a compiled language (like C, C++, Rust, Go, Haskell,
} OCaml, etc.), you create an executable file to be directly executed by the computer. 4 Jupyter has an edit mode for entering text in cells and a command mode for ma-
} For programs written in Python, the executable is not the program you wrote but the nipulating cells (for example, merging or deleting cells). If there’s a blinking cursor
} Python runtime system. The Python runtime interprets your code and changes the in a cell, the current mode is edit, and otherwise the current mode is command.
way that it executes accordingly. Other languages that use runtimes include Julia, Switching between modes is accomplished with the escape key (edit to command
4 Tabular data formats include CSV and Parquet. CSV is a plain text format that R, Java, C#, and Javascript. mode) and the enter key (command to edit mode).
uses commas to separate entries and newlines to separate rows. Parquet is a binary
format (looks like gibberish if you interpret the bits of the file as plain text) which 7 Many languages (Julia, Java, C#, Javascript, et al) compile parts of your code to 5 Jupyter has many keyboard shortcuts which are worth learning. Cells are
is faster and more space efficient. machine code as the program executes; this is called just-in-time compilation. Neither deleted in command mode with two strokes of the d key. You can highlight cells
Python nor R is JIT-compiled unless you’re using a package for that purpose (like in command mode by holding shift and using your arrow keys, and you can merge
Numba) or a non-standard interpreter (like PyPy). the highlighted cells into a single cell using shift-m. Insertion of new cells is accom-
Data Systems plished with either a (insert cell above ) or b (insert cell below ) in command mode.
8 Interpreting code is typically much slower than executing compiled code (typ-
Cells can be switched between Markdown (m) and code (y) in command mode.
ically 5x-30x). Python, R, and MATLAB manage reasonable performance by con-
1 Organizations use a wide variety of technologies to manage their data. Orga-
necting to compiled libraries—usually written in C, C++, or Fortran—for compute- 6 VS Code is a text editor with many features and extensions to support devel-
nizations’ concerns around data include how and where to store the data, how to
intensive tasks. This is why vectorization is an important performance technique in opment in many languages, including Python. It has better support than Jupyter
access the data, how to perform calculations on the data, how to process the data
these languages. for working with multiple files, debugging (stepping through code), refactoring
and how and when to cache intermediate results, how to display the data to make
(changing the structure of your code), and version control.
it actionable, and many others.
2 Databases are used to store structured data, because they are designed to pro- The shell
vide guarantees around data integrity and to provide rich access to the data.
1 Bytes stored on the hard drive are organized into files. Files are organized hier-
3 Bucket storage in the cloud is useful for files which are large and are not struc- archically into an arbitrarily nested collections of directories (also known as folders).
tured enough to go in a database (e.g., image files, video files, PDF files).
2 The operating system customarily handles each file according to its file type,
4 A data warehouse is a data system in an organization which is highly structured which is customarily indicated by its file extension (like .pdf in resume.pdf).
and carefully curated. A data lake is a central but less structured and/or less curated
repository of data collected by the organization. 3 You can interact programmatically with your file system using a program called
a shell. On Unix or macOS, the shell is bash or a close relative.
5 Data ingestion, storage, cleaning, analytics, and UIs are often related in complex
ways (not a simple pipeline): 4 Important shell commands include
• pwd - print the current working directory
of equal value. If the table expression in a SELECT statement has been grouped, then
8 Checking out a branch sets the state of your working directory to the state of the statement SELECT * FROM birds;, both SELECT and FROM are keywords. each entry in the select list must be either a value that was grouped on or a call to
commit that the branch points to. To preserve any unsaved work in your working 6 Identifiers specify tables, columns, or other database objects (depending on con- an aggregate function (like SUM, AVG, MAX, MIN, or COUNT, which reduces a column of
directory, do a git stash. Put that work back into your working directory later with text). birds is an identifier which specifies which table we’re selecting from. values to a single value).
git stash apply. You will also want to stash when you git pull to get the latest SELECT fruit,
copy of your code from GitHub. 7 Identifiers may be surrounded by double quotes to ensure they are not inter-
MAX(LENGTH(common_name)) AS max_name_length
preted as keywords and to allow them to use otherwise disallowed characters (like
9 You can merge a branch into yours to bring in that branch’s changes (the ones FROM birds
whitespace).
added since the most recent common ancestor). Here’s what it looks like if we merge GROUP BY fruit;
theirbranch into main: 8 String literals in SQL are enclosed in single quotes. Numeric literals can be en-
tered like 4, 3.2, or 1.925e-3.
common_name fruit
main 9 Queries use the SELECT keyword. The basic structure of a SELECT statement is
American Robin 1
merge commit SELECT [select_list] FROM [table_expression] [sort_specification]; Cedar Waxwing 2
The table expression is evaluated and then passed to the select list. The sort specifi- Ash-Throated Flycatcher 1
cation (if present) then processes the resulting rows before they are returned. Southern Cassowary 2
main
Common Nightingale 1
10 The table expression is an expression that returns a table, like a table name or
my second my second another SELECT statement enclosed in parentheses. ↓
commit commit
11 The select list is a comma-separated list of value expressions, which may consist common_name fruit
of column identifiers, constant literals, or expressions involving function calls and American Robin 1
theirbranch theirbranch operators. In this context, the asterisk is a special character meaning ”all columns”. Ash-Throated Flycatcher 1
their their
Common Nightingale 1
my commit my commit 12 Each value expression may be assigned a specific name using the AS keyword.
commit commit Cedar Waxwing 2
SELECT Southern Cassowary 2
common_name,
main LENGTH(common_name) AS name_length ↓
shared shared
victory_points + egg_capacity AS total_points, fruit max_name_length
parent parent FROM
1 23
birds;
2 18
bonus_cards wingspan = 0
mod(y, x) remainder of y/x mod(9,4) 1
pi() π pi() 3.14
name condition WHERE round(x) round to nearest integer round(42.4) 42
round(v, s) round to s decimal places round(42.4382, 2) 42.44
Passerine Specialist wingspan ≤ 30 wingspan IS NULL; sign(x) signum (-1, 0, +1) sign(-8.4) -1
Large Bird Specialist wingspan > 64 trunc(x) truncate toward zero trunc(42.8) 42
3 To delete rows: trunc(v, s) truncate to s dec. places trunc(42.4382, 2) 42.43
width_bucket(x,b1,b2,n) histogram bucket width_bucket(1,-3,3,5) 4
↓ DELETE FROM cos(x) inverse cosine cos(1.05) 0.5
SELECT * FROM birds, bonus_cards; birds acos(x) inverse cosine acos(0.5) 1.05
Common Nightingale european 23 Passerine Specialist wingspan ≤ 30 left(string, n) first n chracters left('abcde',2) ab
Sulphur-Crested Cockatoo oceania 103 Large Bird Specialist wingspan > 65
• INTEGER/INT/INT4 signed four-byte integer
lpad(string, n, char) left pad lpad('5',3,'0') 005
• DOUBLE PRECISION/FLOAT8 double precision floating-point number (8 bytes) reverse(string) reverse reverse('abc') 'cba'
21 Cartesian products with restrictions are important enough to warrant their own • REAL/FLOAT4 single precision floating-point number (4 bytes)
syntax: [table1] JOIN [table2] ON [condition]
• BOOLEAN/BOOL logical Boolean (true/false)
SELECT * FROM birds JOIN bonus_cards
SQL: Setup
• VARCHAR(n) variable-length character string (max n characters)
ON wingspan <= 30 AND condition = 'wingspan ≤ 30'
OR wingspan > 65 AND condition = 'wingspan > 65'; • TEXT variable-length character string 1 Easiest way to create a free cloud Postgres instance: Go to supabase.io > Log
in with GitHub > Create an Organization > Create a New Project > [wait a few
22 Joins come in several flavors: • DATE calendar date (year, month, day)
minutes, and in the meantime add the line export DATABASE_PWD="your-pwd-here"
• MONEY currency amount to your bash profile] > Go into the new project > Settings (gear icon) > Database
• JOIN or INNER JOIN. Cartesian product followed by restriction.
• NUMERIC [ (p, s) ] exact numeric of selectable precision > Connection String (bottom) > PSQL > Copy.
• LEFT OUTER JOIN. Inner join followed by adding a single row for each row from the
first table completely eliminated by the restriction. Those rows get NULL values for • TIMESTAMP date and time 2 macOS local installation: https://postgresapp.com/. Instructions on the land-
second-table fields. • UUID universally unique identifier ing page for finding your connection string. To install locally on Windows:
https://www.postgresql.org/download/windows/.
SELECT * FROM birds LEFT OUTER JOIN bonus_cards;
3 To drop a table: DROP TABLE [table_name];
common_name set wingspan name condition
3 To connect from a Python session, paste the connection string replacing [YOUR-
Cedar Waxwing core 25 Passerine Specialist wingspan ≤ 30
Ash-Throated Flycatcher core 30 Passerine Specialist wingspan ≤ 30 4 To remove all data from a table: TRUNCATE TABLE [table_name]; PASSWORD] with {pwd}, like this:
Common Nightingale european 23 Passerine Specialist wingspan ≤ 30
Sulphur-Crested Cockatoo oceania 103 Large Bird Specialist wingspan > 65 5 To add a column: ALTER TABLE [table_name] ADD [column_name column_type]; import sqlalchemy
American Robin core 43 NULL NULL import os
Southern Cassowary oceania NULL NULL NULL
pwd = os.envget("DATABASE_PWD") # retrieve password from bashrc
• RIGHT OUTER JOIN. Same but for eliminated rows from the second table. SQL Functions connection_string = (
f"postgresql://postgres:{pwd}@"
• FULL OUTER JOIN. Same but for eliminated rows from either table.
1 Common SQL operators: "db.bijsjfasiwdlfkjasdfot.supabase.co:5432/postgres"
• CROSS JOIN. Cartesian product with no restriction. ) # should be your connection string instead
• NATURAL JOIN. Inner join on equality comparison of all pairs of identically named • AND, OR, NOT. Logical operators. engine = create_engine(connection_string)
fields. connection = engine.connect()
• <, >, <=, >=, =, <> (not equal). Comparison operators.
sql = "SELECT * FROM pg_catalog.pg_tables LIMIT 10;"
23 We can take a union of tuples in two relations (with the same field names) us-
• IS NULL, IS NOT NULL. Null checks. connection.execute(sql).fetchall()
ing the UNION operator. We can take a set difference using EXCEPT and the intersection
using INTERSECT. • LIKE, NOT LIKE. SQL-style pattern matching. Use _ for any single character 4 Create a new table in the database from a Pandas dataframe:
% for any sequence of zero or more characters. 'abc' LIKE '_b_' returns import pandas as pd
24 The syntax for a table literal is VALUES (row1), (row2), (row3); To add two
TRUE. df = pd.read_csv("https://bit.ly/iris-dataset")
rows manually:
df.to_sql("iris", con=engine)
(SELECT common_name, "set" FROM birds) • , ! , *, ! *. Ordinary regular expression matching. ! for negation, * for
UNION case-insensitivity.
(VALUES ('Western Tanager', 'core'),
2 Arithmetic operators and functions:
('Scissor-Tailed Flycatcher', 'core'));
Document databases
• Geospatial. How data are situated geographically (points, lines, regions 7 In C, memory allocated on the heap must be explicitly freed when it is no longer
1 Document databases store data in documents that are organized into collec- on a map). needed by the program. In Python, the runtime identifies when the number of ref-
tions. Each document is a set of key-value dictionary where the values may be erences to an object hits zero and frees the memory automatically.
4 A dimension variable is a variable which is used for grouping data. Usually
numbers, strings, booleans, arrays, dictionaries, etc.
categorical but can be continuous (like timestamps on a time series plot).
5 A measure variable which is one that answers a ”how much” question. Measure Making Python fast
{
variables are the ones that make sense to aggregate (sum, average, count).
{"name": "Rosalia Alexandra",
{"name": "Rosalia Alexandra", 1 Interpreting code is slower than running compiled code. Therefore,
"id":
{"name":
"id":
"B84222941",
{"name":
"medals":
"Rosalia Alexandra",
"B84222941",
{ "Rosalia Alexandra",
{
{"id": 1, 6 Dashboard tips:
"id":
{"name":
"medals":
1: "id":
"B84222941",
"gold",
{"name":
"medals":
{ "Rosalia Alexandra",
"B84222941",
{ "Rosalia Alexandra",
{"id": 1,
"abbreviation":
{"id": 1, "JULIA", performance-sensitive numeric computing in Python requires packages designed
2:1: "id":
1:
"gold",
{"name":
"medals":
"gold", "B84222941",
{"name":
"id":
"gold", { "Rosalia Alexandra",
"B84222941",
"abbreviation":
{"id": 1,
"description":
{"id": 1, "Write
"abbreviation":
"JULIA",
Julia code
"JULIA",
3:2:2:
"medals":
"gold",
"silver",
1: "id":
"gold",
"name":
"medals":
"gold",
{ "Rosalia Alexandra",
"B84222941",
{"Rosalia Alexandra",
"description":
to "abbreviation":
solve
{"id":simple
"description":
{"id":
1,
"Write Julia code
algorithmic
"JULIA",
"Write Julia code • Context is king. Help the dashboard consumer appreciate the broader to address these shortcomings.
4:3:3: "silver",
1:
"gold"
2: "id":
"gold",
"medals":
"gold",
"id":
"silver",
1:
"B84222941",
{
"B84222941",
"gold",
to "abbreviation":
solve
problems simple
"description":
to using algorithmic
"JULIA",
1, conditionals,
"abbreviation":
solve simple "Write Julia code
algorithmic
"JULIA",
4: "gold"
2: "medals":
"gold",
"gold",{ problems "index":
"description":
using 1,
conditionals,
"Write Julia code
}, 3:"gold"
}, 4:4:3:
"silver",
2:1:
"medals":
"gold", { functions,
to"description":
"abbreviation":
solvearrays,
problems simple
using dictionaries,
algorithmic
"JULIA",
conditionals,
"Write Julia code meaning of each number. Week over week changes and time series plots
},
"silver",
4:3:3:
1: "gold",
"participation_scores":
"gold"
2: "gold",
1: "gold",
"gold",
"silver",
"participation_scores":
"gold"
2:
[
[10},
functions,
and "id":
problems
iteration."
functions,
and
"JULIA",
to"description":
solve arrays,
simple
using
to"description":
solvearrays,
problems
iteration." simple
using
dictionaries,
algorithmic
conditionals,
"Write Julia code
dictionaries,
algorithmic
"Write Julia code
conditionals, 2 NumPy is the main such package. It provides multidimensional numeric arrays
{"score":
}, 10,
"silver",
2:10, "out_of":
"gold",
"participation_scores":
}, 4:4:3:
{"score": "gold"
"silver",
"out_of":10}, [10}, }
}
functions,
to solve
andfunctions,
problems
iteration."arrays,
simple
to solveusing
simple
arrays,
dictionaries,
algorithmic
conditionals,
algorithmic
dictionaries, are helpful.
{"score":
"participation_scores":
{"score":
},
{"score":
{"score":}, 4:
{"score":
{"score":
10,
"gold"
3:
4:
10,
10,
"out_of":
"silver",
10,
"gold" "out_of":10},
"participation_scores":
"out_of":
10,
"out_of":
"gold"
"out_of":
"participation_scores":
10, "out_of":
[10},
10}, [10},
10},
[10},
}
}
andfunctions,
problems
iteration."
problems
andfunctions,
iteration."
andfunctions,
using
arrays,
using
iteration."
conditionals,
arrays,
dictionaries,
conditionals,
dictionaries,
with operations that are implemented in an AOT-compiled C library and called from
{"score":
{"score":
}, 10, 10,
"out_of":
"out_of":10}, } arrays, dictionaries,
{"score":
{"score":
{"score":
"participation_scores":
}, 8,8,10,
{"score":
{"score":
{"score":
{"score":
{"score":
{"score":
{"score":
10,
"out_of":
10,8,10,
{"score":
10,
10,
"out_of":
"out_of":
"out_of":
10,
10},
"out_of":
"participation_scores":
"out_of":
"out_of":
10,
"out_of": 10},
"out_of":
"participation_scores":
"out_of": 10}
"out_of":
10},
10}, [10},
10},
10},
10},
[10},
[10},
}
}
and iteration."
and iteration." • Less is more. Dashboards that are too busy can be overwhelming. Put key Python. Operations that can be conveniently vectorized are ideal for NumPy.
] {"score":{"score":
{"score":
{"score": 10,8,10,
{"score": "out_of":
10,
"out_of":
10, "out_of":
"out_of": 10}
"out_of": 10},
10}, 10},
10}, }
{"score":
} ] ] {"score":
{"score":
{"score":
{"score": 10,8,10,
{"score": 10,
"out_of":
10, "out_of":
"out_of":
"out_of": 10}
"out_of": 10},
10},
10}, 10}, performance indicators (KPIs) in big number charts in a prominent posi-
}
{"score":
{"score":
{"score":
} ] ] {"score":
10,8,10,
{"score":
{"score":
{"score":
"out_of":
10,
10,8,10,
{"score":
"out_of":
"out_of":
10,10,
10}
"out_of":
"out_of":
"out_of":
"out_of":
"out_of":
"out_of":
10}
10},
10},
10},10},
10},
10},
10}
import numpy as np
} {"score":
{"score":
{"score":10,
8, "out_of":
"out_of": 10}, 10},
10} tion.
} ] ] {"score": 8,
10, "out_of":
"out_of": 10} sum(list(range(100_000))) # pure Python
}
} ]]
{"score": 10, "out_of": 10} Standards
}
}
• Use tables too. Not every chart has to be geometric. Tables are also useful np.arange(100_000).sum() # NumPy; way faster
dashboard chart.
3 Numba provides JIT-compilation of select Python functions as a package within
• Contrast. Ensure that your color scheme makes things easy to read (unlike CPython. To use it, write a function involving numbers, booleans, and strings, using
Students
the bottom left treemap). constructs like loops, conditionals, and NumPy arrays. Then call the function jit
7 Creating charts. Charts in Superset are produced by selecting a chart type and on that function:
2 Collections are analogous to relations in a relational database, while documents from numba import jit
filling in its slots with names of variables from one of your SQL tables. For example,
are analogous to rows. import numpy as np
you supply the time column as well as the numeric column to plot on the vertical
3 Major differences from traditional relational databases: axis for a time series line plot. You can further customize by adding SQL query
elements like WHERE clauses and grouping operations. def f(x):
• The values of a document may be nested (like an array of arrays of dictio- while abs(x) > 1:
naries, etc.). x = x / 2
• Designed to encourage storing data together which is accessed together, How programs run return x
DATA 1050 Cheatsheet · Samuel S. Watson
stack
len_p = 0
n = 2
populate while len_p < nb_primes:
a @ 0x0f1bb989
for i in p[:len_p]:
if n % i == 0:
fib
a @ 0x0f1bb989 break
else:
def populate(a): p[len_p] = n
for i in range(2, len(a)): len_p += 1
a[i] = a[i-1] + a[i-2] 0x0f1bb989
n += 1
def fib():
a = np.ones(50, int)
1 1 2
populate(a) result_as_list = [prime for prime in p[:len_p]]
return a 3 5 8 13 21
return result_as_list
34 ... ... ... 12...2
3 Types of charts:
• Time series. How data changes over time (line charts, time-series bar
charts).
• Composition. How totals break down by category (pie charts, bar charts, heap
tree maps)
• Distribution. How variables are distributed on the number line (his-
tograms, box plots, horizon charts) or how two or more variables are dis-
tributed jointly (pivot tables, heatmaps, bubble charts)