0% found this document useful (0 votes)
63 views4 pages

DATA 1050 Cheatsheet

1. All data stored on a computer is ultimately a sequence of bits that are given meaning based on specifications. Common data types include text, documents, images, video, audio, and structured data like XML and JSON. 2. Computers work by loading program instructions and data into memory (RAM) as sequences of bytes that are read and processed by the CPU according to the program instructions. The CPU reads and writes bytes to memory and other components via buses. 3. Programming languages have tools like Jupyter notebooks that make development more efficient by combining code and documentation in a single interactive environment. Common commands allow manipulating and searching files from the shell.

Uploaded by

Duong Dinh Khanh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views4 pages

DATA 1050 Cheatsheet

1. All data stored on a computer is ultimately a sequence of bits that are given meaning based on specifications. Common data types include text, documents, images, video, audio, and structured data like XML and JSON. 2. Computers work by loading program instructions and data into memory (RAM) as sequences of bytes that are read and processed by the CPU according to the program instructions. The CPU reads and writes bytes to memory and other components via buses. 3. Programming languages have tools like Jupyter notebooks that make development more efficient by combining code and documentation in a single interactive environment. Common commands allow manipulating and searching files from the shell.

Uploaded by

Duong Dinh Khanh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Data How computers work

• curl - make a request to a URL


1 All data stored on a computer is ultimately a sequence of bits (0/1). These bits 1 Main computer components: CPU (processing), RAM (temporary storage), • wc - count words/lines/characters in a file
are endowed with meaning based on a specification. hard drive (persistent storage).
• grep - search for text in file contents
2 Types of data include plain text, documents, images, video, audio, tabular data, 2 RAM consists of a sequence of 8-bit chunks called bytes. The index of each byte • code - open a file in VS Code
and many others. is its address.
5 Set Bash variables like MY_FAVORITE_NUMBER=3. You can access a variable with a
3 Common hierarchical (nested) data formats include XML and JSON: 3 The computer executes a program by loading its bytes into consecutive ad- dollar sign, like echo $MY_FAVORITE_NUMBER.
XML: dresses in RAM and then reading the bytes in sequence. The CPU may read data
<svg version="1.1" stored at an address by activating the enable wire while putting the bits of that ad- 6 Add the line
baseProfile="full" dress on the address bus or write data by by using the set wire. The bits are read or export PATH="/Users/jovyan/anaconda3/bin:\$PATH"
width="300" height="200" written via the data bus.
to your /.bash_profile file to add /Users/jovyan/anaconda3/bin to your PATH vari-
xmlns="http://www.w3.org/2000/svg">
able (if you want to be able to execute programs in that folder by name from the
<rect width="100%" height="100%" fill="red" />
CPU command line).
<circle cx="150" cy="100" r="80" fill="green" />
<text x="150" y="125" font-size="60" fill="white">SVG</text> 7 Piping. The output of a command like echo $PATH, which prints to the screen
address data
</svg> bus
enable set
bus by default, may be redirected to a file using the operators > or >> or fed as input to
another bash command on the same line using the pipe operator |.
JSON: 00000000 01011000
8 Glob patterns. You can perform an action on many files by including an as-
{ 00000001 10101010 terisk in the file name. For example, mv img*.png frames/ moves every file in the
"layout":{
00000010 11011011 current directory whose name starts with img and ends with png into the ‘frames‘
"showlegend": false, RAM
00000011 10000111 subdirectory of the current directory.
"xaxis":{
"range":[ 00000100 01000000
0.73, 00000101 11011011 Using Python
10.27
], 1 Tooling for a programming language refers to anything we can use to make the
"domain":[ development experience more pleasant (more efficient, more interactive, less uncer-
0.03619130941965587, 4 Bytes (or chunks of bytes) may represent CPU instructions, data (like integers,
tain, etc.).
0.9934383202099738 floats, or characters), or RAM addresses. Some instructions can tell the CPU to jump
DATA 1050 Cheatsheet · Samuel S. Watson

], to a different location in RAM and continue reading bytes from there. 2 Jupyter is a popular development environment which provides researchers with
"linecolor":"rgba(0, 0, 0, 1.000)", tools for combining exposition and code into a single document called a Jupyter
5 CPU operations are synchronized by a clock generator, which fires about a bil-
"tickcolor":"rgb(0, 0, 0)", notebook. Under the hood, the file contents of a Jupyter notebooks is a JSON string.
lion times a second. Machine integer operations can be executed in 1-3 clock cycles,
"tickfont":{ while more complex operations (like floating point division) can take more like 30 3 Jupyter supports many magic commands which are not part of the Python lan-
"color":"rgba(0, 0, 0, 1.000)", cycles. guage but which allow us to do various convenient things. For example, the %%sql
"size":11 magic causes the contents of the cell to be interpreted as SQL code.
} 6 When you write code in a compiled language (like C, C++, Rust, Go, Haskell,
} OCaml, etc.), you create an executable file to be directly executed by the computer. 4 Jupyter has an edit mode for entering text in cells and a command mode for ma-
} For programs written in Python, the executable is not the program you wrote but the nipulating cells (for example, merging or deleting cells). If there’s a blinking cursor
} Python runtime system. The Python runtime interprets your code and changes the in a cell, the current mode is edit, and otherwise the current mode is command.
way that it executes accordingly. Other languages that use runtimes include Julia, Switching between modes is accomplished with the escape key (edit to command
4 Tabular data formats include CSV and Parquet. CSV is a plain text format that R, Java, C#, and Javascript. mode) and the enter key (command to edit mode).
uses commas to separate entries and newlines to separate rows. Parquet is a binary
format (looks like gibberish if you interpret the bits of the file as plain text) which 7 Many languages (Julia, Java, C#, Javascript, et al) compile parts of your code to 5 Jupyter has many keyboard shortcuts which are worth learning. Cells are
is faster and more space efficient. machine code as the program executes; this is called just-in-time compilation. Neither deleted in command mode with two strokes of the d key. You can highlight cells
Python nor R is JIT-compiled unless you’re using a package for that purpose (like in command mode by holding shift and using your arrow keys, and you can merge
Numba) or a non-standard interpreter (like PyPy). the highlighted cells into a single cell using shift-m. Insertion of new cells is accom-
Data Systems plished with either a (insert cell above ) or b (insert cell below ) in command mode.
8 Interpreting code is typically much slower than executing compiled code (typ-
Cells can be switched between Markdown (m) and code (y) in command mode.
ically 5x-30x). Python, R, and MATLAB manage reasonable performance by con-
1 Organizations use a wide variety of technologies to manage their data. Orga-
necting to compiled libraries—usually written in C, C++, or Fortran—for compute- 6 VS Code is a text editor with many features and extensions to support devel-
nizations’ concerns around data include how and where to store the data, how to
intensive tasks. This is why vectorization is an important performance technique in opment in many languages, including Python. It has better support than Jupyter
access the data, how to perform calculations on the data, how to process the data
these languages. for working with multiple files, debugging (stepping through code), refactoring
and how and when to cache intermediate results, how to display the data to make
(changing the structure of your code), and version control.
it actionable, and many others.
2 Databases are used to store structured data, because they are designed to pro- The shell
vide guarantees around data integrity and to provide rich access to the data.
1 Bytes stored on the hard drive are organized into files. Files are organized hier-
3 Bucket storage in the cloud is useful for files which are large and are not struc- archically into an arbitrarily nested collections of directories (also known as folders).
tured enough to go in a database (e.g., image files, video files, PDF files).
2 The operating system customarily handles each file according to its file type,
4 A data warehouse is a data system in an organization which is highly structured which is customarily indicated by its file extension (like .pdf in resume.pdf).
and carefully curated. A data lake is a central but less structured and/or less curated
repository of data collected by the organization. 3 You can interact programmatically with your file system using a program called
a shell. On Unix or macOS, the shell is bash or a close relative.
5 Data ingestion, storage, cleaning, analytics, and UIs are often related in complex
ways (not a simple pipeline): 4 Important shell commands include
• pwd - print the current working directory

Data • cd - change current working directory


Ingestion • ls - list the contents of the current directory
7 You can do nearly everything in VS Code through the command palette
• tree - show contents of current working directory (recursively) (command+shift+p). Start typing words relevant to what you want to do and select
• cat - print the contents of a file the desired option.
Data Data
Storage Cleaning • head - print the first so many lines or characters of a file 8 Install the Python and Jupyter extensions from the Marketplace (left sidebar),
• mv - move a file and you can execute Python code (shift+enter, in a .py file), inspect variables (in
the Jupyter panel that opens when you execute code), autocomplete variable names,
Data User • cp - copy a file debug (place a red dot in the gutter and then click the bug icon in the sidebar), and
Analytics Interfaces
• touch - create a file or update its last-modified time run your pytest tests.
Version Control with Git
performed on relations: like WHERE or GROUP BY or HAVING.
1 Git is the main software that developers use to version control their code.
• Projection. Subsetting columns. 14 The sort specification is a clause of the form ORDER BY [value_expression]
2 It works using a combination of a command line program (git) and a folder [ASC|DESC], where the value indicated by the value expression is evaluated for each
• Restriction. Subsetting rows based on a condition.
called .git in the top-level directory of each project being version controlled. row and used to perform the sort:
3 You create a new repo by doing git init in the desired directory. Then create a • Cartesian product. Forming every possible concatenation of a tuple from SELECT
file, stage it by doing git add --all, and create your initial commit with git commit one relation with a tuple from a second relation. *
-m "initial commit". FROM
• Sorting. Ordering tuples according to a condition.
birds
4 Your version history consists of a collection of commits (snapshots of your • Grouping and aggregation. Applying an aggregation function to the val- WHERE
project directory) which are connected via parent-child relationships. ues in a column, potentially after grouping the tuples in the relation (par- "set" = 'core' AND wingspan > 25
5 Your changes go through a sequence of zones: files in your working directory titioning them according to a condition). ORDER BY
are initially untracked by Git. Then you stage them with git add to prepare a tidy wingspan DESC;
• Renaming. Changing the name of one of the fields in a relation (changing
commit. Then you create a new commit in your version history with git commit -m
a column header, essentially).
"commit message". Lastly, you update GitHub’s copy of your version history with
common_name set wingspan
git push.
American Robin core 43
SQL Queries (PostgreSQL) Cedar Waxwing core 25
Ash-Throated Flycatcher core 30
working staging local remote 1 SQL (Structured Query Language) is the standard language for performing the
directory stage area commit repository push repository Southern Cassowary oceania NULL
relational algebra operations on tables stored in a relational database. Common Nightingale european 23
2 SQL is declarative, meaning that we express the result we want to obtain, not

the steps the system is supposed to take to achieve that result.
6 You will receive code to set your remote repository to a particular repo on common_name set wingspan
GitHub when you create that repo on GitHub. You can see the current remote URL 3 SQL input consists of a sequence of commands. A command is composed of a
American Robin core 43
with git remote -v. sequence of tokens and is terminated by a semicolon.
Ash-Throated Flycatcher core 30
7 A branch is a pointer to a particular commit. You start a new line of work by 4 A token can be a keyword, an identifier, a literal, or a special character symbol. Tokens
creating a new branch that points to the commit you want to start from, applying are separated by whitespace.
15 Grouping by a value expression partitions the tuples in a relation into groups
the desired changes, and making new commits. 5 Keywords are reserved words in the language with special meaning. In the
DATA 1050 Cheatsheet · Samuel S. Watson

of equal value. If the table expression in a SELECT statement has been grouped, then
8 Checking out a branch sets the state of your working directory to the state of the statement SELECT * FROM birds;, both SELECT and FROM are keywords. each entry in the select list must be either a value that was grouped on or a call to
commit that the branch points to. To preserve any unsaved work in your working 6 Identifiers specify tables, columns, or other database objects (depending on con- an aggregate function (like SUM, AVG, MAX, MIN, or COUNT, which reduces a column of
directory, do a git stash. Put that work back into your working directory later with text). birds is an identifier which specifies which table we’re selecting from. values to a single value).
git stash apply. You will also want to stash when you git pull to get the latest SELECT fruit,
copy of your code from GitHub. 7 Identifiers may be surrounded by double quotes to ensure they are not inter-
MAX(LENGTH(common_name)) AS max_name_length
preted as keywords and to allow them to use otherwise disallowed characters (like
9 You can merge a branch into yours to bring in that branch’s changes (the ones FROM birds
whitespace).
added since the most recent common ancestor). Here’s what it looks like if we merge GROUP BY fruit;
theirbranch into main: 8 String literals in SQL are enclosed in single quotes. Numeric literals can be en-
tered like 4, 3.2, or 1.925e-3.
common_name fruit
main 9 Queries use the SELECT keyword. The basic structure of a SELECT statement is
American Robin 1
merge commit SELECT [select_list] FROM [table_expression] [sort_specification]; Cedar Waxwing 2
The table expression is evaluated and then passed to the select list. The sort specifi- Ash-Throated Flycatcher 1
cation (if present) then processes the resulting rows before they are returned. Southern Cassowary 2
main
Common Nightingale 1
10 The table expression is an expression that returns a table, like a table name or
my second my second another SELECT statement enclosed in parentheses. ↓
commit commit
11 The select list is a comma-separated list of value expressions, which may consist common_name fruit
of column identifiers, constant literals, or expressions involving function calls and American Robin 1
theirbranch theirbranch operators. In this context, the asterisk is a special character meaning ”all columns”. Ash-Throated Flycatcher 1
their their
Common Nightingale 1
my commit my commit 12 Each value expression may be assigned a specific name using the AS keyword.
commit commit Cedar Waxwing 2
SELECT Southern Cassowary 2
common_name,
main LENGTH(common_name) AS name_length ↓
shared shared
victory_points + egg_capacity AS total_points, fruit max_name_length
parent parent FROM
1 23
birds;
2 18

common_name victory_points egg_capacity


previous previous 16 Filter results from a grouped and aggregated relation using a HAVING clause.
commit commit
American Robin 1 4
Cedar Waxwing 3 3 17 Use LIMIT [limit] OFFSET [offset] after an ORDER BY clause to return at most
Ash-Throated Flycatcher 4 4 limit records beginning at index offset.
Southern Cassowary 4 4
Common Nightingale 3 4 18 Name a temporary table using WITH. Example: select every card from whichever
initial initial expansion set has the largest average egg capacity:
commit commit ↓
WITH set_eggs AS (
common_name name_length total_points SELECT "set",
American Robin 14 5 AVG(egg_capacity) AS avg_eggs
Relational data Cedar Waxwing 13 6 FROM birds
Ash-Throated Flycatcher 23 8 GROUP BY "set"
1 A relation is a set of named tuples (with a common set of names) and can be Southern Cassowary 18 8
ORDER BY avg_eggs DESC LIMIT 1
visualized as a table with column headers. The relational data model represents Common Nightingale 18 7
)
data as a collection of relations. SELECT * FROM birds
13 The table expression may be modified by further clauses indicated by keywords WHERE "set" IN (SELECT "set" FROM set_eggs);
2 Relational algebra is a collection of mathematical operations that may be be
SQL: Modifying Data
Operator or function Name Example Result
19 A comma-separated list of two relations denotes their Cartesian product. To
+ addition 2 + 3 5
look at every (bird card, bonus card) combination: 1 To add rows to a database: - subtraction 2 - 3 -1
* multiplication 2 * 3 6
INSERT INTO / division 4 / 2 2
birds birds(common_name, "set") % modulo (remainder) 5 % 4 1
^ exponentiation 2.0 ^3.0 8
common_name set wingspan VALUES |/ square root |/ 25.0 5
American Robin core 43 ! factorial 5 ! 120
('Western Tanager', 'core'),
@ absolute value @ -5.0 5
Cedar Waxwing core 25 ('Scissor-Tailed Flycatcher', 'core'); abs(x) absolute value abs(-17.4) 17.4
Ash-Throated Flycatcher core 30 ceil(x) least integer ceil(-42.8) -42
2 To update rows to a database: div(y, x) integer quotient div(9,4) 2
Southern Cassowary oceania NULL exp(x) exponential exp(1.0) 2.718
Common Nightingale european 23 UPDATE floor(x) greatest integer floor(-42.8) -43
ln(x) natural logarithm ln(2.0) 0.693
Sulphur-Crested Cockatoo oceania 103 birds log(x) base 10 logarithm log(100.0) 2
SET log(b, x) logarithm to base b log(2.0, 64.0) 6.0

bonus_cards wingspan = 0
mod(y, x) remainder of y/x mod(9,4) 1
pi() π pi() 3.14
name condition WHERE round(x) round to nearest integer round(42.4) 42
round(v, s) round to s decimal places round(42.4382, 2) 42.44
Passerine Specialist wingspan ≤ 30 wingspan IS NULL; sign(x) signum (-1, 0, +1) sign(-8.4) -1
Large Bird Specialist wingspan > 64 trunc(x) truncate toward zero trunc(42.8) 42
3 To delete rows: trunc(v, s) truncate to s dec. places trunc(42.4382, 2) 42.43
width_bucket(x,b1,b2,n) histogram bucket width_bucket(1,-3,3,5) 4
↓ DELETE FROM cos(x) inverse cosine cos(1.05) 0.5

SELECT * FROM birds, bonus_cards; birds acos(x) inverse cosine acos(0.5) 1.05

common_name set wingspan name condition WHERE


American Robin core 43 Passerine Specialist wingspan ≤ 30
Cedar Waxwing core 25 Passerine Specialist wingspan ≤ 30 "set" NOT IN ('core', 'oceania', 'european'); 3 String operators and functions:
Ash-Throated Flycatcher core 30 Passerine Specialist wingspan ≤ 30
Southern Cassowary oceania NULL Passerine Specialist wingspan ≤ 30
Common Nightingale european 23 Passerine Specialist wingspan ≤ 30
Sulphur-Crested Cockatoo oceania 103 Passerine Specialist wingspan ≤ 30 Operator or function Name Example Result
American Robin core 43 Large Bird Specialist wingspan > 65 SQL: Managing Tables
string || string String concatenation 'Post' || PostgreSQL
Cedar Waxwing core 25 Large Bird Specialist wingspan > 65 'greSQL'
Ash-Throated Flycatcher core 30 Large Bird Specialist wingspan > 65
Southern Cassowary oceania NULL Large Bird Specialist wingspan > 65 1 Creating a new table. To make a new table called birds a text field common_name lower(string) Convert string to lower('TOM') tom
lower case
Common Nightingale european 23 Large Bird Specialist wingspan > 65 which will be used as a primary key, a text field set which is a foreign key for the
Sulphur-Crested Cockatoo oceania 103 Large Bird Specialist wingspan > 65 overlay(string placing string from int Replace substring overlay('Txxxxas' Thomas
name column in another table called expansions, and an integer field wingspan which
[for int]) placing 'hom'
DATA 1050 Cheatsheet · Samuel S. Watson

should not be allowed to be negative: from 2 for 4)


20 Cartesian products are usually combined with a WHERE clause. To find which position(substring in string) Location of specified position('om' 3
CREATE TABLE birds (
(bird, bonus card) combinations actually yield bonuses: substring in 'Thomas')
common_name TEXT PRIMARY KEY,
substring(string [from int] [for int]) Extract substring substring('Thomas' hom
SELECT * FROM birds, bonus_cards "set" TEXT REFERENCES expansions(name), from 2 for 3)
WHERE wingspan <= 30 AND condition = 'wingspan ≤ 30' wingspan INTEGER CHECK (wingspan >= 0), substring(string from pattern) Extract substring substring('Thomas' mas
OR wingspan > 65 AND condition = 'wingspan > 65'; ); matching pattern from '...$')

trim([leading | trailing | both] Remove characters from trim(both 'x' Tom


2 PostgreSQL [characters] from string) ends from 'xTomxx')
common_name set wingspan name condition
upper(string) Convert string to upper('Tom') TOM
Cedar Waxwing core 25 Passerine Specialist wingspan ≤ 30
Ash-Throated Flycatcher core 30 Passerine Specialist wingspan ≤ 30
• BIGINT/INT8 signed eight-byte integer upper case

Common Nightingale european 23 Passerine Specialist wingspan ≤ 30 left(string, n) first n chracters left('abcde',2) ab
Sulphur-Crested Cockatoo oceania 103 Large Bird Specialist wingspan > 65
• INTEGER/INT/INT4 signed four-byte integer
lpad(string, n, char) left pad lpad('5',3,'0') 005
• DOUBLE PRECISION/FLOAT8 double precision floating-point number (8 bytes) reverse(string) reverse reverse('abc') 'cba'
21 Cartesian products with restrictions are important enough to warrant their own • REAL/FLOAT4 single precision floating-point number (4 bytes)
syntax: [table1] JOIN [table2] ON [condition]
• BOOLEAN/BOOL logical Boolean (true/false)
SELECT * FROM birds JOIN bonus_cards
SQL: Setup
• VARCHAR(n) variable-length character string (max n characters)
ON wingspan <= 30 AND condition = 'wingspan ≤ 30'
OR wingspan > 65 AND condition = 'wingspan > 65'; • TEXT variable-length character string 1 Easiest way to create a free cloud Postgres instance: Go to supabase.io > Log
in with GitHub > Create an Organization > Create a New Project > [wait a few
22 Joins come in several flavors: • DATE calendar date (year, month, day)
minutes, and in the meantime add the line export DATABASE_PWD="your-pwd-here"
• MONEY currency amount to your bash profile] > Go into the new project > Settings (gear icon) > Database
• JOIN or INNER JOIN. Cartesian product followed by restriction.
• NUMERIC [ (p, s) ] exact numeric of selectable precision > Connection String (bottom) > PSQL > Copy.
• LEFT OUTER JOIN. Inner join followed by adding a single row for each row from the
first table completely eliminated by the restriction. Those rows get NULL values for • TIMESTAMP date and time 2 macOS local installation: https://postgresapp.com/. Instructions on the land-
second-table fields. • UUID universally unique identifier ing page for finding your connection string. To install locally on Windows:
https://www.postgresql.org/download/windows/.
SELECT * FROM birds LEFT OUTER JOIN bonus_cards;
3 To drop a table: DROP TABLE [table_name];
common_name set wingspan name condition
3 To connect from a Python session, paste the connection string replacing [YOUR-
Cedar Waxwing core 25 Passerine Specialist wingspan ≤ 30
Ash-Throated Flycatcher core 30 Passerine Specialist wingspan ≤ 30 4 To remove all data from a table: TRUNCATE TABLE [table_name]; PASSWORD] with {pwd}, like this:
Common Nightingale european 23 Passerine Specialist wingspan ≤ 30
Sulphur-Crested Cockatoo oceania 103 Large Bird Specialist wingspan > 65 5 To add a column: ALTER TABLE [table_name] ADD [column_name column_type]; import sqlalchemy
American Robin core 43 NULL NULL import os
Southern Cassowary oceania NULL NULL NULL
pwd = os.envget("DATABASE_PWD") # retrieve password from bashrc
• RIGHT OUTER JOIN. Same but for eliminated rows from the second table. SQL Functions connection_string = (
f"postgresql://postgres:{pwd}@"
• FULL OUTER JOIN. Same but for eliminated rows from either table.
1 Common SQL operators: "db.bijsjfasiwdlfkjasdfot.supabase.co:5432/postgres"
• CROSS JOIN. Cartesian product with no restriction. ) # should be your connection string instead
• NATURAL JOIN. Inner join on equality comparison of all pairs of identically named • AND, OR, NOT. Logical operators. engine = create_engine(connection_string)
fields. connection = engine.connect()
• <, >, <=, >=, =, <> (not equal). Comparison operators.
sql = "SELECT * FROM pg_catalog.pg_tables LIMIT 10;"
23 We can take a union of tuples in two relations (with the same field names) us-
• IS NULL, IS NOT NULL. Null checks. connection.execute(sql).fetchall()
ing the UNION operator. We can take a set difference using EXCEPT and the intersection
using INTERSECT. • LIKE, NOT LIKE. SQL-style pattern matching. Use _ for any single character 4 Create a new table in the database from a Pandas dataframe:
% for any sequence of zero or more characters. 'abc' LIKE '_b_' returns import pandas as pd
24 The syntax for a table literal is VALUES (row1), (row2), (row3); To add two
TRUE. df = pd.read_csv("https://bit.ly/iris-dataset")
rows manually:
df.to_sql("iris", con=engine)
(SELECT common_name, "set" FROM birds) • , ! , *, ! *. Ordinary regular expression matching. ! for negation, * for
UNION case-insensitivity.
(VALUES ('Western Tanager', 'core'),
2 Arithmetic operators and functions:
('Scissor-Tailed Flycatcher', 'core'));
Document databases
• Geospatial. How data are situated geographically (points, lines, regions 7 In C, memory allocated on the heap must be explicitly freed when it is no longer
1 Document databases store data in documents that are organized into collec- on a map). needed by the program. In Python, the runtime identifies when the number of ref-
tions. Each document is a set of key-value dictionary where the values may be erences to an object hits zero and frees the memory automatically.
4 A dimension variable is a variable which is used for grouping data. Usually
numbers, strings, booleans, arrays, dictionaries, etc.
categorical but can be continuous (like timestamps on a time series plot).
5 A measure variable which is one that answers a ”how much” question. Measure Making Python fast
{
variables are the ones that make sense to aggregate (sum, average, count).
{"name": "Rosalia Alexandra",
{"name": "Rosalia Alexandra", 1 Interpreting code is slower than running compiled code. Therefore,
"id":
{"name":
"id":
"B84222941",
{"name":
"medals":
"Rosalia Alexandra",
"B84222941",
{ "Rosalia Alexandra",
{
{"id": 1, 6 Dashboard tips:
"id":
{"name":
"medals":
1: "id":
"B84222941",
"gold",
{"name":
"medals":
{ "Rosalia Alexandra",
"B84222941",
{ "Rosalia Alexandra",
{"id": 1,
"abbreviation":
{"id": 1, "JULIA", performance-sensitive numeric computing in Python requires packages designed
2:1: "id":
1:
"gold",
{"name":
"medals":
"gold", "B84222941",
{"name":
"id":
"gold", { "Rosalia Alexandra",
"B84222941",
"abbreviation":
{"id": 1,
"description":
{"id": 1, "Write
"abbreviation":
"JULIA",
Julia code
"JULIA",
3:2:2:
"medals":
"gold",
"silver",
1: "id":
"gold",
"name":
"medals":
"gold",
{ "Rosalia Alexandra",
"B84222941",
{"Rosalia Alexandra",
"description":
to "abbreviation":
solve
{"id":simple
"description":
{"id":
1,
"Write Julia code
algorithmic
"JULIA",
"Write Julia code • Context is king. Help the dashboard consumer appreciate the broader to address these shortcomings.
4:3:3: "silver",
1:
"gold"
2: "id":
"gold",
"medals":
"gold",
"id":
"silver",
1:
"B84222941",
{
"B84222941",
"gold",
to "abbreviation":
solve
problems simple
"description":
to using algorithmic
"JULIA",
1, conditionals,
"abbreviation":
solve simple "Write Julia code
algorithmic
"JULIA",
4: "gold"
2: "medals":
"gold",
"gold",{ problems "index":
"description":
using 1,
conditionals,
"Write Julia code
}, 3:"gold"
}, 4:4:3:
"silver",
2:1:
"medals":
"gold", { functions,
to"description":
"abbreviation":
solvearrays,
problems simple
using dictionaries,
algorithmic
"JULIA",
conditionals,
"Write Julia code meaning of each number. Week over week changes and time series plots
},
"silver",
4:3:3:
1: "gold",
"participation_scores":
"gold"
2: "gold",
1: "gold",
"gold",
"silver",
"participation_scores":
"gold"
2:
[
[10},
functions,
and "id":
problems
iteration."
functions,
and
"JULIA",
to"description":
solve arrays,
simple
using
to"description":
solvearrays,
problems
iteration." simple
using
dictionaries,
algorithmic
conditionals,
"Write Julia code
dictionaries,
algorithmic
"Write Julia code
conditionals, 2 NumPy is the main such package. It provides multidimensional numeric arrays
{"score":
}, 10,
"silver",
2:10, "out_of":
"gold",
"participation_scores":
}, 4:4:3:
{"score": "gold"
"silver",
"out_of":10}, [10}, }
}
functions,
to solve
andfunctions,
problems
iteration."arrays,
simple
to solveusing
simple
arrays,
dictionaries,
algorithmic
conditionals,
algorithmic
dictionaries, are helpful.
{"score":
"participation_scores":
{"score":
},
{"score":
{"score":}, 4:
{"score":
{"score":
10,
"gold"
3:
4:
10,
10,
"out_of":
"silver",
10,
"gold" "out_of":10},
"participation_scores":
"out_of":
10,
"out_of":
"gold"
"out_of":
"participation_scores":
10, "out_of":
[10},
10}, [10},
10},
[10},
}
}
andfunctions,
problems
iteration."
problems
andfunctions,
iteration."
andfunctions,
using
arrays,
using
iteration."
conditionals,
arrays,
dictionaries,
conditionals,
dictionaries,
with operations that are implemented in an AOT-compiled C library and called from
{"score":
{"score":
}, 10, 10,
"out_of":
"out_of":10}, } arrays, dictionaries,
{"score":
{"score":
{"score":
"participation_scores":
}, 8,8,10,
{"score":
{"score":
{"score":
{"score":
{"score":
{"score":
{"score":
10,
"out_of":
10,8,10,
{"score":
10,
10,
"out_of":
"out_of":
"out_of":
10,
10},
"out_of":
"participation_scores":
"out_of":
"out_of":
10,
"out_of": 10},
"out_of":
"participation_scores":
"out_of": 10}
"out_of":
10},
10}, [10},
10},
10},
10},
[10},
[10},
}
}
and iteration."
and iteration." • Less is more. Dashboards that are too busy can be overwhelming. Put key Python. Operations that can be conveniently vectorized are ideal for NumPy.
] {"score":{"score":
{"score":
{"score": 10,8,10,
{"score": "out_of":
10,
"out_of":
10, "out_of":
"out_of": 10}
"out_of": 10},
10}, 10},
10}, }
{"score":
} ] ] {"score":
{"score":
{"score":
{"score": 10,8,10,
{"score": 10,
"out_of":
10, "out_of":
"out_of":
"out_of": 10}
"out_of": 10},
10},
10}, 10}, performance indicators (KPIs) in big number charts in a prominent posi-
}
{"score":
{"score":
{"score":
} ] ] {"score":
10,8,10,
{"score":
{"score":
{"score":
"out_of":
10,
10,8,10,
{"score":
"out_of":
"out_of":
10,10,
10}
"out_of":
"out_of":
"out_of":
"out_of":
"out_of":
"out_of":
10}
10},
10},
10},10},
10},
10},
10}
import numpy as np
} {"score":
{"score":
{"score":10,
8, "out_of":
"out_of": 10}, 10},
10} tion.
} ] ] {"score": 8,
10, "out_of":
"out_of": 10} sum(list(range(100_000))) # pure Python
}
} ]]
{"score": 10, "out_of": 10} Standards
}
}
• Use tables too. Not every chart has to be geometric. Tables are also useful np.arange(100_000).sum() # NumPy; way faster
dashboard chart.
3 Numba provides JIT-compilation of select Python functions as a package within
• Contrast. Ensure that your color scheme makes things easy to read (unlike CPython. To use it, write a function involving numbers, booleans, and strings, using
Students
the bottom left treemap). constructs like loops, conditionals, and NumPy arrays. Then call the function jit
7 Creating charts. Charts in Superset are produced by selecting a chart type and on that function:
2 Collections are analogous to relations in a relational database, while documents from numba import jit
filling in its slots with names of variables from one of your SQL tables. For example,
are analogous to rows. import numpy as np
you supply the time column as well as the numeric column to plot on the vertical
3 Major differences from traditional relational databases: axis for a time series line plot. You can further customize by adding SQL query
elements like WHERE clauses and grouping operations. def f(x):
• The values of a document may be nested (like an array of arrays of dictio- while abs(x) > 1:
naries, etc.). x = x / 2
• Designed to encourage storing data together which is accessed together, How programs run return x
DATA 1050 Cheatsheet · Samuel S. Watson

at the cost of denormalization (repeating the same data in multiple places


in the database). Joins are usually expensive. 1 To create a C program, you write the source code in a text file and then run a f = jit(f)
compiler to produce an executable that you can run directly on your processor. We
• Data may be split up for hosting on multiple machines.
say that C is ahead-of-time compiled (AOT). def apply_f(A):
4 Document databases should be designed so that neither the number of collec- return np.array([f(x) for x in A])
2 To create a CPython program, you write the source code in a text file and then
tions nor the contents of a single document are set up to grow indefinitely. Rather,
run the Python executable on your machine, pointing it to your code. CPython inter-
the database should scale by having an indefinitely large number of documents. apply_f = jit(apply_f)
prets the code, meaning that it executes the instructions directly without first com-
Same for relational databases: don’t grow your number of tables or your number of
piling functions to machine code. Your Python code is said to be interpreted.
columns indefinitely. Grow by having large numbers of rows. A = np.array([-3, 0.2, 314, 7.05])
3 You can, alternatively, run your Python program using PyPy, which (as it runs) apply_f(A)
compiles your code incrementally into machine code for faster execution. We say
Data Dashboards 4 Numba can only compile certain Python constructs and a few primitive types.
that such code is just-in-time compiled (JIT).
Use njit instead of jit to get an error if you try something that the compiler can’t
1 Dashboard products like Tableau, PowerBI, or Apache Superset add a conve- 4 CPython and PyPy are examples of runtimes (or runtime systems, or runtime handle.
nient and powerful visualization layer to a SQL database. environments).
5 Cython quite similar to Numba but AOT instead of JIT. As a result, Cython re-
2 A dashboard displays one or more charts. A chart is a visualization of the results 5 Programs execute as a nested sequence of function calls. The variables local to quires special type annotations and is actually a different language than Python
of a query on your database. each function call are recorded in a stack frame. The stack frames are organized (note that the array p is allocated on the stack and therefore can’t be very big):
into a stack which grows for each function call and which shrinks again when a
%%cython
function’s execution completes.
def primes(int nb_primes):
6 Memory may also be allocated in a separate part of RAM called the heap. This is cdef int n, i, len_p
especially useful for larger data structures, as it saves copying between stack frames. cdef int p[1000]
Objects on the heap are identified by address.
if nb_primes > 1000:
nb_primes = 1000

stack
len_p = 0
n = 2
populate while len_p < nb_primes:
a @ 0x0f1bb989
for i in p[:len_p]:
if n % i == 0:
fib
a @ 0x0f1bb989 break
else:
def populate(a): p[len_p] = n
for i in range(2, len(a)): len_p += 1
a[i] = a[i-1] + a[i-2] 0x0f1bb989
n += 1
def fib():
a = np.ones(50, int)
1 1 2
populate(a) result_as_list = [prime for prime in p[:len_p]]
return a 3 5 8 13 21
return result_as_list
34 ... ... ... 12...2
3 Types of charts:
• Time series. How data changes over time (line charts, time-series bar
charts).
• Composition. How totals break down by category (pie charts, bar charts, heap
tree maps)
• Distribution. How variables are distributed on the number line (his-
tograms, box plots, horizon charts) or how two or more variables are dis-
tributed jointly (pivot tables, heatmaps, bubble charts)

You might also like