Cheat Sheet
Cheat Sheet
are not. Mutability is the ability of objects to change their values. Data Modeling: Easily model complex data structures. Advanced OOP
a computation.Python and R:High-level languages vs. low-level languages; Slice: We take can a subset of a string or list using the [n:m] operator. Inheritance: A way to form new classes using classes that have already
Interpreted vs. compiled languages; Efficiency: perform tasks much Remember that Python indexing starts at 0 and not 1! been defined. Polymorphism: The ability to use a common interface for
faster and more accurately; Scalability: adapt to growth or fluctuations in capitalize(); endswith(); find(); isalnum(); isalpha(); islower() multiple forms (data types).
demand; Consistency: reduce variability and ensure reliable results. GUI By reading and writing files, programs can save information between Sec6: Programming is all about breaking down complicated problems into
(graphical user interface)vs.CLI(command line interface);IDE (Integrated program runs.A text file is a file that contains printable characters and small pieces. But we don’t want to code everything from scratch, so we
development environment). Bugs are programming errors;Debugging is whitespace, organized into lines separated by newline characters. use modular programming. So far, functions and classes have helped us
tracking down and correcting programming errors;Syntax errors:breaks Comma-separated files (.csv) are examples of text files, but they can have make code more modular. Python module: a bunch of related code
language rules;Runtime errors: something exceptional (bad) any file extension (.txt, .dat, …) saved in a file with the extension .py; Python packages: a directory of a
happened;Semantic errors:doesn’t do what you meant to do. Value is Sec3: bool is a new data type with two possible values: True or False. We collection of (related) modules. Library is an umbrella term referring to a install.packages("readxl") install.packages("writexl")
one of the fundamental things that a program manipulates. int is can do so with comparison operators, also known as relational reusable chunk of code. But in Python, people often assume that a library library(readxl) library(writexl)
shorthand for an integer; str is shorthand for a string ; float is a shorthand operators. We might also combine boolean expressions using logical is a collection of packages (but it could just be one package). Python df = read_excel("orders.xlsx") write_excel(df, "orders.xlsx")
for a floating-point number. Variable is a name that refers to a value. operators. Armed with boolean expressions, we can now use while libraries / packages commonly used for data analysis: numpy; pandas; reads from the file orders.xlsx write to the file orders.xlsx
Assignment statement creates a new variable, gives it a value. loops. Flow of execution for a while loop: 1. Evaluate the condition, scipy; sklearn. NumPy Key Features: high-dimensional numerical data vectors (aka atomic vector): a 1-dimensional ordered collection of
Assignment operator character: =. is not an equals sign! Assignment yielding True or False. 2. If the condition is False, exit the while statement structures (ndarray); numerical calculations (think trigonometry, vectors, & primitive variables of the same type; The type can be any primitive type
operators link a name (left side) with a value (right). and continue execution at the next statement (after the while loop code matrix manipulation) Example operations: transpose; log; append. myVector = c(1, 2, 3). matrices: a 2-dimensional ordered collection of
Variable names: Can be arbitrarily long; Can contain numbers, letters and block). 3. If the condition is True, execute each of the statements in the Pandas(Builds on NumPy) Key Features: data structures for labeled, primitive variables of the same type; The type can be any primitive type
the _ underscore character; Must not begin with a number; Are body and then go back to s1. Warning: the body of the loop should numerical, time series, & relational data data manipulation operations yourMatrix = matrix(1:12, nrow = 3, ncol = 4). lists (aka recursive vector):
case-sensitive (my_name and My_Name are different); Cannot be a change the value of one or more variables so that eventually the condition Example operations: head; groupby; merge.SciPy(Builds on and is closely 1-dimensional ordered collection of objects of possibly different types; A
Python keyword. Keywords(33) define the language’s rules and structure. becomes False and the loop terminates. Otherwise we’d get an infinite related to NumPy) Key Features: operations on sparse matrices; more list allows a variety of types (possibly unrelated) stored under one name
loop.. One of the most powerful things about programming is being able advanced linear algebra ; special mathematical operations; numerical thisList = list(myVector, "a", myMatrix, theirDF). Data frames: a
to check conditions and change the behavior of the program accordingly. optimization. Commonly used submodules: scipy.cluster;scipy.interpolate; 2-dimensional ordered collection of vectors each column has same length
The simplest form of this is the if statement. 2nd if statement is scipy.linalg; scipy.optimize; scipy.sparse; scipy.stats. Scikit-Learn(Built & different columns can have different data types; theirDF =data.frame(c(1,
alternative execution. Alternative execution allows for two possible sets on NumPy and SciPy) Key Features: machine learning methods. Machine 2, 3),c("a", "b", "c")).
of actions (or branches) and the condition determines which one gets learning uses: classification; regression; clustering; dimensionality Sec9: Function is a named sequence of statements that performs a
executed. 3rd variant is chained conditionals. Chained conditionals reduction; model selection; preprocessing. desired operation. This operation is specified in a function definition.
allows for more than two branches. elif abbreviation of “else if”. Sec7: Sorting is the process of ordering elements in a data structure. The Functions are program’s “molecules”. Just like a program, a function can
Conditionals can also be nested within each other. default is to sort in ascending order, or from smallest to biggest. have: 0 or 1 more Input(s) (aka arguments); 0 or 1 Output. The reshape2
Sec4: Function is a named sequence of statements that performs a Eg: sorted(clients) returns a new object; clients.sort() changes the object. package provides useful data reshaping functions. melt(), dcast(), and
Statement is an instruction that the Python interpreter can execute. Print desired operation. This operation is specified in a function definition. Searching is the process of checking for or retrieving an element from any acast() are most commonly used functions. melt() and dcast() are inverse
statements give an output, whereas assignment statements do not. Here are some built-in functions we’ve already used: print; input; range; data structure where it is stored. "Marie" in clients; clients.index("Marie"); functions. melt() and acast() are almost inverse functions. acast() & dcast()
Expression is a type of statement or a building block of statements; is a open; close; write; read; readlines; type. Why use function?: Modular bio.find("Marie"). Sometimes simple searching is not powerful enough: we are similar, except acast() returns a matrix/array. apply() applies a function
combination of values, variables, and operators that the Python interpreter programming separates a program's functionality into distinct, need to be able to find strings that match a pattern (or filter). Most letters to an entire list/vector. lapply() applies a function to an entire list/vector.
evaluates.Evaluation of an expression produces a value. Evaluating an independent modules (eg, functions) that act as “building blocks” for and characters will simply match themselves. Metacharacters are the unlist() converts a list into a vector. sapply() function tries to 1.Transform
expression is not the same thing as printing a value. If the last statement bigger programs. This make it: Easier to maintain code; to reuse code; to exception; they signal that some out-ofthe-ordinary thing should be the result into a vector or a matrix 2. Assign meaningful names.
that Python executes is an expression, then a Jupyter cell will print out the scale code; to collaborate; to debug and test code. Function must be matched. Metacharacters: . ^ $ * + ? { } [ ] \ | ( ). Understanding how the sapply() fails if it can’t find a meaningful transformation. Even more
evaluation of that expression (a value). defined before its first use. Execution always begins at the first statement metacharacters work is the hard part. Computational Efficiency: You dangerous is sapply() can return unexpected types. vapply() function
Expressions can appear on the right hand side of an assignment of the program. Statements are executed one at a time, in order from top can do the same thing multiple ways, but some ways take longer to run or performs like sapply() function but: We can specify the return type using
statement.Value all by itself is a simple expression, and so is a variable. to bottom.Default parameters are values provided in a function definition use more memory than others. We can compute various summary an additional argument; vapply() assigns types specified. Gives an error if
Expression all by itself is a legal statement, but it doesn’t “do” anything. that are used if no argument is passed during the function call. Syntax: statistics using base Python, NumPy, or Pandas. min(data), np.min(data) the type mismatches.
Operators are special symbols that represent computations like addition Specified in the function header after the parameter name using the data.min() # pandas dataframe.
and multiplication.Mathematical operators in Python: = assignment assignment operator (=). Advantages: 1. Makes function calls simpler by Sec8:
operator + addition - subtraction * multiplication / division ** providing default values. 2. Allows flexibility and reusability of functions.
exponentiation // integer division % modulo. Rules of precedence: Important Notes: Default parameters must come after non-default
Parentheses; Exponentiation; Multiplication and Division have the same parameters in the function definition. Functions can call other functions.
precedence; Addition and Subtraction have the same precedence. When a function calls itself, it’s called recursion. It's useful, but if you
Operators with the same precedence are evaluated left to right. String don't get the stopping conditions right, your program can run forever!
operators: + concatenation * repetition. Input and Output: Programs need Functions can be defined within functions. Lambda functions are small
a way of receiving input and generating output. simple way of a program anonymous functions (and very useful). String formatting is very useful
producing output: using a print statement. Python includes a built-in when working with text data and writing out to files. Sometimes we need
function called input that we can use to get keyboard input. Two types of to generate random numbers to simulate possible outcomes or
comments: Inline comments start with # (everything after is ignored) subsample our data. Technically, we’ll be generating psuedo-random
Comment Blocks start and end with ''' (everything in between is ignored) numbers because we can regenerate them given the same seed. import
Sec2 Input: input& Output: print; Opening, reading from, & writing to files; random # import our module; random.seed(42) # set the seed;
Math: +, *, **, -, /, //, and %; Specialty string and list operators, Variables & random.randint(1, 10) # generate a random number.
Operators; Conditional Execution;Repetition: Loops. List is a new data Sec5 Set is an unordered collection of unique elements.
type, it is a list of other values: write a comma-separated sequence Sets are mutable, but the elements within the set must be immutable (e.g.,
of elements in square brackets. Each element is Python code for another numbers, strings, tuples). Features: Uniqueness: Automatically removes
value. Loop is a piece of code which may execute many times (or perhaps duplicate elements; Unordered: Does not preserve the insertion order.;
even not at all). One such loop is the for loop. Supports standard operations like union, intersection, difference, and
symmetric difference. Dictionary is an unordered collection of key-value
pairs. Keys must be unique and immutable (e.g., numbers, strings, tuples).
Values can be any data type and can be duplicated. Features: Fast
lookups, insertions, and deletions; Flexible and versatile for various data
manipulation tasks. Object-oriented vs. Procedural Programming: Up
to now we have been writing programs using a procedural
programming paradigm. In procedural programming the focus is on
writing functions or procedures which operate on data. We’re now going
Function is a named sequence of statements that performs a to learn about object-oriented programming, where the focus is on the
desired operation; can have input (or arguments) and output, just like a creation of objects which contain both data and functionality together.
program. print function: print("Hello"). The print function performs the You will use both approaches in data analytics. Even if most of the code
action of printing the input to the screen (program output) but does not you write is procedural, you will still use objects via libraries. Class in
return a value (function output). Range function is built in and returns an essence defines a new data type. Class: A blueprint for creating objects
object of type range that we can iterate over. range(0, 3) ≈ [0, 1, 2] (particular data structure). Object: An instance of a class. Attributes:
The function len returns the length of a list or string, which is Characteristics of an object (data). Methods: Functions that belong to an
equal to the number of its elements. len([1, 2, 4, 6, 10]) → outputs 5; object. Why OOP in Data Analytics? Encapsulation: Group related data
len("Hello, how are you?") → outputs 19. Bracket operator selects single and functions together. Reusability: Use classes to create reusable code.
character from string/single element from list. Lists are mutable, strings Modularity: Divide complex problems into smaller, manageable pieces.
conditions: Variables: Each column contains a single variable; The Area Principle: All graph’s areas should proportional to their Q9: A function can have:0/1/2+ inputs & 0/1 outputs. What happens when
Observations: Each row holds a single observation; Values: Each cell represented data’s sizes. Read galton.csv: height = you knit an RMD file with the following statement?
holds a single value after following the first two rules. if (!require(tidyr)) {; Table.read_table("galton.csv"), height = height.select(1, 2, 7) Draw install.packages("reshape2") RStudio doesn't knit the file and informs
install.packages("tidyr")}; library(tidyr). magrittr: Provides piping to overlaid histograms: height.hist(unit = "inch", bins = arrange(55, 80, 2)) Q1: you of an error; What is the primary function of sapply() in R? To apply a
facilitate readability & writability. The basic piping with %>%, Also provide The set of characters /_/_abc/abc_123/gamma/abc123/local/alpha/beta/ function to each element of a data structure; Which data structures can
several specialized piping operators (such as %$% and %<>%)a %<>% is allowed as a Python variable name. The set of characters sapply() be used with? Vectors, Lists, Matrices, Data frames; vapply()
b %>% c is shorthand for a = a %>% b %>% c, %$% is shorthand for /global/123abc/123.abc/lambda/abc.123/#abc/my#/123_abc/ is not gives more informative error messages and never fails silently T; vapply()
with(). filter() picks rows based on name & conditions. select() picks allowed as a Python variable name. Q2: How many elements are in is more verbose than sapply() T; sapply() is easy to use but can give
columns based on name. Together select() & filter() help zoom into data. range(0,10)? 10 Q3: "apple" < "Apple" F For Unicode, the value of unexpected errors when used in functions T; lapply() returns a list data
mutate()modifies an existing column(s)or adds new column(s). uppercase letter is lower than the lowercase letter.; (5 < 7) and (7 < 10) structure T; vapply(fuqua, func, datatype) gives an error when the result is
summarize() transforms many rows into one row. filter() before T; (1 < 1) or (3 > 2) T; (1 =< 1) and (3 => 2) Error; == is an operator & is not the same type as datatype T; R has three types of classes, they are
summarize() can help focus your exploration. group_by() tells summarize() used for comparison; = is an operator & is used for assignment. A nested S3, S4, R6. S3 is the most basic type of class in R, and some people
to summarize within groups (not the entire dataset). if-statement is an if -statement containing other if-statements. Q4: def is would say it is not a real class. & R6 is the most robust type of class in
used to define a function in Python. my_ function( ) a correct way to call a R, and similar to classes in other languages. Q10: Which of the following
function named my_function. A function in Python must have a return are part of tidyverse? readr, tidyr, ggplot2, purrr; forcats is best for
statement F; Functions can accept parameters T; You can call a function processing categorical variables; separate() & unite() are opposites,
from within another function T; In Python, the body of a function is defined pivot_longer() & pivot_wider() are opposites, gather() & spread() are
by indentation T; You can call a function before it is defined F; Q5: opposites; pivot_longer() and gather() are similar, pivot_wider() and
empty_set = set( ) is the correct way to construct an empty set in Python. spread() are similar. Which of the following usually give you fewer
The elements in a Python set are ordered F; The elements in a Python columns than their input for sure? Pivot_longer( ), unite( ), gather( );
dictionary are ordered T; you can add a list as an element in a set F; In Consider the code below and pick the statements that must be true about
Python, you can add a string as an element in a set T; you can use a list as the medals: dfcovid %>% select(Year, Country, Deaths) %>%
a key in a dictionary F; you can use a list as a value in a dictionary T; In a filter((Country == "US") & (Year == 2020)) Year, Country, and Deaths are
Python dictionary, the same value can occur for multiple different keys T; columns & filter() filters rows; Which of the following pieces of code
Object is an instance of a custom data structure; Attribute is data gives you the summary statistics on deaths reported in dfcovid
grepl()grepl(pattern = regex, x = my_string) finds regex pattern in associated with a custom data structure; Class is a custom data structure
my_string: TRUE when regex is in my_string and; FALSE when regex is dfcovid %>% filter(Country == "US") %>% group_by(Month) %>%
template; Method is a a function associated with structured data. Q6: What summarize(totalDeaths = sum(Deaths), medianDeaths =
not in my_string. grep() grep(pattern = regex, x = my_string), gives vector library would you want to import if you needed to merge two datasets
of indices where regex is in my_string. grep() is the same as which(grepl()) median(Deaths)). Q11: Which of the following are sources to learn and use
based on a common column Pandas; What library would you want to libraries and packages? Use search engines (such as Google and Bing),
import if you needed to clean and transform a large, structured dataset for With Generative AI (Gemini, ChatGPT, etc.), Read CRAN pages, Find
analysis Pandas; What library would you want to import if you needed to code samples and modify them; What is the output of of the following
use machine learning algorithms like linear regression, decision trees, or line of code? myDate = ymd_hms("2026-09-07 12:01:02.03")
clustering sklearn; What library would you want to import if you needed to floor_date(myDate, "day") "2026-09-07 UTC" ; Character the best data
randomly shuffle a list of items numpy(random); What library would you type to store a nominal categorical variable; Ordinal Categorical Variable
want to import if you needed to manipulate a sparse matrix scipy; What is the best variable type for the Wong-Baker Pain scale; What is a possible
library would you want to import if you needed to group data by an prefix for stringr functions? (prefix means start with -- for example S, Sa,
attribute pandas; What library would you want to import if you needed to etc. are prefixes of Salman.) str_ ; str_subset() is an improvement over
calculate the mean and standard deviation of an array of numbers pandas; grep( ); data.table is the fastest among data.frame/tibble/data.table.
What library would you want to import if you needed to create a A data.table have columns of different length F; Can a data.table have row
sub() sub(pattern = regex, replacement = newstr, x = my_string) replaces 3-dimensional array numpy; What library would you want to import if you names? NO,it can’t. Q12: Which of the following can you use for plotting
the first regex pattern with newstr in my_string. gsub() gsub(pattern = needed to generate a random integer between two specified values on your laptop? Base R (plot(), hist()), R's ggplot2 (qplot(), ggplot)),
regex, replacement = newstr, x = my_string) replaces all regex patterns numpy(random); What library would you want to import if you needed to Python's matplotlib (plot(), hist()), Python's plotly (scatter(), line()) ;
with newstr in my_string. Sec10&11 Tidyverse is an R package(Manipulates perform linear algebra operations such as matrix multiplication scipy Q7: Which of the following can throw an error in this R code? lines(x =
data Designed for data science) collection(Similar to pandas). All What does \W in a regular expression match [^a-zA-Z0-9_]; The density(df$col), type = "l", lwd = 2, col = "blue") df doesn't have column
packages share: Design philosophy; Grammar; Data structures. The difference between a+ and a* a* can match no a characters, but a+ will named x. df$col contains NA values. type doesn't have "l". col doesn't
tidyverse core consists of: ggplot2; dplyr(Manipulates data); tidyr(Tidies need at least one for a match; Which method would you use to check if have "blue". ggplot2 is not installed; In qplot(), which parameter can you
messy data); readr; purrr; tibble; lubridate(For dates); stringr(For strings a pattern matches the beginning of a string re.match( ); Which method use to differentiate multiple values of a categorical variable (such as
(character)); forcats(For categorical data). returns a list of all matches of a pattern in a string re.findall( ); What will marital_status) on the same plot? color; Which of the following is the best
give an equivalent result to the regular expression re.split(r'\s+', 'The quick R package for combining multiple plots to store as one variable?
brown fox') 'The quick brown fox'.split(' '); Which Python module is used gridExtra; Which of the following ggplot option fits a linear trendline on a
for working with regular expressions re; What does the regular expression scatter plot? geom_smooth(method = "lm"); Which of the following
[a-z] match Any single lowercase letter; What does the regular histograms is most likely to be deceptive? hist(x = df$col1, col = "orange",
expression \d match Any digit; In Python, what function would you use to freq = FALSE, ylim = c(1, 5)) Which of the following are are reasons to use
Sec12 hist() Histograms; plot() Scatter, split plots; abline() Straight lines; barplot() Bar
plots; dotchart() Dot plots; lines() Line plots. search for the first match of a pattern anywhere within a string re.search( ); matplotlib instead of plotly? Quality plot printouts to impress your boss,
What does the caret symbol (^) in a regular expression signify Beginning professors, and friends.& Faster large visualizations; Which is the
of a string Q8: What happens when you add a vector with 300 elements to closest thing to Python's marker parameter in R's world? geom; The Area
a vector with 100 elements? The resulting vector has 300 elements; Principle dictates All graph’s areas should proportional to their
What happens when you add a vector with 300 elements to a vector with represented data’s sizes. T Sample Exam: A variable name in Python can
200 elements? The resulting vector has 300 elements&R gives a start with a number F; What function is used to take user input in Python?
warning; In any given row, all elements must be of the same type F; All input( ); What does the str() function do in Python? Converts a number to
rows must have exactly the same number of elements T; In any given a string; Python has 33 keywords. How many of these keywords are
columns, all elements must be of the same type T; All columns must have allowed as variable names? 0 ; num1 = 10 num2 = 0 result = num1 / num2
exactly the same number of elements T; my_df[ , 1:3] selects all the rows Runtime ; def my_function(): print("Hello, world!") Syntax; Classes are not
and the 1st, 2nd, and 3rd columns. What is the result of merge(df1, df2) allocated to memory (memory is allocated when defining an object using
when df1 and df2 have the same numbers of rows and one or more the class) T(Classes define how their objects' structure. They do not
columns in common? A new dataframe with dataframes df1 and df2 store data.) ; Two objects defined on the same class can have different
merged by common columns. What is the result of merge(df1, df2, by = methods but will have the same attributes F; A new class can be defined
column_names) when df1 and df2 have the different numbers of rows and as extensions of existing classes T; Encapsulation can help hide class
one or more columns in common? A new data frame with data frames attributes from other users T; All variables are objects T; Which of the
df1 and df2 merged by the specified columns. All matrix elements following best describes the following data.table code? DT[col_x %in%
must be of the same type T; All rows must have the same number of c("a","b"), list(mean(col_y)), by = col_x] Calculate and print the mean of
elements T; All columns must have the same number of elements T; A col_y when col_x is “a” as well as the mean of col_y when col_x is
matrix must have 2 dimensions T; In R, which of the following must be true “b” ; Which of the following are true about the syntax for Python regular
about rbind(df1, df2) to work properly on two dataframes df1 and df2? df1 expressions? re.findall('x+', mystr) returns all matches of at least one x in
and df2 must have the same number of columns. In R, which of the the string mystr T; re.findall('x*', mystr) returns all matches of at least two
following must be true about cbind(df1, df2) to work properly on two xs in mystr F(zero/more); re.search('$[yz] ', mystr) returns a match if the
Tidyr tidyr tidies messy data. Tidy data is rectangular data like a Bar Chart displays categorical variables; Histogram displays numeric variables. Bars
dataframes df1 and df2? df1 and df2 must have the same number of
have equal width, Length: the count/frequency. x-axis is counts; y-axis for vertical bars. string mystr starts with y or z F(^[yz] starts/ [yz]$ end) .
spreadsheet with columns, rows, and cells. Tidy data meets three rows.&There's no restriction on the variable types.