0% found this document useful (0 votes)
19 views7 pages

Miuul Data Scientist Bootcamp CheatSheet Collections

This document serves as a cheat sheet for Python basics, covering topics such as data structures, functions, comprehensions, and NumPy arrays. It includes examples of list, dictionary, and set operations, as well as function definitions and basic mathematical operations. The document is structured to provide quick references for syntax and usage in Python programming.

Uploaded by

Ikbal Gencarslan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views7 pages

Miuul Data Scientist Bootcamp CheatSheet Collections

This document serves as a cheat sheet for Python basics, covering topics such as data structures, functions, comprehensions, and NumPy arrays. It includes examples of list, dictionary, and set operations, as well as function definitions and basic mathematical operations. The document is structured to provide quick references for syntax and usage in Python programming.

Uploaded by

Ikbal Gencarslan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

PYTHON BASICS

Comprehensions Functions

C O L L E C T I O N S
>>> list_1 = [1, 2, 3, "a", "b"]

>>> list_2 = [True, [1, 2, 3]]


# Function_1

C h e a t S h e e t
>>> numbers = [4, 2, 1, 3] List Comprehensions
>>> def say_hi(name):
# Syntax
... print(f'Merhaba {name}')
list = [expression for item in iterable if condition]
Indexing & Slicing >>> squares = [x**2 for x in range(1, 11)] # Calling function

C O L L E C T I O N S
>>> print(squares) >>> say_hi('Miuul')
>>> list_1[3] [1, 4, 9, 16, 25, 36, 49, 64, 81, 100] Merhaba Miuul
Data Structres 'a'
>>> evens = [f"{x}: even" for x in range(1,10) if x % 2 == 0] # Function_2
>>> txt_1 = 'Hello World!' >>> print(evens)
'Hello World!' >>>list_2[-1] ['2: even', '4: even', '6: even', '8: even']
>>> def summer(num_1, num_2):
[1, 2, 3] ... """
>>> txt_2 = "HELLO WORLD!" >>> even_odd = [f"{x}: even" if x%2 == 0 else f"{x}: odd" ... Sum of two numbers
# Integer (int) 'HELLO WORLD!' >>>list_1[0:4] for x in range(1,5)] ... Args:
>>> x = 2 [2, 3, 'a'] ... num_1: int, float
>>> long_txt = """

C O L L E C T I O N S
>>> print(even_odd) ... num_22: int, float
# Float Hello World, ['1: odd', '2: even', '3: odd', '4: even']
>>> x = 2.3 Welcome DSMLBC! ... Returns:
""" ... int, float
# Complex List Methods
'Hello World, \n Welcome DSMLBC!' ... """
>>> x = 2j + 1 Dictionary Comprehensions
... Return num_1 + num_2
>>> list_1 + list_2
Indexing & Slicing [1, 2, 3, 'a', 'b', True, [1, 2, 3]] # Syntax
Operations # Calling function
dict = {key_exp: value_exp for item in iterable if condition}
>>> summer(3, 4)
>>> txt_1[0] >>> list_1.append('c')
>>> dictionary = {'a': 1, 'b': 2, 'c': 3, 'd': 4} 7
# Exponent [1, 2, 3, 'a', 'b', 'c']

C O L L E C T I O N S
'H'
>>> 2**3
8 >>> {k: v ** 2 for (k, v) in dictionary.items()} # Function_3
>>> txt_1[-1] >>> list_2.remove('True') {'a': 1, 'b': 4, 'c': 9, 'd': 16}
'!' >>> def find_volume(length=1, width=1, depth=1):
[[1, 2, 3]]
# Modulus / Remainder ... print(f'Length = {length}')
>>> 22%8 >>> {k.upper(): v for (k, v) in dictionary.items()}
>>> txt_1[1:4] ... print(f'Width = {width}')
6 'ell' >>> len(list_1) {'A': 1, 'B': 2, 'C': 3, 'D': 4}
... print(f'Depth = {depth}')
5
>>> {k.upper(): v*2 for (k, v) in dictionary.items()} ... volume = length * width * depth
# Integer Division >>> txt_1[:5]

S
>>> 22//8 {'A': 2, 'B': 4, 'C': 6, 'D': 8}
'Hello' >>> numbers.sort()
2 # Calling function
[1, 2, 3, 4]

C O L L E C T I O N S
>>> find_volume(1, 2, 3)
# Division String Methods Length = 1
>>> 22/8 >>> list_2.insert(1, False)
Width = 2

N
2.75 [True, False, [1, 2, 3]]
>>> len(txt_1) Loops Depth = 3

Lists
12 6
# Multiplication >>> numbers.pop()
>>> 3*3 >>> txt_1.upper() [4, 2, 1]
9 >>> find_volume(2, depth=3, width=4)
'HELLO WORLD!'
Length = 2

O
For Loop
Numbers

# Subtraction >>> txt_2.lower() Width = 4

C O L L E C T I O N S
>>> 5-2 'hello world!' # Syntax Depth = 3
3 for <variable> in <list>: 24
>>> txt_1.replace("World", "Era") <code block> ... return volume
# Addition 'Hello Era!' >>> tuple = ("john", "mark", 1, 2)

I
>>> 2+2 >>> students = ["John", "Mark", "Venessa"]
4 >>> txt_1.split() >>> tuple[0] >>> for student in students: Local & Global Variables
['Hello', 'World!'] 'john' ... print(student)
John

T
>>> ' Hello World! '.strip() Mark >>> list_store = [1, 2] # Global variable
>>> tuple[1:3]
Strings

'Hello World!' Venessa

Tuples
('mark', 1)

C O L L E C T I O N S
>>> a = 2 >>> 'hello world!'.capitalize() >>> for index, student in enumerate(students):
... print(index, student) >>> def add_element(a, b):
>>> b = 7 'Hello world!' >> len(tuple)

C
... c=a*b
4 0 John
1 Mark ... list_store.append(c)
>>> a == b 2 Venessa
... print(list_store)
False
While Loop

E
>>> dict_1 = {"REG" : "Regression", >>> add_element(1, 9)
"LOG" : "Logistic Regression", >>> set_1 = {1, 2, 2, 3, 3, 3}
>>> a != b "CART": "Classification and Reg"}
# Syntax [1, 2, 9]

C O L L E C T I O N S
{1, 2, 3} while <condition>:
True
>>> dict_2 = {"REG": ["RMSE", 10]
<code to execute while the condition is true>
Built-in Functions

L
"LOG": ["MSE", 20] >>> set_2 = set([3, 4, 4, 5, 5, 5, 6])
>>> a > b "CART": ["SSE", 30]} {3, 4, 5, 6} >>> i = 1
>>> while i < 5:
False ... print(i) # lambda
Key - Value Methods ... if i==3: >>> summer = lambda a, b: a + b
Set Methods
L
... break >>> summer(3, 5)
>>> a >= b >>> dict_1.keys() ... i+=1
False dict_keys(['REG', 'LOG', 'CART']) 1 8

C O L L E C T I O N S
2
>>> dict_1.values() >>> set_1.difference(set_2) 3 # map
O

>>> a < b dict_values(['Regression', 'Logistic Regression', {1, 2}


'Classification and Reg']) >>> while i < 5: >>> salaries = [1000, 2000, 3000, 4000, 5000]
True
>>> set_2.difference(set_1) ... i+=1 >>> list(map(lambda x: x * 20 / 100 + x, salaries))
>>> dict_1.items() {4, 5, 6} ... if i==3: [1200.0, 2400.0, 3600.0, 4800.0, 6000.0]
>>> a <= b dict_items([('REG', 'Regression'), ... continue
('LOG', 'Logistic Regression'), ... print(i)
C

True ('CART', 'Classification and Reg')]) >>> set_1.symmetric_difference(set_2)


2 # filter
{1, 2, 4, 5, 6} 4 >>> numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> dict_1["REG"] 5

C O L L E C T I O N S
>>> (a > 1) & (b < 10) >>> list(filter(lambda x: x % 2 == 0, numbers))
'Regression' >>> set_1.intersection(set_2)
True {3} [2, 4, 6, 8, 10]
>>> dict_2["CART"][1]

>>> (a > 3) | (b > 5)


30
>>> set_1.union(set_2) Conditions # reduce
{1, 2, 3, 4, 5, 6} >>> numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
True
>>> reduce(lambda a, b: a + b, numbers)
Dictionary Methods >>> set_1.isdisjoint(set_2) 55
Dictionaries

>>> a is b False >>> number = 3.14


>>> dict_1 = {"REG": "Regression",
False >>> if number > 0:

C O L L E C T I O N S
"LOG": "Logistic Regression", # zip
>>> set_1.issubset(set_2) ... print(f"{number} is positive.")
Boolean

"CART": "Classification and Reg"} False ... elif number < 0: >>> students = ["John", "Mark", "Venessa"]
>>> a is not b ... print(f"{number} is negative.") >>> departments = ["mathematics", "statistics", "physics"]
True >>> dict_2 = {"REG": ["RMSE", 10] >>> set_1.issuperset(set_2) ... else: >>> list(zip(students, departments))
Set

"LOG": ["MSE", 20] ... print(f"{number} is zero!") [('John', 'mathematics'), ('Mark', 'statistics'), ('Venessa', 'physics')]
False 3.14 is positive.
"CART":["SSE", 30]}
NumPy

C O L L E C T I O N S
Attributes of Arrays Numerical Operations
Indexing / Fancy Indexing

>>> a = np.array([1, 2, 3, 4, 5])


>>> a = np.array([(1,2,3),(4,5,6)]) array([1, 2, 3, 4, 5])
Vector Operations
C h e a t S h e e t >>> a.ndim
2
# Access the first element of the array

C O L L E C T I O N S
>>> a[0] >>> a = np.array([1, 2, 3, 4, 5])
1 >>> b = np.array([6, 7, 8, 9, 10])
>>> a.shape
import numpy as np (2, 3) >>> a+b
>>> indices = np.array([0, 2, 4])
array([ 7, 9, 11, 13, 15])
>>> a.size # Access elements at indices 0, 2, and 4
6 >>> a[indices] >>> a-b
array([1, 3, 5]) array([-5, -5, -5, -5, -5])
NumPy Arrays >>> a.dtype

C O L L E C T I O N S
dtype('int32') >>> b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) >>> b/a
array([[1, 2, 3], array([6. , 3.5 , 2.66666667, 2.25 ,2. ])
>>> a.astype(float)
[4, 5, 6],
array([[1., 2., 3.], >>> b//a
[7, 8, 9]])
[4., 5., 6.]]) array([6, 3, 2, 2, 2], dtype=int32)
# Access the element at row 1, column 2 of the array
>>> a*b
>>> b[1, 2]

C O L L E C T I O N S
array([ 6, 14, 24, 36, 50])
1D Array 2D Array 3D Array Data Manipulation 6
>>> a**b
array([ 1, 128, 6561, 262144, 9765625], dtype=int32)

Slicing >>> np.exp(a)


Creating NumPy Arrays Reshaping array([ 2.71828183, 7.3890561 , 20.08553692, 54.59815003, 148.4131591 ])

S
# Slice the array from index 2 to index 4 (exclusive)

C O L L E C T I O N S
>>> a = np.array([(1,2,3),(4,5,6)]) >>> a[2:5] >>> np.log1p(b)
array([[1, 2, 3], array([1.94591015, 2.07944154, 2.19722458, 2.30258509, 2.39789527])
array([3, 4, 5])
[4, 5, 6]])

N
# One dimensional array
>>> np.array([1, 2, 3]) >>> np.sin(a)
>>> a.flatten() # Slice the array from index 1 to the end array([[-0.95892427, 0.84147098],
array([1, 2, 3])
array([1, 2, 3, 4, 5, 6]) >>> a[1:] [ 0.84147098, 0.14112001]])
# Two dimensional array array([2, 3, 4, 5])

O
>>> np.array([(1,2,3),(4,5,6)]) >>> a.resize((6,1)) >>> np.sqrt(b)

C O L L E C T I O N S
array([[1, 2, 3], array([[1], array([3.46410162, 3.16227766])
[2], # Slice the array from the beginning to index 2(exclusive)
[4, 5, 6]])
[3], >>> a[:3] >>> np.ceil(np.array([1.2, 2.3, 3.4, 4.5, 5.6, 7.6]))
[4], array([1, 2, 3]) array([2., 3., 4., 5., 6., 8.])

I
# Array of zeros with 3 elements
>>> np.zeros(3) [5],
array([0., 0., 0.]) [6]]) >>> np.floor(np.array([1.2, 2.3, 3.4, 4.5, 5.6, 7.6]))
# Slice the array in reverse order array([1., 2., 3., 4., 5., 7.])

T
>>> a.T >>> a[::-1]
# 2x2 array filled with ones

C O L L E C T I O N S
>>> np.ones((2,2)) array([[1, 4], array([5, 4, 3, 2, 1]) >>> np.round(np.array([1.2, 2.3, 3.4, 4.5, 5.6, 7.6]))
array([[1., 1.], [2, 5], array([1., 2., 3., 4., 6., 8.])

C
[1., 1.]]) [3, 6]])
# Slice the first row
>>> np.abs(np.array([-1, 2, -3, -4, 5, 6, -7]))
>>> b = np.random.randint(1, 10, size=9) >>> b[0, :] array([1, 2, 3, 4, 5, 6, 7])
# 3x3 identity matrix
>>> np.eye(3) array([4, 7, 1, 8, 2, 3, 3, 7, 8]) array([1, 2, 3])

E
array([[1., 0., 0.],
[0., 1., 0.], >>> b.reshape(3, 3)
# Slice the second column

C O L L E C T I O N S
[0., 0., 1.]]) array([[4, 7, 1],
[8, 2, 3], >>> b[:, 1]
array([2, 5, 8])

L
# 2x3 array filled with 4s [3, 7, 8]])
>>> np.full((2,3),4)
array([[4, 4, 4],
Statistics Linear Algebra
# Slice the sub-array starting from row 1 and column 0 up
[4, 4, 4]]) to row end and column 1 (exclusive)
L
>>> a = np.array([1, 2, 3, 4, 5]) # Solve the linear system

{
>>> b[1:, :2]
# 1D array of length 3 with random integers between 0 and 10 Combining

C O L L E C T I O N S
>>> np.random.randint(0, 10, size=3) array([[4, 5], >>> np.mean(a) 5x0 + x1 = 12
array([8, 2, 7]) [7, 8]]) 3.0
O

>>> a = np.array([1, 2, 3, 4, 5]) x0 + 3x1 = 10


>>> b = np.array([6, 7, 8, 9,10]) >>> a.sum()
# 2x3 array of random numbers between 0 and 1
>>> np.random.rand(2, 3) 15 >>> a = np.array([[5, 1], [1, 3]])
# Concatenating arrays horizontally (along >>> b = np.array([12, 10])
array([[0.60738971, 0.84209229, 0.64491569], columns)
Subsetting
C

[0.50990228, 0.25552633, 0.56484642]]) >>> a.max()


>>> np.hstack((a, b)) 5 >>> np.linalg.solve(a, b)
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9,10]) array([1.85714286, 2.71428571])

C O L L E C T I O N S
# 2x3 array of random numbers from a normal distribution with mean 0 >>> a[a > 2]
and standard deviation 1 array([3, 4, 5]) >>> a.min()
# Concatenating arrays vertically (along rows) # Solve the linear system
>>> np.random.normal(0, 1, (2, 3)) 1

{
>>> np.vstack((a, b))
array([[ 0.36527849, -2.48435406, 0.77739812], array([[ 1, 2, 3, 4, 5], >>> a[(a > 1) & (a < 4)] x0 + 2x1 + 3x2 = 10
[ 0.07923544, -0.30833118, 0.32393125]]) >>> np.var(a)
[ 6, 7, 8, 9, 10]]) array([2, 3]) 2.0 4x0 + 5x1 + 6x2 = 11

# 1D array with values from 0 to 10 with a step of 3 7x0 + 8x1 + 9x2 = 12


>>> np.concatenate((a, b)) >>> a[a != 3] >>> np.std(a)
>>> np.arange(0,10,3) array([ 1, 2, 3, 4, 5, 6, 7, 8, 9,

C O L L E C T I O N S
array([0, 3, 6, 9]) array([1, 2, 4, 5]) 1.4142135623730951 >>> a = np.array([[1, 2, 3],[4, 5, 6],[7, 8, 9]])
10])
>>> b = np.array([10, 11, 12 ])
# 1D array with 3 evenly spaced values between 0 and 10 >>> np.where(a > 2, a, 0) >>> np.corrcoef(a)
>>> np.stack((a, b)) 1.0
>>> np.linspace(0,10,3) array([0, 0, 3, 4, 5]) >>> np.linalg.solve(a, b)
array([[ 1, 2, 3, 4, 5],
array([ 0., 5., 10.]) arrayarray([-25.33333333, 41.66666667, -16.])
[ 6, 7, 8, 9, 10]])
PANDAS

C O L L E C T I O N S
Importing & Exporting Filtering & Sorting Joining
# Filter columns using regular expression matching # Merge operation
>>> df1 = pd.DataFrame({
# Read a CSV file and create a DataFrame >>> df.filter(regex=regex)
'key': ['A', 'B', 'C', 'D'],
>>> pd.read_csv('filename.csv')
C h e a t S h e e t # Save the DataFrame to a CSV file
# Sort DataFrame by values in a specific column in descending order
'value1': [1, 2, 3, 4]})

>>>> df2 = pd.DataFrame({

C O L L E C T I O N S
df.to_csv(filename.csv) >>> df.sort_values('ID', ascending=False) 'key': ['B', 'D', 'E', 'F'],
'value2': [5, 6, 7, 8]})
import pandas as pd # Read an Excel file and create a DataFrame
>>> pd.read_excel('filename.xlsx') >>> pd.merge(df1, df2, on='key', how='inner')
Editting
Data Structures # Save the DataFrame to an Excel file Key Value2
Key Value1
>>> df.to_excel('filename.xlsx') # Perform quantile-based discretization of values in 'col' into 3

􀁌 􀁔
0 A 1 0 B 5 Key Value1 Value2

C O L L E C T I O N S
bins with custom labels
1 B 2 1 D 6 0 B 2 5
>>> pd.qcut(df['col'], 3, labels=qcut_labels)
Series 2 C 3 2 E 7 1 D 4 6

Data Manipulation 3 D 4 3 F 8
# Create a series # Perform value-based discretization of values in 'col' into custom
>>> s = pd.Series([1, 2, 3], index=['A' ,'B' ,'C']) bins with labels
>>> pd.cut(df['col'], bins=cut_bins, labels=cut_labels) >>> df4 = pd.DataFrame({
Selection
Index Data

A 1
'key': ['B', 'D', 'E', 'F'],

C O L L E C T I O N S
# Create a pivot table using 'col1' as index, 'col2' as columns, and 'value4': ['orange', 'grape', 'kiwi', 'lemon']})
B 2
Selecting Rows 'col3' as values
C 3
>>> df.pivot(index='col1', columns='col2', values='col3') >>> df3.set_index('key').join(df4.set_index('key'), how='inner')
# Select a specific row by its label
>>> s.index
Index(['A', 'B', 'C'], dtype='object') >>> df.loc['1 (One)']:
# Reset the index of the DataFrame

S
Key Value3 Key Value2
Value1 Value2
>>> s.dtype # Select a specific row by its index >>> df.reset_index() orange

􀁌 􀁔
0 A apple 0 B
dtype('int64') Key
>>> df.loc[1] 1 B bannana 1 D grape
banana orange

C O L L E C T I O N S
B
>>> s.size # Rename specific columns of the DataFrame 2 C cherry 2 E kiwi

N
3 Selecting Columns >>> df.rename(columns = rename_list) 3 D date 3 F lemon
D date grape

>>> s.ndim
1 # Select a single column by its name
>>> df['ID']
>>> s.values # Example for concatenate operation

O
array([1, 2, 3], dtype=int64)
# Select multiple columns by their name Grouping & Aggregation >>> df5 = pd.DataFrame({
>>> df[['ID', 'Profession']] 'A': ['A0', 'A1', 'A2'],
# Perform grouping operation on 'col' and obtain a GroupBy object

C O L L E C T I O N S
'B': ['B0', 'B1', 'B2']})
DataFrame # Select a single column using label-based indexing >>> df.groupby('col')

I
>>> df.loc[:, 'Name'] >>> df6 = pd.DataFrame({
# Create a DataFrame 'A': ['A3', 'A4', 'A5'],
>>> df = pd.DataFrame({'ID': [1, 2, 3], # Select a single column using integer-based indexing # Perform grouping operation on multiple columns ('col1' and 'col2') 'B': ['B3', 'B4', 'B5']})

T
'Name': ['Alex', 'Brian', 'David'], >>> df.iloc[:, 1]
'Profession': ['DA', 'DE', 'DS']}, and calculate the mean of 'col3' >>> pd.concat([df5, df6], axis=0)
index = ['1 (One)', '2 (Two)', '3 (Three)']) Selecting Rows and Columns >>> df.groupby(['col1', 'col2']).agg({"col3": "mean"})

C O L L E C T I O N S
C
ID Name A B
Profession # Select a specific cell by its label-based row and column indices
0 A0 B0
1 (One) 1 Alex DA >>> df.loc['2 (Two)', 'Name'] # Perform grouping operation on multiple columns ('col1' and 'col2') A B A B

􀁌 􀁔
1 A1 B1
2 (Two) 2 Brian DE 0 A0 B0 0 A3 B3
# Select a specific cell by its integer-based row and column indices and calculate the mean of 'col3' and count of 'col4' 2 A2 B2

E
1 A1 B1 1 A4 B4
3 (Three) 3 David DS
>>> df.iloc[1, 1] 0 A3 B3
>>> df.groupby(['col1', 'col2']).agg({'col3': 'mean', 'col4': 2 A2 B2 2 A5 B5
1 A4 B4

# Get the shape of the DataFrame # Select specific rows and columns using label-based indexing 'count'}) 2 A5 B5

C O L L E C T I O N S
>>> df.loc['1 (One)', ['ID', 'Profession']]

L
>>> df.shape
(3, 3)

# Get the column names of the DataFrame # Select specific rows and columns using integer-based indexing # Grouping the DataFrame by the first level of the index
>>> df.columns >>> df.iloc[1, ['ID', 'Profession']]
>>> df.groupby(level=0) Statistics
L
Index(['ID', 'Name', 'Profession'], dtype='object')

# Get the index information of the DataFrame


Conditional Selection
>>> df.index
Index(['1 (One)', '2 (Two)', '3 (Three)'], dtype='object') # Select rows based on a condition # An example # Generate descriptive statistics of the DataFrame
O

C O L L E C T I O N S
>>> df[df['ID'] > 1] >>> df = pd.DataFrame({ >>> df.describe()
# Get the data types of the DataFrame columns
>>> df.dtypes
ID int64 # Select rows based on multiple conditions using logical operators 'Name': ['John', 'Alice', 'John', 'Alice', 'Bob'], # Calculate the mean of each column in the DataFrame
Name object >>> df[(df['ID'] > 2) & (df['Profession'] == 'DS')] >>> df.mean()
Profession object 'City': ['New York', 'Paris', 'London', 'Paris', 'London'],
C

dtype: object
# Select rows where a column value is present in a given list 'Age': [30, 25, 35, 28, 40], # Calculate the correlation matrix of the DataFrame
# Check if there are any missing values in the DataFrame >>> df.loc[df['Name'].isin(['Alex', 'David'])] >>> df.corr()
>>> df.isnull().values.any() 'Salary': [50000, 60000, 55000, 45000, 70000]})

C O L L E C T I O N S
False
# Count the number of non-null values in each column of the DataFrame
# Get the count of missing values in each column >>> df.count()
>>> df.isnull().sum() >>> df.groupby('City').agg({'Salary': 'mean', 'Age': 'max'})
ID 0 Deleting & Adding # Find the maximum value in each column of the DataFrame
Name 0
Profession 0 # Drop rows with any missing values Name City Age Salary Salary Age >>> df.max()
dtype: int64 >>> df.dropna()
City

􀁼
0 John New York 30 50000 # Find the minimum value in each column of the DataFrame
# Get the count of non-null values in each column # Drop columns with any missing values
>>> df.count() 1 Alice Paris 25 60000 London 62500.0 40 >>> df.min()

C O L L E C T I O N S
>>> df.dropna(axis=1)
ID 3
Name 3 2 John London 35 55000 New York 50000.0 30
# Drop columns with fewer than n non-null values # Calculate the median of each column in the DataFrame
Profession 3 >>> df.median()
>>> df.dropna(axis=1, thresh=n) 3 Alice Paris 28 45000 Paris 52500.0 28
dtype: int64
4 Bob London 40 70000
# Generate descriptive statistics of the DataFrame (transposed) # Fill missing values with a specified value # Calculate the standard deviation of each column in the DataFrame
>>> df.describe().T df.fillna(value) >>> df.std()
C O L L E C T I O N S
Supervised Learning Algorithms Model Evaluation Model Tuning

SCIKIT-LEARN
C h e a t S h e e t
# Linear Regression from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
Classification metrics
from sklearn.linear_model import LinearRegression
model = LinearRegression() # Logistic Regression parameter optimization
from sklearn.metrics import accuracy_score, precision_score,

C O L L E C T I O N S
model.fit(X_train, y_train) lr_param_grid = {'C': [0.1, 1, 10], 'penalty': ['l1', 'l2']}
recall_score, f1_score, roc_auc_score, classification_report
lr_grid_search = GridSearchCV(LogisticRegression(), lr_param_grid, cv=5)
# Logistic Regression confusion_matrix
lr_grid_search.fit(X_train, y_train)
from sklearn.linear_model import LogisticRegression
Preprocessing model = LogisticRegression()
y_pred = model.predict(X_test) lr_grid_search.best_params_
lr_grid_search.best_score_
# Best parameters for Logistic Regression
# Best score for Logistic Regression
model.fit(X_train, y_train)
# Confusion Matrix
confusion_matrix(y_test, y_pred) # Decision Tree parameter optimization
dt_param_grid = {'max_depth': [None, 5, 10], 'min_samples_split': [2, 5, 10]}

C O L L E C T I O N S
Splitting data into train and test sets # K-Nearest Neighbors (KNN)
Predicted dt_grid_search = GridSearchCV(DecisionTreeClassifier(), dt_param_grid, cv=5)
from sklearn.neighbors import KNeighborsClassifier
dt_grid_search.fit(X_train, y_train)
from sklearn.model_selection import train_test_split model = KNeighborsClassifier() False True
dt_grid_search.best_params_ # Best parameters for Decision Tree
model.fit(X_train, y_train)
X = df[["Independent Variables"]] dt_grid_search.best_score_ # Best score for Decision Tree

True
TN FN
y = df[["Target Variable"]]

Actual
X_train, X_test, y_train, y_test = train_test_split(X, y, test size=0.2) # CART # Random Forest parameter optimization

False
from sklearn.tree import DecisionTreeClassifier
FN TP rf_param_grid = {'n_estimators': [100, 200, 300], 'max_depth': [None, 5, 10]}

C O L L E C T I O N S
model = DecisionTreeClassifier() rf_random_search = RandomizedSearchCV(RandomForestClassifier(), rf_param_grid,
Handling Missing Values model.fit(X_train, y_train) n_iter=5, cv=5)
# Accuracy = ( TP + TN ) / ( TP + TN + FP + FN ) rf_random_search.fit(X_train, y_train)
accuracy_score(y_test, y_pred) rf_random_search.best_params_ # Best parameters for Random Forest
from sklearn.impute import SimpleImputer, KNNImputer
# Random Forests rf_random_search.best_score_ # Best score for Random Forest
# Dropping Rows or Columns from sklearn.ensemble import RandomForestClassifier # Precision = TP / ( TP + FP )
df.dropna(axis=0) model = RandomForestClassifier() precision_score(y_test, y_pred) # K-Nearest Neighbors parameter optimization

S
df.dropna(axis=1)
model.fit(X_train, y_train) knn_param_grid = {'n_neighbors': [3, 5, 7], 'weights': ['uniform','distance']}

C O L L E C T I O N S
# Imputation # Recall = TP / ( TP + FN ) knn_random_search = RandomizedSearchCV(KNeighborsClassifier(), knn_param_grid,
imputer = SimpleImputer(strategy='mean') recall_score(y_test, y_pred) n_iter=5, cv=5)

N
X_train_imputed = imputer.fit_transform(X_train) # GBM knn_random_search.fit(X_train, y_train)
X_test_imputed = imputer.fit_transform(X_test) from sklearn.ensemble import GradientBoostingClassifier # F1 Score = TN / ( TN + FP ) knn_random_search.best_params_ # Best parameters for K-Nearest Neighbors
model = GradientBoostingClassifier() f1_score(y_test, y_pred) knn_random_search.best_score_ # Best score for K-Nearest Neighbors
# K-Nearest Neighbors (KNN) Imputation
model.fit(X_train, y_train)
knn_imputer = KNNImputer()

O
X_train_knn_imputed = knn_imputer.fit_transform(X_train) # AUC # GBM parameter optimization

C O L L E C T I O N S
X_test_knn_imputed = knn_imputer.fit_transform(X_test) gbm_param_grid = {'n_estimators': [100, 200, 300], 'max_depth': [None, 5, 10]}
roc_auc_score(y_test, y_pred)
# XGBoost
gbm_grid_search = GridSearchCV(GradientBoostingClassifier(), gbm_param_grid,cv=5)
!pip install xgboost
Handling Outliers from xgboost import XGBClassifier # Classification Report gbm_grid_search.fit(X_train, y_train)
gbm_grid_search.best_params_ # Best parameters for GBM

I
model = XGBClassifier classification_report(y_test, y_pred)
def suppress_outliers_iqr(df, col_name, q1=0.25, q3=0.75, multiplier=1.5): gbm_grid_search.best_score_ # Best score for GBM
model.fit(X_train, y_train)
quartile1 = df[col_name].quantile(q1)
Regression metrics

T
quartile3 = df[col_name].quantile(q3) # LightGBM parameter optimization

C O L L E C T I O N S
interquartile_range = quartile3 - quartile1 # LightGBM lgb_param_grid = {'n_estimators': [100, 200, 300], 'max_depth': [None, 5, 10]}
lower_bound = quartile1 - multiplier * interquartile_range from sklearn.metrics import mean_absolute_error, lgb_grid_search = GridSearchCV(lgb.LGBMClassifier(), lgb_param_grid, cv=5)
!pip install lightgbm
upper_bound = quartile3 + multiplier * interquartile_range mean_squared_error, r2_score lgb_grid_search.fit(X_train, y_train)
from lightgbm import LGBMClassifier

C
model = LGBMClassifier() from sklearn.model_selection import cross_val_score lgb_grid_search.best_params_ # Best parameters for LightGBM
# Suppress outliers by replacing them with the lower/upper bounds lgb_grid_search.best_score_ # Best score for LightGBM
model.fit(X_train, y_train)
df.loc[df[col_name] < lower_bound, col_name] = lower_bound y_pred = model.predict(X_test)
df.loc[df[col_name] > upper_bound, col_name] = upper_bound
# XGBoost parameter optimization

E
# MAE
# Catboost xgb_param_grid = {'n_estimators': [100, 200, 300], 'max_depth': [None, 5, 10]}

C O L L E C T I O N S
# Example usage mean_absolute_error(y_test, y_pred)
!pip install catboost xgb_grid_search = GridSearchCV(xgb.XGBClassifier(), xgb_param_grid, cv=5)
suppress_outliers_iqr(dataframe, 'A')
from catboost import CatBoostClassifier # MSE xgb_grid_search.fit(X_train, y_train)

L
model = CatBoostClassifie mean_squared_error(y_test, y_pred) xgb_grid_search.best_params_ # Best parameters for XGBoost
model.fit(X_train, y_train) xgb_grid_search.best_score_ # Best score for XGBoost
Feature Scaling # RMSE
np.sqrt(mean_squared_error(y, y_pred))
L
from sklearn.preprocessing import StandardScaler, MinMaxScaler # CatBoost parameter optimization
Unsupervised Learning Algorithms

C O L L E C T I O N S
# R-Square cat_param_grid = {'iterations': [100, 200, 300], 'depth': [4, 6, 8]}
# Numeric feature scaling (StandardScaler) r2_score(y_test, y_pred) cat_random_search = RandomizedSearcahCV(CatBoostClassifier(), cat_param_grid,
O

scaler = StandardScaler() n_iter=5, cv=5)


# Cross-Validation
X_train_scaled = scaler.fit_transform(X_train) cat_random_search.fit(X_train, y_train)
cross_val_score(model, X, y, cv=5)
X_test_scaled = scaler.transform(X_test) # K-Means cat_random_search.best_params_ # Best parameters for CatBoost
from sklearn.cluster import KMeans cat_random_search.best_score_ # Best score for CatBoost
C

model = KMeans(n_clusters=3)
Clustering metrics
# Numeric feature scaling (MinMaxScaler) # K-means parameter optimization

C O L L E C T I O N S
model.fit(df)
scaler = MinMaxScaler() from sklearn.metrics import adjusted_rand_score, kmeans_param_grid = {'n_clusters': [3, 5, 7], 'init': ['k-means++', 'random']}
X_train_scaled = scaler.fit_transform(X_train) homogeneity_score, v_measure_score kmeans_grid_search = GridSearchCV(KMeans(), kmeans_param_grid, cv=5)
X_test_scaled = scaler.transform(X_test) # Hierarchical Clustering kmeans_grid_search.fit(X_train)
from sklearn.cluster import AgglomerativeClustering kmeans_grid_search.best_params_ # Best parameters for K-Means
# Adjusted Rand Index kmeans_grid_search.best_score_ # Best score for K-Means
model = AgglomerativeClustering(n_clusters=5),
Feature Encoding adjusted_rand_score(y_true, y_pred)
model.fit(df)
# PCA parameter optimization

C O L L E C T I O N S
from sklearn.preprocessing import OneHotEncoder # Homogeneity pca_param_grid = {'n_components': [2, 5, 10]}
# PCA homogeneity_score(y_true, y_pred) pca_grid_search = GridSearchCV(PCA(), pca_param_grid, cv=5)
# Categorical feature encoding (One-Hot Encoder) from sklearn.decomposition import PCA
encoder = OneHotEncoder() pca_grid_search.fit(X_train)
X_train_encoded = encoder.fit_transform(X_train) pca = PCA(n_components=2) # V-measure pca_grid_search.best_params_ # Best parameters for PCA
X_test_encoded = encoder.transform(X_test) pca.fit(df) metrics.v_measure_score(y_true, y_pred)) pca_grid_search.best_score_ # Best score for CatBoost
Data Visualization

C O L L E C T I O N S
Bar Plot Box Plot Multiple Subplots

Cheat Sheet plt.bar(x, y, color = 'b', width = 0.5, height = 0.3)


plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
sns.boxplot(data, x, y, color)
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
fig, axs = plt.subplots(2, 2)
axs[0, 0].plot(x1, y1)
axs[0, 1].scatter(x2, y2)
axs[1, 0].bar(x3, height3)
plt.title('Title')
axs[1, 1].hist(data4, bins=10)

C O L L E C T I O N S
plt.title('Title') plt.show() plt.show()
plt.show() Example
Importing Libraries # Load the 'tip' dataset from seaborn
Example

# Horizontal bar tips = sns.load_dataset('tips') # Load the 'tips' dataset from seaborn
tips = sns.load_dataset('tips')
plt.barh(x, y) # Create the box plot with a different color # Prepare the data for plotting
import matplotlib.pyplot as plt plt.show() sns.boxplot(data=tips, x='day', y='total_bill', x1 = tips['total_bill']
color='green') y1 = tips['tip']
import seaborn as sns plt.xlabel('Day of the Week') x2 = tips['size']

C O L L E C T I O N S
plt.ylabel('Total Bill Amount') y2 = tips['tip']
Example plt.title('Distribution of Total Bills by Day of the Week') x3 = tips['day']
height3 = tips['total_bill']
plt.show() data4 = tips['total_bill']
Basic Line Plot # Load the 'tip' dataset from seaborn # Create the subplots
fig, axs = plt.subplots(2, 2)
tips = sns.load_dataset('tips')
# Plot 1: Line plot
axs[0, 0].plot(x1, y1)
# Calculate the average tip amount for each day of the week axs[0, 0].set_xlabel('Total Bill')
plt.plot(x, y, linestyle = '-', color = 'b') axs[0, 0].set_ylabel('Tip')

C O L L E C T I O N S
avg_tip_by_day = tips.groupby('day')['tip'].mean()

S
axs[0, 0].set_title('Tip Amount vs Total Bill')
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label') # Plot 2: Scatter plot
plt.title('Title') # Create the bar plot axs[0, 1].scatter(x2, y2)
axs[0, 1].set_xlabel('Party Size')
plt.show() sns.barplot(x=avg_tip_by_day.index, y=avg_tip_by_day.values) axs[0, 1].set_ylabel('Tip')

N
axs[0, 1].set_title('Tip Amount vs Party Size')
Example plt.xlabel('Day of the Week')
Heatmap # Plot 3: Bar plot
plt.ylabel('Average Tip Amount') axs[1, 0].bar(x3, height3)
axs[1, 0].set_xlabel('Day')
# Sample dataset plt.title('Average Tip Amount by Day of the Week') axs[1, 0].set_ylabel('Total Bill')

C O L L E C T I O N S
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May'] axs[1, 0].set_title('Total Bill by Day')

O
plt.show()
revenue = [10000, 15000, 12000, 18000, 20000] sns.heatmap(data, annot=True) # Plot 4: Histogram
plt.xlabel('X-axis label') axs[1, 1].hist(data4, bins=10)
# Create the line plot plt.ylabel('Y-axis label') axs[1, 1].set_xlabel('Total Bill')
plt.title('Title') axs[1, 1].set_ylabel('Frequency')
plt.plot(months, revenue, marker='o', linestyle='-', color='b') axs[1, 1].set_title('Distribution of Total Bills')
plt.show()

I
plt.xlabel('Months')
plt.ylabel('Revenue ($)') # Adjust the layout and spacing
Example fig.tight_layout()
plt.title('Monthly Revenue')
# Display the plot

T
plt.grid(True) # Load the 'tip' dataset from seaborn

C O L L E C T I O N S
tips = sns.load_dataset('tips') plt.show()
plt.show()
# Calculate the correlation matrix
correlation_matrix = tips.corr()

C
# Create the heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='YlGn')
plt.title('Correlation Heatmap')
plt.show()
Histogram

C O L L E C T I O N S
E
L
plt.hist(data, bins=10)

Scatter Plot plt.xlabel('X-axis label')


Customizing Plots
plt.ylabel('Y-axis label') L

C O L L E C T I O N S
plt.title('Title')
plt.scatter(x, y, c='color') plt.show()
Pie Chart
plt.figure(figsize=(8, 6))
O
plt.xlabel('X-axis label') plt.plot(x, y, color='red', linestyle='--', linewidth=2,
marker='o', markersize=6)
plt.ylabel('Y-axis label') Example plt.xlabel('X-axis label', fontsize=12)
plt.title('Title') plt.ylabel('Y-axis label', fontsize=12)
plt.show() plt.pie(y, labels=labels, colors=[colors]) plt.title('Title', fontsize=14)
# Load the 'tip' dataset from seaborn plt.title('Title') plt.legend(['Legend'])
plt.show() plt.grid(True)
Example
C

plt.show()

C O L L E C T I O N S
tips = sns.load_dataset('tips')
Example
# Load the tips dataset Example
df = sns.load_dataset("tips") # Create the histogram # Load the 'tip' dataset from seaborn
# Load the 'tip' dataset from seaborn
tips = sns.load_dataset('tips') tips = sns.load_dataset('tips')
# Extract the total bill and tip amounts sns.histplot(data=tips, x='tip', bins=10, color='green')
# Prepare the data for plotting
total_bill = df["total_bill"].values plt.xlabel('Tip Amount') # Calculate the count of meals for each time of the day x = tips['total_bill']
tip_amount = df["tip"].values meal_counts = tips['time'].value_counts() y = tips['tip']
plt.ylabel('Count')
# Specify custom labels and colors # Create the plot

C O L L E C T I O N S
# Create a scatter plot plt.title('Distribution of Tip Amounts') labels = meal_counts.index plt.figure(figsize=(8, 6))
plt.scatter(total_bill, tip_amount, c='green') colors = ['darkgreen', 'aquamarine'] plt.plot(x, y, color='red', linestyle='--', linewidth=2,
plt.show() marker='o', markersize=6)
plt.xlabel('Total Bill') plt.xlabel('Total Bill', fontsize=12)
plt.ylabel('Tip Amount') # Create the pie chart plt.ylabel('Tip', fontsize=12)
plt.pie(meal_counts, labels=labels, colors=colors) plt.title('Tip Amount vs Total Bill', fontsize=14)
plt.title('Scatter Plot: Tips Dataset') plt.title('Distribution of Meals by Time of the Day') plt.legend(['Tips'])
plt.show() plt.show() plt.grid(True)
plt.show()

C O L L E C T I O N S
MS SQL
Filtering Example
Filtering

C O L L E C T I O N S
-- Using the LIKE operator to filter employees
-- Using the LIKE operator for pattern matching with names starting with ‘J’
Sample Data SELECT FirstName, LastName FROM Employees
SELECT * FROM table_name WHERE column_name
WHERE FirstName LIKE ‘J%’;
C h e a t S h e e t LIKE ‘pattern%’;
1.1 Employees Table
-- Using the IN operator to filter employees from
-- Using the IN operator to filter by multiple values specific departments
SELECT * FROM table_name WHERE column_name Employee ID First Name Last Name DepartmentID Salary SELECT FirstName, LastName FROM Employees

C O L L E C T I O N S
IN (value1, value2, value3); 1 John Doe 1 50000 WHERE DepartmentID IN (1, 2);
Basic SQL Commands
-- Using the BETWEEN operator to filter within a range 2 Jane Smith 2 60000 -- Using the BETWEEN operator to filter employees
SELECT * FROM table_name WHERE column_name with salaries between 50000 and 55000
3 Michael Johnson 1 55000 SELECT FirstName, LastName FROM Employees
-- Select all columns from a table BETWEEN value1 AND value2;
WHERE Salary BETWEEN 50000 AND 55000;
SELECT * FROM table_name; 4 Emily Williams 3 52000
-- Using the IS NULL operator to filter NULL values -- Using the IS NULL operator to find employees
-- Select specific columns from a table SELECT * FROM table_name WHERE column_name IS NULL; 5 William Brown 2 48000
without a department
SELECT column1, column2 FROM table_name;

C O L L E C T I O N S
SELECT FirstName, LastName FROM Employees
1.2 Departments Table WHERE DepartmentID IS NULL;
-- Filter rows with a condition
SELECT * FROM table_name WHERE condition; Aliases and Calculated Columns Department ID Department Name
Aliases and Calculated Columns
-- Insert a new row into a table 1 Sales
INSERT INTO table_name (column1, column2) -- Alias for column names
VALUES (value1, value2); -- Alias for column names 2 Marketing SELECT FirstName AS First, LastName AS Last
SELECT column_name AS alias_name FROM table_name; FROM Employees;
-- Update existing rows in a table 3 HR

C O L L E C T I O N S
UPDATE table_name SET column1 = value1, -- Calculated columns in SELECT -- Calculated columns in SELECT
4 Finance SELECT FirstName, LastName, Salary, Salary * 12
column2 = value2 WHERE condition; SELECT column1, column2, column1 + column2
AS calculated_column FROM table_name; AS AnnualSalary FROM Employees;
-- Delete rows from a table Selecting Data
DELETE FROM table_name WHERE condition; Conditional Statements
-- Select all employees -- Simple CASE expression to categorize employees
Conditional Statements

S
SELECT * FROM Employees; based on salary
SELECT FirstName, LastName,

C O L L E C T I O N S
Aggregation Functions -- Select specific columns from Employees table CASE
-- Simple CASE expression SELECT FirstName, LastName, Salary FROM Employees;
WHEN Salary >= 55000 THEN ‘High’

N
SELECT column_name, CASE WHEN column_name = value1 THEN WHEN Salary >= 50000 THEN ‘Medium’
-- Filter employees in the Sales department
-- Calculate the sum of a column ‘Result 1’ WHEN column_name = value2 THEN ‘Result 2’ ELSE ‘Low’
SELECT FirstName, LastName FROM Employees
SELECT SUM(column_name) FROM table_name; ELSE ‘Default Result’ END AS result FROM table_name; WHERE DepartmentID = 1; END AS SalaryCategory FROM Employees;

O
-- Calculate the average of a column
-- Select employees with a salary greater than 52000
Working with Dates
SELECT AVG(column_name) FROM table_name;

C O L L E C T I O N S
SELECT FirstName, LastName, Salary -- Get the current date
Working with Dates
-- Get the maximum value from a column FROM Employees WHERE Salary > 52000; SELECT GETDATE() AS CurrentDate;
SELECT MAX(column_name) FROM table_name;

I
-- Get the minimum value from a column
Aggregation Functions -- Format dates using CONVERT
-- Get the current date and time SELECT FirstName, LastName,
SELECT MIN(column_name) FROM table_name; SELECT GETDATE(); -- Calculate the total salary of all employees

T
CONVERT(varchar, HireDate, 103)
-- Count the number of rows in a table SELECT SUM(Salary) AS TotalSalary FROM Employees; AS FormattedHireDate FROM Employees;
-- Format dates
SELECT COUNT(*) FROM table_name;

C O L L E C T I O N S
SELECT CONVERT(varchar, date_column, 103)
AS formatted_date FROM table_name; -- Calculate the average salary
SELECT AVG(Salary) AS AverageSalary FROM Employees; -- Extract the year from the HireDate

C
SELECT FirstName, LastName,
-- Extract parts of a date
Sorting and Grouping SELECT DATEPART(year, date_column) AS year -- Get the highest salary DATEPART(year, HireDate) AS HireYear
FROM table_name; SELECT MAX(Salary) AS MaxSalary FROM Employees; FROM Employees;

E
-- Count the number of employees in the Marketing
-- Order rows in ascending order department Creating and Modifying Tables

C O L L E C T I O N S
SELECT * FROM table_name ORDER BY column_name Creating and Modifying Tables SELECT COUNT(*) AS NumberOfEmployees -- Create a new table
ASC; FROM Employees WHERE DepartmentID = 2;

L
CREATE TABLE Customers (
-- Order rows in descending order CustomerID INT PRIMARY KEY,
SELECT * FROM table_name ORDER BY column_name -- Create a new table Sorting and Grouping FirstName VARCHAR(50),
DESC; CREATE TABLE table_name (
L
column1 datatype1 constraint1, -- Order employees by salary in descending order LastName VARCHAR(50),
-- Group rows based on a column column2 datatype2 constraint2 SELECT FirstName, LastName, Salary Email VARCHAR (100)
SELECT column_name, COUNT (*) FROM table_name ); );

C O L L E C T I O N S
FROM Employees ORDER BY Salary DESC;
GROUP BY column_name;
O

-- Add a new column to an existing table -- Group employees by department and calculate the -- Add a new column to an existing table
ALTER TABLE table_name ADD column_name datatype; total salary for each department ALTER TABLE Employees ADD Age INT;
SELECT DepartmentID, SUM(Salary) AS TotalSalary
Joins -- Modify an existing column FROM Employees GROUP BY DepartmentID; -- Modify an existing column
C

ALTER TABLE table_name ALTER COLUMN column_name


ALTER TABLE Employees ALTER COLUMN Salary
new_datatype;
Joins DECIMAL(10, 2);

C O L L E C T I O N S
-- Inner Join
-- Add a primary key constraint
SELECT * FROM table1 INNER JOIN table2 -- Inner Join to get employee details along with
ALTER TABLE table_name ADD CONSTRAINT pk_con- -- Add a primary key constraint
ON table1.column_name = table2.column_name; their department names
straint ALTER TABLE Departments ADD CONSTRAINT PK_DepartmentID
PRIMARY KEY (column_name); SELECT e.FirstName, e.LastName, d.DepartmentName PRIMARY KEY (DepartmentID);
-- Left Join
FROM Employees e INNER JOIN Departments d
SELECT * FROM table1 LEFT JOIN table2
ON table1.column_name = table2.column_name; ON e.DepartmentID = d.DepartmentID; Indexes
Indexes -- Create an index

C O L L E C T I O N S
-- Right Join -- Left Join to include all departments, even
CREATE INDEX IDX_Employees_DepartmentID
SELECT * FROM table1 RIGHT JOIN table2 those without employees
ON Employees ( DepartmentID);
ON table1.column_name = table2.column_name; SELECT d.DepartmentName, e.FirstName, e.LastName
-- Create an index
CREATE INDEX index_name ON table_name(column_name); FROM Departments d LEFT JOIN Employees e
-- Full Outer Join -- Delete an index
ON d.DepartmentID = e.DepartmentID;
SELECT * FROM table1 FULL OUTER JOIN table2 -- Delete an index DROP INDEX Employees.IDX_Employees_DepartmentID;
ON table1.column_name = table2.column_name; DROP INDEX table_name.index_name;
Probability Expected value of a discrete random variable X is calculated as the sum of Continuous Distributions Fundamental Results

C O L L E C T I O N S
each value x weighted by its probability P (X = x):
Uniform Distribution

C h e a t S h e e t Variance of a discrete random variable X is calculated as the average of the


The uniform distribution is a probability distribution that describes a situation
where all outcomes in a given range are equally likely. In other words, each
possible value within the range has the same probability of occurring.
The Law of Large Numbers

Weak Law of Large Numbers (WLLN) : The sample mean of a large


squared differences between each value x and the expected value E(X): number of independent and identically distributed random variables
Probability Basics 1 approaches the expected value.

C O L L E C T I O N S
PMF : f(x) = for
b-a Strong Law of Large Numbers (SLLN) : The sample mean converges
Continuous Random Variables Parameters : a (lower bound), b (upper bound) almost surely to the expected value.
Experiment : A process that results in an outcome. For a continuous random variable X, which can take on any value within a Expected Value : E(X) = a + b
Sample Space (S) : The set of all possible outcomes of an experiment. certain range, we have the following key concepts: 2

Variance : Var(X) = # Import necessary libraries


Event (E) : A subset of the sample space. Probability Density Function (PDF): The PDF f(x) of a continuous random 12
import random
Probability (P) : A measure of the likelihood of an event occuring.

C O L L E C T I O N S
variable X provides the relative likelihood that X falls within a specific import matplotlib.pyplot as plt
Normal (Gaussian) Distribution
interval. Unlike discrete random variables, the probability that X takes on a # Function to simulate rolling a fair six-sided die
Probability Axioms The normal distribution, also known as the Gaussian distribution or bell curve, def roll_die():
specific value is generally zero. Instead, the probability is associated with
is a fundamental concept in statistics and probability theory. It describes a return random.randint(1, 6)
Non-Negativity : For all events A, P (A) 0. intervals. symmetrical probability distribution that is characterized by its bell-shaped
# Number of trials to perform for each sample size
Additivity : For mutually exclusive events A and B, curve. num_trials = 10000
Expected value of a continuous random variable X is the integral of x
multiplied by the f(x) over the entire range of X: # List of sample sizes to investigate ]
P (A B) = P (A) + P(B). PDF :

C O L L E C T I O N S
sample_sizes = [10, 50, 100, 500, 1000, 5000
Normalization : The probability of the entire sample space S is 1:
Parameters : (mean), (standard deviation) #for
Iterate through each sample size
sample_size in sample_sizes:
P (S)=1. Variance of a continuous random variable X is calculated similarly: trial_means = []
Expected Value :
Discrete Uniform Law (Classical Probability) : For equally likely outcomes, Variance : # Perform num_trials trials for the current sample size
for _ in range(num_trials):

S
Number of favorable outcomes # Simulate rolling the die 'sample_size' times and
P (E) = Exponential Distribution

C O L L E C T I O N S
Total number of outcomes calculate the mean (sample_size)]
sample = [roll_die() for _ in range
The exponential distribution models the time between successive events in a
Relative Frequency Probability : Common Probability Distributions mean = sum(sample) / sample_size

N
process where events occur randomly and independently at a constant trial_means.append(mean)
Frequency of E occuring average rate.
P (E) = # Create a histogram of trial means with 20 bins and add
Total number of trials ,
PDF : for it to the plot
Discrete Distributions plt.hist(trial_means , bins=20, alpha= 0.5 ' )

O
Conditional Probability P (E|F) : Probability of E given that F has
Parameters : (rate parameter) label=f'Sample size: {sample_size}
occurred. It is calculated as:
Bernoulli Distribution
1 # Label the x and y axes and set the title of the plot
Expected Value : E(X) =
P(E | F) = where The Bernoulli distribution models a random experiment with two possible plt.xlabel('Sample Mean')
P(F) 1 plt.ylabel('Frequency')

I
outcomes, usually labeled as “success” and “failure”. Variance : Var(X) = plt.title('Law of Large Numbers')
2

Bayes’ Theorem
plt.legend()
PMF : P(X=x) = px Chi-Squared Distribution

T
Bayes’ Theorem helps us update our initial beliefs (prior probabili-
plt.show()
ties) based on new information (likelihood) to arrive at a more The chi-squared distribution emerges from the sum of the squares of inde-
Parameters : p (probability of success)

C O L L E C T I O N S
accurate or informed estimate of the probability of an event pendent standard normal random variables.

C
occurring (posterior probability). Mathematically, Bayes’ Theorem
Expected Value : E(X) = p
is stated as follows: PDF : The Central Limit Theorem

Variance : Parameters :
P(F | E ) . P(E) The distribution of the sum (or average) of a large number of
P(E | F) =
P(F)

E
Expected Value : E(X) = k independent, identically distributed random variables approaches a
Binomial Distribution normal distribution.
Variance : Var(X) = 2k

C O L L E C T I O N S
The binomial distribution describes the number of successes (S) in a fixed

Random Variables # Import necessary libraries


number of independent Bernoulli trials. Let us say you conduct “n” trials
and you’re interested in how many times you willl get a success. The binomial
distribution provides the probabilities of getting 0, 1, 2, ..., n successes.
L Student’s t-Distribution

The t-distribution models the variability of sample means when drawing


small samples from a population. It is characterized by its bell-shaped curve,
import numpy as np
import matplotlib.pyplot as plt
L
similar to the normal distribution. # Define the size of the population and generate a random
population with an exponential distribution
Random Variable : A random variable assigns a numerical value to PMF : P(X = k) = C(n,k) . pk . (1-p)n-k
PDF : population_size = 100000
each outcome in a sample space. It can be discrete or continuous. population = np.random.exponential(scale=2, size=population_size)

C O L L E C T I O N S
Parameters : n (number of trials),
O

Parameters :
p (probability of success)
Expected Value : The expected value of a random variable represen- # Define a list of sample sizes that we want to explore
Expected Value :
Expected Value : E(X) = n . p sample_sizes = [10, 50, 100, 500]
ts the average or mean value it takes over all possible outcomes. 2
Variance :
It provides a measure of the central tendency of the distribution. # Iterate through each sample size
C

Variance :
sample_size in sample_sizes:
F-Distribution
Variance : The variance of a random variable measures the extent Poisson Distribution # Generate 1000 random samples of the specified size from the
population
The F-distribution models the ratio of two independent chi-squared distribu-

C O L L E C T I O N S
to which the values of the variable deviate from its expected value.
The Poisson distribution is a probability distribution that describes sample_means = [np.mean(np.random.choice(population,
tions divided by their respective degrees of freedom.
It quantifies the spread or dispersion of the distribution. the number of events that occur in a fixed interval of time or space, size=sample size)) for _ in range(1000)]
given a certain average rate of occurrence. # Create a histogram of the sample means with 30 bins and add
PDF : Depends on degrees of freedom ( it to the plot
Discrete Random Variables
plt.hist(sample_means, bins=30, alpha=0.5,
Parameters : 1 (numerator degrees of freedom),
For a discrete random variable X, which takes on distinct values
k
.e (denominator degrees of freedom)
label=f'Sample size: {sample_size}')
PMF : P(X = k) = 2
k!
from a finite or countable set, we have the following key concepts: # Add labels and a title to the plot
Parameters : (average rate of events) Expected Value : E(X) = 2
for

C O L L E C T I O N S
2
-2
2
plt.xlabel('Sample Mean')
Probability Mass Function (PMF): The PMF P (X = x) of a discrete plt.ylabel('Frequency')
Expected Value :
random variable X gives the probability that X takes on the Variance : Var(X) = plt.title('Central Limit Theorem')

value x. It describes the distribution of probabilities across all Variance :


plt.legend()
possible values of X .

You might also like