Miuul Data Scientist Bootcamp CheatSheet Collections
Miuul Data Scientist Bootcamp CheatSheet Collections
Comprehensions Functions
C O L L E C T I O N S
>>> list_1 = [1, 2, 3, "a", "b"]
C h e a t S h e e t
>>> numbers = [4, 2, 1, 3] List Comprehensions
>>> def say_hi(name):
# Syntax
... print(f'Merhaba {name}')
list = [expression for item in iterable if condition]
Indexing & Slicing >>> squares = [x**2 for x in range(1, 11)] # Calling function
C O L L E C T I O N S
>>> print(squares) >>> say_hi('Miuul')
>>> list_1[3] [1, 4, 9, 16, 25, 36, 49, 64, 81, 100] Merhaba Miuul
Data Structres 'a'
>>> evens = [f"{x}: even" for x in range(1,10) if x % 2 == 0] # Function_2
>>> txt_1 = 'Hello World!' >>> print(evens)
'Hello World!' >>>list_2[-1] ['2: even', '4: even', '6: even', '8: even']
>>> def summer(num_1, num_2):
[1, 2, 3] ... """
>>> txt_2 = "HELLO WORLD!" >>> even_odd = [f"{x}: even" if x%2 == 0 else f"{x}: odd" ... Sum of two numbers
# Integer (int) 'HELLO WORLD!' >>>list_1[0:4] for x in range(1,5)] ... Args:
>>> x = 2 [2, 3, 'a'] ... num_1: int, float
>>> long_txt = """
C O L L E C T I O N S
>>> print(even_odd) ... num_22: int, float
# Float Hello World, ['1: odd', '2: even', '3: odd', '4: even']
>>> x = 2.3 Welcome DSMLBC! ... Returns:
""" ... int, float
# Complex List Methods
'Hello World, \n Welcome DSMLBC!' ... """
>>> x = 2j + 1 Dictionary Comprehensions
... Return num_1 + num_2
>>> list_1 + list_2
Indexing & Slicing [1, 2, 3, 'a', 'b', True, [1, 2, 3]] # Syntax
Operations # Calling function
dict = {key_exp: value_exp for item in iterable if condition}
>>> summer(3, 4)
>>> txt_1[0] >>> list_1.append('c')
>>> dictionary = {'a': 1, 'b': 2, 'c': 3, 'd': 4} 7
# Exponent [1, 2, 3, 'a', 'b', 'c']
C O L L E C T I O N S
'H'
>>> 2**3
8 >>> {k: v ** 2 for (k, v) in dictionary.items()} # Function_3
>>> txt_1[-1] >>> list_2.remove('True') {'a': 1, 'b': 4, 'c': 9, 'd': 16}
'!' >>> def find_volume(length=1, width=1, depth=1):
[[1, 2, 3]]
# Modulus / Remainder ... print(f'Length = {length}')
>>> 22%8 >>> {k.upper(): v for (k, v) in dictionary.items()}
>>> txt_1[1:4] ... print(f'Width = {width}')
6 'ell' >>> len(list_1) {'A': 1, 'B': 2, 'C': 3, 'D': 4}
... print(f'Depth = {depth}')
5
>>> {k.upper(): v*2 for (k, v) in dictionary.items()} ... volume = length * width * depth
# Integer Division >>> txt_1[:5]
S
>>> 22//8 {'A': 2, 'B': 4, 'C': 6, 'D': 8}
'Hello' >>> numbers.sort()
2 # Calling function
[1, 2, 3, 4]
C O L L E C T I O N S
>>> find_volume(1, 2, 3)
# Division String Methods Length = 1
>>> 22/8 >>> list_2.insert(1, False)
Width = 2
N
2.75 [True, False, [1, 2, 3]]
>>> len(txt_1) Loops Depth = 3
Lists
12 6
# Multiplication >>> numbers.pop()
>>> 3*3 >>> txt_1.upper() [4, 2, 1]
9 >>> find_volume(2, depth=3, width=4)
'HELLO WORLD!'
Length = 2
O
For Loop
Numbers
C O L L E C T I O N S
>>> 5-2 'hello world!' # Syntax Depth = 3
3 for <variable> in <list>: 24
>>> txt_1.replace("World", "Era") <code block> ... return volume
# Addition 'Hello Era!' >>> tuple = ("john", "mark", 1, 2)
I
>>> 2+2 >>> students = ["John", "Mark", "Venessa"]
4 >>> txt_1.split() >>> tuple[0] >>> for student in students: Local & Global Variables
['Hello', 'World!'] 'john' ... print(student)
John
T
>>> ' Hello World! '.strip() Mark >>> list_store = [1, 2] # Global variable
>>> tuple[1:3]
Strings
Tuples
('mark', 1)
C O L L E C T I O N S
>>> a = 2 >>> 'hello world!'.capitalize() >>> for index, student in enumerate(students):
... print(index, student) >>> def add_element(a, b):
>>> b = 7 'Hello world!' >> len(tuple)
C
... c=a*b
4 0 John
1 Mark ... list_store.append(c)
>>> a == b 2 Venessa
... print(list_store)
False
While Loop
E
>>> dict_1 = {"REG" : "Regression", >>> add_element(1, 9)
"LOG" : "Logistic Regression", >>> set_1 = {1, 2, 2, 3, 3, 3}
>>> a != b "CART": "Classification and Reg"}
# Syntax [1, 2, 9]
C O L L E C T I O N S
{1, 2, 3} while <condition>:
True
>>> dict_2 = {"REG": ["RMSE", 10]
<code to execute while the condition is true>
Built-in Functions
L
"LOG": ["MSE", 20] >>> set_2 = set([3, 4, 4, 5, 5, 5, 6])
>>> a > b "CART": ["SSE", 30]} {3, 4, 5, 6} >>> i = 1
>>> while i < 5:
False ... print(i) # lambda
Key - Value Methods ... if i==3: >>> summer = lambda a, b: a + b
Set Methods
L
... break >>> summer(3, 5)
>>> a >= b >>> dict_1.keys() ... i+=1
False dict_keys(['REG', 'LOG', 'CART']) 1 8
C O L L E C T I O N S
2
>>> dict_1.values() >>> set_1.difference(set_2) 3 # map
O
C O L L E C T I O N S
>>> (a > 1) & (b < 10) >>> list(filter(lambda x: x % 2 == 0, numbers))
'Regression' >>> set_1.intersection(set_2)
True {3} [2, 4, 6, 8, 10]
>>> dict_2["CART"][1]
C O L L E C T I O N S
"LOG": "Logistic Regression", # zip
>>> set_1.issubset(set_2) ... print(f"{number} is positive.")
Boolean
"CART": "Classification and Reg"} False ... elif number < 0: >>> students = ["John", "Mark", "Venessa"]
>>> a is not b ... print(f"{number} is negative.") >>> departments = ["mathematics", "statistics", "physics"]
True >>> dict_2 = {"REG": ["RMSE", 10] >>> set_1.issuperset(set_2) ... else: >>> list(zip(students, departments))
Set
"LOG": ["MSE", 20] ... print(f"{number} is zero!") [('John', 'mathematics'), ('Mark', 'statistics'), ('Venessa', 'physics')]
False 3.14 is positive.
"CART":["SSE", 30]}
NumPy
C O L L E C T I O N S
Attributes of Arrays Numerical Operations
Indexing / Fancy Indexing
C O L L E C T I O N S
>>> a[0] >>> a = np.array([1, 2, 3, 4, 5])
1 >>> b = np.array([6, 7, 8, 9, 10])
>>> a.shape
import numpy as np (2, 3) >>> a+b
>>> indices = np.array([0, 2, 4])
array([ 7, 9, 11, 13, 15])
>>> a.size # Access elements at indices 0, 2, and 4
6 >>> a[indices] >>> a-b
array([1, 3, 5]) array([-5, -5, -5, -5, -5])
NumPy Arrays >>> a.dtype
C O L L E C T I O N S
dtype('int32') >>> b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) >>> b/a
array([[1, 2, 3], array([6. , 3.5 , 2.66666667, 2.25 ,2. ])
>>> a.astype(float)
[4, 5, 6],
array([[1., 2., 3.], >>> b//a
[7, 8, 9]])
[4., 5., 6.]]) array([6, 3, 2, 2, 2], dtype=int32)
# Access the element at row 1, column 2 of the array
>>> a*b
>>> b[1, 2]
C O L L E C T I O N S
array([ 6, 14, 24, 36, 50])
1D Array 2D Array 3D Array Data Manipulation 6
>>> a**b
array([ 1, 128, 6561, 262144, 9765625], dtype=int32)
S
# Slice the array from index 2 to index 4 (exclusive)
C O L L E C T I O N S
>>> a = np.array([(1,2,3),(4,5,6)]) >>> a[2:5] >>> np.log1p(b)
array([[1, 2, 3], array([1.94591015, 2.07944154, 2.19722458, 2.30258509, 2.39789527])
array([3, 4, 5])
[4, 5, 6]])
N
# One dimensional array
>>> np.array([1, 2, 3]) >>> np.sin(a)
>>> a.flatten() # Slice the array from index 1 to the end array([[-0.95892427, 0.84147098],
array([1, 2, 3])
array([1, 2, 3, 4, 5, 6]) >>> a[1:] [ 0.84147098, 0.14112001]])
# Two dimensional array array([2, 3, 4, 5])
O
>>> np.array([(1,2,3),(4,5,6)]) >>> a.resize((6,1)) >>> np.sqrt(b)
C O L L E C T I O N S
array([[1, 2, 3], array([[1], array([3.46410162, 3.16227766])
[2], # Slice the array from the beginning to index 2(exclusive)
[4, 5, 6]])
[3], >>> a[:3] >>> np.ceil(np.array([1.2, 2.3, 3.4, 4.5, 5.6, 7.6]))
[4], array([1, 2, 3]) array([2., 3., 4., 5., 6., 8.])
I
# Array of zeros with 3 elements
>>> np.zeros(3) [5],
array([0., 0., 0.]) [6]]) >>> np.floor(np.array([1.2, 2.3, 3.4, 4.5, 5.6, 7.6]))
# Slice the array in reverse order array([1., 2., 3., 4., 5., 7.])
T
>>> a.T >>> a[::-1]
# 2x2 array filled with ones
C O L L E C T I O N S
>>> np.ones((2,2)) array([[1, 4], array([5, 4, 3, 2, 1]) >>> np.round(np.array([1.2, 2.3, 3.4, 4.5, 5.6, 7.6]))
array([[1., 1.], [2, 5], array([1., 2., 3., 4., 6., 8.])
C
[1., 1.]]) [3, 6]])
# Slice the first row
>>> np.abs(np.array([-1, 2, -3, -4, 5, 6, -7]))
>>> b = np.random.randint(1, 10, size=9) >>> b[0, :] array([1, 2, 3, 4, 5, 6, 7])
# 3x3 identity matrix
>>> np.eye(3) array([4, 7, 1, 8, 2, 3, 3, 7, 8]) array([1, 2, 3])
E
array([[1., 0., 0.],
[0., 1., 0.], >>> b.reshape(3, 3)
# Slice the second column
C O L L E C T I O N S
[0., 0., 1.]]) array([[4, 7, 1],
[8, 2, 3], >>> b[:, 1]
array([2, 5, 8])
L
# 2x3 array filled with 4s [3, 7, 8]])
>>> np.full((2,3),4)
array([[4, 4, 4],
Statistics Linear Algebra
# Slice the sub-array starting from row 1 and column 0 up
[4, 4, 4]]) to row end and column 1 (exclusive)
L
>>> a = np.array([1, 2, 3, 4, 5]) # Solve the linear system
{
>>> b[1:, :2]
# 1D array of length 3 with random integers between 0 and 10 Combining
C O L L E C T I O N S
>>> np.random.randint(0, 10, size=3) array([[4, 5], >>> np.mean(a) 5x0 + x1 = 12
array([8, 2, 7]) [7, 8]]) 3.0
O
C O L L E C T I O N S
# 2x3 array of random numbers from a normal distribution with mean 0 >>> a[a > 2]
and standard deviation 1 array([3, 4, 5]) >>> a.min()
# Concatenating arrays vertically (along rows) # Solve the linear system
>>> np.random.normal(0, 1, (2, 3)) 1
{
>>> np.vstack((a, b))
array([[ 0.36527849, -2.48435406, 0.77739812], array([[ 1, 2, 3, 4, 5], >>> a[(a > 1) & (a < 4)] x0 + 2x1 + 3x2 = 10
[ 0.07923544, -0.30833118, 0.32393125]]) >>> np.var(a)
[ 6, 7, 8, 9, 10]]) array([2, 3]) 2.0 4x0 + 5x1 + 6x2 = 11
C O L L E C T I O N S
array([0, 3, 6, 9]) array([1, 2, 4, 5]) 1.4142135623730951 >>> a = np.array([[1, 2, 3],[4, 5, 6],[7, 8, 9]])
10])
>>> b = np.array([10, 11, 12 ])
# 1D array with 3 evenly spaced values between 0 and 10 >>> np.where(a > 2, a, 0) >>> np.corrcoef(a)
>>> np.stack((a, b)) 1.0
>>> np.linspace(0,10,3) array([0, 0, 3, 4, 5]) >>> np.linalg.solve(a, b)
array([[ 1, 2, 3, 4, 5],
array([ 0., 5., 10.]) arrayarray([-25.33333333, 41.66666667, -16.])
[ 6, 7, 8, 9, 10]])
PANDAS
C O L L E C T I O N S
Importing & Exporting Filtering & Sorting Joining
# Filter columns using regular expression matching # Merge operation
>>> df1 = pd.DataFrame({
# Read a CSV file and create a DataFrame >>> df.filter(regex=regex)
'key': ['A', 'B', 'C', 'D'],
>>> pd.read_csv('filename.csv')
C h e a t S h e e t # Save the DataFrame to a CSV file
# Sort DataFrame by values in a specific column in descending order
'value1': [1, 2, 3, 4]})
C O L L E C T I O N S
df.to_csv(filename.csv) >>> df.sort_values('ID', ascending=False) 'key': ['B', 'D', 'E', 'F'],
'value2': [5, 6, 7, 8]})
import pandas as pd # Read an Excel file and create a DataFrame
>>> pd.read_excel('filename.xlsx') >>> pd.merge(df1, df2, on='key', how='inner')
Editting
Data Structures # Save the DataFrame to an Excel file Key Value2
Key Value1
>>> df.to_excel('filename.xlsx') # Perform quantile-based discretization of values in 'col' into 3
0 A 1 0 B 5 Key Value1 Value2
C O L L E C T I O N S
bins with custom labels
1 B 2 1 D 6 0 B 2 5
>>> pd.qcut(df['col'], 3, labels=qcut_labels)
Series 2 C 3 2 E 7 1 D 4 6
Data Manipulation 3 D 4 3 F 8
# Create a series # Perform value-based discretization of values in 'col' into custom
>>> s = pd.Series([1, 2, 3], index=['A' ,'B' ,'C']) bins with labels
>>> pd.cut(df['col'], bins=cut_bins, labels=cut_labels) >>> df4 = pd.DataFrame({
Selection
Index Data
A 1
'key': ['B', 'D', 'E', 'F'],
C O L L E C T I O N S
# Create a pivot table using 'col1' as index, 'col2' as columns, and 'value4': ['orange', 'grape', 'kiwi', 'lemon']})
B 2
Selecting Rows 'col3' as values
C 3
>>> df.pivot(index='col1', columns='col2', values='col3') >>> df3.set_index('key').join(df4.set_index('key'), how='inner')
# Select a specific row by its label
>>> s.index
Index(['A', 'B', 'C'], dtype='object') >>> df.loc['1 (One)']:
# Reset the index of the DataFrame
S
Key Value3 Key Value2
Value1 Value2
>>> s.dtype # Select a specific row by its index >>> df.reset_index() orange
0 A apple 0 B
dtype('int64') Key
>>> df.loc[1] 1 B bannana 1 D grape
banana orange
C O L L E C T I O N S
B
>>> s.size # Rename specific columns of the DataFrame 2 C cherry 2 E kiwi
N
3 Selecting Columns >>> df.rename(columns = rename_list) 3 D date 3 F lemon
D date grape
>>> s.ndim
1 # Select a single column by its name
>>> df['ID']
>>> s.values # Example for concatenate operation
O
array([1, 2, 3], dtype=int64)
# Select multiple columns by their name Grouping & Aggregation >>> df5 = pd.DataFrame({
>>> df[['ID', 'Profession']] 'A': ['A0', 'A1', 'A2'],
# Perform grouping operation on 'col' and obtain a GroupBy object
C O L L E C T I O N S
'B': ['B0', 'B1', 'B2']})
DataFrame # Select a single column using label-based indexing >>> df.groupby('col')
I
>>> df.loc[:, 'Name'] >>> df6 = pd.DataFrame({
# Create a DataFrame 'A': ['A3', 'A4', 'A5'],
>>> df = pd.DataFrame({'ID': [1, 2, 3], # Select a single column using integer-based indexing # Perform grouping operation on multiple columns ('col1' and 'col2') 'B': ['B3', 'B4', 'B5']})
T
'Name': ['Alex', 'Brian', 'David'], >>> df.iloc[:, 1]
'Profession': ['DA', 'DE', 'DS']}, and calculate the mean of 'col3' >>> pd.concat([df5, df6], axis=0)
index = ['1 (One)', '2 (Two)', '3 (Three)']) Selecting Rows and Columns >>> df.groupby(['col1', 'col2']).agg({"col3": "mean"})
C O L L E C T I O N S
C
ID Name A B
Profession # Select a specific cell by its label-based row and column indices
0 A0 B0
1 (One) 1 Alex DA >>> df.loc['2 (Two)', 'Name'] # Perform grouping operation on multiple columns ('col1' and 'col2') A B A B
1 A1 B1
2 (Two) 2 Brian DE 0 A0 B0 0 A3 B3
# Select a specific cell by its integer-based row and column indices and calculate the mean of 'col3' and count of 'col4' 2 A2 B2
E
1 A1 B1 1 A4 B4
3 (Three) 3 David DS
>>> df.iloc[1, 1] 0 A3 B3
>>> df.groupby(['col1', 'col2']).agg({'col3': 'mean', 'col4': 2 A2 B2 2 A5 B5
1 A4 B4
# Get the shape of the DataFrame # Select specific rows and columns using label-based indexing 'count'}) 2 A5 B5
C O L L E C T I O N S
>>> df.loc['1 (One)', ['ID', 'Profession']]
L
>>> df.shape
(3, 3)
# Get the column names of the DataFrame # Select specific rows and columns using integer-based indexing # Grouping the DataFrame by the first level of the index
>>> df.columns >>> df.iloc[1, ['ID', 'Profession']]
>>> df.groupby(level=0) Statistics
L
Index(['ID', 'Name', 'Profession'], dtype='object')
C O L L E C T I O N S
>>> df[df['ID'] > 1] >>> df = pd.DataFrame({ >>> df.describe()
# Get the data types of the DataFrame columns
>>> df.dtypes
ID int64 # Select rows based on multiple conditions using logical operators 'Name': ['John', 'Alice', 'John', 'Alice', 'Bob'], # Calculate the mean of each column in the DataFrame
Name object >>> df[(df['ID'] > 2) & (df['Profession'] == 'DS')] >>> df.mean()
Profession object 'City': ['New York', 'Paris', 'London', 'Paris', 'London'],
C
dtype: object
# Select rows where a column value is present in a given list 'Age': [30, 25, 35, 28, 40], # Calculate the correlation matrix of the DataFrame
# Check if there are any missing values in the DataFrame >>> df.loc[df['Name'].isin(['Alex', 'David'])] >>> df.corr()
>>> df.isnull().values.any() 'Salary': [50000, 60000, 55000, 45000, 70000]})
C O L L E C T I O N S
False
# Count the number of non-null values in each column of the DataFrame
# Get the count of missing values in each column >>> df.count()
>>> df.isnull().sum() >>> df.groupby('City').agg({'Salary': 'mean', 'Age': 'max'})
ID 0 Deleting & Adding # Find the maximum value in each column of the DataFrame
Name 0
Profession 0 # Drop rows with any missing values Name City Age Salary Salary Age >>> df.max()
dtype: int64 >>> df.dropna()
City
0 John New York 30 50000 # Find the minimum value in each column of the DataFrame
# Get the count of non-null values in each column # Drop columns with any missing values
>>> df.count() 1 Alice Paris 25 60000 London 62500.0 40 >>> df.min()
C O L L E C T I O N S
>>> df.dropna(axis=1)
ID 3
Name 3 2 John London 35 55000 New York 50000.0 30
# Drop columns with fewer than n non-null values # Calculate the median of each column in the DataFrame
Profession 3 >>> df.median()
>>> df.dropna(axis=1, thresh=n) 3 Alice Paris 28 45000 Paris 52500.0 28
dtype: int64
4 Bob London 40 70000
# Generate descriptive statistics of the DataFrame (transposed) # Fill missing values with a specified value # Calculate the standard deviation of each column in the DataFrame
>>> df.describe().T df.fillna(value) >>> df.std()
C O L L E C T I O N S
Supervised Learning Algorithms Model Evaluation Model Tuning
SCIKIT-LEARN
C h e a t S h e e t
# Linear Regression from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
Classification metrics
from sklearn.linear_model import LinearRegression
model = LinearRegression() # Logistic Regression parameter optimization
from sklearn.metrics import accuracy_score, precision_score,
C O L L E C T I O N S
model.fit(X_train, y_train) lr_param_grid = {'C': [0.1, 1, 10], 'penalty': ['l1', 'l2']}
recall_score, f1_score, roc_auc_score, classification_report
lr_grid_search = GridSearchCV(LogisticRegression(), lr_param_grid, cv=5)
# Logistic Regression confusion_matrix
lr_grid_search.fit(X_train, y_train)
from sklearn.linear_model import LogisticRegression
Preprocessing model = LogisticRegression()
y_pred = model.predict(X_test) lr_grid_search.best_params_
lr_grid_search.best_score_
# Best parameters for Logistic Regression
# Best score for Logistic Regression
model.fit(X_train, y_train)
# Confusion Matrix
confusion_matrix(y_test, y_pred) # Decision Tree parameter optimization
dt_param_grid = {'max_depth': [None, 5, 10], 'min_samples_split': [2, 5, 10]}
C O L L E C T I O N S
Splitting data into train and test sets # K-Nearest Neighbors (KNN)
Predicted dt_grid_search = GridSearchCV(DecisionTreeClassifier(), dt_param_grid, cv=5)
from sklearn.neighbors import KNeighborsClassifier
dt_grid_search.fit(X_train, y_train)
from sklearn.model_selection import train_test_split model = KNeighborsClassifier() False True
dt_grid_search.best_params_ # Best parameters for Decision Tree
model.fit(X_train, y_train)
X = df[["Independent Variables"]] dt_grid_search.best_score_ # Best score for Decision Tree
True
TN FN
y = df[["Target Variable"]]
Actual
X_train, X_test, y_train, y_test = train_test_split(X, y, test size=0.2) # CART # Random Forest parameter optimization
False
from sklearn.tree import DecisionTreeClassifier
FN TP rf_param_grid = {'n_estimators': [100, 200, 300], 'max_depth': [None, 5, 10]}
C O L L E C T I O N S
model = DecisionTreeClassifier() rf_random_search = RandomizedSearchCV(RandomForestClassifier(), rf_param_grid,
Handling Missing Values model.fit(X_train, y_train) n_iter=5, cv=5)
# Accuracy = ( TP + TN ) / ( TP + TN + FP + FN ) rf_random_search.fit(X_train, y_train)
accuracy_score(y_test, y_pred) rf_random_search.best_params_ # Best parameters for Random Forest
from sklearn.impute import SimpleImputer, KNNImputer
# Random Forests rf_random_search.best_score_ # Best score for Random Forest
# Dropping Rows or Columns from sklearn.ensemble import RandomForestClassifier # Precision = TP / ( TP + FP )
df.dropna(axis=0) model = RandomForestClassifier() precision_score(y_test, y_pred) # K-Nearest Neighbors parameter optimization
S
df.dropna(axis=1)
model.fit(X_train, y_train) knn_param_grid = {'n_neighbors': [3, 5, 7], 'weights': ['uniform','distance']}
C O L L E C T I O N S
# Imputation # Recall = TP / ( TP + FN ) knn_random_search = RandomizedSearchCV(KNeighborsClassifier(), knn_param_grid,
imputer = SimpleImputer(strategy='mean') recall_score(y_test, y_pred) n_iter=5, cv=5)
N
X_train_imputed = imputer.fit_transform(X_train) # GBM knn_random_search.fit(X_train, y_train)
X_test_imputed = imputer.fit_transform(X_test) from sklearn.ensemble import GradientBoostingClassifier # F1 Score = TN / ( TN + FP ) knn_random_search.best_params_ # Best parameters for K-Nearest Neighbors
model = GradientBoostingClassifier() f1_score(y_test, y_pred) knn_random_search.best_score_ # Best score for K-Nearest Neighbors
# K-Nearest Neighbors (KNN) Imputation
model.fit(X_train, y_train)
knn_imputer = KNNImputer()
O
X_train_knn_imputed = knn_imputer.fit_transform(X_train) # AUC # GBM parameter optimization
C O L L E C T I O N S
X_test_knn_imputed = knn_imputer.fit_transform(X_test) gbm_param_grid = {'n_estimators': [100, 200, 300], 'max_depth': [None, 5, 10]}
roc_auc_score(y_test, y_pred)
# XGBoost
gbm_grid_search = GridSearchCV(GradientBoostingClassifier(), gbm_param_grid,cv=5)
!pip install xgboost
Handling Outliers from xgboost import XGBClassifier # Classification Report gbm_grid_search.fit(X_train, y_train)
gbm_grid_search.best_params_ # Best parameters for GBM
I
model = XGBClassifier classification_report(y_test, y_pred)
def suppress_outliers_iqr(df, col_name, q1=0.25, q3=0.75, multiplier=1.5): gbm_grid_search.best_score_ # Best score for GBM
model.fit(X_train, y_train)
quartile1 = df[col_name].quantile(q1)
Regression metrics
T
quartile3 = df[col_name].quantile(q3) # LightGBM parameter optimization
C O L L E C T I O N S
interquartile_range = quartile3 - quartile1 # LightGBM lgb_param_grid = {'n_estimators': [100, 200, 300], 'max_depth': [None, 5, 10]}
lower_bound = quartile1 - multiplier * interquartile_range from sklearn.metrics import mean_absolute_error, lgb_grid_search = GridSearchCV(lgb.LGBMClassifier(), lgb_param_grid, cv=5)
!pip install lightgbm
upper_bound = quartile3 + multiplier * interquartile_range mean_squared_error, r2_score lgb_grid_search.fit(X_train, y_train)
from lightgbm import LGBMClassifier
C
model = LGBMClassifier() from sklearn.model_selection import cross_val_score lgb_grid_search.best_params_ # Best parameters for LightGBM
# Suppress outliers by replacing them with the lower/upper bounds lgb_grid_search.best_score_ # Best score for LightGBM
model.fit(X_train, y_train)
df.loc[df[col_name] < lower_bound, col_name] = lower_bound y_pred = model.predict(X_test)
df.loc[df[col_name] > upper_bound, col_name] = upper_bound
# XGBoost parameter optimization
E
# MAE
# Catboost xgb_param_grid = {'n_estimators': [100, 200, 300], 'max_depth': [None, 5, 10]}
C O L L E C T I O N S
# Example usage mean_absolute_error(y_test, y_pred)
!pip install catboost xgb_grid_search = GridSearchCV(xgb.XGBClassifier(), xgb_param_grid, cv=5)
suppress_outliers_iqr(dataframe, 'A')
from catboost import CatBoostClassifier # MSE xgb_grid_search.fit(X_train, y_train)
L
model = CatBoostClassifie mean_squared_error(y_test, y_pred) xgb_grid_search.best_params_ # Best parameters for XGBoost
model.fit(X_train, y_train) xgb_grid_search.best_score_ # Best score for XGBoost
Feature Scaling # RMSE
np.sqrt(mean_squared_error(y, y_pred))
L
from sklearn.preprocessing import StandardScaler, MinMaxScaler # CatBoost parameter optimization
Unsupervised Learning Algorithms
C O L L E C T I O N S
# R-Square cat_param_grid = {'iterations': [100, 200, 300], 'depth': [4, 6, 8]}
# Numeric feature scaling (StandardScaler) r2_score(y_test, y_pred) cat_random_search = RandomizedSearcahCV(CatBoostClassifier(), cat_param_grid,
O
model = KMeans(n_clusters=3)
Clustering metrics
# Numeric feature scaling (MinMaxScaler) # K-means parameter optimization
C O L L E C T I O N S
model.fit(df)
scaler = MinMaxScaler() from sklearn.metrics import adjusted_rand_score, kmeans_param_grid = {'n_clusters': [3, 5, 7], 'init': ['k-means++', 'random']}
X_train_scaled = scaler.fit_transform(X_train) homogeneity_score, v_measure_score kmeans_grid_search = GridSearchCV(KMeans(), kmeans_param_grid, cv=5)
X_test_scaled = scaler.transform(X_test) # Hierarchical Clustering kmeans_grid_search.fit(X_train)
from sklearn.cluster import AgglomerativeClustering kmeans_grid_search.best_params_ # Best parameters for K-Means
# Adjusted Rand Index kmeans_grid_search.best_score_ # Best score for K-Means
model = AgglomerativeClustering(n_clusters=5),
Feature Encoding adjusted_rand_score(y_true, y_pred)
model.fit(df)
# PCA parameter optimization
C O L L E C T I O N S
from sklearn.preprocessing import OneHotEncoder # Homogeneity pca_param_grid = {'n_components': [2, 5, 10]}
# PCA homogeneity_score(y_true, y_pred) pca_grid_search = GridSearchCV(PCA(), pca_param_grid, cv=5)
# Categorical feature encoding (One-Hot Encoder) from sklearn.decomposition import PCA
encoder = OneHotEncoder() pca_grid_search.fit(X_train)
X_train_encoded = encoder.fit_transform(X_train) pca = PCA(n_components=2) # V-measure pca_grid_search.best_params_ # Best parameters for PCA
X_test_encoded = encoder.transform(X_test) pca.fit(df) metrics.v_measure_score(y_true, y_pred)) pca_grid_search.best_score_ # Best score for CatBoost
Data Visualization
C O L L E C T I O N S
Bar Plot Box Plot Multiple Subplots
C O L L E C T I O N S
plt.title('Title') plt.show() plt.show()
plt.show() Example
Importing Libraries # Load the 'tip' dataset from seaborn
Example
# Horizontal bar tips = sns.load_dataset('tips') # Load the 'tips' dataset from seaborn
tips = sns.load_dataset('tips')
plt.barh(x, y) # Create the box plot with a different color # Prepare the data for plotting
import matplotlib.pyplot as plt plt.show() sns.boxplot(data=tips, x='day', y='total_bill', x1 = tips['total_bill']
color='green') y1 = tips['tip']
import seaborn as sns plt.xlabel('Day of the Week') x2 = tips['size']
C O L L E C T I O N S
plt.ylabel('Total Bill Amount') y2 = tips['tip']
Example plt.title('Distribution of Total Bills by Day of the Week') x3 = tips['day']
height3 = tips['total_bill']
plt.show() data4 = tips['total_bill']
Basic Line Plot # Load the 'tip' dataset from seaborn # Create the subplots
fig, axs = plt.subplots(2, 2)
tips = sns.load_dataset('tips')
# Plot 1: Line plot
axs[0, 0].plot(x1, y1)
# Calculate the average tip amount for each day of the week axs[0, 0].set_xlabel('Total Bill')
plt.plot(x, y, linestyle = '-', color = 'b') axs[0, 0].set_ylabel('Tip')
C O L L E C T I O N S
avg_tip_by_day = tips.groupby('day')['tip'].mean()
S
axs[0, 0].set_title('Tip Amount vs Total Bill')
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label') # Plot 2: Scatter plot
plt.title('Title') # Create the bar plot axs[0, 1].scatter(x2, y2)
axs[0, 1].set_xlabel('Party Size')
plt.show() sns.barplot(x=avg_tip_by_day.index, y=avg_tip_by_day.values) axs[0, 1].set_ylabel('Tip')
N
axs[0, 1].set_title('Tip Amount vs Party Size')
Example plt.xlabel('Day of the Week')
Heatmap # Plot 3: Bar plot
plt.ylabel('Average Tip Amount') axs[1, 0].bar(x3, height3)
axs[1, 0].set_xlabel('Day')
# Sample dataset plt.title('Average Tip Amount by Day of the Week') axs[1, 0].set_ylabel('Total Bill')
C O L L E C T I O N S
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May'] axs[1, 0].set_title('Total Bill by Day')
O
plt.show()
revenue = [10000, 15000, 12000, 18000, 20000] sns.heatmap(data, annot=True) # Plot 4: Histogram
plt.xlabel('X-axis label') axs[1, 1].hist(data4, bins=10)
# Create the line plot plt.ylabel('Y-axis label') axs[1, 1].set_xlabel('Total Bill')
plt.title('Title') axs[1, 1].set_ylabel('Frequency')
plt.plot(months, revenue, marker='o', linestyle='-', color='b') axs[1, 1].set_title('Distribution of Total Bills')
plt.show()
I
plt.xlabel('Months')
plt.ylabel('Revenue ($)') # Adjust the layout and spacing
Example fig.tight_layout()
plt.title('Monthly Revenue')
# Display the plot
T
plt.grid(True) # Load the 'tip' dataset from seaborn
C O L L E C T I O N S
tips = sns.load_dataset('tips') plt.show()
plt.show()
# Calculate the correlation matrix
correlation_matrix = tips.corr()
C
# Create the heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='YlGn')
plt.title('Correlation Heatmap')
plt.show()
Histogram
C O L L E C T I O N S
E
L
plt.hist(data, bins=10)
C O L L E C T I O N S
plt.title('Title')
plt.scatter(x, y, c='color') plt.show()
Pie Chart
plt.figure(figsize=(8, 6))
O
plt.xlabel('X-axis label') plt.plot(x, y, color='red', linestyle='--', linewidth=2,
marker='o', markersize=6)
plt.ylabel('Y-axis label') Example plt.xlabel('X-axis label', fontsize=12)
plt.title('Title') plt.ylabel('Y-axis label', fontsize=12)
plt.show() plt.pie(y, labels=labels, colors=[colors]) plt.title('Title', fontsize=14)
# Load the 'tip' dataset from seaborn plt.title('Title') plt.legend(['Legend'])
plt.show() plt.grid(True)
Example
C
plt.show()
C O L L E C T I O N S
tips = sns.load_dataset('tips')
Example
# Load the tips dataset Example
df = sns.load_dataset("tips") # Create the histogram # Load the 'tip' dataset from seaborn
# Load the 'tip' dataset from seaborn
tips = sns.load_dataset('tips') tips = sns.load_dataset('tips')
# Extract the total bill and tip amounts sns.histplot(data=tips, x='tip', bins=10, color='green')
# Prepare the data for plotting
total_bill = df["total_bill"].values plt.xlabel('Tip Amount') # Calculate the count of meals for each time of the day x = tips['total_bill']
tip_amount = df["tip"].values meal_counts = tips['time'].value_counts() y = tips['tip']
plt.ylabel('Count')
# Specify custom labels and colors # Create the plot
C O L L E C T I O N S
# Create a scatter plot plt.title('Distribution of Tip Amounts') labels = meal_counts.index plt.figure(figsize=(8, 6))
plt.scatter(total_bill, tip_amount, c='green') colors = ['darkgreen', 'aquamarine'] plt.plot(x, y, color='red', linestyle='--', linewidth=2,
plt.show() marker='o', markersize=6)
plt.xlabel('Total Bill') plt.xlabel('Total Bill', fontsize=12)
plt.ylabel('Tip Amount') # Create the pie chart plt.ylabel('Tip', fontsize=12)
plt.pie(meal_counts, labels=labels, colors=colors) plt.title('Tip Amount vs Total Bill', fontsize=14)
plt.title('Scatter Plot: Tips Dataset') plt.title('Distribution of Meals by Time of the Day') plt.legend(['Tips'])
plt.show() plt.show() plt.grid(True)
plt.show()
C O L L E C T I O N S
MS SQL
Filtering Example
Filtering
C O L L E C T I O N S
-- Using the LIKE operator to filter employees
-- Using the LIKE operator for pattern matching with names starting with ‘J’
Sample Data SELECT FirstName, LastName FROM Employees
SELECT * FROM table_name WHERE column_name
WHERE FirstName LIKE ‘J%’;
C h e a t S h e e t LIKE ‘pattern%’;
1.1 Employees Table
-- Using the IN operator to filter employees from
-- Using the IN operator to filter by multiple values specific departments
SELECT * FROM table_name WHERE column_name Employee ID First Name Last Name DepartmentID Salary SELECT FirstName, LastName FROM Employees
C O L L E C T I O N S
IN (value1, value2, value3); 1 John Doe 1 50000 WHERE DepartmentID IN (1, 2);
Basic SQL Commands
-- Using the BETWEEN operator to filter within a range 2 Jane Smith 2 60000 -- Using the BETWEEN operator to filter employees
SELECT * FROM table_name WHERE column_name with salaries between 50000 and 55000
3 Michael Johnson 1 55000 SELECT FirstName, LastName FROM Employees
-- Select all columns from a table BETWEEN value1 AND value2;
WHERE Salary BETWEEN 50000 AND 55000;
SELECT * FROM table_name; 4 Emily Williams 3 52000
-- Using the IS NULL operator to filter NULL values -- Using the IS NULL operator to find employees
-- Select specific columns from a table SELECT * FROM table_name WHERE column_name IS NULL; 5 William Brown 2 48000
without a department
SELECT column1, column2 FROM table_name;
C O L L E C T I O N S
SELECT FirstName, LastName FROM Employees
1.2 Departments Table WHERE DepartmentID IS NULL;
-- Filter rows with a condition
SELECT * FROM table_name WHERE condition; Aliases and Calculated Columns Department ID Department Name
Aliases and Calculated Columns
-- Insert a new row into a table 1 Sales
INSERT INTO table_name (column1, column2) -- Alias for column names
VALUES (value1, value2); -- Alias for column names 2 Marketing SELECT FirstName AS First, LastName AS Last
SELECT column_name AS alias_name FROM table_name; FROM Employees;
-- Update existing rows in a table 3 HR
C O L L E C T I O N S
UPDATE table_name SET column1 = value1, -- Calculated columns in SELECT -- Calculated columns in SELECT
4 Finance SELECT FirstName, LastName, Salary, Salary * 12
column2 = value2 WHERE condition; SELECT column1, column2, column1 + column2
AS calculated_column FROM table_name; AS AnnualSalary FROM Employees;
-- Delete rows from a table Selecting Data
DELETE FROM table_name WHERE condition; Conditional Statements
-- Select all employees -- Simple CASE expression to categorize employees
Conditional Statements
S
SELECT * FROM Employees; based on salary
SELECT FirstName, LastName,
C O L L E C T I O N S
Aggregation Functions -- Select specific columns from Employees table CASE
-- Simple CASE expression SELECT FirstName, LastName, Salary FROM Employees;
WHEN Salary >= 55000 THEN ‘High’
N
SELECT column_name, CASE WHEN column_name = value1 THEN WHEN Salary >= 50000 THEN ‘Medium’
-- Filter employees in the Sales department
-- Calculate the sum of a column ‘Result 1’ WHEN column_name = value2 THEN ‘Result 2’ ELSE ‘Low’
SELECT FirstName, LastName FROM Employees
SELECT SUM(column_name) FROM table_name; ELSE ‘Default Result’ END AS result FROM table_name; WHERE DepartmentID = 1; END AS SalaryCategory FROM Employees;
O
-- Calculate the average of a column
-- Select employees with a salary greater than 52000
Working with Dates
SELECT AVG(column_name) FROM table_name;
C O L L E C T I O N S
SELECT FirstName, LastName, Salary -- Get the current date
Working with Dates
-- Get the maximum value from a column FROM Employees WHERE Salary > 52000; SELECT GETDATE() AS CurrentDate;
SELECT MAX(column_name) FROM table_name;
I
-- Get the minimum value from a column
Aggregation Functions -- Format dates using CONVERT
-- Get the current date and time SELECT FirstName, LastName,
SELECT MIN(column_name) FROM table_name; SELECT GETDATE(); -- Calculate the total salary of all employees
T
CONVERT(varchar, HireDate, 103)
-- Count the number of rows in a table SELECT SUM(Salary) AS TotalSalary FROM Employees; AS FormattedHireDate FROM Employees;
-- Format dates
SELECT COUNT(*) FROM table_name;
C O L L E C T I O N S
SELECT CONVERT(varchar, date_column, 103)
AS formatted_date FROM table_name; -- Calculate the average salary
SELECT AVG(Salary) AS AverageSalary FROM Employees; -- Extract the year from the HireDate
C
SELECT FirstName, LastName,
-- Extract parts of a date
Sorting and Grouping SELECT DATEPART(year, date_column) AS year -- Get the highest salary DATEPART(year, HireDate) AS HireYear
FROM table_name; SELECT MAX(Salary) AS MaxSalary FROM Employees; FROM Employees;
E
-- Count the number of employees in the Marketing
-- Order rows in ascending order department Creating and Modifying Tables
C O L L E C T I O N S
SELECT * FROM table_name ORDER BY column_name Creating and Modifying Tables SELECT COUNT(*) AS NumberOfEmployees -- Create a new table
ASC; FROM Employees WHERE DepartmentID = 2;
L
CREATE TABLE Customers (
-- Order rows in descending order CustomerID INT PRIMARY KEY,
SELECT * FROM table_name ORDER BY column_name -- Create a new table Sorting and Grouping FirstName VARCHAR(50),
DESC; CREATE TABLE table_name (
L
column1 datatype1 constraint1, -- Order employees by salary in descending order LastName VARCHAR(50),
-- Group rows based on a column column2 datatype2 constraint2 SELECT FirstName, LastName, Salary Email VARCHAR (100)
SELECT column_name, COUNT (*) FROM table_name ); );
C O L L E C T I O N S
FROM Employees ORDER BY Salary DESC;
GROUP BY column_name;
O
-- Add a new column to an existing table -- Group employees by department and calculate the -- Add a new column to an existing table
ALTER TABLE table_name ADD column_name datatype; total salary for each department ALTER TABLE Employees ADD Age INT;
SELECT DepartmentID, SUM(Salary) AS TotalSalary
Joins -- Modify an existing column FROM Employees GROUP BY DepartmentID; -- Modify an existing column
C
C O L L E C T I O N S
-- Inner Join
-- Add a primary key constraint
SELECT * FROM table1 INNER JOIN table2 -- Inner Join to get employee details along with
ALTER TABLE table_name ADD CONSTRAINT pk_con- -- Add a primary key constraint
ON table1.column_name = table2.column_name; their department names
straint ALTER TABLE Departments ADD CONSTRAINT PK_DepartmentID
PRIMARY KEY (column_name); SELECT e.FirstName, e.LastName, d.DepartmentName PRIMARY KEY (DepartmentID);
-- Left Join
FROM Employees e INNER JOIN Departments d
SELECT * FROM table1 LEFT JOIN table2
ON table1.column_name = table2.column_name; ON e.DepartmentID = d.DepartmentID; Indexes
Indexes -- Create an index
C O L L E C T I O N S
-- Right Join -- Left Join to include all departments, even
CREATE INDEX IDX_Employees_DepartmentID
SELECT * FROM table1 RIGHT JOIN table2 those without employees
ON Employees ( DepartmentID);
ON table1.column_name = table2.column_name; SELECT d.DepartmentName, e.FirstName, e.LastName
-- Create an index
CREATE INDEX index_name ON table_name(column_name); FROM Departments d LEFT JOIN Employees e
-- Full Outer Join -- Delete an index
ON d.DepartmentID = e.DepartmentID;
SELECT * FROM table1 FULL OUTER JOIN table2 -- Delete an index DROP INDEX Employees.IDX_Employees_DepartmentID;
ON table1.column_name = table2.column_name; DROP INDEX table_name.index_name;
Probability Expected value of a discrete random variable X is calculated as the sum of Continuous Distributions Fundamental Results
C O L L E C T I O N S
each value x weighted by its probability P (X = x):
Uniform Distribution
C O L L E C T I O N S
PMF : f(x) = for
b-a Strong Law of Large Numbers (SLLN) : The sample mean converges
Continuous Random Variables Parameters : a (lower bound), b (upper bound) almost surely to the expected value.
Experiment : A process that results in an outcome. For a continuous random variable X, which can take on any value within a Expected Value : E(X) = a + b
Sample Space (S) : The set of all possible outcomes of an experiment. certain range, we have the following key concepts: 2
C O L L E C T I O N S
variable X provides the relative likelihood that X falls within a specific import matplotlib.pyplot as plt
Normal (Gaussian) Distribution
interval. Unlike discrete random variables, the probability that X takes on a # Function to simulate rolling a fair six-sided die
Probability Axioms The normal distribution, also known as the Gaussian distribution or bell curve, def roll_die():
specific value is generally zero. Instead, the probability is associated with
is a fundamental concept in statistics and probability theory. It describes a return random.randint(1, 6)
Non-Negativity : For all events A, P (A) 0. intervals. symmetrical probability distribution that is characterized by its bell-shaped
# Number of trials to perform for each sample size
Additivity : For mutually exclusive events A and B, curve. num_trials = 10000
Expected value of a continuous random variable X is the integral of x
multiplied by the f(x) over the entire range of X: # List of sample sizes to investigate ]
P (A B) = P (A) + P(B). PDF :
C O L L E C T I O N S
sample_sizes = [10, 50, 100, 500, 1000, 5000
Normalization : The probability of the entire sample space S is 1:
Parameters : (mean), (standard deviation) #for
Iterate through each sample size
sample_size in sample_sizes:
P (S)=1. Variance of a continuous random variable X is calculated similarly: trial_means = []
Expected Value :
Discrete Uniform Law (Classical Probability) : For equally likely outcomes, Variance : # Perform num_trials trials for the current sample size
for _ in range(num_trials):
S
Number of favorable outcomes # Simulate rolling the die 'sample_size' times and
P (E) = Exponential Distribution
C O L L E C T I O N S
Total number of outcomes calculate the mean (sample_size)]
sample = [roll_die() for _ in range
The exponential distribution models the time between successive events in a
Relative Frequency Probability : Common Probability Distributions mean = sum(sample) / sample_size
N
process where events occur randomly and independently at a constant trial_means.append(mean)
Frequency of E occuring average rate.
P (E) = # Create a histogram of trial means with 20 bins and add
Total number of trials ,
PDF : for it to the plot
Discrete Distributions plt.hist(trial_means , bins=20, alpha= 0.5 ' )
O
Conditional Probability P (E|F) : Probability of E given that F has
Parameters : (rate parameter) label=f'Sample size: {sample_size}
occurred. It is calculated as:
Bernoulli Distribution
1 # Label the x and y axes and set the title of the plot
Expected Value : E(X) =
P(E | F) = where The Bernoulli distribution models a random experiment with two possible plt.xlabel('Sample Mean')
P(F) 1 plt.ylabel('Frequency')
I
outcomes, usually labeled as “success” and “failure”. Variance : Var(X) = plt.title('Law of Large Numbers')
2
Bayes’ Theorem
plt.legend()
PMF : P(X=x) = px Chi-Squared Distribution
T
Bayes’ Theorem helps us update our initial beliefs (prior probabili-
plt.show()
ties) based on new information (likelihood) to arrive at a more The chi-squared distribution emerges from the sum of the squares of inde-
Parameters : p (probability of success)
C O L L E C T I O N S
accurate or informed estimate of the probability of an event pendent standard normal random variables.
C
occurring (posterior probability). Mathematically, Bayes’ Theorem
Expected Value : E(X) = p
is stated as follows: PDF : The Central Limit Theorem
Variance : Parameters :
P(F | E ) . P(E) The distribution of the sum (or average) of a large number of
P(E | F) =
P(F)
E
Expected Value : E(X) = k independent, identically distributed random variables approaches a
Binomial Distribution normal distribution.
Variance : Var(X) = 2k
C O L L E C T I O N S
The binomial distribution describes the number of successes (S) in a fixed
C O L L E C T I O N S
Parameters : n (number of trials),
O
Parameters :
p (probability of success)
Expected Value : The expected value of a random variable represen- # Define a list of sample sizes that we want to explore
Expected Value :
Expected Value : E(X) = n . p sample_sizes = [10, 50, 100, 500]
ts the average or mean value it takes over all possible outcomes. 2
Variance :
It provides a measure of the central tendency of the distribution. # Iterate through each sample size
C
Variance :
sample_size in sample_sizes:
F-Distribution
Variance : The variance of a random variable measures the extent Poisson Distribution # Generate 1000 random samples of the specified size from the
population
The F-distribution models the ratio of two independent chi-squared distribu-
C O L L E C T I O N S
to which the values of the variable deviate from its expected value.
The Poisson distribution is a probability distribution that describes sample_means = [np.mean(np.random.choice(population,
tions divided by their respective degrees of freedom.
It quantifies the spread or dispersion of the distribution. the number of events that occur in a fixed interval of time or space, size=sample size)) for _ in range(1000)]
given a certain average rate of occurrence. # Create a histogram of the sample means with 30 bins and add
PDF : Depends on degrees of freedom ( it to the plot
Discrete Random Variables
plt.hist(sample_means, bins=30, alpha=0.5,
Parameters : 1 (numerator degrees of freedom),
For a discrete random variable X, which takes on distinct values
k
.e (denominator degrees of freedom)
label=f'Sample size: {sample_size}')
PMF : P(X = k) = 2
k!
from a finite or countable set, we have the following key concepts: # Add labels and a title to the plot
Parameters : (average rate of events) Expected Value : E(X) = 2
for
C O L L E C T I O N S
2
-2
2
plt.xlabel('Sample Mean')
Probability Mass Function (PMF): The PMF P (X = x) of a discrete plt.ylabel('Frequency')
Expected Value :
random variable X gives the probability that X takes on the Variance : Var(X) = plt.title('Central Limit Theorem')