Python
Python
Hamming distance
Manhattan distance
Euclidean distance: find list of word count for each text, then
● Lowe score/distance means high similarity
Fuzzywuzzy: ratio, partialRatio, tokenSortRatio, tokenSetRatio, WRatio
Bag of word
Vector embedding
● Calculate frequency count of each token
Variable type:
1. Numerical: age, transaction_amount
2. Categorical variable
● Nominal: order doesn’t matter: color, gender, state, city
● Ordinal: order matters: education qualification, salary as high medium low
Mean encoding
Example:
● Color (red, blue) & binary classification
● Red appeared 4 times > 3 times 1 was target & rest 0 > (sum(target)/count(red)) > 3/4
Binary encoding: 7 variable can fit into 3 column (111), 15 variable into 4 column(1111)
Variable treatment
Lambda Function
● Create inline function = single expression = s=lambda x:x*x, s(2)=2*2=4
● s=lambda, Input=x, output=x*x
● Call lambda function as, s(var1)
Bias Variance
● bias vs variance, overfit/underfit: Actual Value, (Expected value/Predicted Value), Mean
Value
○ bias=(pred val or Expected - actual value ) = sum of (Actual value - Predicted value)
■ Low bias: Predicted value closer to target value or actual value
○ variance= spread = Sum of (Actual Value - Mean Value of prediction)
■ Low Variance: Predicted value closer to mean value of data points
○ Low Bias/High Variance - overfitting
○ High bias/low variance - underfitting
● overfit/underfit
○ Overfit
■ low bias, high variance
■ model fits correctly against training data - but don’t perform well on test/unseen
data
■ Bagging can help
○ Underfit
■ low variance, High bias
■ Training datas
■ et size is very low - not enough t/size
■ Model perform poorly on both training and test data
■ Remove noise and increase feature, reduce bias and increase variance
■ Boosting can help
L1: LASSO: so the feature which doesn’t have association with the target gets
zero absolute weight
● Sum of absolute value of weight
● Robust to outlier
● L1 try to shrink coefficient to zero
● Useful for feature selection
K fold cross validation: split dataset into k fold, train on each fold of dataset & test on (k-1)th
dataset.
Linear regression: Used to predict targets having numerical values. Price prediction
Logistic Regression: Used for classification problems. True or False, Yes or No, rain or not
Regression model condition
● Create a relationship between dependent and independent variable, calculate coefficient
of each independent variable
Condition
● There should be linear relationship between dependent & independent variable
● Field or feature variable should be linearly independent
● Sample should represent the population
● All the variable should be normally distributed
● Outlier should be removed
Handle imbalance
● Oversampling - adding the duplicate data to minor class side, repetition, bootstrapping, SMOTE
○ Disadvantage: it creates duplicate data point, leads to overfitting issue
○ K fold cross validation should be done to overcome overfittinf chance
○ SMOTE: add synthetic data close to the data point, slightly different from original
datapoint, thus not exactly duplicate (KNN)
● Undersampling - remove fewer data point from major class, used when dataset is of good size
but still we lost some information with this method
● Apply ensemble model, like Random Forest/Ensemble classifier/xgboost model
● Apply suitable evaluation matric like f1 score
○ Case: Fraud detection case, 1 or 2% fraud only, will airplane failure, patient has cancer,
terrorist or not
Evaluation metrics (Accuracy, Precision, Recall, F1 Score)
○ Accuracy = correct/total
○ F1 = Harmonic mean of recall & Precision = 2*r*p/r+p
○ Precision-
■ What proportion of positive identifications was actually correct ?
■ (tp/tp+FP) -Spam detection -predict spam, actual non spam - loose information:
T1
○ Recall
■ What proportion of positive identifications was identified correctly ?
■ (tp/tp+FN) - cancer detection - predict non-cancer, actual cancer - user will die:
T2
● Spam detection - FP to be avoid (Precision)
○ Avoid non spam classified as spam
● Cancer detection - FN to be avoid (Recall)
○ Avoid cancer classified as non cancerous
● Gulity of crime - FP (Precision)
○ Avoid non guilty declared as guilty
Matrix
● true positives (the model correctly predicted true)
● false positives (the model incorrectly predicted true)
● true negatives (the model correctly predicted false)
● false negatives (the model incorrectly predicted false)
Feature scaling/re-scaling - When range of dataset is high across different field (age vs salary) - Kmean
● Standardization/Z Score normalization: feature rescaled using mean & STD, but no certain
range
○ X - mean/STD
○ Mean of transformed data is 0, std is 1
○ No certain range, robust to outlier
● Normalization/max min normalization - feature are rescaled - 0 to 1
○ x-min/max-min
● standardization having no certain range is more robust to outliers, But Normalization even
converts the outlier to 0-1 range (which is not good)
● Perform Standardization when we know distribution, Perform Normalization when we don’t know
the distribution. When there is outlier we generally do not perform normalization
● It tells how far the data point is spread from the mean value
Covariance
● Joint variability of two random variabe: any ranges of value possible
Correlation
● Association between two variable: both are increasing or decreasing or one increasing other
decreasing or both constant
● Between -1 to 1
No
● Box plot and whisker plot, historgram/distribution plot can tell about outlier
Handling Outlier: outlier is extremely large or small data value relative to dataset
● Standardization/Z Score normalization: feature rescaled using mean & STD, but no certain range
○ X - mean/STD
○ Mean of transformed data is 0, std is 1
○ No certain range, robust to outlier
Word embedding
● Tokenize the text to word
● Embedding - each word transform to numerical values(frequency)
● sequence of token of word are generated
● Transform word to high dimensional vector
● analyze text/doc based on word count, it does account for order of words
● it does account for the contextual meaning & semantic relationship between words (teacher &
professor).
● It keeps words with similar meaning in the same dimension & closer in distance, thus their cosine
value will be higher.
Activation function
Sigmoid
● For Binary class classification problem
● Range 0 to 1
Softmax
● For multiclass classification problem
● Gives probability in the range of 0 to 1
Tenserflow
● Open source Deep learning library developed by google, It can be used to compute various
mathematical calculations in high dimensional space.
○ I have used tenserflow to tokenize the text to word - subword level token - Mark
beginning of word
○ Pre-Trained RoBERT model to generate output based on column names (training) & user
text (prediction of sql)
○ Limitation: 524 sequence length of token
PyTesseract: OCR engine to extract text from images
ML step: https://www.simplilearn.com/tutorials/machine-learning-tutorial/machine-learning-steps
Forecasting
Basic of time series forecasting: https://www.analyticsvidhya.com/blog/2022/06/time-series-
forecasting-using-python/
Air Quality Forecast: Article & Notebook:
● https://towardsdatascience.com/the-complete-guide-to-time-series-analysis-and-
forecasting-70d476bfe775
● https://github.com/marcopeix/air-quality
Helpful site
SQL-Query
Intro
Brief summary +
Initially - pm, build cdp, customer segmentation , smart res system, deployed as well, restapi,
flask, docker +
Different tools techn - aws +
Later - work for client, adhoc +
Build cs360 +
EA, Project, skill
What EA does?
● Express analytics - (B2B / B2C) - provides service to client
● Express Analytics created CDP - Customer Data Platform
● stores data - apply various DS/ML model - predictive, segmentation, classification
Attribution Model
Objective: To find out which channel is driving more towards conversion (final transaction)
Objective:To find out how much revenue is actually generated by putting ads on that particular channel.
Channel: google cpc, google organic, facebook, instagram, twitter, criteo, landing page
Attribution Answer: we know weight of each channel, which channel contribution how much to conversion
Business case: It can help reduce the gap of allocation to media channel (budget allocation)
● NLP based model - To convert natural language statement into sql query
○ NLP tools - numpy, pandas, nltk, lemmatization, text matching algorithm, text tokenization