0% found this document useful (0 votes)
8 views

Python

Uploaded by

Hrishikesh Bele
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Python

Uploaded by

Hrishikesh Bele
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Python

Univariate vs bivariate vs multivariate


● When data involve more than 3 feature variables
● We can perform multivariate analysis to check relationships among variables.
● Exa: correlation between two variable
● Check standard deviation for each variable. If the std varies too much, we shall standardize each
variable
○ Standardization (Z Score normalization): feature rescaled using mean & STD, but no
certain range
○ Z = (X-mean)/STD

Structure data vs Unstructured data


● Structure data: stored in an organized way, thus it is easy to analyze. Tabular data, text
● Unstructured data: audio, video, chats, images.
○ We need to do some EDA to convert the unstructured data, so that mL model can be
applied

Supervised vs Un-Supervised Learning


Supervised machine learning relies on labelled input and output training data, whereas unsupervised
learning processes unlabelled or raw data
● Regression, XGBoost, RF, SVM supervised learning
● K means is unsupervised

Similarity measure: Measure the similarity between two text/word


Method: Cosine similarity, Euclidean distance, Jaccard Similarity, fuzzywuzzy, Hamming distance,
Manhattan distance
Cosine similarity: measure similarity between two text, range between 0 to 1:
● Method: find word occurrence in each document - list transform to n-dim vector space -
cosine angle between these 2 n-dimensional vectors is calculated.
● Cosine similarity score = COS(x) = A.B/AB
● Count(sklearn, CountVectorizer)/ tfidf(sklearn, TfidfVectorizer), cosine similarity
(sklearn cosine similarity)
● Length of both word need not to be equal

Hamming distance

Manhattan distance

Euclidean distance: find list of word count for each text, then
● Lowe score/distance means high similarity
Fuzzywuzzy: ratio, partialRatio, tokenSortRatio, tokenSetRatio, WRatio

Levenshtein distance: Minimum number of editions(addition, removal, substitution) needed to


convert one word to another word. Length of both word need not to be equal
TF(term frequency): count of occurrence of each word or token
IDF(inverse document frequency) vectorizer: inverse log of each tf

Bag of word

Text1: “Amardeep Kumar”


Text2: “Amardeep Singh”
● Unique token: “Amardeep” | “Kumar” | “Singh”
● Text1 count: 1 | 1 | 0
● Text1 count: 1 | 0 | 1
● Euclidean score = sqrt( (1-1)**2 + (1-0)**2 + (0-1)**2 ) / sqrt(1+1) * sqrt(1+1) = sqrt(2)/2

Vector embedding
● Calculate frequency count of each token

Variable type:
1. Numerical: age, transaction_amount
2. Categorical variable
● Nominal: order doesn’t matter: color, gender, state, city
● Ordinal: order matters: education qualification, salary as high medium low

Encoding: categorical to numerical variable so that it can be fitted to ml model


https://ai-ml-analytics.com/encoding/

OneHot encoding: for nominal categorical variable


● assign numeric value to each keyword: n-1 new column will add, dimension of dataset
increases
● Example: color (red, blue, green)
● In case of nominal variable

Label Encoding: For ordinal categorical variable


● values is assigned based on rank
● Example: phd > master > btech > high school
Target guided label encoding:
● Based on importance of variable for output, weight is assigned by taking average count of
each categorical variable
● Based on weighted score, the rank is assigned
● Example: phd > master > btech > high school or phd > master > high school > btech

Mean encoding

Example:
● Color (red, blue) & binary classification
● Red appeared 4 times > 3 times 1 was target & rest 0 > (sum(target)/count(red)) > 3/4

Binary encoding: 7 variable can fit into 3 column (111), 15 variable into 4 column(1111)

Fill missing values


1. Delete the rows with missing value if missing value is too less as compare to dataset
2. Fill missing value with mean, median, mode: If column is numeric continuous type
● Works with continuous variable, data leakage/misleading data
Fill with Mean: when data is numerical & distribution is normal
Fill with Median: when distribution is skewed, median is less sensitive to outlier than mean
Fill with Mode: suitable for both numerical or categorical, when small number of unique values &
skewed data
3. Fill missing value with most frequent key or add new variable

How to check distribution of dataset


● Histogram plot
● If Skewed(left/right), then mean !=0

Mean vs median vs mode meaning


● Mean: average()
● Median: middle positioned value when data is arranged in ascending or descending order
● Mode: Most frequent one in data are mode

Variable treatment

Lambda Function
● Create inline function = single expression = s=lambda x:x*x, s(2)=2*2=4
● s=lambda, Input=x, output=x*x
● Call lambda function as, s(var1)

Tuple vs list vs array


● Tuple - immutable data structure - read only - no addition/deletion/overwrite - faster/safe
● List and array - mutable - list=store different data type, array=single data type

RF/Xgboost/SVM - using sklearn library


Sklearn.ensemble - RFClassifier - n_estimator=100 - clf.fit
● N_estimator = number of trees to grow
XGBoost library - max depth=0.6, eta=0.3, objective:’binary:logistic’
● max depth - height of tree
Sklearn.svm - SVC - c=.1, kernel=’linear’/poly, gamma=1
train_test_split(input feature, label=target) - 60/40

Bias Variance
● bias vs variance, overfit/underfit: Actual Value, (Expected value/Predicted Value), Mean
Value
○ bias=(pred val or Expected - actual value ) = sum of (Actual value - Predicted value)
■ Low bias: Predicted value closer to target value or actual value
○ variance= spread = Sum of (Actual Value - Mean Value of prediction)
■ Low Variance: Predicted value closer to mean value of data points
○ Low Bias/High Variance - overfitting
○ High bias/low variance - underfitting

● boosting and bagging - Ensemble learning


○ Bagging - merging same type of prediction = decreases variance = solve overfitting
■ Each model receives equal weight
○ Boosting - merging different prediction = decrease bias = Solve underfitting
■ Each model receives weight based on their performance

● overfit/underfit
○ Overfit
■ low bias, high variance
■ model fits correctly against training data - but don’t perform well on test/unseen
data
■ Bagging can help
○ Underfit
■ low variance, High bias
■ Training datas
■ et size is very low - not enough t/size
■ Model perform poorly on both training and test data
■ Remove noise and increase feature, reduce bias and increase variance
■ Boosting can help

● How to overcome overfit/underfit issue


■ Bagging for overfit & boost for underfit
● ensemble learning, RF, Bagging, boosting
■ Regularization - technique to reduce error and avoid fitting

Regularizations are used in order to prevent overfitting or underfitting problem


● It adds extra information & reduces the magnitude of variable

L1: LASSO: so the feature which doesn’t have association with the target gets
zero absolute weight
● Sum of absolute value of weight
● Robust to outlier
● L1 try to shrink coefficient to zero
● Useful for feature selection

L2: Ridge regressor: all features gets some weight


● Sum of Square of weights
● L2 try to shrink coefficient to evenly
● Useful
● Which one is computationally more expensive

RandomForest Classifier vs XGBoost classifier


● Random Forest is baggin model thus it merges the same type of prediction & reduces variance.
○ Random in RF: It is called a Random Forest because we use Random subsets of
data and features and we end up building a Forest of decision trees (many trees)
● XGBoost is a sequential model or boosting model, thus reducing biasness & solving underfit
problems.
● When to stop growing a tree: During each stage of the splitting of the tree, the cross-
validation error will be monitored. If the value of the error does not decrease anymore -
then we stop the growth of the decision tree.

K fold cross validation: split dataset into k fold, train on each fold of dataset & test on (k-1)th
dataset.

Linear regression: Used to predict targets having numerical values. Price prediction
Logistic Regression: Used for classification problems. True or False, Yes or No, rain or not
Regression model condition
● Create a relationship between dependent and independent variable, calculate coefficient
of each independent variable
Condition
● There should be linear relationship between dependent & independent variable
● Field or feature variable should be linearly independent
● Sample should represent the population
● All the variable should be normally distributed
● Outlier should be removed

● R**2 (coefficient of deterination or R square) vs Adjusted R**2


○ R**2 tells proportion of variance in target variable wrt feature variable
○ R**2 = 1 - RSS/TSS
■ RSS(sum square residual or error) = sum of (yi - ypred)**2
■ TSS(sum square total) = sum of (yi - ymean)**2
■ Numeracy, Maths and Statistics - Academic Skills Kit (ncl.ac.uk)
■ Adjusted r**2 compares model with different variable and sample sizes
(determine good fitness of fit)

● What signifies P values:


○ help to understand whether there is a relationship between two variables or not.
P<0.5 means there exists a relationship. Smaller the P more confident we can
say about relationship existence.
● Coefficient Calculation
○ Let us suppose the equation of the best fitted line is given by Y =aX + b

Handle imbalance
● Oversampling - adding the duplicate data to minor class side, repetition, bootstrapping, SMOTE
○ Disadvantage: it creates duplicate data point, leads to overfitting issue
○ K fold cross validation should be done to overcome overfittinf chance
○ SMOTE: add synthetic data close to the data point, slightly different from original
datapoint, thus not exactly duplicate (KNN)
● Undersampling - remove fewer data point from major class, used when dataset is of good size
but still we lost some information with this method
● Apply ensemble model, like Random Forest/Ensemble classifier/xgboost model
● Apply suitable evaluation matric like f1 score
○ Case: Fraud detection case, 1 or 2% fraud only, will airplane failure, patient has cancer,
terrorist or not
Evaluation metrics (Accuracy, Precision, Recall, F1 Score)
○ Accuracy = correct/total
○ F1 = Harmonic mean of recall & Precision = 2*r*p/r+p
○ Precision-
■ What proportion of positive identifications was actually correct ?
■ (tp/tp+FP) -Spam detection -predict spam, actual non spam - loose information:
T1
○ Recall
■ What proportion of positive identifications was identified correctly ?
■ (tp/tp+FN) - cancer detection - predict non-cancer, actual cancer - user will die:
T2
● Spam detection - FP to be avoid (Precision)
○ Avoid non spam classified as spam
● Cancer detection - FN to be avoid (Recall)
○ Avoid cancer classified as non cancerous
● Gulity of crime - FP (Precision)
○ Avoid non guilty declared as guilty

Matrix
● true positives (the model correctly predicted true)
● false positives (the model incorrectly predicted true)
● true negatives (the model correctly predicted false)
● false negatives (the model incorrectly predicted false)

Let’s say, objective: To detect spam


■ Type 1 error: FP: model predicted the spam, but when investigated it was
actually not spam: User will lost information**
■ Type 2 error: FN: Model predicted non-spam, but actually it was spam: annoy

Let’s say, objective: To detect cancer


■ Type 1 error: FP: model predicted the cancer, but when investigated it was
actually non-cancer: wrong medication: may lead to illness
■ Type 2 error: FN: Model predicted non-cancer, but actually it was cancer: doctor
won’t prescribe medicine, thus patient will die

Type 1 error is more dangerous or Type 2


● Depends on situation: Spam detection T1 more more important, Cancer detection T2 more
important

Feature scaling/re-scaling - When range of dataset is high across different field (age vs salary) - Kmean
● Standardization/Z Score normalization: feature rescaled using mean & STD, but no certain
range
○ X - mean/STD
○ Mean of transformed data is 0, std is 1
○ No certain range, robust to outlier
● Normalization/max min normalization - feature are rescaled - 0 to 1
○ x-min/max-min
● standardization having no certain range is more robust to outliers, But Normalization even
converts the outlier to 0-1 range (which is not good)
● Perform Standardization when we know distribution, Perform Normalization when we don’t know
the distribution. When there is outlier we generally do not perform normalization

Standard deviation vs variance


● Std(S) = sqrt(variance)
● Variance = S**2 = sumOf(x-mean)**2/n
○ Average of square of error
○ average of square of deviation from mean of each data point

● It tells how far the data point is spread from the mean value

Covariance
● Joint variability of two random variabe: any ranges of value possible

Correlation
● Association between two variable: both are increasing or decreasing or one increasing other
decreasing or both constant
● Between -1 to 1

No

Hyperpamtere tuning: It is process of selecting best parmeter to yield desired output


Case of Random Forest classifier
1. How many estimators should I use(number of decision trees) ?
2. What should be the maximum allowable depth of each decision tree ?
Method:
● Grid search: In this we define the list of possible values for each parameter. Grid search
will run and select the parmeter automatically which would yield the best result

Normal distribution or gaussian distribution


● Bell shaped, symmetrical curve on both side
● Mean value is 0 and std is 1
● Formula
Finding outlier
● IQR range(q1=25% of data,q2=50%,q3=75%, q4=median): IQR=Q3-Q1
○ Find median(q2): again find median for first half of q2(q1), 2nd half of q2(q3)
○ IQR = q3-q1
○ outlier beyond: < (-1.5*IQR) or > (+1.5*IQR)

● Box plot and whisker plot, historgram/distribution plot can tell about outlier

Handling Outlier: outlier is extremely large or small data value relative to dataset
● Standardization/Z Score normalization: feature rescaled using mean & STD, but no certain range
○ X - mean/STD
○ Mean of transformed data is 0, std is 1
○ No certain range, robust to outlier

Word embedding
● Tokenize the text to word
● Embedding - each word transform to numerical values(frequency)
● sequence of token of word are generated
● Transform word to high dimensional vector
● analyze text/doc based on word count, it does account for order of words
● it does account for the contextual meaning & semantic relationship between words (teacher &
professor).
● It keeps words with similar meaning in the same dimension & closer in distance, thus their cosine
value will be higher.

BERT: Bi-Directional Encoder Representation of


Transformers
● BERT: SOTA transformer NLP model
● It learns contextual relation between word or subword in text
● BE: bidirectional encoder: it reads all the sequence of word, not just from left to right or right to left
● Embedding - sequence of token of word are generated
● Encoder - takes input - sequence of token - transform to vector - passed to NN - return vector
with index indicating a word
● Masked LM (MLM): before feeding sequence of word, 15% of word are masked, then model try
to predict the value of masked token (match with original token)
● Classification layer on top of encoder: softmax for multi(0 to 1), sigmoid for binary(0,1):

Activation function
Sigmoid
● For Binary class classification problem
● Range 0 to 1
Softmax
● For multiclass classification problem
● Gives probability in the range of 0 to 1

Tenserflow
● Open source Deep learning library developed by google, It can be used to compute various
mathematical calculations in high dimensional space.
○ I have used tenserflow to tokenize the text to word - subword level token - Mark
beginning of word
○ Pre-Trained RoBERT model to generate output based on column names (training) & user
text (prediction of sql)
○ Limitation: 524 sequence length of token
PyTesseract: OCR engine to extract text from images

Stemming: (caring-car): remove affix & give root word


Lemmatization: (caring - care): understand the context & return meaningful base word or dictionary word

Feature selection kaese karoge ? dimesnioanlity reduction ? correlation, PCA


● PCA - recast the feature along principal component axes
○ Create covariance matrix: n*n matrix (n=no of feature)
○ Calculate eigen vector, eigenvalue - thus identify pc
○ Principal axis
○ Transform the feature along pca
kuch column collinear hai toh kya karoge ?

ML step: https://www.simplilearn.com/tutorials/machine-learning-tutorial/machine-learning-steps

Forecasting
Basic of time series forecasting: https://www.analyticsvidhya.com/blog/2022/06/time-series-
forecasting-using-python/
Air Quality Forecast: Article & Notebook:
● https://towardsdatascience.com/the-complete-guide-to-time-series-analysis-and-
forecasting-70d476bfe775
● https://github.com/marcopeix/air-quality

Time series: Data has correlation with time


Trend: What is the general trend a data follow over their entire journey. (Increasing / decreasing/
Irregular)
● Population of India is increasing over a period of time
Seasonality: Pattern in data that repeats over an interval of time.
● Sell of Zepto will be higher during weekend
● Sell of ice cream decrease during winter month
Stationarity: constant mean, variance, covariance does not depend on time
Non-Stationary: Means there is trend and seasonality in data

Adfuller test: To check stationarity in data


Naive Approach/ Simple exponential smoothing: Give full weightage to last observation
Moving Average: Average of last 3 month as forecast for next period
Exponential Smoothing: Give weightage to all historical data, But most recent will have high
weightage, past data will have lowest weightage
Holt linear model: Takes account of trend and seasonality
Prophet Model /MMM - Reallocate the budget among multiple channel
AR - Linear regression model, predict future value based on past data only
ARMA
ARIMA
SARIMA
Past Experience/Work/My role at EA - Briefly Explain/Roles and responsibilities?

Helpful site
SQL-Query

Intro
Brief summary +
Initially - pm, build cdp, customer segmentation , smart res system, deployed as well, restapi,
flask, docker +
Different tools techn - aws +
Later - work for client, adhoc +
Build cs360 +
EA, Project, skill
What EA does?
● Express analytics - (B2B / B2C) - provides service to client
● Express Analytics created CDP - Customer Data Platform
● stores data - apply various DS/ML model - predictive, segmentation, classification

EA Sector & Model


● Retail(RFM, Lookalike), Finance(CLTV), Hospitality(feedback)
● Various data science model(RFM, Lookalike, Idres, query generator, VOCA, Smart responder)

Work summary in terms of technology


● NLP, Classification problem, segmentation model, data preprocessing, data scrape, Deployment
● Other - statistics, maths

Attribution Model
Objective: To find out which channel is driving more towards conversion (final transaction)
Objective:To find out how much revenue is actually generated by putting ads on that particular channel.
Channel: google cpc, google organic, facebook, instagram, twitter, criteo, landing page
Attribution Answer: we know weight of each channel, which channel contribution how much to conversion
Business case: It can help reduce the gap of allocation to media channel (budget allocation)

Past Experience/Work/My role at EA - Briefly Explain/Roles and responsibilities?


● Work on various project, DS model
○ NLP based project, Cust Segmentation & Classification problem, ds model deployment,
sql
○ Internship - data preprocessing/wrangling, sql, NER model
Data preprocessing/wrangling
■ Load data, transform data
■ Identify missing values
■ Encoding categorical variable (lbael_encoder)
■ Correcting the data type
■ calculate ROI

● NLP based model - To convert natural language statement into sql query
○ NLP tools - numpy, pandas, nltk, lemmatization, text matching algorithm, text tokenization

● RFM and lookalike model - customer segmentation and classification problem


○ RFM - to do segmentation - find customer likely to come back and make purchase again
■ Pandas, segmentation based on quartile(qcut)
○ Lookalike - To find the high value customer - Classification problem
■ Cltv model, sklearn, RF, XgBoost, SVM, LR
● Docker and API management - Responsible for docker build & ds model deployment on linux os
○ basic shell command = echo, vi, cat
● Hands on sql - basic command like select, where, group by, having, join - to fetch, update, join tbl
● Attribution model -resource allocation problem
Proficiency in python? - NLP, Classification problem, segmentation model, data preprocessing, data scrap
● Tools & Tech - numpy, pandas, sklearn, plotly, nltk, lemmatization, stemming, spacy NER
○ NLP - nltk, stemming(caring-car), lemmatization*(caring - care), spacy(vocabulary)
fuzzywuzzy, levenheisten distance, matching algo, Spacy NER, sklearn

You might also like