0% found this document useful (0 votes)

20 views14 pages

Python

Uploaded by

Hrishikesh Bele

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views14 pages

Python

Uploaded by

Hrishikesh Bele

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Python

Univariate vs bivariate vs multivariate

● When data involve more than 3 feature variables
● We can perform multivariate analysis to check relationships among variables.
● Exa: correlation between two variable
● Check standard deviation for each variable. If the std varies too much, we shall standardize each
variable
○ Standardization (Z Score normalization): feature rescaled using mean & STD, but no
certain range
○ Z = (X-mean)/STD

Structure data vs Unstructured data

● Structure data: stored in an organized way, thus it is easy to analyze. Tabular data, text
● Unstructured data: audio, video, chats, images.
○ We need to do some EDA to convert the unstructured data, so that mL model can be
applied

Supervised vs Un-Supervised Learning

Supervised machine learning relies on labelled input and output training data, whereas unsupervised
learning processes unlabelled or raw data
● Regression, XGBoost, RF, SVM supervised learning
● K means is unsupervised

Similarity measure: Measure the similarity between two text/word

Method: Cosine similarity, Euclidean distance, Jaccard Similarity, fuzzywuzzy, Hamming distance,
Manhattan distance
Cosine similarity: measure similarity between two text, range between 0 to 1:
● Method: find word occurrence in each document - list transform to n-dim vector space -
cosine angle between these 2 n-dimensional vectors is calculated.
● Cosine similarity score = COS(x) = A.B/AB
● Count(sklearn, CountVectorizer)/ tfidf(sklearn, TfidfVectorizer), cosine similarity
(sklearn cosine similarity)
● Length of both word need not to be equal

Hamming distance

Manhattan distance

Euclidean distance: find list of word count for each text, then
● Lowe score/distance means high similarity
Fuzzywuzzy: ratio, partialRatio, tokenSortRatio, tokenSetRatio, WRatio

Levenshtein distance: Minimum number of editions(addition, removal, substitution) needed to

convert one word to another word. Length of both word need not to be equal
TF(term frequency): count of occurrence of each word or token
IDF(inverse document frequency) vectorizer: inverse log of each tf

Bag of word

Text1: “Amardeep Kumar”

Vector embedding
● Calculate frequency count of each token

Variable type:
1. Numerical: age, transaction_amount
2. Categorical variable
● Nominal: order doesn’t matter: color, gender, state, city
● Ordinal: order matters: education qualification, salary as high medium low

Encoding: categorical to numerical variable so that it can be fitted to ml model

https://ai-ml-analytics.com/encoding/

OneHot encoding: for nominal categorical variable

● assign numeric value to each keyword: n-1 new column will add, dimension of dataset
increases
● Example: color (red, blue, green)
● In case of nominal variable

Label Encoding: For ordinal categorical variable

● values is assigned based on rank
● Example: phd > master > btech > high school
Target guided label encoding:
● Based on importance of variable for output, weight is assigned by taking average count of
each categorical variable
● Based on weighted score, the rank is assigned
● Example: phd > master > btech > high school or phd > master > high school > btech

Mean encoding

Example:
● Color (red, blue) & binary classification
● Red appeared 4 times > 3 times 1 was target & rest 0 > (sum(target)/count(red)) > 3/4

Binary encoding: 7 variable can fit into 3 column (111), 15 variable into 4 column(1111)

Fill missing values

1. Delete the rows with missing value if missing value is too less as compare to dataset
2. Fill missing value with mean, median, mode: If column is numeric continuous type
● Works with continuous variable, data leakage/misleading data
Fill with Mean: when data is numerical & distribution is normal
Fill with Median: when distribution is skewed, median is less sensitive to outlier than mean
Fill with Mode: suitable for both numerical or categorical, when small number of unique values &
skewed data
3. Fill missing value with most frequent key or add new variable

How to check distribution of dataset

● Histogram plot
● If Skewed(left/right), then mean !=0

Mean vs median vs mode meaning

● Mean: average()
● Median: middle positioned value when data is arranged in ascending or descending order
● Mode: Most frequent one in data are mode

Variable treatment

Lambda Function
● Create inline function = single expression = s=lambda x:x*x, s(2)=2*2=4
● s=lambda, Input=x, output=x*x
● Call lambda function as, s(var1)

Tuple vs list vs array

● Tuple - immutable data structure - read only - no addition/deletion/overwrite - faster/safe
● List and array - mutable - list=store different data type, array=single data type

RF/Xgboost/SVM - using sklearn library

Sklearn.ensemble - RFClassifier - n_estimator=100 - clf.fit
● N_estimator = number of trees to grow
XGBoost library - max depth=0.6, eta=0.3, objective:’binary:logistic’
● max depth - height of tree
Sklearn.svm - SVC - c=.1, kernel=’linear’/poly, gamma=1
train_test_split(input feature, label=target) - 60/40

Bias Variance
● bias vs variance, overfit/underfit: Actual Value, (Expected value/Predicted Value), Mean
Value
○ bias=(pred val or Expected - actual value ) = sum of (Actual value - Predicted value)
■ Low bias: Predicted value closer to target value or actual value
○ variance= spread = Sum of (Actual Value - Mean Value of prediction)
■ Low Variance: Predicted value closer to mean value of data points
○ Low Bias/High Variance - overfitting
○ High bias/low variance - underfitting

● boosting and bagging - Ensemble learning

○ Bagging - merging same type of prediction = decreases variance = solve overfitting
■ Each model receives equal weight
○ Boosting - merging different prediction = decrease bias = Solve underfitting
■ Each model receives weight based on their performance

● overfit/underfit
○ Overfit
■ low bias, high variance
■ model fits correctly against training data - but don’t perform well on test/unseen
data
■ Bagging can help
○ Underfit
■ low variance, High bias
■ Training datas
■ et size is very low - not enough t/size
■ Model perform poorly on both training and test data
■ Remove noise and increase feature, reduce bias and increase variance
■ Boosting can help

● How to overcome overfit/underfit issue

■ Bagging for overfit & boost for underfit
● ensemble learning, RF, Bagging, boosting
■ Regularization - technique to reduce error and avoid fitting

Regularizations are used in order to prevent overfitting or underfitting problem

● It adds extra information & reduces the magnitude of variable

L1: LASSO: so the feature which doesn’t have association with the target gets
zero absolute weight
● Sum of absolute value of weight
● Robust to outlier
● L1 try to shrink coefficient to zero
● Useful for feature selection

L2: Ridge regressor: all features gets some weight

● Sum of Square of weights
● L2 try to shrink coefficient to evenly
● Useful
● Which one is computationally more expensive

RandomForest Classifier vs XGBoost classifier

● Random Forest is baggin model thus it merges the same type of prediction & reduces variance.
○ Random in RF: It is called a Random Forest because we use Random subsets of
data and features and we end up building a Forest of decision trees (many trees)
● XGBoost is a sequential model or boosting model, thus reducing biasness & solving underfit
problems.
● When to stop growing a tree: During each stage of the splitting of the tree, the cross-
validation error will be monitored. If the value of the error does not decrease anymore -
then we stop the growth of the decision tree.

K fold cross validation: split dataset into k fold, train on each fold of dataset & test on (k-1)th
dataset.

Linear regression: Used to predict targets having numerical values. Price prediction
Logistic Regression: Used for classification problems. True or False, Yes or No, rain or not
Regression model condition
● Create a relationship between dependent and independent variable, calculate coefficient
of each independent variable
Condition
● There should be linear relationship between dependent & independent variable
● Field or feature variable should be linearly independent
● Sample should represent the population
● All the variable should be normally distributed
● Outlier should be removed

● R2 (coefficient of deterination or R square) vs Adjusted R2

○ R**2 tells proportion of variance in target variable wrt feature variable
○ R**2 = 1 - RSS/TSS
■ RSS(sum square residual or error) = sum of (yi - ypred)**2
■ TSS(sum square total) = sum of (yi - ymean)**2
■ Numeracy, Maths and Statistics - Academic Skills Kit (ncl.ac.uk)
■ Adjusted r**2 compares model with different variable and sample sizes
(determine good fitness of fit)

● What signifies P values:

○ help to understand whether there is a relationship between two variables or not.
P<0.5 means there exists a relationship. Smaller the P more confident we can
say about relationship existence.
● Coefficient Calculation
○ Let us suppose the equation of the best fitted line is given by Y =aX + b

Handle imbalance
● Oversampling - adding the duplicate data to minor class side, repetition, bootstrapping, SMOTE
○ Disadvantage: it creates duplicate data point, leads to overfitting issue
○ K fold cross validation should be done to overcome overfittinf chance
○ SMOTE: add synthetic data close to the data point, slightly different from original
datapoint, thus not exactly duplicate (KNN)
● Undersampling - remove fewer data point from major class, used when dataset is of good size
but still we lost some information with this method
● Apply ensemble model, like Random Forest/Ensemble classifier/xgboost model
● Apply suitable evaluation matric like f1 score
○ Case: Fraud detection case, 1 or 2% fraud only, will airplane failure, patient has cancer,
terrorist or not
Evaluation metrics (Accuracy, Precision, Recall, F1 Score)
○ Accuracy = correct/total
○ F1 = Harmonic mean of recall & Precision = 2*r*p/r+p
○ Precision-
■ What proportion of positive identifications was actually correct ?
■ (tp/tp+FP) -Spam detection -predict spam, actual non spam - loose information:
T1
○ Recall
■ What proportion of positive identifications was identified correctly ?
■ (tp/tp+FN) - cancer detection - predict non-cancer, actual cancer - user will die:
T2
● Spam detection - FP to be avoid (Precision)
○ Avoid non spam classified as spam
● Cancer detection - FN to be avoid (Recall)
○ Avoid cancer classified as non cancerous
● Gulity of crime - FP (Precision)
○ Avoid non guilty declared as guilty

Matrix
● true positives (the model correctly predicted true)
● false positives (the model incorrectly predicted true)
● true negatives (the model correctly predicted false)
● false negatives (the model incorrectly predicted false)

Let’s say, objective: To detect spam

■ Type 1 error: FP: model predicted the spam, but when investigated it was
actually not spam: User will lost information**
■ Type 2 error: FN: Model predicted non-spam, but actually it was spam: annoy

Let’s say, objective: To detect cancer

■ Type 1 error: FP: model predicted the cancer, but when investigated it was
actually non-cancer: wrong medication: may lead to illness
■ Type 2 error: FN: Model predicted non-cancer, but actually it was cancer: doctor
won’t prescribe medicine, thus patient will die

Type 1 error is more dangerous or Type 2

● Depends on situation: Spam detection T1 more more important, Cancer detection T2 more
important

Feature scaling/re-scaling - When range of dataset is high across different field (age vs salary) - Kmean
● Standardization/Z Score normalization: feature rescaled using mean & STD, but no certain
range
○ X - mean/STD
○ Mean of transformed data is 0, std is 1
○ No certain range, robust to outlier
● Normalization/max min normalization - feature are rescaled - 0 to 1
○ x-min/max-min
● standardization having no certain range is more robust to outliers, But Normalization even
converts the outlier to 0-1 range (which is not good)
● Perform Standardization when we know distribution, Perform Normalization when we don’t know
the distribution. When there is outlier we generally do not perform normalization

Standard deviation vs variance

● Std(S) = sqrt(variance)
● Variance = S**2 = sumOf(x-mean)**2/n
○ Average of square of error
○ average of square of deviation from mean of each data point

● It tells how far the data point is spread from the mean value

Covariance
● Joint variability of two random variabe: any ranges of value possible

Correlation
● Association between two variable: both are increasing or decreasing or one increasing other
decreasing or both constant
● Between -1 to 1

Hyperpamtere tuning: It is process of selecting best parmeter to yield desired output

Case of Random Forest classifier
1. How many estimators should I use(number of decision trees) ?
2. What should be the maximum allowable depth of each decision tree ?
Method:
● Grid search: In this we define the list of possible values for each parameter. Grid search
will run and select the parmeter automatically which would yield the best result

Normal distribution or gaussian distribution

● Bell shaped, symmetrical curve on both side
● Mean value is 0 and std is 1
● Formula
Finding outlier
● IQR range(q1=25% of data,q2=50%,q3=75%, q4=median): IQR=Q3-Q1
○ Find median(q2): again find median for first half of q2(q1), 2nd half of q2(q3)
○ IQR = q3-q1
○ outlier beyond: < (-1.5*IQR) or > (+1.5*IQR)

● Box plot and whisker plot, historgram/distribution plot can tell about outlier

Handling Outlier: outlier is extremely large or small data value relative to dataset
● Standardization/Z Score normalization: feature rescaled using mean & STD, but no certain range
○ X - mean/STD
○ Mean of transformed data is 0, std is 1
○ No certain range, robust to outlier

Word embedding
● Tokenize the text to word
● Embedding - each word transform to numerical values(frequency)
● sequence of token of word are generated
● Transform word to high dimensional vector
● analyze text/doc based on word count, it does account for order of words
● it does account for the contextual meaning & semantic relationship between words (teacher &
professor).
● It keeps words with similar meaning in the same dimension & closer in distance, thus their cosine
value will be higher.

BERT: Bi-Directional Encoder Representation of

Transformers
● BERT: SOTA transformer NLP model
● It learns contextual relation between word or subword in text
● BE: bidirectional encoder: it reads all the sequence of word, not just from left to right or right to left
● Embedding - sequence of token of word are generated
● Encoder - takes input - sequence of token - transform to vector - passed to NN - return vector
with index indicating a word
● Masked LM (MLM): before feeding sequence of word, 15% of word are masked, then model try
to predict the value of masked token (match with original token)
● Classification layer on top of encoder: softmax for multi(0 to 1), sigmoid for binary(0,1):

Activation function
Sigmoid
● For Binary class classification problem
● Range 0 to 1
Softmax
● For multiclass classification problem
● Gives probability in the range of 0 to 1

Tenserflow
● Open source Deep learning library developed by google, It can be used to compute various
mathematical calculations in high dimensional space.
○ I have used tenserflow to tokenize the text to word - subword level token - Mark
beginning of word
○ Pre-Trained RoBERT model to generate output based on column names (training) & user
text (prediction of sql)
○ Limitation: 524 sequence length of token
PyTesseract: OCR engine to extract text from images

Stemming: (caring-car): remove affix & give root word

Lemmatization: (caring - care): understand the context & return meaningful base word or dictionary word

Feature selection kaese karoge ? dimesnioanlity reduction ? correlation, PCA

● PCA - recast the feature along principal component axes
○ Create covariance matrix: n*n matrix (n=no of feature)
○ Calculate eigen vector, eigenvalue - thus identify pc
○ Principal axis
○ Transform the feature along pca
kuch column collinear hai toh kya karoge ?

ML step: https://www.simplilearn.com/tutorials/machine-learning-tutorial/machine-learning-steps

Forecasting
Basic of time series forecasting: https://www.analyticsvidhya.com/blog/2022/06/time-series-
forecasting-using-python/
Air Quality Forecast: Article & Notebook:
● https://towardsdatascience.com/the-complete-guide-to-time-series-analysis-and-
forecasting-70d476bfe775
● https://github.com/marcopeix/air-quality

Time series: Data has correlation with time

Trend: What is the general trend a data follow over their entire journey. (Increasing / decreasing/
Irregular)
● Population of India is increasing over a period of time
Seasonality: Pattern in data that repeats over an interval of time.
● Sell of Zepto will be higher during weekend
● Sell of ice cream decrease during winter month
Stationarity: constant mean, variance, covariance does not depend on time
Non-Stationary: Means there is trend and seasonality in data

Adfuller test: To check stationarity in data

Naive Approach/ Simple exponential smoothing: Give full weightage to last observation
Moving Average: Average of last 3 month as forecast for next period
Exponential Smoothing: Give weightage to all historical data, But most recent will have high
weightage, past data will have lowest weightage
Holt linear model: Takes account of trend and seasonality
Prophet Model /MMM - Reallocate the budget among multiple channel
AR - Linear regression model, predict future value based on past data only
ARMA
ARIMA
SARIMA
Past Experience/Work/My role at EA - Briefly Explain/Roles and responsibilities?

Helpful site
SQL-Query

Intro
Brief summary +
Initially - pm, build cdp, customer segmentation , smart res system, deployed as well, restapi,
flask, docker +
Different tools techn - aws +
Later - work for client, adhoc +
Build cs360 +
EA, Project, skill
What EA does?
● Express analytics - (B2B / B2C) - provides service to client
● Express Analytics created CDP - Customer Data Platform
● stores data - apply various DS/ML model - predictive, segmentation, classification

EA Sector & Model

● Retail(RFM, Lookalike), Finance(CLTV), Hospitality(feedback)
● Various data science model(RFM, Lookalike, Idres, query generator, VOCA, Smart responder)

Work summary in terms of technology

● NLP, Classification problem, segmentation model, data preprocessing, data scrape, Deployment
● Other - statistics, maths

Attribution Model
Objective: To find out which channel is driving more towards conversion (final transaction)
Objective:To find out how much revenue is actually generated by putting ads on that particular channel.
Channel: google cpc, google organic, facebook, instagram, twitter, criteo, landing page
Attribution Answer: we know weight of each channel, which channel contribution how much to conversion
Business case: It can help reduce the gap of allocation to media channel (budget allocation)

Past Experience/Work/My role at EA - Briefly Explain/Roles and responsibilities?

● Work on various project, DS model
○ NLP based project, Cust Segmentation & Classification problem, ds model deployment,
sql
○ Internship - data preprocessing/wrangling, sql, NER model
Data preprocessing/wrangling
■ Load data, transform data
■ Identify missing values
■ Encoding categorical variable (lbael_encoder)
■ Correcting the data type
■ calculate ROI

● NLP based model - To convert natural language statement into sql query
○ NLP tools - numpy, pandas, nltk, lemmatization, text matching algorithm, text tokenization

● RFM and lookalike model - customer segmentation and classification problem

○ RFM - to do segmentation - find customer likely to come back and make purchase again
■ Pandas, segmentation based on quartile(qcut)
○ Lookalike - To find the high value customer - Classification problem
■ Cltv model, sklearn, RF, XgBoost, SVM, LR
● Docker and API management - Responsible for docker build & ds model deployment on linux os
○ basic shell command = echo, vi, cat
● Hands on sql - basic command like select, where, group by, having, join - to fetch, update, join tbl
● Attribution model -resource allocation problem
Proficiency in python? - NLP, Classification problem, segmentation model, data preprocessing, data scrap
● Tools & Tech - numpy, pandas, sklearn, plotly, nltk, lemmatization, stemming, spacy NER
○ NLP - nltk, stemming(caring-car), lemmatization*(caring - care), spacy(vocabulary)
fuzzywuzzy, levenheisten distance, matching algo, Spacy NER, sklearn

Hcu Dump
100% (3)
Hcu Dump
86 pages
ACCOUNT
No ratings yet
ACCOUNT
5 pages
120 DS-With Answer
100% (1)
120 DS-With Answer
32 pages
Top Strategic Technology Trends For 2022 Cybersecurity Mesh
No ratings yet
Top Strategic Technology Trends For 2022 Cybersecurity Mesh
14 pages
Topic 2
No ratings yet
Topic 2
47 pages
Lecture 5
No ratings yet
Lecture 5
26 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Supervised Learning Notes
No ratings yet
Supervised Learning Notes
7 pages
LFD 1
No ratings yet
LFD 1
39 pages
ML W8 Merged
No ratings yet
ML W8 Merged
27 pages
Lecture 4
No ratings yet
Lecture 4
63 pages
How To Learn Machine Learning Algorithms For Interviews
No ratings yet
How To Learn Machine Learning Algorithms For Interviews
16 pages
Python Learning
No ratings yet
Python Learning
21 pages
SML
No ratings yet
SML
8 pages
ML11 Generalization
No ratings yet
ML11 Generalization
40 pages
End SEM V IMP DSE 2
No ratings yet
End SEM V IMP DSE 2
9 pages
Cheat Sheet Building Supervised Learning Models
No ratings yet
Cheat Sheet Building Supervised Learning Models
3 pages
MODELS (AutoRecovered)
No ratings yet
MODELS (AutoRecovered)
9 pages
Final ML
No ratings yet
Final ML
2 pages
Quiz 1 Materials
No ratings yet
Quiz 1 Materials
159 pages
Lecture 15 - Recap and Midterm Review
No ratings yet
Lecture 15 - Recap and Midterm Review
37 pages
Macine Resit
No ratings yet
Macine Resit
7 pages
ML Algorithms Week 3
No ratings yet
ML Algorithms Week 3
30 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
Machine Learning
No ratings yet
Machine Learning
11 pages
Python 06 MachineLearning
No ratings yet
Python 06 MachineLearning
45 pages
Assignment 4 Reportdocx
No ratings yet
Assignment 4 Reportdocx
10 pages
Exam 2 Review
No ratings yet
Exam 2 Review
23 pages
03 Machine Learning Overview
No ratings yet
03 Machine Learning Overview
24 pages
My Notes
No ratings yet
My Notes
15 pages
M2 - Supervised Machine Learning
No ratings yet
M2 - Supervised Machine Learning
79 pages
What Are The Differences Between Supervised and Unsupervised Learning?
No ratings yet
What Are The Differences Between Supervised and Unsupervised Learning?
22 pages
EDAN96 2024 Last Lecture-1
No ratings yet
EDAN96 2024 Last Lecture-1
78 pages
Machine Learning: by Team 2
No ratings yet
Machine Learning: by Team 2
41 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
Tutorial 7 Machine Learning Algorithms
No ratings yet
Tutorial 7 Machine Learning Algorithms
30 pages
A1388404476 - 64039 - 23 - 2023 - Machine Learning II
No ratings yet
A1388404476 - 64039 - 23 - 2023 - Machine Learning II
10 pages
Lesson 4 - Supervised Learning
No ratings yet
Lesson 4 - Supervised Learning
36 pages
Lec 05
No ratings yet
Lec 05
54 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
10 pages
ML Cheat
No ratings yet
ML Cheat
9 pages
DSCI 6003 Class Notes
No ratings yet
DSCI 6003 Class Notes
7 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
1 - Intro To Machine Learning
No ratings yet
1 - Intro To Machine Learning
34 pages
Beyond Binary Classification
No ratings yet
Beyond Binary Classification
34 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Semester
No ratings yet
Semester
8 pages
AWS Machine Learning Specialty Master Cheat Sheet
No ratings yet
AWS Machine Learning Specialty Master Cheat Sheet
24 pages
Machine Learning: Engr. Ejaz Ahmad
No ratings yet
Machine Learning: Engr. Ejaz Ahmad
54 pages
ML Interview Questions
No ratings yet
ML Interview Questions
10 pages
Udacity Machine Learning Analysis Supervised Learning
100% (1)
Udacity Machine Learning Analysis Supervised Learning
504 pages
PW3 SupervisedLearning
No ratings yet
PW3 SupervisedLearning
10 pages
Session 5
No ratings yet
Session 5
36 pages
All About ML
No ratings yet
All About ML
18 pages
Unit 3
No ratings yet
Unit 3
12 pages
DS 1
No ratings yet
DS 1
20 pages
5 Markd
No ratings yet
5 Markd
24 pages
30 Days ML Projects Challenge
No ratings yet
30 Days ML Projects Challenge
288 pages
Data Science and Machine Learning - Interview Questions
No ratings yet
Data Science and Machine Learning - Interview Questions
185 pages
ML Classification Trupesh Patel
No ratings yet
ML Classification Trupesh Patel
39 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)
Anisha ETL DataEngineer
No ratings yet
Anisha ETL DataEngineer
7 pages
Abhipedia Abhimanu Com Article 1049 MjcyMDc2 My Experiments With Silence
No ratings yet
Abhipedia Abhimanu Com Article 1049 MjcyMDc2 My Experiments With Silence
5 pages
Omnibus Certification
No ratings yet
Omnibus Certification
1 page
CNS - Unit-4
No ratings yet
CNS - Unit-4
84 pages
MLT Course Content-4
No ratings yet
MLT Course Content-4
209 pages
GV-FER5700: 5MP H.265 Low Lux WDR IR Fisheye Rugged IP Camera
No ratings yet
GV-FER5700: 5MP H.265 Low Lux WDR IR Fisheye Rugged IP Camera
7 pages
The Six Elements To Block-Building Approaches For The Single Container Loading Problem
No ratings yet
The Six Elements To Block-Building Approaches For The Single Container Loading Problem
15 pages
Pokémon FireRed & LeafGreen QR Codes. (USA) R3dsqrcodes
No ratings yet
Pokémon FireRed & LeafGreen QR Codes. (USA) R3dsqrcodes
1 page
Mixed Signal Integrated Circuit Design
100% (1)
Mixed Signal Integrated Circuit Design
1 page
Sih PS 2024
No ratings yet
Sih PS 2024
5 pages
Document 162 - Enrolment Pack - Cert IV in Fitness FFS Online v61 (NEW)
No ratings yet
Document 162 - Enrolment Pack - Cert IV in Fitness FFS Online v61 (NEW)
16 pages
Guidelines in EI Installation
No ratings yet
Guidelines in EI Installation
7 pages
Complete Guide To Install SCCM Software Update Point Role
No ratings yet
Complete Guide To Install SCCM Software Update Point Role
30 pages
Big Data Project-2 Report
No ratings yet
Big Data Project-2 Report
22 pages
Business Statistics: Assignment
No ratings yet
Business Statistics: Assignment
3 pages
Ignore Pause in LaTeX Beamer With Handout - Gordon Lesti
No ratings yet
Ignore Pause in LaTeX Beamer With Handout - Gordon Lesti
2 pages
Báo Cáo Thực Hành Vi Điều Khiển
No ratings yet
Báo Cáo Thực Hành Vi Điều Khiển
39 pages
Migration POC
No ratings yet
Migration POC
10 pages
Week 4 Secure Information System
No ratings yet
Week 4 Secure Information System
19 pages
Manual Launch CR HD PRO
No ratings yet
Manual Launch CR HD PRO
2 pages
Caie Igcse Ict Znotes Theory
No ratings yet
Caie Igcse Ict Znotes Theory
55 pages
Inocontroller Control Module Instructions Manual Sames DRT7134 Uk
No ratings yet
Inocontroller Control Module Instructions Manual Sames DRT7134 Uk
44 pages
Badranaya Pitch Deck
No ratings yet
Badranaya Pitch Deck
29 pages
Oracle Applications - Query To Get Employee and Supervisor Hierarchy Details in Oracle Apps HRMS R12
No ratings yet
Oracle Applications - Query To Get Employee and Supervisor Hierarchy Details in Oracle Apps HRMS R12
3 pages
Information Bulletin - PHD M.Tech (R) M.Tech (S) - 2023
No ratings yet
Information Bulletin - PHD M.Tech (R) M.Tech (S) - 2023
18 pages
P04 Calc AbsolutReferences
No ratings yet
P04 Calc AbsolutReferences
2 pages
G-LBDA1361NA 002 Tattle-Tape Workstation Datasheet LR2
No ratings yet
G-LBDA1361NA 002 Tattle-Tape Workstation Datasheet LR2
2 pages

Python

Uploaded by

Python

Uploaded by

Python

Univariate vs bivariate vs multivariate

Structure data vs Unstructured data

Supervised vs Un-Supervised Learning

Similarity measure: Measure the similarity between two text/word

Levenshtein distance: Minimum number of editions(addition, removal, substitution) needed to

Text1: “Amardeep Kumar”

Encoding: categorical to numerical variable so that it can be fitted to ml model

OneHot encoding: for nominal categorical variable

Label Encoding: For ordinal categorical variable

Fill missing values

How to check distribution of dataset

Mean vs median vs mode meaning

Tuple vs list vs array

RF/Xgboost/SVM - using sklearn library

● boosting and bagging - Ensemble learning

● How to overcome overfit/underfit issue

Regularizations are used in order to prevent overfitting or underfitting problem

L2: Ridge regressor: all features gets some weight

RandomForest Classifier vs XGBoost classifier

● R**2 (coefficient of deterination or R square) vs Adjusted R**2

● What signifies P values:

Let’s say, objective: To detect spam

Let’s say, objective: To detect cancer

Type 1 error is more dangerous or Type 2

Standard deviation vs variance

Hyperpamtere tuning: It is process of selecting best parmeter to yield desired output

Normal distribution or gaussian distribution

BERT: Bi-Directional Encoder Representation of

Stemming: (caring-car): remove affix & give root word

Feature selection kaese karoge ? dimesnioanlity reduction ? correlation, PCA

Time series: Data has correlation with time

Adfuller test: To check stationarity in data

EA Sector & Model

Work summary in terms of technology

Past Experience/Work/My role at EA - Briefly Explain/Roles and responsibilities?

● RFM and lookalike model - customer segmentation and classification problem

You might also like

● R2 (coefficient of deterination or R square) vs Adjusted R2