0% found this document useful (0 votes)

18 views74 pages

2 - Preprocessing

This document discusses data preparation and feature engineering for machine learning models. It covers preprocessing steps like data selection, exploration, cleaning, and transformations. Key aspects covered include handling missing data through imputation or indicator variables, detecting and treating outliers, and standardizing features. The goal is to construct a clean, preprocessed dataset suitable for modeling.

Uploaded by

Shivam Verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views74 pages

2 - Preprocessing

Uploaded by

Shivam Verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

Advanced Analytics in Business

Big Data Platforms & Technologies

Preprocessing
Feature Engineering
Overview
Selection
Exploration
Cleaning
Transformations
Feature engineering
Feature selection
Conclusion

2
Today

3
Where we want to get to
Instance Age Account activity Owns credit card Churn

1 24 low yes no
2 21 medium no yes
3 42 high yes no
4 34 low no yes
5 19 medium yes yes
6 44 medium no no

7 31 high yes no
8 29 low no no
9 41 high no no
10 29 medium yes no
11 52 medium no yes
12 55 high no no
13 52 medium yes no

14 38 low no yes

4
The data set
Recall, a tabular data set (“structured data”):

Has instances (examples, rows, observations, customers, …)

And features (attributes, fields, variables, predictors, covariates, explanatory variables, regressors,
independent variables)

These features can be:

Numeric (continuous)
Categorical (discrete, factor), either nominal (binary as a special case) or ordinal

Target (label, class, dependent variable, response variable) can also be present
Numeric, categorical, …

Majority of tabular modeling techniques assume a single table as input (a feature matrix)

5
Constructing a data set takes work
Merging different data sources
Levels of aggregation, e.g. household versus individual customer
Linking instance identifiers
Definition of target variable
Cleaning, preprocessing, featurization

Let’s go through the different steps…

6
Data Selection

7
Data types
Master data External data
Relates to core entities company is working with Social media data (e.g. Facebook, Twitter, e.g. for
sentiment analysis)
E.g. Customers, Products, Employees, Suppliers,
Vendors Macro economic data (e.g. GPD, inflation)
Typically stored in operational data bases and data Weather data
warehouses (historical view)
Competitor data
Transactional data Search data (e.g. Google Trends)
Timing, quantity and items Web scraped data
E.g. POS data, credit card transactions, money
Open data
transfers, web visits, etc.
External data that anyone can access, use and share
Will typically require a featurization step (see later)
Government data (e.g. Eurostat, OECD)
Scientific data

Though keep the production context in mind!

8
Example: Google Trends

https://medium.com/dataminingapps-articles/forecasting-with-google-trends-114ab741bda4

9
Example: Google Street View

https://arxiv.org/ftp/arxiv/papers/1904/1904.05270.pdf

10
Data types
Tabular, imagery, time series, …
Small vs. big data
Metadata
Data that describes other data
Data about data
Data definitions
E.g. stored in DBMS catalog
Oftentimes lacking but can help a great deal in understanding, and feature extraction as well

11
Selection
As data mining can only uncover patterns actually present in the data, target data set must
be large/complete enough to contain these patterns

But concise enough to be mined within an acceptable time limit

Data marts and data warehouses can help

When you want to predict a target variable y

Make absolutely sure you’re not cheating by including explanatory variables which are too “perfectly”
correlated with the target
“Too good to be true”
Do you know this explanatory variable after the target outcome, or before? At the time when your model will
be used, or after?
Your finished model will not be able to look into the future!

Set aside a hold-out test set as soon as possible

12
Data Exploration

13
Exploration
Visual analytics as a means for initial exploration (EDA: exploratory data analysis)

Boxplots
Scatter plots
Histograms
Basic statistics
Correlation plots (!)

There is some debate on this topic re: bias

Tools such as pandas-profiling , AutoViz , SweetVIZ

14
Exploration

Anscombe (1973)

15
Data Cleaning

16
Cleaning
Consistency

Cleaning errors and duplications

K.U.L., KU Leuven, K.U.Leuven, KULeuven…

Transforming different representations to common format

Male, Man, MAN → M

True, YES, Ok → 1

Removing “future variables”

Or variables that got modified according to the target at hand

17
Missing values
Many techniques cannot deal with missing values

Not applicable (credit card limit = NA if no credit card owned) versus unknown or not disclosed (age = NA)
Missing at random vs. missing not at random

Detection is easy

Main treatment options:

Delete: complete row/column

Replace (impute): by mean, median (more robust), or mode
Or by separate model (the missing value then becomes the target to predict) – often not worth it in practice
Keep: include a separate missing value indicator feature

18
Missing values
“Detection is easy?”

19
Missing values

missingno (Python) or VIM (R)

20
Missing values

21
Missing values
Common approach: delete rows with too many missing values, impute features using
median and mode, add separate column indicating if original value was missing

Sometimes imputation of median/mode based within same class-label

More advanced imputation: nearest-neighbor based
Even more advanced: techniques such as missForest or Autoimpute

22
Intermezzo
Always keep the production setting in mind: new
unseen instances can contain missing values as well
Don’t impute with a new median! But using the same
“rules” as with the training data
Ideally: also when working with validation data!
What if we’ve never observed a missing value for this
feature before?
Use the original training data to construct an imputation
Consider rebuilding your model (monitoring!)

Everything you do over multiple instances

becomes part of the full model pipeline!

23
Outliers
Extreme observations (age = 241 or income = 3
million)

Can be valid or invalid

Some techniques are sensitive to outliers

Detection: histograms, z-scores, box-plots

Sometimes the detection of outliers is the whole
“analysis” task
Anomaly detection (fraud analytics)

Treatment:

As missing: for invalid outliers, consider them as missing values

Treat: for valid outliers; depends on the technique you will use: keep as-is, truncate (cap), categorize

24
Outliers

25
Outliers

26
Outliers

27
Duplicate rows
Duplicate rows can be valid or invalid

Treat accordingly

(We will see a case of deliberate duplication later on)

28
Transformations: standardization
Standardization: constrain feature to ∼ N (0, 1)

Good/necessary for Gaussian-distributed data and some techniques: SVMs, regression models, k-nearest
neighbors, neural networks, everything working with Euclidian distance or similarity
Useless for other techniques: decision trees

x − μ
xnew =
σ

29
Transformations: normalization
Normalization: also called “feature scaling”

Rescale to [0,1], [-1,1]

In credit risk models, this is oftentimes applied to the resulting scores so that they fall in [0, 1000]

x − xmin
xnew =
xmax − xmin

30
Transformations: categorization
Also called “coarse classification”, “classing”, “binning”, “grouping”

Continuous → nominal

Binning: group ranges into categories

Can be useful to treat outliers
Oftentimes driven by domain knowledge
Equal width/interval binning versus equal frequency binning
(histogram equalization)

Nominal → reduced nominal

Grouping: grouping multiple nominal levels together

In case you have too many levels (…)

31
Transformations: categorization
To treat outliers
To make final model more interpretable
To reduce curse of dimensionality following from high number of levels
To introduce non-linear effects into linear models

32
Transformations: dummyfication and other
encodings
Nominal → continuous

“Integer encoding”

E.g. account activity = high, medium, low (1 feature)

Convert to:

account_activity_high =1
account_activity_medium = 2
account_activity_low =3

(1 feature)… Why is this problematic?

33
Transformations: dummyfication and other
encodings
Nominal → continuous

Dummy variables: “one-hot-encoding”

Mainly used in regression (cannot deal with categoricals directly)

E.g. account activity = high, medium, low (1 feature)

Convert to:

account_activity_high = 0, 1
account_activity_medium = 0, 1
account_activity_low = 0, 1

(3 features)… and then drop one (2 features) – why?

34
Transformations: dummyfication and other
encodings
Nominal → continuous

“Binary encoding”

E.g. account activity = high, medium, low (1 feature)

Convert to:

account_activity_high = 1 → 01
account_activity_medium = 2 → 10
account_activity_low = 3 → 11

(2 features); more compact than dummy variables. Less commonly used

35
Transformations: high-level categoricals
What if we have a categorical value with too many levels? Dummy variables don’t really
solve our problem

Domain-knowledge driven grouping (e.g. NACE codes)

Frequency-based grouping
E.g. postal code = 2000, 3000, 9401, 3001…
Solution 1: Group using domain knowledge (working area, communities)
Solution 2: Create new variable postal_code_count
E.g. postal_code_count = 23 if original postcode appears 23 times in training data
(Again: keep the same rule for validation/production!)
Lose detailed information but goal is that model can pick up on frequencies

Odds based grouping

Weights of evidence encoding
Probabilistic transformations and other “Kaggle”-tricks such as Leave One Out Mean (Owen Zhang)
Decision tree based
Embeddings
36
Transformations: odds based grouping
Drawback of equal-interval or equal-frequency based binning: do not take outcome into
account

Create a pivot table of the attribute versus target and compute the odds
Group variable values having similar odds

Age Purpose … Target car cash travel study house …

44 car G good 1000 2000 3000 100 5000
20 cash G → bad 500 100 200 80 800
58 travel B odds 2 20 15 1.25 6.25
26 car G
30 study B
32 house G

48 cash B
60 car G
… … … …

37
Transformations: odds based grouping
How to verify which option is better?

Consider the following example taken from the book Credit Scoring and Its Applications:

Attribute: Owner Rent Unfurnished Rent Furnished Parents Other No Answer Total
Goods 6000 1600 350 950 90 10 9000
Bads 300 400 140 100 50 10 1000
Goods / Bads Odds 20 4 2.5 9.5 1.8 1 9

Let’s say we have two options to group the levels:

1. Owners, Renters (Rent Unfurnished + Furnished), and Others (every other level)
2. Owners, Parents, and Others

38
Transformations: odds based grouping
Empirical frequencies for Option 1:

Attribute: Owners Renters Others Total

Goods 6000 1950 1050 9000

Bads 300 540 160 1000

Total 6300 2490 1210 10000

Independence frequencies for Option 1:

E.g. number of good owners given that odds are same as in whole population is
6300/10000 × 9000/10000 × 10000 = 5670

Attribute: Owners Renters Others Total

Goods 5670 2241 1089 9000
Bads 630 249 121 1000

Total 6300 2490 1210 10000

Chi-square distance:
2 2 2 2 2 2
(6000−5670) (300−630) (1950−2241) (540−249) (1050−1089) (160−121)
2
χ =
5670
+
630
+
2241
+
249
+
1089
+
121
= 583
39
Transformations: odds based grouping
Chi-square distance Option 1:
2 2 2 2 2 2
(6000−5670) (300−630) (1950−2241) (540−249) (1050−1089) (160−121)
2
χ = + + + + + = 583
5670 630 2241 249 1089 121

Chi-square distance Option 2:

2 2 2 2 2 2
(6000−5670) (300−630) (950−945) (100−105) (2050−2385) (600−265)
2
χ = + + + + + = 662
5670 630 945 105 2385 265

In order to judge upon significance, the obtained chi-square statistic should follow a chi-
square distribution with k − 1 degrees of freedom, with k the number of levels (3 in our
case). This can then be summarized by a p-value to see whether there is a statistically
significant dependence or not.

Since both options assume 3 levels, we can directly compare the value of 662 to 583 and since the former is
bigger conclude that Option 2 is the better coarse classification

40
Weights of evidence encoding
Weights of evidence variables can be defined as follows:

pc1,cat
W oEcat = ln( )
pc2,cat

pc1,cat = number of instances with class 1 in category / number of instance with class 1 (total)

pc2,cat = number of instances with class 2 in category / number of instance with class 2 (total)

If pc1,cat > pc2,cat then W oE > 0

If pc1,cat < pc2,cat then W oE < 0

WoE has a monotonic relationship with the target variable!

41
Weights of evidence encoding
Purpose Total Negatives Positives Distr. Neg. Gistr. Pos WoE

new car 50 42 8 2.33% 4.12% -57.28%

old car 200 152 48 8.42% 24.74% -107.83%
cash 300 246 54 13.62% 27.84% -71.47%
travel 450 405 45 22.43% 23.20% -3.38%
furniture 500 475 25 26.30% 12.89% 71.34%
home 350 339 11 18.77% 5.67% 119.71%

studies 150 147 3 8.14% 1.55% 166.08%

total 2000 1806 194

Higher weights of evidence indicates less risk (monotonic relation to target)

Information Value (IV) defined as: ∑(pG − pB )W oE

Can also be used to screen variables (important variables have IV > 0.1)

Note that we have now been using information coming from the target: so even more
crucial for a train/test split!

42
Weights of evidence encoding
Category boundaries

Optimize so as to maximize IV

https://cran.r-project.org/web/packages/smbinning/index.html

43
Weights of evidence encoding
Number of categories?

Trade-off
Fewer categories because of simplicity, interpretability and stability
More categories to keep predictive power

Practical: perform sensitivity analysis

IV versus number of categories
Decide on cut-off: elbow point?

Note: fewer values in category, less reliable/robust/stable WOE value

Laplace smoothing:
pc1,cat +n
W oEcat = ln( )
pc2,cat +n

n: smoothing parameter, larger n → less reliance on data, pushes WoE closer to 0

44
Weights of evidence encoding

See smbinning and Information packages in R, category_encoder for Python

45
Some other creative approaches
For geospatial data: group nearby communities together, spatial interpolation, …
First build a decision tree only on the one categorical or continuous variable and target
1-dimensional k-means clustering on a continuous variable to suggest groups
Probabilistic transformations and other “Kaggle”-tricks such as Leave One Out Mean (Owen Zhang)
Categorical embeddings

46
Some other creative approaches
category_encoder : https://contrib.scikit-learn.org/category_encoders/

categorical-encoding : https://github.com/alteryx/categorical_encoding

https://www.featurelabs.com/open/
https://github.com/alteryx/categorical_encoding/blob/main/guides/Categorical_Encoding_Methods.pdf

47
Hashing trick
The “hashing trick”: for categoricals with many levels and when it is expected that new
levels will occur in new instances

Oftentimes used for text mining

Alternative for bagging:

customer desired wanted know complained more about …

“Customer desired to know more about 1 1 0 1 0 1 1 …

investment products”

48
Hashing trick
“Hash encoding”: a hash function is any function that can be used to map data of arbitrary
size to fixed-size values

Seemingly random
Avoids collisions in output space
Used frequently in encryption / security
“One-way” function

E.g.:

h(“customer”) = 4
0 1 2 3 … 37 38 …
h(“desired”) = 2
0 0 1 0 … 1 0 …
h(“wanted”) = 37

But: perhaps there are smarter approaches?

49
Embeddings
Other “embedding” approaches are possible as well in the textual domain

“word2vec and friends”

Can even be used for high-level and sparse categoricals
We’ll come back to this later on, when we talk about text mining in more detail

https://arxiv.org/pdf/1604.06737.pdf
50
Transformations: mathematical approaches
Logarithms, square roots, etc

Mostly done to enforce some linearity

Or more gaussian behavior
Box Cox
Yeo Johnson and other “power transformations”, e.g. for
saturation effects

51
Transformations: interaction variables
When no linear effect is present on x1 ∼ y and x2 ∼ y but there is one on f (x1 , x2 ) ∼ y

In most cases: f (x1 , x2 ) = x1 × x2

52
Feature Engineering

53
Feature engineering
The goal of the transformations above: creating more informative features!

“ The aim of feature engineering is to transform data set variables into features so as to
help the analytical models achieve better performance in terms of either predictive
“
performance, interpretability or both.

Important to take the bias of your analytical technique into account

E.g. new features to make the data linearly separable for linear models

However: a key step for any kind of model

E.g. “date of birth” → “age”
Most techniques are relatively “dumb” – any human domain
knowledge and smart features, transformation, … can help
Though avoid biasing data and remember correlation ≠ causation

54
Feature engineering

https://www.kdnuggets.com/2018/12/feature-engineering-explained.html

55
Feature engineering: RFM features
Already popular since (Cullinan, 1977)

Recency: recency of transaction

Frequency: frequency of transaction
Monetary: monetary value of transaction

Can be operationalized in various ways

Frequency in last month, last two months, …

Highest monetary in last month, average, minimum
Recency with varying time decays
Grouped by product, type, …

Popular in e.g. marketing and fraud analytics

56
Feature engineering: RFM features

recency = e(−γt)

57
Feature engineering: time features
Capture information about time aspect by meaningful features

Dealing with time can be tricky

00:00 = 24:00
No natural ordering: 23:00 > 01:00?

Do not use arithmetic mean to compute average timestamp

Model time as a periodic variable using

von Mises distribution: distribution of a
normally distributed variable wrapped
across a circle, defined by two
parameters μ and κ

μ is the periodic mean

κ is a measure of concentration such that 1/κ
is the periodic variance
58
Feature engineering: date features
If year included: do not use date directly

Time since date (in days, hours, …)

Quarter (1-4), month of year (1-12)
Week (1-52), day of month (1-31)
Weekend or not (0, 1)
Holiday or not (0, 1)
…

59
Feature engineering: delta’s, trends, windows
Evolutions over time or differences between features are crucial in many settings!

Solution 1: Keep track of an instance through time and add in as separate rows (“panel data analysis”)
Solution 2: One point in time as “base”, add in relative features

60
Feature engineering: delta’s, trends, windows
Ft −Ft−x
Absolute trends: x

Ft −Ft−x
Relative trends:
Ft−x

Can be useful for size variables (e.g., asset size, loan amounts) and ratios
Beware with denominators equal to 0!
Can put higher weight on recent values
Extension: time series analysis

61
Feature engineering: ordinal variables
Ordinal features have intrinsic ordering in their values (e.g., credit rating, debt seniority)

Don’t: integer coding (however…)

Percentile coding
E.g. class ratings
Suppose 10% AAA, 35% AA, 15% A, 30% B and 10% C
Assign real-valued number based on percentiles: 0.1, 0.45, 0.60, 0.90 and 1

Thermometer coding
Progressively codes ordinal scale of variable

F1 F2 F3 F4
AAA 0 0 0 0
AA 1 0 0 0

A 1 1 0 0
B 1 1 1 0
C 1 1 1 1

62
Feature engineering: relational data

63
Feature engineering: relational data
dm : https://krlmlr.github.io/dm/

Analytics typically requires or presumes the data will be

presented in a single table

Denormalization refers to the merging of several normalized source

data tables into an aggregated, denormalized data table
Merging tables involves selecting and/or aggregating information
from different tables related to an individual entity, and copying it
to the aggregated data table

64
Featurization
How do get a tabular data set out of something non-structured?

Squeeze out a representational vector

Very time-consuming/difficult
Approaches range from manual to… trained with models

(Deep learning as a feature extractor, not necessarily as final model)

65
Featurization
featuretools : An open source python framework for
automated feature engineering,
https://www.featuretools.com/

Basically applies lots of auto-aggregations and transformations

“Excels at transforming temporal and relational datasets into feature matrices for machine learning”
Some students got good results by using this for Assignment 1

stumpy : https://github.com/TDAmeritrade/stumpy

“STUMPY is a powerful and scalable library that efficiently computes

something called the matrix profile, which can be used for a variety of time
series tasks”

tsfresh : https://tsfresh.readthedocs.io/en/latest/

“tsfresh is used to to extract characteristics from time series”

66
Featurization
AutoFeat : https://github.com/cod3licious/autofeat

FeatureSelector : https://github.com/WillKoehrsen/feature-selector

OneBM : https://arxiv.org/abs/1706.00327

67
Feature Selection

68
Feature selection
Oftentimes, you end up with lots of features

As long as you have way more instances, not a problem

Otherwise, will make many techniques go haywire
E.g. features with low variance
Chi-square goodness-of-fit test
Information gain
Correlated features
Information value

More advanced techniques:

Stepwise introduction or removal
Regularization
Based on genetic algorithms
Using techniques that provide a “variable importance” ranking
Retrain model with top-n only and check performance stability

Principal component analysis to lower dimensionality

69
Principal component analysis

http://setosa.io/ev/principal-component-analysis/

Remains a surprising effective technique to have in your toolbox

Though: dimensionality reduction rather than feature selection

Before and after the data mining (e.g. http://louistiao.me/posts/notebooks/visualizing-the-latent-space-
of-vector-drawings-from-the-google-quickdraw-dataset-with-sketchrnn-pca-and-t-sne/)
Note: standardization/normalization required beforehand (and again remember to do train/test split first)

70
Conclusion

71
More on feature engineering and dimensionality
reduction
Stepwise selection
Regularization
Variable importance
Clustering
t-SNE
UMAP
Features from text
Categorical embeddings

(We’ll come back to these during later sessions)

72
Sampling
Think carefully about population on which model that is going to be built using sample will
be used

Timing of sample
How far do I go back to get my sample?
Trade-off: many data versus recent data
Sample taken must be from normal business period to get as accurate a picture as possible of target population

Too much instances to compute a model in reasonable time

Don’t be afraid to quickly iterate on a small sample first!

(Sampling will return in a different context later on, when we talk more about validation:
over/under/smart sampling)

73
Conclusion
Pre-processing: many steps and checks

Depends on technique being used later on, which might not yet be certain
Can you apply your pre-processing steps next month, on a future data set, in production?
I.e.: can you apply your pre-processing steps on the test set?

Time consuming! E.g. with SQL / pandas / Spark / …: join columns, remove columns, create aggregates, sort,
order, split, …
Not the “fun” aspect of data science
Easy to introduce mistakes

Ship-Motion Prediction - Algorithms and Simulation Results
No ratings yet
Ship-Motion Prediction - Algorithms and Simulation Results
4 pages
CH05 Business Analytics Process and Data Exploration
No ratings yet
CH05 Business Analytics Process and Data Exploration
37 pages
BA Full Note 1
No ratings yet
BA Full Note 1
183 pages
6 - InnovatiCS - Data Visualization (Numerical & Graphical Descriptive Statistics)
No ratings yet
6 - InnovatiCS - Data Visualization (Numerical & Graphical Descriptive Statistics)
96 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Gould
No ratings yet
Gould
474 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Insy662 - f23 - Week 1
No ratings yet
Insy662 - f23 - Week 1
21 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
39 pages
Catogories of Palm Print
No ratings yet
Catogories of Palm Print
37 pages
Tongue Drive System Doc Com Pleated
No ratings yet
Tongue Drive System Doc Com Pleated
20 pages
5 - InnovatiCS - Data Types - Measure of Shape - Position - Dispersion
No ratings yet
5 - InnovatiCS - Data Types - Measure of Shape - Position - Dispersion
47 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
TATA MOTORS - Group 2
0% (1)
TATA MOTORS - Group 2
9 pages
Lecture 5 - Data Preparation
No ratings yet
Lecture 5 - Data Preparation
31 pages
Numerical Linear Algebra in Data Mining: Lars Eld en
No ratings yet
Numerical Linear Algebra in Data Mining: Lars Eld en
58 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Information Space Based On HTML Structure: Gregory B. Newby UNC Chapel Hill
No ratings yet
Information Space Based On HTML Structure: Gregory B. Newby UNC Chapel Hill
9 pages
Introduction To Analytics
100% (1)
Introduction To Analytics
45 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
19 pages
Baeza Et Al 2012 Mithrax Allometry
No ratings yet
Baeza Et Al 2012 Mithrax Allometry
9 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Standard For Essential Composition of VCO PDF
No ratings yet
Standard For Essential Composition of VCO PDF
11 pages
En Tanagra Kohonen SOM R
No ratings yet
En Tanagra Kohonen SOM R
21 pages
Designing Robust Parameters For Injection-Compression Molding Light-Guided Plates Based On Desirability Function and Regression Model
No ratings yet
Designing Robust Parameters For Injection-Compression Molding Light-Guided Plates Based On Desirability Function and Regression Model
15 pages
Sent-Machine Learning For Data Science
100% (1)
Sent-Machine Learning For Data Science
463 pages
Adams Et Al. 2002
No ratings yet
Adams Et Al. 2002
7 pages
Nonlinear Source Separation - Luis B. Almeida
No ratings yet
Nonlinear Source Separation - Luis B. Almeida
114 pages
CH 2
No ratings yet
CH 2
36 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
03 Data Science Process - Fall 23-24
No ratings yet
03 Data Science Process - Fall 23-24
38 pages
DM Merged
No ratings yet
DM Merged
169 pages
4 - Discrete Model-Based Operation of Cooling Tower Based On Statistical Analysis
No ratings yet
4 - Discrete Model-Based Operation of Cooling Tower Based On Statistical Analysis
8 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
The Data Arena.
No ratings yet
The Data Arena.
11 pages
Supervised Vs Unsupervised Classification
No ratings yet
Supervised Vs Unsupervised Classification
2 pages
Salary Prediction Model Using Principal Component Analysis and Deep Neural Network Algorithm
No ratings yet
Salary Prediction Model Using Principal Component Analysis and Deep Neural Network Algorithm
11 pages
Data Mining
No ratings yet
Data Mining
40 pages
Week 4
No ratings yet
Week 4
5 pages
Data - Preprocessing 1 19
No ratings yet
Data - Preprocessing 1 19
19 pages
Lesson 07 7.02 Knowledge Check
No ratings yet
Lesson 07 7.02 Knowledge Check
7 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
Chapter-4 Principal Component Analysis-Based Fusion
No ratings yet
Chapter-4 Principal Component Analysis-Based Fusion
27 pages
Yuan 2008 PB+DCC
No ratings yet
Yuan 2008 PB+DCC
6 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
When Can Weak Latent Factors Be Statistically Inferred
No ratings yet
When Can Weak Latent Factors Be Statistically Inferred
74 pages
Business Analytics Process and Data Exploration
No ratings yet
Business Analytics Process and Data Exploration
38 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Data Science AI Certification Program
No ratings yet
Data Science AI Certification Program
30 pages
Big Data Lesson 2 Lucrezia Noli
No ratings yet
Big Data Lesson 2 Lucrezia Noli
21 pages
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
No ratings yet
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
38 pages
Data Understanding and Prepration
100% (1)
Data Understanding and Prepration
10 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
GRL - EX - 4 (1) .Ipynb - Colaboratory
No ratings yet
GRL - EX - 4 (1) .Ipynb - Colaboratory
7 pages
Da Laqs Saqs
No ratings yet
Da Laqs Saqs
23 pages
Chapter 02 Overview
No ratings yet
Chapter 02 Overview
43 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
398681
No ratings yet
398681
3 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Statistical Methods Using Spss 1st Edition Gabriel Otieno Okello Instant Download
No ratings yet
Statistical Methods Using Spss 1st Edition Gabriel Otieno Okello Instant Download
83 pages
Unit 4
No ratings yet
Unit 4
42 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Unit6 Part3 General Procedure
No ratings yet
Unit6 Part3 General Procedure
19 pages
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
No ratings yet
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
28 pages
Ads Imp Qna 2025 15 04 06 06 35
No ratings yet
Ads Imp Qna 2025 15 04 06 06 35
33 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
REVIEWER
No ratings yet
REVIEWER
9 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
pXRFandFTIR Spectrometry Applied To The Study of Azurite and Smalt in Romanian Medieval Wall Painting
No ratings yet
pXRFandFTIR Spectrometry Applied To The Study of Azurite and Smalt in Romanian Medieval Wall Painting
13 pages
Topic: Dimension Reduction With PCA: Instructions
No ratings yet
Topic: Dimension Reduction With PCA: Instructions
8 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Unit 3-BA
No ratings yet
Unit 3-BA
31 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
From Chaos To Clarity: The Role of Dimensions in Machine Learning
No ratings yet
From Chaos To Clarity: The Role of Dimensions in Machine Learning
3 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet