0% found this document useful (0 votes)

17 views39 pages

Data Preprocessing Steps For Machine Learning in Python (Part 1) - by Learn With Nas - Wom

The document discusses the importance of data preprocessing in machine learning, outlining various steps such as data collection, cleaning, transformation, and feature engineering. It emphasizes that proper data preparation is crucial for achieving reliable and accurate predictive models. The article provides practical examples and Python implementations for each preprocessing step, highlighting techniques for handling missing values, outliers, and scaling data.

Uploaded by

alankeys271997

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views39 pages

Data Preprocessing Steps For Machine Learning in Python (Part 1) - by Learn With Nas - Wom

Uploaded by

alankeys271997

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Open in app

Search Write

Get unlimited access to the best of Medium for less than $1/week. Become a member

Data Preprocessing Steps for

Machine Learning in Python (Part 1)
Learn with Nas · Follow
Published in Women in Technology · 14 min read · Sep 30, 2023

791 3

Data Preprocessing, also recognized as Data Preparation or Data Cleaning,

encompasses the practice of identifying and rectifying erroneous or
misleading records within a dataset. This involves pinpointing flawed,
incomplete, or irrelevant segments of the data and subsequently modifying,
substituting, or eliminating this impure or coarse data [1]. Data
Preprocessing techniques have been adapted to train AI models, including
machine learning models. The techniques are generally used at the early
stages to ensure accurate results [2]. Please be aware that data preprocessing
is a comprehensive term covering a wide range of tasks, spanning from
formatting the data to creating features, all depending on the nature of your
AI project.

This preparatory phase not only enhances the overall quality of the data but
also streamlines the modelling process, ultimately leading to more reliable
and accurate predictive models. This article delves into the vital role that
Data Preprocessing plays in the context of Machine Learning, shedding light
on its various aspects and emphasizing its necessity for achieving
meaningful and impactful results.

Why is it important?
The significance of Data Preprocessing in Machine Learning cannot be
overstated, as it forms the cornerstone of any successful data analysis or
machine learning endeavour. In the realm of data-driven technologies, the
quality and suitability of data directly influence the outcomes and
effectiveness of machine learning models.

Data Preprocessing involves a series of steps such as:

1. Data Collection

2. Data Cleaning

3. Data Transformation

4. Feature Engineering: Scaling, Normalization and Standardization

5. Feature Selection

6. Handling Imbalanced Data

7. Encoding Categorical Features

8. Data Splitting

Step 1: Data Collection

The cornerstone of machine learning is rooted in data. Collecting data
involves gathering information aligned with the goals and objectives of your
AI project. If you feed subpar or low-quality data into your model, it will not
produce satisfactory outcomes. This holds true regardless of the model’s
complexity, the expertise of the data scientist, or the financial investment in
the project [3].

While some companies have been accumulating data for years, ensuring a
steady supply for machine learning, those lacking sufficient data can turn to
reference datasets available online to complete their AI projects. Discovering
new data and sharing it can be achieved through three methods:
collaborative analysis (DataHub), web (Google Fusion Tables, CKAN, Quandl,
and Data Market), and a combination of collaboration and web use (Kaggle).
Additionally, there are specialized data retrieval systems, including data
lakes (Google Data Search) and web-based platforms (WebTables) [3].

Suppose we have all the necessary data; we can proceed with creating a
dataset.

# import library
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# load the data

df = pd.read_csv('data/credit_scoring_eng.csv')

df.head(5)

Result:
Data Description

children - number of children in the family

days_employed - number of days employed

dob_years - client's age in years

education - client's education level

education_id - education identifier

family_status - marital status

family_status_id - marital status identifier

gender - client's gender

income_type - type of employment

debt - whether the client has a loan debt

total_income - monthly income

purpose - purpose of the loan application

Step 2: Data Cleaning

This involves identifying and correcting errors or inconsistencies in the data,
such as missing values, outliers and duplicates. Various techniques can be
used for data cleaning, such as imputation, removal or transformation [4].

Implementing in phyton:
2-a: Handling missing values

# check the dataset information

df.info()

Result:

Findings:
There are missing values in the columns days_employed and total_income,
because the number of rows should be 21,525

# check the percentage

df.isna().sum() / len(df)

Result:

Findings:

The missing value percentage for both columns are around 10%
# Visualizing Missing Data using a seaborn heatmap.
plt.figure(figsize=(10,6))
sns.heatmap(df.isna().transpose(),
cmap="YlGnBu",
cbar_kws={'label': 'Missing Data'})

Result:

Findings:

1. Missing values form a pattern. The missing values are caused by job types
where clients with the job types ‘student’ and ‘unemployed’ do not have any
income, leading them to leave the ‘days_employed’ and ‘total_income’
columns empty.
2. This conclusion is reinforced by the pattern shown in the seaborn
heatmap, indicating that when the value in the ‘days_employed’ column is
missing, the data in the same row for ‘total_income’ is also missing
(symmetrical).

3. Since the missing values are only present in the ‘days_employed’ and
‘total_income’ columns, and both of these columns have float data types,
which fall under the Numeric/Ratio category, the missing data will be filled
using statistical calculations (such as Mean, Median).

4. Median is chosen to fill in missing values because it can prevent the

occurrence of outliers [5]

# function to fill in missing values using median

def data_imputation(data, column_grouping, column_selected):
# Parameter meaning
# data => The name of the dataframe to be processed
# column_grouping => The column used to group values and take the median
# column_selected => The column in which we will fill its NaN values

# Get unique category groups

group = data[column_grouping].unique()

# Loop through each value in the group category

for value in group:
# get median
median = data.loc[(data[column_grouping]==value) & ~(data[column_selecte

# change missing value

data.loc[(data[column_grouping]==value) & (data[column_selected].isna())

# Return the dataframe after filling the missing values

return data
# apply the function to 'total_income' column
df = data_imputation(data=df, column_grouping='age_category', column_selected='t

# apply the function to 'days_employed' column

df = data_imputation(data=df, column_grouping='age_category', column_selected='d

# check the statistical

df.info()
Result:

2-b: Handling outliers

# check outlier in children column
sns.boxplot(df['children'])

# check statistical data in children column

df['children'].describe()
Findings:

1. Based on the statistical data above, I will replace the value 20 with the
value 2, assuming it was an input error.

2. I will remove the minus sign (-), assuming it was an input error

# replace the value 20 with the value 2

condition_children = df['children']==20
df['children'] = df['children'].mask(condition_children, 2)

# remove minus sign

df['children'] = abs(df['children'])
# check outliers in days_employed column
sns.boxplot(df['days_employed'])
# check percentage
len(df.loc[(df['days_employed'] < 0 ) | (df['days_employed'] > 200000)]) / len(d

Result:

0.8990011614401858

Findings:

1. There are 2 issues identified in the ‘days_employed’ column:

Too many digits after the decimal point.

Existence of negative values and outliers, with a high percentage of rows

having these conditions, approximately 89%.

2. The steps to solve these issues are as follows:

Remove the minus sign (-).

Perform rounding.

Replace the outlier values.

# remove minus sign (-), assuming it was an input error

df['days_employed'] = abs(df['days_employed'])
# round
df['days_employed'] = round(df['days_employed'],0)

# check data distribution

df['days_employed'].describe()

Result:

Findings:

The mean value does not represent the data as it is mixed with outliers.
Therefore, the replacement of outliers will be done using the median value.

# Replace outlier with median

condition_de = (df['days_employed'] > 200000) & (df['days_employed'].notnull())
df['days_employed'] = df['days_employed'].mask(condition_de, df['days_employed']

# verify the result

sns.boxplot(df['days_employed'])

Result:

2-c: Handling Duplicates

Findings

1. There are 72 identified duplicate data entries.

2. These duplicate data entries will be removed, and the index will be reset.

# remove duplicate data and do reset index

df = df.drop_duplicates().reset_index(drop=True)

You can visit my GitHub account to access the complete code related to the
above example:
GitHub …
Contribute to…
github.com

Step 3: Data Transformation

Data transformation involves technically converting data from one format,
standard, or structure to another, without changing the dataset’s content.
This is typically done to prepare the data for consumption by an application
or user, or to enhance data quality. The specifics of data transformation can
vary based on the techniques employed. In this article, I will utilize a data
aggregation technique for data transformation [6].

Data aggregation is a method used to present data in a summarized form.

Given the likelihood of data originating from diverse sources, combining all
incoming data into a cohesive description is the essence of data aggregation.
This facet of data processing holds significance as it hinges on the quality
and quantity of the data at hand. An illustrative example of this process is
generating an annual sales report by consolidating quarterly or monthly data
[7].

There are many ways to aggregate data in Pandas, including:

a. Utilizing the groupby() function: Grouping involves breaking down a

dataset into smaller subsets depending on specific variables. This approach
is widely employed for data exploration and analysis purposes. The pandas’
groupby() function is highly versatile and allows for the grouping of data
based on one or more columns. By using the groupby() function, we can
group data according to selected variables and subsequently apply a range of
aggregation functions to these groups [8].
Implementing in phyton:

We will use groupby() to analyze whether the user review and professional
review will influence platform sales

# preparing the dataset

top2_ref_df = reference_df.groupby(['platform', 'name'])[['total_sales', 'critic
top2_ref_df = top2_ref_df[['name', 'platform', 'total_sales', 'critic_score', 'u
top2_ref_df

b. using pivot_table() function: We’ve explored how the GroupBy concept

enables us to investigate connections within a dataset. A pivot table is a
similar operation commonly encountered in spreadsheet software and other
programs working with tabular data. When using a pivot table, input data in
a column-wise format is organized into a two-dimensional table, offering a
multidimensional summary of the information. Distinguishing between
pivot tables and GroupBy can be confusing at times. It’s helpful to consider
pivot tables as essentially a multidimensional form of GroupBy aggregation.
In other words, you perform the split-apply-combine process, but in this
case, both the splitting and combining occur not along a one-dimensional
index, but across a two-dimensional grid [9].

Implementing in phyton:

In order to analyze top 5 platforms in NA, EU and JP regions and visualize

the variation in market share from one region to another

# data aggregation of sales for each platform in NA, EU, JP regions

agg_selected_region_platform = pd.pivot_table(data=reference_df, index='platform
agg_selected_region_platform

We can visualize the data:

# visualizing the sales for each platform in NA, EU, JP regions

plt.figure(figsize=(20,6))
plt.title('Distribution of game sales on each platform in the EU, NA, and JP reg
sns.lineplot(data=agg_selected_region_platform)
plt.show()

You can visit my GitHub account to access the complete code related to the
above example:

GitHub …
Contribute to…
github.com

Step 4: Feature Engineering: Scaling, Normalization and

Standardization
Feature engineering constitutes a pivotal stage in the creation of accurate
and efficient machine learning models. A significant facet of feature
engineering involves scaling, normalization, and standardization,
encompassing the alteration of data to enhance its suitability for modeling.
Employing these methods can enhance model accuracy, mitigate the
influence of outliers, and ensure uniformity in data scale. This article delves
into the fundamentals of scaling, normalization, and standardization [10].

Feature Scaling
Feature scaling is a crucial step in data preprocessing, aiming to standardize
the values of features or variables within a dataset to a uniform scale. The
primary objective is to ensure that all features have a fair influence on the
model, avoiding the dominance of features with higher values. The necessity
for feature scaling arises when working with datasets that encompass
features having diverse ranges, units of measurement, or orders of
magnitude. In such scenarios, discrepancies in feature values can introduce
bias in model performance or hinder the learning process. Through the
application of feature scaling, the features in a dataset can be harmonized to
a consistent scale, simplifying the construction of precise and efficient
machine learning models. Scaling promotes meaningful feature
comparisons, enhances model convergence, and prevents specific features
from dominating others solely based on their magnitude [10].

Why Should We Use Feature Scaling?

Certain machine learning algorithms exhibit sensitivity to feature scaling,
whereas others remain mostly unaffected by it. Let’s delve into a detailed
examination of this aspect.

1. Gradient Descent Based Algorithms

Machine learning algorithms that use gradient descent as an optimization

technique (like linear regression, logistic regression, etc) require data to be
scaled [10]

2. Distance-Based Algorithms
Algorithms based on distance metrics, such as K-nearest neighbors (KNN),
K-means clustering, and support vector machines (SVM), are highly
influenced by the range of features. This is because these algorithms rely on
calculating distances between data points to ascertain their similarity [10].

Implementing in Phyton:

A function will be created to calculate the distance using the k-nearest

neighbors algorithm based on two distance metrics: Euclidean and
Manhattan. We will then compare the distance results on both unscaled and
scaled data.

the df dataset:
# function for calculating kNN distance
def get_knn(df, n, k, metric):

"""
Display k nearest neighbors:
param df: Pandas DataFrame used to find similar objects within it
param n: number of the object for which k nearest neighbors are sought
param k: number of k nearest neighbors to be displayed
param metric: name of the distance metric
"""

nbrs = sklearn.neighbors.NearestNeighbors(n_neighbors = k, metric = metric,

nbrs.fit(df[feature_names])
nbrs_distances, nbrs_indices = nbrs.kneighbors([df.iloc[n][feature_names]],

df_res = pd.concat([
df.iloc[nbrs_indices[0]],
pd.DataFrame(nbrs_distances.T, index=nbrs_indices[0], columns=['distance
], axis=1)

return df_res

Using unscaled data (df):

# euclidean metric - unscaled data

get_knn(df, 1, 50, 'euclidean')

Result:
# manhattan metric - unscaled data
get_knn(df, 1, 50, 'manhattan')
Findings:

When using unscaled data, the results are the same (referring to the
generated indices — at index 1, it has a similar classification to the following
indices: 3920, 4948, 2528, 3593 -) for both distance metrics.

Using scaled data:

For instance, age and income have different scales (age = years, income =
dollars), hence data scaling is necessary.

MaxAbsScaler is utilized to scale data to its maximum value; that is, dividing
each observation by the maximum value of the variable: The result of the
previous transformation is a distribution where values roughly vary within
the range of -1 to 1.

# scalling the data using MaxAbsScaler

feature_names = ['gender', 'age', 'income', 'family_members']

transformer_mas = sklearn.preprocessing.MaxAbsScaler().fit(df[feature_names].to_
df_scaled = df.copy()
df_scaled.loc[:, feature_names] = transformer_mas.transform(df[feature_names].to

the df_scaled dataset:

# euclidean metric - scaled data

get_knn(df_scaled, 1, 10, 'euclidean')
# manhattan metric - scaled data
get_knn(df_scaled, 1, 10, 'manhattan')

The question is: Does non-scaled data affect the kNN algorithm? If it does,
how does it affect it?
Yes, when data is not scaled, the results will be the same (regardless of the
metric used). Therefore, the results might be inaccurate due to differences in
the scales used in each column.

In calculations, it is important to maintain a consistent scale as much as

possible. For example: age and income have different scales (age = years,
income = dollars).

You can visit my GitHub account to access the complete code related to the
above example:

GitHub …
Contribute to…
github.com

Normalization
Normalization, a data preprocessing approach, standardizes feature values
within a dataset to a consistent scale. This is carried out to streamline data
analysis and modeling, mitigating the influence of disparate scales on
machine learning model accuracy [10].

Implementing in Phyton:

# import library
from sklearn.preprocessing import MinMaxScaler

# fit scaler on training data

std = MinMaxScaler().fit(X_train)
# transform training data
X_train_std = std.transform(X_train)

# transform testing data

X_test_std = std.transform(X_test)

Standardization
Standardization, a form of scaling, involves centering values around the
mean and adjusting the standard deviation to one unit. Consequently, the
attribute’s mean becomes zero, and the resulting distribution maintains a
unit standard deviation [10].

Now, let’s proceed with utilizing scikit-learn’s StandardScaler for

standardizing features. This process involves eliminating the mean and
adjusting the scale to unit variance, ultimately resulting in a mean of 0 and a
standard deviation of 1. This aligns the data with a standard normal
distribution. [11].

Implementing in Phyton:

We will compare the metric results before and after implementing

StandardScaler.

Before normalizing the features:

beforeScaling_lr = LogisticRegression(random_state = 42)

# train model on training set

beforeScaling_lr.fit(features_train, target_train)
# predict using validation set
y_predict_valid_lr = beforeScaling_lr.predict(features_valid)
# measuring probability using validation set
y_probability_valid_lr = beforeScaling_lr.predict_proba(features_valid)[:, 1]
# test performance algorithm using F1 score and auc_score
print('F1 score =', f1_score(target_valid, y_predict_valid_lr))
print('AUC-ROC score =', roc_auc_score(target_valid, y_probability_valid_lr))

Result:

Normalizing the features:

# normalizing the features using StandardScaler

scaler = StandardScaler()

features_train[df_numerical] = scaler.fit_transform(features_train[df_numerical]
features_valid[df_numerical] = scaler.transform(features_valid[df_numerical])
features_test[df_numerical] = scaler.transform(features_test[df_numerical])

After normalizing the features:

afterScaling_lr = LogisticRegression(random_state = 42)

# train model on training set
afterScaling_lr.fit(features_train, target_train)
# predict using validation set
y_predict_valid_lr = afterScaling_lr.predict(features_valid)
# measuring probability using validation set
y_probability_valid_lr = afterScaling_lr.predict_proba(features_valid)[:, 1]
# test performance algorithm using F1 score and auc_score
print('F1 score =', f1_score(target_valid, y_predict_valid_lr))
print('AUC-ROC score =', roc_auc_score(target_valid, y_probability_valid_lr))

Result:

Findings:

Both the F1 score and AUC-ROC score are increasing after standardizing the
features.

You can visit my GitHub account to access the complete code related to the
above example:

GitHub …
Contribute to…
github.com
In Part 2, I will delve into topics such as Feature Selection, handling
imbalanced dataset, Encoding Features and Data Splitting. Keep an eye out
for this continuation where we’ll explore these essential steps in detail!

References:

1. Shaomin Wu, A review on coarse warranty data and analysis (2013)

2. George Lawton, Data Preprocessing (2022)

https://www.techtarget.com/searchdatamanagement/definition/data-
preprocessing

3. Yuliia Kniazieva, What is Data Collectio in Machine Learning (2022)

https://labelyourdata.com/articles/data-collection-methods-AI

4. Deepak Jain, Data Preprocessing in Data Mining (2023)

https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/

5. https://stats.stackexchange.com/questions/143700/which-is-better-
replacement-by-mean-and-replacement-by-median

6. Chiradeep BasuMallick, What Is Data Transformation? Types, Tools, and

Importance (2022)

https://www.spiceworks.com/tech/big-data/articles/what-is-data-
transformation/
7. Data Science Wizards, Introduction to Data Transformation (2023)

Introduction to Data Transformation

End-to-end data analysis involves many processes and practices
aimed at extracting insights from data. data…
medium.com

8. https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-
and-grouping.html

9. https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-
tables.html

10. Aniruddha Bhandari, Feature Engineering: Scalling, Normalization and

Standardization (2023)

Feature Engineering: Scaling, Normalization, and Standardization

(Updated 2023)
Learn how feature scaling, normalization, & standardization work in
machine learning. Understand the uses & differences…
www.analyticsvidhya.com

11. Scikit-learn documentation, Standard Scaler

https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.h
tml
12. Nate Rosidi, Advanced Feature Selection Techniques for Machine
Learning Models (2023)

Advanced Feature Selection Techniques for Machine Learning

Models — KDnuggets
Mastering Feature Selection: An Exploration of Advanced
Techniques for Supervised and Unsupervised Machine Learning…
www.kdnuggets.com

Machine Learning Python Pandas Data Preprocessing Technology

Published in Women in Technology Follow

2.2K Followers · Last published 6 hours ago

Women in Tech is a publication to highlight women in STEM, their

accomplishments, career lessons, and stories.

Written by Learn with Nas Follow

260 Followers · 4 Following

System Analyst | Data Engineering and Machine Learning Enthusiast

Responses (3)

What are your thoughts?

Respond

Peng Qian
Oct 20, 2023

Is Phyton in the title a new word I haven't seen before?

6 1 reply Reply

Cazador
Oct 18, 2023

Hi! I enjoyed reading this article very much. I had a question for you involving the replacement of missing
values.

In terms of replacing them, what research must/should be done to justify the replacement value? For example,
in replacing total income… more

4 1 reply Reply

Shaik Basheer Ahmed

Oct 25, 2023

It is useful for me, thanks you for sharing.

More from Learn with Nas and Women in Technology

Learn with Nas In Women in Technology by Mrunal Natu

Tuning Multiple ML Models using Why Learning to Code Is So

GridSearchCV: A Project of… Frustrating (And How to Push…
In the rapidly evolving field of machine Learning to Code getting frustrating? It’s Not
learning, the quest for optimal model… You — It’s Just Really Hard

Sep 17, 2023 14 1d ago 82

In Women in Technology by Celine Liu Learn with Nas

Hack Your Career Change: How I Sentiment Analysis: Concept and

Transitioned from a Non-Tech… Practical Use Case for Detecting…
I went from zero data experience to leading a Let’s take a moment to ponder the quirks of
global team at Uber. Here’s how you can… human communication. Have you ever…

3d ago 408 12 Sep 4, 2023 24 2

See all from Learn with Nas See all from Women in Technology
Recommended from Medium

Mayurkumar Surani J.

20 Essential Git Commands Every Building a Data Pipeline with

Data Engineer / Data Scientist… Python: A Step-by-Step Guide to…
As an Data Engineer, version control is crucial
for managing data pipelines, Infrastructure a…

Oct 31, 2024 11 Nov 4, 2024 21

Lists

Predictive Modeling w/ Practical Guides to Machine

Python Learning
20 stories · 1837 saves 10 stories · 2214 saves

Coding & Development ChatGPT prompts

11 stories · 1016 saves 51 stories · 2594 saves
Ebrahim Mousavi In Artificial Intelligence in Plain En… by Ritesh Gu…

ML Series: Day 47 — Scaling and Data Science All Algorithm

Normalization Cheatsheet 2025
Balancing Data Magnitudes for Model Stories, strategies, and secrets to choosing
Stability the perfect algorithm.

Feb 10 67 Jan 5 1.3K 27

Python Fundamentals In TDS Archive by Benjamin Bodner

Data Visualization for Exploratory Top 12 Skills Data Scientists Need

Data Analysis (EDA) in Python to Succeed in 2025
Python Data Visualization Guide It’s (not) all about LLMs and AI tools

Feb 25, 2024 786 5 Dec 31, 2024 2.3K 33

See more recommendations

Gender School and Society
100% (1)
Gender School and Society
61 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Social Work Practice With Communities 123
100% (7)
Social Work Practice With Communities 123
11 pages
B1 Preliminary 2020 Sample Listening - Answer Key PDF
33% (3)
B1 Preliminary 2020 Sample Listening - Answer Key PDF
2 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Book Review
67% (3)
Book Review
2 pages
Concurrent Visit Report 2 and 3
No ratings yet
Concurrent Visit Report 2 and 3
10 pages
CSC407 - Chapter 2-3
No ratings yet
CSC407 - Chapter 2-3
46 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
P.E. K To 12 Curriculum
No ratings yet
P.E. K To 12 Curriculum
56 pages
Detailed Lesson Plan 1
88% (8)
Detailed Lesson Plan 1
5 pages
Module 3
No ratings yet
Module 3
76 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
Kyambogo University - Faculty of Engineering Undergraduate Fees Structures
No ratings yet
Kyambogo University - Faculty of Engineering Undergraduate Fees Structures
2 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
Anthropology
100% (1)
Anthropology
13 pages
UNIT - Introduction - DataScience - New
No ratings yet
UNIT - Introduction - DataScience - New
55 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Data Pre Processing and Cleaning
No ratings yet
Data Pre Processing and Cleaning
56 pages
Unit V
No ratings yet
Unit V
47 pages
Detailed Lesson Plan in Art III
No ratings yet
Detailed Lesson Plan in Art III
4 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
CH 2
No ratings yet
CH 2
36 pages
Data Mining
No ratings yet
Data Mining
31 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
Unit - II
No ratings yet
Unit - II
56 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Machine Learning Chapter 2
No ratings yet
Machine Learning Chapter 2
37 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Data Preprocessing Visualization
No ratings yet
Data Preprocessing Visualization
25 pages
Data Preprocessing Using Python. Python Implementation of Data - by Suneet Jain - Medium
No ratings yet
Data Preprocessing Using Python. Python Implementation of Data - by Suneet Jain - Medium
20 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
Data Mining
No ratings yet
Data Mining
22 pages
Data - Preprocessing 1 19
No ratings yet
Data - Preprocessing 1 19
19 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
The Institute of Classical Architecture & Art National Core Curriculum
No ratings yet
The Institute of Classical Architecture & Art National Core Curriculum
14 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Data Preprocessing - 1: Course Leader
No ratings yet
Data Preprocessing - 1: Course Leader
22 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Unit Plan Overview and Description
100% (1)
Unit Plan Overview and Description
23 pages
B Tech-AIML-question Bank-2 Answer Key
No ratings yet
B Tech-AIML-question Bank-2 Answer Key
9 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
Blended Data Cleaning
No ratings yet
Blended Data Cleaning
9 pages
Exp 8 - LM
No ratings yet
Exp 8 - LM
10 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Perception Errors: How Do We Perceive?
100% (3)
Perception Errors: How Do We Perceive?
4 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Lec 4
No ratings yet
Lec 4
9 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
U1 - DA - Data Preprocessing
No ratings yet
U1 - DA - Data Preprocessing
6 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
Videbeck Chapter 5 Relationship Development - NO NOTES
No ratings yet
Videbeck Chapter 5 Relationship Development - NO NOTES
20 pages
Writing A Literature Review Uwe
100% (2)
Writing A Literature Review Uwe
4 pages
Autism Detection Hybrid ML Model
No ratings yet
Autism Detection Hybrid ML Model
11 pages
Datesheet PDF
No ratings yet
Datesheet PDF
7 pages
Formative Exam P.E 1
No ratings yet
Formative Exam P.E 1
2 pages
HEC Approved National Social Sciences Journals
No ratings yet
HEC Approved National Social Sciences Journals
4 pages
Sarathi Academy
No ratings yet
Sarathi Academy
28 pages
Culture Grams
100% (4)
Culture Grams
2 pages
The Aristotelian Tradition of Natural Kinds and Its Demise Stewart Umphrey PDF Download
No ratings yet
The Aristotelian Tradition of Natural Kinds and Its Demise Stewart Umphrey PDF Download
76 pages
External Training Agency
No ratings yet
External Training Agency
9 pages
AI Mini Project - Template
No ratings yet
AI Mini Project - Template
11 pages
SHDH3105 Government and Public Administration 1
No ratings yet
SHDH3105 Government and Public Administration 1
3 pages
Maximizing Your Return On People
No ratings yet
Maximizing Your Return On People
10 pages
Student Competition - Design Sutra 2016
No ratings yet
Student Competition - Design Sutra 2016
4 pages
Blastique - Resume
No ratings yet
Blastique - Resume
2 pages
P.Tamil Selvan.M.Tech.: Mobile No: 9791620130,8695501323
No ratings yet
P.Tamil Selvan.M.Tech.: Mobile No: 9791620130,8695501323
2 pages
9702 w16 Ms 11 PDF
No ratings yet
9702 w16 Ms 11 PDF
2 pages