Better Data Science - Make Synthetic Datasets With Python

The document provides a tutorial on creating synthetic datasets using Python, specifically with the sklearn library's make_classification function. It covers generating datasets with specific characteristics such as class distribution, noise addition, class imbalance, and class separation. Visualization of the datasets is also included to illustrate the effects of these parameters.

Uploaded by

Derek Degbedzui

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views4 pages

Better Data Science - Make Synthetic Datasets With Python

Uploaded by

Derek Degbedzui

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Better Data Science | Make Synthetic Datasets

with Python
● Library imports
● rcParams is only here for plot stylings
In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams['axes.spines.top'] = False
rcParams['axes.spines.right'] = False

Make a synthetic dataset

● 1000 data points measured through 2 features

● Perfect (50:50) class distribution
● Binary target variable, every subset has a single cluster
● Make sure to use random_state=42 if you want reproducible results
In [2]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)

df.columns = ['x1', 'x2', 'y']
# 5 random rows
df.sample(5)
Visualization

● The plot() function visualizes a synthetic dataset:

In [3]:
def plot(df: pd.DataFrame, x1: str, x2: str, y: str, title: str = '', save: bool = False,
figname='figure.png'):
plt.figure(figsize=(14, 7))
plt.scatter(x=df[df[y] == 0][x1], y=df[df[y] == 0][x2], label='y = 0')
plt.scatter(x=df[df[y] == 1][x1], y=df[df[y] == 1][x2], label='y = 1')
plt.title(title, fontsize=20)
plt.legend()
if save:
plt.savefig(figname, dpi=300, bbox_inches='tight', pad_inches=0)
plt.show()
In [4]:
plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes')

Adding noise

● You can use the flip_y parameter to add noise

● From the docs:
○ The fraction of samples whose class is assigned randomly. Larger
values introduce noise in the labels and make the classification
task harder. Note that the default setting flip_y > 0 might lead to
less than n_classes in y in some cases.
In [5]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
flip_y=0.15,
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)

df.columns = ['x1', 'x2', 'y']

plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes - Added noise')

Add class imbalance

● Perfect class distribution (50:50) is rarely the case

● You can use the weights parameter to play with the distribution
○ Assigning the value of 0.95 makes the y = 1 class take 5% of the
data
In [6]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
weights=[0.95],
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)

df.columns = ['x1', 'x2', 'y']

plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes - Class imbalance (y = 1)')

● You can do the opposite:

In [7]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
weights=[0.05],
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)

df.columns = ['x1', 'x2', 'y']

plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes - Class imbalance (y = 0)')

Make classification task easier/harder

● You can play around with the class_sep parameter to adjust class separation
● Higher the value, the more separated the classes are
In [8]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
class_sep=5,
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)

df.columns = ['x1', 'x2', 'y']

plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes - Make classification easier')

AIML Short Term Internship Session 9 Summary-1719044709410
No ratings yet
AIML Short Term Internship Session 9 Summary-1719044709410
14 pages
SOLUTION ONLY CODE DWDM - Lab - All
No ratings yet
SOLUTION ONLY CODE DWDM - Lab - All
8 pages
ML Assignment
No ratings yet
ML Assignment
34 pages
1
No ratings yet
1
13 pages
Machine Learnine Experiment by Priyanka
No ratings yet
Machine Learnine Experiment by Priyanka
6 pages
Study Material For Machine Learning - 1 - 1754721598318
No ratings yet
Study Material For Machine Learning - 1 - 1754721598318
18 pages
Baidurya Debnath 4
No ratings yet
Baidurya Debnath 4
37 pages
Classification Techniques in Python
No ratings yet
Classification Techniques in Python
30 pages
ML Week 7
No ratings yet
ML Week 7
12 pages
Unit 3 Unsupervised Learning
No ratings yet
Unit 3 Unsupervised Learning
9 pages
ML (Sudhanshu)
No ratings yet
ML (Sudhanshu)
24 pages
Lab Report 4
No ratings yet
Lab Report 4
6 pages
Advanced Machine Learning Course Guide
No ratings yet
Advanced Machine Learning Course Guide
36 pages
ML Lab
No ratings yet
ML Lab
14 pages
SanatKulkarni - AP22110010183 - Assignment5
No ratings yet
SanatKulkarni - AP22110010183 - Assignment5
8 pages
Data Science with Max: SVM & PCA Guide
No ratings yet
Data Science with Max: SVM & PCA Guide
7 pages
Python K-Means Clustering Guide
No ratings yet
Python K-Means Clustering Guide
6 pages
Casos de ML Unsupervised Daniel Ames Camayo
No ratings yet
Casos de ML Unsupervised Daniel Ames Camayo
20 pages
Seminar 10
No ratings yet
Seminar 10
3 pages
ML Solution
No ratings yet
ML Solution
60 pages
23BCE7092 ML Lab Assignment
No ratings yet
23BCE7092 ML Lab Assignment
14 pages
Feature Engineering: Scaling Techniques
No ratings yet
Feature Engineering: Scaling Techniques
13 pages
EasyVisa: Streamlining Visa Approvals
No ratings yet
EasyVisa: Streamlining Visa Approvals
67 pages
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
No ratings yet
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
31 pages
ML Short
No ratings yet
ML Short
2 pages
Eda Lab Assignment2
No ratings yet
Eda Lab Assignment2
10 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
10 pages
TensorFlow Logistic Regression Guide
No ratings yet
TensorFlow Logistic Regression Guide
22 pages
DWDM Lab All
No ratings yet
DWDM Lab All
20 pages
Week 8 DS Practical
No ratings yet
Week 8 DS Practical
13 pages
KRAI Practical
No ratings yet
KRAI Practical
14 pages
Pandas and Numpy
No ratings yet
Pandas and Numpy
9 pages
NF Assighment4
No ratings yet
NF Assighment4
5 pages
Code Shabab Error 7
No ratings yet
Code Shabab Error 7
5 pages
Understanding maXbox in Data Science
No ratings yet
Understanding maXbox in Data Science
5 pages
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
No ratings yet
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
20 pages
ML Functions
No ratings yet
ML Functions
12 pages
Machine Learning Basics for Beginners
No ratings yet
Machine Learning Basics for Beginners
14 pages
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
9 pages
ML Lab Exam Document
No ratings yet
ML Lab Exam Document
14 pages
Code To Create Hypothetical Data in Python
No ratings yet
Code To Create Hypothetical Data in Python
6 pages
Machine Learning Practical File MRIEM
No ratings yet
Machine Learning Practical File MRIEM
49 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
43 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
Random Forest Classifier on Banking Dataset
No ratings yet
Random Forest Classifier on Banking Dataset
7 pages
ML Algorithms for Data Scientists
100% (2)
ML Algorithms for Data Scientists
148 pages
Naive Bayes Algorithm Code Explanation
No ratings yet
Naive Bayes Algorithm Code Explanation
3 pages
DMKD External Exam Answers
No ratings yet
DMKD External Exam Answers
12 pages
Machine Learning
No ratings yet
Machine Learning
18 pages
Cheat Sheet-Building Unsupervised Learning Models
No ratings yet
Cheat Sheet-Building Unsupervised Learning Models
3 pages
Naive Bayes Classification
No ratings yet
Naive Bayes Classification
8 pages
23BCE7199 ML Lab Assignment
No ratings yet
23BCE7199 ML Lab Assignment
15 pages
Car Evaluation Data Analysis & Random Forest Model
No ratings yet
Car Evaluation Data Analysis & Random Forest Model
12 pages
Mini 4
No ratings yet
Mini 4
9 pages
Titanic Data Analysis with Python
No ratings yet
Titanic Data Analysis with Python
20 pages
SE KMeansClustering
No ratings yet
SE KMeansClustering
21 pages
MODELS (AutoRecovered)
No ratings yet
MODELS (AutoRecovered)
9 pages
Logistic Regression Explained
No ratings yet
Logistic Regression Explained
10 pages
Multiple Linear Regression Guide
No ratings yet
Multiple Linear Regression Guide
7 pages
Linear Regression Basics in Python
No ratings yet
Linear Regression Basics in Python
17 pages
Simple Linear Regression: Math Behind
0% (1)
Simple Linear Regression: Math Behind
6 pages
Generate PDF Reports in Python
No ratings yet
Generate PDF Reports in Python
5 pages
Implementing Random Forest from Scratch
No ratings yet
Implementing Random Forest from Scratch
10 pages
K Nearest Neighbors
No ratings yet
K Nearest Neighbors
5 pages
Decision Trees
No ratings yet
Decision Trees
11 pages
Data Analytics for Business Growth
No ratings yet
Data Analytics for Business Growth
13 pages
Memory Forensics
No ratings yet
Memory Forensics
8 pages
Y61 Valve
No ratings yet
Y61 Valve
2 pages
Binary and Octal Arithmetic Operations
No ratings yet
Binary and Octal Arithmetic Operations
21 pages
CY 2022 DOST GIA Approved Projects From January To July 2022
No ratings yet
CY 2022 DOST GIA Approved Projects From January To July 2022
252 pages
WTTCO - Supplier User Manual
No ratings yet
WTTCO - Supplier User Manual
35 pages
Handbook 2023 27 Aerospace Engg
No ratings yet
Handbook 2023 27 Aerospace Engg
331 pages
Commissioning Switchgear, Low-Voltage and Circuit Breakers
94% (16)
Commissioning Switchgear, Low-Voltage and Circuit Breakers
59 pages
21977-Article Text-26029-1-2-20221011
No ratings yet
21977-Article Text-26029-1-2-20221011
3 pages
8 Repair Basics
No ratings yet
8 Repair Basics
66 pages
63 9243 Rev E VLP 16 User Manual
No ratings yet
63 9243 Rev E VLP 16 User Manual
140 pages
986 Lcdmmitx
No ratings yet
986 Lcdmmitx
92 pages
BJ Coiled Tubing Equipment Manual Version 1
95% (40)
BJ Coiled Tubing Equipment Manual Version 1
90 pages
CV Unit 3
No ratings yet
CV Unit 3
21 pages
Python Model Question-Oct 2024
No ratings yet
Python Model Question-Oct 2024
1 page
MN67Z User-Manual 2023+ (LR)
No ratings yet
MN67Z User-Manual 2023+ (LR)
137 pages
The Power Point - The Effective Email and Report Writing 2022
No ratings yet
The Power Point - The Effective Email and Report Writing 2022
72 pages
Wizard Aura at DuckDuckGo
No ratings yet
Wizard Aura at DuckDuckGo
1 page
DWDM UNIT-1 Lecture Notes
No ratings yet
DWDM UNIT-1 Lecture Notes
15 pages
6 Pulse & 12 Pulse UPS
No ratings yet
6 Pulse & 12 Pulse UPS
1 page
Operating Instructions: Stationary Boom MXR 36-4 Multi
100% (5)
Operating Instructions: Stationary Boom MXR 36-4 Multi
420 pages
1991 - Eigenfaces For Recognition
No ratings yet
1991 - Eigenfaces For Recognition
16 pages
IT Practical Solutions For Semester I: Prepared By: Mohammed Waseem Raza
No ratings yet
IT Practical Solutions For Semester I: Prepared By: Mohammed Waseem Raza
54 pages
Python Interface for LINGO API Guide
No ratings yet
Python Interface for LINGO API Guide
5 pages
Es212 Problem Set 2
No ratings yet
Es212 Problem Set 2
1 page
Bitzer HSN 6461-50-40P Specifications
No ratings yet
Bitzer HSN 6461-50-40P Specifications
4 pages
MSRP-M S-5502
No ratings yet
MSRP-M S-5502
4 pages
1, 'Bright Gr10 Tests Audios
No ratings yet
1, 'Bright Gr10 Tests Audios
10 pages
Cobit 2019 Framework PDF
33% (9)
Cobit 2019 Framework PDF
139 pages
Project AI Clue - Profile Sheet - 24102025
No ratings yet
Project AI Clue - Profile Sheet - 24102025
7 pages
Compressor Specs for Engineers
No ratings yet
Compressor Specs for Engineers
5 pages