Generate Test Datasets for Machine learning
Last Updated :
11 Apr, 2023
Whenever we think of Machine Learning, the first thing that comes to our mind is a dataset. While there are many datasets that you can find on websites such as Kaggle, sometimes it is useful to extract data on your own and generate your own dataset. Generating your own dataset gives you more control over the data and allows you to train your machine-learning model. In this article, we will generate random datasets using sklearn.datasets library in Python.
Generate test datasets for Classification:
Binary Classification
Example 1: The 2d binary classification data generated by make_circles() have a spherical decision boundary.
Python3
# Import necessary libraries
from sklearn.datasets import make_circles
import matplotlib.pyplot as plt
# Generate 2d classification dataset
X, y = make_circles(n_samples=200, shuffle=True,
noise=0.1, random_state=42)
# Plot the generated datasets
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()
Output:
make_circles()
Example 2: Two interlocking half circles represent the 2d binary classification data produced by the make_moons() function.
Python3
#import the necessary libraries
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
# generate 2d classification dataset
X, y = make_moons(n_samples=500, shuffle=True,
noise=0.15, random_state=42)
# Plot the generated datasets
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()
Output:
make_moons()Multi-Class Classification
Example 1: Data generated by the function make_blobs() are blobs that can be utilized for clustering.
Python3
#import the necessary libraries
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generate 2d classification dataset
X, y = make_blobs(n_samples=500, centers=3, n_features=2, random_state=23)
# Plot the generated datasets
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()
Output:
make_blobs()
Example 2: To generate data by the function make_classification() need to balance between n_informative, n_redundant and n_classes attributes X[:, :n_informative + n_redundant + n_repeated]
Python3
#import the necessary libraries
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
# generate 2d classification dataset
X, y = make_classification(n_samples = 100,
n_features=2,
n_redundant=0,
n_informative=2,
n_repeated=0,
n_classes =3,
n_clusters_per_class=1)
# Plot the generated datasets
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()
Output:
make_classification()
Example 3:A random multi-label classification data is created by the function make make_multilabel_classification()Â
Python3
# Import necessary libraries
from sklearn.datasets import make_multilabel_classification
import pandas as pd
import matplotlib.pyplot as plt
# Generate 2d classification dataset
X, y = make_multilabel_classification(n_samples=500, n_features=2,
n_classes=2, n_labels=2,
allow_unlabeled=True,
random_state=23)
# create pandas dataframe from generated dataset
df = pd.concat([pd.DataFrame(X, columns=['X1', 'X2']),
pd.DataFrame(y, columns=['Label1', 'Label2'])],
axis=1)
display(df.head())
# Plot the generated datasets
plt.scatter(df['X1'], df['X2'], c=df['Label1'])
plt.show()
Output:
X1 X2 Label1 Label2
0 14.0 34.0 0 1
1 30.0 22.0 1 1
2 29.0 19.0 1 1
3 21.0 19.0 1 1
4 16.0 32.0 0 1
make_multilabel_classification()Generate test datasets for Regression:
Example 1: Â Generate a 1-dimensional feature and target for linear regression using make_regression
Python3
# Import necessary libraries
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
# Generate 1d Regression dataset
X, y = make_regression(n_samples = 50, n_features=1,noise=20, random_state=23)
# Plot the generated datasets
plt.scatter(X, y)
plt.show()
Output:
make_regression()Example 2: Â Multilabel feature using make_sparse_uncorrelated()
Python3
# Import necessary libraries
from sklearn.datasets import make_sparse_uncorrelated
import matplotlib.pyplot as plt
# Generate 1d Regression dataset
X, y = make_sparse_uncorrelated(n_samples = 100, n_features=4, random_state=23)
# Plot the generated datasets
plt.figure(figsize=(12,10))
for i in range(4):
plt.subplot(2,2, i+1)
plt.scatter(X[:,i], y)
plt.xlabel('X'+str(i+1))
plt.ylabel('Y')
plt.show()
Output:
make_sparse_uncorrelated()Example: 3 Â Multilabel feature using make_friedman2()
Python3
# Import necessary libraries
from sklearn.datasets import make_friedman2
import matplotlib.pyplot as plt
# Generate 1d Regression dataset
X, y = make_friedman2(n_samples = 100, random_state=23)
# Plot the generated datasets
plt.figure(figsize=(12,10))
for i in range(4):
plt.subplot(2,2, i+1)
plt.scatter(X[:,i], y)
plt.xlabel('X'+str(i+1))
plt.ylabel('Y')
plt.show()
Output:
make_friedman2()
Â
Similar Reads
ChatGPT Prompt to get Datasets for Machine Learning With the development of machine learning, access to high-quality datasets is becoming increasingly important. Datasets are crucial for assessing the accuracy and effectiveness of the final model, which is a prerequisite for any machine learning project. In this article, we'll learn how to use a Chat
7 min read
Generative AI vs Machine Learning Artificial Intelligence (AI) is a dynamic and expansive field, driving innovation and reshaping the landscape across numerous industries. Two pivotal branches within this technological marvelâGenerative AI and Machine Learningâserve as key players in the AI revolution. While they share a common foun
3 min read
What is Test Dataset in Machine Learning? In Machine Learning, a Test Dataset plays a crucial role in evaluating the performance of your trained model. In this blog, we will delve into the intricacies of test dataset in machine learning, its significance, and its indispensable role in the data science lifecycle. What is Test Dataset in Mach
4 min read
Top Machine Learning Dataset: Find Open Datasets In the realm of machine learning, data is the fuel that powers innovation. The quality and quantity of data directly influence the performance and capabilities of machine learning models. Open datasets, in particular, play an important role in democratizing access to data and fostering collaboration
8 min read
What is Generative Machine Learning? Generative Machine Learning is an interesting subset of artificial intelligence, where models are trained to generate new data samples similar to the original training data. In this article, we'll explore the fundamentals of generative machine learning, compare it with discriminative models, delve i
4 min read
Data Science Vs Machine Learning : Key Differences In the 21st Century, two terms "Data Science" and "Machine Learning" are some of the most searched terms in the technology world. From 1st-year Computer Science students to big Organizations like Netflix, Amazon, etc are running behind these two techniques. Both fields have grown exponentially due t
5 min read
10 Best Language for Machine Learning Finding the best programming language for machine learning (ML) is crucial in the ever-changing world of technology and data science. In this article we will look at the Top Programming Languages designed for ML projects, discussing their benefits, available libraries/frameworks, and specific applic
10 min read
Introduction to Data in Machine Learning Data refers to the set of observations or measurements to train a machine learning models. The performance of such models is heavily influenced by both the quality and quantity of data available for training and testing. Machine learning algorithms cannot be trained without data. Cutting-edge develo
4 min read
6 Machine Learning Tools for Enterprises Machine learning (ML) is becoming indispensable for enterprises looking to enhance data-driven decision-making, automate processes, and gain a competitive edge. While the technology can be complex, several ML tools are designed to simplify its adoption and application for businesses of all sizes. 6
5 min read
Flowchart for basic Machine Learning models Machine Learning (ML) is a branch of Artificial Intelligence (AI) that allow computers to learn from large amount of data, identify patterns and make decisions. It help them to predict new similar data without explicit programming for each task. A good way to understand how machine learning works is
4 min read