Top Inbuilt DataSets in Scikit-Learn Library

Last Updated : 06 Aug, 2024

Scikit-Learn is one of the most popular libraries of Python for machine learning. This library comes equipped with various inbuilt datasets perfect for practising and experimenting with different algorithms. These datasets cover a range of applications, from simple classification tasks to more complex regression problems.

In this article, we will learn about some of the Top Inbuilt data sets in the Skcikit-Learn Library.

Iris Dataset
Diabetes Dataset
Digits Dataset
Linnerud Dataset
Wine Dataset
MNIST Dataset

Some Top Inbuilt Datasets are mentioned below:

Iris Dataset

It is one of the most famous datasets in machine learning. It consists of 150 samples of iris flowers, with each sample containing four features: sepal length, sepal width, petal length, and petal width. The task is to classify these samples into one of three species: Iris Setosa, Iris Versicolor, or Iris Virginica.

Dataset Structure

Features: 4 (sepal length, sepal width, petal length, petal width)
Target: 3 classes (Setosa, Versicolor, Virginica)
Samples: 150

Loading the Dataset

We can easily load iris dataset using Scikit-Learn’s load_iris() function.

Python

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

Exploring the Data

After loading the dataset, it is important to explore its structure. We can inspect the feature names, target names, and even visualize the data using scatter plots.

Python

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris # Import the function to load the Iris dataset

iris = load_iris() # Load the Iris dataset
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target
sns.pairplot(df, hue='species')
plt.show()

Output:

Diabetes Dataset

This dataset is designed for regression tasks. It consists of 442 samples with 10 features each, representing different factors like age, sex, body mass index (BMI), blood pressure, and blood serum measurements. The target is a quantitative measure of disease progression one year after baseline.

Dataset Structure

Features: 10
Target Variable: Disease progression score
Total Samples: 442

Loading the Dataset

We can load the Diabetes dataset using the following command:

Python

from sklearn.datasets import load_diabetes

diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

Exploring the Data

We can understand the distribution of the data and the relationships between features and the target variable.

Python

from sklearn.datasets import load_diabetes # Import the function to load the diabetes dataset
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

diabetes = load_diabetes() # Load the diabetes dataset
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df['progression'] = diabetes.target
sns.pairplot(df)
plt.show()

Output:

Digits Dataset

This dataset is a collection of 8x8 pixel images of handwritten digits from 0 to 9. Each image is represented as a flattened 64-feature vector. The goal is to classify each image into one of the ten digit classes.

Dataset Structure

Features: 64 (each representing a pixel in the 8x8 image)
Target Classes: 10 (digits 0-9)
Total Samples: 1,797

Loading the Dataset

We can load the Digits dataset using:

Python

from sklearn.datasets import load_digits

digits = load_digits()
X = digits.data
y = digits.target

Visualizing the Data

We can visualize the images to better understand the dataset.

Python

import matplotlib.pyplot as plt
from sklearn.datasets import load_digits  # Import the function to load the digits dataset

digits = load_digits()  # Load the digits dataset

fig, axes = plt.subplots(2, 5, figsize=(10, 5))
for i, ax in enumerate(axes.flatten()):
    ax.imshow(digits.images[i], cmap='gray')
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_title(f"Digit: {digits.target[i]}")
plt.show()

Output:

Linnerud Dataset

This dataset is a multivariate regression dataset that includes three exercise variables (chin-ups, sit-ups, and jumps) and three physiological measurements (weight, waist, and pulse). The goal is to predict the physiological measurements based on the exercise data.

Dataset Structure

Features: 3 (Chin-ups, Sit-ups, Jumps)
Targets: 3 (Weight, Waist, Pulse)
Total Samples: 20

Loading the Dataset

We can load the Linnerud dataset using:

Python

from sklearn.datasets import load_linnerud

linnerud = load_linnerud()
X = linnerud.data
y = linnerud.target

Exploring the Data

Given its small size, we can visualize Linnerud dataset:

Python

from sklearn.datasets import load_linnerud  # Import the function to load the Linnerud dataset
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

linnerud = load_linnerud()  # Load the Linnerud dataset
X = linnerud.data  # Extract the feature data
y = linnerud.target  # Extract the target data

df = pd.DataFrame(X, columns=linnerud.feature_names)
df_targets = pd.DataFrame(y, columns=linnerud.target_names)

sns.pairplot(df_targets)
plt.show()

Output:

Wine Dataset

This dataset consists of chemical analysis results for wines grown in a specific region of Italy. The task is to classify the wine samples into one of three cultivars based on their chemical properties.

Dataset Structure

Features: 13 (chemical properties)
Target Classes: 3 (different wine cultivars)
Total Samples: 178

Loading the Dataset

We can load the Wine dataset using:

Python

from sklearn.datasets import load_wine

wine = load_wine()
X = wine.data
y = wine.target

Exploring the Data

It is useful to see how these features relate to the target classes.

Python

from sklearn.datasets import load_wine  # Import the function to load the wine dataset
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

wine = load_wine()  # Load the wine dataset
X = wine.data  # Extract the feature data
y = wine.target  # Extract the target data

df = pd.DataFrame(X, columns=wine.feature_names)  # Now 'wine' is defined
df['cultivar'] = y

sns.pairplot(df, hue='cultivar')
plt.show()

Output:

MNIST Dataset

This dataset is a large database of handwritten digits used extensively in the field of machine learning and computer vision. It contains 70,000 28x28 pixel grayscale images of digits, with the goal being to classify them into their respective digit classes (0-9).

Dataset Structure

Features: 784 (28x28 images flattened into a vector)
Target Classes: 10 (digits 0-9)
Total Samples: 70,000

Loading the Dataset

The MNIST dataset can be loaded using:

Python

from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1)
X = mnist.data
y = mnist.target.astype('int')

Visualizing the Data

We can visualizing some of the digit images which can give us insights into the dataset:

Python

import matplotlib.pyplot as plt # Import the matplotlib library and give it the alias 'plt'
from sklearn.datasets import fetch_openml # Import the function to fetch datasets

# Load the MNIST dataset
mnist = fetch_openml('mnist_784', version=1, as_frame=True)
X, y = mnist['data'], mnist['target'] 

# Plot the first 10 images
fig, axes = plt.subplots(2, 5, figsize=(10, 5))
for i, ax in enumerate(axes.flatten()):
    ax.imshow(X.iloc[i].values.reshape(28, 28), cmap='gray')
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_title(f"Digit: {y.iloc[i]}")
plt.show()

Output:

Conclusion

The Scikit-learn library is a collection of inbuilt datasets that are important for learning and experimenting with various machine learning techniques. Each dataset present in this library serves a unique purpose, whether it’s for practicing classification, regression, or clustering algorithms. These datasets are not tools are not used by beginners, but also used by experienced practitioners to test new algorithms and approaches. After working with different datasets such as Iris, Diabetes, and Wine, we can grasp the fundamentals of data exploration and model building.

bug8wdqo

Improve

Article Tags :

Practice Tags :

Machine Learning

Top Inbuilt DataSets in Scikit-Learn Library

Iris Dataset

Dataset Structure

Loading the Dataset

Exploring the Data

Diabetes Dataset

Dataset Structure

Loading the Dataset

Exploring the Data

Digits Dataset

Dataset Structure

Loading the Dataset

Visualizing the Data

Linnerud Dataset

Dataset Structure

Loading the Dataset

Exploring the Data

Wine Dataset

Dataset Structure

Loading the Dataset

Exploring the Data

MNIST Dataset

Dataset Structure

Loading the Dataset

Visualizing the Data

Conclusion

Similar Reads

Introduction to Machine Learning

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advance Machine Learning Technique

Machine Learning Practice

Thank You!

What kind of Experience do you want to share?