Top Inbuilt DataSets in Scikit-Learn Library
Last Updated :
06 Aug, 2024
Scikit-Learn is one of the most popular libraries of Python for machine learning. This library comes equipped with various inbuilt datasets perfect for practising and experimenting with different algorithms. These datasets cover a range of applications, from simple classification tasks to more complex regression problems.
In this article, we will learn about some of the Top Inbuilt data sets in the Skcikit-Learn Library.
Top Inbuilt DataSets in Scikit-Learn Library
Some Top Inbuilt Datasets are mentioned below:
Iris Dataset
It is one of the most famous datasets in machine learning. It consists of 150 samples of iris flowers, with each sample containing four features: sepal length, sepal width, petal length, and petal width. The task is to classify these samples into one of three species: Iris Setosa, Iris Versicolor, or Iris Virginica.
Dataset Structure
- Features: 4 (sepal length, sepal width, petal length, petal width)
- Target: 3 classes (Setosa, Versicolor, Virginica)
- Samples: 150
Loading the Dataset
We can easily load iris dataset using Scikit-Learn’s load_iris() function.
Python
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
Exploring the Data
After loading the dataset, it is important to explore its structure. We can inspect the feature names, target names, and even visualize the data using scatter plots.
Python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris # Import the function to load the Iris dataset
iris = load_iris() # Load the Iris dataset
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target
sns.pairplot(df, hue='species')
plt.show()
Output:
Iris DatasetDiabetes Dataset
This dataset is designed for regression tasks. It consists of 442 samples with 10 features each, representing different factors like age, sex, body mass index (BMI), blood pressure, and blood serum measurements. The target is a quantitative measure of disease progression one year after baseline.
Dataset Structure
- Features: 10
- Target Variable: Disease progression score
- Total Samples: 442
Loading the Dataset
We can load the Diabetes dataset using the following command:
Python
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
Exploring the Data
We can understand the distribution of the data and the relationships between features and the target variable.
Python
from sklearn.datasets import load_diabetes # Import the function to load the diabetes dataset
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
diabetes = load_diabetes() # Load the diabetes dataset
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df['progression'] = diabetes.target
sns.pairplot(df)
plt.show()
Output:
Diabetes DatasetDigits Dataset
This dataset is a collection of 8x8 pixel images of handwritten digits from 0 to 9. Each image is represented as a flattened 64-feature vector. The goal is to classify each image into one of the ten digit classes.
Dataset Structure
- Features: 64 (each representing a pixel in the 8x8 image)
- Target Classes: 10 (digits 0-9)
- Total Samples: 1,797
Loading the Dataset
We can load the Digits dataset using:
Python
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
y = digits.target
Visualizing the Data
We can visualize the images to better understand the dataset.
Python
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits # Import the function to load the digits dataset
digits = load_digits() # Load the digits dataset
fig, axes = plt.subplots(2, 5, figsize=(10, 5))
for i, ax in enumerate(axes.flatten()):
ax.imshow(digits.images[i], cmap='gray')
ax.set_xticks([])
ax.set_yticks([])
ax.set_title(f"Digit: {digits.target[i]}")
plt.show()
Output:
Digits Dataset
Linnerud Dataset
This dataset is a multivariate regression dataset that includes three exercise variables (chin-ups, sit-ups, and jumps) and three physiological measurements (weight, waist, and pulse). The goal is to predict the physiological measurements based on the exercise data.
Dataset Structure
- Features: 3 (Chin-ups, Sit-ups, Jumps)
- Targets: 3 (Weight, Waist, Pulse)
- Total Samples: 20
Loading the Dataset
We can load the Linnerud dataset using:
Python
from sklearn.datasets import load_linnerud
linnerud = load_linnerud()
X = linnerud.data
y = linnerud.target
Exploring the Data
Given its small size, we can visualize Linnerud dataset:
Python
from sklearn.datasets import load_linnerud # Import the function to load the Linnerud dataset
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
linnerud = load_linnerud() # Load the Linnerud dataset
X = linnerud.data # Extract the feature data
y = linnerud.target # Extract the target data
df = pd.DataFrame(X, columns=linnerud.feature_names)
df_targets = pd.DataFrame(y, columns=linnerud.target_names)
sns.pairplot(df_targets)
plt.show()
Output:
Linnerud DatasetWine Dataset
This dataset consists of chemical analysis results for wines grown in a specific region of Italy. The task is to classify the wine samples into one of three cultivars based on their chemical properties.
Dataset Structure
- Features: 13 (chemical properties)
- Target Classes: 3 (different wine cultivars)
- Total Samples: 178
Loading the Dataset
We can load the Wine dataset using:
Python
from sklearn.datasets import load_wine
wine = load_wine()
X = wine.data
y = wine.target
Exploring the Data
It is useful to see how these features relate to the target classes.
Python
from sklearn.datasets import load_wine # Import the function to load the wine dataset
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
wine = load_wine() # Load the wine dataset
X = wine.data # Extract the feature data
y = wine.target # Extract the target data
df = pd.DataFrame(X, columns=wine.feature_names) # Now 'wine' is defined
df['cultivar'] = y
sns.pairplot(df, hue='cultivar')
plt.show()
Output:
Wine DatasetMNIST Dataset
This dataset is a large database of handwritten digits used extensively in the field of machine learning and computer vision. It contains 70,000 28x28 pixel grayscale images of digits, with the goal being to classify them into their respective digit classes (0-9).
Dataset Structure
- Features: 784 (28x28 images flattened into a vector)
- Target Classes: 10 (digits 0-9)
- Total Samples: 70,000
Loading the Dataset
The MNIST dataset can be loaded using:
Python
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
X = mnist.data
y = mnist.target.astype('int')
Visualizing the Data
We can visualizing some of the digit images which can give us insights into the dataset:
Python
import matplotlib.pyplot as plt # Import the matplotlib library and give it the alias 'plt'
from sklearn.datasets import fetch_openml # Import the function to fetch datasets
# Load the MNIST dataset
mnist = fetch_openml('mnist_784', version=1, as_frame=True)
X, y = mnist['data'], mnist['target']
# Plot the first 10 images
fig, axes = plt.subplots(2, 5, figsize=(10, 5))
for i, ax in enumerate(axes.flatten()):
ax.imshow(X.iloc[i].values.reshape(28, 28), cmap='gray')
ax.set_xticks([])
ax.set_yticks([])
ax.set_title(f"Digit: {y.iloc[i]}")
plt.show()
Output:
MNIST DatasetConclusion
The Scikit-learn library is a collection of inbuilt datasets that are important for learning and experimenting with various machine learning techniques. Each dataset present in this library serves a unique purpose, whether it’s for practicing classification, regression, or clustering algorithms. These datasets are not tools are not used by beginners, but also used by experienced practitioners to test new algorithms and approaches. After working with different datasets such as Iris, Diabetes, and Wine, we can grasp the fundamentals of data exploration and model building.
Similar Reads
Machine Learning Tutorial Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.Do you
5 min read
Introduction to Machine Learning
Python for Machine Learning
Machine Learning with Python TutorialPython language is widely used in Machine Learning because it provides libraries like NumPy, Pandas, Scikit-learn, TensorFlow, and Keras. These libraries offer tools and functions essential for data manipulation, analysis, and building machine learning models. It is well-known for its readability an
5 min read
Pandas TutorialPandas (stands for Python Data Analysis) is an open-source software library designed for data manipulation and analysis. Revolves around two primary Data structures: Series (1D) and DataFrame (2D)Built on top of NumPy, efficiently manages large datasets, offering tools for data cleaning, transformat
6 min read
NumPy Tutorial - Python LibraryNumPy is a core Python library for numerical computing, built for handling large arrays and matrices efficiently.ndarray object â Stores homogeneous data in n-dimensional arrays for fast processing.Vectorized operations â Perform element-wise calculations without explicit loops.Broadcasting â Apply
3 min read
Scikit Learn TutorialScikit-learn (also known as sklearn) is a widely-used open-source Python library for machine learning. It builds on other scientific libraries like NumPy, SciPy and Matplotlib to provide efficient tools for predictive data analysis and data mining.It offers a consistent and simple interface for a ra
3 min read
ML | Data Preprocessing in PythonData preprocessing is a important step in the data science transforming raw data into a clean structured format for analysis. It involves tasks like handling missing values, normalizing data and encoding variables. Mastering preprocessing in Python ensures reliable insights for accurate predictions
6 min read
EDA - Exploratory Data Analysis in PythonExploratory Data Analysis (EDA) is a important step in data analysis which focuses on understanding patterns, trends and relationships through statistical tools and visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and plotly which enables effective exploration
6 min read
Feature Engineering
Supervised Learning
Unsupervised Learning
Model Evaluation and Tuning
Advance Machine Learning Technique
Machine Learning Practice