How to use datasets.fetch_mldata() in sklearn - Python?
Last Updated :
23 Sep, 2021
mldata.org does not have an enforced convention for storing data or naming the columns in a data set. The default behavior of this function works well with most of the common cases mentioned below:
- Data values stored in the column are 'Data’, and target values stored in the column are ‘label’.
- The first column table stores target, and the second stores' data.
- The data array is stored as features and samples and needed to be transposed to match the sklearn standard.
Fetch a machine learning data set, if the file does not exist, it is downloaded automatically from mldata.org.
sklearn.datasets package directly loads datasets using function: sklearn.datasets.fetch_mldata()
Syntax: sklearn.datasets.fetch_mldata(dataname, target_name=’label’, data_name=’data’, transpose_data=True, data_home=None)
Parameters:
- dataname: (<str>) It is the name of the dataset on mldata.org, e.g: “Iris” , “mnist”, “leukemia”, etc.
- target_name: (optional, default: ‘label’) It accepts the name or index of the column containing the target values and needed to pass the default values of the label.
- data_name: (optional, default: ‘data’) It accepts the name or index of the column containing the data and needed to pass default values of data.
- transpose_data: (optional, default: True) The default value passed is true, and if True, it transposes the loaded data.
- data_home: (optional, default: None) It loads cache folder for the datasets. By default, all sklearn data is stored in ‘~/scikit_learn_data’ subfolders.
Returns: data, (Bunch) Interesting attributes are: ‘data’, data to learn, ‘target’, classification labels, ‘DESCR’, description of the dataset, and ‘COL_NAMES’, the original names of the dataset columns.
Let's see the examples:
Example 1: Load the ‘iris’ dataset from mldata, which needs to be transposed.
Python3
# import fetch_mldata function
from sklearn.datasets.mldata import fetch_mldata
# load data and transpose data
iris = fetch_mldata('iris',
transpose_data = False)
# iris data is very large
# so print the dataset shape
# print(iris)
print(iris.data.shape)
Output:
(4,150)
Example 2: Load the MNIST digit recognition dataset from mldata.
Python3
# import fetch_mldata function
from sklearn.datasets.mldata import fetch_mldata
# load data
mnist = fetch_mldata('MNIST original')
# mnist data is very large
# so print the shape of data
print(mnist.data.shape)
Output:
(70000, 784)
Note: This post is according to Scikit-learn (version 0.19).
Similar Reads
How to import datasets using sklearn in PyBrain In this article, we will discuss how to import datasets using sklearn in PyBrain Dataset: A Dataset is defined as the set of data that is can be used to test, validate, and train on networks. On comparing it with arrays, a dataset is considered more flexible and easy to use. A dataset resembles a 2-
2 min read
How To Convert Sklearn Dataset To Pandas Dataframe In Python In this article, we look at how to convert sklearn dataset to a pandas dataframe in Python. Sklearn and pandas are python libraries that are used widely for data science and machine learning operations. Pandas is majorly focused on data processing, manipulation, cleaning, and visualization whereas s
3 min read
How To Read .Data Files In Python? Unlocking the secrets of reading .data files in Python involves navigating through diverse structures. In this article, we will unravel the mysteries of reading .data files in Python through four distinct approaches. Understanding the structure of .data files is essential, as their format may vary w
4 min read
Python Seaborn get_dataset_names() Method Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. In this article, we will learn about the Python seaborn.get_dataset_names() Method. What is the Seaborn get_dataset_names() Method?The s
2 min read
Sklearn Diabetes Dataset : Scikit-learn Toy Datasets in Python The Sklearn Diabetes Dataset typically refers to a dataset included in the scikit-learn machine learning library, which is a synthetic dataset rather than real-world data. This dataset is often used for demonstration purposes in machine learning tutorials and examples. In this article, we are going
5 min read