0% found this document useful (0 votes)
22 views5 pages

Practical 2 51

The document discusses loading and summarizing datasets from scikit-learn in Python. It loads the iris dataset, which contains measurements of iris flowers, and prints information about the dataset including the feature names, data, target labels, and a description. It also shows how to download datasets directly from an openml repository and load a mice protein dataset as an example.

Uploaded by

Royal Empire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views5 pages

Practical 2 51

The document discusses loading and summarizing datasets from scikit-learn in Python. It loads the iris dataset, which contains measurements of iris flowers, and prints information about the dataset including the feature names, data, target labels, and a description. It also shows how to download datasets directly from an openml repository and load a mice protein dataset as an example.

Uploaded by

Royal Empire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

20BECE30058

SCIKIT LEARN

Importing Datasets from sklearn package


import sklearn

from sklearn import datasets

dir(datasets) #---displays all the datasets in the 'dataset' package of sklearn

'_california_housing', '_covtype',
'_kddcup99',
'_lfw', '_olivetti_faces',
'_openml', '_rcv1',
'_samples_generator',
'_species_distributions',
'_svmlight_format_fast',
'_svmlight_format_io',
'_twenty_newsgroups',
'clear_data_home',
'dump_svmlight_file',
'fetch_20newsgroups',
'fetch_20newsgroups_vectorized',
'fetch_california_housing',
'fetch_covtype', 'fetch_kddcup99',
'fetch_lfw_pairs',
'fetch_lfw_people',
'fetch_olivetti_faces',
'fetch_openml',
'fetch_rcv1',
'fetch_species_distributions',
'get_data_home', 'load_boston',
'load_breast_cancer',
'load_diabetes', 'load_digits',
'load_files', 'load_iris',
'load_linnerud',
'load_sample_image',
'load_sample_images',
'load_svmlight_file',
'load_svmlight_files', 'load_wine',
'make_biclusters', 'make_blobs',
'make_checkerboard', 'make_circles',
'make_classification',
'make_friedman1', 'make_friedman2',
'make_friedman3',
'make_gaussian_quantiles',
'make_hastie_10_2',
'make_low_rank_matrix',
'make_moons',
'make_multilabel_classification',
'make_regression', 'make_s_curve',
'make_sparse_coded_signal',
'make_sparse_spd_matrix',
'make_sparse_uncorrelated',
'make_spd_matrix',
'make_swiss_roll']

https://colab.research.google.com/drive/1qlsKPFMXrWckO3ZoHbBXFniCCWmNLuvG 1
20BECE30058
Load Dataset

i=datasets.load_iris() #---Load the dataset


print(type(i)) print(i)

<class 'sklearn.utils.Bunch'>
{'data': array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],
[4.5, 2.3, 1.3, 0.3],
[4.4, 3.2, 1.3, 0.2],
[5. , 3.5, 1.6, 0.6],
[5.1, 3.8, 1.9, 0.4],
[4.8, 3. , 1.4, 0.3],
[5.1, 3.8, 1.6, 0.2],
[4.6, 3.2, 1.4, 0.2],
[5.3, 3.7, 1.5, 0.2],
[5. , 3.3, 1.4, 0.2],
[7. , 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5],

features=i.feature_names #---fetch the feature names or the column names print(features)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

print(i.data) #---print the loaded dataset feature matrix

https://colab.research.google.com/drive/1qlsKPFMXrWckO3ZoHbBXFniCCWmNLuvG 2
20BECE30058
[5.7 2.5 5. 2. ]
[5.8 2.8 5.1 2.4]
[6.4 3.2 5.3 2.3]
[6.5 3. 5.5 1.8]
[7.7 3.8 6.7 2.2]
[7.7 2.6 6.9 2.3]
[6. 2.2 5. 1.5]
[6.9 3.2 5.7 2.3]
[5.6 2.8 4.9 2. ]
[7.7 2.8 6.7 2. ]
[6.3 2.7 4.9 1.8]
[6.7 3.3 5.7 2.1]
[7.2 3.2 6. 1.8]
[6.2 2.8 4.8 1.8]
[6.1 3. 4.9 1.8]
[6.4 2.8 5.6 2.1]
[7.2 3. 5.8 1.6]
[7.4 2.8 6.1 1.9]
[7.9 3.8 6.4 2. ]
[6.4 2.8 5.6 2.2]
[6.3 2.8 5.1 1.5]
[6.1 2.6 5.6 1.4]
[7.7 3. 6.1 2.3]
[6.3 3.4 5.6 2.4]
[6.4 3.1 5.5 1.8]
[6. 3. 4.8 1.8]
[6.9 3.1 5.4 2.1]
[6.7 3.1 5.6 2.4]
[6.9 3.1 5.1 2.3]
[5.8 2.7 5.1 1.9]
[6.8 3.2 5.9 2.3]
[6.7 3.3 5.7 2.5]
[6.7 3. 5.2 2.3]
[6.3 2.5 5. 1.9]
[6.5 3. 5.2 2. ]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]]

target=i.target #---gets the labels associated with the data points


print(target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]

print(i.target_names) #--displays the target names associated with values 0 and 1 and 2

['setosa' 'versicolor' 'virginica']

print(i.DESCR) #--gives all the detailed description about the dataset


.. _iris_dataset: Iris plants dataset

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)


:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
class:
Iris-Setosa
Iris-Versicolour
Iris-Virginica

:Summary Statistics:

============== ==== ==== ======= ===== ====================


Min MaxMeanSDClass Correlation
sepal length:
============== ==== 4.3 =======
==== 7.9 5.84
===== 0.83 0.7826
====================
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None


:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall ov) g
https://c olab.research.google.com/drive/1qlsKPFMXrWckO3ZoHbBXFniCCWmNLuvG 3
20BECE30058
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is
taken from Fisher's paper. Note that it's the same as in R, but not as in
the UCI Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the


pattern recognition literature. Fisher's paper is a classic in the field
and is referenced frequently to this day. (See Duda & Hart, for example.)
The data set contains 3 classes of 50 instances each, where each class
refers to a type of iris plant. One class is linearly separable from the
other 2; the latter are NOT linearly separable from each other. .. topic::
References

- Fisher, R.A. "The use of multiple measurements in taxonomic problems"


Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions
to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene
Analysis. (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See
page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System

Download any dataset from openml repository

from sklearn.datasets import fetch_openml


mice=fetch_openml(name='miceprotein',version=4)
mice

{'data': DYRK1A_N ITSN1_N BDNF_N NR1_N NR2A_N pAKT_N pBRAF_N \


0 0.503644 0.747193 0.430175 2.816329 5.990152 0.218830 0.177565
1 0.514617 0.689064 0.411770 2.789514 5.685038 0.211636 0.172817
2 0.509183 0.730247 0.418309 2.687201 5.622059 0.209011 0.175722
3 0.442107 0.617076 0.358626 2.466947 4.979503 0.222886 0.176463
4 0.434940 0.617430 0.358802 2.365785 4.718679 0.213106 0.173627
... ... ... ... ... ... ...
...
1075 0.254860 0.463591 0.254860 2.092082 2.600035 0.211736 0.171262
1076 0.272198 0.474163 0.251638 2.161390 2.801492 0.251274 0.182496
1077 0.228700 0.395179 0.234118 1.733184 2.220852 0.220665 0.161435
1078 0.221242 0.412894 0.243974 1.876347 2.384088 0.208897 0.173623
1079 0.302626 0.461059 0.256564 2.092790 2.594348 0.251001
0.191811

pCAMKII_N pCREB_N pELK_N ... SHH_N BAD_N BCL2_N \


0 2.373744 0.232224 1.750936 ... 0.188852 0.122652 NaN
1 2.292150 0.226972 1.596377 ... 0.200404 0.116682 NaN
2 2.283337 0.230247 1.561316 ... 0.193685 0.118508 NaN
3 2.152301 0.207004 1.595086 ... 0.192112 0.132781 NaN
4 2.134014 0.192158 1.504230 ... 0.205604 0.129954 NaN ...
... ... ... ... ... ... ...
1075 2.483740 0.207317 1.057971 ... 0.275547 0.190483 NaN
1076 2.512737 0.216339 1.081150 ... 0.283207 0.190463 NaN
1077 1.989723 0.185164 0.884342 ... 0.290843 0.216682 NaN
1078 2.086028 0.192044 0.922595 ... 0.306701 0.222263 NaN 1079
2.361816 0.223632 1.064085 ... 0.292330 0.227606 NaN

pS6_N pCFOS_N SYP_N H3AcK18_N EGR1_N H3MeK4_N CaNA_N


0 0.106305 0.108336 0.427099 0.114783 0.131790 0.128186 1.675652
1 0.106592 0.104315 0.441581 0.111974 0.135103 0.131119 1.743610
2 0.108303 0.106219 0.435777 0.111883 0.133362 0.127431 1.926427
3 0.103184 0.111262 0.391691 0.130405 0.147444 0.146901 1.700563
4 0.104784 0.110694 0.434154 0.118481 0.140314 0.148380 1.839730
... ... ... ... ... ... ...
...
1075 0.115806 0.183324 0.374088 0.318782 0.204660 0.328327 1.364823
1076 0.113614 0.175674 0.375259 0.325639 0.200415 0.293435 1.364478
1077 0.118948 0.158296 0.422121 0.321306 0.229193 0.355213 1.430825
1078 0.125295 0.196296 0.397676 0.335936 0.251317 0.365353 1.404031
1079 0.118899 0.187556 0.420347 0.335062 0.252995 0.365278
1.370999
[1080 rows x 77 columns], 'target': 0 c-CS-m
1 c-CS-m
2 c-CS-m
3 c-CS-m
4 c-CS-m ... 1075 t-SC-s
1076 t-SC-s
1077 t-SC-s
1078 t-SC-s
1079 t-SC-s
Name: class, Length: 1080, dtype: category
Categories (8, object): ['c-CS-m', 'c-CS-s', 'c-SC-m', 'c-SC-s', 't-CS-m', 't-CS-s', 't-SC-m',
't-SC-s'], 'frame': DYRK1A_N ITSN1_N BDNF_N NR1_N NR2A_N pAKT_N pBRAF_N \
0 0.503644 0.747193 0.430175 2.816329 5.990152 0.218830 0.177565
1 0.514617 0.689064 0.411770 2.789514 5.685038 0.211636 0.172817
2 0.509183 0.730247 0.418309 2.687201 5.622059 0.209011 0.175722
3 0.442107 0.617076 0.358626 2.466947 4.979503 0.222886 0.176463
https://colab.research.google.com/drive/1qlsKPFMXrWckO3ZoHbBXFniCCWmNLuvG 4
20BECE30058
40.434940 0.617430 0.358802 2.365785 4.718679 0.213106 0.173627

https://colab.research.google.com/drive/1qlsKPFMXrWckO3ZoHbBXFniCCWmNLuvG 5

You might also like