Data Science Using Scikit-Learn

Data Science Using Scikit-Learn
Several Python libraries offer solid execution of a range of machine learning algorithms. One of the best called is
Scikit-Learn, a package that supports accurate versions of a large number of standard algorithms. A clean, uniform
features and Scikit-Learn, and streamlined API, as well as by beneficial and complete online documentation.
Data Representation in Scikit-Learn
Machine learning is about generating models from data: for that reason, we will start by discussing how data can be
represented to be learned by the computer. The best method to thought about data within Scikit-Learn is in terms
of tables of data.
Data as table
A virtual table is a two-dimensional grid of data, in which the rows describe single elements of the dataset, and the
columns describe quantities associated with each of these elements. For example, consider the Iris dataset,
popularly analyzed by Ronald Fisher in 1936. We can download this dataset in the form of a Pandas DataFrame using
the Seaborn library:
In[1]: import seaborn as sns
iris = sns.load_dataset('iris')
iris.head()

Out[1]: sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
Therefore, each row of the data defines a single observed flower, and the multiple rows are the total number of
flowers in the dataset. In general we will define the rows of the matrix as samples and the number of rows as
n_samples.
Each column of the data refers to a particular quantitative piece of information that describes each sample. In general,
we will refer to the columns of the matrix as features, and the number of columns as n_features.

Features matrix
This table layout makes clear that the information can be thought of as a two-dimensional numerical array or
matrix, which we will call the features matrix. By convention, this features matrix is often stored in a variable named
X.
The features matrix is assumed to be two-dimensional, with shape [n_samples, n_features], and is included in a
NumPy array or a Pandas DataFrame. However, some ScikitLearn models also accept SciPy sparse matrices. The
samples (i.e., rows) always defines the individual objects defined by the dataset.
For example, the sample can be a flower, a person, a document, an image, a sound file, a video, an astronomical
object, or anything else we can define with a set of quantitative measurements. The features (i.e., columns) always
describes the distinct observations that quantitatively represent each sample. Features are generally real-valued but
can be Boolean or discrete-valued in some methods.
Target array
In addition to the feature matrix X, we also generally work with a label or target array, which by convention we will
usually call y. The target array is usually one dimensional, with length n_samples, and is generally contained in a
NumPy array or Pan‐ das Series.
The target array can have continuous analytical values or discrete classes/labels. While some Scikit-Learn estimators
do handle multiple target values in the form of a two-dimensional [n_samples, n_targets] target array, we will
generally be working with the typical case of a one-dimensional target array.

For example, in the primary data, we can wish to generate a model that can predict the species of the flower
depends on the other measurements; in this case,the species column can be considered the feature.
In[3]: X_iris = iris.drop('species', axis=1)
X_iris.shape
Out[3]: (150, 4)
In[4]: y_iris = iris['species']
y_iris.shape
Out[4]: (150,)

Basics of the API
Most generally, the steps in using the Scikit-Learn estimator API are as follows:
Select a class of model by importing the appropriate estimator class from ScikitLearn.
Select model hyperparameters by instantiating this class with desired values.
Sequence the data into a features matrix and target vector following the discussion from before.
Fit the model to our data by calling the fit() method of the model instance.
Apply the model to new data:
For supervised learning, we predict labels for new data using the predict() method.
For unsupervised learning, we often transform or infer properties of the data using the transform() or predict()
method.
Read More: https://tutorials.ducatindia.com/data-science/data-science-using-scikit/

Thank You
Call us:
70-70-90-50-90
www.ducatindia.com

Data Science Using Scikit-Learn

More Related Content

Similar to Data Science Using Scikit-Learn (20)

More from Ducat India (20)

Recently uploaded (20)

Data Science Using Scikit-Learn