0% found this document useful (0 votes)
6 views9 pages

Mini 4

Uploaded by

karthika20805
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views9 pages

Mini 4

Uploaded by

karthika20805
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 9

You may load a dataset like as follows:

from sklearn.datasets import load_svmlight_file


X_train, y_train = load_svmlight_file("/path/to/train_dataset.txt")

You may also load two (or more) datasets at once:

X_train, y_train, X_test, y_test = load_svmlight_files(


("/path/to/train_dataset.txt", "/path/to/test_dataset.txt"))

In this case, X_train and X_test are guaranteed to have the same number of
features. Another way to achieve the same result is to fix the number of features:

X_test, y_test = load_svmlight_file(


"/path/to/test_dataset.txt", n_features=X_train.shape[1])

import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
SimpleImputer()
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))
[[4. 2. ]
[6. 3.666...]
[7. 6. ]]
The SimpleImputer class also supports sparse matrices:

import scipy.sparse as sp
X = sp.csc_matrix([[1, 2], [0, -1], [8, 4]])
imp = SimpleImputer(missing_values=-1, strategy='mean')
imp.fit(X)
SimpleImputer(missing_values=-1)
X_test = sp.csc_matrix([[-1, 2], [6, -1], [7, 6]])
print(imp.transform(X_test).toarray())
[[3. 2.]
[6. 3.]
[7. 6.]]
Note that this format is not meant to be used to implicitly store missing values in
the matrix because it would densify it at transform time. Missing values encoded by
0 must be used with dense input.

The SimpleImputer class also supports categorical data represented as string values
or pandas categoricals when using the 'most_frequent' or 'constant' strategy:

import pandas as pd
df = pd.DataFrame([["a", "x"],
[np.nan, "y"],
["a", np.nan],
["b", "y"]], dtype="category")

imp = SimpleImputer(strategy="most_frequent")
print(imp.fit_transform(df))
[['a' 'x']
['a' 'y']
['a' 'y']
['b' 'y']]
For another example on usage, see Imputing missing values before building an
estimator.

Data Pre-Processing with Sklearn using Standard and Minmax scaler


Last Updated : 03 Feb, 2022
Data Scaling is a data preprocessing step for numerical features. Many machine
learning algorithms like Gradient descent methods, KNN algorithm, linear and
logistic regression, etc. require data scaling to produce good results. Various
scalers are defined for this purpose. This article concentrates on Standard Scaler
and Min-Max scaler. The task here is to discuss what they mean and how they are
implemented using in-built functions that come with this package.

Apart from supporting library functions other functions that will be used to
achieve the functionality are:

The fit(data) method is used to compute the mean and std dev for a given feature so
that it can be used further for scaling.
The transform(data) method is used to perform scaling using mean and std dev
calculated using the .fit() method.
The fit_transform() method does both fit and transform.
Standard Scaler
Standard Scaler helps to get standardized distribution, with a zero mean and
standard deviation of one (unit variance). It standardizes features by subtracting
the mean value from the feature and then dividing the result by feature standard
deviation.

The standard scaling is calculated as:

z = (x - u) / s
Where,

z is scaled data.
x is to be scaled data.
u is the mean of the training samples
s is the standard deviation of the training samples.
Sklearn preprocessing supports StandardScaler() method to achieve this directly in
merely 2-3 steps.

Syntax: class sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True,


with_std=True)

Parameters:

copy: If False, inplace scaling is done. If True , copy is created instead of


inplace scaling.
with_mean: If True, data is centered before scaling.
with_std: If True, data is scaled to unit variance.

Approach:

Import module
Create data
Compute required values
Print processed data
Example:

# import module
from sklearn.preprocessing import StandardScaler

# create data
data = [[11, 2], [3, 7], [0, 10], [11, 8]]

# compute required values


scaler = StandardScaler()
model = scaler.fit(data)
scaled_data = model.transform(data)

# print scaled data


print(scaled_data)
Output:

[[ 0.97596444 -1.61155897]

[-0.66776515 0.08481889]

[-1.28416374 1.10264561]

[ 0.97596444 0.42409446]]

MinMax Scaler
There is another way of data scaling, where the minimum of feature is made equal to
zero and the maximum of feature equal to one. MinMax Scaler shrinks the data within
the given range, usually of 0 to 1. It transforms data by scaling features to a
given range. It scales the values to a specific value range without changing the
shape of the original distribution.

The MinMax scaling is done using:

x_std = (x – x.min(axis=0)) / (x.max(axis=0) – x.min(axis=0))

x_scaled = x_std * (max – min) + min

Where,

min, max = feature_range


x.min(axis=0) : Minimum feature value
x.max(axis=0):Maximum feature value
Sklearn preprocessing defines MinMaxScaler() method to achieve this.
Syntax: class sklearn.preprocessing.MinMaxScaler(feature_range=0, 1, *, copy=True,
clip=False)

Parameters:

feature_range: Desired range of scaled data. The default range for the feature
returned by MinMaxScaler is 0 to 1. The range is provided in tuple form as
(min,max).
copy: If False, inplace scaling is done. If True , copy is created instead of
inplace scaling.
clip: If True, scaled data is clipped to provided feature range.

Approach:

Import module
Create data
Scale data
print scaled data
Example:

# import module
from sklearn.preprocessing import MinMaxScaler

# create data
data = [[11, 2], [3, 7], [0, 10], [11, 8]]

# scale features
scaler = MinMaxScaler()
model=scaler.fit(data)
scaled_data=model.transform(data)

# print scaled features


print(scaled_data)

Output:

[[1. 0. ]

[0.27272727 0.625 ]

[0. 1. ]

[1. 0.75 ]]
Are you passionate about data and looking to make one giant leap into your career?
Our Data Science Course will help you change your game and, most importantly, allow
students, professionals, and working adults to tide over into the data science
immersion. Master state-of-the-art methodologies, powerful tools, and industry best
practices, hands-on projects, and real-world applications. Become the executive
head of industries related to Data Analysis, Machine Learning, and Data
Visualization with these growing skills. Ready to Transform Your Future? Enroll Now
to Be a Data Science Expert!

hemavatisabu
Follow
News

11
Next Article
Understanding Kernel Ridge Regression With Sklearn

Video | Machine Learning - Implementation of Data Scaling Using Python


Similar Reads
Python Sklearn – sklearn.datasets.load_breast_cancer() Function
In this article, we are going to see how to convert sklearn dataset to a pandas
dataframe in Python. Sklearn is a python library that is used widely for data
science and machine learning operations. Sklearn library provides a vast list of
tools and functions to train machine learning models. The library is available via
pip install. pip install sci
2 min read
CNN - Image data pre-processing with generators
The article aims to learn how to pre-processing the input image data to convert it
into meaningful floating-point tensors for feeding into Convolutional Neural
Networks. Just for the knowledge tensors are used to store data, they can be
assumed as multidimensional arrays. A tensor representing a 64 X 64 image having 3
channels will have its dimensi
3 min read
Pre-processing and Modelling using Caret Package in R
Pre-processing and modeling are important phases in the field of data science and
machine learning that affect how well predictive models work. Classification and
Regression Training, or the "caret" package in R, is a strong and adaptable tool
intended to make training and assessing machine learning models easier. This post
will cover the fundament
5 min read
Do Clustering Algorithms Need Feature Scaling in the Pre-Processing Stage?
Answer: Yes, clustering algorithms typically require feature scaling to ensure
equal distance consideration across all features.Without scaling, features with
larger scales dominate the distance calculations, leading to biased clusters.
Here's a comparison table to illustrate the impact of feature scaling on different
clustering algorithms: Cluster
1 min read
What is the difference between batch processing and real-time processing?
In this article, we will learn about two fundamental methods that govern the flow
of information and understand how data gets processed in the digital world. We
start with simple definitions of batch processing and real-time processing, and
gradually cover the unique characteristics and differences. Table of Content Data
processing in Data Engineer
4 min read
Difference between Data Cleaning and Data Processing
Data Processing: It is defined as Collection, manipulation, and processing of
collected data for the required use. It is a task of converting data from a given
form to a much more usable and desired form i.e. making it more meaningful and
informative. Using Machine Learning algorithms, mathematical modelling and
statistical knowledge, this entire p
2 min read
What is Big Data, and How Does it Differ from Traditional Data Processing?
Big Data, as the name suggests, is a collection of Huge data that requires a high
velocity of processing through various means like social media, sensors,
transactions etc. Traditional DA processing involves entities and statistics, a
consistent and intentional input; in contrast, Big Data includes structured, semi-
structured, and unstructured cont
9 min read
Real-Time Data Processing: Challenges and Solutions for Streaming Data
In today’s fast-paced digital landscape, real-time data processing is essential for
businesses to maintain a competitive edge. From financial transactions to social
media feeds, analysing and acting on data as it streams in is crucial for making
timely and informed decisions. However, processing streaming data presents unique
challenges that requir
6 min read
Passing categorical data to Sklearn Decision Tree
Theoretically, decision trees are capable of handling numerical as well as
categorical data, but, while implementing, we need to prepare the data for
classification. There are two methods to handle the categorical data before
training: one-hot encoding and label encoding. In this article, we understand how
each method helps in converting categorica
5 min read
Encoding Categorical Data in Sklearn
Categorical data is a common occurrence in many datasets, especially in fields like
marketing, finance, and social sciences. Unlike numerical data, categorical data
represents discrete values or categories, such as gender, country, or product type.
Machine learning algorithms, however, require numerical input, making it essential
to convert categor
6 min read
Reversing sklearn.OneHotEncoder Transform to Recover Original Data
One-hot encoding is a common preprocessing step in machine learning, especially
when dealing with categorical data. The OneHotEncoder class in scikit-learn is
widely used for this purpose. However, there are instances where you need to
reverse the transformation and recover the original data from the encoded format.
This article will guide you thro
5 min read
Processing of Raw Data to Tidy Data in R
The data that is download from web or other resources are often hard to analyze. It
is often needed to do some processing or cleaning of the dataset in order to
prepare it for further downstream analysis, predictive modeling and so on. This
article discusses several methods in R to convert the raw dataset into a tidy data.
Raw Data A Raw data is a
5 min read
10 Best Data Engineering Tools for Big Data Processing
In the era of big data, the ability to process and manage vast amounts of data
efficiently is crucial. Big data processing has revolutionized industries by
enabling the extraction of meaningful insights from large datasets. This article
explores the significance of data engineering tools, outlines the criteria for
selecting the right tools, and pre
6 min read
Linear and Quadratic Discriminant Analysis using Sklearn
Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) are
two well-known classification methods that are used in machine learning to find
patterns and put things into groups. They are especially helpful when you have
labeled data and want to classify new observations notes into pre-defined
categories. In this we will implemen
5 min read
Bag of word and Frequency count in text using sklearn
Text data is ubiquitous in today's digital world, from emails and social media
posts to research articles and customer reviews. To analyze and derive insights
from this textual information, it's essential to convert text into a numerical form
that machine learning algorithms can process. One of the fundamental methods for
this conversion is the "Ba
3 min read
ML | Implementing L1 and L2 regularization using Sklearn
Prerequisites: L2 and L1 regularizationThis article aims to implement the L2 and L1
regularization for Linear regression using the Ridge and Lasso modules of the
Sklearn library of Python. Dataset - House prices dataset.Step 1: Importing the
required libraries C/C++ Code import pandas as pd import numpy as np import
matplotlib.pyplot as plt from sk
3 min read
Python | Decision Tree Regression using sklearn
Decision Tree is a decision-making tool that uses a flowchart-like tree structure
or is a model of decisions and all of their possible results, including outcomes,
input costs, and utility.Decision-tree algorithm falls under the category of
supervised learning algorithms. It works for both continuous as well as categorical
output variables. The bra
4 min read
Python | Create Test DataSets using Sklearn
Python's Sklearn library provides a great sample dataset generator which will help
you to create your own custom dataset. It's fast and very easy to use. Following
are the types of samples it provides.For all the above methods you need to import
sklearn.datasets.samples_generator. C/C++ Code # importing libraries from
sklearn.datasets import make_b
3 min read
ML | Implementation of KNN classifier using Sklearn
Prerequisite: K-Nearest Neighbours Algorithm K-Nearest Neighbors is one of the most
basic yet essential classification algorithms in Machine Learning. It belongs to
the supervised learning domain and finds intense application in pattern
recognition, data mining and intrusion detection. It is widely disposable in real-
life scenarios since it is non-
2 min read
ML | Dummy classifiers using sklearn
A dummy classifier is a type of classifier which does not generate any insight
about the data and classifies the given data using only simple rules. The
classifier's behavior is completely independent of the training data as the trends
in the training data are completely ignored and instead uses one of the strategies
to predict the class label. It
3 min read
ML | Ridge Regressor using sklearn
A Ridge regressor is basically a regularized version of a Linear Regressor. i.e to
the original cost function of linear regressor we add a regularized term that
forces the learning algorithm to fit the data and helps to keep the weights lower
as possible. The regularized term has the parameter 'alpha' which controls the
regularization of the model
3 min read
ML | Voting Classifier using Sklearn
A Voting Classifier is a machine learning model that trains on an ensemble of
numerous models and predicts an output (class) based on their highest probability
of chosen class as the output. It simply aggregates the findings of each classifier
passed into Voting Classifier and predicts the output class based on the highest
majority of voting. The i
3 min read
How To Do Train Test Split Using Sklearn In Python
In this article, let's learn how to do a train test split using Sklearn in Python.
Train Test Split Using Sklearn The train_test_split() method is used to split our
data into train and test sets. First, we need to divide our data into features (X)
and labels (y). The dataframe gets divided into X_train,X_test , y_train and
y_test. X_train and y_tra
5 min read
Perceptron Algorithm for Classification using Sklearn
Assigning a label or category to an input based on its features is the fundamental
task of classification in machine learning. One of the earliest and most
straightforward machine learning techniques for binary classification is the
perceptron. It serves as the framework for more sophisticated neural networks. This
post will examine how to use Scik
11 min read
Classification Using Sklearn Multi-layer Perceptron
A key machine learning method that belongs to the class of artificial neural
networks is classification using Multi-Layer Perceptrons (MLP). It is a flexible
and effective method for tackling a variety of classification problems, including
text classification and picture recognition. Traditional linear classifiers might
not be up to the challenge,
7 min read
Multi-layer Perceptron a Supervised Neural Network Model using Sklearn
An artificial neural network (ANN), often known as a neural network or simply a
neural net, is a machine learning model that takes its cues from the structure and
operation of the human brain. It is a key element in machine learning's branch
known as deep learning. Interconnected nodes, also referred to as artificial
neurons or perceptrons, are arr
11 min read
Classification Metrics using Sklearn
Machine learning classification is a powerful tool that helps us make predictions
and decisions based on data. Whether it's determining whether an email is spam or
not, diagnosing diseases from medical images, or predicting customer churn,
classification algorithms are at the heart of many real-world applications.
However, the mere creation of a cl
14 min read
Canonical Correlation Analysis (CCA) using Sklearn
Canonical Correlation Analysis (CCA) is a statistical method used in data analysis
to identify and quantify the relationships between two sets of variables. When
working with multivariate data—that is, when there are several variables in each of
the two sets and we want to know how they connect—it is very helpful. This post
will explain CCA, go ove
10 min read
Orthogonal Matching Pursuit (OMP) using Sklearn
In this article, we will delve into the Orthogonal Matching Pursuit (OMP),
exploring its features and advantages. What is Orthogonal Matching Pursuit?The
Orthogonal Matching Pursuit (OMP) algorithm in Python efficiently reconstructs
signals using a limited set of measurements. OMP intelligently selects elements
from a "dictionary" to match the sign
6 min read
Gaussian Naive Bayes using Sklearn
In the world of machine learning, Gaussian Naive Bayes is a simple yet powerful
algorithm used for classification tasks. It belongs to the Naive Bayes algorithm
family, which uses Bayes' Theorem as its foundation. The goal of this post is to
explain the Gaussian Naive Bayes classifier and offer a detailed implementation
tutorial for Python users ut
8 min read
Article Tags :
AI-ML-DS
Machine Learning
python
Practice Tags :
Machine Learning
python

You might also like