0% found this document useful (0 votes)
20 views6 pages

Binning

Binning is a data preprocessing method used for smoothing or handling noisy data by grouping sorted values into bins and replacing them with representative values. There are various smoothing techniques such as bin means, bin medians, and bin boundaries, which can improve the accuracy of predictive models. Additionally, the document discusses the use of pandas functions like 'cut' and 'qcut' for creating bins based on specified intervals or quantiles.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views6 pages

Binning

Binning is a data preprocessing method used for smoothing or handling noisy data by grouping sorted values into bins and replacing them with representative values. There are various smoothing techniques such as bin means, bin medians, and bin boundaries, which can improve the accuracy of predictive models. Additionally, the document discusses the use of pandas functions like 'cut' and 'qcut' for creating bins based on specified intervals or quantiles.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Binning

Binning method is used to smoothingdata or to handle noisy data. In this method, the
data is first sor ted and then the sor ted values are distributed into a number of
buckets or bins. As binning methods consult the neighbourhood of values, they per
form local smoothing. There are three approaches to per forming smoothing –

Data binning (or bucketing) groups data in bins (or buckets), in


the sense that it replaces values contained into a small interval
with a single representative value for that interval. Sometimes
binning improves accuracy in predictive models.

Data binning is a type of data preprocessing, a mechanism


which includes also dealing with missing
values, formatting, normalization and standardization.

Binning can be applied to convert numeric values to categorical


or to sample (quantise) numeric values.

•convert numeric to categorical includes binning by


distance and binning by frequency
•reduce numeric values includes quantisation (or
sampling).

Smoothing by bin means : In smoothing by bin means, each value in a bin is


replaced by the mean value of the bin.

Smoothing by bin median : In this method each bin value is replaced by its bin
median value.

Smoothing by binboundary : In smoothing by bin boundaries, the minimum and


maximum values in a given bin are identified as the bin boundaries. Each bin value is
then replaced by the closest boundary value.

Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25,

26, 28, 29, 34

Partition using equal frequency approach:

- Bin 1 : 4, 8, 9, 15
- Bin 2 : 21, 21, 24, 25
- Bin 3 : 26, 28, 29, 34

Smoothing by bin means:

- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29

Smoothing by bin boundaries:

- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

Smoothing by bin median:

- Bin 1: 9 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29

import numpy as np
import math
import pandas as pd

df = pd.read_csv('nse50_data.csv')
data = df['Turnover (Rs. Cr)']

d
a data = data[:30]
data=np.sort(data)
t print(data)
a

=
12199.98, 12211.18, 12290.16, 12528.8 , 12649.4 , 12834.85,

d
13320.2 , 13520.01, 13591.3 , 13676.58, 13709.57, 13837.03,
a
t 13931.15, 14006.48, 14105.94, 14440.17, 14716.66, 14744.56,
a
[
:
3
0
]
14932.51, 15203.09, 15787.28, 15944.45, 20187.98, 21595.33])

b1=np.zeros((10,3))
b2=np.zeros((10,3))
b3=np.zeros((10,3))

compute the Mean Bin as follows:

for i in range (0,30,3):


k=int(i/3)
mean=(data[i] + data[i+1] + data[i+2] )/3
for j in range(3):
b1[k,j]=mean

print("-----------------Mean Bin:----------------- \n",b1)

-----------------Mean Bin:-----------------

[[10696.98666667 10696.98666667 10696.98666667]

[11674.27666667 11674.27666667 11674.27666667]

[12233.77333333 12233.77333333 12233.77333333]

[12671.01666667 12671.01666667 12671.01666667]

[13477.17 13477.17 13477.17 ]

[13741.06 13741.06 13741.06 ]

[14014.52333333 14014.52333333 14014.52333333]

[14633.79666667 14633.79666667 14633.79666667]

[15307.62666667 15307.62666667 15307.62666667]

[19242.58666667 19242.58666667 19242.58666667]]

compute the Median Bin as follows:

for i in range (0,30,3):

k=int(i/3)
for j in range (3):
b2[k,j]=data[i+1]
print("-----------------Median Bin :----------------- \n",b2)
compute the Boundary Bin as follows:

for i in range (0,30,3):


k=int(i/3)
for j in range (3):
if (data[i+j]-data[i]) < (data[i+2]-data[i+j]):
b3[k,j]=data[i]
else:
b3[k,j]=data[i+2]

print("-----------------Boundary Bin:----------------- \n",b3)

Output:

-----------------Bin Boundary :-----------------

[[10388.69 10858.35 10858.35]

[10896.89 12113.53 12113.53]

[12199.98 12199.98 12290.16]

[12528.8 12528.8 12834.85]

[13320.2 13591.3 13591.3 ]

[13676.58 13676.58 13837.03]

[13931.15 13931.15 14105.94]

[14440.17 14744.56 14744.56]

[14932.51 14932.51 15787.28]

[15944.45 21595.33 21595.33]]

pd.cut()
We can use the ‘cut’ function in broadly 2 ways: by specifying the number
of bins directly and let pandas do the work of calculating equal-sized bins
for us, or we can manually specify the bin edges as we desire.

import pandas as pd
df=pd.read_csv('/home/student/Desktop/IPL.csv')
print(df)

df['Salries Group'] = pd.cut(df['SOLD


PRICE'],bins=[0,50000,150000,1800000],labels = ['grade1','grade2','grade3'])
print(df)

Instead of getting the intervals back, we can specify the ‘labels’ parameter
as a list for better analysis.

pd.qcut():

Qcut (quantile-cut) differs from cut in the sense that, in qcut, the number
of elements in each bin will be roughly the same, but this will come at the
cost of differently sized interval widths. On the other hand, in cut, the bin
edges were equal sized (when we specified bins=3) with uneven number
of elements in each bin or group. Also, cut is useful when you know for
sure the interval ranges and the bins,

For example, if binning an ‘age’ column, we know infants are between 0


and 1 years old, 1-12 years are kids, 13-19 are teenagers, 20-60 are
working class grownups, and 60+ senior citizens. So we can appropriately
set bins=[0, 1, 12, 19, 60, 140] and labels=[‘infant’, ‘kid’, ‘teenager’,
‘grownup’, ‘senior citizen’]. In qcut, when we specify q=5, we are telling
pandas to cut the Year column into 5 equal quantiles, i.e. 0-20%, 20-40%,
40-60%, 60-80% and 80-100% buckets/bins

import pandas as pd
import numpy as np
df=pd.read_csv('/home/student/Desktop/IPL.csv')
print(df)
np.array(sorted(df['AUCTION YEAR'].unique()))
df['Yr_qcut'] = pd.qcut(df['AUCTION YEAR'], q=5,labels=['oldest','not so
old','medium','newer','latest'])

ERROR:

ValueError: Bin edges must be unique: array([2008., 2008., 2008., 2009.,


2011., 2011.]).
You can drop duplicate edges by setting the 'duplicates' kwarg

import pandas as pd
#create DataFrame
df = pd.DataFrame({'points': [4, 4, 7, 8, 12, 13, 15, 18, 22, 23, 23, 25],
'assists': [2, 5, 4, 7, 7, 8, 5, 4, 5, 11, 13, 8],
'rebounds': [7, 7, 4, 6, 3, 8, 9, 9, 12, 11, 8, 9]})

print(df)

df['points_bin'] = pd.qcut(df['points'], q=[0, .2, .4, .6, .8, 1], labels=['A', 'B', 'C', 'D',
'E'])

You might also like