R22 Data Science Using Python Lab Manual
R22 Data Science Using Python Lab Manual
Descriptive statistics summarize your dataset, painting a picture of its properties. These
properties include various central tendency and variability measures, distribution
properties, outlier detection, and other information. Unlike inferential statistics, descriptive
statistics only describe your dataset’s characteristics and do not attempt to generalize from
a sample to a population.
Using a single function, Excel can calculate a set of descriptive statistics for your dataset. This
post is an excellent introduction to interpreting descriptive statistics even if Excel isn’t your
primary statistical software package.
This experiment provides step-by-step instructions for using Excel to calculate
descriptive statistics for your data. Importantly, it will also show you how to interpret the
results, determine which statistics are most applicable to your data, and help you navigate some
of the lesser-known values.
Before proceeding, ensure that Excel’s Data Analysis ToolPak is installed. On
the Data tab, look for Data Analysis, as shown below.
1
Step-by-Step Instructions for Filling in Excel’s Descriptive Statistics Box
1. Under Input Range, select the range for the variables that you want to analyze. You can
include multiple variables as long as they form a contiguous block. While you can
explore more than one variable, the analysis assesses each variable in a univariate
manner (i.e., no correlation).
2. In Grouped By, choose how your variables are organized. I always include one variable
per column as this format is standard across software. Alternatively, you can include one
variable per row.
3. Check the Labels in first row checkbox if you have meaningful variable names in row
1. This option makes the output easier to interpret.
4. In Output options, choose where you want Excel to display the results.
5. Check the Summary statistics box to display most of the descriptive statistics (central
tendency, dispersion, distribution properties, sum, and count).
6. Check the Confidence Level for Mean box to display a confidence interval for the
mean. Enter the confidence level. 95% is usually a good value. For more information
about confidence levels, read my post about confidence intervals.
7. Check Kth Largest and Kth Smallest to display a high and low value. If you enter 1,
Excel displays the highest and lowest values. If you enter 2, it shows the 2nd highest and
lowest values. Etc.
8. Click OK.
2
2. Apply pivot table of Excel to perform data analysis.
Data analysis on a large set of data is quite often necessary and important. It involves
summarizing the data, obtaining the needed values and presenting the results.
Excel provides PivotTable to enable you summarize thousands of data values easily and
quickly so as to obtain the required results.
Consider the following table of sales data. From this data, you might have to summarize
total sales region wise, month wise, or salesperson wise. The easy way to handle these tasks is
to create a PivotTable that you can dynamically modify to summarize the results the way you
want.
Creating PivotTable
To create PivotTables, ensure the first row has headers.
Click the table.
Click the INSERT tab on the Ribbon.
Click PivotTable in the Tables group. The PivotTable dialog box appears.
As you can see in the dialog box, you can use either a Table or Range from the current
workbook or use an external data source.
In the Table / Range Box, type the table name.
Click New Worksheet to tell Excel where to keep the PivotTable.
Click OK.
A Blank PivotTable and a PivotTable fields list appear.
3
Recommended PivotTables
In case you are new to PivotTables or you do not know which fields to select from the data, you
can use the Recommended PivotTables that Excel provides.
Click the data table.
Click the INSERT tab.
Click on Recommended PivotTables in the Tables group. The Recommended
PivotTables dialog box appears.
In the recommended PivotTables dialog box, the possible customized PivotTables that suit your
data are displayed.
Click each of the PivotTable options to see the preview on the right side.
Click the PivotTable Sum of Order Amount by Salesperson and month.
4
Click OK. The selected PivotTable appears on a new worksheet. You can observe the
PivotTable fields that was selected in the PivotTable fields list.
PivotTable Fields
The headers in your data table will appear as the fields in the PivotTable.
You can select / deselect them to instantly change your PivotTable to display only the
information you want and in a way that you want. For example, if you want to display the
account information instead of order amount information, deselect Order Amount and select
Account.
PivotTable Areas
You can even change the Layout of your PivotTable instantly. You can use the PivotTable
Areas to accomplish this.
5
In PivotTable areas, you can choose −
What fields to display as rows
What fields to display as columns
How to summarize your data
Filters for any of the fields
When to update your PivotTable Layout
o You can update it instantly as you drag the fields across areas, or
o You can defer the update and get it updated only when you click on UPDATE
An instant update helps you to play around with the different Layouts and pick the one that
suits your report requirement. You can just drag the fields across these areas and observe the
PivotTable layout as you do it.
In the PivotTable Areas, in rows, click region and drag it below salesperson such that it looks
as follows −
7
Note − You can clearly observe that the layout with the nesting order – Region and then
Salesperson yields a better and compact report than the one with the nesting order –
Salesperson and then Region. In case Salesperson represents more than one area and you need
to summarize the sales by Salesperson, then the second layout would have been a better option.
Filters
You can assign a Filter to one of the fields so that you can dynamically change the PivotTable
based on the values of that field.
Drag Region from Rows to Filters in the PivotTable Areas.
8
The filter with the label as Region appears above the PivotTable (in case you do not have
empty rows above your PivotTable, PivotTable gets pushed down to make space for the Filter.
9
Check the option Select Multiple Items. Check boxes appear for all the values.
Select South and West and deselect the other values and click OK.
The data pertaining to South and West Regions only will be summarized as shown in the screen
shot given below −
You can see that next to the Filter Region, Multiple Items is displayed, indicating that you
have selected more than one item. However, how many items and / or which items are selected
is not known from the report that is displayed. In such a case, using Slicers is a better option for
filtering.
Slicers
You can use Slicers to have a better clarity on which items the data was filtered.
10
Click ANALYZE under PIVOTTABLE TOOLS on the Ribbon.
Click Insert Slicer in the Filter group. The Insert Slicers box appears. It contains all the
fields from your data.
Select the fields Region and month. Click OK.
Slicers for each of the selected fields appear with all the values selected by default. Slicer Tools
appear on the Ribbon to work on the Slicer settings, look and feel.
11
Summarizing Values by other Calculations
In the examples so far, you have seen summarizing values by Sum. However, you can use other
calculations also if necessary.
In the PivotTable Fields List
Select the Field Account.
Unselect the Field Order Amount.
Drag the field Account to Summarizing Values area. By default, Sum of Account will be
displayed.
Click the arrow on the right side of the box.
In the drop-down that appears, click Value Field Settings.
12
The Value Field Settings box appears. Several types of calculations appear as a list under
Summarize value field by −
Select Count in the list.
The Custom Name automatically changes to Count of Account. Click OK.
The PivotTable summarizes the Account values by Count.
PivotTable Tools
Follow the steps given below to learn to use the PivotTable Tools.
Select the PivotTable.
The following PivotTable Tools appear on the Ribbon −
ANALYZE
DESIGN
13
ANALYZE
Some of the ANALYZE Ribbon commands are −
Set PivotTable Options
Value Field Settings for the selected Field
Expand Field
Collapse Field
Insert Slicer
Insert Timeline
Refresh Data
Change Data Source
Move PivotTable
Solve Order (If there are more calculations)
PivotChart
DESIGN
Some of the DESIGN Ribbon commands are −
PivotTable Layout
o Options for Sub Totals
o Options for Grand Totals
o Report Layout Forms
o Options for Blank Rows
PivotTable Style Options
PivotTable Styles
14
Expanding and Collapsing Field
You can either expand or collapse all items of a selected field in two ways −
By selecting the symbol or to the left of the selected field.
By clicking the Expand Field or Collapse Field on the ANALYZE Ribbon.
By selecting the Expand symbol or Collapse symbol to the left of the selected field
Select the cell containing East in the PivotTable.
Click on the Collapse symbol to the left of East.
All the items under East will be collapsed. The Collapse symbol to the left of East changes to
the Expand symbol .
15
You can observe that only the items below East are collapsed. The rest of the PivotTable items
are as they are.
Click the Expand symbol to the left of East. All the items below East will be displayed.
Using ANALYZE on the Ribbon
You can collapse or expand all items in the PivotTable at once with the Expand Field and
Collapse Field commands on the Ribbon.
Click the cell containing East in the PivotTable.
Click the ANALYZE tab on the Ribbon.
Click Collapse Field in the Active Field group.
All the items of the field East in the PivotTable will collapse.
16
All the items will be displayed.
17
Blank rows will be displayed after each value of the Region field.
You can insert blank rows from the DESIGN tab also.
18
Hover the mouse over the PivotTable Styles. A preview of the style on which the mouse
is placed will appear.
Select the Style that suits your report.
PivotTable in Outline Form with the selected Style will be displayed.
Timeline in PivotTables
To understand how to use Timeline, consider the following example wherein the sales data of
various items is given salesperson wise and location wise. There are total 1891 rows of data.
19
Create a PivotTable from this Range with −
Location and Salesperson in Rows in that order
Product in Columns
Sum of Amount in Summarizing values
20
Click Date and click OK. The Timeline dialog box appears and the Timeline Tools appear on
the Ribbon.
21
22
3. Perform the following operations using Numpy
i) Basic Operations on NumPy
ii) Computations on numpy’s Arrays
NumPy is, just like SciPy, Scikit-Learn, Pandas, etc. one of the packages that you just
can’t miss when you’re learning data science, mainly because this library provides you with an
array data structure that holds some benefits over Python lists, such as: being more compact,
faster access in reading and writing items, being more convenient and more efficient.
b = a.reshape(3,5)
b would become:
[[0,1,2,3,4],
[5,6,7,8,9],
[10,11,12,13,14],
[15,16,17,18,19]]
23
3. Converting any data type to NumPy array
Use np.asarray. For eg
a = [(1,2), [3,4,(5)], (6,7,8)]
b = np.asarray(a)
b::
array([(1, 2), list([3, 4, (5, 6)]), (6, 7, 8)], dtype=object)
24
np.linspace(start,stop,num=50,endpoint=bool_value,retstep=bool_value)endpoint specifies if you
want the stop value to be included and retstep tells if you would like to know the step-
value.'num' is the number of integer to be returned where 50 is default
Eg,np.linspace(1,2,num=5,endpoint=False,retstep=True)This means return 5 values starting at 1 and
ending befor 2 and returning the step-size.output would be:
(array([1. , 1.2, 1.4, 1.6, 1.8]), 0.2) ##### Tuple of numpy array and step-size
arange:
np.arange(start=where_to_start,stop=where_to_stop,step=step_size)
If only one number is provided as an argument, it’s treated to be a stop and if 2 are provided,
they are assumed to be the start and the stop. Notice the spelling here.
25
13. How to create a copy of NumPy array
Use np.copy
y = np.array([[1,3],[5,6]])
x = np.copy(y) If,
x[0][0] = 1000 Then,
x is y is
13 100 3
56 56
x x.T is
12 13
34 24
26
np.ravel
x = np.array([[1, 2, 3], [4, 5, 6]])
x.ravel() produces
array([1, 2, 3, 4, 5, 6])
np.swapaxes.
x = np.array([[1,2],[3,4]])x.shape is (2,2) and
x np.swapaxes(x,0,1) will produce
12 13
34 24
27
x.astype(np.bool) will produce
array([False, True, True, True, True])
It's important to note that x has shape (5,) so only 1st indices are returned. If x were say,x =
np.array([[0,1],[3,5]])
x.nonzero() would produce (array([0, 1, 1]), array([1, 0, 1]))So, the indices are actually (0,1), (1,0),
(1,1).
If you would want to count the number ones in x, you could just do
(x==1).astype(np.int16).sum()
28
It should output 1
a:
123
4 8 16b = np.array([5,6,11]).reshape(-1,1)b:
5
6
11np.dot(a,b) produces
38
29
160
Just like any dot product of a matrix with a column vector would produce.
The dot product of a row vector with a column vector will produce:
if a is array([[1, 2, 3, 4]])
and b is:
array([[4],
[5],
[6],
[7]])np.dot(a,b) gives:array([[60]])a's shape was (1,4) and b's shape was (4,1) so the result will
have shape (1,1)
30
If one of the rows or cols are continuous, it's easier to do it:x[[0,2],0:2] producesarray([[2, 4],[7, 8]])
Example :
[[ 1, 2, 3],
[ 4, 2, 5]]
Here,
rank = 2 (as it is 2-dimensional or it has 2 axes)
first dimension(axis) length = 2, second dimension has length = 3
overall shape can be expressed as: (2, 3)
3. Array Indexing: Knowing the basics of array indexing is important for analysing and
manipulating the array object. NumPy offers many ways to do array indexing.
Slicing: Just like lists in python, NumPy arrays can be sliced. As arrays can be
multidimensional, you need to specify a slice for each dimension of the array.
Integer array indexing: In this method, lists are passed for indexing for each
dimension. One to one mapping of corresponding elements is done to construct a new
arbitrary array.
Boolean array indexing: This method is used when we want to pick elements from
array which satisfy some condition.
a = np.array([1, 2, 5, 3])
32
# add 1 to every element
print ("Adding 1 to every element:", a+1)
# transpose of array
a = np.array([[1, 2, 3], [3, 4, 5], [9, 6, 0]])
Output :
Adding 1 to every element: [2 3 6 4]
Subtracting 3 from each element: [-2 -1 2 0]
Multiplying each element by 10: [10 20 50 30]
Squaring each element: [ 1 4 25 9]
Doubled each element of original array: [ 2 4 10 6]
Original array:
[[1 2 3]
[3 4 5]
[9 6 0]]
Transpose of array:
33
[[1 3 9]
[2 4 6]
[3 5 0]]
4. Write a program to find patterns in the given data using regular expressions by taking
the data from text file.
A Regular Expressions (RegEx) is a special sequence of characters that uses a search
pattern to find a string or set of strings. It can detect the presence or absence of a text by
matching with a particular pattern, and also can split a pattern into one or more sub-patterns.
Python provides a re module that supports the use of regex in Python. Its primary
function is to offer a search, where it takes a regular expression and a string. Here, it either
returns the first match or else none.
Before starting with the Python regex module let’s see how to actually write regex
using metacharacters or special sequences.
MetaCharacters
To understand the RE analogy, MetaCharacters are useful, important, and will be used in
functions of module re. Below is the list of metacharacters.
MetaCharacters Description
34
MetaCharacters Description
Special Sequences
Special sequences do not match for the actual character in the string instead it tells the
specific location in the search string where the match must occur. It makes it easier to write
commonly used patterns.
List of special sequences
Special
Sequence Description Examples
\A Matches if the string begins with the given character \Afor for geeks
35
Special
Sequence Description Examples
for the
world
together
It is the opposite of the \b i.e. the string should not
\B start or end with the given regex. \Bge forge
123
Matches any decimal digit, this is equivalent to the set
\d class [0-9] \d gee1
geeks
Matches any non-digit character, this is equivalent to
\D the set class [^0-9] \D geek1
gee ks
a bd
36
Special
Sequence Description Examples
>$
abcdab
\Z Matches if the string ends with the given regex ab\Z abababab
re.findall()
Return all non-overlapping matches of pattern in string, as a list of strings. The string is
scanned left-to-right, and matches are returned in the order found.
Output
['123456789', '987654321']
Match Object
A Match object contains all the information about the search and the result and if there is no
match found then None will be returned. Let’s see some of the commonly used methods and
attributes of the match object.
Getting the string and the regex
math.re attribute returns the regular expression passed and match.string attribute returns the
string passed.
Example: Getting the string and the regex of the matched object
import re
s = "Welcome to GeeksForGeeks"
# here x is the match object
res = re.search(r"\bG", s)
print(res.re)
print(res.string)
Output
re.compile('\\bG')
Welcome to GeeksForGeeks
38
Getting index of matched object
start() method returns the starting index of the matched substring
end() method returns the ending index of the matched substring
span() method returns a tuple containing the starting and the ending index of the
matched substring
Example: Getting index of matched object
import re
s = "Welcome to GeeksForGeeks"
res = re.search(r"\bGee", s)
print(res.start())
print(res.end())
print(res.span())
Output
11
14
(11, 14)
s = "Welcome to GeeksForGeeks"
# here x is the match object
res = re.search(r"\D{2} t", s)
print(res.group())
39
Output
me t
In the above example, our pattern specifies for the string that contains at least 2 characters
which are followed by a space, and that space is followed by a t.
Pandas is an open-source library that is made mainly for working with relational or labeled
data both easily and intuitively. It provides various data structures and operations for
manipulating numerical data and time series. This library is built on top of the NumPy library.
Pandas is fast and it has high performance & productivity for users.
Getting Started
After the pandas have been installed into the system, you need to import the library. This
module is generally imported as:
import pandas as pd
40
Here, pd is referred to as an alias to the Pandas. However, it is not necessary to import the
library using the alias, it just helps in writing less amount code every time a method or
property is called.
Pandas generally provide two data structures for manipulating data, They are:
Series
DataFrame
Series:
Pandas Series is a one-dimensional labelled array capable of holding data of any type (integer,
string, float, python objects, etc.). The axis labels are collectively called indexes. Pandas
Series is nothing but a column in an excel sheet. Labels need not be unique but must be a
hashable type. The object supports both integer and label-based indexing and provides a host
of methods for performing operations involving the index.
41
# Creating empty series
ser = pd.Series()
print(ser)
# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])
ser = pd.Series(data)
print(ser)
Output:
Series([], dtype: float64)
0 g
1 e
2 e
3 k
4 s
dtype: object
Note: For more information, refer to Creating a Pandas Series
Basic Operations:
The basic operations that we can perform on a dataset after we have loaded into our dataframe
object.
42
studyTonight_df.head(2)
# using the tail function to get last two entries
studyTonight_df.tail(2)
Output:
Another way to access columns is by calling the column name as an attribute, as shown
below:
studyTonight_df.Fruit
43
Accessing Rows in a DataFrame:
Using the .loc[] function we can access the row-index name which is passed in as a parameter,
for example:
studyTonight_df.loc[2]
Output:
studyTonight_df2 = pd.DataFrame(data,
columns=['Fruit','Weight','Price','Kind'])
print(studyTonight_df2)
The column we just added, called Kind, didn't exist in our data frame before. Thus there are no
values corresponding to this. Therefore our dataframe reads this as a missing value and places
a NaN under the Kind column. Below is the output for the above code:
44
If we want to assign something to this column, we can attempt to assign a constant value for all
the rows. To do this, just select the column as shown below, and make it equal to some constant
value.
studyTonight_df2['Kind'] = 'Round'
print(studyTonight_df2)
As we can see in our output below, all the values corresponding to the column Kind has been
changed to the value Round.
A series can be mapped onto a dataframe column. This further proves the point that a
DataFrame is a combination of multiple Series.
st_ser = pd.Series(["Round", "Long", "Round", "Oval-ish"])
Let's map this series with our column Kind:
studyTonight_df2['Kind'] = st_ser
print(studyTonight_df2)
45
For this we will get the following output:
Hierarchical Indexing:
The index is like an address, that’s how any data point across the data frame or series can be
accessed. Rows and columns both have indexes, rows indices are called index and for
columns, it’s general column names.
Hierarchical Indexes
Hierarchical Indexes are also known as multi-indexing is setting more than one column name
as the index. In this article, we are going to use homelessness.csv file.
# importing pandas library as alias pd
import pandas as pd
# calling the pandas read_csv() function.
# and storing the result in DataFrame df
df = pd.read_csv('homelessness.csv')
print(df.head())
Output:
46
In the above data frame, there is no indexing.
Output:
Index([‘Unnamed: 0’, ‘region’, ‘state’, ‘individuals’,
‘family_members’,
‘state_pop’],
dtype=’object’)
To make the column an index, we use the Set_index() function of pandas. If we want
to make one column an index, we can simply pass the name of the column as a string in
set_index(). If we want to do multi-indexing or Hierarchical Indexing, we pass the list of
column names in the set_index().
Output:
47
Now the dataframe is using Hierarchical Indexing or multi-indexing.
48
Let us now create two different DataFrames and perform the merging operations on it.
# import the pandas library
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print left
print right
Name id subject_id
0 Billy 1 sub2
1 Brian 2 sub4
2 Bran 3 sub3
3 Bryce 4 sub6
4 Betty 5 sub5
To implement this in code, you’ll use concat() and pass it a list of DataFrames that you
want to concatenate. Code for this task would like like this:
concatenated = pandas.concat([df1, df2])
Note: This example assumes that your column names are the same. If your column
names are different while concatenating along rows (axis 0), then by default the columns will
also be added, and NaN values will be filled in as applicable.
What if instead you wanted to perform a concatenation along columns? First, take a look
at a visual representation of this operation:
50
To accomplish this, you’ll use a concat() call like you did above, but you also will need to pass
the axis parameter with a value of 1:
concatenated = pandas.concat([df1, df2], axis=1)
Concatenating Objects
The concat function does all of the heavy lifting of performing concatenation operations along
an axis. Let us create different objects and do concatenation.
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two])
Suppose we wanted to associate specific keys with each of the pieces of the chopped up
DataFrame. We can do this by using the keys argument −
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two],keys=['x','y'])
If the resultant object has to follow its own indexing, set ignore_index to True.
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two],keys=['x','y'],ignore_index=True)
Observe, the index changes completely and the Keys are also overridden.
If two objects need to be added along axis=1, then the new columns will be appended.
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two],axis=1)
54
Concatenating Using append
A useful shortcut to concat are the append instance methods on Series and DataFrame. These
methods actually predated concat. They concatenate along axis=0, namely the index −
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print one.append(two)
Time Series
56
Pandas provide a robust tool for working time with Time series data, especially in the financial
sector. While working with time series data, we frequently come across the following −
Generating sequence of time
Convert the time series to different frequencies
Pandas provides a relatively compact and self-contained set of tools for performing the above
tasks.
Get Current Time
datetime.now() gives you the current date and time.
import pandas as pd
print pd.datetime.now()
Its output is as follows −
2017-05-11 06:10:13.393147
Create a TimeStamp
Time-stamped data is the most basic type of timeseries data that associates values with points in
time. For pandas objects, it means using the points in time. Let’s take an example −
import pandas as pd
print pd.Timestamp('2017-03-01')
Its output is as follows −
2017-03-01 00:00:00
It is also possible to convert integer or float epoch times. The default unit for these is
nanoseconds (since these are how Timestamps are stored). However, often epochs are stored in
another unit which can be specified. Let’s take another example
import pandas as pd
print pd.Timestamp(1587687255,unit='s')
57
import pandas as pd
print pd.date_range("11:00", "13:30", freq="30min").time
Its output is as follows −
[datetime.time(11, 0) datetime.time(11, 30) datetime.time(12, 0)
datetime.time(12, 30) datetime.time(13, 0) datetime.time(13,
30)]
Converting to Timestamps
To convert a Series or list-like object of date-like objects, for example strings, epochs, or a
mixture, you can use the to_datetime function. When passed, this returns a Series (with the
same index), while a list-like is converted to a DatetimeIndex. Take a look at the following
example −
import pandas as pd
print pd.to_datetime(pd.Series(['Jul 31, 2009','2010-01-10',
None]))
Reshaping :
Stack
In [1]:
import numpy as np
import pandas as pd
In [9]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
'foo', 'foo'],
['one', 'two', 'one', 'two',
'one', 'two']]))
In [10]:
index = pd.MultiIndex.from_tuples(tuples, names=['first',
'second'])
In [11]:
df = pd.DataFrame(np.random.randn(6, 2), index=index,
columns=['M', 'N'])
In [12]:df2 = df[:4]
In [13]:df2
Out[13]:
59
M N
firs secon
t d
bar one - -
1.04321 0.49102
3 7
- -
two 0.64623 2.84928
4 6
-
0.55405
two 1.01298
4
9
60
Out[16]:
M N
first second
In [17]:stacked.unstack(1)
Out[17]:
first
N -0.491027 -2.849286
N 0.637868 0.554054
In [18]:stacked.unstack(0)
Out[18]:
secon
d
61
first bar baz
secon
d
N -0.491027 0.637868
N -2.849286 0.554054
Pivot tables
In [19]: df = pd.DataFrame({'M': ['one', 'one', 'two', 'three']
* 2,
'N': ['A', 'B'] * 4,
'O': ['foo', 'foo', 'bar', 'bar'] * 2,
'P': np.random.randn(8),
'Q': np.random.randn(8)})
In [20]: df
Out[20]:
M N O P Q
You can produce pivot tables from this data very easily:
In [23]:pd.pivot_table(df, values='P', index=['M', 'N'],
columns=['O'])
Out[23]:
O bar foo
M N
B NaN -0.211783
thre B 0.84142
NaN
e 4
two A 0.64246
NaN
5
Pivoting
The pivot() function is used to reshaped a given DataFrame organized by given index / column
values. This function does not support data aggregation, multiple values will result in a
MultiIndex in the columns.
Syntax:
DataFrame.pivot(self, index=None, columns=None, values=None)
Parameters:
63
index Column to use to make new frame’s index. If string or object Optional
None, uses existing index.
columns Column to use to make new frame’s columns. string or object Required
values Column(s) to use for populating new frame’s string, object or a Optional
values. If not specified, all remaining columns list of the previous
will be used and the result will have
hierarchically indexed columns.
Returns: DataFrame
Returns reshaped DataFrame.
Raises: ValueError- When there are any index, columns combinations with multiple values.
DataFrame.pivot_table when you need to aggregate.
Example:
pandas.pivot(index, columns, values) function produces pivot table based on 3 columns of
the DataFrame. Uses unique values from index / columns and fills with values.
Parameters:
index[ndarray] : Labels to use to make new frame’s index
columns[ndarray] : Labels to use to make new frame’s columns
values[ndarray] : Values to use for populating new frame’s values
Returns: Reshaped DataFrame
Exception: ValueError raised if there are any duplicates.
# importing pandas as pd
import pandas as pd
# creating a dataframe
df = pd.DataFrame({'A': ['John', 'Boby', 'Mina'],
'B': ['Masters', 'Graduate', 'Graduate'],
'C': [27, 23, 21]})
64
df
# value is a list
df.pivot(index ='A', columns ='B', values =['C', 'A'])
Raise ValueError when there are any index, columns combinations with multiple values.
# importing pandas as pd
import pandas as pd
# creating a dataframe
df = pd.DataFrame({'A': ['John', 'John', 'Mina'],
'B': ['Masters', 'Masters', 'Graduate'],
65
'C': [27, 23, 21]})
df.pivot('A', 'B', 'C')
ValueError: Index contains duplicate entries, cannot reshape
In [2]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]
Out[2]:
66
['Peter', 'Paul', 'Mary', 'Guido']
This is perhaps sufficient to work with some data, but it will break if there are any missing
values. For example:
In [3]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-3-fc1d891ab539> in <module>()
1 data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
----> 2 [s.capitalize() for s in data]
<ipython-input-3-fc1d891ab539> in <listcomp>(.0)
1 data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
----> 2 [s.capitalize() for s in data]
In [4]:
import pandas as pd
names = pd.Series(data)
names
Out[4]:
0 peter
1 Paul
2 None
3 MARY
4 gUIDO
dtype: object
67
We can now call a single method that will capitalize all the entries, while skipping over any
missing values:
In [5]:
names.str.capitalize()
Out[5]:
0 Peter
1 Paul
2 None
3 Mary
4 Guido
dtype: object
Using tab completion on this str attribute will list all the vectorized string methods available to
Pandas.
Notice that these have various return values. Some, like lower(), return a series of strings:
In [7]:
monte.str.lower()
Out[7]:
0 graham chapman
1 john cleese
2 terry gilliam
3 eric idle
4 terry jones
5 michael palin
dtype: object
Still others return lists or other compound values for each element:
In [10]:
monte.str.split()
Out[10]:
0 [Graham, Chapman]
1 [John, Cleese]
2 [Terry, Gilliam]
3 [Eric, Idle]
4 [Terry, Jones]
5 [Michael, Palin]
dtype: object
Methods using regular expressions
In addition, there are several methods that accept regular expressions to examine the content of
each string element, and follow some of the API conventions of Python's built-in re module:
Method Description
match() Call re.match() on each element, returning a boolean.
extract() Call re.match() on each element, returning matched groups as strings.
findall() Call re.findall() on each element
replace() Replace occurrences of pattern with some other string
contains() Call re.search() on each element, returning a boolean
count() Count occurrences of pattern
split() Equivalent to str.split(), but accepts regexps
rsplit() Equivalent to str.rsplit(), but accepts regexps
70
With these, you can do a wide range of interesting operations. For example, we can extract the
first name from each by asking for a contiguous group of characters at the beginning of each
element:
In [11]:
monte.str.extract('([A-Za-z]+)', expand=False)
Out[11]:
0 Graham
1 John
2 Terry
3 Eric
4 Terry
5 Michael
dtype: object
Or we can do something more complicated, like finding all names that start and end with a
consonant, making use of the start-of-string (^) and end-of-string ($) regular expression
characters:
In [12]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')
Out[12]:
0 [Graham Chapman]
1 []
2 [Terry Gilliam]
3 []
4 [Terry Jones]
5 [Michael Palin]
dtype: object
The ability to concisely apply regular expressions across Series or Dataframe entries opens up
many possibilities for analysis and cleaning of data.
71
6. Gather information from different sources like CSV, Excel, JSON
When you start any project that directly or indirectly deals with data, the first and foremost
thing you would do is search for a dataset. Now gathering data could be done in various ways,
either using web scraping, a private dataset from a client, or a public dataset downloaded from
sources like GitHub, universities, kaggle, quandl, etc.
This data might be in an Excel file or saved with .csv, .txt, JSON, etc. file extension. The data
could be qualitative or quantitative. The data type could vary depending on the kind of problem
you plan to solve.
Text files are one of the most common file formats to store data. Python makes it very easy to
Python provides the open() function to read files that take in the file path and the file access
mode as its parameters. For reading a text file, the file access mode is ‘r’. I have mentioned the
72
The read() function imported all the data in the file in the correct structured form.
print(f.read(10))
By providing a number in the read() function, we were able to extract the specified amount of
print(f.readline())
Using readline(), only a single line from the text file was extracted.
print(f.readlines())
Here, the readline() function extracted all the text file data in a list format.
73
Reading CSV Files in Python
A CSV (or Comma Separated Value) file is the most common type of file that a data scientist
will ever work with. These files use a “,” as a delimiter to separate the values and each row in a
These are useful to transfer data from one application to another and are probably the reason
why they are so commonplace in the world of data science. If you look at them in the Notepad,
The Pandas library makes it very easy to read CSV files using the read_csv() function:
# import pandas
import pandas as pd
df = pd.read_csv(r'./Importing files/Products.csv')
# display DataFrame
df
74
But CSV can run into problems if the values contain commas. This can be overcome by using
different delimiters to separate information in the file, like ‘\t’ or ‘;’, etc. These can also be
imported with the read_csv() function by specifying the delimiter in the parameter value as
import pandas as pd
df = pd.read_csv(r'./Importing
files/Employee.txt',delimiter='\t')
df
Pandas has a very handy function called read_excel() to read Excel files:
df = pd.read_excel(r'./Importing files/World_city.xlsx')
75
# print values
df
view rawImport_files_8.py hosted with GitHub
But an Excel file can contain multiple sheets, right? So how can we access them?
For this, we can use the Pandas’ ExcelFile() function to print the names of all the sheets in the
file:
xl = pd.ExcelFile(r'./Importing files/World_city.xlsx')
xl.sheet_names
After doing that, we can easily read data from any sheet we wish by providing its name in
df = pd.read_excel(r'./Importing files/World_city.xlsx',sheet_name='Europe')
df
76
And voila!
JSON (JavaScript Object Notation) files are lightweight and human-readable to store and
exchange data. It is easy for machines to parse and generate these files and are based on the
JSON files store data within {} similar to how a dictionary stores it in Python. But their major
benefit is that they are language-independent, meaning they can be used with any programming
Python provides a json module to read JSON files. You can read JSON files just like simple
text files. However, the read function, in this case, is replaced by json.load() function that
Once you have done that, you can easily convert it into a Pandas dataframe using
import json
data = json.load(file)
# json dictionary
print(type(data))
77
# loading into a DataFrame
df_json = pd.DataFrame(data)
df_json
But you can even load the JSON file directly into a dataframe using
df = pd.read_json(path)
df
78
7. Gather the required web information using Web Scrapping
Web scraping is an automatic method to obtain large amounts of data from websites.
Most of this data is unstructured data in an HTML format which is then converted into
structured data in a spreadsheet or a database so that it can be used in various applications.
79
For this example, we are going scrape Flipkart website to extract the Price, Name, and Rating
of Laptops. The URL for this page is https://www.flipkart.com/laptops/~buyback-guarantee-on-
laptops-/pr?sid=6bo%2Cb5g&uniqBStoreParam1=val1&wid=11.productCard.PMU_V2.
driver=webdriver.Chrome("/usr/lib/chromium-browser/chromedriver"
)
Now that we have written the code to open the URL, it’s time to extract the data from the
website. As mentioned earlier, the data we want to extract is nested in <div> tags. So, find the
div tags with those respective class-names, extract the data and store the data in a variable.
Refer the code below:
1 content = driver.page_source
2 soup = BeautifulSoup(content)
3 for a in soup.findAll('a',href=True,
attrs={'class':'_31qSD5'}):
4 name=a.find('div', attrs={'class':'_3wU53n'})
81
6 rating=a.find('div', attrs={'class':'hGSR34 _2beYZw'})
7 products.append(name.text)
8 prices.append(price.text)
9 ratings.append(rating.text)
1 df = pd.DataFrame({'Product
Name':products,'Price':prices,'Rating':ratings})
2 df.to_csv('products.csv', index=False, encoding='utf-8')
Now, run the whole code again.
A file name “products.csv” is created and this file contains the extracted data.
82
8. Write a Python program to do the following operations:
a) Loading data from CSV file
b) Compute the basic statistics of given data - shape, no. of columns, mean
c) Splitting a data frame on values of categorical variables
d) Visualize data using Scatter plot
RESOURCES:
a) Python 3.7.0
b) Install: pip installer, Pandas library
PROCEDURE:
1. Create: Open a new file in Python shell, write a program and save the program with .py
extension.
2. Execute: Go to Run -> Run module (F5)
PROGRAM LOGIC:
a) Loading data from CSV file #loading file csv
import pandas as pd
pd.read_csv("P:/python/newfile.csv")
b) Compute the basic statistics of given data - shape, no. of columns, mean #shape
a=pd.read_csv("C:/Users/admin/Documents/
diabetes.csv") print('shape :',a.shape)
#no of columns
cols=len(a.axes[1])
print('no of columns:',cols)
83
print('mean of Age:',m)
b)
shape: (4, 3)
no. of columns:3
mean:87.5
c)
before:
student rollno marks address
0 a1 121 98 hyderabad,ts
84
1 a2 122 82 Warangal,ts
2 a3 123 92 Adilabad,ts
3 a4 124
78 medak,ts
After:
student rollno marks district state
0 a1 121 98 hyderabadts
1 a2 122 82 Warangal ts
2 a3 123 92 Adilabadts
3 a4 124 78 medakts
d)
85
9. Write a python program to impute missing values with various techniques on given
dataset.
a) Remove rows/ attributes
b) Replace with mean or mode
c) Write a python program to perform transformation of data using Discretization
(Binning) and normalization (MinMaxScaler or MaxAbsScaler) on given dataset.
RESOURCES: a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library
PROCEDURE:
1. Create: Open a new file in Python shell, write a program and save the program with .py
extension.
2. Execute: Go to Run -> Run module (F5)
86
data.replace(to_replace = np.nan,
value = -99)
# Remove rows/ attributes
mean_y = np.mean(ys)
Equal width (or distance) binning : The simplest binning approach is to partition the
range of the variable into k equal-width intervals. The interval width is simply the range
[A, B] of the variable divided by k, w = (B-A) / k
87
Thus, ith interval range will be [A + (i-1)w, A + iw]
where i = 1, 2, 3…..k Skewed data cannot be handled
well by this method.
Equal depth (or frequency) binning : In equal-frequency binning we divide the range [A,
B] of the variable into intervals that contain (approximately) equal number of points; equal
frequency may not be possible due to repeated values.
Example:
Sorted data for price(in dollar) : 2, 6, 7, 9, 13, 20, 21, 25, 30
import numpy as np
import math
from sklearn.datasets import load_iris
from sklearn import datasets, linear_model, metrics
88
# take 1st column among 4 column of
data set
for i in range (150):
b[i]=a[i,1]
# create bins
bin1=np.zeros((30,5))
bin2=np.zeros((30,5))
bin3=np.zeros((30,5))
# Bin mean
for i in range (0,150,5):
k=int(i/5)
mean=(b[i] + b[i+1] + b[i+2] + b[i+3] +
b[i+4])/5
for j in range(5):
bin1[k,j]=mean
print("Bin Mean: \
n",bin1)
# Bin boundaries
for i in range (0,150,5):
k=int(i/5)
for j in range (5):
if (b[i+j]-b[i]) < (b[i+4]-
b[i+j]): bin2[k,j]=b[i]
else:
bin2[k,j]=b[i+4] print("Bin Boundaries: \
n",bin2)
# Bin median
for i in range (0,150,5):
89
k=int(i/5)
for j in range (5):
bin3[k,j]=b[i+2]
print("Bin Median: \n",bin3)
90
[3.5 3.5 3.5 3.5 3.5 ] [3.5 3.5 3.5 3.5 3.5] [3.5 3.5 3.5 3.5 3.5]
[3.58 3.58 3.58 3.58 3.58] [3.5 3.6 3.6 3.6 3.6] [3.6 3.6 3.6 3.6 3.6]
[3.74 3.74 3.74 3.74 3.74] [3.7 3.7 3.7 3.8 3.8] [3.7 3.7 3.7 3.7 3.7]
[3.82 3.82 3.82 3.82 3.82] [3.8 3.8 3.8 3.8 3.9] [3.8 3.8 3.8 3.8 3.8]
[4.12 4.12 4.12 4.12 4.12]] [3.9 3.9 3.9 4.4 4.4]] [4.1 4.1 4.1 4.1 4.1]]
OUTPUT
MinMaxScaler(copy=True, feature_range=(0, 1))
data:
[ 1. 18.]
data:
[ 1. 18.]
Transformed data:
[[0. 0.]
91
[0.25 0.25]
[0.5 0.5 ]
[1. 1. ]]
Types Of Classification
There are two main types of classification:
Binary Classification – sorts data on the basis of discrete or non-continuous values
(usually two values). For example, a medical test may sort patients into those that have a
specific disease versus those that do not.
Multi-class Classification – sorts data into three or more classes. For example, medical
profiling that sorts patients into those with kidney, liver, lung, or bladder infection
symptoms.
92
# Import dataset:
url = “iris.csv”
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
93
# Predict y data with classifier:
y_predict = classifier.predict(X_test)
# Print results:
print(confusion_matrix(y_test, y_predict))
print(classification_report(y_test, y_predict))
Data labeling is the process of assigning labels to subsets of data based on its
characteristics. Data labeling takes unlabeled datasets and augments each piece of data with
informative labels or tags.
Most commonly, data is annotated with a text label. However, there are many use cases
for labeling data with other types of labels. Labels provide context for data ranging from images
to audio recordings to x-rays, and more.
4. Click on the Import button to import your data from various sources.
Once the data is imported, you can scroll down the page and preview it.
You can now choose among the many options to finish setup for your specific project.
95
12. Perform Data Visualization with Python
Data Visualization is the presentation of data in graphical format. It helps people understand
the significance of data by summarizing and presenting huge amount of data in a simple and
easy-to-understand format and helps communicate information clearly and effectively.
Consider this given Data-set for which we will be plotting different charts :
1. Histogram :
The histogram represents the frequency of occurrence of specific phenomena which lie within
a specific range of values and arranged in consecutive and fixed intervals.
96
In below code histogram is plotted for Age, Income, Sales. So these plots in the output shows
frequency of each unique value for each attribute.
import pandas as pd
import matplotlib.pyplot as plt
# create 2D array of table given above
data = [['E001', 'M', 34, 123, 'Normal', 350],
['E002', 'F', 40, 114, 'Overweight', 450],
['E003', 'F', 37, 135, 'Obesity', 169],
['E004', 'M', 30, 139, 'Underweight', 189],
['E005', 'F', 44, 117, 'Underweight', 183],
['E006', 'M', 36, 121, 'Normal', 80],
['E007', 'M', 32, 133, 'Obesity', 166],
['E008', 'F', 26, 140, 'Normal', 120],
['E009', 'M', 32, 133, 'Normal', 75],
['E010', 'M', 36, 133, 'Underweight', 40] ]
# dataframe created with
# the above data array
df = pd.DataFrame(data, columns = ['EMPID', 'Gender',
'Age', 'Sales',
'BMI', 'Income'] )
# create histogram for numeric data
df.hist()
# show plot
plt.show()
97
Output :
2. Column Chart :
A column chart is used to show a comparison among different attributes, or it can show a
comparison of items over time.
# Dataframe of previous code is used here
# Plot the bar chart for numeric values
# a comparison will be shown between
# all 3 age, income, sales
df.plot.bar()
# plot between 2 attributes
plt.bar(df['Age'], df['Sales'])
plt.xlabel("Age")
plt.ylabel("Sales")
plt.show()
98
Output :
4. Pie Chart :
A pie chart shows a static number and how categories represent part of a whole the
99
composition of something. A pie chart represents numbers in percentages, and the total sum of
all segments needs to equal 100%.
plt.pie(df['Age'], labels = {"A", "B", "C",
"D", "E", "F",
"G", "H", "I", "J"},
autopct ='% 1.1f %%', shadow = True)
plt.show()
plt.pie(df['Income'], labels = {"A", "B", "C",
"D", "E", "F",
"G", "H", "I", "J"},
5. Scatter plot :
A scatter chart shows the relationship between two different variables and it can reveal the
100
distribution trends. It should be used when there are many different data points, and you want
to highlight similarities in the data set. This is useful when looking for outliers and for
understanding the distribution of your data.
# scatter plot between income and age
plt.scatter(df['income'], df['age'])
plt.show()
101
13. Write a python program to load the dataset and understand the input data:
Load data, describe the given data and identify missing, outlier data items
Perform Univariate, Segmented Univariate and Bivariate analysis
Identify any derived metrics for the given data.
Find correlation among all attributes
Visualize correlation matrix
RESOURCES:
a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library
PROCEDURE:
1. Create: Open a new file in Python shell, write a program and save the program with .py
extension.
2. Execute: Go to Run -> Run module (F5)
PROGRAM LOGIC:
a) Load data
import pandas as pd
importnumpy as np
102
importmatplotlib as plt
%matplotlib inline
#Reading the dataset in a dataframe using Pandas
df = pd.read_csv("C:/Users/admin/Documents/diabetes.csv")
#describe the given data
print(df. describe())
#Display first 10 rows of data
print(df.head(10))
#Missing values
104
14. Perform Encoding categorical features on given dataset.
The performance of a machine learning model not only depends on the model and the
hyperparameters but also on how we process and feed different types of variables to the model.
Since most machine learning models only accept numerical variables, preprocessing the
categorical variables becomes a necessary step. We need to convert these categorical variables
to numbers such that the model is able to understand and extract valuable information.
A typical data scientist spends 70 – 80% of his time cleaning and preparing the data.
And converting categorical data is an unavoidable activity. It not only elevates the model
105
quality but also helps in better feature engineering. Now the question is, how do we proceed?
Which categorical data encoding method should we use?
While encoding Nominal data, we have to consider the presence or absence of a feature.
In such a case, no notion of order is present. For example, the city a person lives in. For the
data, it is important to retain where a person lives. Here, We do not have any order or sequence.
It is equal if a person lives in Delhi or Bangalore.
For encoding categorical data, we have a python package category_encoders. The following
code helps you install easily.
pip install category_encoders
106
In Label encoding, each label is converted into an integer value. We will create a variable that
contains the categories representing the education qualification of a person.
import category_encoders as ce
import pandas as pd
train_df=pd.DataFrame({'Degree':['High
school','Masters','Diploma','Bachelors','Bachelors','Masters','P
hd','High school','High school']})
107
One Hot Encoding
We use this categorical data encoding technique when the features are nominal(do not have any
order). In one hot encoding, for each level of a categorical feature, we create a new variable.
Each category is mapped with a binary variable containing either 0 or 1. Here, 0 represents the
absence, and 1 represents the presence of that category.
These newly created binary features are known as Dummy variables. The number of dummy
variables depends on the levels present in the categorical variable. This might sound
complicated. Let us take an example to understand this better. Suppose we have a dataset with a
category animal, having different animals like Dog, Cat, Sheep, Cow, Lion. Now we have to
one-hot encode this data.
After encoding, in the second table, we have dummy variables each representing a category in
the feature Animal. Now for each category that is present, we have 1 in the column of that
category and 0 for the others. Let’s see how to implement a one-hot encoding in python.
import category_encoders as ce
import pandas as pd
108
data=pd.DataFrame({'City':[
'Delhi','Mumbai','Hydrabad','Chennai','Bangalore','Delhi','Hydrabad','Bangalore','Delhi'
]})
Dummy Encoding
109
Dummy coding scheme is similar to one-hot encoding. This categorical data encoding method
transforms the categorical variable into a set of binary variables (also known as dummy
variables). In the case of one-hot encoding, for N categories in a variable, it uses N binary
variables. The dummy encoding is a small improvement over one-hot-encoding. Dummy
encoding uses N-1 features to represent N labels/categories.
To understand this better let’s see the image below. Here we are coding the same data using
both one-hot encoding and dummy encoding techniques. While one-hot uses 3 variables to
represent the data whereas dummy encoding uses 2 variables to code 3 categories.
#Original Data
data
110
#encode the data
data_encoded=pd.get_dummies(data=data,drop_first=True)
data_encoded
Here using drop_first argument, we are representing the first label Bangalore using 0.
111
Effect Encoding:
This encoding technique is also known as Deviation Encoding or Sum
Encoding. Effect encoding is almost similar to dummy encoding, with a little difference. In
dummy coding, we use 0 and 1 to represent the data but in effect encoding, we use three values
i.e. 1,0, and -1.
The row containing only 0s in dummy encoding is encoded as -1 in effect encoding. In
the dummy encoding example, the city Bangalore at index 4 was encoded as 0000. Whereas in
effect encoding it is represented by -1-1-1-1.
#Original Data
data
encoder.fit_transform(data)
112
Effect encoding is an advanced technique. In case you are interested to know more about effect
encoding, refer to this interesting paper.
Hash Encoder
To understand Hash encoding it is necessary to know about hashing. Hashing is the
transformation of arbitrary size input in the form of a fixed-size value. We use hashing
algorithms to perform hashing operations i.e to generate the hash value of an input. Further,
hashing is a one-way process, in other words, one can not generate original input from the hash
representation.
Hashing has several applications like data retrieval, checking data corruption, and in data
encryption also. We have multiple hash functions available for example Message Digest (MD,
MD2, MD5), Secure Hash Function (SHA0, SHA1, SHA2), and many more.
Just like one-hot encoding, the Hash encoder represents categorical features using the new
dimensions. Here, the user can fix the number of dimensions after transformation
using n_component argument. Here is what I mean – A feature with 5 categories can be
represented using N new features similarly, a feature with 100 categories can also be
transformed using N new features.
By default, the Hashing encoder uses the md5 hashing algorithm but a user can pass any
algorithm of his choice.
import category_encoders as ce
import pandas as pd
Since Hashing transforms the data in lesser dimensions, it may lead to loss of information.
Another issue faced by hashing encoder is the collision. Since here, a large number of features
are depicted into lesser dimensions, hence multiple values can be represented by the same hash
value, this is known as a collision.
Moreover, hashing encoders have been very successful in some Kaggle competitions. It is great
to try if the dataset has high cardinality features.
Binary Encoding
Binary encoding is a combination of Hash encoding and one-hot encoding. In this encoding
scheme, the categorical feature is first converted into numerical using an ordinal encoder. Then
114
the numbers are transformed in the binary number. After that binary value is split into different
columns.
Binary encoding works really well when there are a high number of categories. For example the
cities in a country where a company supplies its products.
#Import the libraries
import category_encoders as ce
import pandas as pd
#Original Data
data
115
Binary encoding is a memory-efficient encoding scheme as it uses fewer features than one-hot
encoding. Further, It reduces the curse of dimensionality for data with high cardinality.
Base N Encoding
Before diving into BaseN encoding let’s first try to understand what is Base here?
In the numeral system, the Base or the radix is the number of digits or a combination of digits
and letters used to represent the numbers. The most common base we use in our life is 10 or
decimal system as here we use 10 unique digits i.e 0 to 9 to represent all the numbers. Another
widely used system is binary i.e. the base is 2. It uses 0 and 1 i.e 2 digits to express all the
numbers.
For Binary encoding, the Base is 2 which means it converts the numerical values of a category
into its respective Binary form. If you want to change the Base of encoding scheme you may
use Base N encoder. In the case when categories are more and binary encoding is not able to
handle the dimensionality then we can use a larger base such as 4 or 8.
#Import the libraries
import category_encoders as ce
import pandas as pd
#Create the dataframe
data=pd.DataFrame({'City':
['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad','Mumbai','Agra']})
#Create an object for Base N Encoding
encoder= ce.BaseNEncoder(cols=['city'],return_df=True,base=5)
#Original Data
data
116
#Fit and Transform Data
data_encoded=encoder.fit_transform(data)
data_encoded
In the above example, I have used base 5 also known as the Quinary system. It is similar to the
example of Binary encoding. While Binary encoding represents the same data by 4 new features
the BaseN encoding uses only 3 new variables.
Hence BaseN encoding technique further reduces the number of features required to efficiently
represent the data and improving memory usage. The default Base for Base N is 2 which is
equivalent to Binary Encoding.
Target Encoding
Target encoding is a Baysian encoding technique.
117
Bayesian encoders use information from dependent/target variables to encode the categorical
data.
In target encoding, we calculate the mean of the target variable for each category and replace
the category variable with the mean value. In the case of the categorical target variables, the
posterior probability of the target replaces each category.
#import the libraries
import pandas as pd
import category_encoders as ce
#Original Data
Data
118
We perform Target encoding for train data only and code the test data using results obtained
from the training dataset. Although, a very efficient coding system, it has the
following issues responsible for deteriorating the model performance-
1. It can lead to target leakage or overfitting. To address overfitting we can use different
techniques.
1. In the leave one out encoding, the current target value is reduced from the overall
mean of the target to avoid leakage.
2. In another method, we may introduce some Gaussian noise in the target statistics.
The value of this noise is hyperparameter to the model.
2. The second issue, we may face is the improper distribution of categories in train and test
data. In such a case, the categories may assume extreme values. Therefore the target means for
the category are mixed with the marginal mean of the target.
119
15. Perform Simple Linear Regression using Data Analysis Toolbox of Excel and
Python to interpret the regression table.
In statistical modeling, regression analysis is used to estimate the relationships between two or
more variables:
Dependent variable (aka criterion variable) is the main factor you are trying to
understand and predict.
Independent variables (aka explanatory variables, or predictors) are the factors that
might influence the dependent variable.
Regression analysis helps you understand how the dependent variable changes when one of
the independent variables varies and allows to mathematically determine which of those
variables really has an impact.
The three main methods to perform linear regression analysis in Excel are:
Regression tool included with Analysis ToolPak
Scatter chart with a trendline
Linear regression formula
120
Below you will find the detailed instructions on using each method.
How to do linear regression in Excel with Analysis ToolPak
This example shows how to run regression in Excel by using a special tool included with the
Analysis ToolPak add-in.
This will add the Data Analysis tools to the Data tab of your Excel ribbon.
With Analysis Toolpak added enabled, carry out these steps to perform regression analysis in
Excel:
1. On the Data tab, in the Analysis group, click the Data Analysis button.
Select Regression and click OK.
In the Regression dialog box, configure the following settings:
o Select the Input Y Range, which is your dependent variable. In our case, it's
umbrella sales (C1:C25).
o Select the Input X Range, i.e. your independent variable. In this example, it's
the average monthly rainfall (B1:B25).
121
If you are building a multiple regression model, select two or more adjacent columns with
different independent variables.
o Check the Labels box if there are headers at the top of your X and Y ranges.
o Choose your preferred Output option, a new worksheet in our case.
2. Optionally, select the Residuals checkbox to get the difference between the predicted
and actual values.
Click OK and observe the regression analysis output created by Excel.
As you may notice, the regression equation Excel has created for us is the same as the linear
regression formula we built based on the Coefficients output.
3. Switch to the Fill & Line tab and customize the line to your liking. For example, you can
choose a different line color and use a solid line instead of a dashed line (select Solid line in
the Dash type box):
123
Still, you may want to make a few more improvements:
Drag the equation wherever you see fit.
Add axes titles (Chart Elements button > Axis Titles).
If your data points start in the middle of the horizontal and/or vertical axis like in this
example, you may want to get rid of the excessive white space. The following tip explains how
to do this: Scale the chart axes to reduce white space.
16. Perform Multiple Linear Regression using Data Analysis Toolbox of Excel and Python
to interpret the regression table.
Let’s take a practical look at modeling a Multiple Regression model for the Gross Domestic
Product (GDP) of a country.
It is show you how to run multiple Regression in Excel and interpret the output, not to teach
about setting up our model assumptions and choosing the most appropriate variables.
Now that we have this out of the way and expectations are set, let’s open Excel and get started!
Sourcing our data
We will obtain public data from Eurostat, the statistics database for the European Commission
for this exercise. All the relevant source data is within the model file for your convenience,
which you can download below. I have also kept the links to the source tables to explore further
if you want.
The EU dataset gives us information for all member states of the union. As a massive fan of
Agatha Christie’s Hercule Poirot, let’s direct our attention to Belgium.
As you can see in the table below, we have nineteen observations of our target variable (GDP),
as well as our three predictor variables:
X1 — Education Spend in mil.;
X2 — Unemployment Rate as % of the Labor Force;
124
X3 — Employee compensation in mil.
Even before we run our regression model, we notice some dependencies in our data. Looking at
the development over the periods, we can assume that GDP increases together with Education
Spend and Employee Compensation.
Look to the Data tab, and on the right, you will see the Data Analysis tool within the Analyze
section.
Run it and pick Regression from all the options. Note, we use the same menu for both simple
(single) and multiple linear regression models.
Now it’s time to set some ranges and settings.
The Y Range will include our dependent variable, GDP. And in the X Range, we will select all X
variable columns. Please, note that this is the same as running a single linear regression, the only
difference being that we choose multiple columns for X Range.
Remember that Excel requires that all X variables are in adjacent columns.
As I have selected the column Titles, it is crucial to mark the checkbox for Labels. A 95%
confidence interval is appropriate in most financial analysis scenarios, so we will not change this.
You can then consider placing the data on the same sheet or a new one. A new worksheet usually
works best, as the tool inserts quite a lot of data.
125
The information we got out of Excel’s Data Analysis module starts with the Regression
Statistics.
R Square is the most important among those, so we can start by looking at it. Specifically, we
should look at Adjusted R Square in our case, as we have more than one X variable. It gives us
an idea of the overall goodness of the fit.
An adjusted R Square of 0.98 means our regression model can explain around 98% of the
variation of the dependent variable Y (GDP) around the average value of the observations (the
mean of our sample). In other words, 98% of the variability in ŷ (y-hat, our dependent variable
predictions) is capture by our model. Such a high value would usually indicate there might be
some issue with our model. We will continue with our model, but a too-high R Squared can be
problematic in a real-life scenario. I suggest you read this article on Statistics by Jim, to learn
why too good is not always right in terms of R Square.
The Standard Error gives us an estimate of the standard deviation of the error (residuals).
Generally, if the coefficient is large compared to the standard error, it is probably statistically
significant.
The Analysis of Variance section is something we often skip when modeling Regression.
However, it can provide valuable insights, and it’s worth taking a look at. You can read more
about running an ANOVA test and see an example model in our dedicated article.
This table gives us an overall test of significance on the regression parameters.
126
The ANOVA table’s F column gives us the overall F-test of the null hypothesis that all
coefficients are equal to zero. The alternative hypothesis is that at least one of the coefficients is
not equal to zero. The Significance F column shows us the p-value for the F-test. As it is lower
than the significance level of 0.05 (at our chosen confidence level of 95%), we can reject the null
hypothesis, that all coefficients are equal to zero. This means our regression parameters are
jointly not statistically insignificant.
127