0% found this document useful (0 votes)
76 views

R22 Data Science Using Python Lab Manual

Uploaded by

roshni.nekkanti
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

R22 Data Science Using Python Lab Manual

Uploaded by

roshni.nekkanti
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 127

1. Perform Descriptive statistics of given dataset using Data Analysis Toolbox of Excel.

Descriptive statistics summarize your dataset, painting a picture of its properties. These
properties include various central tendency and variability measures, distribution
properties, outlier detection, and other information. Unlike inferential statistics, descriptive
statistics only describe your dataset’s characteristics and do not attempt to generalize from
a sample to a population.
Using a single function, Excel can calculate a set of descriptive statistics for your dataset. This
post is an excellent introduction to interpreting descriptive statistics even if Excel isn’t your
primary statistical software package.
This experiment provides step-by-step instructions for using Excel to calculate
descriptive statistics for your data. Importantly, it will also show you how to interpret the
results, determine which statistics are most applicable to your data, and help you navigate some
of the lesser-known values.
Before proceeding, ensure that Excel’s Data Analysis ToolPak is installed. On
the Data tab, look for Data Analysis, as shown below.

If you don’t see Data Analysis, install that ToolPak.

Descriptive Statistics in Excel


Let’s start with a caveat. Use descriptive statistics together with graphs. The statistical
output contains numbers that describe the properties of your data. While they provide useful
information, charts are often more intuitive. The best practice is to use graphs and statistical
output together to maximize your understanding. At the end of this post, I display the
histograms for the variables in this dataset.
For this example, we’ll assess two variables, the height and weight of preteen girls. I
collected these data during a real experiment. To use this feature in Excel, arrange your data in
columns or rows. I have my data in columns, as shown in the snippet below.
Download the Excel file that contains the data for this example: HeightWeight.
In Excel, click Data Analysis on the Data tab, as shown above. In the Data Analysis popup,
choose Descriptive Statistics, and then follow the steps below.

1
Step-by-Step Instructions for Filling in Excel’s Descriptive Statistics Box
1. Under Input Range, select the range for the variables that you want to analyze. You can
include multiple variables as long as they form a contiguous block. While you can
explore more than one variable, the analysis assesses each variable in a univariate
manner (i.e., no correlation).
2. In Grouped By, choose how your variables are organized. I always include one variable
per column as this format is standard across software. Alternatively, you can include one
variable per row.
3. Check the Labels in first row checkbox if you have meaningful variable names in row
1. This option makes the output easier to interpret.
4. In Output options, choose where you want Excel to display the results.
5. Check the Summary statistics box to display most of the descriptive statistics (central
tendency, dispersion, distribution properties, sum, and count).
6. Check the Confidence Level for Mean box to display a confidence interval for the
mean. Enter the confidence level. 95% is usually a good value. For more information
about confidence levels, read my post about confidence intervals.
7. Check Kth Largest and Kth Smallest to display a high and low value. If you enter 1,
Excel displays the highest and lowest values. If you enter 2, it shows the 2nd highest and
lowest values. Etc.
8. Click OK.

2
2. Apply pivot table of Excel to perform data analysis.
Data analysis on a large set of data is quite often necessary and important. It involves
summarizing the data, obtaining the needed values and presenting the results.
Excel provides PivotTable to enable you summarize thousands of data values easily and
quickly so as to obtain the required results.
Consider the following table of sales data. From this data, you might have to summarize
total sales region wise, month wise, or salesperson wise. The easy way to handle these tasks is
to create a PivotTable that you can dynamically modify to summarize the results the way you
want.

Creating PivotTable
To create PivotTables, ensure the first row has headers.
 Click the table.
 Click the INSERT tab on the Ribbon.
 Click PivotTable in the Tables group. The PivotTable dialog box appears.

As you can see in the dialog box, you can use either a Table or Range from the current
workbook or use an external data source.
 In the Table / Range Box, type the table name.
 Click New Worksheet to tell Excel where to keep the PivotTable.
 Click OK.
A Blank PivotTable and a PivotTable fields list appear.

3
Recommended PivotTables
In case you are new to PivotTables or you do not know which fields to select from the data, you
can use the Recommended PivotTables that Excel provides.
 Click the data table.
 Click the INSERT tab.
 Click on Recommended PivotTables in the Tables group. The Recommended
PivotTables dialog box appears.

In the recommended PivotTables dialog box, the possible customized PivotTables that suit your
data are displayed.
 Click each of the PivotTable options to see the preview on the right side.
 Click the PivotTable Sum of Order Amount by Salesperson and month.

4
Click OK. The selected PivotTable appears on a new worksheet. You can observe the
PivotTable fields that was selected in the PivotTable fields list.

PivotTable Fields
The headers in your data table will appear as the fields in the PivotTable.

You can select / deselect them to instantly change your PivotTable to display only the
information you want and in a way that you want. For example, if you want to display the
account information instead of order amount information, deselect Order Amount and select
Account.

PivotTable Areas
You can even change the Layout of your PivotTable instantly. You can use the PivotTable
Areas to accomplish this.

5
In PivotTable areas, you can choose −
 What fields to display as rows
 What fields to display as columns
 How to summarize your data
 Filters for any of the fields
 When to update your PivotTable Layout
o You can update it instantly as you drag the fields across areas, or
o You can defer the update and get it updated only when you click on UPDATE

An instant update helps you to play around with the different Layouts and pick the one that
suits your report requirement. You can just drag the fields across these areas and observe the
PivotTable layout as you do it.

Nesting in the PivotTable


If you have more than one field in any of the areas, then nesting happens in the order you place
the fields in that area. You can change the order by dragging the fields and observe how nesting
changes. In the above layout options, you can observe that
 Months are in columns.
 Region and salesperson in rows in that order. i.e. salesperson values are nested under
region values.
 Summarizing is by Sum of Order Amount.
 No filters are chosen.
6
The resulting PivotTable is as follows −

In the PivotTable Areas, in rows, click region and drag it below salesperson such that it looks
as follows −

The nesting order changes and the resulting PivotTable is as follows −

7
Note − You can clearly observe that the layout with the nesting order – Region and then
Salesperson yields a better and compact report than the one with the nesting order –
Salesperson and then Region. In case Salesperson represents more than one area and you need
to summarize the sales by Salesperson, then the second layout would have been a better option.
Filters
You can assign a Filter to one of the fields so that you can dynamically change the PivotTable
based on the values of that field.
Drag Region from Rows to Filters in the PivotTable Areas.

8
The filter with the label as Region appears above the PivotTable (in case you do not have
empty rows above your PivotTable, PivotTable gets pushed down to make space for the Filter.

You can see that −


 Salesperson values appear in rows.
 Month values appear in columns.
 Region Filter appears on the top with default selected as ALL.
 Summarizing value is Sum of Order Amount
o Sum of Order Amount Salesperson-wise appears in the column Grand Total
o Sum of Order Amount Month-wise appears in the row Grand Total
Click the arrow in the box to the right of the filter region. A drop-down list with the values of
the field region appears.

9
 Check the option Select Multiple Items. Check boxes appear for all the values.
 Select South and West and deselect the other values and click OK.

The data pertaining to South and West Regions only will be summarized as shown in the screen
shot given below −

You can see that next to the Filter Region, Multiple Items is displayed, indicating that you
have selected more than one item. However, how many items and / or which items are selected
is not known from the report that is displayed. In such a case, using Slicers is a better option for
filtering.

Slicers
You can use Slicers to have a better clarity on which items the data was filtered.

10
 Click ANALYZE under PIVOTTABLE TOOLS on the Ribbon.
 Click Insert Slicer in the Filter group. The Insert Slicers box appears. It contains all the
fields from your data.
 Select the fields Region and month. Click OK.

Slicers for each of the selected fields appear with all the values selected by default. Slicer Tools
appear on the Ribbon to work on the Slicer settings, look and feel.

 Select South and West in the Slicer for Region.


 Select February and March in the Slicer for month.
 Keep Ctrl key pressed while selecting multiple values in a Slicer.
Selected items in the Slicers are highlighted. PivotTable with summarized values for the
selected items will be displayed.

11
Summarizing Values by other Calculations
In the examples so far, you have seen summarizing values by Sum. However, you can use other
calculations also if necessary.
In the PivotTable Fields List
 Select the Field Account.
 Unselect the Field Order Amount.

 Drag the field Account to Summarizing Values area. By default, Sum of Account will be
displayed.
 Click the arrow on the right side of the box.
 In the drop-down that appears, click Value Field Settings.

12
The Value Field Settings box appears. Several types of calculations appear as a list under
Summarize value field by −
 Select Count in the list.
 The Custom Name automatically changes to Count of Account. Click OK.
The PivotTable summarizes the Account values by Count.

PivotTable Tools
Follow the steps given below to learn to use the PivotTable Tools.
 Select the PivotTable.
The following PivotTable Tools appear on the Ribbon −
 ANALYZE
 DESIGN

13
ANALYZE
Some of the ANALYZE Ribbon commands are −
 Set PivotTable Options
 Value Field Settings for the selected Field
 Expand Field
 Collapse Field
 Insert Slicer
 Insert Timeline
 Refresh Data
 Change Data Source
 Move PivotTable
 Solve Order (If there are more calculations)
 PivotChart

DESIGN
Some of the DESIGN Ribbon commands are −
 PivotTable Layout
o Options for Sub Totals
o Options for Grand Totals
o Report Layout Forms
o Options for Blank Rows
 PivotTable Style Options
 PivotTable Styles
14
Expanding and Collapsing Field
You can either expand or collapse all items of a selected field in two ways −
 By selecting the symbol or to the left of the selected field.
 By clicking the Expand Field or Collapse Field on the ANALYZE Ribbon.
By selecting the Expand symbol or Collapse symbol to the left of the selected field
 Select the cell containing East in the PivotTable.
 Click on the Collapse symbol to the left of East.

All the items under East will be collapsed. The Collapse symbol to the left of East changes to
the Expand symbol .

15
You can observe that only the items below East are collapsed. The rest of the PivotTable items
are as they are.
Click the Expand symbol to the left of East. All the items below East will be displayed.
Using ANALYZE on the Ribbon
You can collapse or expand all items in the PivotTable at once with the Expand Field and
Collapse Field commands on the Ribbon.
 Click the cell containing East in the PivotTable.
 Click the ANALYZE tab on the Ribbon.
 Click Collapse Field in the Active Field group.

All the items of the field East in the PivotTable will collapse.

Click Expand Field in the Active Field group.

16
All the items will be displayed.

Report Presentation Styles


You can choose the presentation style for your PivotTable as you would be including it as a
report. Select a style that fits into the rest of your presentation or report. However, do not get
over bored with the styles because a report that gives an impact in showing the results is always
better than a colorful one, which does not highlight the important data points.
 Click East in the PivotTable.
 Click ANALYZE.
 Click Field Settings in Active Field group. The Field Settings dialog box appears.
 Click the Layout & Print tab.
 Check Insert blank line after each item label.

17
Blank rows will be displayed after each value of the Region field.
You can insert blank rows from the DESIGN tab also.

 Click the DESIGN tab.


 Click Report Layout in Layout group.
 Select Show in Outline Form in the drop-down list.

18
 Hover the mouse over the PivotTable Styles. A preview of the style on which the mouse
is placed will appear.
 Select the Style that suits your report.
PivotTable in Outline Form with the selected Style will be displayed.

Timeline in PivotTables
To understand how to use Timeline, consider the following example wherein the sales data of
various items is given salesperson wise and location wise. There are total 1891 rows of data.
19
Create a PivotTable from this Range with −
 Location and Salesperson in Rows in that order
 Product in Columns
 Sum of Amount in Summarizing values

 Click the PivotTable.


 Click INSERT tab.
 Click Timeline in Filters group. The Insert Timelines appears.

20
Click Date and click OK. The Timeline dialog box appears and the Timeline Tools appear on
the Ribbon.

 In Timeline dialog box, select MONTHS.


 From the drop-down list select QUARTERS.
 Click 2014 Q2.
 Keep the Shift key pressed and drag to 2014 Q4.
Timeline is selected to Q2 – Q4 2014.
PivotTable is filtered to this Timeline.

21
22
3. Perform the following operations using Numpy
i) Basic Operations on NumPy
ii) Computations on numpy’s Arrays

NumPy is, just like SciPy, Scikit-Learn, Pandas, etc. one of the packages that you just
can’t miss when you’re learning data science, mainly because this library provides you with an
array data structure that holds some benefits over Python lists, such as: being more compact,
faster access in reading and writing items, being more convenient and more efficient.

(i) Basic Operations on Numpy:


1. Converting a list to n-dimensional NumPy array
numpy_array = np.array(list_to_convert)

2. Use of np.newaxis and np.reshape


np.newaxis is used to create new dimensions of size 1. For eg
a = [1,2,3,4,5] is a list
a_numpy = np.array(a)
If you print a_numpy.shape , you get (5,) . In order to make this a row vector or column vector,
one could do
row_vector = a_numpy[:,np.newaxis] ####shape is (5,1) now
col_vector = a_numpy[np.newaxis,:] ####shape is (1,5) now
Similarly, np.reshape can be used to reshape any array. For eg:
a = range(0,15) ####list of numbers from 0 to 14

b = a.reshape(3,5)

b would become:
[[0,1,2,3,4],
[5,6,7,8,9],
[10,11,12,13,14],
[15,16,17,18,19]]

23
3. Converting any data type to NumPy array
Use np.asarray. For eg
a = [(1,2), [3,4,(5)], (6,7,8)]
b = np.asarray(a)
b::
array([(1, 2), list([3, 4, (5, 6)]), (6, 7, 8)], dtype=object)

4. Get an n-dimensional array of zeros.


a = np.zeros(shape,dtype=type_of_zeros)
type of zeros can be int or float as it is required
eg.
a = np.zeros((3,4), dtype = np.float16)

5. Get an n-dimensional array of ones.


Similar to np.zeros:
a = np.ones((3,4), dtype=np.int32)

6. np.full and np.empty


np.full is used to get an array filled with one specific value while np.empty helps to create the
array by initializing them with random values. For eg.
1. np.full(shape_as_tuple,value_to_fill,dtype=type_you_want)
a = np.full((2,3),1,dtype=np.float16)a would be:
array([[1., 1., 1.],
[1., 1., 1.]], dtype=float16)2. np.empty(shape_as_tuple,dtype=int)
a = np.empty((2,2),dtype=np.int16)
a would be:array([[25824, 25701],
[ 2606, 8224]], dtype=int16) The integers here are random.

7. Getting an array of evenly spaced values with np.arrange and np.linspace


Both can be used to arrange to create an array with evenly spaced elements.
linspace:

24
np.linspace(start,stop,num=50,endpoint=bool_value,retstep=bool_value)endpoint specifies if you
want the stop value to be included and retstep tells if you would like to know the step-
value.'num' is the number of integer to be returned where 50 is default
Eg,np.linspace(1,2,num=5,endpoint=False,retstep=True)This means return 5 values starting at 1 and
ending befor 2 and returning the step-size.output would be:
(array([1. , 1.2, 1.4, 1.6, 1.8]), 0.2) ##### Tuple of numpy array and step-size

arange:
np.arange(start=where_to_start,stop=where_to_stop,step=step_size)
If only one number is provided as an argument, it’s treated to be a stop and if 2 are provided,
they are assumed to be the start and the stop. Notice the spelling here.

8. Finding the shape of the NumPy array


array.shape

9. Knowing the dimensions of the NumPy array


x = np.array([1,2,3])
x.ndim will produce 1

10. Finding the number of elements in the NumPy array


x = np.ones((3,2,4),dtype=np.int16)
x.size will produce 24

11. Get the memory space occupied by an n-dimensional array


x.nbytesoutput will be 24*memory occupied by 16 bit integer = 24*2 = 48

12. Finding the data type of elements in the NumPy array


x = np.ones((2,3), dtype=np.int16)
x.dtype will produce dtype('int16')
It works better when elements in the array are of one type otherwise typecasting happens and result
may be difficult to interpret.

25
13. How to create a copy of NumPy array
Use np.copy
y = np.array([[1,3],[5,6]])
x = np.copy(y) If,
x[0][0] = 1000 Then,

x is y is
13 100 3
56 56

14. Get transpose of an n-d array


Use array_name.T
x = [[1,2],[3,4]]

x x.T is
12 13
34 24

15. Flatten an n-d array to get a one-dimensional array


Use np.reshape and np.ravel:
np.reshape: This is really a nice and sweet trick. While reshaping if you provide -1 as one of the
dimensions, it’s inferred from the no. of elements. For eg. for an array of size (1,3,4) if it’s
reshaped to (-1,2,2), then the first dimension’s length is calculated to be 3 . So,

If x is: Then x.reshape(-1) produces:


123 array([1, 2, 3, 4, 5, 9])
459

26
np.ravel
x = np.array([[1, 2, 3], [4, 5, 6]])
x.ravel() produces
array([1, 2, 3, 4, 5, 6])

16. Change axes of an n-d array or swap dimensions


Use np.moveaxis and np.swapaxes.
x = np.ones((3,4,5))
np.moveaxis(x,axes_to_move_as_list, destination_axes_as_list)

The conversion is not in place so don’t forget to store it in another variable.

np.swapaxes.
x = np.array([[1,2],[3,4]])x.shape is (2,2) and
x np.swapaxes(x,0,1) will produce
12 13
34 24

17. Convert NumPy array to list


x = np.array([[3,4,5,9],[2,6,8,0]])
y = x.tolist()
y will be
[[3, 4, 5, 9], [2, 6, 8, 0]]
The NumPy docs mention that using list(x) will also work if x is a 1-d array.

18. Change the data type of elements in the NumPy array.


Use ndarray.as type
x = np.array([0,1,2.0,3.0,4.2],dtype=np.float32)
x.astype(np.int16) will producearray([0, 1, 2, 3, 4], dtype=int16)

27
x.astype(np.bool) will produce
array([False, True, True, True, True])

19. Get indices of non-zero elements


Use n-dim_array.nonzero()
x = np.array([0,1,2.0,3.0,4.2],dtype=np.float32)
x.nonzero() will produce(array([1, 2, 3, 4]),)

It's important to note that x has shape (5,) so only 1st indices are returned. If x were say,x =
np.array([[0,1],[3,5]])

x.nonzero() would produce (array([0, 1, 1]), array([1, 0, 1]))So, the indices are actually (0,1), (1,0),
(1,1).

20. Sort NumPy array


Use np.ndarray.sort(axis=axis_you_want_to_sort_by)
x = np.array([[4,3],[3,2])
x is
43
3 2x.sort(axis=1) #sort each row3 4
2 3x.sort(axis=0) #sort each col3 2
43

21. Compare NumPy arrays to values


Comparing will produce NumPy n-dimension arrays of boolean type. For eg
x = np.array([[0,1],[2,3]])x==1 will produce
array([[False, True],
[False, False]])

If you would want to count the number ones in x, you could just do
(x==1).astype(np.int16).sum()

28
It should output 1

22. Multiply two NumPy matrices


Use numpy.matmul to take matrix product of 2-D matrices:

a = np.eye(2) #identity matrix of size 2


a
10
0 1b = np.array([[1,2],[3,4]])
b
12
3 4np.matmul(a,b) will give
12
34
If we supply a 1-D array, then the output can be very different as broadcasting will be used. We
discuss that below. Also, there is another function called np.multiply which performs element to
element multiplication. For the previous two matrices output for np.multiply(a,b) would be.
10
04

23. Dot product of two arrays


np.dot(matrix1, matrix2)
a = np.array([[1,2,3],[4,8,16]])

a:
123
4 8 16b = np.array([5,6,11]).reshape(-1,1)b:
5
6

11np.dot(a,b) produces
38

29
160
Just like any dot product of a matrix with a column vector would produce.

The dot product of a row vector with a column vector will produce:
if a is array([[1, 2, 3, 4]])
and b is:

array([[4],
[5],
[6],
[7]])np.dot(a,b) gives:array([[60]])a's shape was (1,4) and b's shape was (4,1) so the result will
have shape (1,1)

24. Get cross-product of two numpy vectors


Recall vector cross-product from physics. It’s the direction of torque taken about a point.
x = [1,2,3]
y = [4,5,6]z = np.cross(x, y)z is:
array([-3, 6, -3])

25. Getting gradient of an array


use np.gradient. NumPy calculates the gradient using Taylor series and central difference
method.
x = np.array([5, 10, 14, 17, 19, 26], dtype=np.float16)np.gradient(x) will be:
array([5. , 4.5, 3.5, 2.5, 4.5, 7. ], dtype=float16)

26. How to slice NumPy array?


For single element:
x[r][c] where r,c are row and col number of the element.For slicing more than one element.x:
249
315
7 8 0 and you want 2,4 and 7,8 then dox[list_of_rows,list_of_cols] which would bex[[0,0,2,2],
[0,1,0,1]] producesarray([2, 4, 7, 8])

30
If one of the rows or cols are continuous, it's easier to do it:x[[0,2],0:2] producesarray([[2, 4],[7, 8]])

ii) Computations on numpy’s Arrays


NumPy is a general-purpose array-processing package. It provides a high-performance
multidimensional array object, and tools for working with these arrays.

1. Arrays in NumPy: NumPy’s main object is the homogeneous multidimensional array.


 It is a table of elements (usually numbers), all of the same type, indexed by a tuple of
positive integers.
 In NumPy dimensions are called axes. The number of axes is rank.
 NumPy’s array class is called ndarray. It is also known by the alias array.

Example :
[[ 1, 2, 3],
[ 4, 2, 5]]

Here,
rank = 2 (as it is 2-dimensional or it has 2 axes)
first dimension(axis) length = 2, second dimension has length = 3
overall shape can be expressed as: (2, 3)

2. Array creation: There are various ways to create arrays in NumPy.


 For example, you can create an array from a regular Python list or tuple using
the array function. The type of the resulting array is deduced from the type of the elements in
the sequences.
 Often, the elements of an array are originally unknown, but its size is known. Hence,
NumPy offers several functions to create arrays with initial placeholder content. These
minimize the necessity of growing arrays, an expensive operation.
For example: np.zeros, np.ones, np.full, np.empty, etc.
 To create sequences of numbers, NumPy provides a function analogous to range that
returns arrays instead of lists.
 arange: returns evenly spaced values within a given interval. step size is specified.
31
 linspace: returns evenly spaced values within a given interval. num no. of elements
are returned.
 Reshaping array: We can use reshape method to reshape an array. Consider an array
with shape (a1, a2, a3, …, aN). We can reshape and convert it into another array with shape
(b1, b2, b3, …, bM). The only required condition is:
a1 x a2 x a3 … x aN = b1 x b2 x b3 … x bM . (i.e original size of array remains unchanged.)
 Flatten array: We can use flatten method to get a copy of array collapsed into one
dimension. It accepts order argument. Default value is ‘C’ (for row-major order). Use ‘F’ for
column major order.

Note: Type of array can be explicitly defined while creating array.

3. Array Indexing: Knowing the basics of array indexing is important for analysing and
manipulating the array object. NumPy offers many ways to do array indexing.
 Slicing: Just like lists in python, NumPy arrays can be sliced. As arrays can be
multidimensional, you need to specify a slice for each dimension of the array.
 Integer array indexing: In this method, lists are passed for indexing for each
dimension. One to one mapping of corresponding elements is done to construct a new
arbitrary array.
 Boolean array indexing: This method is used when we want to pick elements from
array which satisfy some condition.

4. Basic operations: Plethora of built-in arithmetic functions are provided in NumPy.


 Operations on single array: We can use overloaded arithmetic operators to do
element-wise operation on array to create a new array. In case of +=, -=, *= operators, the
existing array is modified.
# Python program to demonstrate
# basic operations on single array
import numpy as np

a = np.array([1, 2, 5, 3])

32
# add 1 to every element
print ("Adding 1 to every element:", a+1)

# subtract 3 from each element


print ("Subtracting 3 from each element:", a-3)

# multiply each element by 10


print ("Multiplying each element by 10:", a*10)
# square each element
print ("Squaring each element:", a**2)

# modify existing array


a *= 2
print ("Doubled each element of original array:", a)

# transpose of array
a = np.array([[1, 2, 3], [3, 4, 5], [9, 6, 0]])

print ("\nOriginal array:\n", a)


print ("Transpose of array:\n", a.T)

Output :
Adding 1 to every element: [2 3 6 4]
Subtracting 3 from each element: [-2 -1 2 0]
Multiplying each element by 10: [10 20 50 30]
Squaring each element: [ 1 4 25 9]
Doubled each element of original array: [ 2 4 10 6]

Original array:
[[1 2 3]
[3 4 5]
[9 6 0]]
Transpose of array:

33
[[1 3 9]
[2 4 6]
[3 5 0]]

4. Write a program to find patterns in the given data using regular expressions by taking
the data from text file.
A Regular Expressions (RegEx) is a special sequence of characters that uses a search
pattern to find a string or set of strings. It can detect the presence or absence of a text by
matching with a particular pattern, and also can split a pattern into one or more sub-patterns.

Python provides a re module that supports the use of regex in Python. Its primary
function is to offer a search, where it takes a regular expression and a string. Here, it either
returns the first match or else none.

Before starting with the Python regex module let’s see how to actually write regex
using metacharacters or special sequences.

MetaCharacters
To understand the RE analogy, MetaCharacters are useful, important, and will be used in
functions of module re. Below is the list of metacharacters.

MetaCharacters Description

\ Used to drop the special meaning of character following it

[] Represent a character class

34
MetaCharacters Description

^ Matches the beginning

$ Matches the end

. Matches any character except newline

| Means OR (Matches with any of the characters separated by it.

? Matches zero or one occurrence

* Any number of occurrences (including 0 occurrences)

+ One or more occurrences

{} Indicate the number of occurrences of a preceding regex to match.

() Enclose a group of Regex

Special Sequences
Special sequences do not match for the actual character in the string instead it tells the
specific location in the search string where the match must occur. It makes it easier to write
commonly used patterns.
List of special sequences

Special
Sequence Description Examples

\A Matches if the string begins with the given character \Afor for geeks

35
Special
Sequence Description Examples

for the
world

Matches if the word begins or ends with the given geeks


character. \b(string) will check for the beginning of the
word and (string)\b will check for the ending of the
\b word. \bge get

together
It is the opposite of the \b i.e. the string should not
\B start or end with the given regex. \Bge forge

123
Matches any decimal digit, this is equivalent to the set
\d class [0-9] \d gee1

geeks
Matches any non-digit character, this is equivalent to
\D the set class [^0-9] \D geek1

gee ks

\s Matches any whitespace character. \s a bc a

a bd

\S Matches any non-whitespace character \S abcd

\w Matches any alphanumeric character, this is equivalent \w 123

36
Special
Sequence Description Examples

to the class [a-zA-Z0-9_]. geeKs4

>$

\W Matches any non-alphanumeric character. \W gee<>

abcdab

\Z Matches if the string ends with the given regex ab\Z abababab

Regex Module in Python


Python has a module named re that is used for regular expressions in Python. We can import
this module by using the import statement.
Example: Importing re module in Python
import re
Let’s see various functions provided by this module to work with regex in Python.

re.findall()
Return all non-overlapping matches of pattern in string, as a list of strings. The string is
scanned left-to-right, and matches are returned in the order found.

Example: Finding all occurrences of a pattern


# A Python program to demonstrate working of
# findall()
import re

# A sample text string where regular expression


37
# is searched.
string = """Hello my Number is 123456789 and
my friend's number is 987654321"""

# A sample regular expression to find digits.


regex = '\d+'

match = re.findall(regex, string)


print(match)

Output
['123456789', '987654321']

Match Object
A Match object contains all the information about the search and the result and if there is no
match found then None will be returned. Let’s see some of the commonly used methods and
attributes of the match object.
Getting the string and the regex
math.re attribute returns the regular expression passed and match.string attribute returns the
string passed.
Example: Getting the string and the regex of the matched object
import re
s = "Welcome to GeeksForGeeks"
# here x is the match object
res = re.search(r"\bG", s)
print(res.re)
print(res.string)
Output
re.compile('\\bG')
Welcome to GeeksForGeeks

38
Getting index of matched object
 start() method returns the starting index of the matched substring
 end() method returns the ending index of the matched substring
 span() method returns a tuple containing the starting and the ending index of the
matched substring
Example: Getting index of matched object
import re

s = "Welcome to GeeksForGeeks"

# here x is the match object

res = re.search(r"\bGee", s)

print(res.start())
print(res.end())
print(res.span())

Output
11
14
(11, 14)

Getting matched substring


group() method returns the part of the string for which the patterns match. See the below
example for a better understanding.
Example: Getting matched substring
import re

s = "Welcome to GeeksForGeeks"
# here x is the match object
res = re.search(r"\D{2} t", s)
print(res.group())
39
Output
me t
In the above example, our pattern specifies for the string that contains at least 2 characters
which are followed by a space, and that space is followed by a t.

5. Perform the following operation on dataframes using Pandas


i) Basic Operation on DataFrames using Pandas
ii) Hierarchical Indexing
iii) Combining and Merging Datasets,
iv) Merging on Index,
v) Concatenate, Combining with overlap,
vi) Reshaping,
vii) Pivoting.
viii) Vectorized String Operations

Pandas is an open-source library that is made mainly for working with relational or labeled
data both easily and intuitively. It provides various data structures and operations for
manipulating numerical data and time series. This library is built on top of the NumPy library.
Pandas is fast and it has high performance & productivity for users.

Getting Started
After the pandas have been installed into the system, you need to import the library. This
module is generally imported as:
import pandas as pd

40
Here, pd is referred to as an alias to the Pandas. However, it is not necessary to import the
library using the alias, it just helps in writing less amount code every time a method or
property is called.

Pandas generally provide two data structures for manipulating data, They are:
 Series
 DataFrame
Series:
Pandas Series is a one-dimensional labelled array capable of holding data of any type (integer,
string, float, python objects, etc.). The axis labels are collectively called indexes. Pandas
Series is nothing but a column in an excel sheet. Labels need not be unique but must be a
hashable type. The object supports both integer and label-based indexing and provides a host
of methods for performing operations involving the index.

Note: For more information, refer to Python | Pandas Series


Creating a Series
In the real world, a Pandas Series will be created by loading the datasets from existing
storage, storage can be SQL Database, CSV file, an Excel file. Pandas Series can be created
from the lists, dictionary, and from a scalar value etc.
Example:
import pandas as pd
import numpy as np

41
# Creating empty series
ser = pd.Series()
print(ser)
# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])
ser = pd.Series(data)
print(ser)

Output:
Series([], dtype: float64)
0 g
1 e
2 e
3 k
4 s
dtype: object
Note: For more information, refer to Creating a Pandas Series

Basic Operations:
The basic operations that we can perform on a dataset after we have loaded into our dataframe
object.

Find Last and First rows of the DataFrame:


To access the first and last few rows of the DataFrame, we use .head() and .tail() function. If
used without any parameters, then, these function will return the first 5 or the last 5 rows
respectively. But if we pass an integer as a parameter then the number of rows corresponding to
the integer, are shown. For example,
# using dictionary to create a dataframe
data = {'Fruit': ['Apple','Banana','Orange','Mango'], 'Weight':
[200,150,300,250], 'Price':[90,40,100,50]}
studyTonight_df = pd.DataFrame(data)
# using the head function to get first two entries

42
studyTonight_df.head(2)
# using the tail function to get last two entries
studyTonight_df.tail(2)

Output:

Accessing Columns in a DataFrame:


We can access the individual columns which make up the data frame. For doing that we use
square brackets just like we do in case of array and specify the name of the column in the square
brackets. For example, if we have to get the values stored in the column Weight in the above
dataframe, we can do so using the following code:
studyTonight_d['Weight']
This will give us the output:

Another way to access columns is by calling the column name as an attribute, as shown
below:
studyTonight_df.Fruit

43
Accessing Rows in a DataFrame:
Using the .loc[] function we can access the row-index name which is passed in as a parameter,
for example:
studyTonight_df.loc[2]
Output:

Various Assignments and Operations on a DataFrame:


To demonstrate the role of NaN in our DataFrame, we will be adding a column that has no
values in our data frame. To do this, we will be using the columns parameter in
the DataFrame() function, and pass a list of column names.
data = {'Fruit': ['Apple','Banana','Orange','Mango'], 'Weight':
[200,150,300,250], 'Price':[90,40,100,50]}

studyTonight_df2 = pd.DataFrame(data,
columns=['Fruit','Weight','Price','Kind'])

print(studyTonight_df2)
The column we just added, called Kind, didn't exist in our data frame before. Thus there are no
values corresponding to this. Therefore our dataframe reads this as a missing value and places
a NaN under the Kind column. Below is the output for the above code:

44
If we want to assign something to this column, we can attempt to assign a constant value for all
the rows. To do this, just select the column as shown below, and make it equal to some constant
value.
studyTonight_df2['Kind'] = 'Round'
print(studyTonight_df2)
As we can see in our output below, all the values corresponding to the column Kind has been
changed to the value Round.

A series can be mapped onto a dataframe column. This further proves the point that a
DataFrame is a combination of multiple Series.
st_ser = pd.Series(["Round", "Long", "Round", "Oval-ish"])
Let's map this series with our column Kind:
studyTonight_df2['Kind'] = st_ser
print(studyTonight_df2)

45
For this we will get the following output:

Hierarchical Indexing:
The index is like an address, that’s how any data point across the data frame or series can be
accessed. Rows and columns both have indexes, rows indices are called index and for
columns, it’s general column names.

Hierarchical Indexes
Hierarchical Indexes are also known as multi-indexing is setting more than one column name
as the index. In this article, we are going to use homelessness.csv file.
# importing pandas library as alias pd
import pandas as pd
# calling the pandas read_csv() function.
# and storing the result in DataFrame df
df = pd.read_csv('homelessness.csv')
print(df.head())
Output:

46
In the above data frame, there is no indexing.

Columns in the Dataframe:


# using the pandas columns attribute.
col = df.columns
print(col)

Output:
Index([‘Unnamed: 0’, ‘region’, ‘state’, ‘individuals’,
‘family_members’,
‘state_pop’],
dtype=’object’)

To make the column an index, we use the Set_index() function of pandas. If we want
to make one column an index, we can simply pass the name of the column as a string in
set_index(). If we want to do multi-indexing or Hierarchical Indexing, we pass the list of
column names in the set_index().

Below Code demonstrates Hierarchical Indexing in pandas:


# using the pandas set_index() function.
df_ind3 = df.set_index(['region', 'state', 'individuals'])
# we can sort the data by using sort_index()
df_ind3.sort_index()
print(df_ind3.head(10))

Output:

47
Now the dataframe is using Hierarchical Indexing or multi-indexing.

Combining and Merging Datasets


Pandas provides a single function, merge, as the entry point for all standard database join
operations between DataFrame objects −
pd.merge(left, right, how='inner', on=None, left_on=None,
right_on=None,
left_index=False, right_index=False, sort=True)
 Here, we have used the following parameters −
 left − A DataFrame object.
 right − Another DataFrame object.
 on − Columns (names) to join on. Must be found in both the left and right DataFrame
objects.
 left_on − Columns from the left DataFrame to use as keys. Can either be column names
or arrays with length equal to the length of the DataFrame.
 right_on − Columns from the right DataFrame to use as keys. Can either be column
names or arrays with length equal to the length of the DataFrame.
 left_index − If True, use the index (row labels) from the left DataFrame as its join
key(s). In case of a DataFrame with a MultiIndex (hierarchical), the number of levels
must match the number of join keys from the right DataFrame.
 right_index − Same usage as left_index for the right DataFrame.
 how − One of 'left', 'right', 'outer', 'inner'. Defaults to inner. Each method has been
described below.
 sort − Sort the result DataFrame by the join keys in lexicographical order. Defaults to
True, setting to False will improve the performance substantially in many cases.

48
Let us now create two different DataFrames and perform the merging operations on it.
# import the pandas library
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print left
print right

Its output is as follows −


Name id subject_id
0 Alex 1 sub1
1 Amy 2 sub2
2 Allen 3 sub4
3 Alice 4 sub6
4 Ayoung 5 sub5

Name id subject_id
0 Billy 1 sub2
1 Brian 2 sub4
2 Bran 3 sub3
3 Bryce 4 sub6
4 Betty 5 sub5

Pandas concat(): Combining Data Across Rows or Columns


Concatenation is a bit different from the merging techniques you saw above. With
merging, you can expect the resulting dataset to have rows from the parent datasets mixed in
49
together, often based on some commonality. Depending on the type of merge, you might also
lose rows that don’t have matches in the other dataset.
With concatenation, your datasets are just stitched together along an axis — either
the row axis or column axis. Visually, a concatenation with no parameters along rows would
look like this:

To implement this in code, you’ll use concat() and pass it a list of DataFrames that you
want to concatenate. Code for this task would like like this:
concatenated = pandas.concat([df1, df2])

Note: This example assumes that your column names are the same. If your column
names are different while concatenating along rows (axis 0), then by default the columns will
also be added, and NaN values will be filled in as applicable.

What if instead you wanted to perform a concatenation along columns? First, take a look
at a visual representation of this operation:

50
To accomplish this, you’ll use a concat() call like you did above, but you also will need to pass
the axis parameter with a value of 1:
concatenated = pandas.concat([df1, df2], axis=1)

Concatenating Objects
The concat function does all of the heavy lifting of performing concatenation operations along
an axis. Let us create different objects and do concatenation.
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two])

Its output is as follows −


Marks_scored Name subject_id
1 98 Alex sub1
2 90 Amy sub2
51
3 87 Allen sub4
4 69 Alice sub6
5 78 Ayoung sub5
1 89 Billy sub2
2 80 Brian sub4
3 79 Bran sub3
4 97 Bryce sub6
5 88 Betty sub5

Suppose we wanted to associate specific keys with each of the pieces of the chopped up
DataFrame. We can do this by using the keys argument −

import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two],keys=['x','y'])

Its output is as follows −


x 1 98 Alex sub1
2 90 Amy sub2
3 87 Allen sub4
4 69 Alice sub6
5 78 Ayoung sub5
y 1 89 Billy sub2
2 80 Brian sub4
52
3 79 Bran sub3
4 97 Bryce sub6
5 88 Betty sub5

The index of the resultant is duplicated; each index is repeated.

If the resultant object has to follow its own indexing, set ignore_index to True.

import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two],keys=['x','y'],ignore_index=True)

Its output is as follows −

Marks_scored Name subject_id


0 98 Alex sub1
1 90 Amy sub2
2 87 Allen sub4
3 69 Alice sub6
4 78 Ayoung sub5
5 89 Billy sub2
6 80 Brian sub4
7 79 Bran sub3
8 97 Bryce sub6
53
9 88 Betty sub5

Observe, the index changes completely and the Keys are also overridden.

If two objects need to be added along axis=1, then the new columns will be appended.

import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two],axis=1)

Its output is as follows −


Marks_scored Name subject_id Marks_scored Name
subject_id
1 98 Alex sub1 89 Billy
sub2
2 90 Amy sub2 80 Brian
sub4
3 87 Allen sub4 79 Bran
sub3
4 69 Alice sub6 97 Bryce
sub6
5 78 Ayoung sub5 88 Betty
sub5

54
Concatenating Using append
A useful shortcut to concat are the append instance methods on Series and DataFrame. These
methods actually predated concat. They concatenate along axis=0, namely the index −
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print one.append(two)

Its output is as follows −


Marks_scored Name subject_id
1 98 Alex sub1
2 90 Amy sub2
3 87 Allen sub4
4 69 Alice sub6
5 78 Ayoung sub5
1 89 Billy sub2
2 80 Brian sub4
3 79 Bran sub3
4 97 Bryce sub6
5 88 Betty sub5
The append function can take multiple objects as well −
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
55
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print one.append([two,one,two])

Its output is as follows −


Marks_scored Name subject_id
1 98 Alex sub1
2 90 Amy sub2
3 87 Allen sub4
4 69 Alice sub6
5 78 Ayoung sub5
1 89 Billy sub2
2 80 Brian sub4
3 79 Bran sub3
4 97 Bryce sub6
5 88 Betty sub5
1 98 Alex sub1
2 90 Amy sub2
3 87 Allen sub4
4 69 Alice sub6
5 78 Ayoung sub5
1 89 Billy sub2
2 80 Brian sub4
3 79 Bran sub3
4 97 Bryce sub6
5 88 Betty sub5

Time Series
56
Pandas provide a robust tool for working time with Time series data, especially in the financial
sector. While working with time series data, we frequently come across the following −
 Generating sequence of time
 Convert the time series to different frequencies
Pandas provides a relatively compact and self-contained set of tools for performing the above
tasks.
Get Current Time
datetime.now() gives you the current date and time.

import pandas as pd
print pd.datetime.now()
Its output is as follows −
2017-05-11 06:10:13.393147

Create a TimeStamp
Time-stamped data is the most basic type of timeseries data that associates values with points in
time. For pandas objects, it means using the points in time. Let’s take an example −

import pandas as pd
print pd.Timestamp('2017-03-01')
Its output is as follows −
2017-03-01 00:00:00
It is also possible to convert integer or float epoch times. The default unit for these is
nanoseconds (since these are how Timestamps are stored). However, often epochs are stored in
another unit which can be specified. Let’s take another example

import pandas as pd
print pd.Timestamp(1587687255,unit='s')

Its output is as follows −


2020-04-24 00:14:15

Create a Range of Time

57
import pandas as pd
print pd.date_range("11:00", "13:30", freq="30min").time
Its output is as follows −
[datetime.time(11, 0) datetime.time(11, 30) datetime.time(12, 0)
datetime.time(12, 30) datetime.time(13, 0) datetime.time(13,
30)]

Change the Frequency of Time


import pandas as pd
print pd.date_range("11:00", "13:30", freq="H").time

Its output is as follows −


[datetime.time(11, 0) datetime.time(12, 0) datetime.time(13, 0)]

Converting to Timestamps
To convert a Series or list-like object of date-like objects, for example strings, epochs, or a
mixture, you can use the to_datetime function. When passed, this returns a Series (with the
same index), while a list-like is converted to a DatetimeIndex. Take a look at the following
example −

import pandas as pd
print pd.to_datetime(pd.Series(['Jul 31, 2009','2010-01-10',
None]))

Its output is as follows −


0 2009-07-31
1 2010-01-10
2 NaT
dtype: datetime64[ns]

NaT means Not a Time (equivalent to NaN)

Let’s take another example.


58
import pandas as pd
print pd.to_datetime(['2005/11/23', '2010.12.31', None])
Its output is as follows −
DatetimeIndex(['2005-11-23', '2010-12-31', 'NaT'],
dtype='datetime64[ns]', freq=None)
Reshaping
Pandas data reshaping transform the structure of a table or vector to make it suitable for further
data analysis. Pandas use multiple methods to reshape the dataframe and series.

Reshaping :
Stack
In [1]:
import numpy as np
import pandas as pd
In [9]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
'foo', 'foo'],
['one', 'two', 'one', 'two',
'one', 'two']]))
In [10]:
index = pd.MultiIndex.from_tuples(tuples, names=['first',
'second'])
In [11]:
df = pd.DataFrame(np.random.randn(6, 2), index=index,
columns=['M', 'N'])
In [12]:df2 = df[:4]
In [13]:df2
Out[13]:

59
M N

firs secon
t d

bar one - -
1.04321 0.49102
3 7

- -
two 0.64623 2.84928
4 6

baz one 0.73197 0.63786


3 8

-
0.55405
two 1.01298
4
9

The stack() method “compresses” a level in the DataFrame’s columns.


In [14]:stacked = df2.stack()
In [15]: stacked
Out[15]:
first second
bar one M -1.043213
N -0.491027
two M -0.646234
N -2.849286
baz one M 0.731973
N 0.637868
two M -1.012989
N 0.554054
dtype: float64
In [16]:stacked.unstack()

60
Out[16]:

M N

first second

bar one -1.043213 -0.491027

two -0.646234 -2.849286

baz one 0.731973 0.637868

two -1.012989 0.554054

In [17]:stacked.unstack(1)

Out[17]:

second one two

first

bar M -1.043213 -0.646234

N -0.491027 -2.849286

baz M 0.731973 -1.012989

N 0.637868 0.554054

In [18]:stacked.unstack(0)
Out[18]:

first bar baz

secon
d

one M -1.043213 0.731973

61
first bar baz

secon
d

N -0.491027 0.637868

two M -0.646234 -1.012989

N -2.849286 0.554054

Pivot tables
In [19]: df = pd.DataFrame({'M': ['one', 'one', 'two', 'three']
* 2,
'N': ['A', 'B'] * 4,
'O': ['foo', 'foo', 'bar', 'bar'] * 2,
'P': np.random.randn(8),
'Q': np.random.randn(8)})
In [20]: df
Out[20]:

M N O P Q

0 one A foo 0.596137 0.462104

1 one B foo 0.598470 0.012078

2 two A bar 0.223138 -1.541724

3 three B bar 1.573414 -0.468205

4 one A foo -0.562009 -0.893717

5 one B foo -1.022035 -0.879408

6 two A bar 1.061792 1.702140

7 three B bar 0.109434 1.156828


62
M N O P Q

You can produce pivot tables from this data very easily:
In [23]:pd.pivot_table(df, values='P', index=['M', 'N'],
columns=['O'])

Out[23]:

O bar foo

M N

one A NaN 0.017064

B NaN -0.211783

thre B 0.84142
NaN
e 4

two A 0.64246
NaN
5

Pivoting
The pivot() function is used to reshaped a given DataFrame organized by given index / column
values. This function does not support data aggregation, multiple values will result in a
MultiIndex in the columns.
Syntax:
DataFrame.pivot(self, index=None, columns=None, values=None)
Parameters:

Name Description Type/Default Required /


Value Optional

63
index Column to use to make new frame’s index. If string or object Optional
None, uses existing index.

columns Column to use to make new frame’s columns. string or object Required

values Column(s) to use for populating new frame’s string, object or a Optional
values. If not specified, all remaining columns list of the previous
will be used and the result will have
hierarchically indexed columns.

Returns: DataFrame
Returns reshaped DataFrame.
Raises: ValueError- When there are any index, columns combinations with multiple values.
DataFrame.pivot_table when you need to aggregate.
Example:
pandas.pivot(index, columns, values) function produces pivot table based on 3 columns of
the DataFrame. Uses unique values from index / columns and fills with values.
Parameters:
index[ndarray] : Labels to use to make new frame’s index
columns[ndarray] : Labels to use to make new frame’s columns
values[ndarray] : Values to use for populating new frame’s values
Returns: Reshaped DataFrame
Exception: ValueError raised if there are any duplicates.

# Create a simple dataframe

# importing pandas as pd
import pandas as pd

# creating a dataframe
df = pd.DataFrame({'A': ['John', 'Boby', 'Mina'],
'B': ['Masters', 'Graduate', 'Graduate'],
'C': [27, 23, 21]})

64
df

# values can be an object or a list


df.pivot('A', 'B', 'C')

# value is a list
df.pivot(index ='A', columns ='B', values =['C', 'A'])

Raise ValueError when there are any index, columns combinations with multiple values.
# importing pandas as pd
import pandas as pd

# creating a dataframe
df = pd.DataFrame({'A': ['John', 'John', 'Mina'],
'B': ['Masters', 'Masters', 'Graduate'],
65
'C': [27, 23, 21]})
df.pivot('A', 'B', 'C')
ValueError: Index contains duplicate entries, cannot reshape

Vectorized String Operations


One strength of Python is its relative ease in handling and manipulating string data. Pandas
builds on this and provides a comprehensive set of vectorized string operations that become an
essential piece of the type of munging required when working with (read: cleaning up) real-
world data. In this section, we'll walk through some of the Pandas string operations, and then
take a look at using them to partially clean up a very messy dataset of recipes collected from the
Internet.

Introducing Pandas String Operations


We saw in previous sections how tools like NumPy and Pandas generalize arithmetic operations
so that we can easily and quickly perform the same operation on many array elements. For
example:
In [1]:
import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])
x*2
Out[1]:
array([ 4, 6, 10, 14, 22, 26])

This vectorization of operations simplifies the syntax of operating on arrays of data: we no


longer have to worry about the size or shape of the array, but just about what operation we want
done. For arrays of strings, NumPy does not provide such simple access, and thus you're stuck
using a more verbose loop syntax:

In [2]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]
Out[2]:

66
['Peter', 'Paul', 'Mary', 'Guido']
This is perhaps sufficient to work with some data, but it will break if there are any missing
values. For example:
In [3]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-3-fc1d891ab539> in <module>()
1 data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
----> 2 [s.capitalize() for s in data]

<ipython-input-3-fc1d891ab539> in <listcomp>(.0)
1 data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
----> 2 [s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'


Pandas includes features to address both this need for vectorized string operations and for
correctly handling missing data via the str attribute of Pandas Series and Index objects
containing strings. So, for example, suppose we create a Pandas Series with this data:

In [4]:
import pandas as pd
names = pd.Series(data)
names
Out[4]:
0 peter
1 Paul
2 None
3 MARY
4 gUIDO
dtype: object

67
We can now call a single method that will capitalize all the entries, while skipping over any
missing values:
In [5]:
names.str.capitalize()
Out[5]:
0 Peter
1 Paul
2 None
3 Mary
4 Guido
dtype: object
Using tab completion on this str attribute will list all the vectorized string methods available to
Pandas.

Tables of Pandas String Methods


If you have a good understanding of string manipulation in Python, most of Pandas string
syntax is intuitive enough that it's probably sufficient to just list a table of available methods; we
will start with that here, before diving deeper into a few of the subtleties. The examples in this
section use the following series of names:
In [6]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
'Eric Idle', 'Terry Jones', 'Michael Palin'])
Methods similar to Python string methods
Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method.
Here is a list of Pandas str methods that mirror Python string methods:

len() lower() translate() islower()

ljust() upper() startswith() isupper()

rjust() find() endswith() isnumeric()

center() rfind() isalnum() isdecimal()


68
zfill() index() isalpha() split()

strip() rindex() isdigit() rsplit()

rstrip() capitalize() isspace() partition()


lstrip() swapcase() istitle() rpartition()

Notice that these have various return values. Some, like lower(), return a series of strings:
In [7]:
monte.str.lower()
Out[7]:
0 graham chapman
1 john cleese
2 terry gilliam
3 eric idle
4 terry jones
5 michael palin
dtype: object

But some others return numbers:


In [8]:
monte.str.len()
Out[8]:
0 14
1 11
2 13
3 9
4 11
5 13
dtype: int64
Or Boolean values:
In [9]:
monte.str.startswith('T')
69
Out[9]:
0 False
1 False
2 True
3 False
4 True
5 False
dtype: bool

Still others return lists or other compound values for each element:
In [10]:
monte.str.split()
Out[10]:
0 [Graham, Chapman]
1 [John, Cleese]
2 [Terry, Gilliam]
3 [Eric, Idle]
4 [Terry, Jones]
5 [Michael, Palin]
dtype: object
Methods using regular expressions
In addition, there are several methods that accept regular expressions to examine the content of
each string element, and follow some of the API conventions of Python's built-in re module:
Method Description
match() Call re.match() on each element, returning a boolean.
extract() Call re.match() on each element, returning matched groups as strings.
findall() Call re.findall() on each element
replace() Replace occurrences of pattern with some other string
contains() Call re.search() on each element, returning a boolean
count() Count occurrences of pattern
split() Equivalent to str.split(), but accepts regexps
rsplit() Equivalent to str.rsplit(), but accepts regexps
70
With these, you can do a wide range of interesting operations. For example, we can extract the
first name from each by asking for a contiguous group of characters at the beginning of each
element:
In [11]:
monte.str.extract('([A-Za-z]+)', expand=False)
Out[11]:
0 Graham
1 John
2 Terry
3 Eric
4 Terry
5 Michael
dtype: object
Or we can do something more complicated, like finding all names that start and end with a
consonant, making use of the start-of-string (^) and end-of-string ($) regular expression
characters:
In [12]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')
Out[12]:
0 [Graham Chapman]
1 []
2 [Terry Gilliam]
3 []
4 [Terry Jones]
5 [Michael Palin]
dtype: object

The ability to concisely apply regular expressions across Series or Dataframe entries opens up
many possibilities for analysis and cleaning of data.

71
6. Gather information from different sources like CSV, Excel, JSON

When you start any project that directly or indirectly deals with data, the first and foremost
thing you would do is search for a dataset. Now gathering data could be done in various ways,
either using web scraping, a private dataset from a client, or a public dataset downloaded from
sources like GitHub, universities, kaggle, quandl, etc.

This data might be in an Excel file or saved with .csv, .txt, JSON, etc. file extension. The data
could be qualitative or quantitative. The data type could vary depending on the kind of problem
you plan to solve.

Reading Text Files in Python

Text files are one of the most common file formats to store data. Python makes it very easy to

read data from text files.

Python provides the open() function to read files that take in the file path and the file access

mode as its parameters. For reading a text file, the file access mode is ‘r’. I have mentioned the

other access modes below:

 ‘w’ – writing to a file


 ‘r+’ or ‘w+’ – read and write to a file
 ‘a’ – appending to an already existing file
 ‘a+’ – append to a file after reading

# read text file


with open(r'./Importing files/Analytics Vidhya.txt','r') as
f:
print(f.read())

72
The read() function imported all the data in the file in the correct structured form.

# read text file

with open(r'./Importing files/Analytics Vidhya.txt','r') as


f:

print(f.read(10))

By providing a number in the read() function, we were able to extract the specified amount of

bytes from the file.

# read text file

with open(r'./Importing files/Analytics Vidhya.txt','r') as


f:

print(f.readline())

Using readline(), only a single line from the text file was extracted.

# read text file

with open(r'./Importing files/Analytics Vidhya.txt','r') as


f:

print(f.readlines())

Here, the readline() function extracted all the text file data in a list format.

73
Reading CSV Files in Python

A CSV (or Comma Separated Value) file is the most common type of file that a data scientist

will ever work with. These files use a “,” as a delimiter to separate the values and each row in a

CSV file is a data record.

These are useful to transfer data from one application to another and are probably the reason

why they are so commonplace in the world of data science. If you look at them in the Notepad,

you will notice that the values are separated by commas:

The Pandas library makes it very easy to read CSV files using the read_csv() function:

# import pandas

import pandas as pd

# read csv file into a DataFrame

df = pd.read_csv(r'./Importing files/Products.csv')

# display DataFrame

df

74
But CSV can run into problems if the values contain commas. This can be overcome by using

different delimiters to separate information in the file, like ‘\t’ or ‘;’, etc. These can also be

imported with the read_csv() function by specifying the delimiter in the parameter value as

shown below while reading a TSV (Tab Separated Values) file:

import pandas as pd

df = pd.read_csv(r'./Importing
files/Employee.txt',delimiter='\t')

df

Reading Excel Files in Python

Pandas has a very handy function called read_excel() to read Excel files:

# read Excel file into a DataFrame

df = pd.read_excel(r'./Importing files/World_city.xlsx')

75
# print values

df
view rawImport_files_8.py hosted with GitHub

But an Excel file can contain multiple sheets, right? So how can we access them?

For this, we can use the Pandas’ ExcelFile() function to print the names of all the sheets in the

file:

# read Excel sheets in pandas

xl = pd.ExcelFile(r'./Importing files/World_city.xlsx')

# print sheet name

xl.sheet_names

After doing that, we can easily read data from any sheet we wish by providing its name in

the sheet_name parameter in the read_excel() function:

# read Europe sheet

df = pd.read_excel(r'./Importing files/World_city.xlsx',sheet_name='Europe')

df

76
And voila!

Working with JSON Files in Python

JSON (JavaScript Object Notation) files are lightweight and human-readable to store and

exchange data. It is easy for machines to parse and generate these files and are based on the

JavaScript programming language.

JSON files store data within {} similar to how a dictionary stores it in Python. But their major

benefit is that they are language-independent, meaning they can be used with any programming

language – be it Python, C or even Java!

Python provides a json module to read JSON files. You can read JSON files just like simple

text files. However, the read function, in this case, is replaced by json.load() function that

returns a JSON dictionary.

Once you have done that, you can easily convert it into a Pandas dataframe using

the pandas.DataFrame() function:

import json

# open json file

with open('./Importing files/sample_json.json','r') as file:

data = json.load(file)

# json dictionary

print(type(data))
77
# loading into a DataFrame

df_json = pd.DataFrame(data)

df_json

But you can even load the JSON file directly into a dataframe using

the pandas.read_json() function as shown below:

# reading directly into a DataFrame usind pd.read_json()

path = './Importing files/sample_json.json'

df = pd.read_json(path)

df

78
7. Gather the required web information using Web Scrapping
Web scraping is an automatic method to obtain large amounts of data from websites.
Most of this data is unstructured data in an HTML format which is then converted into
structured data in a spreadsheet or a database so that it can be used in various applications.

Libraries used for Web Scraping


As we know, Python is has various applications and there are different libraries for different
purposes. In our further demonstration, we will be using the following libraries:
 Selenium: Selenium is a web testing library. It is used to automate browser activities.
 BeautifulSoup: Beautiful Soup is a Python package for parsing HTML and XML
documents. It creates parse trees that is helpful to extract the data easily.
 Pandas: Pandas is a library used for data manipulation and analysis. It is used to extract
the data and store it in the desired format.

Web Scraping Example : Scraping Flipkart Website


Pre-requisites:
 Python 2.x or Python 3.x with Selenium, BeautifulSoup, pandas libraries installed
 Google-chrome browser

Let’s get started!


Step 1: Find the URL that you want to scrape

79
For this example, we are going scrape Flipkart website to extract the Price, Name, and Rating
of Laptops. The URL for this page is https://www.flipkart.com/laptops/~buyback-guarantee-on-
laptops-/pr?sid=6bo%2Cb5g&uniqBStoreParam1=val1&wid=11.productCard.PMU_V2.

Step 2: Inspecting the Page


The data is usually nested in tags. So, we inspect the page to see, under which tag the data we
want to scrape is nested. To inspect the page, just right click on the element and click on
“Inspect”.
When you click on the “Inspect” tab, you will see a “Browser Inspector Box” open.

Step 3: Find the data you want to extract


Let’s extract the Price, Name, and Rating which is in the “div” tag respectively.

Step 4: Write the code


First, let’s create a Python file. To do this, open the terminal in Ubuntu and type gedit <your file
name> with .py extension.
I am going to name my file “web-s”. Here’s the command:
1 gedit web-s.py
Now, let’s write our code in this file.
First, let us import all the necessary libraries:
1 from selenium import webdriver
2 from BeautifulSoup import BeautifulSoup
80
3 import pandas as pd
To configure webdriver to use Chrome browser, we have to set the path to chromedriver
1

driver=webdriver.Chrome("/usr/lib/chromium-browser/chromedriver"
)

Refer the below code to open the URL:

1 products=[] #List to store name of the product

2 prices=[] #List to store price of the product

3 ratings=[] #List to store rating of the product


4 driver.get("<a
href="https://www.flipkart.com/laptops/">https://www.flipkart.co
m/laptops/</a>~buyback-guarantee-on-laptops-/pr?sid=6bo
%2Cb5g&amp;amp;amp;amp;amp;amp;amp;amp;amp;uniq")

Now that we have written the code to open the URL, it’s time to extract the data from the
website. As mentioned earlier, the data we want to extract is nested in <div> tags. So, find the
div tags with those respective class-names, extract the data and store the data in a variable.
Refer the code below:

1 content = driver.page_source

2 soup = BeautifulSoup(content)

3 for a in soup.findAll('a',href=True,
attrs={'class':'_31qSD5'}):

4 name=a.find('div', attrs={'class':'_3wU53n'})

5 price=a.find('div', attrs={'class':'_1vC4OE _2rQ-NK'})

81
6 rating=a.find('div', attrs={'class':'hGSR34 _2beYZw'})

7 products.append(name.text)

8 prices.append(price.text)
9 ratings.append(rating.text)

Step 5: Run the code and extract the data


To run the code, use the below command:
1 python web-s.py

Step 6: Store the data in a required format


After extracting the data, you might want to store it in a format. This format varies depending
on your requirement. For this example, we will store the extracted data in a CSV (Comma
Separated Value) format. To do this, add the following lines to the code:

1 df = pd.DataFrame({'Product
Name':products,'Price':prices,'Rating':ratings})
2 df.to_csv('products.csv', index=False, encoding='utf-8')
Now, run the whole code again.
A file name “products.csv” is created and this file contains the extracted data.

82
8. Write a Python program to do the following operations:
a) Loading data from CSV file
b) Compute the basic statistics of given data - shape, no. of columns, mean
c) Splitting a data frame on values of categorical variables
d) Visualize data using Scatter plot

RESOURCES:
a) Python 3.7.0
b) Install: pip installer, Pandas library

PROCEDURE:
1. Create: Open a new file in Python shell, write a program and save the program with .py
extension.
2. Execute: Go to Run -> Run module (F5)

PROGRAM LOGIC:
a) Loading data from CSV file #loading file csv
import pandas as pd
pd.read_csv("P:/python/newfile.csv")

b) Compute the basic statistics of given data - shape, no. of columns, mean #shape

a=pd.read_csv("C:/Users/admin/Documents/
diabetes.csv") print('shape :',a.shape)

#no of columns
cols=len(a.axes[1])
print('no of columns:',cols)

#mean of data m=a["Age"].mean()

83
print('mean of Age:',m)

c) Splitting a data frame on values of categorical variables #adding data


a['address']=["hyderabad,ts","Warangal,ts","Adilabad,ts","
medak,ts"]
#splitting dataframe
a_split=a['address'].str.split(',',1)
a['district']=a_split.str.get(0)
a['state']=a_split.str.get(1)
del(a['address'])

d) Visualize data using Scatter plot


# Visualize data using scatter plot
importmatplotlib as plt
a.plot.scatter(x='marks',y='rollno',c='Blue')
INPUT/OUTPUT:
a)
student rollno marks
0 a1 121 98
1 a2 122 82
2 a3 123 92
3 a4 124 78

b)
shape: (4, 3)
no. of columns:3
mean:87.5

c)
before:
student rollno marks address
0 a1 121 98 hyderabad,ts

84
1 a2 122 82 Warangal,ts
2 a3 123 92 Adilabad,ts
3 a4 124
78 medak,ts

After:
student rollno marks district state
0 a1 121 98 hyderabadts
1 a2 122 82 Warangal ts
2 a3 123 92 Adilabadts
3 a4 124 78 medakts

d)

85
9. Write a python program to impute missing values with various techniques on given
dataset.
a) Remove rows/ attributes
b) Replace with mean or mode
c) Write a python program to perform transformation of data using Discretization
(Binning) and normalization (MinMaxScaler or MaxAbsScaler) on given dataset.
RESOURCES: a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library

PROCEDURE:
1. Create: Open a new file in Python shell, write a program and save the program with .py
extension.
2. Execute: Go to Run -> Run module (F5)

# filling missing value using fillna()


df.fillna(0)
# filling a missing
value with previous
value
df.fillna(method
='pad')
#Filling null value with the next ones
df.fillna(method ='bfill')
# filling a null values
using fillna()
data["Gender"].fillna("No
Gender", inplace = True)
# will replace Nan value in
dataframe with value -99

86
data.replace(to_replace = np.nan,
value = -99)
# Remove rows/ attributes

# using dropna() function to remove rows having one Nan


df.dropna()
# using dropna() function to remove rows with all Nan
df.dropna(how = 'all')
# using dropna() function to remove column having one Nan
df.dropna(axis = 1)
# Replace with mean or mode

mean_y = np.mean(ys)

# Perform transformation of data using Discretization (Binning)


Binning can also be used as a discretization technique. Discretization refers to the
process of converting or partitioning continuous attributes, features or variables to
discretized or nominal attributes/ features/ variables/ intervals.

For example, attribute values can be discretized by applying equal-width or equal-


frequency binning, and then replacing each bin value by the bin mean or median, as in
smoothing by bin means or smoothing by bin medians, respectively. Then the continuous
values can be converted to a nominal or discretized value which is same as the value of
their corresponding bin.

There are basically two types of binning approaches –

Equal width (or distance) binning : The simplest binning approach is to partition the
range of the variable into k equal-width intervals. The interval width is simply the range
[A, B] of the variable divided by k, w = (B-A) / k

87
Thus, ith interval range will be [A + (i-1)w, A + iw]
where i = 1, 2, 3…..k Skewed data cannot be handled
well by this method.
Equal depth (or frequency) binning : In equal-frequency binning we divide the range [A,
B] of the variable into intervals that contain (approximately) equal number of points; equal
frequency may not be possible due to repeated values.

Example:
Sorted data for price(in dollar) : 2, 6, 7, 9, 13, 20, 21, 25, 30

import numpy as np
import math
from sklearn.datasets import load_iris
from sklearn import datasets, linear_model, metrics

# load iris data set


dataset = load_iris()
a = dataset.data
b = np.zeros(150)

88
# take 1st column among 4 column of
data set
for i in range (150):
b[i]=a[i,1]

b=np.sort(b) #sort the array

# create bins
bin1=np.zeros((30,5))
bin2=np.zeros((30,5))
bin3=np.zeros((30,5))

# Bin mean
for i in range (0,150,5):
k=int(i/5)
mean=(b[i] + b[i+1] + b[i+2] + b[i+3] +
b[i+4])/5
for j in range(5):
bin1[k,j]=mean
print("Bin Mean: \
n",bin1)
# Bin boundaries
for i in range (0,150,5):
k=int(i/5)
for j in range (5):
if (b[i+j]-b[i]) < (b[i+4]-
b[i+j]): bin2[k,j]=b[i]
else:
bin2[k,j]=b[i+4] print("Bin Boundaries: \
n",bin2)
# Bin median
for i in range (0,150,5):
89
k=int(i/5)
for j in range (5):
bin3[k,j]=b[i+2]
print("Bin Median: \n",bin3)

Bin Mean: Bin Boundaries: Bin Median:


[[2.18 2.18 2.18 2.18 2.18] [[2. 2.3 2.3 2.3 2.3] [[2.2 2.2 2.2 2.2 2.2]
[2.34 2.34 2.34 2.34 2.34] [2.3 2.3 2.3 2.4 2.4] [2.3 2.3 2.3 2.3 2.3]
[2.48 2.48 2.48 2.48 2.48] [2.4 2.5 2.5 2.5 2.5] [2.5 2.5 2.5 2.5 2.5]
[2.52 2.52 2.52 2.52 2.52] [2.5 2.5 2.5 2.5 2.6] [2.5 2.5 2.5 2.5 2.5]
[2.62 2.62 2.62 2.62 2.62] [2.6 2.6 2.6 2.6 2.7] [2.6 2.6 2.6 2.6 2.6]
[2.7 2.7 2.7 2.7 2.7 ] [2.7 2.7 2.7 2.7 2.7] [2.7 2.7 2.7 2.7 2.7]
[2.74 2.74 2.74 2.74 2.74] [2.7 2.7 2.7 2.8 2.8] [2.7 2.7 2.7 2.7 2.7]
[2.8 2.8 2.8 2.8 2.8 ] [2.8 2.8 2.8 2.8 2.8] [2.8 2.8 2.8 2.8 2.8]
[2.8 2.8 2.8 2.8 2.8 ] [2.8 2.8 2.8 2.8 2.8] [2.8 2.8 2.8 2.8 2.8]
[2.86 2.86 2.86 2.86 2.86] [2.8 2.8 2.9 2.9 2.9] [2.9 2.9 2.9 2.9 2.9]
[2.9 2.9 2.9 2.9 2.9 ] [2.9 2.9 2.9 2.9 2.9] [2.9 2.9 2.9 2.9 2.9]
[2.96 2.96 2.96 2.96 2.96] [2.9 2.9 3. 3. 3. ] [3. 3. 3. 3. 3. ]
[3. 3. 3. 3. 3. ] [3. 3. 3. 3. 3. ] [3. 3. 3. 3. 3. ]
[3. 3. 3. 3. 3. ] [3. 3. 3. 3. 3. ] [3. 3. 3. 3. 3. ]
[3. 3. 3. 3. 3. ] [3. 3. 3. 3. 3. ] [3. 3. 3. 3. 3. ]
[3. 3. 3. 3. 3. ] [3. 3. 3. 3. 3. ] [3. 3. 3. 3. 3. ]
[3.04 3.04 3.04 3.04 3.04] [3. 3. 3. 3.1 3.1] [3. 3. 3. 3. 3. ]
[3.1 3.1 3.1 3.1 3.1 ] [3.1 3.1 3.1 3.1 3.1] [3.1 3.1 3.1 3.1 3.1]
[3.12 3.12 3.12 3.12 3.12] [3.1 3.1 3.1 3.1 3.2] [3.1 3.1 3.1 3.1 3.1]
[3.2 3.2 3.2 3.2 3.2 ] [3.2 3.2 3.2 3.2 3.2] [3.2 3.2 3.2 3.2 3.2]
[3.2 3.2 3.2 3.2 3.2 ] [3.2 3.2 3.2 3.2 3.2] [3.2 3.2 3.2 3.2 3.2]
[3.26 3.26 3.26 3.26 3.26] [3.2 3.2 3.3 3.3 3.3] [3.3 3.3 3.3 3.3 3.3]
[3.34 3.34 3.34 3.34 3.34] [3.3 3.3 3.3 3.4 3.4] [3.3 3.3 3.3 3.3 3.3]
[3.4 3.4 3.4 3.4 3.4 ] [3.4 3.4 3.4 3.4 3.4] [3.4 3.4 3.4 3.4 3.4]
[3.4 3.4 3.4 3.4 3.4 ] [3.4 3.4 3.4 3.4 3.4] [3.4 3.4 3.4 3.4 3.4]

90
[3.5 3.5 3.5 3.5 3.5 ] [3.5 3.5 3.5 3.5 3.5] [3.5 3.5 3.5 3.5 3.5]
[3.58 3.58 3.58 3.58 3.58] [3.5 3.6 3.6 3.6 3.6] [3.6 3.6 3.6 3.6 3.6]
[3.74 3.74 3.74 3.74 3.74] [3.7 3.7 3.7 3.8 3.8] [3.7 3.7 3.7 3.7 3.7]
[3.82 3.82 3.82 3.82 3.82] [3.8 3.8 3.8 3.8 3.9] [3.8 3.8 3.8 3.8 3.8]
[4.12 4.12 4.12 4.12 4.12]] [3.9 3.9 3.9 4.4 4.4]] [4.1 4.1 4.1 4.1 4.1]]

# Perform transformation of data using normalization (MinMaxScaler or MaxAbsScaler)


on given dataset.

In preprocessing, standardization of data is one of the transformation task. Standardization is


scaling features to lie between a given minimum and maximum value, often between zero
and one, or so that the maximum absolute value of each feature is scaled to unit size. This
can be achieved using MinMaxScaler or MaxAbsScaler, respectively.

Example to scale a toy data matrix to the [0, 1] range:


from sklearn.preprocessing import
MinMaxScaler data = [[-1, 2], [-0.5,
6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
print(scaler.fit(data))
MinMaxScaler() print("data:\
n",scaler.data_max_)
print("Transformed data:\n",scaler.transform(data))

OUTPUT
MinMaxScaler(copy=True, feature_range=(0, 1))
data:
[ 1. 18.]
data:
[ 1. 18.]
Transformed data:
[[0. 0.]

91
[0.25 0.25]
[0.5 0.5 ]
[1. 1. ]]

10. Perform the Categorization of dataset


Classification in supervised Machine Learning (ML) is the process of predicting the class or
category of data based on predefined classes of data that have been ‘labeled’.
 Labeled data is data that has already been classified
 Unlabeled data is data that has not yet been labeled

Types Of Classification
There are two main types of classification:
 Binary Classification – sorts data on the basis of discrete or non-continuous values
(usually two values). For example, a medical test may sort patients into those that have a
specific disease versus those that do not.
 Multi-class Classification – sorts data into three or more classes. For example, medical
profiling that sorts patients into those with kidney, liver, lung, or bladder infection
symptoms.

How to Run a Classification Task with K-Nearest Neighbour


In this example, the KNN classifier is used to train data and run classification tasks.

# Import libraries and classes required for this example:


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report,
confusion_matrix
import pandas as pd

92
# Import dataset:
url = “iris.csv”

# Assign column names to dataset:


names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-
width', 'Class']

# Convert dataset to a pandas dataframe:


dataset = pd.read_csv(url, names=names)

# Use head() function to return the first 5 rows:


dataset.head()
# Assign values to the X and y variables:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

# Split dataset into random train and test subsets:


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.20)

# Standardize features by removing mean and scaling to unit


variance:
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Use the KNN classifier to fit data:


classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

93
# Predict y data with classifier:
y_predict = classifier.predict(X_test)

# Print results:
print(confusion_matrix(y_test, y_predict))
print(classification_report(y_test, y_predict))

11. Perform proper data labelling operation on dataset.

Data labeling is the process of assigning labels to subsets of data based on its
characteristics. Data labeling takes unlabeled datasets and augments each piece of data with
informative labels or tags.
Most commonly, data is annotated with a text label. However, there are many use cases
for labeling data with other types of labels. Labels provide context for data ranging from images
to audio recordings to x-rays, and more.

Data Labeling Procedure


While data has traditionally been labeled manually, the process is slow and resource-
intensive. Instead, ML models or algorithms can be used to automatically label data by
first training them on a subset of data that has been labeled manually.

How To Use Label Studio To Automatically Label Data


One automated labeling tool is Label Studio, an open source Python tool that lets you label
various data types including text, images, audio, videos, and time series.

1. To install Label Studio, open a command window or terminal, and enter:


pip install -U label-studio
or
python -m pip install -U label-studio

2. To create a labeling project, run the following command:

label-studio init <project_name>


94
Once the project has been created, you will receive a message stating:
Label Studio has been successfully initialized. Check project states in .\<project_name> Start
the server: label-studio start .\<project_name>

3. To start the project run the following command:

label-studio start .\<project-name>


or
label-studio start <project-name>

The project will automatically load in your web browser at


http://localhost:8080/welcome

4. Click on the Import button to import your data from various sources.

Once the data is imported, you can scroll down the page and preview it.

5. In the menu, click on Settings to continue:

You can now choose among the many options to finish setup for your specific project.

95
12. Perform Data Visualization with Python
Data Visualization is the presentation of data in graphical format. It helps people understand
the significance of data by summarizing and presenting huge amount of data in a simple and
easy-to-understand format and helps communicate information clearly and effectively.
Consider this given Data-set for which we will be plotting different charts :

Different Types of Charts for Analyzing & Presenting Data

1. Histogram :
The histogram represents the frequency of occurrence of specific phenomena which lie within
a specific range of values and arranged in consecutive and fixed intervals.

96
In below code histogram is plotted for Age, Income, Sales. So these plots in the output shows
frequency of each unique value for each attribute.

# import pandas and matplotlib

import pandas as pd
import matplotlib.pyplot as plt
# create 2D array of table given above
data = [['E001', 'M', 34, 123, 'Normal', 350],
['E002', 'F', 40, 114, 'Overweight', 450],
['E003', 'F', 37, 135, 'Obesity', 169],
['E004', 'M', 30, 139, 'Underweight', 189],
['E005', 'F', 44, 117, 'Underweight', 183],
['E006', 'M', 36, 121, 'Normal', 80],
['E007', 'M', 32, 133, 'Obesity', 166],
['E008', 'F', 26, 140, 'Normal', 120],
['E009', 'M', 32, 133, 'Normal', 75],
['E010', 'M', 36, 133, 'Underweight', 40] ]
# dataframe created with
# the above data array
df = pd.DataFrame(data, columns = ['EMPID', 'Gender',
'Age', 'Sales',
'BMI', 'Income'] )
# create histogram for numeric data
df.hist()
# show plot
plt.show()

97
Output :

2. Column Chart :
A column chart is used to show a comparison among different attributes, or it can show a
comparison of items over time.
# Dataframe of previous code is used here
# Plot the bar chart for numeric values
# a comparison will be shown between
# all 3 age, income, sales
df.plot.bar()
# plot between 2 attributes
plt.bar(df['Age'], df['Sales'])
plt.xlabel("Age")
plt.ylabel("Sales")
plt.show()

98
Output :

3. Box plot chart :


A box plot is a graphical representation of statistical data based on the minimum, first
quartile, median, third quartile, and maximum. The term “box plot” comes from the fact that
the graph looks like a rectangle with lines extending from the top and bottom. Because of the
extending lines, this type of graph is sometimes called a box-and-whisker plot. For quantile
and median refer to this Quantile and median.
# For each numeric attribute of dataframe
df.plot.box()
# individual attribute box plot
plt.boxplot(df['Income'])
plt.show()
Output :

4. Pie Chart :
A pie chart shows a static number and how categories represent part of a whole the
99
composition of something. A pie chart represents numbers in percentages, and the total sum of
all segments needs to equal 100%.
plt.pie(df['Age'], labels = {"A", "B", "C",
"D", "E", "F",
"G", "H", "I", "J"},
autopct ='% 1.1f %%', shadow = True)
plt.show()
plt.pie(df['Income'], labels = {"A", "B", "C",
"D", "E", "F",
"G", "H", "I", "J"},

autopct ='% 1.1f %%', shadow = True)


plt.show()
plt.pie(df['Sales'], labels = {"A", "B", "C",
"D", "E", "F",
"G", "H", "I", "J"},
autopct ='% 1.1f %%', shadow = True)
plt.show()
Output :

5. Scatter plot :
A scatter chart shows the relationship between two different variables and it can reveal the
100
distribution trends. It should be used when there are many different data points, and you want
to highlight similarities in the data set. This is useful when looking for outliers and for
understanding the distribution of your data.
# scatter plot between income and age
plt.scatter(df['income'], df['age'])
plt.show()

# scatter plot between income and sales


plt.scatter(df['income'], df['sales'])
plt.show()

# scatter plot between sales and age


plt.scatter(df['sales'], df['age'])
plt.show()
Output :

101
13. Write a python program to load the dataset and understand the input data:
 Load data, describe the given data and identify missing, outlier data items
 Perform Univariate, Segmented Univariate and Bivariate analysis
 Identify any derived metrics for the given data.
 Find correlation among all attributes
 Visualize correlation matrix

RESOURCES:
a) Python 3.7.0
b) Install: pip installer, pandas, SciPy library
PROCEDURE:
1. Create: Open a new file in Python shell, write a program and save the program with .py
extension.
2. Execute: Go to Run -> Run module (F5)
PROGRAM LOGIC:
a) Load data
import pandas as pd
importnumpy as np

102
importmatplotlib as plt
%matplotlib inline
#Reading the dataset in a dataframe using Pandas
df = pd.read_csv("C:/Users/admin/Documents/diabetes.csv")
#describe the given data
print(df. describe())
#Display first 10 rows of data
print(df.head(10))
#Missing values

In Pandas missing data is represented by two values:


None: None is a Python singleton object that is often used for missing data in Python code.
NaN :NaN (an acronym for Not a Number), is a special floating-point value recognized by all
systems
 isnull()
 notnull()
 dropna()
 fillna()
 replace()
 interpolate()
# identify missing items
print(df.isnull())
#outlier data items
Methods
Z-score method
Modified Z-score method IQR method
#Z-score function defined in scipy library to detect the
outliers
importnumpy as np
defoutliers_z_score(ys):
threshold = 3
mean_y = np.mean(ys)
stdev_y = np.std(ys)
103
z_scores = [(y - mean_y) / stdev_y for y
in ys] returnnp.where(np.abs(z_scores) >
threshold)

b) Find correlation among all attributes


# importing pandas as pd
import pandas as pd

# Making data frame from the csv file


df = pd.read_csv("nba.csv")
# Printing the first 10 rows of the data frame for visualization
df[:10]
# To find the correlation among columns # using pearson method
df.corr(method
='pearson')
# using „kendall‟
method. df.corr(method
='kendall')
c) Visualize correlation matrix
INPUT/OUTPUT:
import pandas as pd
df = pd.read_csv("C:/Users/admin/Documents/diabetes.csv")
print(df.describe())
print(df.head(10)

104
14. Perform Encoding categorical features on given dataset.
The performance of a machine learning model not only depends on the model and the
hyperparameters but also on how we process and feed different types of variables to the model.
Since most machine learning models only accept numerical variables, preprocessing the
categorical variables becomes a necessary step. We need to convert these categorical variables
to numbers such that the model is able to understand and extract valuable information.

A typical data scientist spends 70 – 80% of his time cleaning and preparing the data.
And converting categorical data is an unavoidable activity. It not only elevates the model

105
quality but also helps in better feature engineering. Now the question is, how do we proceed?
Which categorical data encoding method should we use?

What is categorical data?


Since we are going to be working on categorical variables in this article, here is a quick
refresher on the same with a couple of examples. Categorical variables are usually represented
as ‘strings’ or ‘categories’ and are finite in number. Here are a few examples:
1. The city where a person lives: Delhi, Mumbai, Ahmedabad, Bangalore, etc.
2. The department a person works in: Finance, Human resources, IT, Production.
3. The highest degree a person has: High school, Diploma, Bachelors, Masters, PhD.
4. The grades of a student: A+, A, B+, B, B- etc.
In the above examples, the variables only have definite possible values. Further, we can see
there are two kinds of categorical data-
 Ordinal Data: The categories have an inherent order
 Nominal Data: The categories do not have an inherent order
In Ordinal data, while encoding, one should retain the information regarding the order in which
the category is provided. Like in the above example the highest degree a person possesses, gives
vital information about his qualification. The degree is an important feature to decide whether a
person is suitable for a post or not.

While encoding Nominal data, we have to consider the presence or absence of a feature.
In such a case, no notion of order is present. For example, the city a person lives in. For the
data, it is important to retain where a person lives. Here, We do not have any order or sequence.
It is equal if a person lives in Delhi or Bangalore.
For encoding categorical data, we have a python package category_encoders. The following
code helps you install easily.
pip install category_encoders

Label Encoding or Ordinal Encoding


We use this categorical data encoding technique when the categorical feature is ordinal. In this
case, retaining the order is important. Hence encoding should reflect the sequence.

106
In Label encoding, each label is converted into an integer value. We will create a variable that
contains the categories representing the education qualification of a person.

import category_encoders as ce
import pandas as pd
train_df=pd.DataFrame({'Degree':['High
school','Masters','Diploma','Bachelors','Bachelors','Masters','P
hd','High school','High school']})

# create object of Ordinalencoding


encoder= ce.OrdinalEncoder(cols=['Degree'],return_df=True,
mapping=[{'col':'Degree',
'mapping':{'None':0,'High
school':1,'Diploma':2,'Bachelors':3,'Masters':4,'phd':5}}])
#Original data
train_df

#fit and transform train data


df_train_transformed = encoder.fit_transform(train_df)

107
One Hot Encoding
We use this categorical data encoding technique when the features are nominal(do not have any
order). In one hot encoding, for each level of a categorical feature, we create a new variable.
Each category is mapped with a binary variable containing either 0 or 1. Here, 0 represents the
absence, and 1 represents the presence of that category.
These newly created binary features are known as Dummy variables. The number of dummy
variables depends on the levels present in the categorical variable. This might sound
complicated. Let us take an example to understand this better. Suppose we have a dataset with a
category animal, having different animals like Dog, Cat, Sheep, Cow, Lion. Now we have to
one-hot encode this data.

After encoding, in the second table, we have dummy variables each representing a category in
the feature Animal. Now for each category that is present, we have 1 in the column of that
category and 0 for the others. Let’s see how to implement a one-hot encoding in python.

import category_encoders as ce
import pandas as pd
108
data=pd.DataFrame({'City':[
'Delhi','Mumbai','Hydrabad','Chennai','Bangalore','Delhi','Hydrabad','Bangalore','Delhi'
]})

#Create object for one-hot encoding


encoder=ce.OneHotEncoder(cols='City',handle_unknown='return_nan',return_df=True,use_cat_
names=True)
#Original Data
data

#Fit and transform Data


data_encoded = encoder.fit_transform(data)
data_encoded

Dummy Encoding

109
Dummy coding scheme is similar to one-hot encoding. This categorical data encoding method
transforms the categorical variable into a set of binary variables (also known as dummy
variables). In the case of one-hot encoding, for N categories in a variable, it uses N binary
variables. The dummy encoding is a small improvement over one-hot-encoding. Dummy
encoding uses N-1 features to represent N labels/categories.
To understand this better let’s see the image below. Here we are coding the same data using
both one-hot encoding and dummy encoding techniques. While one-hot uses 3 variables to
represent the data whereas dummy encoding uses 2 variables to code 3 categories.

Let us implement it in python.


import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':
['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi,'Hyderabad']})

#Original Data
data

110
#encode the data
data_encoded=pd.get_dummies(data=data,drop_first=True)
data_encoded

Here using drop_first argument, we are representing the first label Bangalore using 0.

Drawbacks of One-Hot and Dummy Encoding


One hot encoder and dummy encoder are two powerful and effective encoding schemes. They
are also very popular among the data scientists, But may not be as effective when-
1. A large number of levels are present in data. If there are multiple categories in a feature
variable in such a case we need a similar number of dummy variables to encode the data. For
example, a column with 30 different values will require 30 new variables for coding.
2. If we have multiple categorical features in the dataset similar situation will occur and
again we will end to have several binary features each representing the categorical feature and
their multiple categories e.g a dataset having 10 or more categorical columns.
In both the above cases, these two encoding schemes introduce sparsity in the dataset i.e several
columns having 0s and a few of them having 1s. In other words, it creates multiple dummy
features in the dataset without adding much information.
Also, they might lead to a Dummy variable trap. It is a phenomenon where features are
highly correlated. That means using the other variables, we can easily predict the value of a
variable.
Due to the massive increase in the dataset, coding slows down the learning of the model
along with deteriorating the overall performance that ultimately makes the model
computationally expensive. Further, while using tree-based models these encodings are not an
optimum choice.

111
Effect Encoding:
This encoding technique is also known as Deviation Encoding or Sum
Encoding. Effect encoding is almost similar to dummy encoding, with a little difference. In
dummy coding, we use 0 and 1 to represent the data but in effect encoding, we use three values
i.e. 1,0, and -1.
The row containing only 0s in dummy encoding is encoded as -1 in effect encoding. In
the dummy encoding example, the city Bangalore at index 4 was encoded as 0000. Whereas in
effect encoding it is represented by -1-1-1-1.

Let us see how we implement it in python-


import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':
['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi,'Hyderabad']})
encoder=ce.sum_coding.SumEncoder(cols='City',verbose=False,)

#Original Data
data

encoder.fit_transform(data)

112
Effect encoding is an advanced technique. In case you are interested to know more about effect
encoding, refer to this interesting paper.

Hash Encoder
To understand Hash encoding it is necessary to know about hashing. Hashing is the
transformation of arbitrary size input in the form of a fixed-size value. We use hashing
algorithms to perform hashing operations i.e to generate the hash value of an input. Further,
hashing is a one-way process, in other words, one can not generate original input from the hash
representation.
Hashing has several applications like data retrieval, checking data corruption, and in data
encryption also. We have multiple hash functions available for example Message Digest (MD,
MD2, MD5), Secure Hash Function (SHA0, SHA1, SHA2), and many more.
Just like one-hot encoding, the Hash encoder represents categorical features using the new
dimensions. Here, the user can fix the number of dimensions after transformation
using n_component argument. Here is what I mean – A feature with 5 categories can be
represented using N new features similarly, a feature with 100 categories can also be
transformed using N new features.
By default, the Hashing encoder uses the md5 hashing algorithm but a user can pass any
algorithm of his choice.
import category_encoders as ce
import pandas as pd

#Create the dataframe


data=pd.DataFrame({'Month':
['January','April','March','April','Februay','June','July','June','September']})
113
#Create object for hash encoder
encoder=ce.HashingEncoder(cols='Month',n_components=6)

#Fit and Transform Data


encoder.fit_transform(data)

Since Hashing transforms the data in lesser dimensions, it may lead to loss of information.
Another issue faced by hashing encoder is the collision. Since here, a large number of features
are depicted into lesser dimensions, hence multiple values can be represented by the same hash
value, this is known as a collision.
Moreover, hashing encoders have been very successful in some Kaggle competitions. It is great
to try if the dataset has high cardinality features.

Binary Encoding
Binary encoding is a combination of Hash encoding and one-hot encoding. In this encoding
scheme, the categorical feature is first converted into numerical using an ordinal encoder. Then

114
the numbers are transformed in the binary number. After that binary value is split into different
columns.
Binary encoding works really well when there are a high number of categories. For example the
cities in a country where a company supplies its products.
#Import the libraries
import category_encoders as ce
import pandas as pd

#Create the Dataframe


data=pd.DataFrame({'City':
['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad','Mumbai','Agra']})

#Create object for binary encoding


encoder= ce.BinaryEncoder(cols=['city'],return_df=True)

#Original Data
data

#Fit and Transform Data


data_encoded=encoder.fit_transform(data)
data_encoded

115
Binary encoding is a memory-efficient encoding scheme as it uses fewer features than one-hot
encoding. Further, It reduces the curse of dimensionality for data with high cardinality.

Base N Encoding
Before diving into BaseN encoding let’s first try to understand what is Base here?
In the numeral system, the Base or the radix is the number of digits or a combination of digits
and letters used to represent the numbers. The most common base we use in our life is 10 or
decimal system as here we use 10 unique digits i.e 0 to 9 to represent all the numbers. Another
widely used system is binary i.e. the base is 2. It uses 0 and 1 i.e 2 digits to express all the
numbers.
For Binary encoding, the Base is 2 which means it converts the numerical values of a category
into its respective Binary form. If you want to change the Base of encoding scheme you may
use Base N encoder. In the case when categories are more and binary encoding is not able to
handle the dimensionality then we can use a larger base such as 4 or 8.
#Import the libraries
import category_encoders as ce
import pandas as pd
#Create the dataframe
data=pd.DataFrame({'City':
['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad','Mumbai','Agra']})
#Create an object for Base N Encoding
encoder= ce.BaseNEncoder(cols=['city'],return_df=True,base=5)
#Original Data
data

116
#Fit and Transform Data
data_encoded=encoder.fit_transform(data)
data_encoded

In the above example, I have used base 5 also known as the Quinary system. It is similar to the
example of Binary encoding. While Binary encoding represents the same data by 4 new features
the BaseN encoding uses only 3 new variables.
Hence BaseN encoding technique further reduces the number of features required to efficiently
represent the data and improving memory usage. The default Base for Base N is 2 which is
equivalent to Binary Encoding.

Target Encoding
Target encoding is a Baysian encoding technique.

117
Bayesian encoders use information from dependent/target variables to encode the categorical
data.
In target encoding, we calculate the mean of the target variable for each category and replace
the category variable with the mean value. In the case of the categorical target variables, the
posterior probability of the target replaces each category.
#import the libraries
import pandas as pd
import category_encoders as ce

#Create the Dataframe


data=pd.DataFrame({'class':['A,','B','C','B','C','A','A','A'],'Marks':[50,30,70,80,45,97,80,68]})

#Create target encoding object


encoder=ce.TargetEncoder(cols='class')

#Original Data
Data

#Fit and Transform Train Data


encoder.fit_transform(data['class'],data['Marks'])

118
We perform Target encoding for train data only and code the test data using results obtained
from the training dataset. Although, a very efficient coding system, it has the
following issues responsible for deteriorating the model performance-
1. It can lead to target leakage or overfitting. To address overfitting we can use different
techniques.
1. In the leave one out encoding, the current target value is reduced from the overall
mean of the target to avoid leakage.
2. In another method, we may introduce some Gaussian noise in the target statistics.
The value of this noise is hyperparameter to the model.
2. The second issue, we may face is the improper distribution of categories in train and test
data. In such a case, the categories may assume extreme values. Therefore the target means for
the category are mixed with the marginal mean of the target.

119
15. Perform Simple Linear Regression using Data Analysis Toolbox of Excel and
Python to interpret the regression table.
In statistical modeling, regression analysis is used to estimate the relationships between two or
more variables:
 Dependent variable (aka criterion variable) is the main factor you are trying to
understand and predict.
 Independent variables (aka explanatory variables, or predictors) are the factors that
might influence the dependent variable.

Regression analysis helps you understand how the dependent variable changes when one of
the independent variables varies and allows to mathematically determine which of those
variables really has an impact.

The three main methods to perform linear regression analysis in Excel are:
 Regression tool included with Analysis ToolPak
 Scatter chart with a trendline
 Linear regression formula

120
Below you will find the detailed instructions on using each method.
How to do linear regression in Excel with Analysis ToolPak
This example shows how to run regression in Excel by using a special tool included with the
Analysis ToolPak add-in.

Enable the Analysis ToolPak add-in


Analysis ToolPak is available in all versions of Excel 2019 to 2003 but is not enabled by
default. So, you need to turn it on manually. Here's how:
1. In your Excel, click File > Options.
2. In the Excel Options dialog box, select Add-ins on the left sidebar, make sure Excel
Add-ins is selected in the Manage box, and click Go.
In the Add-ins dialog box, tick off Analysis Toolpak, and click OK:

This will add the Data Analysis tools to the Data tab of your Excel ribbon.

Run regression analysis


In this example, we are going to do a simple linear regression in Excel. What we have is a list of
average monthly rainfall for the last 24 months in column B, which is our independent variable
(predictor), and the number of umbrellas sold in column C, which is the dependent variable. Of
course, there are many other factors that can affect sales, but for now we focus only on these
two variables:

With Analysis Toolpak added enabled, carry out these steps to perform regression analysis in
Excel:
1. On the Data tab, in the Analysis group, click the Data Analysis button.
Select Regression and click OK.
In the Regression dialog box, configure the following settings:
o Select the Input Y Range, which is your dependent variable. In our case, it's
umbrella sales (C1:C25).
o Select the Input X Range, i.e. your independent variable. In this example, it's
the average monthly rainfall (B1:B25).

121
If you are building a multiple regression model, select two or more adjacent columns with
different independent variables.
o Check the Labels box if there are headers at the top of your X and Y ranges.
o Choose your preferred Output option, a new worksheet in our case.
2. Optionally, select the Residuals checkbox to get the difference between the predicted
and actual values.
Click OK and observe the regression analysis output created by Excel.

Interpret regression analysis output


As you have just seen, running regression in Excel is easy because all calculations are
preformed automatically. The interpretation of the results is a bit trickier because you need to
know what is behind each number. Below you will find a breakdown of 4 major parts of the
regression analysis output.

Regression analysis output: Summary Output


This part tells you how well the calculated linear regression equation fits your source data.

Here's what each piece of information means:


Multiple R. It is the Correlation Coefficient that measures the strength of a linear relationship
between two variables. The correlation coefficient can be any value between -1 and 1, and
its absolute value indicates the relationship strength. The larger the absolute value, the stronger
the relationship:
 1 means a strong positive relationship
 -1 means a strong negative relationship
 0 means no relationship at all
R Square. It is the Coefficient of Determination, which is used as an indicator of the goodness
of fit. It shows how many points fall on the regression line. The R 2 value is calculated from the
122
total sum of squares, more precisely, it is the sum of the squared deviations of the original data
from the mean.
In our example, R2 is 0.91 (rounded to 2 digits), which is fairy good. It means that 91% of our
values fit the regression analysis model. In other words, 91% of the dependent variables (y-
values) are explained by the independent variables (x-values). Generally, R Squared of 95% or
more is considered a good fit.
 Adjusted R Square. It is the R square adjusted for the number of independent variable
in the model. You will want to use this value instead of R square for multiple regression
analysis.
 Standard Error. It is another goodness-of-fit measure that shows the precision of your
regression analysis - the smaller the number, the more certain you can be about your
regression equation. While R2 represents the percentage of the dependent variables
variance that is explained by the model, Standard Error is an absolute measure that
shows the average distance that the data points fall from the regression line.
 Observations. It is simply the number of observations in your model.

How to make a linear regression graph in Excel


If you need to quickly visualize the relationship between the two variables, draw a linear
regression chart. That's very easy! Here's how:
1. Select the two columns with your data, including headers.
2. On the Inset tab, in the Chats group, click the Scatter chart icon, and select
the Scatter thumbnail (the first one):
Now, we need to draw the least squares regression line. To have it done, right click on any point
and choose Add Trendline… from the context menu.
On the right pane, select the Linear trendline shape and, optionally, check Display Equation
on Chart to get your regression formula:

As you may notice, the regression equation Excel has created for us is the same as the linear
regression formula we built based on the Coefficients output.
3. Switch to the Fill & Line tab and customize the line to your liking. For example, you can
choose a different line color and use a solid line instead of a dashed line (select Solid line in
the Dash type box):

123
Still, you may want to make a few more improvements:
 Drag the equation wherever you see fit.
 Add axes titles (Chart Elements button > Axis Titles).
 If your data points start in the middle of the horizontal and/or vertical axis like in this
example, you may want to get rid of the excessive white space. The following tip explains how
to do this: Scale the chart axes to reduce white space.

16. Perform Multiple Linear Regression using Data Analysis Toolbox of Excel and Python
to interpret the regression table.
Let’s take a practical look at modeling a Multiple Regression model for the Gross Domestic
Product (GDP) of a country.
It is show you how to run multiple Regression in Excel and interpret the output, not to teach
about setting up our model assumptions and choosing the most appropriate variables.
Now that we have this out of the way and expectations are set, let’s open Excel and get started!
Sourcing our data
We will obtain public data from Eurostat, the statistics database for the European Commission
for this exercise. All the relevant source data is within the model file for your convenience,
which you can download below. I have also kept the links to the source tables to explore further
if you want.
The EU dataset gives us information for all member states of the union. As a massive fan of
Agatha Christie’s Hercule Poirot, let’s direct our attention to Belgium.
As you can see in the table below, we have nineteen observations of our target variable (GDP),
as well as our three predictor variables:
 X1 — Education Spend in mil.;
 X2 — Unemployment Rate as % of the Labor Force;

124
 X3 — Employee compensation in mil.

Even before we run our regression model, we notice some dependencies in our data. Looking at
the development over the periods, we can assume that GDP increases together with Education
Spend and Employee Compensation.

Running a Multiple Linear Regression


There are ways to calculate all the relevant statistics in Excel using formulas. But it’s much
easier with the Data Analysis Tool Pack, which you can enable from the Developer Tab -> Excel
Add-ins.

Look to the Data tab, and on the right, you will see the Data Analysis tool within the Analyze
section.

Run it and pick Regression from all the options. Note, we use the same menu for both simple
(single) and multiple linear regression models.
Now it’s time to set some ranges and settings.

The Y Range will include our dependent variable, GDP. And in the X Range, we will select all X
variable columns. Please, note that this is the same as running a single linear regression, the only
difference being that we choose multiple columns for X Range.

Remember that Excel requires that all X variables are in adjacent columns.
As I have selected the column Titles, it is crucial to mark the checkbox for Labels. A 95%
confidence interval is appropriate in most financial analysis scenarios, so we will not change this.

You can then consider placing the data on the same sheet or a new one. A new worksheet usually
works best, as the tool inserts quite a lot of data.

Evaluating the Regression Results


Now that we have our Summary Output from Excel let’s explore our regression model further.

125
The information we got out of Excel’s Data Analysis module starts with the Regression
Statistics.

R Square is the most important among those, so we can start by looking at it. Specifically, we
should look at Adjusted R Square in our case, as we have more than one X variable. It gives us
an idea of the overall goodness of the fit.

An adjusted R Square of 0.98 means our regression model can explain around 98% of the
variation of the dependent variable Y (GDP) around the average value of the observations (the
mean of our sample). In other words, 98% of the variability in ŷ (y-hat, our dependent variable
predictions) is capture by our model. Such a high value would usually indicate there might be
some issue with our model. We will continue with our model, but a too-high R Squared can be
problematic in a real-life scenario. I suggest you read this article on Statistics by Jim, to learn
why too good is not always right in terms of R Square.

The Standard Error gives us an estimate of the standard deviation of the error (residuals).
Generally, if the coefficient is large compared to the standard error, it is probably statistically
significant.

Analysis of Variance (ANOVA)

The Analysis of Variance section is something we often skip when modeling Regression.
However, it can provide valuable insights, and it’s worth taking a look at. You can read more
about running an ANOVA test and see an example model in our dedicated article.
This table gives us an overall test of significance on the regression parameters.

126
The ANOVA table’s F column gives us the overall F-test of the null hypothesis that all
coefficients are equal to zero. The alternative hypothesis is that at least one of the coefficients is
not equal to zero. The Significance F column shows us the p-value for the F-test. As it is lower
than the significance level of 0.05 (at our chosen confidence level of 95%), we can reject the null
hypothesis, that all coefficients are equal to zero. This means our regression parameters are
jointly not statistically insignificant.

127

You might also like