0% found this document useful (0 votes)
15 views

Application Based Programming in Python_ACE_INTL - Copy

The document is a guide on Application Based Programming in Python, detailing its simplicity, flexibility, and use in various applications including web development and machine learning. It covers the basics of Python programming, including keywords, identifiers, variables, and the installation of Python and Jupyter Notebook. The guide also outlines a structured curriculum with sessions on various programming concepts and techniques.

Uploaded by

ahuzisuccess
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Application Based Programming in Python_ACE_INTL - Copy

The document is a guide on Application Based Programming in Python, detailing its simplicity, flexibility, and use in various applications including web development and machine learning. It covers the basics of Python programming, including keywords, identifiers, variables, and the installation of Python and Jupyter Notebook. The guide also outlines a structured curriculum with sessions on various programming concepts and techniques.

Uploaded by

ahuzisuccess
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 466

Application Based Programming in

Python

© 2023 Aptech Limited


All rights reserved.
No part of this book may be reproduced or copied in any form or by any means
– graphic, electronic, or mechanical, including photocopying, recording, taping,
or storing in information retrieval system or sent or transferred without the prior
written permission of copyright owner Aptech Limited.
All trademarks acknowledged.

APTECH LIMITED
Contact E-mail: [email protected]
Edition 1 – 2023
Preface

Python is a simple yet effective programming language that is widely used to


create different applications for the Web. It has a very simple and flexible syntax
that makes writing code easy. It supports procedural, functional, and object-
oriented programming techniques. This makes it a popular choice for creating
Web, Analytics, and Machine learning applications. In this Learner’s Guide, you
will learn the basics of Python programming to create client-server and Web
applications. You will also learn to use the scikit-learn library to enable machine
learning and create prediction applications in Python. You will learn to write your
own Python programs that apply machine-learning techniques to real-world
problems.

This book is the result of a concentrated effort of the Design Team, which is
continuously striving to bring you the best and the latest in Information Technology.
The process of design has been a part of the ISO 9001 certification for Aptech-IT
Division, Education Support Services. As part of Aptech’s quality drive, this team
does intensive research and curriculum enrichment to keep it in line with industry
trends.

We will be glad to receive your suggestions.

Design Team
Table of Contents

Sessions

Session 1 – Basics of Python


Session 2 – Programming Language Constructs
Session 3 – Introduction to Functions
Session 4 – Functions, Modules, and Packages
Session 5 – File Handling and Exception Handling
Session 6 – Regular Expressions and Tokenization of Text in Python
Session 7 – Web Development Using Flask
Session 8 – Web Scraping in Python
Session 9 – GUI Programming in Tkinter
Session 10 – Database Integration with Python
Session 11 – Data Handling Using Pandas
Session 12 – Data Visualization in Python
Session 13 – Data Processing Techniques in Machine Learning
Session 14 – Introduction to Different Types of Machine Learning in Python
Session 15 – Regression Techniques for Machine Learning Using Python
Appendix A – Blossom Buddy – An Application in Python
Appendix B – Case Studies

V1.0 © Aptech Limited


Learning Objectives

In this session, students will learn to:

 Explain what is Python


 List Python keywords, identifiers, and the rules for naming identifiers
 Explain statements, comments, and indentation
 Describe how to create a variable and assign values to variables

Today, Python is one of the most widely used programming languages. It is


used to develop a myriad of applications ranging from simple Web
applications, network programming, scientific computing, and games to data
analytics or data science and machine learning. More and more developers
are adopting Python for its easy-to-code and execute characteristics.
This session will provide an overview of Python. It will cover the concept of
keywords and identifiers including the rules for naming identifiers. The session
will explore statements including simple and multi-line statements. It will also
cover the comments including single-line, multi-line, inline, and indentation.

V1.0 © Aptech Limited


The session will conclude with the details of using variables in Python code.

1.1 Introduction to Python

Python is an interpreted, object-oriented, and high-level programming


language. It was developed by Guido van Rossum at the National Research
Institute for Mathematics and Computer Science in the late eighties and
early nineties. It draws influences from languages such as C, C++, and Unix
shell.

Python is known for its simplicity, efficiency, and readability. Its elegant syntax,
which is similar to English sentences, makes it easier to write code and allows
developers to focus on solving logical errors in the code rather than focusing
on syntax errors. Python also requires programmers to write lesser code as
compared to other programming languages. It supports modularity and code
reuse. This code simplicity reduces the overhead of code maintenance.
Python also facilitates rapid application development across multiple
platforms.

The Python interpreter and comprehensive standard library are freely available
in both source and binary formats, enabling widespread distribution across
major platforms.

Key features of Python include:

V1.0 © Aptech Limited


Python supports functional, structured, and object-oriented programming
paradigms. It can be used as a scripting language or compiled to bytecode
for building large applications. Python offers high-level dynamic data types,
automatic garbage collection, and seamless integration with other
languages.

With its versatility and compatibility, Python finds extensive use in Web
development, software development, mathematics, and system scripting. Its
simple syntax and support for procedural, object-oriented, and functional
programming make it a preferred choice for developers.

1.1.1 Python IDE

A Python Integrated Development Environment (IDE) is a specialized software


application that facilitates software development specifically for Python. It
provides a comprehensive set of tools and features for writing, testing, and
debugging Python code.

Python IDEs enhance productivity by offering features including code editing


with syntax highlighting, code completion, and automatic indentation. They
also provide integrated debugging capabilities, enabling developers to step
through their code and examine variables and data structures.

Moreover, Python IDEs often include features such as version control


integration, project management, and support for third-party libraries and
frameworks. This helps streamline the development process and enables
efficient collaboration.

Some popular Python IDEs are:


Integrated
Visual Studio Code
Development and
PyCharm with Python Spyder
Learning
extension
Environment (IDLE)

Integrated Development and Learning Environment (IDLE) is the


default IDE bundled with Python.

V1.0 © Aptech Limited


1.1.2 Jupyter Notebook

Jupyter Notebook is a free, open-source, and interactive Web application that


can be used to run code and view the output for different programming
languages including Python, PHP, R, and C#. It helps users create and share
documents containing live code, equations, visualizations, and narrative text.

This versatile tool combines code execution, text, and visualizations in a single
document called a notebook. These notebooks consist of cells that can be
executed individually, allowing users to experiment with code, visualize data,
and document their workflow seamlessly.

Jupyter Notebook supports various programming languages through different


kernels. For Python, it uses the IPython kernel. It also supports other languages
such as R, Julia, and Scala, enabling multi-language support within a single
notebook.

Jupyter Notebook facilitates:

Creation and sharing of


Data analysis
reproducible research

Jupyter Notebook

Machine learning models Data visualizations

Jupyter Notebook is a popular tool among data scientists,


researchers, and educators.

1.2 Installing Python and Jupyter Notebook

To install Python, perform these steps:

1. To download Python, go to https://www.python.org/downloads/.


The downloads page appears as shown in Figure 1.1.

V1.0 © Aptech Limited


Figure 1.1: Downloads Page

The site automatically detects the operating system and displays the
appropriate version to download.
2. Under Download the latest version for Windows, click Download Python
3.11.4.
The .exe file is downloaded.

The Python version can be different when downloading because it is


open source and is updated on a regular basis. Download the
version that is available.

3. After the download is complete, run the .exe file.


The installation wizard launches.
4. On the Install Python 3.11.4 page of the installation wizard, perform these
tasks (Refer to Figure 1.2):
 Ensure that the Use admin privileges when installing py.exe check
box is selected.
 Select the Add python.exe to PATH check box.
 Click Install Now.

V1.0 © Aptech Limited


Figure 1.2: Install Python Page of the Wizard

5. Wait for the installation to complete.


After the installation is complete, the Setup was successful page is
displayed as shown in Figure 1.3.

Figure 1.3: Setup was successful Page of the Wizard

V1.0 © Aptech Limited


6. Click Close.
7. To start using Python, from the Start menu, open Command Prompt.
8. To check the version of Python installed, at the command prompt, type
the command as:
python --version

9. Press Enter.
The installed version is displayed as shown in Figure 1.4.

Figure 1.4: Python Version Displayed

10. To install Jupyter Notebook, at the command prompt, type the


command as:

pip install jupyter notebook

11. Press Enter.

The installation process begins as shown in Figure 1.5.

Figure 1.5: Installing Jupyter Notebook

V1.0 © Aptech Limited


12. Wait for the installation to complete and the prompt to be displayed.
13. To open Jupyter Notebook, at the command prompt, type the
command as:

jupyter notebook

14. Press Enter.

Jupyter Notebook opens in the browser as shown in Figure 1.6.

Figure 1.6: Jupyter Notebook

15. To create a Python notebook in Jupyter Notebook, on the top-right, click


the New drop-down, and click Python 3 (ipykernel), as shown in Figure
1.7.

Figure 1.7: Creating Python File

V1.0 © Aptech Limited


A new notebook is created and displayed as shown in Figure 1.8.

Figure 1.8: Python Notebook

16. To execute a Python command, in the box, type the command as:

print (“This is Jupyter Notebook”)

Figure 1.9 shows the command entered in the notebook.

Figure 1.9: Command Entered in Notebook

17. On the toolbar, click Run.

V1.0 © Aptech Limited


The output is displayed as shown in Figure 1.10.

Figure 1.10: Output of the Command

18. To save the file as MyFirstNotebook, on the top, click Untitled.

The Rename Notebook dialog box opens as shown in Figure 1.11.

Figure 1.11: The Rename Notebook Dialog Box

19. In the Enter a new notebook name box, type MyFirstNotebook.


20. Click Rename.

The renamed file appears in the list on the Jupyter homepage as shown
in Figure 1.12.

V1.0 © Aptech Limited


Figure 1.12: New Jupyter Notebook Added to List

To open the notebook at a later point in time, click the file in the list.

1.3 Keywords

Python keywords are fundamental building blocks of any Python program.


These keywords are designated words used to execute specific commands.
You cannot use them as identifies, variable names, or function names. Python
keywords are always available and do not require importing into the code.

Python keywords are distinct from Python's built-in functions and


types, which are also readily available but are less restrictive in their
usage.

Assigning values to Python keywords is NOT allowed and will result in a


SyntaxError. While it is technically possible to assign values to built-in
functions or types, it is generally not recommended.

Python keywords are categorized into different groups based on their usage,
which helps organize and understand their purpose. Table 1.1 describes the
categories of keywords.

Category Description Keywords


Import These keywords are used to import import, from, as
modules and packages into a Python
program.
Value These keywords represent single values True, False, None
that can be assigned to variables.
Control Flow These keywords are used to control the if, elif, else,
flow of program execution based on for, while,
specified conditions. break, continue

V1.0 © Aptech Limited


Category Description Keywords
Operator These keywords are used to specify and, or, not, in,
operations to be performed on values. is
Returning These keywords are used to return return, yield
values from functions and methods.
Structure These keywords are used to define the def, class, with,
structure of a Python program. as, pass, lambda
Exception These keywords are used to handle try, except,
Handling exceptions in a Python program. raise, finally,
else, assert
Variable These keywords are used to manage del, global,
Handling variables in a Python program. nonlocal
Asynchronous These keywords are used for async, await
Programming asynchronous programming in Python.

Table 1.1: Categories of Keywords

Note that the keywords may change over time as Python evolves.
It is always recommended to refer to official documentation or
trusted sources for the most up-to-date information on Python
keywords and their usage.

1.4 Identifiers

Identifiers are unique names used to differentiate various programming


elements. In Python, an identifier is used to name variables, functions, classes,
modules, and other objects. It provides a unique name for each entity,
improving code readability and understanding. However, certain rules must be
followed when choosing identifiers.

Rules for Naming Identifiers in Python

There are some rules that the users must follow when specifying names for
identifiers to avoid errors in code. Figure 1.13 shows these rules.

V1.0 © Aptech Limited


Figure1.13: Identifier Naming Rules

Although, Python does not have any limitation for the length of the identifier
name, it is recommended to keep it short so that they are easy to remember
and use within the code. As per the Python Enhancement Proposal (PEP-8)
standard, the maximum length of an identifier name can be 79 characters. In
addition, use meaningful names so that it is easy to identify what information
the identifiers are storing. In Python, the class names usually start with
uppercase letters and other identifiers start with lowercase letters.

Table 1.2 lists some examples of valid and invalid identifiers.

Valid Identifiers Invalid Identifiers


_myID 321myID
_ myID@321
myID321 321987
myid
MYID

Table 1.2: Identifiers

V1.0 © Aptech Limited


1.5 Variables

When writing a program, the values to be used in the program or the values
that are calculated in the program are stored in the memory. How can one
access these values from memory? To do that, names are assigned to the
locations where each of these values is stored. These names are known as
variables.

Generally, in programming languages, variables must be declared before they


are used in the program. However, in Python, users are not required to explicitly
declare a variable. This is because Python creates a variable in the computer
memory if you assign a value to a variable that is not declared. The value
assigned decides the type of variable. For example, suppose that the user
wants to store the name and marks of a student. To do so, the user can use the
variables as shown in Code Snippet 1.

Code Snippet 1:

student_name = “Larrissa”
marks = 90

In this case, two variables are created, student_name and marks.

Unlike other programming languages, variables in Python are not required to


be declared with the type of data they can store. This is because, in Python,
the variables can change their type after being set. For example, in this case,
the student_name variable stores a string, while the marks variable stores
numerical values.

However, Python does allow variables to be declared with specific types using
casting.
For example, consider these variables as shown in Code Snippet 2:

Code Snippet 2:

student_name = str(Larissa)
roll_no = int(3)
marks = float(90.5)

In this case, student_name will store a string value “Larrissa”; roll_no will
store an integer value 3 and marks will store a floating-point value 90.5. Here,
string, integer, and floating point are data types that specify the type of data
stored in the variables.

V1.0 © Aptech Limited


Python also allows multiple variables to be assigned simultaneously. For
example, consider the variables shown in the code.

marks1, marks2, marks3 = 90, 75, 85

In this case, variables will be assigned values in the order specified. That is,
marks1 will be assigned the value 90; marks2 will be assigned the value 75,
and marks3 will be assigned the value 85.

Python also allows same value to be assigned to multiple variables. For


example, consider the variables shown in the code:

marks1, marks2, marks3 = 90

In this case, all three variables, marks1, marks2, and marks3, will be assigned
the value 90.

The rules defined for naming identifiers apply to names specified for
variables.

Scope of Variables

The scope of variables specifies where in the program those variables are
accessible. Variables can have a local scope or a global scope. A variable
with global scope is accessible throughout the entire program, while local
variables are accessible only within the block of code or function where they
are first used.

For example, consider the code in Figure 1.14. When executed in Jupyter
Notebook, this code gives an error.

V1.0 © Aptech Limited


Figure 1.14: Scope of Variables

This is because in this case, student_name is a global variable as it is not


introduced in any block of code. So, this variable is accessible anywhere within
the program, including any block of code or functions. However, the variables
marks1, marks2, and total are declared within the function. So, these are
local variables, which are accessible only within the function or block of code
where they are introduced. They are not accessible outside the function or
block of code. Once the function finishes executing, the local variables are
destroyed and no longer available. Therefore, the variables marks1, marks2,
and total are not available to the last print statement.

To run the code successfully, replace the last print statement with the code
shown in the code:

total_marks()

Execute the code once again. The output will appear as shown in Figure 1.15.

V1.0 © Aptech Limited


Figure 1.15: Output of Updated Code

1.6 Basic Functions – input and print

Consider a scenario in which the sum of two values has to be calculated and
printed. The two values to use must be specified by the user. In this case, the
developer can use the input function to get the values from the user. After
the calculation, the developer can use the print function to display the
output on the screen.

The syntax for the input function is:

input (prompt)

In this syntax, prompt refers to the message that will be displayed for the user
to enter the required value. The values entered by the user at the prompt will
be a string. To convert these into numbers, the int function must be used.

The syntax for print function is:

print (msg)

In this syntax, msg refers to the message or values to be printed on the screen.

V1.0 © Aptech Limited


Consider an example that takes two numbers from the user, adds them, and
displays the sum on the screen. Figure 1.15 shows the code for this example.
When this code is run in Jupyter Notebook, the prompt for the first number
appears as shown in Figure 1.16.

Figure 1.16: Prompt for First Number

When the user enters the first number as 10 and presses Enter, the prompt for
the second number appears as shown in Figure 1.17.

Figure 1.17: Prompt for Second Number

Consider that the user enters the second number as 10 and presses Enter.

V1.0 © Aptech Limited


The output of the code is displayed as shown in Figure 1.18.

Figure 1.18: Output of Addition of Two Numbers

1.7 Statements

Every line of code written in Python is considered a statement. A statement is


an instruction to the interpreter for execution. Every statement ends with a
token character NEWLINE.

Simple statements in Python are those lines of code that are written on a single
line. In Python, a simple statement can be a line of code that assigns a value
to a variable. For example, consider the code in Code Snippet 3.

Code Snippet 3:

student_name = “Larrissa”
marks = 90

In this case, there are two simple statements. The first statement assigns the
value Larrissa to the variable student_name and the second statement
assigns the value 90 to the variable marks.

A statement can also be split to occupy multiple lines. These types of


statements are called multi-line statements. To split a line of code across
multiple lines, continuation characters are used. Continuation characters can
be parentheses or brackets () or backslash \.

V1.0 © Aptech Limited


For example, consider the code in Code Snippet 4.

Code Snippet 4:

intro_msg = “Welcome to a \
world of opportunities that\
help you shape your life”

intro_msg = (“Welcome to a
world of opportunities that
help you shape your life”)

Table 1.3 lists the different types of statements that are used in Python.

Type of Statement Description Example


Assignment Used to assign values to marks = 75
variables
Conditional Used to specify a condition if marks>70:
that decides whether to print(“pass”)
execute or skip a block of
code
Looping Used to execute a block of while marks>70:
code repeatedly for a print(“Well
defined number of times done!!”)
Import Used to import a module import math
from another Python file
Comment Used to add comments for #Variable
the line of code assignment
Function definition Used to define functions that def add(a, b):
perform specific tasks total = a+b
Pass Used to add empty blocks of if marks=70: pass
code
Raise Used to raise exception to raise
indicate errors ValueError{“Invalid
value”)
Expression Used to calculate values by total_marks = 70+85
performing the operation total = marks1 +
specified in the expression marks2
on the values or variables
specified in the expression

Table 1.3: Types of Statements

V1.0 © Aptech Limited


1.8 Comments

A Python program can include thousands of lines of code. Understanding this


code requires a developer to go through all the lines of code multiple times.
This can be a tedious and time-consuming task. Also, even after going through
the code, there can be errors in understanding. To ensure that the code is
understandable, developers can add comments that explain the purpose of
a line or block of code. These comments are ignored by the interpreter when
executing the code. The comments only help developers make the code
more understandable. Comments can also be used to disable specific lines or
blocks of code when testing or debugging code.

Python supports three types of comments:

All comments are denoted by the hash (#) symbol. For multi-line comments,
each line of the comment must begin with #. The other symbols that can be
used for multi-line comments are triple quotes: single (''') or double (""").
These symbols are used at the beginning and the end of the multi-line
comment.

Triple single and double quotes are also used to enclose text values in
the code. The text within the quotes will be considered as values if they
are assigned to a variable. If the text within the quotes is not assigned
to any variables, it will be considered as comment.

V1.0 © Aptech Limited


An example of how to use various types of comments in code is shown in Figure
1.19.

Figure 1.19: Comments

PEP-8 recommends that the comments are concise and focused. The
comments can contain a maximum of 72 characters for inline and
single-line comments. Multiple-line comments must be used if the
length of the comment exceeds this limit.

1.9 Indentation

Most programming languages use curly braces to indicate the start and end
of a block of code, even if the code is not indented. However, Python does
not use any kind of braces. It uses indentation to indicate the start and end
of code blocks. If the indentation is not correctly followed, the lines of code
will not be executed in the required order, leading to incorrect results.

Developers have the flexibility to choose the number of spaces for


indentation. A minimum of one whitespace is required. It is very crucial that
the same indentation be followed throughout the program. If the indentation
is not consistent, a compilation error occurs.

V1.0 © Aptech Limited


Some considerations for code indentation are:

Here is an example to understand the code indentation and grouping of


statements:

Statement 1
Statement 2

<code block 1>


Statement 1a
Statement 1b
Statement 1c

<code block 2>


Statement 2a
Statement 2b
Statement 2c

<code block 3>


Statement 3a
Statement 3b
Statement 3c
<code block 4>
Statement 4a
Statement 4b
Statement 4c
Statement 3
Statement 4

Here, Statement 1, Statement 2, Statement 3, and Statement 4 are a


part of the outermost code block. Statement 1a, Statement 1b, and
Statement 1c are a part of code block 1. So, these statements are

V1.0 © Aptech Limited


indented by four whitespaces. Then, code block 2 is a part of code block
1, so it is indented by four whitespaces. Statement 2a, Statement 2b, and
Statement 2c are a part of code block 2. So, these statements are
indented by eight whitespaces. Similarly, the statements in code block 3 are
indented by 12 whitespaces. Next, code block 4 is a part of the outermost
block, so it is indented by four whitespaces.

Proper indentation of code clearly indicates the hierarchy and grouping of


statements. This provides a structure to the code, which enhances the
appearance and readability of the code. It also eliminates the necessity for
using braces, brackets, or other delimiters, which in turn eliminates errors
related to missing opening or closing braces/brackets. Using consistent
indentation throughout the code also helps in easy debugging.

V1.0 © Aptech Limited


1.10 Summary

 Python is an interpreted, object-oriented, and high-level programming


language known for its simplicity, efficiency, and readability.
 Jupyter Notebook is a free, open-source, and interactive Web
application that can be used to run Python code and view the output.
 Python keywords are designated words used to execute specific
commands and hence, cannot be used elsewhere.
 Identifiers are names of Python objects such as variables, functions,
classes, modules, and other objects.
 Variables are names given to memory locations where the values are
stored.
 Variables can be global or local in scope, with global variables
accessible throughout the program and local variables limited to their
respective functions or blocks.
 Statements in Python are instructions that can be executed by the
Python interpreter.
 Comments help make the code more understandable and are ignored
by the interpreter. Comments can be single-line or multi-line.
 In Python, indentation is used to define code blocks and grouping of
statements.

V1.0 © Aptech Limited


Test Your Knowledge

1. Which of the following options are the valid identifiers in Python?


a. _1add_
b. 7sub
c. Mul12B
d. try

2. Which of the following statement in Python is used as a placeholder that


indicates no operation?

a. return
b. continue
c. break
d. pass

3. Which of the following options are the implicit line continuations used to
split a statement in Python?

a. /
b. ( )
c. [ ]
d. [ ]

4. Which of the following statement is not true about Indentation in Python?

a. Whitespace is used for indentation


b. Statements with the same level of indentation belong to the same
code block
c. The default indentation spaces are four spaces
d. Indentation is permitted on the first line of Python code

5. Which of the following statements are true about variables in Python?

a. Python does not require variables to be declared


b. The variable names in Python are case-sensitive
c. The value stored in a variable cannot be changed during
program execution
d. The variable declaration in the code amount$toRs=500 is a valid
variable declaration in Python

V1.0 © Aptech Limited


Answers to Test Your Knowledge

1 a, c
2 d
3 b, c, d
4 d
5 a, b

V1.0 © Aptech Limited


Try it Yourself

1. Install Python.
2. Install and open Jupyter Notebook.
3. Create a Python notebook. Consider that you have declared two
variables num1 as 18 and num2 as 10. Perform the given tasks:
a. Write a single-line comment on “Python program to multiply
two numbers”.
b. Write an expression statement in Python to multiply the two
variables num1 and num2 and add a constant value, 100. Store the
result of the expression statement in a variable, product_result.
c. Write an inline comment for the expression statement created in
b.
d. Display the result, product_result.

4. Create a Python notebook. Consider the text given:


“Python keywords are fundamental building blocks of
any Python program. In Python, keywords are reserved
words that have specific meanings and purposes.”
Perform the given tasks:
a. Add the given text as a multiline comment.
b. Assign the given text to a variable Python_multi and use
continuation character backslash \ to split the lines across multiple
lines. Print the text, Python_multi.
c. Assign the given text to a variable Python_line using triple quotes
and print the text, Python_line.

V1.0 © Aptech Limited


SESSION 2
Programming Language Constructs

Learning Objectives

In this session, students will be able to:

⮚ Explain the data types in Python


⮚ Describe the working of the operators
⮚ Explain the working of control flow statements

When writing code, developers use variables to store values. Several types of
values can be stored in the variables, such as whole numbers, decimal
numbers, text, dates, and so on. These are termed as data types. Python
supports several data types. The values stored in these variables are used to
perform different types of operations, such as mathematical, logical,
comparison, and so on. Python supports a host of operators that can be used
to perform these operations. Like other programming languages, Python also
supports flow control statements. These statements are used to determine
which code to execute or skip, based on specific conditions.

In this session, different types of data types in Python will be dealt with.

V1.0 © Aptech Limited


The session will also discuss operators and their applications. Finally, the session
will cover the flow control statements and their significance.

2.1 Data Types

Data types help to determine the type of value or data a variable can hold.
Each type of data requires varying space in the memory. A data type helps
reserve the required space for a value in the memory.

Python supports many data types, including:

In most programming languages, when declaring a variable, the data type for
the variable must also be specified. However, Python behaves differently. The
data type is not required to be specified for the variables; Python identifies the
data type based on the value stored in the variable. Therefore, care must be
taken when assigning values to a variable. If different types of data are
assigned to one variable, the last value assigned will determine the data type.
Python also allows data type to be specified if required.

2.1.1 Numeric Data Type

Numeric data types are assigned to variables in which the developer wants
to store numbers. These numbers can be integers, floating point numbers, or
complex numbers.

V1.0 © Aptech Limited


To store these different types of numbers, Python supports three numerical data
types:

int Data Type

This data type is used to store integers in variables. Integers are whole numbers.
They can be positive or negative numbers without decimals or zeros. Some
examples of integers are 5, 25, 300, -80, and -60. The variable that stores
integers are declared using the int keyword.

Some instances where this data type may find its use are:

● Storing the count of items in the inventory, the quantities or volume of


items, or the count of the number of times a loop must be executed
● Storing index numbers of elements in a string or list
● Storing the X and Y coordinates of the points on a graph

Figure 2.1 shows the usage of the int data type in Python.

Figure 2.1: Usage of int Data Type

V1.0 © Aptech Limited


The function type(variable_name) is used to determine the data
type of the specified variable.

float Data Type

This data type is used to store decimal numbers in variables. These numbers
can be positive or negative numbers with decimals or zeros. Some examples
of decimal numbers are 5.5, 25.363E2, 300.50, -80.75, and -60.35. The
letter ‘E’ indicates an exponential. The variable that stores decimal numbers is
declared using the float keyword.

A floating-point number may find its use in:

● Scientific calculations where accuracy to many decimal places is


essential
● Financial applications where the numbers may be big, and the interest
calculation must be accurate

Figure 2.2 shows the usage of the float data type.

Figure 2.2: Usage of float Data Type

complex Data Type

Complex numbers have two parts: one real and the other imaginary. They can
be represented as p+/-nj, where p and n are real numbers while j is an
imaginary unit. Examples of complex numbers are 3-8j, 9j, -5j, and 2-21j.
The variable that stores complex numbers is declared using the complex
keyword.

V1.0 © Aptech Limited


Complex numbers find their application in the field of quantum physics and
light theory.

Figure 2.3 shows the usage of complex numbers in Python.

Figure 2.3: Usage of complex Data Type

2.1.2 String Data Type

String is a text data type that is used to represent a sequence of characters.


It can be used to represent data such as names, addresses, messages, and
comments. A few examples of strings are Larissa, 123 Main Avenue,
somecity, 12345, Welcome to the world of learning, and Welcome to
the world of computers. In Python, the variables that store strings are
declared using the str keyword. The strings stored in these variables are
enclosed inside single or double quotation marks. Figure 2.4 shows the usage
of the str data type.

Figure 2.4: Usage of str Data Type

V1.0 © Aptech Limited


2.1.3 Sequence Data Type

A sequence is a group of items the elements of which can be accessed using


their positions. Python allows the items in a collection to be the same or of
different types. To access an item in the collection, you must use the name
along with the index that represents the position of the element. In Python, an
index of a collection begins from 1. Python supports three types of sequences:

List

A list is a collection of items that can be referenced using a single name. It is


ordered. Items can be added, modified, or removed from a list at any time. A
list allows duplicate values to be present. It can have items of the same data
type or different data types, such as int, float, and str in a single variable.
Lists are represented by square brackets []. Examples of list data types are
storing list items such as a shopping list as in [apple, cabbage, tomatoes,
potatoes] or a wish list. This data type can also be used to create a structure
such as employee details that include name, age, address, and phone
number. For example, [Larissa Smith, 45, 123 Main Avenue, somecity,
12345, 123-456-789].

V1.0 © Aptech Limited


Figure 2.5 shows the usage of lists in Python code.

Figure 2.5: Usage of List in Python

Tuple

Tuples are similar to lists and can be used to store ordered lists of items. The
difference between the two is that tuples do not allow to change the items
once created. They allow duplicate values to be stored. Tuples are
represented by parentheses (). Examples of tuples are (“apple”,
“cabbage”, “tomatoes”, “potatoes”) and (“Larissa Smith”, 45,
“123 Main Avenue, somecity, 12345”, “123-456-789”).

Figure 2.6 shows the usage of a tuple in Python.

Figure 2.6: Usage of Tuple in Python

Range

Range is a data type that is used to store an array of integers or whole numbers.
It is usually used for looping purposes to keep a watch on the increasing or
decreasing counter. Its index starts from 0. Range(10) means values can be
stored from Range(0) to Range(9).

V1.0 © Aptech Limited


Figure 2.7 shows the usage of Range in Python.

Figure 2.7: Usage of Range in Python

2.1.4 Mapping (Dictionary) Data Type

Mapping data type is used for variables that store data in key-value pairs.
Python supports only one mapping data type, which is dictionary. An example
of dictionary data is {“name”: “Larissa”, “age”: 45, “address”: “123
Main Avenue, somecity, 12345”, “phone”: “123-456-789”}. In this
example, name, age, address, and phone are the keys. The data Larissa, 45,
123 Main Avenue, somecity, 12345, and 123-456-789 are the values.
Dictionary variables are declared using the keyword dict.

Dictionaries are unordered and changeable (or mutable). Keys must be


unique without any duplicates; Values can be any type.

Figure 2.8 shows the usage of a dictionary in Python.

Figure 2.8: Usage of Dictionary Data Type

V1.0 © Aptech Limited


2.1.5 Boolean Data Type

A variable of Boolean data type can take two values: True or False. It
represents the truth of the given expression. For example, 100 > 200 is False,
but 100 < 200 is True. In Python, a variable can be declared as Boolean using
the keyword bool.

Figure 2.9 shows the usage of the Boolean data type.

Figure 2.9: Usage of Boolean Data Type

2.1.6 Set

Similar to a list, tuple, and string, a set is an unordered collection of multiple


items stored in a single variable. A set is defined using curly braces {}. Examples
of a set can be {“brinjal”, “cabbage”, “tomato”, “potato”} and
{“Larissa Smith”, 45, “123 Main Avenue, somecity, 12345”, “123-
456-789”}.

After a set is created, the items in the set cannot be changed. However, items
can be added to or removed from a set. The set cannot have duplicate
values. If there is a duplicate value, only one value is considered. Note that 1
and True are considered as duplicate values. Similarly, 0 and False are
considered as duplicates. A set can be created using a list, tuple, or string using
the set keyword.

V1.0 © Aptech Limited


Figure 2.10 shows the usage of sets in Python.

Figure 2.10: Usage of Set Data Type

2.2 Operators

Operators help to perform specific operations on variables and values, known


as operands. As in any programming language, Python supports several
operators as shown in Figure 2.11.

Figure 2.11: Operators in Python

V1.0 © Aptech Limited


2.2.1 Arithmetic Operators

Arithmetic operators are used to perform mathematical operations such as


addition, subtraction, multiplication, and division. These operators take two
parameters, both can be variables with values assigned. Table 2.1 lists the
various arithmetic operators.

Operator Description
+ Used to add two operands
- Used to subtract two operands
* Used to multiply two operands
/ Used to divide the two operands
% Used to divide two operands and return the remainder
Used to raise the first operand to the power of the
**
second operand
Used to divide two operands and round the result down
//
to the nearest integer

Table 2.1: Arithmetic Operators

Figure 2.12 shows examples of arithmetic operators.

Figure 2.12: Using Arithmetic Operators

V1.0 © Aptech Limited


2.2.2 Comparison Operators

Comparison operators are used to compare two operands to determine if


both operands are equal or if one operand is greater than or less than the
other operand. The comparison result is returned as a Boolean value: True or
False. Table 2.2 shows the various comparison operators.

Operator Description
This operator checks whether both operands are equal. It
==
returns True if they are equal and False if they are not equal.
!= This operator checks whether the operands are equal. It returns
True if they are not equal and False if they are equal.
> This operator checks whether the first operand is greater than
the second operand. It returns True if the first operand is
greater than the second operand; else it returns False.
< This operator checks whether the first operand is less than the
second operand. It returns True if the first operand is less than
the second operand; else it returns False.
>= This operator checks whether the first operand is greater than
or equal to the second operand. It returns True if the first
operand is greater than or equal to the second operand; else
it returns False.
<= This operator checks whether the first operand is less than or
equal to the second operand. It returns True if the first operand
is less than or equal to the second operand; else it returns
False.

Table 2.2: Comparison Operators

V1.0 © Aptech Limited


Figure 2.13 shows examples of the comparison operators.

Figure 2.13: Using Comparison Operators

2.2.3 Logical Operators

Logical operators work on two logical expressions to return a Boolean value as


a result. The three common logical operators in Python are listed in Table 2.3.

Operator Description
Performs an AND operation and returns True only if both the
and
logical expressions evaluate to True
or Performs an OR operation and returns True if any of the logical
expressions evaluate to True
not Reverses the result of the operand

Table 2.3: Logical Operators

Figure 2.14 shows examples of logical operators.

V1.0 © Aptech Limited


Figure 2.14: Using Logical Operators

2.2.4 Bitwise Operators

Bitwise operators perform operations on integers. The process followed for


performing the bitwise operations is:

V1.0 © Aptech Limited


Table 2.4 shows the list of bitwise operators in Python.

Example
Operator Description
Decimal Binary

Performs an AND operation on the 9 1001


& binary bits and returns 1 only if the 5 0101
bits of both the operands are 1 9 & 5 = 1 0001

Performs an OR operation on the 9 1001


| binary bits and returns 1 if the bits of 5 0101
any of the operand is 1 9 | 5 = 13 1101
Inverts the bits of the specified 5 0101
~
operand ~5 = -6 0110
Performs an Exclusive OR (XOR) 9 1001
operation on the binary bits and
5 0101
returns 1 if the corresponding bits of
^
the operands are the same but
returns 0 if the corresponding bits of 9 ^ 5 = 12 1100
both the operands are different
Shifts the bits of the first operand to 9 1001
>> the right by the number of bits
specified by the second operand 9 >> 1 = 4 0100

Shifts the bits of the first operand to 1 0001


<< the left by the number of bits
specified by the second operand 1 << 2 = 4 0100

Table 2.4: Bitwise Operators

For the Bitwise NOT (~) operator, when the bits are inverted, the sign
will also be inverted. So, a positive integer will result in a negative
integer.

V1.0 © Aptech Limited


Figure 2.15 shows examples of logical operators.

Figure 2.15: Using Bitwise Operators

2.2.5 Assignment Operators

You can use assignment operators to assign values to variables. These values
can also be computer values resulting from an arithmetic or logical operation.
Table 2.5 lists various assignment operators available in Python.

Operator Description
Assigns the value specified on the right of the operator to the
=
variable specified on the left
Adds the operands specified on the right and left and assigns
+=
the result to the variable on the left
Subtracts the operand specified on the right of the operator
-= from the operand specified on the left and assigns the result to
the variable on the left
Multiplies the operands specified on the right and left and
*=
assigns the result to the variable on the left
Divides the operand specified on the left of the operator by the
/= operand specified on the right and assigns the result to the
variable on the left

V1.0 © Aptech Limited


Operator Description
Raises the operand specified on the left of the operator by the
**= operand specified on the right and assigns the result to the
variable on the left
Divides the operand specified on the left of the operator by the
//= operand specified on the right, rounds down the result, and
then assigns the result to the variable on the left
Divides the operand specified on the left of the operator by the
%= operand specified on the right and then assigns the remainder
to the variable on the left

Table 2.5: Assignment Operators

Figure 2.16 shows examples of assignment operators.

Figure 2.16: Using Assignment Operators

Python also includes bitwise assignment operators. These operators are listed
in Table 2.6.

Operator Description
Performs bitwise AND operation on the operands and assigns
&=
the result to the variable on the left
Performs bitwise OR operation on the operands and assigns the
|=
result to the variable on the left

V1.0 © Aptech Limited


Operator Description
Performs bitwise XOR operation on the operands and assigns
^=
the result to the variable on the left
Performs bitwise shift right operation on the operand on the
>>=
right and assigns the result to the variable on the left
Performs bitwise shift left operation on the operand on the right
<<=
and assigns the result to the variable on the left

Table 2.6: Bitwise Assignment Operators

Figure 2.17 shows examples of bitwise assignment operators.

Figure 2.17: Using Bitwise Assignment Operators

2.2.6 Membership Operator

This operator is used to determine whether a value or variable exists or does


not exist in a given sequence, such as a string, tuple, list, set, or dictionary. Table
2.7 shows the list of membership operators.

Operator Description
This operator checks whether the specified value or variable
in exists in the given sequence. It returns True if the value or
variable exists in the sequence; else, it returns False.
This operator checks whether the specified value or variable
not in does not exist in the given sequence. It returns True if the value
or variable does not exist in the sequence; else, it returns False.

Table 2.7: Membership Operators

V1.0 © Aptech Limited


Figure 2.18 shows examples of membership operators.

Figure 2.18: Using Membership Operators

2.2.7 Identity Operators

These operators compare the memory location of the two specified values or
variables to determine whether they are the same or not. Table 2.8 lists two
identity operators supported in Python.

Operator Description
This operator checks whether the memory location referred to
by the variables specified on the right and left is the same. It
is
returns True if the referred memory location is the same; else, it
returns False.
This operator checks whether the memory location referred to
by the variables specified on the right and left is not the same.
is not
It returns True if the referred memory location is not the same;
else, it returns False.

Table 2.8: Identity Operators

Figure 2.19 shows examples of membership operators.

V1.0 © Aptech Limited


Figure 2.19: Using Identity Operators

2.3 Control Flow Statements

A program usually executes the code from top to bottom. If developers want
to alter the flow of execution or skip some lines of code depending on
specified conditions, they can use control flow statements. Python supports
three types of control flow statements:

2.3.1 Conditional Statements

As the name suggests, conditional statements specify a condition and alter


the program execution based on whether the condition is met or not met. If
the condition is met, a set of code is executed and if the condition is not met,
a different set of code is executed.

Let us take a look at the various conditional statements.

if Statement

The if statement is used when a particular set of code must be executed only
if the specified condition is met; else the set of code must be ignored. The
syntax for the if statement is:

if <condition>:
<code to be executed condition is met>

V1.0 © Aptech Limited


Consider that the developer wants to display the message “Excellent” only
if the student has scored more than 90 marks. The developer can use the code
as shown in Figure 2.20.

Figure 2.20: Using the if Statement

In Figure 2.20, the first print statement is executed because the condition a >
90 is met. However, the second print statement is not executed because the
condition a < 90 is not met.

if-else Statement

This statement is an extension of the if statement. In the if statement, a block


of code is executed if the condition is met. What if the developer wants to
execute another block of code if the condition is not met? In this case, the
developer can use the if-else statement. The syntax for the if-else
statement is:

if <condition>:
<code to be executed if condition is met>
else:
<code to be executed if condition is not met>

Figure 2.21 shows an example of an if-else statement.

Figure 2.21: Using the if-else Statement

V1.0 © Aptech Limited


In the code shown in Figure 2.21, the print statement associated with the if
statement is ignored because the condition is not met. Therefore, the print
statement associated with the else statement is executed to give the result.

if-elif-else Statement

This statement is a variation of the if-else statement. With the if-else


statement, developers can check for one condition and execute blocks of
code based on whether the condition is met or not. What if the developer
wants to execute different blocks of code based on different conditions? In
this case, developers can use the if-elif-else statement. The syntax of the
if-elif-else statement is:

if <condition1>:
<code to be executed if condition1 is met>
elif<condition2>:
<code to be executed if condition2 is met>
elif<condition3>:
<code to be executed if condition3 is met>
...
else:
<code to be executed if none of the conditions are met>

Figure 2.22 shows an example of the if-elif-else statement.

Figure 2.22: Using the if-elif-else Statement

match-case Statement

The match-case statement is similar to the if-elif-else statement. That is, it


checks for multiple conditions and executes the relevant block of code. The
only difference is that it checks for multiple values of one expression. For
example, if the Math score of a student is checked, the developer can

V1.0 © Aptech Limited


execute different sets of code for different scores. The syntax for the match-
case statement is:

match <expression>:
case <value1>:
<code to be executed if value1 is true>
case <value2>:
<code to be executed if value2 is true>
case <value3>:
<code to be executed if value3 is true>

...
case _:
<code to be executed if none of the case values are
true>

Figure 2.23 shows an example of the match-case statement.

Figure 2.23: Using the match-case Statement

2.3.2 Iterative Statements

An iterative statement is used when a block of code must be executed


repeatedly for a specific number of times. These statements are also called
loops because they execute a code in a loop until a condition is met.

for Statement

The for statement is used to repeat a set of code for each value in a sequence
such as a list, tuple, range, set, or dictionary. For example, consider that there

V1.0 © Aptech Limited


is a list of numbers and the developer must identify the even numbers from the
list and print them. In this case, the developer must take the first number, divide
the number by 2, and check if the remainder is zero. If the remainder is zero,
the number must be printed. The same process must be repeated for the
second number and so on until all the numbers in the list are checked. In this
case, the developer can use the for statement.

The syntax for the for statement is:

for value in sequence:


<set of code to be executed>

Figure 2.24 shows an example of the for statement.

Figure 2.24: Using the for Statement

while Statement

The while statement is used to repeat a set of code statements until the
specified condition evaluates to true. The loop stops when the specified
condition evaluates to false. For example, consider that a developer wants to
identify the odd and even numbers between 0 and 10 to print appropriate
messages. The developer can use the while statement to execute the code
repeatedly until the value of the condition variable is less than or equal to 10.

V1.0 © Aptech Limited


The syntax for the while statement is:

while <condition>:
<code to be executed>

The specified condition must evaluate to false at some point in the code
execution. Otherwise, the code will just go on executing and the execution will
not come out of the loop, thereby creating an infinite loop. The loop must then
be stopped manually, resulting in the program not executing completely.

Figure 2.25 shows an example of the while statement.

Figure 2.25: Using the while Statement

2.3.3 Transfer Statements

While statements are getting executed, transfer of control to some part of the
program may be required based on the occurrence of some event. In such
cases, transfer statements are used. These statements can be used with
iterative or loop statements and functions.

break Statement

If a developer wants to terminate a loop when a particular value is reached,


the developer can use the break statement. When the break statement is

V1.0 © Aptech Limited


encountered, the program passes the execution to the next line of code after
the loop.

For example, consider that the developer is using a for loop for iterating
through a list of student names and printing Hello <student name> for each
student name. The developer wants to terminate the program when it
encounters the name, Ben. To do so, the developer can use the break
statement, as shown in Figure 2.26.

Figure 2.26: Using the break Statement

continue Statement

If a developer has created a loop to iterate through a sequence and wants to


skip some values, the developer can use the continue statement. When a
program encounters the continue statement, it skips the specified value but
completes the loop for the other values.

For example, consider that the developer is using a for loop for iterating
through a list of student names and printing Hello <student name> for each
student name. The developer wants to skip the name Angela from the list and
proceed with the next name.

V1.0 © Aptech Limited


To do so, the developer can use the continue statement, as shown in Figure
2.27.

Figure 2.27: Using the continue Statement

pass Statement

If the developer wants to reserve some space in the program code to add the
required code sometime later, the developer can use the pass statement. The
pass statement ensures that no compilation error occurs and the code runs
successfully even if a block of code has not been completed. For example,
consider that the developer is planning to include an if statement but will do
it at a later point in time. In this case, the developer can use the pass statement
as shown in Figure 2.28.

Figure 2.28: Using the pass Statement

V1.0 © Aptech Limited


2.4 Summary

⮚ Data types help to determine the type of value or data a variable can
hold.
⮚ Data types in Python can be categorized into numeric, string, sequence,
dictionary, boolean, and set.
⮚ Operators help to perform specific operations on variables and values,
known as operands.
⮚ Arithmetic, relational, assignment, logical, membership, identity, and
bitwise are some of the operators.
⮚ Control flow statements help in controlling the flow of program
execution.
⮚ The control flow statements supported by Python are conditional,
iterative, and transfer statements.
⮚ The break, continue, and pass statements are used to transfer the
program execution to some part of the program based on the
occurrence of some event.

V1.0 © Aptech Limited


Test Your Knowledge

1. Which of the following datatypes in Python are ordered collections of


elements?

a. Tuple
b. Set
c. List
d. Dictionary

2. Which of the following datatypes in Python do not allow duplicate


values to be stored?

a. Tuple
b. Set
c. List
d. Dictionary

3. What is the output of the given code?


number = [11, 23, 56, 98, 78]
number[1] = "integer"
print(number)

a. ['integer',11, 23, 56, 98, 78]


b. ['integer', 23, 56, 98, 78]
c. [11, 'integer', 56, 98, 78]
d. [11,'integer', 23, 56, 98, 78]

4. Which of the following codes can be inserted on line 3 to get the output
as True?
1 num_list = [11, 15, 21, 29, 50, 70]
2 number = 12
3 //insert code here

a. print(number not in num_list)


b. print(number in num_list)
c. print(number is num_list)
d. print(number is not num_list)

V1.0 © Aptech Limited


5. What is the output of the given code?
num = 1
for num in range(1, 4):
print(num * 2, end=' ')

a. 2 4 6 8
b. 4 6
c. 2 8
d. 2 4 6

V1.0 © Aptech Limited


Answers to Test Your Knowledge

1 a, c, d
2 b, d
3 c
4 a, d
5 d

V1.0 © Aptech Limited


Try it Yourself

Open Jupyter Notebook and perform the following tasks:

1. Write Python code to create two lists, list1, and list2 where
list1=[“car”, ”cycle”, “bus”, ”car”, ”scooter”] and list2 is
empty. Add the elements from list1 to list2, such that list2
contains only the unique elements (remove duplicates). Print the
elements of list2.

2. Write a Python code to create tuple T1 which consists of numbers 2, 4,


6, and a list L1. L1 contains the numbers 1, 3, and 5. Modify L1 in the
tuple such that the tuple consists of elements 2, 4, 6, [1, 4, 5]. Print the
tuple T1.

3. Write a Python code to create a dictionary named emp_dict with the


key: value pair as:
Name: Robert
Age: 45
Designation: Software Engineer
Skills: [C, Java, Ruby]
Modify a value in the dictionary for the key Designation as
“Project Lead” and append the skill value “Python” in the Skills
list. Print the dictionary emp_dict.

4. In a class, the grade for the student in an examination is awarded based


on the marks as given in Table 2.9.

Table 2.9: Grades

Write a Python code to create a list named mark_list with the marks
35, 75, 86, and 98. Use conditional and iterative statements to print the
grade of the students based on the marks given in the mark_list. The
output should consist of mark and grade.

V1.0 © Aptech Limited


5. Consider the Code Snippet:
if product_result > 300:
pass
else:
print(product_result/10)
What is the output of the code if:
(i) product_result=250
(ii) product_result=480

6. Find the output of two Code Snippets given here:


(i).
var_num = 1
while var_num < 5:
var_num = var_num+1
if var_num == 2:
continue
print(var_num)

(ii).
var_num = 0
while var_num < 5:
var_num = var_num+1
if var_num == 2:
break
print(var_num)

V1.0 © Aptech Limited


SESSION 3
Introduction to Functions

Learning Objectives

In this session, students will learn to:

⮚ Describe functions and their types


⮚ Explain how to create and call a function
⮚ Describe variables and their scopes in a function
⮚ Explain parameters and arguments
⮚ Explain arguments and their types

A function in a programming language is a named block of code that


performs a specific task. By using functions, developers can break down a
large program into smaller, more manageable pieces that can be reused. For
example, in a banking application, before initiating fund transfers to another
account and processing withdrawals, a minimum balance check must be
performed. Instead of repeating the snippet of code that checks for minimum
balance for every action, it can be placed inside a function and reused. This
function can be called from any part of the program.

V1.0 © Aptech Limited


This session will provide an overview of functions and their types. It will also
cover the syntax and structure of functions. It will explain the methods to create
and call a function in Python. This session will cover about parameters of
functions. Finally, it will describe different types of arguments available in
Python.

3.1 Functions

Functions are a set of instructions that enable developers to reuse, organize,


and structure the code. Functions:

● Make the code more readable and comprehensible


● Increase code reusability and thereby reduce code duplication
● Improve code efficiency and performance

There are two types of functions in Python.

3.1.1 Built-in Function

Python provides a set of pre-defined functions that are readily available to


developers. Such functions are called built-in functions. Built-in functions do not
require any additional imports or external libraries. Table 3.1 lists some of the
built-in functions available in Python.

Function Description Example


Name
Abs Returns the absolute value, ignores Abs(-3.2) returns 3.2
the sign of the number
Bool Returns the Boolean value of the Bool(3>1) returns True
parameter
Filter Returns filtered values based on pre- Filter(Check-for-
defined instructions numbers[a,b,1,2])
returns [1,2], where

V1.0 © Aptech Limited


Function Description Example
Name
check-for-number is a
user-defined function
which checks for
numbers
Help Provides interactive help about Help(object) returns
classes, functions, and objects interactive
documentation of
object
Id Returns the identity of the object Help(‘object’)
returns the unique id of
the string, object
Input Prompts for user input in the next line, userInput =
reads, and returns it as a string input("Enter a
value")
print("userInput =
:", userInput)
If the user enters Hello
as input, this code prints
userInput = : Hello
Int Returns the integer value of the given Int(9.2) returns 9
value
Max Returns the maximum value in the list Max(1,3,6,0) returns 6
Min Returns the minimum value in the list Min(1,3,6,0) returns 0

Table 3.1: Built-in Functions

3.1.2 User-defined Function

Apart from the built-in functions provided by Python, developers can create
customized functions that perform specific tasks. Such functions are called
user-defined functions. Python provides the def keyword to define a
user-defined function. The syntax to create a function is:

Def function_name(parameter1,parameter2):
#docstring
function_body
Return statement

V1.0 © Aptech Limited


● Function_name: It is the user-defined name of the function.
● docstring: It is the first statement of a function that can be a short
description of the function. It is optional.
● Parameter: The values passed to the function are called parameters. Any
number of parameters can be passed within parentheses. Parameters are
optional.
● Function_body: The function body includes tasks that are to be performed
by the function. The statements in the function body are indented.
● Return statement: The return statement passes the values obtained as a
result of the execution of the function, to the caller. The program returns
the control to the caller when it encounters a return statement.

The colon ( : ) after the function name indicates the beginning of


the function body.

3.2 Creating and Calling Functions

User-defined functions are an important feature of Python programming. These


functions can be created once and utilized many times by invoking their
names from any part of the program. User-defined functions can be created
with or without parameters and can return zero or more values upon
execution.

3.2.1 Function Without Parameters

Let us look at how a simple Python function named message is created. This
function displays the text 'Welcome to Basic Python programming lab'.
The function is called or invoked from another part of the program. The
function call statement includes only the name of the function followed by
parentheses. This function call does not pass any parameters to the function.

Figure 3.1 shows the code and the output to create and call a function.

Figure 3.1: Execution of Function Without Parameters

V1.0 © Aptech Limited


3.2.2 Function with Parameters

Functions in Python can be made more versatile when used with parameters.
Parameters are variables or values passed to a function when calling a
function. The code inside the function uses these parameters to accomplish
the desired task.

Let us look at how a function with parameters is created, called, and


executed. Consider a function named course_func that takes two
parameters Name and course_name and prints it. The function definition
statement includes the def keyword followed by the function name and the
list of parameters enclosed within parentheses. The actual values for these
parameters are provided within parentheses in the function call. The
parameter list in the definition and the parameter list in the function call must
match in type and number.

Figure 3.2 shows the code and output of a parameterized function.

Figure 3.2: Execution of Function with Parameters

In this example, the function call statement passes the values 'Richard' and
'Basic Python and Machine Learning' to the function. The name parameter
is assigned the value ‘Richard’ and the course_name parameter is assigned
the value ‘Basic Python and Machine Learning’.

3.2.3 Function with Parameters and Return Value

After executing the block of code inside a function, the return statement
passes the values obtained as the result of the execution of the function to the
caller.

Let us look at how a function with a return value is created, called, and
executed. Consider a function named multiply with a, b, and c as its
parameters. This function calculates the product of a, b, and c and stores the

V1.0 © Aptech Limited


result in a variable named product. It uses the return statement to pass the
value of the product to its caller.

Figure 3.3 shows the code and the output that returns a value.

Figure 3.3: Execution of Function with Return Value

In this example, the function call statement passes three arguments to the
function. These arguments provide the values of a, b, and c as 2, 4, and 5,
respectively. The function calculates the product of these values and passes
the calculated value back to the caller using the return statement.

The return statement in a function is optional. An empty return


statement terminates the execution of the function and returns
nothing to the caller. Multiple values can also be returned from a
function. In such cases the return values are separated by comma.

3.3 Scope of Variables in a Function

Scope of a variable refers to the area of a program where the variable can be
referenced and accessed. Two types of variables in Python with respect to
their scope in a function are:

 Local variable
 Global variable

3.3.1 Local Variable

Variables that are declared inside a function are visible and accessible only
inside the function. Such variables are called local variables. They cannot be
accessed from outside the function. Local variables are created when the
function is called and are destroyed when the function completes execution.

V1.0 © Aptech Limited


The lifetime of a variable refers to the period during which the variable is stored
in the memory. The value stored in a local variable gets destroyed when the
function completes execution. Each time a function is called, the local
variables take a new value depending on the arguments passed to the
function and the code block inside the function.

Consider the function named func_Add which accepts input1 as a


parameter. In the function body, two local variables namely input2 and sum
are declared. This function calculates the sum of input1 and input2 and
stores the result in sum1. The function prints the value of sum1. The main code
block tries to calculate the difference between input1 and input2.

Figure 3.4 shows the Code Snippet and its execution.

Figure 3.4: Error Thrown While Accessing a Local Variable Outside its Scope

When the code in the main program tries to access the input2 variable
outside the func_Add function, an error is thrown. Figure 3.4 shows that the
program can access input2 inside the function, whereas the program throws
an error when it tries to access input2 outside the function.

3.3.2 Global Variable

Variables that are declared outside any block of code are called global
variables. These variables can be accessed from any part of the program.
Consider the function named func_Add1 which is similar to func_Add except
that the variable named input2 is declared outside the function.

V1.0 © Aptech Limited


Figure 3.5 shows the code and the output of the program that uses global
variable.

Figure 3.5: Function Accessing a Global Variable

Since input2 is declared outside the function, it is a global variable. Observe


that the code blocks inside and outside the function can access input2.

If a variable with the same name as the global variable is declared inside a
function, then the value of the global variable is overridden by the locally
declared value. In the previous example, consider that input2 is declared
both outside the function and inside the function with different values.

Figure 3.6 shows the code and its execution.

Figure 3.6: Global Variable Overridden by Local Variable

Here, the value of input2 is considered as 3 while calculating the sum of


input1 and input2 inside the function. However, the value of input2 is
considered as 10 while calculating their difference in the main block. This
shows that the locally declared value overrides the globally declared value.

V1.0 © Aptech Limited


3.4 Arguments

Argument is a variable or value which is passed to a function in a function call.


There can be one or more arguments passed to a function. They are the
communication points between the main program and the function.
Parameters are placeholders of arguments inside the function. They take up
the value of arguments inside the function. Both parameters and arguments
are included within parenthesis. Figure 3.7 explains the difference between
arguments and parameters.

Figure 3.7: Arguments and parameters

In Figure 3.7 the main program calls the MyFunc function with arguments
namely arg1, arg2, ... argN. The control is transferred to the function to
execute the code block inside it. The arguments are replaced by parameters
paramtr1, paramtr2, paramtr3,.. paramtrN. These parameters hold the
values of arguments. paramtr1 takes the value of arg1, paramtr2 takes the
value of arg2, and so on. The four types of arguments in Python are as follows:

V1.0 © Aptech Limited


3.4.1 Default Arguments

Default arguments are used to provide a default value for a parameter in


cases where the argument is not passed during a function call. If the caller
does not provide a value for a parameter with a default value, the function
will use the default value defined in the function definition. Default arguments
are used in scenarios where a function is frequently called and the values of
certain arguments remain constant across all those calls.

For example, consider that an organization recruits new employees to a


certain department. When details of these employees are entered into the
database, all of them will carry the same department name. In such cases, the
variable, department may be given a constant value in the function definition.
If required, this default argument can be overridden. In this scenario, consider
that a few employees are recruited for a different department. Here, the same
function can be called by passing the relevant department name while calling
the function. The default values get overridden by the values provided by the
arguments when the function is invoked. Code Snippet 1 lists the function code
to perform this:

Code Snippet 1:

def
MyFunc_Default(empid,empname,department='Research'):
print("The employee details are ", empid, ",",
empname, ",", department)

Based on the requirement, the function call for this function can either specify
or not specify the department name in the argument list. If value for the
argument is not specified, the function will use the default value, ‘Research’,
defined in the function definition. If a value is specified, the function will use the
specified value. Code Snippet 2 lists the code to create function calls for both
cases:

Code Snippet 2:

MyFunc_Default(1, 'George Samuel')


MyFunc_Default(2, 'David Thomson', 'Tools')

The first call statement passes the value of arguments for the first two
parameters and does not pass the value for the department. The function
takes the default value, ‘Research’. The second call statement passes the

V1.0 © Aptech Limited


values for all three parameters. The default argument ‘Research’ is overridden
by ‘Tools’. Figure 3.8 shows the output of this program.

Figure 3.8: Default Arguments

All the arguments that follow a default argument must be default


arguments. That is, developers cannot use default argument for
an argument that is in the middle of the argument list. This is
because the argument list in a function call is always mapped in
sequence from the first argument. All the arguments that are not
specified at the end are considered as default arguments.

3.4.2 Positional Arguments

The order of arguments in a function call must match with the order of
parameters in the function definition. Such an arrangement of arguments is
called positional arguments. If there is a mismatch in the order of arguments
between the function definition and function call, the function might provide
incorrect outputs. Consider a function named MyFunction that takes name
and age of a person as input parameters and displays this information. Code
Snippet 3 lists the function code to perform this:

Code Snippet 3:

def myFunction(name, age):


print("My name is ", name, ", my age is", age)

When myFunction is called with arguments, ‘Robert Shoemaker’ and 32 in


that order, the function displays the expected output.

V1.0 © Aptech Limited


If the order of arguments is changed in the function call, then the function
displays incorrect output. Code Snippet 4 lists the code for both function calls:

Code Snippet 4:

myFunction('Robert Shoemaker', 32)


myFunction(32, 'Robert Shoemaker')

Figure 3.9 shows the outputs of both function calls. It can be observed that
the output of the second function call is not as expected.

Figure 3.9: Positional Arguments

3.4.3 Keyword Arguments

Keyword arguments are used to provide values of arguments in the function


call using their respective parameter names. The arguments and their values
can be considered as a key-value pair. The arguments are assigned values
with the help of = sign. Keyword arguments are also called named arguments.
Keyword arguments can be in any order, but must follow the non-keyword
arguments.

Consider that the function named Myfunction takes two parameters as input:
name and age of a person. This function is called with keyword arguments
where the values of the parameters are specified using their respective names.
The value ‘Clayman Saw’ is passed for the name parameter and 32 is passed
for the age parameter.

Figure 3.10 shows the code for the function and its output.

V1.0 © Aptech Limited


Figure 3.10: Keyword Arguments

If the keyword argument is placed before a non-keyword argument, the


program throws an error as shown in Figure 3.11.

Figure 3.11: Error – Positional Argument Follows Keyword Argument

3.4.4 Variable Length Arguments

Variable length arguments are constructed in cases where the number of


arguments is not known upfront. Here, the function can take any number of
arguments. As the number of arguments is arbitrary, this type of argument is
also called arbitrary argument.

Two types of arbitrary arguments are:

 Arbitrary positional arguments


 Arbitrary keyword arguments

Arbitrary Positional Arguments

Positional arguments whose count is unknown in advance are known as


arbitrary positional arguments. Internally, the argument is stored in the form of
a tuple. Consider that in a school, students can choose the number of optional
subjects as per their wish. The developer wishes to calculate the total marks
and average marks of each student using a function. While calling the
function, the number of arguments passed to the function will vary from
student to student depending on the number of optional subjects that they

V1.0 © Aptech Limited


have chosen. Arbitrary positional arguments are the best option to be used in
this case. The argument in the function definition is defined with an asterisk
symbol preceding it.

Figure 3.12 shows the function code and its output.

Figure 3.12: Arbitrary Positional Arguments

The marks function takes subject as an arbitrary parameter. It is marked by a


preceding asterisk. The first function call statement passes two arguments and
the second function call statement passes three arguments.

Arbitrary Keyword Arguments

Keyword arguments whose count is not known in advance are called arbitrary
keyword arguments. The argument in the function definition is defined with two
asterisk symbols preceding it. Consider the example in the previous topic of
arbitrary positional arguments. Let the developer pass keyword arguments to
the marks function.

Figure 3.13 shows the code and the output of this function.

V1.0 © Aptech Limited


Figure 3.13: Arbitrary Keyword Arguments

Note that the values for subject name and marks are assigned in key-value
pair in the function call. In the first function call, the name ‘Catherine’ is
passed with three subjects and their respective marks. In the second function
call, the name ‘Josephine’ is passed with four subjects and their respective
marks.

V1.0 © Aptech Limited


3.5 Summary

⮚ Functions are snippets of code that perform a specific task.


⮚ Python provides built-in functions and user-defined functions.
⮚ User-defined functions can be created with or without parameters.
⮚ Variables that are declared inside a code block and have limited
visibility are called local variables.
⮚ Variables that are declared in the main code block and can be
accessed throughout the program are called global variables.
⮚ Python provides four types of arguments namely default, keyword,
positional, and variable length.

V1.0 © Aptech Limited


Test Your Knowledge

1. Which of the following statements are true about functions?

a. A function begins with the statement def followed by function


name and parenthesis
b. The starting code of a function is a colon ( : )
c. Curly brackets are allowed for the start and end of a function
d. The return statement marks the end of a function

2. Which of the following functions with default arguments are valid?

a. def fun1(x, y, z=4):


b. def fun1(x=2, y, z=4):
c. def fun1(x, y=6, z=4):
d. def fun1(x=8, y=6, z):

3. What is the output of the given code?

def show(**kwargs):
for i in kwargs:
print(i)
show(name="Alexander", age=20)

a. name
age
b. Alexander
20
c. name=Alexander
age=20
d. name=Alexander

4. What is the output of the given code?

def fun1(age):
return age + 5
fun1(5)
print(age)

a. 5
b. 10
c. NameError
d. TypeError

V1.0 © Aptech Limited


5. What is the output of the given code?

def fun1_num(x):
x=x+1
x=100
print(x)
x=x+1
fun1_num(200)

a. 201
b. 101
c. 202
d. 100

V1.0 © Aptech Limited


Answers to Test Your Knowledge

1 a, b, d
2 a, c
3 a
4 c
5 d

V1.0 © Aptech Limited


Try it Yourself

Open Jupyter Notebook and perform the given tasks:

1. Write a Python function to add the numbers: 5, 3, 8, and 4. Pass these


numbers as arguments to the function.

2. Write a Python function that accepts three values from the console. The
function must return the difference between the first two values and the
product of the last two values.

3. Write a Python function max_num that accepts three numbers as


arguments and returns the largest number among three.

4. Write a Python function to find and display only those words which have
more than four characters from the list named words. Consider that the
list words = ["Python", "Java", "Ruby", "Perl", "JavaScript"].

5. Consider the given codes and give their output:

(i)
x=20
def outer_fun():
x = 10
def inner_fun():
y = 20
print(x + y)
inner_fun()
print(x)
outer_fun()

(ii)
def function1(var1,var2=5):
var3=var1*var2
return var3
var1=3
print(function1(var1=5,var2=6))
print(function1(var1,var2=6))
print(function1(var1,var2=3))

V1.0 © Aptech Limited


SESSION 4
Functions, Modules, and Packages

Learning Objectives

In this session, students will learn to:

⮚ Describe how to use recursive functions


⮚ Explain the significance of lambda functions
⮚ Describe and create modules
⮚ Identify how to define and import packages

Functions differ in the way they are defined. For example, some functions can
call themselves until a specific condition is met. Such functions are termed
recursive functions. Certain concise functions can be anonymous with a single
expression in their body. This type of function is called a Lambda function.
Functions can be part of programs or modules. Modules are collections of
related variables, classes, and functions that can be imported as and when
necessary. Modules are often grouped along with similar functions and
distributed as packages. This makes it easier for developers to import a
package and use the specific function only when required.

V1.0 © Aptech Limited


In this session, students will learn about the construction and usage of recursive
functions. The significance and usage of lambda functions will be discussed in
this session. Details about creating modules and packages will also be
discussed.

4.1 Recursive Functions

A very effective way of solving problems is by breaking them down into similar
smaller subproblems. This technique is efficient in searching, sorting, and
traversing data structure. Even mathematical series can be calculated using
this breaking-down technique. While doing so, to further simplify the problem,
the same technique is applied repeatedly until the problem cannot be further
broken down.

To understand this better, consider calculating the solution for the series: an +
an-1 + an-2 + an-3 + an-4 +…+ a1, given the value of a and n. The solution to
this problem can be calculated as shown in Figure 4.1:

Figure 4.1: Solution for a Series

This is a classic case of recursion. As seen in the Figure 4.1, the sum is broken
down into a value (nth term) plus sum of the series up to the previous term
(n-1). In each step, the series is broken down one step further. This repetition of
calculation can be coded using recursion. The recursiveness ends when the
power becomes one.

Let us now see the general syntax for coding recursion in Python.

V1.0 © Aptech Limited


Def function_name(parameter):
function_body
if <condition>
Return statement
Else
Return function_name(parameter)

Note that the function definition of a recursive function is similar to a normal


function. A recursive function either calls itself or returns the control back to the
caller based on a condition.

Figure 4.2 depicts the flow of control in a recursive function.

Figure 4.2: Flow of Control in a Recursive Function

The main program calls the recursive function. Inside the function, after the
processing steps are executed in the function body, a condition is checked. If
the condition is not satisfied, then the function calls itself with a reduced set of
input. Otherwise, the control returns to the main program. Until the condition is
not satisfied, the function calls itself repeatedly.

Figure 4.3 shows the code and the output of the sum of the series example.

Figure 4.3: Recursive Function for a Series

V1.0 © Aptech Limited


Figure 4.4 shows the working of this recursive function. Let uS represent the
function SeriesSum.

Figure 4.4: Working of the Recursive Function - SeriesSum

First, the main program makes the function call, SeriesSum(4,6). Then, the
recursive function SeriesSum calls itself five more times to complete the
calculation. Finally, the control returns to the main program and the answer is
printed.

So, should a function always have a name? Not necessarily.

4.2 Lambda Function

Lambda function is an anonymous function. The function is limited to a single


statement. It can take any number of arguments, but it returns the value of
only one expression. The expression can run through multiple lines and have
parenthesis. Lambda functions cannot access global variables. The syntax of
lambda function is:

lambda arguments: expression

 Arguments: comma-separated list of arguments used in the expression


 Expression: expression that uses the arguments in the list before colon(:)

V1.0 © Aptech Limited


When a lambda function is called, the expression is evaluated, and the
resultant value is returned.

Let us create a lambda function to multiply two numbers. Figure 4.5 shows the
working of this lambda function in Jupyter environment.

Figure 4.5: Lambda Function

Note that the (lambda x,y:x*y) is the lambda function. (4,5) after the
function is the set of values passed to call the function. Thus, lambda functions
are created to be called and executed instantaneously.

It is possible to assign a name to this lambda function. Let us change the


multiply lambda function and assign it to a variable, say, product.

Figure 4.6 shows the working of this lambda function in Jupyter environment.

Figure 4.6: Lambda Function with a Name

Here, the variable product is assigned the return value of the lambda function
which takes x and y as parameters. When the lambda function is invoked by
calling product(4,5), the result gets printed. Note that there is no explicit
return statement in lambda function.

Code Snippet 1 shows the code for a regular function equivalent to the
product lambda function.

Code Snippet 1:

def Product(x,y):
return x*y
print('The product of 4, 5 =', Product(4,5))

V1.0 © Aptech Limited


Lambda functions reduce the lines of code and improve readability.
Lambda functions when used as an argument in another function help in
quickening the turnaround time of the processor.

4.2.1 Lambda with Filter Function

The filter function uses two arguments. The first argument is the condition to
be checked for and the second argument is a sequence from which data
must be filtered. The filter function then, returns the sequence with values that
satisfy the condition.

filter(condition, sequence)

For example, the condition can be a lambda function that checks for the
divisibility of its argument by 3. This allows filtering the values in the sequence
which are divisible by 3.

Figure 4.7 shows the code and the output of the filter function that takes a
lambda function as an argument.

Figure 4.7: Lambda Function as Argument to Filter Function

% is the built-in function that gives the remainder when x is divided by 3. If the
remainder is zero, it means that x is perfectly divisible by 3. The numbers that
are divisible by 3 are filtered from the list, Given_list. After filtering, the
resultant numbers are sorted and printed. The lambda function which checks
for divisibility is used as an argument in the filter function.

4.2.2 Lambda with Map Function

The map function uses two arguments. The first argument is the action to be
performed on each value of a sequence and the second argument is the
sequence.

map(functionality, sequence)

V1.0 © Aptech Limited


Functionality can be a lambda function that performs a certain calculation on
each value in the list.

Figure 4.8 shows the code and the output of the map function that takes a
lambda function as an argument.

Figure 4.8: Lambda Function as Argument to Map Function

The expression x*3 in the lambda function gives the product of x and 3. These
resulting numbers are added to a list, result, and the result is then printed.

4.2.3 Lambda with Reduce Function

The reduce function uses two arguments. The first argument is the action to be
done on each value of a sequence and the second argument is the
sequence. This function reduces the sequence values from their original value
using the specified functionality. It can return a single value or multiple
values-one for each member of the sequence.

reduce(functionality, sequence)

Functionality is a lambda function that returns each value in the list with some
reduction or a single reduced value. The reduce function must be imported
from the functools module.

Figure 4.9 shows the code and the output of the reduce function that takes a
lambda function as an argument.

Figure 4.9: Lambda Function as Argument to Reduce Function

V1.0 © Aptech Limited


The lambda function takes two arguments x and y and multiplies them. When
the lambda function executes, the first two numbers are multiplied by each
other. Similarly, the next number is taken and multiplied by the product
previously obtained. The final product is stored in result. The print statement
displays the result.

4.3 Overview of Modules

A module can be considered as a code library that can be reused. Modules


contain functions, variables, and classes that can be used in other Python
programs. Modules help in easy maintenance of the code by breaking the
code into separate files. Two types of modules offered by Python are:

● Built-in module
● User-defined module

4.3.1 Built-in Module

Python offers a rich library support that can be imported readily into the code.
These ready-to-use modules are called built-in modules. They are part of the
Python standard library and are available with Python installation. Datetime,
math, and sys are some of the built-in modules.

Importing a Module

The built-in modules offer a variety of functionalities such as file handling, string
manipulation, date and time handling, and mathematical operations.
Developers must import these modules to make them available in their
programs. The syntax to import all the functions in a module is:

import <modulename>

Consider that the developer wants to make use of the sqrt function that
calculates the square root of a number. This function is available in the math
module. Figure 4.10 shows the code to import and utilize this function in a
program.

Figure 4.10: Importing a Module

V1.0 © Aptech Limited


The command import math imports all the functions of the math module. The
code math.sqrt utilizes the sqrt function in the math module.

Importing Specific Functions of a Module

Python offers the provision to import certain functions alone in a module


without importing the entire module. The syntax to import a specific function
from a module is:

from <modulename> import <functionname>

Consider that the developer wants to import the gcd function from the math
module. Figure 4.11 shows the code to import and utilize the gcd function in a
program.

Figure 4.11: Import a Specific Function

The gcd function takes two numbers as arguments and calculates the greatest
common divisor of the two numbers.

Importing Multiple Modules

Certain programming situations demand the use of functions from several


modules. This will equip the developer with a rich library of functions. The syntax
to import multiple modules is:

import <modulename1,modulename2,..>

Consider that the developer wants to import two modules—math and random
to calculate the area of a shape. The shape for which area must be calculated
is chosen at random by the choice function in the random module.

V1.0 © Aptech Limited


Figure 4.12 shows the code to import multiple modules using a single import
command.

Figure 4.12: Importing Multiple Modules

The pi function and the pow function in the math module are used to calculate
the value of pi and the power of a number, respectively. The choice function
in the random module picks a random value from the provided list.

Renaming a Module During Import

For ease of referring to a module in the code, the module can be referred to
using another name. The syntax for renaming the module is:

import <modulename> as <newname>

Consider that the developer wants to import and rename the random module
as rand. Figure 4.13 shows the code to perform this is.

Figure 4.13: Import with Renaming

Here, the random module is imported as rand. Any further reference to the
random module must be made by addressing it as rand. The randint function
of the random module picks a number in the specified range.

V1.0 © Aptech Limited


Importing all Functions of a Module

Python provides another method to import all the functions of a module. The
first method to import a module using the import <modulename> command
was covered earlier in this session. The syntax to import all the functions in a
module is:

from <modulename> import *

Though both these methods provide the same result, the most convenient one
is to import all the modules using the asterisk (*). In this method, for each
reference of a function, the developer does not have to mention the name of
the module as a prefix. Consider that the developer wants to use several
functions from the math module. Figure 4.14 shows the code to perform this.

Figure 4.14: Importing all Functions of a Module

Note that the developer does not have to refer to any of the functions with the
module name as its prefix. For example, the factorial function is referred to
as factorial and not as math.factorial.

4.3.2 User-defined Module

Python offers the facility for developers to create their own modules. Such
modules are called user-defined modules. Developers can include functions,
classes, and variables inside these modules.

Creating a Module

Consider that the developer wants to create a module named


math_calculation with a few mathematical functions—FindMax, FindMin,
FindDigitSum—inside the module. Code Snippet 2 shows the code inside the
module.

V1.0 © Aptech Limited


Code Snippet 2:

def FindMax( x, y ):
if x > y:
return x
return y

def FindMin( x, y ):
if x < y:
return x
return y

def FindDigitSum(n):
total = 0
for digit in str(n):
total += int(digit)
return total

This code must be saved as math_calculation.py in the same folder as the


main program and imported into the main program. Figure 4.15 shows the
code to perform this.

Figure 4.15: Creating a User-defined Module

Variables in a Module

Modules consist of not only functions but also variables such as arrays,
dictionaries, and objects. Once defined in a module, these variables are
readily available and can be used in a program.

Consider that the developer wants to include a list of variables in a module


and import it into the main program. Code Snippet 3 shows the code inside
the module.

V1.0 © Aptech Limited


Code Snippet 3:

carlist = ["Ford", " Tesla", "Mercedes-


Benz", "BMW"]
bikelist=["Honda", "Harley-Davidson",
"Ducati"]

This code must be saved as vehiclelist.py in the same folder as the main
program and imported into the main program. Figure 4.16 shows the code to
perform this.

Figure 4.16: Accessing Variables in a Module

Reloading a Module

A module once imported may undergo changes in its source. In such a


scenario, the updated change must be reflected in the main code. To do this,
the built-in function named reload is used. This function reloads the imported
module into the main program. The syntax to reload a module is:

reload <modulename>

The modulename parameter is the name of the module that was already
imported and must be reloaded. The reload function is found in the
importlib module. Hence, the importlib module must be imported into the
main program.

V1.0 © Aptech Limited


Consider that the developer wants to reload the math_calculation module.
Figure 4.17 shows the code to perform this.

Figure 4.17: Reloading a Module

4.3.3 Dir Function

The dir function lists all the attributes and methods of an object. The syntax of
this function is:

dir <objectname>

The objectname parameter is optional. If the name of the object is not


included, all the attributes of the current object are listed. Note that the dir
function lists only the names of the attributes and does not provide their values.
Figure 4.18 shows the usage of the dir function.

Figure 4.18: Dir Function

Here, the object name is not specified. Hence, the dir function lists all the
attributes in the current object. This function can be used to inspect objects
while debugging programs.

4.4 Packages

Modules in Python can contain any number of variables, functions, and


classes. Related modules are often grouped together as packages. A
package can be thought of as a folder that contains several files. Each
module is stored as a file under the folder for the package.

As the name suggests there can be packages for specific purposes such as
Employeedata_package which deals with information about employees of an
organization. All modules, functions, and variables declared inside the
package help in retrieving or storing data pertaining to employees. Creating
packages in Python improves code readability and scalability.

V1.0 © Aptech Limited


4.4.1 Creating a Package

Consider a package named Calpackage. Suppose this package contains two


modules: Largethree.py and EvenOdd.py. Each of these modules contains
two functions as given in Figure 4.19.

Figure 4.19: Structure of Calpackage

The first step is to create a folder named Calpackage. Inside this folder, an
empty file named __init__.py is created. This is to indicate that Calpackage
is a package and all modules under this package can be imported. Module
files LargeThree.py and Evenodd.py are created and stored under the
Calpackage folder.

Figure 4.20 shows the code for the module, Largethree.py. The function,
maximum_two, takes two arguments and returns the largest among the two.
The function, maximum_three, takes three arguments and returns the largest
among the three.

Figure 4.20: Module Largethree.py Under Calpackage

Figure 4.21 shows the code for the module, EvenOdd.py.

V1.0 © Aptech Limited


Figure 4.21: Module EvenOdd.py under Calpackage

Figure 4.22 shows the folder structure of the created package and the folders
under it.

Figure 4.22: Folder Structure of CalPackage

4.4.2 Importing Modules from a Package

Once a package is installed, the functions in the modules that are part of the
package can be executed by importing the package.

The syntax to import a module from a package is:

from <packagename> import <modulename>

V1.0 © Aptech Limited


A package cannot be imported in a module that is within the same package.
Thus, to import the modules from Calpackage, the developer must be out of
the Calpackage folder.

Figure 4.23 shows the code to import the Largethree module from the
Calpackage. After importing the module, the code uses the maximum_three
function from the module.

Figure 4.23: Import Largethree Module

Similarly, even_num and odd_num functions can be executed by importing the


EvenOdd module from Calpackage.

V1.0 © Aptech Limited


4.5 Summary

⮚ Recursive functions call themselves repeatedly based on a condition.


⮚ Lambda functions are anonymous functions known for ease of creating
them. They return a single value.
⮚ Modules are code blocks that may have one or more functions and
variables. They are stored with the extension, .py.
⮚ When programs import modules, functions, and variables inside the
modules become accessible to the programs.
⮚ Packages include one or more modules within. A file named
__init__.py is included in the package to indicate that this folder is a
package.

V1.0 © Aptech Limited


Test Your Knowledge

1. Which of the following statements are true about lambda functions?

a. It is limited to a single statement


b. It returns a single expression even though it can take multiple
arguments
c. It is an anonymous function
d. It can access global variables

2. Consider the given code:

li = [13,28,29,37,38,41]
//insert code
print(final_list)

Which option should be inserted in the given code to get the output as
[13, 29, 37, 41]

a. final_list = list(map(lambda x: (x % 2 != 0), li))


b. final_list = list(filter(lambda x: (x % 2 != 0),
li))
c. final_list = list(reduce(lambda x: (x % 2 != 0),
li))
d. final_list = list(lookup(lambda x: (x % 2 != 0), li))

3. Consider that you have created a function Num_product(x,y) inside a


module numcal.py. Which of the following option must be inserted into
the given code to get the output as 900?

Import numcal
//insert code

a. print(“Product is:”,Num_product(20,45))
b. print(“Product is:”,numcal.Num_product(20))
c. print(“Product is:”,Num_product(20,45).numcal)
d. print(“Product is:”,numcal.Num_product(20,45))

V1.0 © Aptech Limited


4. Consider that you have created a list “colors = ["White",
"Yellow", "Pink", "Green"]” inside a module colorslist.py.
Which of the following option must be inserted into the given code to
get the output as “Pink”?

Import colorslist
//insert code
print(“my color is:”, likecolor)

a. likecolor=colorlist.colors[2]
b. likecolor=colorlist.colors(3)
c. likecolor=colors[2]
d. likecolor=colors.colorlist[3]

5. Which of the following files is used to mark a directory as a Python


package?

a. __inits__.py
b. __init_core__.py
c. __init__.py
d. __core_init__.py

V1.0 © Aptech Limited


Answers to Test Your Knowledge

1 a, b, c
2 b
3 d
4 a
5 c

V1.0 © Aptech Limited


Try it Yourself

Open Jupyter notebook and perform the given tasks:

1. Write a Python lambda function to filter a list of integers named list1


into two separate lists one containing positive integers and the other
containing negative integers. Given that list1=[5, 8, -3, 6, -1,
-5, 9, 11, -19].

2. Write a Python lambda function to sort a list of dictionaries by color.


Given that the original list of the dictionary is: [{'flower': 'Rose',
'color': 'Red'},{'flower': 'lily', 'color':
'White'}{'flower': 'Rose', 'color': 'Red'}{'flower':
'Daisy', 'color': 'Blue'} ].

3. Write a Python lambda function to multiply two lists named num_list1


and num_list2 using map function. Given that num_list1=[5, 7.8,
9, 11, 10.5] and num_list2=[3, 8, 4.5, 7, 11].

4. Write a Python lambda function to count the total number of


occurrences of 2 in the given list occ_list. Given that occ_list=[1,
2, 4, 2, 6, 1, 4, 7, 2, 8, 3, 9, 1, 3, 6, 2, 3, 2].

5. Import a function called shuffle from the module named random and
shuffle the elements in the given lists. Given that num_list=[1, 3, 5,
9] and word_list = [‘rose’, ’lily’, ’jasmine’, ’daisy’].

6. Create a package named tempackage. Create a module called


temp_conversion.py in tempackage. Write two functions inside the
module: Celsius_Fahren and Fahren_Celsius. The Celsius_Fahren
function converts the temperature from Celsius to Fahrenheit and the
Fahren_Celsius function converts the temperature from Fahrenheit to
Celsius. Import the temp_conversion module from tempackage into the
main program where the values for the functions are passed as
arguments and display the results for the respective functions.

V1.0 © Aptech Limited


Learning Objectives

In this session, students will learn to:

⮚ Identify file access modes and operations in Python


⮚ Describe file handling methods
⮚ Explain exception handling mechanisms

File and exception handling are two important aspects of any programming
language. File handling allows you to read and write data from and to files,
such as text files, XLS files, CSV files, and JSON files. Exception handling allows
you to handle errors and exceptions that may occur during the execution of a
program, such as syntax, logical, and runtime errors. In this session, you will learn
how to use the built-in functions and keywords for file and exception handling
in Python.

V1.0 © Aptech Limited


5.1 File Input and Output in Python

A file is a specifically identified area on a secondary storage medium that


stores data permanently for being accessed later.

Files are stored on a computer in binary form, which is a collection of 0s and


1s. As a result, each file is just a collection of bytes that are stored one after the
other. Data files primarily come in two types: text files and binary files. Any text
editor can be used to open a text file since it is made up of human-readable
characters. Binary files, on the other hand, contain characters and symbols
that are not human-readable. Therefore, you must use special software to read
information from the binary files.

Input and output of files in Python depend on the type of file. There are two
types of files in Python:

• Contains character data


Text File:
• Uses the .txt extension

• Is capable of storing images, videos,


Binary File: audios, and other binary data
• Uses the .bin extension

When opening a file, you must specify the mode in which you wish to open it.
Table 5.1 lists various access options available when opening a file in Python.

Mode Description
x Opens a new file and throws an error if the name already exists
r Opens the read-only version of an existing file, with the pointer in the
beginning
w Opens the write-only version of a file
t Opens a file in text mode
b Opens a file in binary mode
a Opens an existing file in the append mode to add content retaining
the existing content
r+ Opens the read-and-write version of an existing file, with the pointer
in the beginning
rb Opens the read-only version of a file in binary format, with the
pointer in the beginning

V1.0 © Aptech Limited


Mode Description
rb+ Opens the read-and-write version of a file in binary format, with the
pointer in the beginning
w+ Opens an existing file to write data replacing the existing content
wb Opens the write-only version of a file in binary format
wb+ Opens the binary version of a file to write data replacing the existing
content
a+ Opens an existing file to append and read data
ab Opens the binary version of a file in the append mode
ab+ Opens the binary version of a file to append and read data

Table 5.1: File Access Modes

For all the write access modes, if a file with the same name exists, the file is
overwritten; otherwise, a new file is created. For all the append access modes,
if no file exists with the name provided, a new file is created.

5.1.1 Performing File Operations

The programming language Python is frequently used for data analytics and
includes several built-in file operations. Major file operations include creating a
file, writing data into the file, reading data from the file, and closing the file
when no longer required.

 Create a file

In Python, you can create a file in two ways:

Create an empty text Use access mode w to


file using the x access create a new file and
mode. write content to it.

V1.0 © Aptech Limited


You can create a new file without importing any modules using the open
built-in method. The syntax for opening a file is:

<filepointer> = open('file_Path', 'access_mode')

The filepointer is just a variable that points to the position of the cursor in
the opened file. The initial position is 0. The file path can be of one of two
types:

Absolute Path The whole directory list required to find the file is
contained in an absolute path.

Relative Path The file name follows the current directory in a


relative path.

To create a file, you must pass the file name and access mode to the open
method. The purpose of opening a file is specified through the access
mode.

Table 5.2 lists the access modes for creating a file.

File Mode Meaning


Creates a new file and opens it for writing content
w
If an existing file name is provided, the file contents
are removed
Creates a new file and opens it for data entry
x
If the specified file name already exists, then
Python throws the FileExistsError exception
Adds new content to an existing file by opening it
a
in append mode
b Creates a file in binary format
t Creates and launches a text-only file

Table 5.2: Access Modes for Creating a File

Figure 5.1 shows the Code Snippet to open a file using the x access mode.
Note that the file created is stored in a file pointer variable. This variable is

V1.0 © Aptech Limited


used to perform file operations until the file is closed. The close method uses
the file pointer variable, fp.

Figure 5.1: Create an Empty Text File

Figure 5.2 shows the folder structure which lists the newly created file.

Figure 5.2: Empty Text File

If no explicit path or directory location has been given, then the


file is created in the same working directory in which the software
or script exists. Such a path is termed the relative path.

Figure 5.3 shows the code to create a file using the w access mode. The
string ‘Welcome to Python programming Lab’ is written to the file using
the write method. Finally, the file is closed.

Figure 5.3: Create and Write Content to File

V1.0 © Aptech Limited


Figure 5.4 shows the contents of the newly created text file.

Figure 5.4: Contents of welcomepython.txt

In the code, the open method is used to open a file named


welcomepython.txt in the w access mode. Once the file is open, the
write method is used to add contents to the file through the file pointer
variable, fp.
 Open File in Read Mode

You can open a file in read-only mode using the r access mode in the open
method. This will allow you only to view the contents of the file. For example,
the command to open the welcomepython.txt file in read-only mode is:

fp = open("welcomepython.txt", "r")

Once a file is open, the read method can be used to return the desired
number of bytes.

Figure 5.5 shows the code to open a file in read-only mode and display its
contents.

Figure 5.5: Opening a File in Read Mode

In the code, the read method is used to read the contents of the file and
display it using the print method.

 Open File in Write or Append Mode

You can use the open method with the w access mode to open a file and
write data into it. The file will start with the cursor or file pointer at the

V1.0 © Aptech Limited


beginning of the file. If the file already exists, then its contents will be erased
before writing the new data.

Once a file is open, the write method can be used to add data to it.
Stream position and the file access mode determine where the newly
added text will be placed in the file. For example, with the access mode as
w, the new text will be added at the start of the file after its original content
is erased. If the access mode is a then, the text will be added at the end of
the file which is the current stream position.

Consider the Welcomepython.txt file that has already been created and
contains the text ‘Welcome to Python programming Lab’. Figure 5.6
shows the code and the output to open the Welcomepython.txt file in
write mode and replace its contents.

Figure 5.6: Opening a File in Write Mode

Now, consider the welcomepython.txt file that has the text, ‘Today's
session is File Handling in Python.’. Let us add the text ‘ Next
session is Exception handling in Python.’ at the end of this file.
Figure 5.7 shows the code and the output to append the new text.

Figure 5.7: Opening a File in Append Mode

V1.0 © Aptech Limited


 Open a File Using with Statement

The with statement along with the open method can be used to open a
file.
The generic syntax to use the with statement is:

with open('file_Path', 'access_mode') as <filepointer>

Main benefits of opening a file with the with statement are:


 The with statement encapsulates frequent setup and cleanup
operations, which makes exception handling easier.
 The file opened using the with statement is automatically closed
when exiting the block.
 The automatic file closure also ensures that all associated resources
are released.

Consider the Welcomepython.txt file with the contents:

Today's session is File Handling in Python. Next session is


Exception handling in Python.

Let us copy the same contents to another file named sessionpython.txt.

Figure 5.8 shows the code and the output to create a new file with the same
text.

Figure 5.8: Opening a File Using with Statement

The with open statement opens the welcomepython.txt file in read mode
and copies the contents of this file into the sesssionpython.txt file.

V1.0 © Aptech Limited


 Open a File for Multiple Operations

Using the + access mode in Python, you can open a file to carry out many
operations at once. Both reading and writing options in the file are enabled
when you use the r+ access mode.

Figure 5.9 shows the code and the output to open the sessionpython.txt file in r+
mode and add more content to it.

Figure 5.9: Opening a File for Multiple Operations

Note that the code uses the seek method to move to the starting point of
the file.

5.1.2 File Handling Methods

Python offers a wide range of methods that you can use with the filepointer
handle to manipulate the file object.

 readlines Method

Each line in the file is returned as a list item by the readlines method. To
restrict the number of lines returned, you can use the hint option.

V1.0 © Aptech Limited


The syntax for the readlines method is:

<filepointer>.readlines(hint)

Table 5.3 lists the parameter for the readlines method.

Parameter Description
hint Specifies the number of bytes to be returned

This parameter is optional. If the number of bytes returned


exceeds the hint, no more lines will be returned. The default
value for this parameter is -1, indicating all lines in the file will
be returned.

Table 5.3: Parameter for readlines Method

Figure 5.10 shows the code and the output for the readlines method.

Figure 5.10: Read Data Using readlines Method

 writelines Method

Python offers the writelines method to write several strings to a file. To


use the writelines method, you must supply an iterable object such as
a list or a tuple that contains several strings. Here, the writelines method
is used to save a list's entries as lines in a file.

The syntax for the writelines method is:

<filepointer>.writelines(list)

V1.0 © Aptech Limited


Figure 5.11 shows the code and the output for the writelines method.

Figure 5.11: Write a List Using writelines Method

 truncate Method

The truncate method in Python allows you to resize a file to the specified
number of bytes. If the number of bytes is not specified, then by default the
file will be truncated at the current position of the cursor. That is, the
contents of the file after the current position of the cursor will be removed
from the file.

The syntax for the truncate method is:

<filepointer>.truncate(size)

Figure 5.12 shows the code and the output for the truncate method.

Figure 5.12: Resizing the File Using truncate Method

V1.0 © Aptech Limited


The file pythonlab.txt is truncated to 20 bytes.

 tell Method

The tell method returns the numerical value of the current position of the
file pointer.

The syntax for the tell method is:

<filepointer>.tell()

Figure 5.13 shows the code and the output for the tell method.

Figure 5.13: Knowing the Current File Position Using tell Method

 seek Method

The seek method lets you change the current position in a file stream. This
method can be used to go to a specified position in the file. Once done,
the contents of the file from that position can be read or new data can be
written to the current position. Alternatively, the file can also be truncated
from the current position.

The seek method returns the new position to the file pointer after moving to
the specified location. Offset provides the location to which the current
file pointer position must move. The syntax for the seek method is:

<filepointer>.seek(offset)

V1.0 © Aptech Limited


Figure 5.14 shows the code and the output for the seek method.

Figure 5.14: Changing the Current File Position Using seek Method

5.1.3 Pickling Module in Python

Python is an object-oriented programming language that treats everything


such as lists, tuples, or dictionaries as objects. So, Python needs to offer a
technique to retain the state of an object. For example, consider adding items
to a wish list of an online purchasing application. This application must
remember the items that you have added to the wish list during all your further
logins to the application. The items will be removed from the list only when you
do so. Here, the wish list is an object whose state the application has to
remember. Python offers the Pickle module to help you preserve an object
structure with data.

The main function of the Pickle module is to serialize and de-serialize Python
object structures. When you must move Python objects from one system to
another, pickling and unpickling become crucial. Figure 5.15 depicts the
process of pickling and unpickling.

Figure 5.15: Pickling and Unpickling

Pickling or Serialization is the process in which a Python object structure is


serialized into byte streams of data. These byte streams stored in binary files

V1.0 © Aptech Limited


can then be transported anywhere over the network and stored in any
secondary storage medium.

Unpickling or deserialization is the process of recovering the original Python


objects from the pickle files that store byte streams. Any Python object structure
can be serialized and deserialized with the Pickle module.

The Pickle module offers two methods: dump to write binary data and load
to read binary data.

 dump Method

To use the dump method, you must import the Pickle module. You can
open a file in binary write mode and dump data into it. The dump method
converts the data object provided into binary streams before storing it in
the file pointed by the file object.

The syntax of the dump method is:

dump(data_object, file_object)

Here, file_object is the file handle and data_object is the object that
must be written to it.

Figure 5.16 shows the code to create a file with a list object serialized into a
byte stream.

Figure 5.16: Dumping a List Using dump Method

 load Method

The Pickle module contains the load method to read data from byte

V1.0 © Aptech Limited


streams and return it as Python objects. You can open a file in binary read
mode and read data from it.

The syntax of the load method is:

data_object = load(file_object)

Here, the unpickled Python object is placed in a variable called


data_object after being loaded from the file accessed through the file
handle file_object.

Figure 5.17 shows the code to open a file, read the binary stream from it,
and deserialize the binary streams to store in a list object.

Figure 5.17: Loading a List from a File Using load Method

5.2 Introduction to Exception Handling

Similar to other programming languages, Python offers a robust exception


handling mechanism to handle erroneous situations that may arise during
program execution. Exceptions are objects that represent an error that has
occurred during the execution of a program. Examples of exceptions include
ZeroDivisionError or ModuleNotFoundError. ZeroDivisionError occurs
when a number is divided by zero. ModuleNotFoundError occurs when an
imported module is not available in the currently installed version of Python.

V1.0 © Aptech Limited


Figure 5.18 shows the relation between the Exception class and the
BaseException class in Python.

Figure 5.18: Exception and BaseException Classes

Exceptions in Python are objects that derive from the BaseException class.
Every exception object in Python includes the type of error, a message that
clearly states information about the error, and the state of the object when the
error occurred.

Python offers the try and except blocks to catch and handle exceptions,
respectively. The syntax for exception handling is:

try:
#<statements that may throw an exception>

except <exception_name>:
#<statements to handle the exception>

Here, the try block includes the statements that may throw an error during
execution. If such an error occurs, then the statements in the except block will
be executed to handle the error. You can display a user-friendly message or

V1.0 © Aptech Limited


specify what to do when a particular error occurs. If you do not specify any
exception name in the except block, then the same code will be executed
for any exception that occurs.

For example, if the try block contains a statement that divides a number and
the divisor evaluates to zero, then the ZeroDivisionError occurs. Figure 5.19
shows the code and the output for where the ZeroDivisionError exception
occurs.

Figure 5.19: Code Throwing ZeroDivisionError Exception

To handle this exception, a user-friendly message can be added to the except


block, and the division statement can be added to the try block. Figure 5.20
shows the code and the output for handling the ZeroDivisionError
exception.

Figure 5.20: Code Handling ZeroDivisionError Exception

V1.0 © Aptech Limited


Python offers a wide range of built-in exceptions that you can use to catch
and handle common errors that occur widely. However, you can also build a
custom exception and handle it. Such an exception is called a user-defined
exception.

5.2.1 Built-in Exceptions

Some common runtime exceptions developers get include:


 File that does not exist being read
 A list item outside the available index being read
 Divisor becomes zero in a division

Such common exceptions can be easily handled by the Python built-in


exceptions. In Python, the built-in exceptions are exception classes that are
derived from the BaseException class. Table 5.4 lists various built-in exceptions
offered by Python.

Exception Description

AssertionError Occurs to indicate failure of an assert statement

AttributeError Occurs to indicate failure of an attribute reference or


assignment

EOFError Occurs to indicate the end-of-file condition when


accepting data using the input function
FloatingPointError Occurs to indicate failure of a floating-point
operation
GeneratorExit Occurs when a generator’s close method is called

ImportError Occurs when an imported module does not exist

IndexError Occurs if the index of a sequence is out of range

KeyError Occurs if a key is not found in a dictionary

V1.0 © Aptech Limited


Exception Description

KeyboardInterrupt Occurs when the user hits the interrupt key, Ctrl+C or
Delete
MemoryError Occurs when an operation runs out of memory

NameError Occurs if a variable does not exist in the local or global


scope

OSError Occurs if an operating system operation causes an


error
ReferenceError Occurs when a weak reference proxy is used to access
garbage collected

Table 5.4: Built-in Exceptions

The ZeroDivisionError demonstrated in the example is a built-in exception.

5.2.2 Catching Multiple Exceptions

Given the wide range of built-in exceptions, Python facilitates the handling of
multiple exceptions for a single try block. However, if an exception occurs, only
one exception will be handled.

Consider an example where the user enters marks obtained by a student and
the total number of subjects. The code calculates the average marks obtained
by the student using the total number of subjects given by the user as the
denominator value. Code Snippet 1 shows the code for doing this.

V1.0 © Aptech Limited


Code Snippet 1:
try:
sub1 = int(input("Enter value of subject1:"))
sub2 = int(input("Enter value of subject2:"))
num_sub=int(input("Enter total number of subjects:"))
avg= (sub1+sub2) / num_sub
print("Average", avge)

except NameError:
print("NameError occurred. Please define the variable
or functions correctly.")
except ZeroDivisionError:
print("Can't divide by zero")

Figure 5.21 shows the output of code in Code Snippet 1 with specific inputs.

Figure 5.21: Catching Multiple Exceptions, NameError

Since variable avg is incorrectly written as avge in the code, even if the user
enters proper data, NameError occurs. This exception is caught by the first
except block that says except NameError: and the message “NameError
occurred. Please define the variable or functions correctly.” is
displayed. The code can be corrected by modifying print("Average",
avge) as print("Average", avg).

Now, if the input provided for the number of subjects is zero, then the next
exception is caught as shown in Figure 5.22.

Figure 5.22: Catching Multiple Exceptions, ZeroDivisionError

V1.0 © Aptech Limited


The denominator is zero as the total number of subjects is entered as zero. Thus,
the try block throws the ZeroDivisionError exception, which is handled by
displaying the message "Can't divide by zero" in the second except block.

5.2.3 Using try with else Block

Sometimes you may want to execute a specific set of code if no other


exception occurred. This is where the else block is useful. Figure 5.23 depicts
the flow of execution in the exception blocks.

Figure 5.23: Exception Blocks

If an exception occurs, then the else block will not be executed. If no


exception occurs, then the else block alone will be executed. After executing
the relevant exception or else block code, the next line of code after the try
block will be executed. The syntax for the try with the else block is:

V1.0 © Aptech Limited


try:
<statements that may throw an exception>

except <exception_name>:
<statements to handle the exception>

else:
<statements to be executed if there is no exception>

Consider the same example discussed for multiple except blocks. Code
Snippet 2 shows the code with the addition of the else block.

Code Snippet 2:

try:
sub1 = int(input("Enter value of subject1:"))
sub2 = int(input("Enter value of subject2:"))
num_sub=int(input("Enter total number of
subjects:"))
avg= (sub1+sub2) / num_sub
print("Average", avg)
except NameError:
print("NameError occurred. Please define the
variable or functions correctly")
except ZeroDivisionError:
print("Can't divide by zero")
else:
print("Program executed successfully")

Figure 5.24 shows the output for the execution of the else block.

Figure 5.24: Using else Block with try Block

V1.0 © Aptech Limited


5.2.4 Using try with finally Block

The resources used in the try and the exception blocks must be freed before
the execution continues the next line of code after the try, except, and else
blocks. Such a block of code that must be executed whether an exception
occurred or not can be added to the finally block. Figure 5.25 shows the
flow of execution within the exception blocks.

Figure 5.25: finally Block

The finally block is executed immediately after the try and exception
blocks. If an exception is not handled by the code, then the unhandled
exception is re-thrown after the execution of the finally block.

V1.0 © Aptech Limited


The syntax for the try, except, else, and finally block is:

try:
<statements that may throw an exception>

except <exception_name>:
<statements to handle the exception>

else:
<statements to be executed if there is no exception>

finally:
<statements to be executed whether there is an
exception or not>

Consider the same example discussed for the try, except, and else blocks.
Code Snippet 3 shows the code with the addition of the finally block.

Code Snippet 3:
try:
sub1 = int(input("Enter value of subject1:"))
sub2 = int(input("Enter value of subject2:"))
num_sub=int(input("Enter total number of subjects:"))
avg= (sub1+sub2) / num_sub
print("Average", avg)
except NameError:
print("NameError occurred. Please define the variable
or functions correctly")
except ZeroDivisionError:
print("Can't divide by zero")
else:
print("Program executed successfully")
finally:
print("try-except block is completed.")

V1.0 © Aptech Limited


Figure 5.26 shows the output for the execution of the finally block after an
exception occurs.

Figure 5.26: Using finally Block with try and except Blocks

The finally block will be executed even if an exception has not occurred.
Figure 5.27 shows the output for the execution of else and finally blocks.

Figure 5.27: Using finally Block with try and else Blocks

The finally block will be executed even if an unhandled exception occurs.


Figure 5.28 shows the output for the execution of the finally block after which
an unhandled exception is thrown.

Figure 5.28: Using finally Block with an Unhandled Exception

Note that the unhandled exception is thrown after the execution of the
finally block.

V1.0 © Aptech Limited


5.2.5 Raising Exceptions

Often when validating user input, you might want to show a custom error
message in case of inappropriate input. For example, consider that in the
previous example, you must restrict the number of subjects entered to two. If
the user enters any integer value greater than two, you want to raise an
exception that displays the message “Total number of subjects should
be 2”. The raise statement lets you raise an exception. The syntax for the
raise statement is:

raise <Exception_Class>, (<value>)

The raise statement has a single argument that indicates an exception should
be raised. This argument can either be a class that is inherited from the
Exception class or an exception object.

Code Snippet 4 shows the code that raises an exception if the user entered
any number other than two.

Code Snippet 4:

def average(sub1, sub2, num_sub):


try:
if num_sub!=2:
raise ValueError(num_sub)
avg = (sub1+sub2) / num_sub
return avg
except ValueError:
print('Total number of subjects should be 2')

print('David Richard average:',average(78,86,2))


print('\n')
print('Rita Faraday:',average(89,84,1))
print('\n')
print('Adam Smith average:',average(67,87,3))

The code creates a function named average. This function accepts three
parameters. The first two parameters are the subject marks and the third
parameter is the number of subjects. Figure 5.29 shows the output for this code.

V1.0 © Aptech Limited


Figure 5.29: Raising Exceptions

The first call to the function executes normally without raising the exception.
The second and the third calls to the function raise the exception and display
the error message.

5.2.6 User-defined Exceptions

You can build a custom exception and use it to validate data. Consider that
you want to validate that the subject marks entered by the user are positive
values. If a user enters a negative value, then you can raise a custom
exception such as InvalidMarksError. Such exceptions defined by the
developer and raised through code are called user-defined exceptions or
custom exceptions.

You must define a new class to define a custom exception. The new custom
exception class must derive from the Exception class instead of the
BaseException class.

V1.0 © Aptech Limited


The syntax to define a custom exception is:

class <CustomError>(Exception)
<statements>
Pass

try:
...

except CustomError:
...

Note that the user-defined CustomError class inherits from the Exception
class. The pass statement marks the end of the custom exception code.
Code Snippet 5 shows the code for a custom exception. This exception is raised
if the user enters a negative value for the marks.

Code Snippet 5:

class InvalidMarksError(Exception):
"Raised when a negative value is entered for
the marks"
pass

try:
sub1 = int(input("Enter value of subject1:"))
sub2 = int(input("Enter value of subject2:"))

if (sub1 | sub2)< 0:
raise InvalidMarksError
else:
print("Valid numbers")

except InvalidMarksError:
print("Exception occurred: Invalid Mark")

V1.0 © Aptech Limited


Figure 5.30 shows the output for this code.

Figure 5.30: User-defined Exceptions

V1.0 © Aptech Limited


5.3 Summary

 A file is a specifically identified area on a secondary storage medium


that stores data permanently for being accessed later.
 Files can be opened in different modes, such as read, write, and append
modes.
 Python offers a wide range of file handling methods such as readlines,
writelines, truncate, tell, and seek methods. You can use these
methods with the filepointer handle to manipulate the file object.
 The Pickle module is used to serialize and de-serialize Python object
structures.
 Every exception object in Python includes the type of the error,
information about the error, and the state of the object when the error
occurred.
 Exception handling in Python is done using the try, except, else, and
finally blocks.
 Even though Python includes a wide range of built-in exceptions, you
can write and raise a custom exception in Python.

V1.0 © Aptech Limited


Test Your Knowledge

1. Which of the following file mode allows you to open the binary version of a
file to write/read data and replace current data?

a. wb
b. wb+
c. ab+
d. ab

2. Which of the following option allows you to create an empty text file named
f1.txt, provided f1.txt does not exist?

a. fp = create('f1.txt', 'w')
b. fp = open('f1.txt', 'b')
c. fp = open('f1.txt', 'x')
d. fp = create('f1.txt', 't')

3. Which of the following statements are true about opening a file using the
with statement?

a. The syntax of with statement is with open(__file__,


accessmode)
b. The with statement performs frequent setup and cleanup
operations
c. A file is not closed automatically when exiting the block
d. All resources associated with the file are not released when exiting
the block

4. Which of the following option is the correct syntax of the seek method?

a. file.seek(list)
b. file.seek(hint)
c. file.seek(size)
d. file.seek(offset)

5. Which of the following block in exception handling is used to deallocate


the system resources?

a. try
b. except
c. finally
d. else

V1.0 © Aptech Limited


Answers to Test Your Knowledge

1 b
2 c
3 a, b
4 d
5 c

V1.0 © Aptech Limited


Try it Yourself

Open Jupyter notebook and perform the following file-handling tasks:

1. Write a Python function to create a new empty text file named


filego.txt and write the given content into the file.
i. Go (also called Golang or Go language) is an open-
source programming language used for general
purposes.
ii. Go is a simple and efficient language that is
well-suited for concurrency and fast development.

2. Write a Python function to display the entire text of filego.txt.

3. Write a Python function to append the text given to filego.txt and


display the updated file content.
Golang is widely used in a wide range of applications
such as Web development, cloud computing, network
programming, and scientific computing.

4. Write a Python function to count the number of lines in filego.txt


and display the number of lines in the file.

5. Write a Python function to count the number of words in filego.txt


and display the number of words in the file.

Open Jupyter notebook and perform the following exception handling


tasks:

1. Consider that you are trying to open a file named gofile.txt in


read mode. However, the file does not exist. Use exception handling
in Python to handle the FileNotFoundError exception if the file
does not exist.

2. Consider that you are accessing an element list1[5] from the list
list1 which is out of range. Use exception handling in Python to
handle the IndexError exception. Given that the list, list1 = [11,
77, 88, 33, 66].

3. Create a user-defined exception BalanceInadequateError. Raise


this exception if withdraw_amt is greater than the current_balance
amount.

V1.0 © Aptech Limited


Learning Objectives

In this session, students will learn to:

 Identify the benefits of regular expressions


 Explain character classes, sequences, and metacharacters in Python
regular expressions
 Describe various methods in the regular expression (re) module
 Explain tokenization using regular expressions in Python

Regular expressions in programming languages serve as powerful tools for


pattern matching, searching, and manipulating text. Developers can use
these expressions for performing complex operations on strings, making it easier
to extract and process specific patterns or structures within textual data.

This session will provide you with a solid understanding of regular expressions,
their practical usage in Python, and the ability to apply them effectively for
text manipulation and pattern matching.

V1.0 © Aptech Limited


6.1 Regular Expressions in Python

Regular Expressions, often called regex, serve as a powerful tool for defining
patterns used in searching, manipulating, and replacing strings. With regex,
you can efficiently match and find specific strings or sets of strings using a
specialized syntax known as pattern.

By leveraging regex, you can save considerable time in various scenarios that
involve text processing and manipulation such as:

Searching and replacing text in files

Validating text input such as passwords and


email addresses

Renaming hundred files at a time

So, what are regular expressions made of? How are they used? Let us explore.

Regular expressions contain ordinary and special characters. Ordinary


characters include characters that denote themselves such as the alphabets
(a-z or A-Z) and numbers (0-9). Special characters hold special meanings such
as escape sequences or metacharacters. Hence, when using regular
expressions, it is recommended to use raw strings denoted by the prefix r such
as r'expression'. This will prevent confusion and ensure clarity. When
you use raw strings for regular expressions, any escape sequences within the
regular expression will be ignored. This allows the pattern to be represented
as-is without any unintended transformations. Thus, a regular expression
pattern can be accurately interpreted with integrity.

V1.0 © Aptech Limited


Regular expressions contain:

Metacharacters

Regular
Expressions

Special Character
Sequences Classes

6.1.1 Metacharacters in Regular expressions

In a regular expression, most ordinary characters, such as A or p, function as


simple patterns that match themselves. For instance, if the pattern is "Python is
Rocking", it will match the string exactly as it is. This means that by
concatenating ordinary characters, you can create regular expressions that
match specific strings based on the desired pattern.

However, if you want to locate a string starting with a specific set of characters
or a string that contains a given pattern, then you must use metacharacters. In
Python, metacharacters are special characters that impact the interpretation
of regular expressions that include them. They do not match themselves, but
rather signify specific rules. Special characters such as |, +, and * are
considered metacharacters, also referred to as operators, signs, or symbols in
regular expressions.

Metacharacters can hold significant importance and greatly assist in


accomplishing programming tasks. Each metacharacter has its own unique
purpose and can prove to be highly valuable when working with regular
expressions to achieve desired goals.

V1.0 © Aptech Limited


Table 6.1 lists some of the metacharacters.

Metacharacter Name Description


. DOT Matches any one character except a newline
^ Caret Matches pattern from the start of the string
$ Dollar Matches pattern at the end of the string
* asterisk Matches zero or more repetitions of the
pattern
+ Plus Matches one or more repetitions of the
pattern
? Question Matches zero or one repetition of the pattern
Mark
[] Square Matches any single character from the set of
Brackets characters given in the brackets

Table 6.1: Metacharacters in Python Regular Expressions

For example, the pattern [abc] will match either a, or b, or c.

6.1.2 Special Sequences in Regular Expressions

Special sequences in regular expressions represent predefined character


classes that have distinct meanings. These sequences simplify the usage of
common patterns by providing a convenient way to represent them in a more
straightforward and intuitive manner.

Table 6.2 lists some of the special sequences in Python regular expressions with
their purpose. These special sequences are formed using a \ (backslash)
followed by a character as shown in the table.

Special Description
Sequence
\A Matches pattern only at the start of the string

\Z Matches pattern only at the end of the string

\d Matches to any digit which can also be matched using [0-


9]
\D Matches to any non-digit which is the equivalent of the
pattern [^0-9]

V1.0 © Aptech Limited


Special Description
Sequence
\s Matches any whitespace character which is an alternative
for [ \t\n\v\r\f]
\w Matches any alphanumeric character and is the shortform
for the pattern [a-zA-Z_0-9]
\W Matches any non-alphanumeric character and is the
shortform for the pattern [^a-zA-Z_0-9]
\b Matches the empty string, but only at the beginning or end of
a word
Matches a word boundary where a word character is [a-zA-
Z0-9_]
\B Matches the empty string only when it is not at the beginning
or end of a word; opposite of \b

Table 6.2: Special Sequences in Regular Expressions

6.1.3 Character Classes in Regular Expressions

In Python, character classes in regular expressions are represented as sets of


characters enclosed within square brackets []. These classes allow matching
a range of characters or specific sets. For instance, [a-z] signifies matching
any lowercase letter from 'a' to 'z'. Table 6.3 lists some of the frequently used
character classes within regular expression patterns.

Character Class Description


[abc] Matches the letter a or b or c
[0-9] Matches any digit from 0 to 9 including 0
and 9
[a-z] Matches any lowercase letters from a to
z including a and z
[A-Z] Matches any UPPERCASE letters from A to
Z including A and Z
[a-zA-z] Matches any lowercase or UPPERCASE
letter including a, A, z, and Z
[a-zA-Z0-9_] Matches any alphanumeric character
and underscore

Table 6.3: Character Classes in Regular Expressions

V1.0 © Aptech Limited


6.2 Python Methods That Use Regular Expressions

In Python, developers need to import the re module to use regular expressions


in their program. The re module is a Python built-in module that offers
comprehensive functionality for working with patterns and regular expressions.
It equips developers with all the necessary functions to effectively handle and
manipulate patterns and expressions within their Python programs.

The syntax to import the re module is:

import re

Whenever an error occurs during the implementation or usage of a regular


expression, the re module in Python raises the re.error exception.
Developers can use this exception to handle and manage errors encountered
within regular expression operations.

The re module contains regular expression functions which are a set of


predefined operations that enable pattern matching and manipulation within
text to handle and process textual data. Some of the functions in this module
are:

findall search split

match compile finditer

The functions in the re module must be supplied with arguments such as the
search pattern, the string to be searched and optional flags. These optional
flags provide additional control when applying regular expression patterns. The
use of optional flags allows for the utilization of different features and syntax
variations. This enables developers to customize the behavior of regular
expression operations based on their specific requirements.

Consider that you want to search for a word within a string using regular
expressions. You can include the re.I (where I stands for ignore case) flag as

V1.0 © Aptech Limited


an argument to the search method. This flag enables case-insensitive
searching, allowing the pattern to match regardless of the letter case used in
the string.

Table 6.4 lists some of the optional flags available for regular expression
methods in Python.

Flag Long Syntax Meaning


re.I re.IGNORECASE Perform case-insensitive matching
re.M re.MULTILINE Use with the metacharacters ^ and $

When the re.M flag is specified:


 The metacharacter ^ matches the
pattern at beginning of the string and the
beginning of each newline (\n)
 The metacharacter $ matches the
pattern at the end of the string and the
end of each new line
re.S re.DOTALL  Make the DOT (.) special character
match any character at all, including a
newline

 Make the DOT (.) special character


match any character except newline if
this flag is not specified

Table 6.4: Python Regex Flags

6.2.1 findall Method

The re.findall method from the re module lets you search the target string
with a regex pattern. This method scans the entire string and retrieves all the
matches encountered, returning them as a list for further processing and
analysis. The syntax for the findall method is:

re.findall(pattern, string, flags=0)

In this syntax:
• pattern is the regular expression pattern to find in the string
• string is the target string

V1.0 © Aptech Limited


• flags refers to optional regex flags the default of which is 0 indicating
no flags are applied

The re.findall method scans the target string from left to right, as specified
in the regular expression pattern, and returns all matches encountered in the
order they were found. Consider that you want to find all the numbers in a
string. The regex pattern that you will use to do this is \d+.

In this regular expression:


 \d is a special regex sequence that matches any digit from 0 to 9 in a target
string.
 + is a metacharacter that indicates there can be one or more digits.

Code Snippet 1 shows the code that uses the findall method to search for
all the numbers in the given string.

Code Snippet 1:

import re

string1 = "Python is a high-level, general-


purpose programming language. Python first
appeared on 20 February 1991."

result = re.findall(r"\d+", string1)


print("Numbers present in the string are:")
print(result)

Figure 6.1 shows output of the code in Code Snippet 1.

Figure 6.1: Finding all Numbers in a Given String

The code re.findall(r"\d+", string1) in Code Snippet 1 does not


include the flags argument. The output displays two numbers as a list.

V1.0 © Aptech Limited


Consider another example where you want to match three-letter words at the
beginning of each new line. The regex pattern that you will use to do this is
^\w{3}. In this regular expression,
 \w matches any alphanumeric character or a word character including
lowercase and uppercase letters, the digits 0 to 9, and the underscore
character. This is equivalent to character class [a-zA-z0-9_]. You can use
either \w or [a-zA-z0-9_].
 ^ matches a pattern exclusively at the beginning of a string irrespective of
whether the string contains the newline character. Therefore, if the string
contains newline character, then you must use the re.M flag to search for
the pattern at the beginning of each new line.
 {3} specifies that the number of characters to be mapped is three.
Code Snippet 2 shows the code that uses the findall method to match any
three-letter word at the beginning of each new line.

Code Snippet 2:

import re

str1 = "PHP is an open-source programming


language created in 1990.\nSQL is a powerful
tool for accessing and manipulating data."

result = re.findall(r"^\w{3}", str1, re.M)


print(result)

Figure 6.2 shows the output of the code in Code Snippet 2.

Figure 6.2: Matching a Three-Letter Word

Code Snippet 2 displays the two three-letter words that are at the beginning
of a new line in a list.

You can use the findall method to search for a sequence that starts with a
specific text followed by zero or more characters. To do this, you must make
use of the asterisk (*) metacharacter. In Python, when * appears within a
pattern, it signifies that the preceding expression or character should repeat 0

V1.0 © Aptech Limited


or more times, seeking as many repetitions as possible. This behavior is known
as greedy repetition.

Code Snippet 3 shows the code that uses the findall method to match a
sequence that starts with re followed by zero or more characters.

Code Snippet 3:

import re

txt = “Rose is red in color. Read newspaper


daily.”

x = re.findall('re.* ', txt)


print(x)

Figure 6.3 shows the output of the code in Code Snippet 3.

Figure 6.3: Searching a String Followed by Many Characters

The code in Code Snippet 3 displays the string starting from ‘re’ up to a white
space character at the end.

You can modify the findall method call in Code Snippet 3 to


findall(r're.*? ', txt, re.I). This allows you to find all the occurrences
of character ‘re’ followed by an arbitrary number of characters with an empty
space at the end. The case insensitive flag re.I allows you to find all
occurrences the characters ‘re’ irrespective of its case.

Figure 6.4 shows the output of the modified code.

Figure 6.4: Searching a String That Starts with a Specific Text

6.2.2 finditer Method

The re.finditer method functions similarly to re.findall by scanning a

V1.0 © Aptech Limited


string for matches based on a regular expression pattern. However, instead of
returning a list, it produces an iterator that yields matching objects. This iterator
provides a convenient way to extract all the matches found in the string by
iterating over the objects sequentially. The scanning process occurs from left
to right and the matches are returned in the form of an iterator for further
processing.

The syntax for the finditer method is:

re.finditer(pattern, string, flags=0)

The finditer method returns an iterator that yields matching objects if the
search is successful. In cases where no match is found, the method still returns
an iterator, but does not yield any match objects.

Consider that you want to search for all vowels in a given string. The regex
pattern to do this is r’[aeiou]’. Code Snippet 4 shows the code that uses the
finditer method to find all the vowels in the string ‘Computer Languages’.

Code Snippet 4:

import re

s = 'Computer Languages'

vowelmatch = re.finditer(r'[aeoui]', s)

for vowel in vowelmatch:


print(vowel)

Figure 6.5 shows the output of Code Snippet 4.

Figure 6.5: Searching for all Vowels in a String

V1.0 © Aptech Limited


6.2.3 search Method

The re.search method in Python regex searches for occurrences of a regex


pattern within the entire target string. If a match is found, it returns a match
object instance that represents the first occurrence of the pattern. Thus, this
method stops execution as soon as it finds the first match.

The syntax for the search method is:

re.search(pattern, string, flags=0)

The search method returns a match object that contains two elements:
• The tuple object contains the start and end index of a successful match
• An actual matching value that you can retrieve using a group method

If the search method is unable to find the desired pattern or if the pattern does
not exist within the target string, the method returns None.

Multiple flags can be utilized in regular expressions, offering various


functionalities. For instance, the re.I flag enables case-insensitive searching.
These flags can be combined using the bitwise OR (|) operator to achieve
desired behavior in regular expression operations.

Consider that you want to search for all vowels in a given string. The regex
pattern to do this is r’[aeiou]’. Code Snippet 5 shows the code that uses the
finditer method to find all the vowels in the string ‘Computer Languages’.

Code Snippet 5:

import re

s = 'Computer Languages'

vowelmatch = re.finditer(r'[aeoui]', s)

for vowel in vowelmatch:


print(vowel)

V1.0 © Aptech Limited


Figure 6.6 shows the output of Code Snippet 5.

Figure 6.6: Searching for all Vowels in a String

Consider an example to search a word in the given string. Code Snippet 6 lists
the code to search for ‘Java‘ in the given string: ‘Among the programming
languages, JAVA is a high-level and class-based language. Java
is an object-oriented program.’. The re.I or re.IGNORECASE flag can
be used with the search method to enable case-insensitive searching of the
regex pattern.

Code Snippet 6:

import re

string2 = "Among the programming languages,


JAVA is a high-level and class-based
language. Java is an object-oriented
program."

result = re.search(r"Java", string2, re.I)


print("Matching word:", result.group())

Figure 6.7 shows the output of the code in Code Snippet 6.

Figure 6.7: Searching for a Word in the Given String

The group method of the match object is used to display the word found. To
match any character instead of a specific word, you can use the ‘.’
metacharacter in the regular expression. Code Snippet 7 shows the code for
searching any character in a given string.

V1.0 © Aptech Limited


Code Snippet 7:

import re

strn = "SQL is a powerful tool for accessing


and manipulating data. \nJava was first
released in 1995."

result = re.search(r'.', strn)


print(result.group())

Figure 6.8 shows the output of Code Snippet 7.

Figure 6.8: Searching for any Character

The code matches and returns the first character. If you add a ‘+’ sign to the
‘.’ metacharacter as in re.search(r'.+', strn), then the search will be
made for zero or more repetitions of the same pattern. Here it will be for any
character. Figure 6.9 shows the output for the same.

Figure 6.9: Searching for all Characters Excluding Newline

If you want the ‘.’ to match the newline character as well, use
the re.DOTALL or re.S flag as an argument to the search method. Thus, the
search method call will be re.search(r'.+', strn, re.S). Figure 6.10
shows the output for the same.

Figure 6.10: Searching for all Characters Including Newline

V1.0 © Aptech Limited


6.2.4 match Method

The re.match method in Python specifically searches for a regex pattern at


the beginning of the target string. If a match is found, it returns a match object;
otherwise, it returns None. The syntax for the match method is:

re.match(pattern, string, flags=0)

When the regular expression pattern matches zero or more characters at the
beginning of the string, the re.match method returns a match object. This
match object contains information about the starting and ending positions of
the match, as well as the actual matched value.

Consider that you want to find the four-letter word at the beginning of the
given string. Code Snippet 8 shows the code for this search.

Code Snippet 8:

import re

str = "Java was first released in 1995"

result = re.match(r"\w{4}", str)


print("Match object: ", result)

if (result != None):
print("Match word: ", result.group())

Figure 6.11 shows the output of Code Snippet 8.

Figure 6.11: Searching for a Four-Letter Word

The match method will return None if there is no match. In this case, the print
statement will throw an error if the group method is called, as there is no
matching object. Therefore, the code in Code Snippet 8 checks if the result

V1.0 © Aptech Limited


has any matching object before calling the group method to display the
located word.

6.2.5 sub Method

Python regex provides the sub and subn methods, which allow for searching
and replacing patterns in a string. With these methods, you can replace one
or more occurrences of a regex pattern in the target string with a specified
substitute string. The syntax for the sub method is:

re.sub(pattern, replacement, string[, count, flags])

In this syntax:
• pattern is the regular expression pattern to find in the string
• replacement is the string that is to be inserted at every occurrence of
the matched pattern
• string is the target string
• count is the number of occurrences that must be replaced, the default
of which is zero indicating all the occurrences will be replaced
• flags refers to optional regex flags, the default of which is zero
indicating no flags are applied

Pattern, replacement, and string are essential arguments in the search and
replace operations using regular expressions. However, count and flags are
optional arguments that can be utilized for additional customization.

The sub method returns a new string by replacing the occurrences of the
pattern in the original string with the specified replacement string. If the pattern
is not found, the original string is returned without any changes.

Consider you want to replace all the whitespaces in a string with colon (:).
Code Snippet 9 shows the code for this replacement.

V1.0 © Aptech Limited


Code Snippet 9:

import re

target_str = "Java was first released in


1995."

res_str = re.sub(r"\s", ":", target_str)

print(res_str)

Figure 6.12 shows the output of Code Snippet 9.

Figure 6.12: Replacing Whitespaces with Colon

If you change the sub method to subn, you will also get the number of
replacements listed in the list. Figure 6.13 shows the output of Code Snippet 9
after changing res_str = re.sub(r"\s", ":", target_str) to res_str
= re.subn(r"\s", ":", target_str).

Figure 6.13: Displaying Number of Replacements

6.2.6 compile Method

The compile method in Python converts a regular expression pattern, specified


as a string, into a regex pattern object, re.Pattern. You can then, use this
pattern object to search for matches within various target strings using regex
methods such as re.match or re.search.

By compiling a regular expression into a regex object using re.compile, you


can efficiently search for the same pattern in multiple target strings without
rewriting the pattern each time. Thus, the pattern matching operations
become easy and more efficient. The syntax for the compile method is:

re.compile(pattern, flags=0)

V1.0 © Aptech Limited


Consider that you want to locate any four consecutive digits in the target
string. Code Snippet 10 shows the code for achieving this.

Code Snippet 10:

import re

str1 = "Java was first released in


1995.\nPHP is an open-source programming
language created in 1990."

string_pattern = r"\d{4}"
regex_pattern = re.compile(string_pattern)
print(type(regex_pattern))

result = regex_pattern.findall(str1)
print(result)

Figure 6.14 shows the output of Code Snippet 10.

Figure 6.14: Compiling a Regex Pattern into Pattern Object

6.2.7 split Method

The split method from the re module allows you to split the string into
substrings based on occurrences of the regex pattern. This split operation results
in a list of separated substrings. The syntax for the split method is:

re.split(pattern, string, maxsplit=0, flags=0)

In this syntax:
• pattern is the regular expression pattern to split the target string
• string is the target string which must be split
• maxsplit is the number of occurrences that must be split, the default
of which is zero indicating all the occurrences will be split

V1.0 © Aptech Limited


• flags refers to the regex flags, the default of which is zero indicating no
flags are applied

The regular expression pattern and the target string are required parameters
when using the re.split method. However, the maxsplit and flags are
optional parameters that can be used to customize the splitting behaviour.

The re.split method divides the target string based on the regular expression
pattern provided and returns the matches as a list. If the pattern is not found in
the target string, then the method returns the unsplit string itself as the only
element of the resulting list.

When a regular expression provided in the re.split method


contains capturing parentheses, the text of all the groups in the
pattern are included in the resulting list.

Consider you want to split the given string into the list of words matching the
whitespace character. Here, you must use the \s special sequence in the
regex along with the + metacharacter. Adding the + symbol will split the target
string on one or more occurrences of the whitespace characters. Code
Snippet 11 shows the code to achieve this.

Code Snippet 11:

import re

str1 = "PHP is an open-source programming


language created in 1990."

listofWords = re.split(r”\s+”, str1)

print(listofWords)

Figure 6.15 shows the output of Code Snippet 11.

Figure 6.15: Using the split method with \s+ Sequence

V1.0 © Aptech Limited


Suppose you have a date and you want to split it into day, month, and year.
In this case, you can use the \D special sequence to match any non-digit
character. Code Snippet 12 shows the code to achieve this.

Code Snippet 12:

import re

target_string = "30-11-1995"

result = re.split(r"\D", target_string, maxsplit=1)


print(result)

result = re.split(r"\D", target_string, maxsplit=2)


print(result)

The code uses the maxsplit parameter to show the different result in each
case. Figure 6.16 shows the output of the code in Code Snippet 12.

Figure 6.16: Using the split method with \D Sequence

A string can be split in multiple ways using different regex patterns. Consider
that you want to split the given string into words using the \s+ pattern and the
[\b\W\b]+ pattern. In each of these cases, the resulting list will be different.

The \b special sequence in a regex pattern matches the empty strings at the
edge boundaries of a word. The \W special sequence in a regex pattern
matches any non-alphanumeric character which is not a letter, digit, or
underscore.

Code Snippet 13 shows the code to split the given string into multiple word
boundary delimiters, resulting in a list of alphanumeric or word tokens. It also
shows the same target string split around the whitespace characters.

V1.0 © Aptech Limited


Code Snippet 13:

import re

str1 = "Java is a high-level, Class-based, object-oriented


programming language.\nJava was first released in 1995."

result = re.split(r"[\b\W\b]+", str1)


print("Words with multiple boundary delimiters:", result)

listofWords = re.split(r"\s+", str1)


print("\nWords split using whitespace: ", listofWords)

Figure 6.17 shows the output of Code Snippet 13.

Figure 6.17: Using the split method with Two Different Patterns

6.3 Tokenization in Python

Splitting up a phrase, sentence, paragraph, or text document into smaller units


is known as tokenization. The small units of text are termed as tokens. These
tokens can represent individual words, terms, or other meaningful units within
the text.

6.3.1 Word Tokenization

Python supports word tokenization using the split method or the findall
method. However, a limitation of using the split method for word tokenization
is that it does not treat punctuation marks as individual tokens.

V1.0 © Aptech Limited


Code Snippet 14 shows the code to split the given string into multiple words
using the split method and the findall method.

Code Snippet 14:

import re

text = "PHP is an open-source programming language created


in 1990. SQL is a powerful tool for accessing and
manipulating data. Java was first released in 1995. When
you combine all these three together to create an
application, you can easily leverage their individual
strengths."

result = re.split(r”[\b\W\b]+”, text)


print(“Word tokenization using split method:\n”, result)

tokens = re.findall(r'\w+|[\.|\!]+', text)


print("\nWord tokenization using findall method:\n",
tokens)

Figure 6.18 shows the output of the code in Code Snippet 14.

Figure 6.18: Using the split method and findall Method

V1.0 © Aptech Limited


Notice the difference in the output of both methods. Word tokenization using
the findall method even includes the punction marks as separate items in
the list. The \w+|[\.|\!]+ pattern matches any zero or more characters, dot,
and exclamation mark. Note the pipe (|) symbol in the regex which indicates
either a set of characters or . or !.

6.3.2 Sentence Tokenization

You can use the split method of the match object returned by the compile
method to tokenize the given text into individual sentences. Code Snippet 15
shows the code to split the given string into multiple sentences.

Code Snippet 15:

import re

text = "PHP is an open-source programming language created


in 1990. SQL is a powerful tool for accessing and
manipulating data. Java was first released in 1995. When
you combine all these three together to create an
application, you can easily leverage their individual
strengths."

sentences = re.compile('[.!?] ').split(text)


i=1
for sen in sentences:
print(i,'. ',sen)
i=i+1

Figure 6.19 shows the output of Code Snippet 15.

Figure 6.19: Splitting the Given String into Sentences

An advantage of using the compile method is the ability to specify multiple


separators simultaneously. This gives you more control compared to the split

V1.0 © Aptech Limited


method. The pattern [.!?] in the compile method indicates that the
sentences will be split whenever any of these characters are encountered. The
code displays the list of sentences iterating through a simple for loop.

V1.0 © Aptech Limited


6.4 Summary

 Regular expressions serve as powerful tools for pattern matching,


searching, and manipulating text.
 Regular expressions contain metacharacters, special sequences, and
character classes.
 Metacharacters are special characters that impact the interpretation of
the regular expressions that include them.
 Character classes are represented as sets of characters enclosed within
square brackets [] that allow matching a range of specific sets of
characters.
 Special sequences represent predefined character classes that have
distinct meanings.
 Python has a built-in re module that offers comprehensive functionality
for working with patterns and regular expressions.
 Some of the functions in the re module include findall, search, split,
match, compile, and finditer.
 Splitting up a phrase, sentence, paragraph, or text document into
smaller units is known as tokenization.
 Python supports tokenization using the split and findall methods
from the re module.

V1.0 © Aptech Limited


Test Your Knowledge

1. Which of the following metacharacters is used to match patterns at the


end of the string?

a. #
b. $
c. ^
d. *

2. Which of the following special sequences is used to match any non-


digit?

a. \d
b. \dg
c. \D
d. \Z

3. Which of the following regex patterns will split the target string on the
occurrence of one or more whitespace characters?

a. \w+
b. \s+
c. \W+
d. \S+

4. Which of the following methods returns the first occurrence of a pattern


match within the target string?

a. re.search
b. re.match
c. re.sub
d. re.subn

5. Which of the following methods in Python is utilized to convert a regular


expression pattern, specified as a string, into a regex pattern object?

a. re.split
b. re.interpret
c. re.compile
d. re.compute

V1.0 © Aptech Limited


Answers to Test Your Knowledge

1 b
2 c
3 b
4 a
5 c

V1.0 © Aptech Limited


Try it Yourself

Open Jupyter notebook and perform the given tasks using regular
expressions:

a. Consider the given string str= “Franklin ate 10 chocolates


whereas the CHOCOLATES ate by David are 10 times more
than Franklin”. Write a Python program to replace all
occurrences of “10” with “ten” in the given string, str.

b. Using the string str, write a Python program to replace all


occurrences of the word “chocolates” with “candies”
irrespective of case.

c. Using the string str, write a Python program to check whether the
given string str contains the word “times”.

d. Consider the string str2=”Python programming language”.


Write a Python program that returns a word containing 'y' in the
given string str2.

e. Consider the given string str3= “colors: red, white, blue,


orange, green”. Write a Python program to extract all
characters after the occurrence of the character “:” from the
string str3.

f. Consider the given list list_color = [‘red’, ’white’,


’blue’, ’black’, ’pink’]. Write a Python program to filter all
elements starting with “b” and ending with “k”.

g. Consider the given string str4=’cat on325the88wall’. Write a


Python program to split the given string str4 based on the
consecutive sequence of digits or whitespace characters. The
output must be [‘cat’, ‘on’, ‘325’, ‘the’, ‘88’, ‘wall’].

h. Consider the given string str5=’ GO language runs faster,


compiles quicker, and allows for shorter software
development lifecycles.’. Write a Python program to replace
all the occurrences of spaces, commas, or dots in the string str5
with a semicolon.

V1.0 © Aptech Limited


Learning Objectives

In this session, students will learn to:


 Explain Web development using Flask
 List the steps for Flask setup and installation
 Explain the process of building a Flask application

Flask is a compact and lightweight Python Web framework that facilitates the
development of Web applications with the help of useful tools and features. A
Web framework is a collection of libraries and tools that enables you to quickly
create Web applications without the need for writing code from scratch. In this
session, you will learn how to create Web applications in Python using the Flask
framework.

V1.0 © Aptech Limited


7.1 Introduction to Web Development Using Flask

Web development is about creating and building applications that run on an


Intranet or on the Internet. Python offers many Web frameworks for developing
Web applications. One such framework is Flask.
Python Web application development adheres to the Web Server Gateway
Interface standard or WSGI. WSGI is a common standard or protocol that
describes how a Web server communicates with Web applications. Werkzeug
is a WSGI toolkit that carries out utility operations such as implementing requests
and response objects and facilitates Web application development. Jinja2 is
a Python templating engine. It is generally used with Flask for rendering
dynamic Web pages received from the back-end Web server as static Web
pages in the front-end browser.
As Flask is a micro-framework, it does not include built-in support for adding
functionalities such as database access and Web form validation to your Web
applications. However, it supports many extensions or external packages using
which you can add the required functionality to your Flask applications. For
example, you can use Flask extensions to add support for Web form validation
or connecting to a database.

7.2 Flask Setup and Installation

To install Flask, use Python 3.8 or a higher version. You must create virtual
environments to isolate and manage multiple Python projects that use different
versions of Python libraries.
Each virtual environment acts as an independent group of Python libraries for
the respective Python project. Therefore, a Flask framework installed on one
virtual environment will not affect the Flask framework installed on another
virtual environment.
There are various tools for creating virtual environments. One such tool is
virtualenv. To use this tool, you must first install it.

 Install virtualenv Tool


You can install the virtualenv tool from the command prompt. This
command requires administrator rights for execution. The command to
install virtualenv on Windows is:

pip install virtualenv

V1.0 © Aptech Limited


On Linux or Mac OS, type sudo before pip as in:

sudo pip install virtualenv

Figure 7.1 shows the installation of virtualenv on Windows.

Figure 7.1: Installing virtualenv

 Create Virtual Environment

After you have installed virtualenv, create a virtual environment.

To do this, perform the steps given:

At the Windows command prompt, create a directory for your Python


project. Navigate to the newly created directory.

Figure 7.2 shows the commands to create a directory named flask_app


and make it as the current directory.

V1.0 © Aptech Limited


Figure 7.2: Creating Project Directory

Create the virtual environment by running the virtualenv command


within the newly created project directory. The syntax is:

virtualenv [virtual directory]

In the syntax, [virtual directory] is the name that you provide for the
virtual environment being created.

Figure 7.3 shows the creation of a virtual environment named env within the
flask_app directory. Note that the virtual environment creation creates a
root directory named env and other installation sub-directories. It generates
a batch file named activate.bat in the path env\Scripts.

Figure 7.3: Creating Virtual Environment

 Activate Virtual Environment


Once you have created the virtual environment, you must activate it. To
activate the virtual environment, run the batch file activate.bat.

Figure 7.4 shows the activation of env on Windows by running the


activate.bat file from env\Scripts folder.

V1.0 © Aptech Limited


Figure 7.4: Activating Virtual Environment

 Install Flask
Once the virtual environment is activated, the next step is to install Flask.
To install Flask, run the given command within the virtual environment.

pip install Flask

Figure 7.5 shows the installation of Flask within env using the pip installer
package.

Figure 7.5: Installing Flask


7.3 Building a Flask Application

Consider that you want to build a sample Web application in Python using Flask
that displays a message on the browser.

V1.0 © Aptech Limited


7.3.1 Create a Sample Flask Application

To create a sample Web application using Flask, perform the steps:

1. Type the code given in Code Snippet 1 in a notepad:

Code Snippet 1:

from flask import Flask


app = Flask(__name__)
@app.route('/')
def fun_print():
return "Website created using Python Flask"

if __name__ == '__main__':
app.run(debug=True)

2. Save the file as sample_app.py in your respective project directory. For


example, C:\Users\Linda\flask_app. Then, close the file.

3. From the command prompt, run sample_app.py using the given


command.

python sample_app.py

Figure 7.6 shows the execution of the sample_app.py application and the
displayed output.

Figure 7.6: Running the Flask Application

V1.0 © Aptech Limited


The output displays several bits of information. These include:
 The name of the application you are currently running.
 The environment in which the application is being run.
 The status of the debug mode. In this example, the debug mode is off.
 The Uniform Resource Locator (URL) of the application. This example uses
http://127.0.0.1:5000/, where 127.0.0.1 is the IP address of the
localhost, and 5000 is the port number.
4. Finally, verify that your application is running. To perform this, open a Web
browser and type your application’s URL in the address bar. For example,
type http://127.0.0.1:5000 as shown in Figure 7.7. Find the string
“Website created using Python Flask” displayed as a response. This
indicates that your Flask application is operational.

Figure 7.7: Response from Flask Application

Now, let us understand the code given in Code Snippet 1.

Code Snippet 1:
 Imports the Flask object from the flask package.
from flask import Flask

 Invokes the Flask constructor to create an instance of the Flask


application by passing the current module's name (__name__) as a
parameter to the constructor. The code uses the newly created Flask
instance, app to handle Hypertext Transfer Protocol (HTTP) requests and
responses.

app = Flask(__name__)

V1.0 © Aptech Limited


 Uses the @app.route('/') decorator function to bind the URL / with
the fun_print function. This implies that fun_print will respond to HTTP
requests from the URL / by returning a string as an HTTP response object.
When users open the home page of the Web server in a Web browser,
the returned string renders itself.

@app.route('/')
def fun_print():
return "Website created using
Python Flask"

Note that the Flask class's route function is a decorator that binds a
URL to a Python function. The route function converts the Python
function into a view function. The view function in turn converts the return
value into an HTTP response to respond to the incoming requests.

The complete syntax for the route function is:

app.route(rule, options)

In the syntax, the rule parameter represents the URL string to be bound
with the Python function. options represents a list of parameters to be
sent to the underlying rule object.
 Uses the run method to launch the application on the local
development server. By setting debug=True, the code enables the
debug mode to display detailed error messages if the application
encounters any error.
if __name__ == '__main__':
app.run(debug=True)

The complete syntax for the run method is:

app.run(host, port, debug, options)

All the parameters in the run method are optional. Table 7.1 describes
the parameters of the run method.

V1.0 © Aptech Limited


Parameters Description
Host A string value that specifies the host name to listen on.
127.0.0.1 (localhost) is the default. Set the host to
'0.0.0.0' to make the server publicly accessible.
Port An int value that specifies the Web server’s port
number. The default port is 5000.
Debug A bool value that enables or disables debug mode. The
default value is false.
Options A value of Any type that specifies the options to be
forwarded to the underlying Werkzeug server.

Table 7.1: Run Method Parameters

To close the application:


 Close the browser window.
 Press Ctrl+C in the command prompt window to stop Flask
from serving the current Python page, sample_app.py. This
will cause the command prompt to return to the prompt.

7.3.2 Using Variables in Flask


By including variable sections in the rule parameter of the route function,
you can pass arguments to the associated function. In this way, you can pass
multiple variables to the associated function and build URLs dynamically.
The syntax for including a variable section as part of the rule parameter is:

<variable_name>

Optionally, you can use a converter to specify the type of the argument as
in <converter:variable_name>. Table 7.2 lists the converter types that can
be included in the URL generation.

Datatype Description
string accepts any text without a slash
int accepts positive integers

V1.0 © Aptech Limited


Datatype Description
float accepts positive floating-point values
path accepts any text including slashes
uuid accepts UUID strings

Table 7.2: Converter Types


The default type is string. For example, Code Snippet 2 creates a dynamic
URL by including a variable section, <labname> as part of the rule URL string
'/welcome/<labname>'. The fun_welcome function receives the value
passed in the labname variable as a keyword argument and creates a
customized message based on the argument. The function then, returns the
customized message as an HTTP response.

Code Snippet 2:

from flask import Flask


app = Flask(__name__)

@app.route('/welcome/<labname>')
def fun_welcome(labname):
return "Welcome to %s Lab" % labname

if __name__ == '__main__':
app.run(debug = True)

From the command prompt, run the welcomelab.py Web


application using the command: Python welcomelab.py.

When you type the URL http://localhost:5000/welcome/Python in the


browser, the string value "Python" in the URL is passed to the fun_welcome
function, and the “Welcome to Python Lab” message is displayed.

Figure 7.8 shows the outcome of Code Snippet 2.

V1.0 © Aptech Limited


Figure 7.8: Outcome of Code Snippet 2

7.3.3 URL Building in Flask


You can use the url_for function to generate URLs dynamically for a
particular function. The first parameter of the url_for function accepts the
name of the respective function for which the URL is being generated. The
second parameter accepts one or more values that correspond to the
variable part of the URL rule.
The syntax for the url_for function is:

url_for(‘<function_name>’, <key> = <value>)

Consider that a library has three logins namely, Student, Faculty, and Guest.
Code Snippet 3 demonstrates the use of url_for function.

V1.0 © Aptech Limited


Code Snippet 3:

from flask import Flask, redirect, url_for


app = Flask(__name__)

@app.route('/guest/<gid>')
def fun_guest(gid):
return "Welcome to the Department of Computer Science
Library. You have logged in as Guest with id: %s" %gid

@app.route('/student/<sid>')
def fun_student(sid):
return "Welcome to the Department of Computer Science
Library. You have logged in as Student with id: %s" % sid

@app.route('/faculty/<fid>')
def fun_faculty(fid):
return "Welcome to the Department of Computer Science
Library. You have logged in as Faculty with id: %s" % fid

@app.route('/Library/<login>/<id>')
def fun_library(login,id):
if login =='guest':
return redirect(url_for('fun_guest',gid=id))
elif login=='student':
return redirect(url_for('fun_student',sid=id))
elif login=='faculty':
return redirect(url_for('fun_faculty',fid=id))

if __name__ == '__main__':
app.run(debug = True)

You can save the file as library_login.py and run the Web application
using the command:

python library_login.py

V1.0 © Aptech Limited


Figure 7.9 shows the output of this command.

Figure 7.9: Execution of Code Snippet 3

Figure 7.10, Figure 7.11, and Figure 7.12 show the outcome of Code Snippet 3.

Type the URL as http://127.0.0.1:5000/guest/G256 in the browser where the


login is Guest and gid is G256.

Figure 7.10: Library with Login as guest and gid as G256

Type the URL as http://127.0.0.1:5000/student/S456 in the browser


where the login is student and sid is S456.

V1.0 © Aptech Limited


Figure 7.11: Library with Login as student

Type the URL as http://127.0.0.1:5000/faculty/F784 in the browser


where the login is faculty and fid is F784.

Figure 7.12: Library with Login as faculty and fid as F784

In the code, the fun_library(login, id) function accepts login and id


values as its argument from the URL.

Based on the login value, the fun_library function makes a call to the
corresponding view function fun_guest, fun_student, or fun_faculty using
the if-elif condition. The variables, gid, sid, and fid are passed as
parameters to the respective view function in the url_for method.

7.3.4 Using Hypertext Markup Language (HTML) Templates


The Web applications created so far displayed plain text messages without
using HTML. Let us now focus on how to render an HTML file from a Python Flask
script.

The render_template auxiliary function from Flask allows the use of the Jinja2

V1.0 © Aptech Limited


template engine. This engine helps in rendering a Web page in a Python script
using HTML templates.

Consider a Flask script named app_welcome.py that contains a view function


called homepage linked to the URL '/'. In the homepage function, you can use
the render_template function to return the welcome.html file as the
response from the view function.

Code Snippet 4a shows the code for app_welcome.py.

Code Snippet 4a:

from flask import Flask,render_template


app = Flask(__name__)

@app.route('/')
def homepage():
return render_template('welcome.html')
if __name__ == "__main__":
app.run(debug=True)

V1.0 © Aptech Limited


Code Snippet 4b shows the code for welcome.html.

Code Snippet 4b:

<html>
<head>
<style>
h1 {
border: 2px #eee solid;
color: brown;
text-align: center;
padding: 10px;
}
</style>
</head>
<body>
<h1>Welcome to Python Programming Lab</h1>
</body>
</html>

When you execute the app_welcome.py script:

 Flask will search for the HTML template in the templates


folder under the same folder in which the Python script is
stored. In this case, the flask_app\templates folder.

 You must import the render_template method to apply


the HTML template to a Web page.

V1.0 © Aptech Limited


Figure 7.13 shows the output of the execution of app_welcome.py.

Figure 7.13: Execution of Code Snippet 4

Figure 7.14 shows the output of typing the URL http://127.0.0.1:5000/ in the
browser.

Figure 7.14: Output of Code Snippet 4

7.3.5 HTTP Methods in Flask


The World Wide Web's data communication is built on the HTTP protocol. This
protocol defines a variety of data retrieval techniques from a given URL.

Table 7.3 lists the various HTTP methods available to transfer data between
HTML pages.

Methods Description

get This method allows the user to send unencrypted


information to the server.

V1.0 © Aptech Limited


Methods Description

head This method functions similar to the get method.


However, it does not have any response body.

post This method sends the data from an HTML form


data to the server.

put This method replaces the target content with


newly uploaded content.

delete This method removes each and every instance


of the target specified by a URL.

Table 7.3: HTTP Methods Used in Flask

When you use an HTML template to render a page, the template can contain
placeholders for variables and expressions that are substituted with their actual
values. This is done in the Flask script by changing the route decorator method.
By default, the route method in Flask responds to HTTP get requests. However,
this can be changed by using a method list as an argument to the route
decorator method.

Let us write a Flask script that uses HTTP get method to retrieve user input from
a Web form on the welcome.html page. The data retrieved is then displayed
on the sessiondisplay.html page using HTTP post method.

Here,

 app-welcome.py is the script file kept in the flask_app folder.


 The HTML file welcome.html is kept in the templates folder. This file accepts
the user input using the HTTP get method.
 The HTML file sessiondisplay.html is kept within the templates folder.
This file displays the user input retrieved and passed on by the Flask script.
Modify the code in app_welcome.py script from Code Snippet 4a as given in
Code Snippet 5a.

V1.0 © Aptech Limited


Code Snippet 5a:

from flask import Flask,render_template,request


app = Flask(__name__)
@app.route('/')
def homepage():
return render_template('welcome.html')

@app.route("/sessiondisplay", methods=['POST', 'GET'])


def fun_session():
if request.method == 'POST':
cs = request.form.get('cursession')
return
render_template('sessiondisplay.html',csession=cs)
if __name__ == "__main__":
app.run(debug=True)

In this case, form data may be processed or retrieved by utilizing the POST or
GET parameters of the flask request method. The request object has to be
imported as given in Code Snippet 5a.

V1.0 © Aptech Limited


Code Snippet 5b shows the code for the welcome.html page.

Code Snippet 5b:

<html>
<head>
<title>Welcome</title>
<style>
h1 {
border: 2px #eee solid;
color: brown;
text-align: center;
padding: 10px;
}
</style>
</head>
<body>
<form action="/sessiondisplay" method="post">
<h1>Welcome to Python Programming Lab</h1>
<div style="text-align:center">
<label>Enter the current Python programming
session:</label>
<input type="text" name="cursession">
<input type="submit" value="Submit">
</div>
</form>
</body>
</html>

V1.0 © Aptech Limited


Code Snippet 5c shows the code for the sessiondisplay.html page.

Code Snippet 5c:

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Current Session</title>
<style>
h3 {
font-family: arial;
font-size: 16pt;
color: blue;
text-align: center;
}
</style>
</head>
<body>
<h3>Current Session in Python Programming Lab
is:{{csession}}</h3>
</body>
</html>

Here,
 The HTTP post method is used to specify an action in the welcome.html
file. When the post event occurs, /sessiondisplay is appended to the
URL.
 A session name is accepted and passed on to the Flask script through the
input variable, cursession. By giving this variable name of the input as a
parameter to the request.form.get function, you can retrieve the HTML
input from a form. The value received is stored in a variable, cs.
 The value of this cs variable is shown in the placeholder for csession in the
sessiondisplay.html file.

Execute the command python app_welcome.py in the command prompt.


This will start the Web application.

When you type the URL http://127.0.0.1:5000/ in a browser window, the


welcome.html page is rendered as shown in Figure 7.15.

V1.0 © Aptech Limited


Figure 7.15: Outcome of Code Snippet 5b

Enter the session name and click Submit. The sessiondisplay.html page is
rendered as shown in Figure 7.16.

Figure 7.16: Outcome of Code Snippet 5c

7.4 Create a Web Application

Let us now create a small Web application for an online bookshelf, BooksOnly.
There are three categories of books available in this library as shown in Figure
7.17.

V1.0 © Aptech Limited


Figure 7.17: BooksOnly Web Application

This Web application contains Python scripts including Flask scripts, HTML files,
images, and Cascading Style Sheet (CSS) files. You must create the application
folder structure as shown in Figure 7.18 which contains various files as listed.

Figure 7.18: Web Application Folder Structure

Flask can also be used to deliver media files such as text, Portable Document
Formats (PDFs), audio, video, and image files. The /Static folder can be
utilized for storing these files. Here, in the Images folder under Static, you will
store the book images as in /Static/Images/book1.jpeg. You can then use
relative file paths to link to these static files. However, it is recommended that
you construct absolute URL references to static files using the url_for function.
To reference the static files, pass the directory name, in this case, Static, and
the keyword parameter filename=, followed by the name of your static file,

V1.0 © Aptech Limited


to the url_for function.

For example, to refer to an image from the Static folder use the code given
in Code Snippet 6.

Code Snippet 6:

<img src="{{url_for('static',
filename='example_image.png')}}">filename='example_image.pn
g')}}">

Figure 7.19 shows the site map of the BooksOnly application.

Figure 7.19: Site Map

The books.html page is the homepage of the application. Code Snippet 7


lists the code for the books.html page.

V1.0 © Aptech Limited


Code Snippet 7:

<html>
<head>
<link rel="stylesheet" href="{{url_for('static',
filename='css/book.css')}}">
</head>
<body>
<form action="/book_cat" method="post">
<h1>Welcome to the Bookshelf</h1>
<div class="content">
<div class="left-col">
<img src="{{url_for('static',
filename='images/book1.jpeg')}}" class=image />
</div>
<div class="right-col">
<button type="submit" value="submit">BOOKSONLY</button>
</div>
</div>
</form>
</body>
</html>

The code in Code Snippet 7 uses:

 A style sheet, book.css, to render the books.html page.

 An image, book1.jpeg, from the images folder to show an open book


image in the center of the page.

 A button, BooksOnly, which when clicked will take you to the next
page, bookcat.html. This is achieved through the HTTP post method
from the action attribute of the HTML form object.

Next, let us design the books category page, bookcat.html.

V1.0 © Aptech Limited


Code Snippet 8 lists the code for the bookcat.html page.
Code Snippet 8:

<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" href="{{url_for('static',
filename='css/bookcat.css')}}">
</head>
<body>
<div class="container">
<div class="right">
<h1>Book Categories</h1>
<table align="center" style="width:50%">
<form action="/book_thriller" method="post">
<tr>
<td align="center"><a
href="{{url_for('fun_thrill')}}">Thriller</a></td>
</tr>
</form>
<form action="/book_fiction" method="post">
<tr>
<td align="center">Fiction</td>
</tr>
</form>
<form action="/book_adult" method="post">
<tr>
<td align="center">Young Adults</td>
</tr>
</form>
</table>
</div>
<div class="left">
<img src="{{url_for('static', filename='images/book2.png')}}"
class=image /> </div>
</div>
</body>
</html>

V1.0 © Aptech Limited


Here,

 A style sheet, bookcat.css, to render the bookcat.html page.

 An image, book2.jpeg, from the images folder to show a book image.

 The url_for function to create the link to the fun_thrill view


function. This function will be defined later in the Python script.

The next page is the book_thriller.html page. Code Snippet 9 lists the
code for this page.

Code Snippet 9:

<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" href="{{url_for('static',
filename='css/bookstyle.css')}}">
</head>
<body>
<form action="/book_back" method="post">
<div class="content">
<div class="container">

<div class="column1">
<h2>Thriller</h2>
<table class="center">
<tr>
<th>Image</th>
<th>Title</th>
<th>Author</th>
<th>Price</th>
<th>Age group</th>
<th>Description</th>
</tr>
<tr>
<td><img src="{{url_for('static',
filename='images/secret.png')}}" class=image /> </td>
<td>The Secret of The House On The Hill</td>
<td>Julia Harris</td>
<td>$10</td>
<td>15+</td>
<td style="width: 500px;height: 100px">Emily
Mayweather, a 24 year old florist, was now on the search for

V1.0 © Aptech Limited


a house to call 'home'. After, finally having saved enough to
live in their own home, Emily and her husband, Clarke, sought
house after house to find the perfect one. But, what if, the
perfect one wasn't so perfect after all? Uncovering the
secrets of her beautiful home might take some time, but Emily
was prepared to tackle all obstacles head on! Or, was
she?</td>
</tr>
<tr>
<td><img src="{{url_for('static',
filename='images/death.png')}}" class=image /> </td>
<td>Death And Me</td>
<td>Noah Wilson</td>
<td>$16.73</td>
<td>16+</td>
<td style="width: 500px;height: 100px">Estella
Jackson, now a college student, visits her therapist weekly
for her appointments. On the outside, it might seem like she
was a regular student with maybe some issues in her life that
she wanted to sort out, but that was very off the mark.
Estella had been visiting Jeanette, her therapist since she
was 14. It was a part of her routine after the summer
vacation of the year she turned 14, after all. How the nerve-
wracking adventure trip she was to go on changed into her
worst nightmare, she would never have known. But, reliving
it? One would think that would be way worse. Maybe, the safe
environment of Jenny's office was what made it better? Oh,
who was she kidding? It sucked either way.</td>
</tr>
<tr>
<td><img src="{{url_for('static',
filename='images/frozenlake.png')}}" class=image /> </td>
<td>The Frozen Lake By The Woods</td>
<td>Daniel Brown</td>
<td>$12</td>
<td>12+</td>
<td style="width: 500px;height: 100px">Drew Martinez,
and her whole family had planned a beautiful vacation in a
huge cabin next to woods and a beautiful lake for two weeks
during their vacation. 16 year old Drew, her siblings, and
cousins, all in one house, it was going to be great, wasn't
it? No. Nope. No way. Huge dramas and fights aside, the
relaxing vacation was soon turned into a living nightmare

V1.0 © Aptech Limited


when her younger brother, Luke, fell into the lake while
skating on thinner ice and found a dead body in the lake. And
leaving a bunch of kids with a 'real adventure dancing in
front of them' as Luke says, has never been great idea. Drew
didn't want to do this, but do they ever listen to her?
Nope.</td>
</tr>
</table>
</div>
</div>
</div>
<p><a href="{{url_for('fun_back')}}">Back to Book
Categories</a></p>
</form>
</body>
</html>

The book_thriller.html page uses:

 A table to display information such as image, title, author, price, age


group, and a short description for three books in the thriller category.

 A style sheet, bookstyle.css, to render the book_thriller.html


page.

 A hyperlink, Back to Book Categories, which when clicked will take


you to the books category page, bookcat.html. This is achieved
through the HTTP post method from the action attribute of the HTML
form object.

Similarly, you can easily create the HTML pages for the other two categories.
Let us now create the binding Python script that will make the application work.
Code Snippet 10 lists the code for the app_books.py script file.

Code Snippet 10:

from flask import Flask,render_template,request


app = Flask(__name__)

@app.route('/')
def homepage():
return render_template('books.html')

@app.route("/book_cat", methods=['POST', 'GET'])


def fun_cat():
if request.method == 'POST':

V1.0 © Aptech Limited


return render_template('bookcat.html')
@app.route("/book_back")
def fun_back():
return render_template('bookcat.html')

@app.route("/book_thriller")
def fun_thrill():
return render_template('book_thriller.html')

if __name__ == "__main__":
app.run(debug=True)

In this script:

 The render_template function is imported to render the respective


HTML pages in the view functions, fun_back and fun_thrill.

 The request method is imported to use both the post and get HTTP
methods in the fun_cat view function.

You can similarly add two more functions, fun_fiction and


fun_adult, to this script to include the book_fiction.html and
book_adult.html pages.

Now, the application is ready. However, before viewing the output let us add
three CSS pages to the Book_details\static\css folder as mentioned
earlier in Figure 7.19.

Code Snippet 11 shows the book.css code for styling the books.html page.

Code Snippet 11:

.image {
display: block;
margin-left: auto;
margin-right: auto;
width: 40%;
}
button {
background-color: #05c46b;
border: 1px solid #777;
border-radius: 2px;
font-family: inherit;
vertical-align: middle;
font-size: 36px;
display: block;

V1.0 © Aptech Limited


width: 100%;

}
.content {
width: 100%;
position: absolute;
top: 10%;
}

.left-col {

margin-top: 5%;
}

.right-col {
float: right;
margin-right: 10%;
margin-top: -5%;
display: flex;
align-items: center;
}
body {
background-color: lightyellow;
}
h1 {
border: 2px #eee solid;
color: brown;
text-align: center;
padding: 10px;
}

Code Snippet 12 shows the bookcat.css code for styling the bookcat.html
page.

Code Snippet 12:

table, th, td {
border: 1px solid;
font-size: 30px;
}

body {
background-color: lightblue;
}

V1.0 © Aptech Limited


h1 {
border: 2px #eee solid;
color: brown;
text-align: center;
padding: 10px;
}
.image {
display: block;
margin-left: auto;
margin-right: auto;
width: 50%;
}
.container {
height: 75%;
overflow: hidden;
}
.right {
width: 800px;
float: right;
margin-top: 100px;
}
.left {
margin-top: 125px;
width: auto;
overflow: hidden;
}

Code Snippet 13 shows the bookstyle.css code for styling the


book_thriller.html page.

Code Snippet 13:

container {
display: grid;
grid-template-columns: 100px 150px 200px;
text-align: center;
}
.column1 {
float: left;
width: 90%;
padding: 10px;
height: auto;
background-color: lightblue;
}

V1.0 © Aptech Limited


body {
background-color: grey;
}
.content {
width: 100%;
position: absolute;
top: 10%;
}
.center {
display: table;
height: 100%;
width: 100%;
}
h2 {
border: 2px #eee solid;
color: Red;
text-align: center;
padding: 10px;
}

All the files are ready in their respective folders. To run the application, create
the virtual environment and run the activate.bat script in command prompt.
Run the command python app_books.py as shown in Figure 7.20 to start the
application.

Figure 7.20: Execution of Python app_books.py

Now, open a browser and type in the URL http://127.0.0.1:5000/ as shown


in Figure 7.21.

V1.0 © Aptech Limited


Figure 7.21: Welcome Page

Click BOOKSONLY button to open the Book Categories page shown in Figure
7.22.

Figure 7.22: Book Categories Page

V1.0 © Aptech Limited


To view the Thriller page, click the Thriller link. The Thriller category page
appears as shown in Figure 7.23.

Figure 7.23: Thriller Page

Click Back to Book Categories link to navigate to the Book Categories page
as shown in Figure 7.22.

V1.0 © Aptech Limited


7.5 Summary

 Flask is a Python Web framework that facilitates the development of


Web applications.
 Python Web application development adheres to WSGI.
 Separate virtual environments must be created to isolate different Web
applications in Flask.
 virtualenv is a tool that helps you create a virtual environment.
 The Flask object must be imported from the flask package.
 The app.route decorator function bind the URL with the view function
to which it is associated.
 The app.run method is used to launch the application on the local
server.
 You must include variables in the rule parameter of the route function
to pass arguments to the associated view function.
 The render_template auxiliary function from Flask returns a view file as
response from the view function.
 The route method can be made to respond to multiple HTTP methods
by passing a method list as an argument to the route decorator method.

V1.0 © Aptech Limited


Test Your Knowledge

1. What does the WSGI protocol stand for?

a. World Server Gateway Interface


b. Web System Gateway Interface
c. Web Server Gateway Interface
d. Wide Server Gateway Interface

2. Which of the following options is the default port number on which the
Flask application runs?

a. 127
b. 187
c. 500
d. 5000

3. Which of the following functions enables developers to build and


generate URLs on a Flask application?

a. url_for
b. for_url
c. url_flask
d. url_forFlask

4. Which of the following code will manually reload the application after
each code modification?

a. app.run(debug = True)
b. app.route(debug = True)
c. app.route(DEBUG)
d. app.run()

5. Which of the following assets are stored in the static folder during Web
development in Python?

a. CSS files
b. JavaScript files
c. Images files
d. HTML files

V1.0 © Aptech Limited


Answers to Test Your Knowledge

1 c
2 d
3 b
4 a
5 a, b, c

V1.0 © Aptech Limited


Try it Yourself

1. In the command prompt, install the virtual environment and activate it.
Within the activated environment, install Flask.

2. Create a simple Web application in Python to display the information in


the browser as “Welcome to Python programming lab!”.

3. Consider that you have created a Web application in Python Flask for a
school where the students and staff can log in. Write a Python program
to generate dynamic URL bindings in Flask.” On the other hand, if the
user executes the application as localhost/staff, then the
application should redirect to a function to display the message “This
is staff login.”.

4. Create a simple Web application in Python Flask to get the user’s


information such as Name, Age, Qualification, and Email and to display
this information on a new page. Design a Registration page in HTML to
get the details of users and click Submit. When the users click the Submit
button the application should get the typed information using the HTTP
Post method and display the information on a new page.

5. Add the book_fiction.html and book_adult.html pages to the


Templates folder in the BooksOnly Web application. Apply the
bookstyle.css file to these pages. Add code to the app_books.py
script to define the fun_fiction and fun_adult view functions. These
functions should redirect to the book_fiction.html and
book_adult.html pages from the bookcat.html.

V1.0 © Aptech Limited


Learning Objectives

In this session, students will learn to:


⮚ Describe Web scraping using Python
⮚ List the rules of Web scraping
⮚ Explain the Web scraping libraries in Python
⮚ Explain the process of implementing Web scraping using Python

Web scraping is the process of extracting data from Web pages for various
purposes that include data collection, price monitoring, and academic
research. Python provides several libraries that can be used for Web Scraping.
This session will provide an overview of Web scraping along with the rules to be
followed while Web scraping. It will explore various Web scraping libraries, their
installation, and the process of implementing Web scraping in Python.

V1.0 © Aptech Limited


8.1 Overview of Web Scraping

Web scraping is a method used to extract data from Web pages. Web
scraping is also known as Web harvesting and Web data mining. Data from
news Websites, social networking platforms, and e-commerce Websites can
be collected using Web scraping. Web scraping helps in acquiring market
data, monitoring competitor prices, and gathering data for research. Though
these processes can be done manually, automation helps in faster retrieval of
data.

Two broad categories of Web scraping are:

• Static Web scraping: A method of data extraction from Websites that does
not change frequently is called static Web scraping.

• Dynamic Web scraping: A method of data extraction from Websites that


changes frequently is called dynamic Web scraping. These Web pages may
use JavaScript to load or update content.

8.1.1 Web Scraping in Python

Python, JavaScript, C++, Java, and Perl are some of the languages available
for writing code to automate Web scraping. Python is the most preferred
language for Web scraping due to its usability, extensive library of modules,
and simple syntax that makes the task of scraping easier. Data science,
corporate intelligence, and investigative reporting are some of the areas that
benefit greatly from Web scraping.

Gathering data from commercial or statistical Websites with incredibly dense


numbers, variables, and data is difficult. Python makes use of a package
called Pandas to assist developers in converting gathered data into concrete
and practical information. The collected information can be converted into
various file formats including Comma Separated Values (CSV) and JavaScript
Object Notation (JSON).

Some of the libraries used for Web scraping are Scrapy, Beautiful Soup, and
Requests. These libraries provide tools for data extraction that are incredibly
quick and effective.

V1.0 © Aptech Limited


8.1.2 Legality of Web Scraping

The legality of Web scraping is determined by various factors that include the
location of the scraper, the terms of service of the Website, and the purpose
of scraping. A developer must analyze all these factors and then proceed with
Web scraping if it complies with applicable laws and terms of service.

8.1.3 Guidelines for Web Scraping

When scraping Web pages, the ethical and legal factors that must be
considered are:
● Terms of service: To make sure that scraping is not forbidden, it is vital to
check the terms of service of the Website that is aimed to be scraped.
While some Websites forbid scraping, others might allow it with
restrictions. Public Websites may allow scraping while private ones do
not as they may have sensitive data.
● Copyright: Copyright regulations are to be followed for Websites holding
documents or contents that have rights over making copies. Steps to
follow copyright protocols include requesting permission from the
copyright holder or making fair use of the material.
● Effect on Website performance: The Website's performance may be
affected if it is scraped frequently. The Website's servers may become
overloaded or experience other issues if it is excessively scraped.

The robots.txt file is a text file placed on the Website by the Webmasters
and Website owners. This file provides instructions regarding permission for
scraping and the frequency at which scraping is allowed. Web scraping
activities must be performed ethically by adhering to guidelines.

V1.0 © Aptech Limited


8.2 Libraries for Web Scraping

Libraries are built-in functions that can be used ready-made. The popular Web
scraping libraries in Python are:

Beautiful
Requests lxml Scrapy Selenium
Soup

8.2.1 Beautiful Soup

Beautiful Soup is a Python library used for processing Hypertext Markup


Language (HTML) and Extensible Markup Language (XML) documents. It is an
extremely flexible library that can be used to pull information from a variety of
Web pages. It is simple to use, making it easier for the scrapers to search for
specific data and extract information from Web pages. Beautiful Soup is the
best option for scraping static HTML content.

8.2.2 Requests

Requests is a HyperText Transfer Protocol (HTTP) library that enables users to


send HTTP requests to gather information from Web sources. It is an extremely
strong and adaptable tool that can be used to fetch information from
Websites in the form of HTML or JSON files. However, these files cannot be
parsed using Requests. You must use libraries such as Beautiful Soup, Scrapy, or
lxml to subsequently parse and process the information fetched using
Requests.

V1.0 © Aptech Limited


8.2.3 lxml

lxml is a simple yet effective Python library used for parsing HTML and XML
documents. lxml works best in combination with Requests to scrape data from
Web pages. It also enables usage of Cascade Styling Sheet (CSS) and XML
Path Language (XPath) selectors to retrieve data from HTML and XML. It is best
suited for scraping huge databases with structured data and complex
documents.

8.2.4 Scrapy

Web crawling refers to the process where bots browse the World Wide Web to
discover and index Web pages. Scrapy is a framework made for Web crawling
and Web scraping. Web scraping tasks include making HTTP requests, indexing
links, fetching data, and processing the fetched data. The strength and
scalability offered by Scrapy make it ideal for performing various Web scraping
tasks.

8.2.5 Selenium

Selenium is a powerful tool that enables programmatic control of Web-related


tasks that include Web development and Web scraping. Data from Websites
that require user input such as login pages and shopping carts can be scraped
using Selenium. It is best suited for scraping dynamic Websites that use
JavaScript to load and update content.

8.3 Python Web Scraping Implementation

Consider that you want to extract data from the book_thriller.html Web
page that was built using Code Snippet 9 of Session 7. The extracted data must
be stored in an Excel file. Let us use the libraries, Request and Beautiful Soup,
to scrape the content.

8.3.1 Installing Requests and Beautiful Soup

To install requests, in the Command Prompt, type the command as:

pip install requests

V1.0 © Aptech Limited


The command executes as shown in Figure 8.1.

Figure 8.1: Installation of Request

Similarly, to install beautifulsoup4, run the command as:

pip install beautifulsoup4

8.3.2 Steps in Web Scraping

A Web scraper is a specialized program created to extract data swiftly and


efficiently from one or more Websites. Depending on the applications, Web
scrapers come in a wide range of designs and levels of complexity.

Steps involved in Web scraping are:

V1.0 © Aptech Limited


Identify the URL of the Web Page to be Scraped

The first step of Web scraping is to identify the URL of the page to be scraped.
To identify the URL of the Web page to be scraped:

1. Open the browser and navigate to the Web page which is to be


scraped. In this case, it is the 'Thriller' Web page as shown in Figure
8.2.

Figure 8.2: Thriller Web Page

2. Copy the URL of this Web page from the Address bar as shown in Figure
8.3.

Figure 8.3: URL of the Thriller Web Page

V1.0 © Aptech Limited


Inspect the Page

Having identified the URL of the Web page to be scraped, the next step is to
inspect the page. Inspection helps you to understand the structure of the HTML
file. In an HTML file, the data is layered within tags. Inspection of the source
code of the HTML file must be done to locate the tag which holds the required
information.

To inspect the source code of a Web page:

1. Right-click the Web page.


2. In the popup menu that appears, select the Inspect option as shown in
Figure 8.4.

Figure 8.4: Inspect Option

V1.0 © Aptech Limited


The source code of the Web page appears as shown in Figure 8.5.

Figure 8.5: Source Code of the Web Page

Write the Code

After examining the HTML source, the next step is to write a code to extract the
data. The Requests library of Python offers a variety of predefined functions
that facilitate interaction with Web pages by means of HTTP requests, including
get, post, put, patch, and head requests. HTTP requests can be used to fetch
data from a specified URL or to push data to a server. The get method is
specifically used to fetch information.
Steps to be performed to accomplish this task are:

1. The get method of the Request library helps to get data from the server
using the URL. Code Snippet 1 lists the code to get data from the
http://127.0.0.1:5000/book_thriller Web page.

V1.0 © Aptech Limited


Code Snippet 1:

import requests
res_obj =
requests.get('http://127.0.0.1:5000/book_thriller')
print(res_obj)
print(res_obj.content)

Figure 8.6 shows the code and the result of the execution of the code.

Figure 8.6: Data Received from the Web Page

2. To parse this raw HTML code into meaningful information after obtaining
the page's HTML, execute the code in Code Snippet 2.

Code Snippet 2:

import requests
res_obj =
requests.get('http://127.0.0.1:5000/book_thriller')
print(res_obj)

print(res_obj.content)

The command executes as shown in Figure 8.7. The status of this request is
displayed as <Response [200]>. The number 200 signifies that the request
operation was successful.

V1.0 © Aptech Limited


Figure 8.7: Execution of get Method

Extract and Store the Information

The HTML form of the Web page that was fetched in the previous step must be
arranged in a legible form. The prettify function of the Beautiful Soup library
helps in formatting the HTML document by adding proper indentations,
resulting in a more readable structure. Code Snippet 3 shows the code for
prettifying the HTML.

Code Snippet 3:

import requests
from bs4 import BeautifulSoup

res_obj =
requests.get('http://127.0.0.1:5000/book_thriller')
print(res_obj)
soup = BeautifulSoup(res_obj.content, 'html.parser')
print(soup.prettify())

The Beautiful Soup library is called with the response object and the
'html.parser' string and the result is stored in the soup object. Execution of
the prettify function on the soup object makes the HTML, readable. Figure
8.8 shows the output of executing the prettify method.

V1.0 © Aptech Limited


Figure 8.8: HTML in Readable Form

You can observe from Figure 8.8 that data is stored in HTML table where data
is stored in rows and columns. The next step is to extract each row from the
HTML table. The find function and the find_all function help in achieving
this. Code Snippet 4 shows the code to extract the rows from the HTML code.

Code Snippet 4:

import requests
from bs4 import BeautifulSoup

res_obj =
requests.get('http://127.0.0.1:5000/book_thriller')
print(res_obj)
soup = BeautifulSoup(res_obj.content, 'html.parser')
result=soup.find('table').find_all("tr")
print(result)

The <th> tag refers to a header cell of the HTML table, <tr> tag refers to a
row in the HTML table and <td> tag refers to a table cell that contains data.
The code in Code Snippet 4 finds all <tr> tags and extracts each row from the
HTML table. Figure 8.9 shows the output of execution of this code.

V1.0 © Aptech Limited


Figure 8.9: Extracting Row of the Table in HTML

After extracting data from each row of the HTML table, the data in each cell
must be read. This is achieved by iterating over each row and reading each
cell in the row. Scraping is done from the second row as the first row contains
only the headers. The code in Code Snippet 5 performs this task. In this code,
result is a sequence object which holds all the rows. The for loop iterates
over each row, reads all the cells in the row, and prints the data in each cell.

Code Snippet 5:

import requests
from bs4 import BeautifulSoup

res_obj =
requests.get('http://127.0.0.1:5000/book_thriller')
print(res_obj)
soup = BeautifulSoup(res_obj.content, 'html.parser')
result=soup.find('table').find_all("tr")

for re in result[1:]:
cells=re.find_all(['td'])
print(cells)

The code executes as shown in Figure 8.10.

V1.0 © Aptech Limited


Figure 8.10: Data Extracted from Each Cell

It can be noted from Figure 8.10 that the data is extracted with tags. The data
must be cleaned up to remove these tags, unwanted spaces, and images.
Code Snippet 6 picks and prints only the text in the cells.

Code Snippet 6:

import requests
from bs4 import BeautifulSoup

res_obj = requests.get('http://127.0.0.1:5000/book_thriller')
print(res_obj)
soup = BeautifulSoup(res_obj.content, 'html.parser')
result=soup.find('table').find_all("tr")

for re in result[1:]:
cells=re.find_all(['td'])
celltext=[cell.get_text(strip=True) for cell in
cells[1:]]
print(celltext)

Figure 8.11 shows the execution of this code.

Figure 8.11: Scraping Only the Texts

V1.0 © Aptech Limited


Store Data in a File

The code in Code Snippet 6 extracted the data in the cells and printed it on
the screen. The data extracted from the Web page must be written into a file.
To store the data in an Excel file, the Openpyxl library must be installed and
imported. A few lines of code must be introduced to Code Snippet 6 to store
the extracted data into an Excel file. The lines of code that must be included
are:

 Import openpyxl – To import the Openpyxl library.


 excel=openpyxl.Workbook() – To create a new blank Workbook
object, call the workbook function of Openpyxl library.
 sheet=excel.active – To set the active sheet in the Excel workbook,
use the active property of the sheet object.
 sheet.title="Thriller-books" – To set the name or title of the
worksheet, use the title attribute.
 sheet.append(['Title', 'Author','Price','Age group',
'Description']) – To add row headers to the Excel sheet, use the
append method.
 sheet.append(celltext) – To append the scraped data to the
sheet, use the append method.
 excel.save("thriller.xlsx") – To save the Excel sheet, use the
save method.

Code Snippet 7 has these steps included in bold along with the code in Code
Snippet 6. To install the Openpyxl library, in the command prompt, run the
command as shown:

pip install openpyxl

The command executes as shown in Figure 8.12.

V1.0 © Aptech Limited


Figure 8.12: Installing the Openpyxl Library

Code Snippet 7:

import requests, openpyxl


from bs4 import BeautifulSoup
excel=openpyxl.Workbook()

sheet=excel.active
sheet.title="Thriller-books"
sheet.append(['Title', 'Author','Price','Age group',
'Description'])

res_obj = requests.get('http://127.0.0.1:5000/book_thriller')
print(res_obj)
soup = BeautifulSoup(res_obj.content, 'html.parser')
result=soup.find('table').find_all("tr")

for re in result[1:]:
cells=re.find_all(['td'])
celltext=[cell.get_text(strip=True) for cell in
cells[1:]]
sheet.append(celltext)

excel.save("thriller.xlsx")
print("Successfully Scraped and saved the contents")

The code is executed as shown in Figure 8.13.

V1.0 © Aptech Limited


Figure 8.13: Saving the Scraped Data in the Excel File

The Excel file gets created in the current working directory. To find the current
working directory, in the Jupyter notebook, run the command given in Code
Snippet 8.

Code Snippet 8:

import os
os.getcwd()

The command is executed as shown in Figure 8.14.

Figure 8.14: Locate the Current Working Directory

Note that the current working directory is C:\Program Files


(x86)\Python\Scripts\. Figure 8.15 shows the thriller.xlsx file is found
in the current working directory.

V1.0 © Aptech Limited


Figure 8.15: Thriller.xlsx File

Open the thriller.xlsx file to see the scraped data from the Web Page.
Figure 8.16 shows the content of the thriller.xlsx file.

Figure 8.16: Contents of the Excel File

V1.0 © Aptech Limited


8.4 Summary

⮚ Web scraping is the method of obtaining data from Web pages.


⮚ Simple syntax and extensive library of Python makes it ideal for Web
scraping.
⮚ The terms of service of the Website, reason for scraping, and methods
used for scraping determine the legality of scraping.
⮚ The most well-known Python libraries used for Web scraping are Beautiful
Soup, Requests, lxml, Scrapy, and Selenium.
⮚ Web scraping includes identifying the URL of the Web page, inspecting
the page, writing code to extract data, and storing the data in the
required format.

V1.0 © Aptech Limited


Test Your Knowledge

1. Which of the following Web scraping libraries enables users to send HTTP
requests to gather information from Web sources?

a. Beautiful Soup
b. Requests
c. HTTP Request
d. lxml

2. Which of the following Web scraping libraries can efficiently crawl


Websites and extract structured data from their pages?

a. Selenium
b. Scrapy
c. Beautiful Soup
d. Playwright

3. Which of the following statements are true about Web scraping in


Python?

a. It is a method of retrieving data from Websites


b. The terms of service of the Website determine whether Web
scraping is legal or not
c. The get method of the Requests object must be used to retrieve
data from the Web page
d. Static Web scraping is a method of data extraction from Websites
that changes frequently

4. Which line of code is used to retrieve information from the given server
using a given URL?

a. res_obj = requests.get(url)
b. res_obj = get_requests(url)
c. res_obj = get_httprequest(url)
d. res_obj = httprequests.get(url)

5. Which of the following is a Python library that is used to read from an


Excel file or write to an Excel file?

a. Pythonpyxl
b. Openpyexel
c. Openpyxl
d. Pythonexcel

V1.0 © Aptech Limited


Answers to Test Your Knowledge

1 b
2 c
3 a, b, c
4 a
5 c

V1.0 © Aptech Limited


Try it Yourself

1. Install the libraries for Request and Beautiful Soup.

2. Use the URL, https://www.javatpoint.com/wikipedia-module-in-python to


extract the contents.

a. Navigate to the given Website and inspect the page to scrape its
contents.
b. Write Python code to retrieve information from the given URL using
the request library.
c. Write Python code to create a Beautiful Soup object to parse the
raw HTML code which is obtained using the request library.
d. Write Python code to extract the Web page title without the tag.
e. Write a Python code to extract and display the header tags h2
and h3.
f. Write Python code to store the extracted contents namely, the
title and the header tag contents h2 and h3 into an Excel file,
header.xlsx.

V1.0 © Aptech Limited


Learning Objectives

In this session, students will learn to:

⮚ Describe Tkinter and its purpose


⮚ Explain widgets and their properties
⮚ Describe the ways to create widgets
⮚ Explain the process of creating simple applications using widgets

Graphical User Interface (GUI) based software development makes


interaction with the end user easy. It helps quick and easy creation of
applications that are user-friendly as it supports readily available interface
elements such as text boxes, list boxes, check boxes, and so on. Many tools are
available for developing GUI-based software applications. Python provides a
library called Tkinter which aids in software application development using GUI.

This session aims to familiarize you with application development using Tkinter.
It introduces the various widgets available as part of the Tkinter library along
with their properties. It also explains the process of creating a simple
application using the Tkinter widgets.

V1.0 © Aptech Limited


9.1 What is Tkinter?

Tkinter is a standard GUI library toolkit. It comes with the standard library of
Python. If Tkinter is not installed, it can be installed using the command:

pip install Tkinter

The library should be imported into the code to make use of its features. You
can then, create instances of the objects in the library to facilitate rapid
application development. Tkinter is used more to create desktop applications
rather than Web applications.

9.2 Widgets

Widgets are controls which make the presentation in GUI possible. They are
similar to elements in a Hypertext Markup Language (HTML) page. You can use
widgets in your applications to facilitate users to enter details, make choices,
or read some information. Thus, widgets make the end-user experience with
the application very smooth and easy. Radio buttons, text boxes, and lists are
some examples of widgets. Figure 9.1 lists some of the widgets available in
Python.

Figure 9.1: Widgets in Python

V1.0 © Aptech Limited


Table 9.1 lists some of the widgets and their purpose.

Widget Description
Button Displays buttons in applications that perform the
specified action when clicked
Canvas Provides space to draw shapes, such as lines, ovals,
polygons, and rectangles in applications
Checkbutton Allows the user to select multiple options from a list of
options
Entry Accepts a single-line of text from a user

Frame Is a container widget to organize other widgets

Label Provides a single-line caption for widgets

Listbox Provides a list of options to a user

Menubutton Displays menus in applications

Menu Provides various commands that are contained inside


Menu button to a user
Message Displays information to the user regarding the
behavior of the Python application
RadioButton Displays a list of options from which the user can
select one
Scale Provides a slider widget

Scrollbar Adds scrolling capability to various widgets, such as


list boxes
Text Displays text in multiple lines

Toplevel Provides a separate window container

Spinbox Is a variant of the standard Tkinter Entry widget, which


can be used to select from a fixed number of values
PanedWindow Is a container that may contain any number of
panes, arranged horizontally or vertically
Labelframe Is a container for complex window layouts

tkMessageBox Displays message boxes in applications

Table 9.1: Widgets and their Purposes

V1.0 © Aptech Limited


The syntax for creating a widget on the screen is:

Widget_name = WidgetName(container, options)

Here,

 Widget_name is a variable name for the widget.


 WidgetName is the name of the widget as given in Label,
Checkbutton, or Radiobutton.
 Container is the place in which the widget will be created.
 options are arguments for configuring the widget such as text to
display on a label.

Tkinter consists of some common properties and methods that are applicable
to all the widgets. Color, font, and dimensions of the widgets can be set using
these common properties. The three common methods included in the Tkinter
geometry manager are:

These geometry methods help in positioning the widgets as specified on the


container.

9.2.1 Label

The Label widget in Tkinter serves the purpose of displaying text and images.
A label on a product carries information about the product such as its price,
date of manufacturing, and so on. Similarly, a label widget displays information
on the screen. For example, you can use a label to display the current time on
a screen.

V1.0 © Aptech Limited


The syntax for creating a Label is:

Label_name = Label(root, options)

Label_name is the name of the label created. Label is the keyword to create
the label and root is the window which holds the label. Table 9.2 lists the option
values that can be configured for a Label widget.

Option Description
anchor Controls the position of text in the widget where
the default value is CENTER
bd Sets the border width of the widget where the
default is 2 pixels
bitmap Sets the bitmap to the graphical object specified
bg Sets the background color of the widget
cursor Specifies the type of cursor to show when the
mouse is moved over the label
fg Specifies the foreground color of the text written
inside the label
font Specifies the font type of text inside the label
height Specifies the height of the label
image Indicates the image shown as the label
justify Specifies the alignment of multiple lines in the
label
padx Indicates the horizontal padding of the widget
pady Indicates the vertical padding of the widget
relief Indicates the 3D style of border
text Sets the text to display
textvariable Helps in updating text of label by updating the
variable
underline Helps to include an underline for a specific part
of the text
width Sets the width of the widget
wraplength Specifies the number of characters after which
the text must be wrapped

Table 9.2: Label Widget Options

V1.0 © Aptech Limited


Consider a simple GUI application that displays a welcome message to the
user. To do this:

1. Import the Tkinter library and create an instance to use it further in the
code. You can do this using the code:

import tkinter as tk

2. Create a GUI application main window. You can do this with the code:

root = tk.Tk()

Here, root is the container variable in which you will create a Label
widget to display the welcome message.

3. Create and place the Label widget to the container root using the
code:

message = tk.Label(root, text="Hello, World!")


message.pack()

Here, message is the widget name, root is the container, and the text
option of the widget is set to the string to be displayed. Tkinter toolkit uses
the geometry managers to position the widgets on the container. Here,
the pack method of the widget geometry manager is used to place the
label on the container.

4. Enter the main event loop to display the message. This event loop informs
the code to display the widget with the message until you manually
close the window. The code to enter the main event loop is:

root.mainloop()

Code Snippet 1 shows the complete code to create a simple GUI application
to display a welcome message.

V1.0 © Aptech Limited


Code Snippet 1:

import tkinter as tk
root = tk.Tk()
message = tk.Label(root, text="Hello, World!")
message.pack()
root.mainloop()

You can use Jupyter Notebook to run this code. Figure 9.2 shows the output
of execution of the code in Code Snippet 1.

Figure 9.2: Label Widget

In this example, the pack method does not include any options. However, you
can include options with the pack method to change the display of the
widgets. Table 9.3 lists some of the options that can be used with the pack
method.

Options Description
fill Fills the widgets in horizontal (X) or vertical (Y)
manner where the default is none
side Specifies which side of the container the widget
must be placed as in TOP, BOTTOM, LEFT, or RIGHT
expand Specifies if the widget should expand to fill the extra
space available

Table 9.3: pack Method Options

Figure 9.3 shows the output of the code if message.pack()in Code Snippet 1
is replaced with message.pack(ipadx=50,ipady=50,expand=True). The
ipadx and the ipady options help you specify the internal padding of the text
within the widget without affecting the widget position.

V1.0 © Aptech Limited


Figure 9.3: Expand Option

9.2.2 Frames

Frames are rectangular holders of the widgets. They are the containers of other
widgets that facilitate the arrangement of widgets in an order.

Frame acts as a foundation class which then implements complex widgets. It


is similar to the div tag in HTML, which defines divisions on a Webpage that
contains other HTML elements.

The syntax for creating a frame is:

Frame_name = Frame(root, options)

Frame_name is the name of the frame created. Frame is the keyword to create
the Frame. root is the window which holds the frame. Even though all the
values listed in Table 9.2 are valid for the Frame widget too, Table 9.4 lists the
option values that can be configured for a Frame widget.

Option Description
Highlightbackground Helps in denoting the color of
the background color when it is under focus
Highlightthickness Helps in specifying the thickness around the
border when the widget is under the focus
Relief Helps in specifying the type of border of the
frame which is FLAT by default
Highlightcolor Helps in representing the color of the focus
highlight when the frame has the focus

Table 9.4: List of options for Frame Widget

Consider that you want to divide the screen into two halves horizontally: top
and bottom. The bottom portion must be further subdivided into two vertical
halves: left and right. The frame arrangement will appear as in Figure 9.4.

V1.0 © Aptech Limited


Figure 9.4: Frame Arrangement

Code Snippet 2 lists the code to divide the frames as shown in Figure 9.4 with
each frame displaying a text in different backgrounds.

Code Snippet 2:

import tkinter as tk
root = tk.Tk()

tframe = Frame(root)
tframe.pack()

bframe = Frame(root)
bframe.pack(side=BOTTOM)
lframe = Frame(bframe)
lframe.pack(side=LEFT)
rframe = Frame(bframe)
rframe.pack()

tmessage = tk.Label(tframe, text="This is the first


frame", fg="brown")
tmessage.pack()
bmessage = tk.Label(lframe, text="This is the second
frame.", bg="red")
bmessage.pack()
dmessage = tk.Label(rframe, text="This is the third
frame.", bg="blue", fg="white")
dmessage.pack()

root.mainloop()

Figure 9.5 shows the output of execution of the code in Code Snippet 2.

V1.0 © Aptech Limited


Figure 9.5: Frame Widget

9.2.3 Checkbutton

The Checkbutton widget allows the user to make single or multiple choices
from a list of options available. For example, users may often choose one or
more hobbies from a list of hobbies when entering their personal details.
The syntax of the Checkbutton widget is:

CheckButton_name = CheckButton(root,options)

CheckButton_name stands for the name of the widget. root is the parent
window. You can use many options to configure the Checkbutton widget and
these options are written as comma-separated key-value pairs. Table 9.5 lists
some of the Checkbutton widget options.

Option Description
activebackground Helps in indicating the
background color of the
Checkbutton when it is under the
cursor
command Helps in calling the scheduled
function when the state of
Checkbutton is changed
activeforeground Helps in indicating the foreground
color of the Checkbutton when it
is under the cursor
disableforeground Helps in indicating that the text of
Checkbutton is disabled
justify Helps in indicating the way the
multiple text lines are presented
variable Helps in representing the
associated variable to track the
state of the Checkbutton
offvalue Helps in setting the value of the
OFF state to another value which
is 0 by default

V1.0 © Aptech Limited


Option Description
Helps in setting the ON state value
onvalue to another value which is 1 by
default
state Helps in representing the state of
the Checkbutton when it is under
focus of the cursor

The default value is NORMAL. This


can be set to DISABLED to make
the Checkbutton unresponsive.
selectcolor Helps in indicating the color of the
Checkbutton when it is set, which
is Red by default
selectimage Helps in selecting the image on
the Checkbutton when it is set

Table 9.5: CheckButton Widget Options

Table 9.6 lists some of the methods associated with the CheckButton widget.

Method Description
invoke Helps in invoking the method associated with the
Checkbutton
select Helps in turning on the Checkbutton
deselect Helps in turning off the Checkbutton
toggle Helps in toggling between the check buttons
flash Helps flashing the Checkbutton between active
and normal colors

Table 9.6: CheckButton Widget Methods

Consider that in Code Snippet 1 you want to display a list of hobbies in the first
frame and a button in the second frame. When the user selects hobbies and
clicks the button, then the selected hobbies must be displayed on the third
frame. Code Snippet 3 provides the code to create the desired interface using
a Checkbutton.

V1.0 © Aptech Limited


Code Snippet 3:

import tkinter as tk
root = tk.Tk()
tframe = tk.Frame(root)
tframe.pack()
bframe = tk.Frame(root)
bframe.pack(side=tk.BOTTOM)
lframe = tk.Frame(bframe)
lframe.pack(side=tk.LEFT)
rframe = tk.Frame(bframe)
rframe.pack()

def show_hobbies():
if (read.get() & song.get()):
dmessage = tk.Label(rframe, text="Reading books and
Listening to songs", bg="blue", fg="white")
dmessage.pack()
elif (song.get()):
dmessage = tk.Label(rframe, text="Listening to
songs", bg="blue", fg="white")
dmessage.pack()
elif (read.get()):
dmessage = tk.Label(rframe, text="Reading books",
bg="blue", fg="white")
dmessage.pack()
read = tk.IntVar()
song= tk.IntVar()
label1 = tk.Checkbutton(tframe, text="Reading books",
variable=read, onvalue = 1, offvalue = 0).grid(column=0,
row=0, sticky=tk.W)
label2 = tk.Checkbutton(tframe, text="Listening to songs",
variable=song, onvalue = 1, offvalue = 0).grid(column=1,
row=0, sticky=tk.W)
button1 = tk.Button(lframe, text="Click to display
selected hobbies.", bg="red", padx=5, pady=5,
command=show_hobbies)
button1.pack()

root.mainloop()

V1.0 © Aptech Limited


Figure 9.6 shows the output of execution of the code in Code Snippet 3.

Figure 9.6: Checkbutton Widget

You can select any one or both hobbies and then click the button. Figures 9.7,
9.8, and 9.9 show the output for selection of different hobbies.

Figure 9.7: Reading Books Selected Figure 9.8: Listening to Songs Selected

Figure 9.9: Both Hobbies Selected

In Code Snippet 3:

 Variable, onvalue, and offvalue options are used to define the


Checkbutton widgets.
 A function named show_hobbies is used in the command option of the
Button widget to display the selected hobbies when the Button is
clicked.
 In the show_hobbies function, the get method of the integer variable is
used to see if the Checkbutton is selected.
 The if..elif construct is used to display different texts according to
the selections.
 Instead of pack, the grid method of the widget geometry manager is
used to arrange the Checkbutton widgets on the frame. Here, a grid
with single row and two columns is used to place one Checkbutton
widget in each column. The indices of rows and columns in a grid start
from 0. Table 9.7 lists the options available for the grid method.

V1.0 © Aptech Limited


Options Description
column Helps in specifying the column number in which
the widget is to be placed
row Helps in specifying the row number in which the
widget is to be placed
columnspan Helps in specifying the column number until
which the widget can be expanded
rowspan Helps in specifying the row number until which
the widget can be expanded
sticky Specifies the direction to place the widget: S for
South, E for East, W for West, NE for Northeast, NW
for Northwest, SE for Southeast, and SW for
Southwest

Table 9.7: grid Method Options

9.2.4 Radiobutton

Unlike Checkbutton, Radiobutton allows the users to choose only one option
from the available options. For example, you can ask the users to choose one
of the age group ranges from the list of age groups when entering their
personal details.

The syntax of the Radiobutton widget is:

Radiobutton_name = Radiobutton(root, options)

Radiobutton_name stands for the name of the widget. root is the parent
window. You can use many options to configure the Radiobutton widget and
these options are written as comma-separated key-value pairs. Table 9.8 lists
some of the options available for the Radiobutton Widget.

Options Description
borderwidth Helps in representing the size of the border

image Helps in setting an image for the widget

value Helps in turning the control variable to a value


when an option is chosen

Table 9.8: Options of Radiobutton Widget

V1.0 © Aptech Limited


Methods of the Radiobutton widget are similar to those for the
Checkbutton widget as listed in Table 9.6.

Consider that in Code Snippet 1 you want to display a list of age groups in the
first frame and a button in the second frame. When the user selects an age
group and clicks the button, an appropriate message must be displayed. If <
18 is selected, then the message must be ‘Child or Young Adult’ or the
message must be ‘Adult’. This message must be displayed in the third frame.
Code Snippet 4 provides the code to create the desired interface using a
Radiobutton.

Code Snippet 4:

import tkinter as tk
root = tk.Tk()
tframe = tk.Frame(root)
tframe.pack()
bframe = tk.Frame(root)
bframe.pack(side=tk.BOTTOM)
lframe = tk.Frame(bframe)
lframe.pack(side=tk.LEFT)
rframe = tk.Frame(bframe)
rframe.pack()
def show_ages():
if (age.get()==1):
for widgets in rframe.winfo_children():
widgets.destroy()
dmessage = tk.Label(rframe, text="Child or Young
Adult", bg="blue", fg="white")
dmessage.pack()
elif (age.get()==2):
for widgets in rframe.winfo_children():
widgets.destroy()
dmessage = tk.Label(rframe, text="Adult",
bg="blue", fg="white")
dmessage.pack()
age = tk.IntVar()
label1 = tk.Radiobutton(tframe, text="< 18", variable=age,
value = 1).grid(column=0, row=0, sticky=tk.W)
label2 = tk.Radiobutton(tframe, text=">= 18",
variable=age, value = 2).grid(column=1, row=0,
sticky=tk.W)

V1.0 © Aptech Limited


button1 = tk.Button(lframe, text="Click to display
selected age group.", bg="red", padx=5, pady=5,
command=show_ages)
button1.pack()
root.mainloop()

Figure 9.10 shows the output of execution of the code in Code Snippet 4.

Figure 9.10: Radiobutton Widget

You can select only one age group and then, click the button. Figures 9.11 and
9.12 show the output for selection of different age groups.

Figure 9.11: Less than Eigteen Selected

Figure 9.12: Greater than or Equal to Eighteen Selected

In Code Snippet 4:

 Variable and value options are used to define the Raiobutton


widgets.
 A function named show_ages is used in the command option of the
Button widget to display the appropriate message according to the
selected age group when the Button is clicked.
 In the show_ages function:
o The if..elif construct is used to display different texts according

V1.0 © Aptech Limited


to the selections.
o The winfo_children method is used to fetch all the widgets in a
frame. The destroy method of the widgets class is then, used to
clear all the widgets in the frame.
o The get method of the integer variable is used to see the
Raiobutton value. Based on the value an appropriate message
is displayed.

9.2.5 Entry Widget

The Entry widget is used to accept inputs such as name, email address, and
so on from the users. It is a small text box which allows the user to type in a
single line of input. The style of entering the input text can be changed by
options.

The syntax of the Entry widget is:

Entry_name = Entry(root, options)

Entry_name stands for the name of the widget. root is the parent window
which holds the Entry widget. Table 9.9 lists some of the options available for
the Entry widget.

Options Description
exportselection Sets the default value 1 (copying the
text to clipboard) to 0
selectbackground Sets an image for the widget
selectforeground Sets the font color of the selected text
selectborderwidth Sets the width of the border for the
selected text
width Sets the width of the border or image
Textvariable Gets the current text from the widget
Shows the input text instead of the
show masked character such as *
xscrollcommand Makes the horizontal scroll bar available
if the entered text is more than the size
of the widget
insertbackground Sets the background color

Table 9.9: Options for Entry Widget

V1.0 © Aptech Limited


Table 9.10 lists some of the methods available for the Entry widget.

Methods Description
delete(first, Deletes the specified characters inside the
last=None) widget
get Fetches the Entry widget's current text as a
string
icursor(index) Sets the insertion cursor just before the character
at the specified index
index(index) Places the cursor to the left of the
character written at the specified index
select_clear Clears the selected text in the entry widget
select_present Checks if some text is selected in the widget
insert(index, s) Inserts the specified string(s) before
the character placed at the specified index
select_adjust(index) Includes the selection of the character present
at the specified index
select_form(index) Sets the anchor index position to the character
specified by the index
select_range(start, Selects the characters in between the specified
end) range
select_to(index) Selects all the characters from the beginning to
the specified index
xview(index) Links the entry widget to a horizontal scrollbar
xview_scroll(number, Makes the entry widget scrollable horizontally
what) where number specifies the number of units to be
moved and What specifies whether by tk.PAGES
or tk.UNITS

Table 9.10: Methods for Entry Widget

Consider that you want to display two labels: Name and e-mail. Against these
labels, you want to accept user inputs for name and e-mail. You also want to
include a button, which when clicked will display the name and e-mail entered
by the user. Code Snippet 5 shows the code to accept user input using the
Entry widget.

V1.0 © Aptech Limited


Code Snippet 5:

import tkinter as tk
frame = tk.Tk()
frame.geometry("400x250")
bframe = tk.Frame(frame)
bframe.pack(side=tk.BOTTOM)
def show_info():
for widgets in bframe.winfo_children():
widgets.destroy()
dmessage = tk.Label(bframe, text="Name: " +
namevar.get() + " ; e-mail: " + emailvar.get(), bg="blue",
fg="white")
dmessage.pack()

namevar = tk.StringVar()
emailvar = tk.StringVar()
name = tk.Label(frame, text = "Name").place(x = 30,y = 50)
nameentry = tk.Entry(frame, textvariable =
namevar).place(x = 90, y = 50)
email = tk.Label(frame, text = "e-mail").place(x = 30, y =
130)
emailentry = tk.Entry(frame, textvariable =
emailvar).place(x = 90, y = 130)
button1 = tk.Button(frame, text="Click to display the name
and e-mail.", bg="red", padx=5, pady=5, command=show_info)
button1.place(x = 30, y= 180)

frame.mainloop()

V1.0 © Aptech Limited


Figure 9.13 shows the output of execution of the code in Code Snippet 5.

Figure 9.13: Entry Widget

You can enter the name and e-mail information and then, click the button.
Figure 9.14 shows the output with the entered information.

Figure 9.14: User Information Entered in Entry Widget

In Code Snippet 5:

 The textvariable option is used to define the Entry widgets.


 A function named show_info is used in the command option of the
Button widget to display the information entered when the Button is

V1.0 © Aptech Limited


clicked.
 The get method of the string variable is used to get the details entered.
 The geometry method of the fame is used to define the overall area to
be used.
 The widgets are placed in the frame using the place method. This
method accepts a x and y value to position the widget on the main
frame. Table 9.11 lists the options available for the place method.

Options Description
x, y Helps to define horizontal and vertical offset in
pixels
height, Helps to set height and width of the widget in
width pixel
anchor Helps to specify the exact location of the widget
in the container such as NW and SE
relx, rely Specify floating point numbers between 0.0 and
1.0 which is used as an offset in the horizontal
and vertical direction
relheight, Specify floating point numbers between 0.0 and
relwidth 1.0 which serve as offset between horizontal and
vertical direction in relation to the height and
width of the parent window

Table 9.11: place Method Options

9.2.6 Canvas Widget

Canvas widget helps in creating graphical images on the screen. Any kind of
shapes such as circle, rectangle, or octogen can be drawn on a Canvas
widget. Canvas is used to complex layouts, charts, and plots.

The syntax of the Canvas widget is:

Canvas_name = Canvas(root, options)

In the syntax, Canvas_name is the name of the canvas, and the root parameter
denotes the parent window. You can change the layout of the canvas using
many options that are written as comma-separated pairs of key-values.

Table 9.12 lists some of the options available for the Canvas widget.

Options Description
Confine Helps in making the canvas non-
scrollable outside the scroll region

V1.0 © Aptech Limited


Options Description
xscrollcommand Helps in configuring the horizontal scroll of
the canvas
yscrollcommand Helps in configuring the vertical scroll of
the canvas
Scrollregion Helps in specifying the scroll area of the
canvas using the coordinates
xscrollincrement Helps in increasing the horizontal scroll for
each movement of mouse
yscrollincrement Helps in increasing the vertical scroll for
each movement of mouse

Table 9.12: Options for Canvas Widget

Consider the example in Code Snippet 5. Let us add two lines-one atop and
one beneath the Button widget. Code Snippet 6 adds lines as intended using
the Canvas widget.

Code Snippet 6:

import tkinter as tk
frame = tk.Tk()
frame.geometry("350x250")
bframe = tk.Frame(frame)
bframe.pack(side=tk.BOTTOM)
canvas = tk.Canvas(frame, bg="brown", height=250,
width=350)
def show_info():
for widgets in bframe.winfo_children():
widgets.destroy()
dmessage = tk.Label(bframe, text="Name: " +
namevar.get() + " ; e-mail: " + emailvar.get(), bg="blue",
fg="white")
dmessage.pack()

namevar = tk.StringVar()
emailvar = tk.StringVar()

name = tk.Label(frame, text = "Name").place(x = 80,y = 50)


nameentry = tk.Entry(frame, textvariable =
namevar).place(x = 140, y = 50)

V1.0 © Aptech Limited


email = tk.Label(frame, text = "e-mail").place(x = 80, y =
130)
emailentry = tk.Entry(frame, textvariable =
emailvar).place(x = 140, y = 130)

line1 = canvas.create_line(20,170,340,170,fill='white')
button1 = tk.Button(frame, text="Click to display the name
and e-mail.", bg="green", fg="white", padx=5, pady=5,
command=show_info)
button1.place(x = 60, y= 180)
line2 = canvas.create_line(20,230,340,230,fill='white')
canvas.pack()

frame.mainloop()

Figure 9.15 shows the output of execution of the code in Code Snippet 6.

Figure 9.15: Canvas Widget

You can enter the name and e-mail information and then, click the button.
The output with the entered information will appear beneath the canvas.

In Code Snippet 6:

 The Canvas widget is used as a frame to draw two lines as shown in Figure
9.15. The create_line method of the canvas widget is used to draw the
lines.
 A function named show_info is used in the command option of the

V1.0 © Aptech Limited


Button widget to display the information entered when the Button is
clicked.
 The get method of the string variable is used to get the details entered.
 The geometry method of the fame is used to define the overall area to
be used.

9.3 Creating Simple Applications Using Tkinter

Using the widgets discussed so far, you can easily develop a simple application
that takes user input and displays the data.

9.3.1 Example 1

Example 1 helps take user inputs from widgets such as Entry and Button. The
input entered is then displayed on the canvas as shown in Figure 9.16. The user
must provide the name and author of the book. When the user clicks the Book
Font color button, the color palette appears. The user can choose a color
for the Book Name from this color palette. Similarly, the user can choose a color
for the Author Name by clicking the Author Font color button.

Figure 9.16: Output of Example 1

V1.0 © Aptech Limited


Code Snippet 7 shows the code for this example.

Code Snippet 7:

from tkinter import *


import tkinter as tk
from tkinter import colorchooser
root = tk.Tk()
root.geometry('500x500')
root.config(bg='#345')
canvas = Canvas(root,height=400,width=400,bg="#fff")
canvas.create_rectangle(30, 30, 300,
180,outline="#fb0",fill="#fb0")
name_var=StringVar()
author_var=StringVar()
def get_name():
e_name=name_var.get()
color1=colorchooser.askcolor()[1]
canvas.create_text(150, 50, text=e_name, fill=color1,
font=('Helvetica 12 bold'))

def get_author():
e_author=author_var.get()
color2=colorchooser.askcolor()[1]
canvas.create_text(180, 130, text="by", fill="black",
font=('Helvetica 10 bold'))
canvas.create_text(210, 150, text=e_author,
fill=color2, font=('Helvetica 10 bold'))
name = Label(canvas, text = "Book Name").place(x = 20,y =
250)
author=Label(canvas, text = "Author Name").place(x = 20,y
= 280)
namebutton = Button(canvas, text ="Book Font
color",command=get_name).place(x = 80, y = 320)
authorbutton = Button(canvas, text ="Author Font
color",command=get_author).place(x = 180, y = 320)
e1 = Entry(canvas,textvariable = name_var).place(x = 100,
y = 250)
e2 = Entry(canvas,textvariable = author_var).place(x =
100, y = 280)
canvas.pack()

root.mainloop()

V1.0 © Aptech Limited


In Code Snippet 7:

 The code uses widgets such as Canvas, Label, Entry, and Button.
 The Canvas widget holds the Entry widgets for taking the names of the
book and author from the user.
 The Button widget is used to pick font color and display the entered text
in desired font color.
 The functions, get_name and get_author, take input from the Entry
widgets, retrieve the chosen colors from the colorchooser interface
and display the output in the upper part of the screen.

9.3.2 Example 2

Example 2 takes input from Entry, Checkbutton, and Radiobutton widgets.


The input is then displayed on the console. Figure 9.17 shows the screen design
of the application.

Figure 9.17: Screen Design for Example 2

In this example, user’s name(Entry widget), gender(Radiobutton widget),


qualification (Checkbutton widget) are taken as input. When the user
provides the input and clicks the submit button, the details entered are
printed on the console. Code Snippet 8 shows the code for the application.

V1.0 © Aptech Limited


Code Snippet 8:

import tkinter as tk
from tkinter import *
root = tk.Tk()
root.geometry("400x280")
name = tk.Label(root, text = "Name").place(x = 30,y = 50)
gender = tk.Label(root, text = "Gender").place(x = 30, y =
100)
qualification=tk.Label(root, text =
"Qualification").place(x = 30, y = 150)
chk1 = IntVar()
chk2 = IntVar()
chk3 = IntVar()
radio = IntVar()
name_var=tk.StringVar()
def get_value():
e_name=name_var.get()
e_gender=str(radio.get())
e_quali1=str(chk1.get())
e_quali2=str(chk2.get())

print("Name:" + e_name)
if (e_gender=="1"):
print("Gender: Male")
else:
print("Gender: Female")
if(e_quali1=="1") and (e_quali2=="1"):
print("Qualification: BS, MS")
elif(e_quali1=="1"):
print("Qualification: BS")
else:
if(e_quali2=="1"):
print("Qualification: MS")
e1 = tk.Entry(root,textvariable = name_var).place(x = 80,
y = 50)
e2 =tk.Radiobutton(root, variable=radio,text='Male',
value=1).place(x=80,y=100)
e3=tk.Radiobutton(root, variable=radio,text='Female',
value=2).place(x=150,y=100)
e4=tk.Checkbutton(root, text = "BS",variable = chk1,
onvalue = 1,offvalue = 0).place(x=130,y=150)

V1.0 © Aptech Limited


e5=tk.Checkbutton(root, text = "MS",variable = chk2,
onvalue = 1,offvalue = 0).place(x=200,y=150)
e6= tk.Button(root, text = "Submit",command=
get_value).place(x=100,y=200)
root.mainloop()

In Code Snippet 8:

 The get_value method is called when the submit button is clicked.


 The get_value method takes inputs from the input widgets and displays
them on the console.
 The print statement helps in printing the output onto the console.

Figure 9.18 shows the output of the code with inputs entered in various widgets.

Figure 9.18: Screen Design for Example 2 With Input

Figure 9.19 shows the console output.

Figure 9.19: Console Output of Example 2

V1.0 © Aptech Limited


9.4 Summary

 GUI applications make the user experience with the application smooth.
 Tkinter is a library to create a GUI-based application in Python.
 Tkinter provides many widgets such as Label, Entry, Checkbutton,
Radiobutton, and Canvas.
 Each widget has its own properties which can be specified as options
during declaration.
 Methods can be invoked to make some changes in the property values.
 Pack, Grid, and Place are the methods of the geometry manager that
facilitate positioning of the widgets on the frame.

V1.0 © Aptech Limited


Test Your Knowledge

1. Which of the following statements are true about Tkinter?

a. It is an inbuilt Python module


b. It can be used only to create GUI-based Web applications
c. It is a standard library in Python used for creating Graphical User
Interface applications
d. It provides various controls, such as buttons, labels, scrollbars, radio
buttons, and text boxes that are used in a GUI application

2. Which of the following widgets is used to get a single-line text from the
user?

a. Entry
b. Frame
c. Canvas
d. Label

3. Which of the following methods are used to control the layout of an


application with geometry managers?

a. pack
b. place
c. layout
d. grid

4. Which of the following widgets is used to draw any object on the


application window or frame in Tkinter?

a. Draw
b. Canvas
c. Frame
d. Custom

5. Which of the following widgets works similar to a container and arranges


the position of other widgets?

a. Button
b. Canvas
c. Frame
d. Container

V1.0 © Aptech Limited


Answers to Test Your Knowledge

1 a, c, d
2 a
3 a, b, d
4 b
5 c

V1.0 © Aptech Limited


Try it Yourself

Use Tkinter to perform the given tasks:

1. Write a Python GUI program to enter your name in a text box. When a
button is clicked, your name should be displayed with a welcome
message.

2. Write a Python GUI program to create a simple calculator to perform the


calculation of addition, subtraction, multiplication, and division. Use a
grid to design the layout of the calculator.

3. Use the place method to design sample patient details such as


Patient_id, Patient_name, age, gender, and admission_type. Use
Entry widgets for Patient_id, Patient_name, and age. Use
Radiobutton widgets for gender and admission_type
(Inpatient/Outpatient). Get the details and display the information
at the bottom of the window.

4. Write a Python GUI program to create an oval shape using a Canvas


widget and fill the oval shape with yellow color. Place a Button widget
beneath the oval shape. When the button is clicked, display the
welcome text inside the oval shape.

V1.0 © Aptech Limited


Learning Objectives

In this session, students will learn to:


 Explain the concept of database connectivity in Python
 Describe the steps involved in connecting Python with the MySQL
database
 Explain the procedure to create a database and a table in MySQL using
the command-line interface of MySQL
 Describe the steps to design an application to insert, update, and delete
records from a database using Tkinter

Python can be used to build data-driven applications by connecting Python


to databases. The libraries and modules in Python facilitate connecting to and
interacting with databases. This session will provide an overview of connecting
Python to the MySQL database. It will cover the installations required before
setting up the connection. The session will show how to create a database and
a table using the command-line interface of MySQL. The session will also
describe how to design a student information database using Tkinter and
perform operations to insert, fetch, update, and delete records from the table.

V1.0 © Aptech Limited


10.1 Introduction to Database Connectivity with Python

Python can be connected to databases to create database programs that


serve in various domains including Web development, data analysis,
healthcare, E-commerce, and education. Python has several libraries and
frameworks that simplify database connectivity. It supports relational
databases such as MySQL, SQLite, Oracle, and NoSQL databases such as
MongoDB. The flexibility and robustness of Python combined with its extensive
support for database connectivity makes it a versatile choice for interacting
with database systems. Python can be connected to SQLite using the SQLite3
module. PyMongo, MongoDB's official Python driver, is used to create a
connection between Python and MongoDB.

In this session, you will learn how to connect Python with MySQL.

10.2 Connecting Python with MySQL Database

To connect Python with MySQL, use the MySQL Connector Python module that
allows Python to connect with the MySQL database server.

MySQL Connector Python provides an interface for Python applications to


communicate with the MySQL database server. It enables Python to perform
tasks such as querying, inserting, updating, and deleting data in the database.
The prerequisite for connecting MySQL with Python is to install the MySQL
database.

10.2.1 Installing MySQL

To install MySQL:

1. To download MySQL, open the browser and navigate to the Web


page, https://dev.mysql.com/downloads/installer/.
2. Click the Download button of the (mysql-installer-community-
8.0.34.0.msi ) version as shown in Figure 10.1.

V1.0 © Aptech Limited


Figure 10.1: MySQL Installer

The MySQL Community Downloads window appears as shown in


Figure 10.2.

3. Select No thanks, just start my download.

Figure 10.2: MySQL Community Downloads

The installation file is downloaded.

V1.0 © Aptech Limited


The MySQL version can be different when downloading because it is
open source and is updated on a regular basis. Download the
version that is available on the Web page.

4. Run the .exe file. The installation wizard launches.


5. On the MySQL Installer window of the installation wizard, under
Please select the Setup Type that suits your use case, select Full as
shown in Figure 10.3. Click Next.

Figure 10.3: Choosing a Setup Type

The Full option installs all the available products such as


MySQL Server, MySQL Shell, MySQL Router, MySQL
Workbench, MySQL Connectors, documentation, samples,
and examples.

The installer checks if the machine where MySQL is being installed


contains the prerequisite software for each product. If the machine
does not contain the required software, then the Check
Requirements window is displayed. If the Check Requirements
window is displayed, then perform step 6; Else, perform step 7.

6. In the Check Requirements window, click the Execute button.


The MySQL Installer dialog box appears with the One or more
product requirements have not been satisfied message.
7. Click Yes.

V1.0 © Aptech Limited


The Installation window appears with the list of products that will be
downloaded and installed to the machine as shown in Figure 10.4.

8. Click the Execute button.

Figure 10.4: Installation Window

9. After the completion of installation, in the Installation page, click the


Next button.
The Product Configuration window appears as shown in Figure 10.5.
10. Click the Next button.

Figure 10.5: Product Configuration Window

V1.0 © Aptech Limited


The Type and Networking window appears.
11. In the Type and Networking window, retain the default options and
click Next button as shown in Figure 10.6.

Figure 10.6: Type and Networking Window

The Authentication Method page appears.


12. Retain the default option and click the Next button as shown in
Figure 10.7.

V1.0 © Aptech Limited


Figure 10.7: Authentication Method Page

The Accounts and Roles window appears.

13. In the Accounts and Roles window, in the MySQL Root Password and
Repeat Password text boxes, enter the password. Click the Next
button as shown in Figure 10.8.

Figure 10.8: Accounts and Role Window

The Windows Service window appears.

V1.0 © Aptech Limited


14. Retain the default option and click the Next button as shown in
Figure 10.9.

Figure 10.9: Windows Service Window

The Service File Permissions window appears.


15. Retain the default option and click the Next button.
The Apply Configuration window appears.
16. Click the Execute button as shown in Figure 10.10.

Figure 10.10: Apply Configuration Window

V1.0 © Aptech Limited


17. After configuration is completed, click the Finish button.
18. In the Product Configuration window that appears, click Next.
19. In the Connect To Server window that appears, enter the Username,
Password, and click Check as shown in Figure 10.11.

Figure 10.11: Connect To Server Window

20. After the check is completed, click the Next button.


21. In the Apply Configurations window, click the Execute button.
22. After the configuration is completed, click Finish as shown in Figure
10.12.

V1.0 © Aptech Limited


Figure 10.12: Apply Configuration Window

23. In the Product Configuration window that appears, click Next.


24. In the Installation Complete window that appears, click Finish as
shown in Figure 10.13.

Figure 10.13: Installation Complete Window

V1.0 © Aptech Limited


10.2.2 Installing MySQL Connector Module

Python MySQL database connector is a module that enables Python scripts to


interact with data.

To download and install database connector, open the Command Prompt


and type the command as:
pip install mysql-connector-python
The command gets executed as shown in Figure 10.14.

Figure 10.14: Installing Database Connector

10.2.3 Establish a Connection between Python and MySQL

To perform operations such as creating a database, creating a table,


querying, and updating tables in the MySQL database using Python, connect
Python to MySQL. Code Snippet 1 shows the code to connect Python with
MySQL. It also creates a database named Studentdb.

Code Snippet 1:

import mysql.connector
conn = mysql.connector.connect(user='root',
password='mysql123', host='localhost')
cursor = conn.cursor()
cursor.execute("CREATE DATABASE Studentdb")
print("Database created Successfully")
conn.close()

V1.0 © Aptech Limited


The import mysql.connector imports the MySQL connector module. The
connect method of the MySQL connector module creates a connection with
the MySQL server. The syntax of the connect method is:

connect (host = <hostname>, user = <username>, passwd =


<password>)

Table 10.1 describes the arguments of the connect method.

Argument Description
user This argument represents the username of the user
who interacts with the MySQL server.
passwd This argument represents the password to
authenticate the user. The user must enter the
password given by the user during MySQL
installation.
host This argument represents server name/ IP address
on which the MySQL sever is running.
database This argument represents the name of the
database to which the user wants to connect to.

Table 10.1: Arguments of the connect Method

The cursor method creates a MySQLCursor object that allows you to interact
with the database. The execute method is used to execute the Structured
Queried Language (SQL) queries. Here, the query to create the Studentdb
database is executed. The close method is used to close the connection.
Save the code in Code Snippet 1 in a file, for example create_db.py. Open
the Command Prompt and run the command as:

python create_db.py

The command executes as shown in Figure 10.15.

Figure 10.15: Creating Database

V1.0 © Aptech Limited


10.2.4 View Databases
To view the databases that exist in the MySQL server:
1. To connect to the command-line interface of MySQL, in the Command
Prompt, run the command as:
mysql -u root -p

2. You will be prompted to enter the password. Enter the password given
when installing the MySQL server as shown in Figure 10.16.

Figure 10.16: Command-line Interface of MySQL

3. To view the databases in MySQL, run the command as:


show databases;

The command executes as shown in Figure 10.17.

Figure 10.17: MySQL Prompt

V1.0 © Aptech Limited


10.2.5 Create a Table in Database Using Python
Let us create a table in the Studentdb database. Code Snippet 2 will create
a table named student_info with four columns namely Stud_id, Name, Age,
and City.

Code Snippet 2:

import mysql.connector

conn = mysql.connector.connect(user='root',
password='mysql123', host='localhost',
database='studentdb')
cursor = conn.cursor()
sql ='''CREATE TABLE IF NOT EXISTS student_info(
Stud_id CHAR(10) NOT NULL, Name CHAR(20),
Age INT,
City CHAR(80)
)'''
cursor.execute(sql)
print("Table successfully Created");
conn.close()

Save the code in Code Snippet 2 in a file, Create_table.py. In the Command


Prompt, run the command as:

python create_table.py

The command executes as shown in Figure 10.18.

Figure 10.18: Creating Table in a Database

10.2.6 View Table


To view the student_info table:

1. To use the Studentdb database, in the Command Prompt, run the


command as:
Use studentdb

V1.0 © Aptech Limited


The command executes as shown in Figure 10.19.

Figure 10.19: Opening a Database

2. To view the tables in the studentdb database, run the commands as:
Show tables;

describe student_info;
The command executes as shown in Figure 10.20.

Figure 10.20: Viewing a Table

10.3 Design an Application Using Tkinter

Let us design an application using Tkinter that stores student information in the
student_info table. The Tkinter Entry widget can be used to obtain student
details such as name, age, gender, and address. The Button widget can be
used to perform operations such as create, read, update, and delete the
student information. Code Snippet 3 lists the code for this application.

V1.0 © Aptech Limited


Code Snippet 3:

import tkinter as tk
from tkinter import ttk, messagebox
import mysql.connector
from tkinter import *

root = tk.Tk()
root.geometry("800x500")
global e1
global e2
global e3
global e4
global e5

tk.Label(root, text="Student Information", fg="green",


font=("Calibri", 30)).place(x=300, y=5)
tk.Label(root, text="If you want to update or delete a
record,", fg="black", font=("Calibri", 12)).place(x=300,
y=60)
tk.Label(root, text="then enter the Student ID and click
the Read button.", fg="black", font=("Calibri",
12)).place(x=300, y=80)
tk.Label(root, text="Student ID").place(x=10, y=10)
tk.Label(root, text="Name").place(x=10, y=50)
tk.Label(root, text="Age").place(x=10, y=90)
tk.Label(root, text="City").place(x=10, y=130)

e1 = tk.Entry(root)
e1.place(x=140, y=10)

e2 = tk.Entry(root)
e2.place(x=140, y=50)

e3 = tk.Entry(root)
e3.place(x=140, y=90)

e4 = tk.Entry(root)
e4.place(x=140, y=130)

tk.Button(root, text="Add",height=3, width=


13).place(x=30, y=160)
tk.Button(root, text="Read",height=3, width=
13).place(x=140, y=160)
tk.Button(root, text="Update",height=3, width=
13).place(x=250, y=160)
tk.Button(root, text="Delete",height=3, width=
13).place(x=360, y=160)

V1.0 © Aptech Limited


tk.Button(root, text="Refresh",height=3, width=
13).place(x=470, y=160)

cols = ('Student_id', 'Name', 'Age','City')


listdisplay = ttk.Treeview(root, columns=cols,
show='headings' )

for col in cols:


listdisplay.heading(col, text=col)
listdisplay.grid(row=1, column=0, columnspan=1)
listdisplay.place(x=10, y=250)
root.mainloop()

The tkinter.ttk module is used to create graphical user


interfaces with improved appearance. The Treeview widget is
used to display data in tabular form.

Save the code in Code Snippet 3 in a file, student_design.py. In the


Command Prompt, run the command as:

python student_design.py

The command executes as shown Figure 10.21.

Figure 10.21: Student Information

V1.0 © Aptech Limited


The code in Code Snippet 3 designs and launches the frame with widgets. Let
us add functionality to the widgets in the frame in succession. The
functionalities include:

 Adding a record
 Reading a record
 Updating a record
 Deleting a record
 Refreshing the frame

Add a Record:

To insert records into the student_info table, a user-defined function named


Add must be inserted to Code Snippet 3. The code for the Add function is shown
in Code Snippet 4.

Code Snippet 4:

def Add():
stuid = e1.get()
stuname = e2.get()
stuage = e3.get()
city = e4.get()

mysqldb=mysql.connector.connect(host="localhost",user="root",
password="mysql123",database="studentdb")
mycursor=mysqldb.cursor()

try:
sql = "INSERT INTO student_info
(Stud_id,Name,Age,City) VALUES (%s, %s, %s, %s)"
val = (stuid,stuname,stuage,city)
mycursor.execute(sql, val)
mysqldb.commit()
lastid = mycursor.lastrowid
messagebox.showinfo("information", "Student Record
inserted successfully.")
e1.delete(0, END)
e2.delete(0, END)
e3.delete(0, END)
e4.delete(0, END)
e1.focus_set()
except Exception as e:
print(e)
mysqldb.rollback()
mysqldb.close()

V1.0 © Aptech Limited


Also, the line of code pertaining to the Add button in Code Snippet 3 must be
edited to include command = Add as shown in the code:

tk.Button(root, text="Add",command = Add,height=3,


width= 13).place(x=30, y=160)

The Add function in Code Snippet 4 gets the data that the user inputs in the
form. This data is stored in the student_info table using the INSERT
statement.

Delete a Record:

To delete records from the student_info table, a user-defined function


named Delete must be inserted to Code Snippet 3. The code for the Delete
function is shown in Code Snippet 5.

Code Snippet 5:

def Delete():
stuid = e1.get()

mysqldb=mysql.connector.connect(host="localhost",user="root",
password="mysql123",database="studentdb")
mycursor=mysqldb.cursor()

try:
sql = "DELETE FROM student_info WHERE Stud_id = %s"
val = (stuid,)
mycursor.execute(sql, val)
mysqldb.commit()
lastid = mycursor.lastrowid
messagebox.showinfo("information", "Student Record
Deleted successfully.")

e1.delete(0, END)
e2.delete(0, END)
e3.delete(0, END)
e4.delete(0, END)
e1.focus_set()

except Exception as e:

print(e)
mysqldb.rollback()
mysqldb.close()

V1.0 © Aptech Limited


Also, the line of code pertaining to the Delete button in Code Snippet 3 must
be edited to include command = Delete as shown in the code.

tk.Button(root, text="Delete",command =
Delete,height=3, width= 13).place(x=250, y=160)

The Delete function in Code Snippet 5 gets the Student ID of the student
whose record must be deleted. The user inputs Student ID in the form. The
record is deleted from the student_info table using the DELETE statement.

Update a Record:

To update a record in the student_info table, a user-defined function


named Update must be inserted to Code Snippet 3. The code for the Update
function is shown in Code Snippet 6.

Code Snippet 6:

def Update():
stuid = e1.get()
stuname = e2.get()
stuage = e3.get()
city = e4.get()

mysqldb=mysql.connector.connect(host="localhost",user="root",
password="mysql123",database="studentdb")
mycursor=mysqldb.cursor()

try:
sql = "UPDATE student_info SET Name= %s, Age= %s,
City= %s WHERE Stud_id= %s"
val = (stuname,stuage,city,stuid)
mycursor.execute(sql, val)
mysqldb.commit()
lastid = mycursor.lastrowid
messagebox.showinfo("information", "Student Record
Updated successfully.")

e1.delete(0, END)
e2.delete(0, END)
e3.delete(0, END)
e4.delete(0, END)
e1.focus_set()

except Exception as e:

V1.0 © Aptech Limited


print(e)
mysqldb.rollback()
mysqldb.close()

Also, the line of code pertaining to the Update button in Code Snippet 3 must
be edited to include command = Update as shown in the code.

tk.Button(root, text="Update",command =
Update,height=3, width= 13).place(x=140, y=160)

The Update function in Code Snippet 6 gets the Student ID, Name, Age, and
City of the student whose record must be updated. The user inputs this data
in the form. The record is updated in the student_info table using the UPDATE
statement.

Read a Record

To retrieve information from the student_info table, a user-defined function


named Read must be inserted to Code Snippet 3. The code for the Read
function is shown in Code Snippet 7.

Code Snippet 7:

def Read():
stuid = e1.get()

mysqldb=mysql.connector.connect(host="localhost",user="root",
password="mysql123",database="studentdb")
mycursor=mysqldb.cursor()

try:
sql = "SELECT Stud_id,Name,Age,City FROM student_info
WHERE Stud_id= %s"
val = (stuid,)
mycursor.execute(sql, val)
records = mycursor.fetchone()
e1.delete(0, END)
e1.insert(0,records[0])
e2.insert(0,records[1])
e3.insert(0,records[2])
e4.insert(0,records[3])

e1.focus_set()

except Exception as e:

V1.0 © Aptech Limited


print(e)
mysqldb.rollback()
mysqldb.close()

Also, the line of code pertaining to the Read button in Code Snippet 3 must
be edited to include command = Read as shown in the code.

tk.Button(root, text="Read",command =
Read,height=3, width= 13).place(x=140, y=160)

The Read function in Code Snippet 7 gets the Stud_id of the student whose
record must be fetched. The user inputs the Stud_id in the form. The record is
fetched from the student_info table using the SELECT statement and
displayed on the screen.

Refresh the Frame:

When the frame is launched, all the records in the student_info table are
displayed in the Treeview widget at the lower half of the frame. After an
update or delete operation is performed on the student_info table, clicking
the Refresh button will fetch the updated records from the table. The show
method in Code Snippet 8 takes care of this.

Code Snippet 8:

def show():

mysqldb=mysql.connector.connect(host="localhost",user="root",
password="mysql123",database="studentdb")
mycursor=mysqldb.cursor()
children = listdisplay.get_children()
for child in children:
listdisplay.delete(child)
mycursor.execute("SELECT Stud_id,Name,Age,City FROM
student_info")
records = mycursor.fetchall()

for i, (Stud_id,Name,Age,City) in enumerate(records,


start=1):
listdisplay.insert("", "end",
values=(Stud_id,Name,Age,City))
mysqldb.close()

V1.0 © Aptech Limited


Code Snippet 9 is the final program that includes creation of widgets and
adding functionalities to those widgets. Update the student_design.py file
with this code.

Code Snippet 9:

import tkinter as tk
from tkinter import ttk, messagebox
import mysql.connector
from tkinter import *

def Add():
stuid = e1.get()
stuname = e2.get()
stuage = e3.get()
city = e4.get()

mysqldb=mysql.connector.connect(host="localhost",user="root",
password="mysq@123",database="studentdb")
mycursor=mysqldb.cursor()

try:
sql = "INSERT INTO student_info
(Stud_id,Name,Age,City) VALUES (%s, %s, %s, %s)"
val = (stuid,stuname,stuage,city)
mycursor.execute(sql, val)
mysqldb.commit()
lastid = mycursor.lastrowid
messagebox.showinfo("information", "Student Record
inserted successfully.")
e1.delete(0, END)
e2.delete(0, END)
e3.delete(0, END)
e4.delete(0, END)
e1.focus_set()
except Exception as e:
print(e)
mysqldb.rollback()
mysqldb.close()

def Read():
stuid = e1.get()

mysqldb=mysql.connector.connect(host="localhost",user="root",
password="mysql123",database="studentdb")
mycursor=mysqldb.cursor()

try:

V1.0 © Aptech Limited


sql = "SELECT Stud_id,Name,Age,City FROM student_info
WHERE Stud_id= %s"
val = (stuid,)
mycursor.execute(sql, val)
records = mycursor.fetchone()
e1.delete(0, END)
e1.insert(0,records[0])
e2.insert(0,records[1])
e3.insert(0,records[2])
e4.insert(0,records[3])

e1.focus_set()

except Exception as e:

print(e)
mysqldb.rollback()
mysqldb.close()

def Update():
stuid = e1.get()
stuname = e2.get()
stuage = e3.get()
city = e4.get()

mysqldb=mysql.connector.connect(host="localhost",user="root",
password="mysql123",database="studentdb")
mycursor=mysqldb.cursor()

try:
sql = "UPDATE student_info SET Name= %s, Age= %s, City=
%s WHERE Stud_id= %s"
val = (stuname,stuage,city,stuid)
mycursor.execute(sql, val)
mysqldb.commit()
lastid = mycursor.lastrowid
messagebox.showinfo("information", "Student Record
Updated successfully.")

e1.delete(0, END)
e2.delete(0, END)
e3.delete(0, END)
e4.delete(0, END)
e1.focus_set()

except Exception as e:

print(e)

V1.0 © Aptech Limited


mysqldb.rollback()
mysqldb.close()

def Delete():
stuid = e1.get()

mysqldb=mysql.connector.connect(host="localhost",user="root",
password="mysql123",database="studentdb")
mycursor=mysqldb.cursor()

try:
sql = "DELETE FROM student_info WHERE Stud_id = %s"
val = (stuid,)
mycursor.execute(sql, val)
mysqldb.commit()
lastid = mycursor.lastrowid
messagebox.showinfo("information", "Student Record
Deleted successfully.")

e1.delete(0, END)
e2.delete(0, END)
e3.delete(0, END)
e4.delete(0, END)
e1.focus_set()

except Exception as e:

print(e)
mysqldb.rollback()
mysqldb.close()

def show():

mysqldb=mysql.connector.connect(host="localhost",user="root",
password="mysql123",database="studentdb")
mycursor=mysqldb.cursor()
children = listdisplay.get_children()
for child in children:
listdisplay.delete(child)
mycursor.execute("SELECT Stud_id,Name,Age,City FROM
student_info")
records = mycursor.fetchall()

for i, (Stud_id,Name,Age,City) in enumerate(records,


start=1):
listdisplay.insert("", "end",
values=(Stud_id,Name,Age,City))
mysqldb.close()

V1.0 © Aptech Limited


root = tk.Tk()
root.geometry("800x500")
global e1
global e2
global e3
global e4
global e5

tk.Label(root, text="Student Information", fg="green",


font=("Calibri", 30)).place(x=300, y=5)
tk.Label(root, text="If you want to update or delete a
record,", fg="black", font=("Calibri", 12)).place(x=300,
y=60)
tk.Label(root, text="then enter the Student ID and click the
Read button.", fg="black", font=("Calibri", 12)).place(x=300,
y=80)
tk.Label(root, text="Student ID").place(x=10, y=10)
tk.Label(root, text="Name").place(x=10, y=50)
tk.Label(root, text="Age").place(x=10, y=90)
tk.Label(root, text="City").place(x=10, y=130)

e1 = tk.Entry(root)
e1.place(x=140, y=10)
e1.focus_set()

e2 = tk.Entry(root)
e2.place(x=140, y=50)

e3 = tk.Entry(root)
e3.place(x=140, y=90)

e4 = tk.Entry(root)
e4.place(x=140, y=130)

tk.Button(root, text="Add",command = Add,height=3, width=


13).place(x=30, y=160)
tk.Button(root, text="Read",command = Read,height=3, width=
13).place(x=140, y=160)
tk.Button(root, text="Update",command = Update,height=3,
width= 13).place(x=250, y=160)
tk.Button(root, text="Delete",command = Delete,height=3,
width= 13).place(x=360, y=160)
tk.Button(root, text="Refresh",command = show,height=3,
width= 13).place(x=470, y=160)

cols = ('Student_id', 'Name', 'Age','City')


listdisplay = ttk.Treeview(root, columns=cols,
show='headings' )

for col in cols:

V1.0 © Aptech Limited


listdisplay.grid(row=1, column=0, columnspan=1)
listdisplay.place(x=10, y=250)

show()
root.mainloop()

In the Command Prompt, run the command as:

python student_design.py

In the page that appears, insert 1001 in the Student ID field, Richard
Franklin in the Name field, 14 in the Age field, and Boston in the City field.
Then, click the Add button as shown in Figure 10.22.

Figure 10.22: Add a Record

A message box appears with the message “Student Record inserted


successfully”. Click OK. Click Refresh to view the inserted record as shown
in Figure 10.23.

V1.0 © Aptech Limited


Figure 10.23: Refresh After Adding a Record

Similarly, insert a few more records into the student_info table. To fetch a
record, enter the Student ID and click Read as shown in Figure 10.24.

Figure 10.24: Read a Record

The fetched information will be displayed as shown in Figure 10.25.

V1.0 © Aptech Limited


Figure 10.25: Fetched Record

The fields in this record can be updated by editing the values displayed. For
example, let us update the Age as 14. After updating the value, click the
Update button. Then, click Refresh to view the updated record as shown in
Figure 10.26.

Figure 10.26: Updated Records

To delete a record, enter the Student ID and click Delete as shown in Figure
10.27.

V1.0 © Aptech Limited


Figure 10.27: Delete a Record

Click the Delete button. Then, click Refresh to verify that the record is deleted
as shown in Figure 10.28.

Figure 10.28: After Deletion

V1.0 © Aptech Limited


10.4 Summary

 Python supports connectivity to relational databases such as MySQL,


SQLite, Oracle, and NoSQL databases such as MongoDB.
 MySQL Connector Python module allows Python to connect with MySQL.
 After establishing the connection with MySQL, operations such as
create, read, update, and delete can be performed on the database.
 These operations can be performed using the command-line interface
of MySQL or by creating an application using Tkinter.

V1.0 © Aptech Limited


Test Your Knowledge

1. Which of the following connector is used to connect MySQL with Python?

a. Python connector
b. MySQL Connector
c. Connector MySQL Python
d. SQL Python connector

2. Which of the following method of mysql.connector module is used to


create a connection between the MySQL database and Python?

a. connection
b. pythonConnect
c. connect
d. mysqlConnect

3. Which of the following line of code is used to install


mysql.connector module?

a. pip install mysql-connector-python


b. pip install mysql-connector
c. pip install mysql-connect
d. pip install connect-mysql

4. Consider the following code:

1 import mysql.connector
2 conn = mysql.connector.connect(user=’root’,
password=’mysql123’, host=’localhost’)
3 cursor = conn.cursor()
4 //insert code here
5 conn.close()

Which line of code must be inserted on line 4 to create a database


Employee ?

a. conn.execute(“CREATE DATABASE Employee”)


b. cursor.execute.db(“CREATE DATABASE Employee”)
c. db.cursor.execute(“CREATE DATABASE Employee”)
d. cursor.execute(“CREATE DATABASE Employee”)

V1.0 © Aptech Limited


5. Which of the following is the correct syntax to bind a frame widget?

a. frame.bind(event, event handler)


b. bind(frame, event handler)
c. bind(event, frame)
d. bind.frame(event, event handler)

V1.0 © Aptech Limited


Answers to Test Your Knowledge

1 b
2 c
3 a
4 d
5 a

V1.0 © Aptech Limited


Try it Yourself

Use MySQL and Tkinter to perform the following tasks:

1. To connect Python with MySQL, install MYSQL and MySQL connector


module.
2. Using MysQL connector, write Python code to create a database EmpDB
database.
3. Using MysQL connector, write Python code to create a table named
Emp_info with columns Emp_id, Emp_name, Designation, Basic_pay,
and Net_pay. Emp_id is the primary key in the EmpDB database.
4. Write Python code to design an Employee Registration application using
TKinter to get the information such as Employee id, Employee Name,
Designation, and Basic pay using Entry widgets. Add another Entry
widget for the label Net Pay to display the net salary of the employee.
5. Write Python code to place two buttons with Label ADD, and NETPAY.
The ADD button is used to add the employee records into the Emp_info
table. The NETPAY button is used to calculate the net salary of the
employee as: Net salary = Basic pay + incentive
 If the basic pay of the employee is <1500 $ then incentive=300 $.
 If the basic pay of the employee is >1500 and <3000 $ then
incentive=800 $.
 If the basic pay of the employee is >3000 and <5000 $ then
incentive=1200 $.
 If the basic pay of the employee is >5000 $ then incentive=1500 $.
6. Write Python code to get all the information of the employee other than
net_pay value in the widgets. When the user clicks the NETPAY button
the code has to calculate the net salary of the employee using the given
criteria. The calculated net pay value must be displayed in the Entry
widget of the label Net Pay.
7. Write Python code to add the employee details given in Table 10.2 into
the Emp_info table. Calculate Net_pay and add all the fields into the
Emp_info table.

Table 10.2: Employee Details

V1.0 © Aptech Limited


Learning Objectives

In this session, students will learn to:

⮚ Describe the purpose of Pandas


⮚ Explain the two data structures provided by Pandas
⮚ Describe commonly used Pandas data manipulation and text
processing functions

Data handling is the process of performing different operations on data, such


as loading, cleaning, transforming, analyzing, and visualizing. Some of the
general purpose of data handling are:
 To load, explore, and present data in different formats such as charts,
graphs, and plots.
 To analyze data for deriving useful information and drawing conclusions.
 To ensure the integrity of the research data.
 To present data in an organized way for ease of interpretation.

V1.0 © Aptech Limited


Pandas, an open-source library built on Python, is one of the most popular and
powerful data handling and manipulation tools. In this session, you will learn
how to handle data using Pandas.

11.1 Introduction to Pandas

Pandas is a Python package that includes various functions to help you work
with data sets. The different types of data that you can handle using Pandas
include textual data, numerical data, boolean data, datetime, tabular data,
time series, matrices, arrays, and more. The various tasks that you can perform
using Pandas are:
 Loading data from different sources such as Excel, Comma-Separated
Values (CSV), SQL, HTML, and JSON.
 Handling missing data by filling, replacing, or dropping them.
 Merging and joining data from different data sources.
 Reshaping and pivoting data for further data processing or data
summarization.
 Grouping and aggregating data.
 Performing statistical operations on grouped data, such as mean,
median, and standard deviation.
 Presenting data through plots, graphs, and charts.
 Formatting the display of data by applying styles.

To use the classes and functions available in the Pandas library, you must first
install Pandas on your system and then import the library into your code.

11.1.1 Install and Import Pandas

To install Pandas, you can use the pip command as in:

pip install pandas

This command launches the pip installer package, which in turn downloads
the packages and files required to run Pandas. Figure 11.1 shows the
installation of Pandas using the pip installer.

V1.0 © Aptech Limited


Figure 11.1: Installation of Pandas

Once Pandas is downloaded and installed, it is ready for use. Next, you must
import it in your code. The syntax for importing pandas is:

import pandas as <alias name>

The given syntax imports Pandas with an alias name. Accessing Pandas
functions using its alias name makes your code concise and readable. For
example, the given code imports Pandas with an alias name pd.

import pandas as pd

By this way, whenever you want to access one of the functions in Pandas,
instead of typing pandas.<function_name>, you can type pd.<function
name>.

V1.0 © Aptech Limited


11.1.2 Types of Data Structures in Pandas

A data structure is a specialized format to arrange and organize data in


storage so that retrieval of data becomes easy. Pandas provides two types of
data structures to store and retrieve data. They are:

• A Series is an one-dimensional
Series labeled array that holds data of
any type such as strings and
integers.

• A DataFrame is a two-dimensional
DataFrame array such as a table or a
spreadsheet that stores data of
different types in rows and columns.

11.2 Creating Objects

To use the Series or DataFrame data structures available in Pandas, you


must first create an instance or object of these data structures. You can then,
invoke the required functions available for these data structures using the
created object.

11.2.1 Creating Pandas Series

As already discussed, Series are one-dimensional arrays that can hold


homogeneous data of any type. Figure 11.2 shows a few examples of the
Series data structure. In the example, the first row shows a series of integer
data, while the second and third rows show a series of string data, respectively.

V1.0 © Aptech Limited


Figure 11.2: Examples of Series Data Structure

The syntax for creating a Series is:

pandas.Series(data, index, dtype, copy)

In the syntax, data, index, dtype, and copy are the parameters of the Series
function. Table 11.1 describes these parameters.

Parameter Description
data Contains data to be stored in the Series object.
The data can be a list, an array, a scalar value, or a
dictionary.
index A unique value that acts as a label to identify the
data in the array or list. The index should be of same
length as data.

If not specified, index is set to a range of integers


from 0 to n, where n is the length of the data. If data
is a dictionary, the index would be the keys of the
dictionary by default.
dtype The type of data the Series holds. If not specified,
Pandas infer the type from the given data.
copy A boolean value that indicates whether to copy the
data or not. The default value is false.

Table 11.1: Parameters of pandas.Series Function

You can create an object of Series from a list, a dictionary, an array, or a


scalar value.

V1.0 © Aptech Limited


 Creating Pandas Series from Lists

Code Snippet 1 shows the creation of a Series object from a list. The code
passes a list of string values as the data argument and creates an object
named color. The code then, displays the values stored in color to the
standard output device. The code lets Pandas create a default index for
the values in the list.

Code Snippet 1:

import pandas as pd
color = pd.Series(['red', 'blue', 'green' ,'yellow',
'White'])
print(color)

Figure 11.3 shows the output of Code Snippet 1. Note that the output
specifies the index or the labels for each of the values in the list. By default,
the index is set to a range of integers from 0 to 4. In addition, the output also
displays the inferred data type of color, which in this case is object.

Figure 11.3: Output of Code Snippet 1

 Creating Pandas Series from Dictionaries

A dictionary is a collection of key-value pair. Code Snippet 2 shows the


creation of a Series object from a dictionary. The code first creates a
dictionary object named dict holding a string value pair. It then creates a
Series object named user_ser from dict and prints the data stored in
user_ser to the standard output device.

V1.0 © Aptech Limited


Code Snippet 2:

import pandas as pd
dict = {'Richard': 15,
'David': 20,
'William': 25}
user_ser = pd.Series(dict)
print(user_ser)

Figure 11.4 shows the output of Code Snippet 2. Note that the inferred data
type of user_ser is displayed as int64 in the output.

Figure 11.4: Output of Code Snippet 2

 Creating Pandas Series from Scalars

A scalar is a single valued data. To create a Series object from a scalar value,
you must pass the index argument to the pandas.Series function by
specifying the labels for the data. Pandas then creates a Series object by
repeating the scalar value to match the length of the specified index. For
example, Code Snippet 3 creates a Series object from the scalar value 50 by
passing a list of five integers as the index argument.

Code Snippet 3:

import pandas as pd
num_ser = pd.Series(50, index=[1, 2, 3, 4, 5])
print(num_ser)

Figure 11.5 shows the output of Code Snippet 3. Note that the output repeats
the scalar value 50 for five times matching the length of the specified index.

V1.0 © Aptech Limited


Figure 11.5: Output of Code Snippet 3

11.2.2 Creating Pandas DataFrame

As already explained, a DataFrame is a two-dimensional data structure


containing rows and columns. The syntax for creating a DataFrame is:

pandas.DataFrame(data, index,
columns, dtype, copy)

In the syntax, data, index, columns, dtype, and copy are the parameters of
DataFrame function. Table 11.2 describes these parameters.

Parameter Description
data Specifies the data to be stored in the DataFrame
object. The data can be any collection such as a
ndarray (an n-dimensional array), a list, a
Series, a dictionary, a Microsoft Excel file, a CSV
file, a SQL table, or a SQL result set.
index Specifies the row labels of the DataFrame. If not
specified, index defaults to range(n), where n is
the length of the data.
columns Specifies the column labels of the DataFrame. If
not specified, columns default to range(m),
where m is the number of columns in the data. If
the data is a dictionary, the keys are used as
column labels, by default.
dtype Specifies the type of data to be stored in the
DataFrame. If not specified, Pandas infer the type
from the given data.
copy A boolean value that indicates whether to copy
the data or not. The default value is false.

Table 11.2: Parameters of pandas.DataFrame Function

You can create DataFrames from different data sources such as lists,
dictionaries, CSV files, Excel files, and SQL tables.

V1.0 © Aptech Limited


 Creating DataFrames from Lists

You can create a DataFrame object from a single list, multiple lists or a list of
lists using the pandas.DataFrame function. For example, Code Snippet 4
creates a DataFrame from a list of lists. The code passes a list containing three
lists representing names and genders of people as the data arguments. As
each list contains two columns, the code passes two labels, namely Name and
Gender as the columns argument. The code then, prints the created
DataFrame object to the standard output device.

Code Snippet 4:

import pandas as pd
data = [['Richard','M'],['Emy','F'],['Adam','M']]
df = pd.DataFrame(data,columns=['Name','Gender'])
print (df)

Figure 11.6 shows the output of Code Snippet 4. You can see that the output
displays the data stored in the DataFrame object with the specified column
labels and the default row labels in a tabular format.

Figure 11.6: Output of Code Snippet 4

 Creating DataFrames from Dictionaries

There are different ways to create a DataFrame from a dictionary. Code


Snippet 5 shows how to create a DataFrame object from a dictionary of lists
containing names and genders of people. The code creates a dictionary
object named data with two keys, Name and Gender. It associates these keys
with lists of string values representing names and genders of three people. The
code then creates a DataFrame object named df from the dictionary, using
the pd.DataFrame function and prints the DataFrame to the standard output
device.

V1.0 © Aptech Limited


Code Snippet 5:

import pandas as pd
data = {'Name':['Richard', 'Emy',
'Adam'],'Gender':['M','F','M']}
df = pd.DataFrame(data)
print (df)

Figure 11.7 shows the output of Code Snippet 5. The output shows the column
labels, the default row labels, and the values in a tabular format. Note that the
keys of the dictionary are used as column labels and their corresponding
values are displayed in the respective columns.

Figure 11.7: Output of Code Snippet 5

 Creating DataFrames from CSV Files

CSV files are comma-separated text files. You can create a DataFrame object
from a CSV file using the pd.read_csv function. This function accepts the path
or URL of a CSV file as an argument, reads the CSV file, and returns its contents
as a DataFrame object.

Code Snippet 6 demonstrates the use of the pd.read_csv function.


Download the Product_sales.csv file present under Course Files on
OnlineVarsity and upload it to the current working directory. This is because
the code checks for the Product_sales.csv file in the current working
directory and passes the name of this file as the argument to the read_csv
function. It then calls the name of the newly created DataFrame object to print
the loaded contents on the screen.

Code Snippet 6:

import pandas as pd
df = pd.read_csv('Product_sales.csv')
df

Figure 11.8 shows the output of Code Snippet 6. The output shows the contents
of the Product_sales.CSV file that was loaded into the DataFrame object.

V1.0 © Aptech Limited


The output indicates that:
 CSV file has eight columns: Product, Brands, Description, Sale Price
in $, Marked Price in $, Number of Ratings, Number of Reviews,
and Star Ratings.
 Index of the DataFrame starts from 0 to n, where n is the number of rows in
the CSV file.
 Values in each cell is either a string, an integer, or a float depending on the
data type of the respective column.

Figure 11.8: Output of Code Snippet 6

11.3 Viewing Data

Pandas provides various methods to help you view the data stored in its data
structures. For example, you can use the head and tail methods of the
Series and DataFrame classes to view a small sample of the respective
objects. Alternatively, to view the statistical information of the data, you can
use the describe method.

11.3.1 Viewing Top Rows

By default, the head method returns the top five rows. To view a greater
number of rows, you can pass the required integer as an argument to the head
method.

V1.0 © Aptech Limited


The syntax for retrieving the first n rows from a DataFrame is:

pandas.DataFrame.head(<n>)

The syntax for retrieving the first n rows from a Series is:

pandas.Series.head(<n>)

Code Snippet 7 demonstrates the purpose of the head method using a


DataFrame. The code reads the contents of the Product_sales.csv file using
the pd.read_csv function and creates a DataFrame object. It then uses the
df.head method to return the first five rows of the object and prints the
returned object on the screen.

Code Snippet 7:

import pandas as pd
df = pd.read_csv('Product_sales.csv')
data_top = df.head()
data_top

Figure 11.9 shows the output of Code Snippet 7. The output shows the first five
rows of the Product_sales.csv file with all the columns and the default index
starting from 0.

Figure 11.9: Output of Code Snippet 7

11.3.2 Viewing Bottom Rows

To view the last n rows of a Series or a DataFrame, use the tail method. By
default, the tail method returns the last five rows. To view a greater number

V1.0 © Aptech Limited


of rows, you can pass the required integer as an argument to the tail
method. The syntax for retrieving the last n rows from a DataFrame is:

pandas.DataFrame.tail(<n>)

The syntax for retrieving the last n rows from a Series is:

pandas.Series.tail(<n>)

Code Snippet 8 demonstrates the purpose of the tail method using a


DataFrame. The code reads the contents of the Product_sales.csv file using
the pd.read_csv function to a DataFrame object. It then uses the df.tail
method to return the last five rows of the object. It then prints the returned
object on the screen.

Code Snippet 8:

import pandas as pd
df = pd.read_csv('Product_sales.csv')
data_bottom = df.tail()
data_bottom

Figure 11.10 shows the output of Code Snippet 8. The output shows the last five
rows of the Product_sales.csv file from index 9 to 13 with all the columns.

Figure 11.10: Output of Code Snippet 8

V1.0 © Aptech Limited


11.3.3 Viewing Statistical Data

To view the statistical data of a DataFrame or a Series, you can use the
describe method. This method returns a summary of the statistical information
of the Series or DataFrame provided. The statistical information in the
summary includes the mean of the values, maximum of the values, minimum
of the values, count, standard deviation, and percentile.

The syntax for retrieving the statistical values of a DataFrame is:

pandas.DataFrame.describe()

The syntax for retrieving the statistical values of a Series is:

pandas.Series.describe()

Let us look at an example code that demonstrates the use of the describe
method using a DataFrame object. Code Snippet 9 creates a DataFrame
object called df from the Product_sales.csv file and calls the describe
method to obtain the summary statistics of df. It then prints the summary
statistics on the screen.

Code Snippet 9:

import pandas as pd
df = pd.read_csv('Product_sales.csv')
print(df.describe())

Figure 11.11 shows the output of Code Snippet 9. The output shows the statistics
of only those columns that are of numerical data type and does not include
Product, Brands, and Description columns. This is because the data type of
these columns is not numerical.

V1.0 © Aptech Limited


Figure 11.11: Output of Code Snippet 9

11.4 Manipulating Data

As you recall, Pandas allows you to manipulate data to modify, transform, or


analyze the data. Some examples of data manipulation that you can perform
using data structures in Pandas are:
 Adding new columns to a DataFrame
 Applying arithmetic operations on a DataFrame
 Sorting the columns or rows of a Series or a DataFrame
 Truncating data from a Series or a DataFrame
 Filtering rows or columns of a DataFrame
 Handling missing data

11.4.1 Adding Columns

To add a column to a DataFrame, specify the new column name between the
[] brackets at the left side of the assignment operator. Then, specify the value
to be assigned to the column at the right side of the assignment operator. The
value can be a list, a dictionary, a Series or another DataFrame, or even a
result of an arithmetic operation.
Let us now look at an example code that demonstrates how to add a new
column to a DataFrame. The newly added column holds a computed value

V1.0 © Aptech Limited


of the arithmetic operations performed on the existing columns. For the sample
code, let us consider the data in the Product_sales.csv file and create a
DataFrame from this CSV file. Code Snippet 10 shows the sample code. The
code adds a column called discount to the existing DataFrame object, df
using the [ ] brackets. The code then performs arithmetic operations on
Marked Price in $ and Sale Price in $ columns to calculate the
percentage of discount for each product. It then assigns the computed value
to the Discount column. The percentage of discount is calculated by
subtracting the Sale Price from the Marked Price in $, multiplying by 100,
and dividing by the Marked Price in $. Finally, the code calls df to display
the DataFrame on the standard output device.

Code Snippet 10:

import pandas as pd
df = pd.read_csv('Product_sales.csv')
df['Discount'] = (df['Marked Price in $'] - df['Sale Price
in $'])*100/df['Marked Price in $']
df

Figure 11.12 shows the output of Code Snippet 10. Note that the DataFrame
has an additional column, Discount that shows the discount percentage for
each product.

Figure 11.12: Output of Code Snippet 10

V1.0 © Aptech Limited


11.4.2 Applying Arithmetic Operations

There may be scenarios, where you want to add a scalar value to the values
in a Series or column of a DataFrame object. For example, consider that a
DataFrame holds a column Sale Price for each product, and you want to
increase the values in Sale Price by a constant for all products. Pandas
allows you to perform such basic arithmetic operations on a Series or a
DataFrame by providing various functions. Some of these functions are:

Code Snippet 11 demonstrates the use of add function. The code reads the
Product_sales.csv to a DataFrame object df. It then uses the df['Sale
Price in $'].add(20) function to add the scalar value of 20 to the values
in the column Sale Price in $ of the DataFrame object df. It displays df
with the updated values in the column Sale Price in $.

Code Snippet 11

import pandas as pd
df = pd.read_csv('Product_sales.csv')
df['Sale Price in $']= df['Sale Price in $'].add(20)
df

Figure 11.13 shows the output of Code Snippet 11. Note that values in Sale
Price in $ is incremented by 20 for each product.

V1.0 © Aptech Limited


Figure 11.13: Output of Code Snippet 11
11.4.3 Sorting

Arranging the data in ascending or descending order is called sorting.


Generally, you use the sort_values method to arrange the data in a
DataFrame or Series according to the specified order. The syntax for the
sort_values method is:

pandas.DataFrame.sort_values(by,axis=0, ascending=True,
inplace=False, kind=’quicksort’, na_position=’last’)

pandas.Series.sort_values(by,axis=0, ascending=True,
inplace=False, kind=’quicksort’, na_position=’last’)

Table 11.3 describes the parameters of the sort_values method.

Parameter Description
by A string or a list of strings that specifies the labels to
be sorted. If axis is 0 or 'index', then the labels may
contain column names. If axis is 1 or 'columns',
then the labels may contain index names.
axis An integer or a string that specifies the axis to be
sorted. The default value is 0 or 'index', which
indicates sorting by rows. If the value is 1 or
'columns', then the data is sorted by columns.

V1.0 © Aptech Limited


Parameter Description
ascending A boolean or a list of boolean specifying the order
of sorting. The default value is True.
inplace A boolean that indicates whether to modify the
original DataFrame or return a new one. The default
value is False.
kind A string that specifies the sorting algorithm to be
used. The values that can be specified for this
parameter are ‘heapsort‘, ‘mergesort‘, or
‘quicksort‘. The default value is 'quicksort‘. For
DataFrames, this parameter is used only when
sorting on a single column or label.
na_position A string that accepts two values ‘first‘ or ‘last‘,
indicating the position of null values in the sorted
order. The default value is ‘last‘, which means the
missing values are placed at the end of the sorted
order.

Table 11.3: Parameters of sort_values method

Code Snippet 12 demonstrates the use of the sort_values method. The code
reads the data from Product_sales.csv into a DataFrame object df. It then
uses the sort_values method to sort the DataFrame by the Star ratings
column in ascending order. The code modifies the original DataFrame to the
sorted order by specifying True for inplace. In addition, it specifies 'last' for
na_position to place any null values represented by Not a Number (NaN) at
the end of the sorted order. The code then calls df to display the DataFrame
on the standard output device.
Code Snippet 12:

import pandas as pd
df = pd.read_csv('Product_sales.csv')
df.sort_values('Star ratings', axis = 0, ascending = True,
inplace = True, na_position ='last')

df

Figure 11.14 shows the output of Code Snippet 12. Note that the DataFrame
df is sorted by Star ratings. As you can see, the row with the index 0 is
displayed at the last because the value of Star ratings is NaN, which
indicates missing or null data.

V1.0 © Aptech Limited


Figure 11.14: Output of Code Snippet 12

11.4.4 Truncating Data

Truncating is removing data. You can truncate rows or columns of data from a
DataFrame or a Series using the truncate method. The syntax for the
truncate method is:

pandas.DataFrame.truncate(before,after,axis,copy)

pandas.Series.truncate(before,after,axis,copy)

Table 11.4 describes the parameters of the truncate method.

Parameter Description
before Specifies the index value to truncate all rows
before it.
after Specifies the index value to truncate all rows after
it.
axis Specifies the axis to truncate along. The values
can be:
 0 or 'index'
 1 or 'columns'

V1.0 © Aptech Limited


Parameter Description
The default value is 0 or 'index'.
copy Specifies whether to return a copy of the
truncated data or not. The default value is False.

Table 11.4: Parameters of truncate Method

Code Snippet 13 demonstrates the use of the truncate method. The code
reads the data from Product_sales.csv into a DataFrame object df. It then
calls the df.truncate method and sets the before and after parameters of
the method to 5 and 9, respectively. This indicates that all the rows before the
index label 5 and all the rows after the index label 9 should be removed from
the resultant DataFrame. As the code does not specify any value for the axis
parameter, the default value of 0 or 'index' is used. The truncation is done
along the axis or rows. Similarly, as the code does not specify the copy
parameter, the default value of True is used. Then, a copy of the truncated
section is returned as a new DataFrame object. The code assigns this newly
returned DataFrame object to the variable newdf. It then calls newdf to display
the DataFrame on the standard output device.

Code Snippet 13:

import pandas as pd
df = pd.read_csv('Product_sales.csv')
newdf = df.truncate(before=5, after=9)
newdf

Figure 11.15 shows the output of Code Snippet 13. As you can see, the output
shows newdf that contains only the rows between the index labels 5 and 9
that are truncated from the original DataFrame df.

Figure 11.15: Output of Code Snippet 13

11.4.5 Filtering Data

Filtering data refers to extracting the required subset of data from an entire
dataset. To filter data in a data set, you can read the data into a DataFrame

V1.0 © Aptech Limited


object and use the pandas.DataFrame.loc method. The loc method allows
you to access a group of rows and columns by using labels.

The syntax for the loc method is:

pandas.DataFrame.loc(searchstring)

In general, the input value of loc can be a single label, a list, or an array of
labels. An example of a single label could be 6 or 'Product' where 6 is not
interpreted as an integer but as a label of the index.
Let us consider Code Snippet 14 that demonstrates the loc method. Code
Snippet 14 reads Product_sales.csv into a DataFrame called df. The code
sets the index_col to 0 indicating that the first column of the CSV file should
be used as the row labels of df. It then uses the df.loc method to extract the
rows that have the index or row label as Earphones and return the rows as a
DataFrame object. The code assigns the newly returned DataFrame object to
the variable find_rec. Finally, the code calls find_rec to print the DataFrame
on the standard output device.

The index_col parameter in read_csv specifies which column to


use as the row labels of the DataFrame. By default, Pandas assigns
an integer index to each row of the DataFrame, starting from 0.

Code Snippet 14:

import pandas as pd
df = pd.read_csv('Product_sales.csv', index_col =0)
find_rec = df.loc["Earphones"]
find_rec

Figure 11.16 shows the output of Code Snippet 14. As you can see, the output
contains the details about the product, Earphones.

Figure 11.16: Output of Code Snippet 14

V1.0 © Aptech Limited


11.4.6 Handling Missing Data

A dataset might contain missing data or null values when there is no


information or value available for one or more columns. As missing data can
affect the validity of the data analysis and the results, it is imperative to
identify and handle missing data appropriately.

In Pandas, NaN and None represent missing or null values, where NaN is the
default missing value marker. The reasons for using NaN as the default missing
value marker are for computational speed and ease of use with different
data types such as floating point, boolean, or integer.

Pandas provides you with various functions to detect, fill, or drop NaN values
in a DataFrame or a Series.

 Detecting Missing Values

The functions to detect missing or NaN values in a Series or a DataFrame are:


o isnull: This function returns a boolean Series or DataFrame object
indicating whether each value in the respective data structure is null or
not. It returns True for null values and False for not null values.
o notnull: This function also returns a boolean Series or DataFrame
object. However, it indicates whether each value in the respective data
structure is not null or not. That is, this function returns True for not null
and False for null values.

As both the isnull and notnull functions serve the same purpose, let us look
at a sample code demonstrating the isnull function. Let us consider the
data in the Product_sales.csv file in which the Star Ratings for the
Earphones of Sony brand is NaN. Figure 11.17 shows the data in the
Product_sales.csv file highlighting the NaN value for Star Ratings in the
row with index 0.

V1.0 © Aptech Limited


Figure 11.17: Product_sales.csv File Highlighting the NaN Value

Code Snippet 15 reads this CSV file into a DataFrame object df and prints df
to the standard output device. It then calls the df.isnull function to return a
boolean DataFrame object that shows whether each value in the original
DataFrame object df is null or not. As the code does not assign the output
DataFrame to any variable, the DataFrame directly prints to the standard
output device.

Code Snippet 15:

import pandas as pd
df = pd.read_csv('Product_sales.csv')
df
df.isnull()

Figure 11.18 shows the boolean DataFrame that was returned by the
df.isnull function. You can see that this output DataFrame has the same
number of rows and columns as that of the original DataFrame object df,
except that the values are either True or False. Note that the row with index
0 has the value True for Star Ratings because the value of this column is
NaN in the original DataFrame.

V1.0 © Aptech Limited


Figure 11.18: Output of Code Snippet 15

 Filling/Dropping Missing Values

Once you have identified that there are missing values in a DataFrame or a
Series, you can either fill those missing values or drop them. The functions to
fill or drop missing values are:
o fillna: This function replaces missing values with some other values
and returns the object with missing values filled.
o dropna: This function removes rows or columns with missing values
and returns the object with NaN entries dropped from it.

Code Snippet 16 demonstrates the use of the DataFrame.fillna function.


The code reads the Product_sales.csv file into a DataFrame object df. It
then calls the df.fillna function to replace all missing values in df with 0. The
df.fillna function returns a new DataFrame object with NaN entries filled with
the value 0.0.

Code Snippet 16:

import pandas as pd
df = pd.read_csv('Product_sales.csv')
df.fillna(0)

Figure 11.19 shows the output of Code Snippet 16 that displays the new
DataFrame object returned by the df.fillna function. Note that the NaN

V1.0 © Aptech Limited


value for Star Ratings in index 0 is replaced with 0.0. The replaced values
are highlighted in Figure 11.19.

Figure 11.19: Output of Code Snippet 16

Code Snippet 17 demonstrates the use of the DataFrame.dropna function.


The code reads the Product_sales.csv file into a DataFrame object df. It
then invokes the df.dropna function to drop the entries in df that have NaN
values. The df.dropna function returns a new DataFrame object with dropped
entries of NaN.

Code Snippet 17:

import pandas as pd
df = pd.read_csv('Product_sales.csv')
df.dropna()

Figure 11.20 shows the output of Code Snippet 17. The output DataFrame does
not have the row with index 0 as it has dropped this row due to the existence
of missing value or NaN in Star Ratings. Similarly, rows with index 3 and 9 also
are dropped.

V1.0 © Aptech Limited


Figure 11.20: Output of Code Snippet 17

11.5 Working with Text Data

Manipulating text data is essential for normalizing, cleaning, analyzing, or


matching text data. Pandas provide some inbuilt string processing methods
that are accessed via the str attribute.

These functions operate on text data in Series or index objects and can
perform tasks such as converting case, replacing strings, and removing
whitespaces.

 Converting Case

Pandas provides the str.lower and str.upper functions to convert text data
to lowercase and uppercase, respectively. Suppose you have a DataFrame
object df with a column Brands that contains string values. You can use the
df[“Brands”].str.upper() command to get a new Series with all brands
in uppercase. Code Snippet 18 demonstrates this.

Code Snippet 18:

import pandas as pd
df = pd.read_csv('Product_sales.csv')
df["Brands"]= df["Brands"].str.upper()
df

The code creates a DataFrame df from Product_sales.csv. It then calls the


df["Brands"].str.upper() function to convert the string values in the

V1.0 © Aptech Limited


column Brands to uppercase. The str.upper function returns a new Series
object with the same index as the original data, but with all characters in
uppercase. The code assigns the resultant Series object to the same column
Brands of the DataFrame object df, replacing the original values with the
uppercase values. It then calls df to display the DataFrame on the standard
output device. Figure 11.21 shows the output of Code Snippet 18. Note that all
brands in df are changed to uppercase.

Figure 11.21: Output of Code Snippet 18

 Replacing Strings

Pandas provides the str.replace function to replace string values in a


Series or column in a DataFrame. For example, Code Snippet 19 replaces
earphones with Headphones in the Product column of the DataFrame
object df using the str.replace function.

Code Snippet 19:

import pandas as pd
df = pd.read_csv('Product_sales.csv')
df["Product"]= df["Product"].str.replace("earphones",
"Headphones", case = False)
df

The boolean value False for the Case parameter indicates that the
replacement is not case sensitive. The code assigns the resultant Series object

V1.0 © Aptech Limited


back to the same Product column of the DataFrame object df. It then calls
df to display it to the standard output device.

Figure 11.22 shows the output of Code Snippet 19. Note that the rows with index
0, 5, and 6 in the Product column, which had the value Earphones previously
is now replaced with Headphones.

Figure 11.22: Output of Code Snippet 19

 Removing Whitespaces

Whitespaces in text data can make the data inconsistent or inaccurate. It


might also cause unexpected results when filtering, sorting, or merging data.
The functions to remove whitespaces from a Series or column in a DataFrame
are:
o str.lsstrip: This function removes leading whitespaces from the
string.
o str.rsstrip: This function removes trailing whitespaces from the
string.
o str.strip: This function removes both leading and trailing
whitespaces from the string.

Code Snippet 20 demonstrates the use of the strip function. The code
creates the DataFrame object df from Product_sales.csv and prints the
Product column in df to the standard output device. It then calls the
df["Product"].str.strip() function to remove the leading and trailing
spaces from the string values in Product column. The str.strip function
returns a new Series object with the same index as the original data, but with

V1.0 © Aptech Limited


all the whitespaces removed. The code assigns the resultant Series object to
the same Product column of the DataFrame object df. It then calls
print(df.Product) to display the new values in the Product column on the
standard output device.

Code Snippet 20:

import pandas as pd
df = pd.read_csv('Product_sales.csv')
print(df.Product)
df['Product'] = df['Product'].str.strip()
print(df.Product)

Figure 11.23 and Figure 11.24 show the output of Code Snippet 20. Figure 11.23
shows the values in the Product column in the original DataFrame object df
with whitespaces. Figure 11.24 shows the updated df, with all the whitespaces
removed from the values in the Product column.

Figure 11.23: Output of Code Snippet 20 with Whitespaces

V1.0 © Aptech Limited


Figure 11.24: Output of Code Snippet 20 Without Whitespaces

V1.0 © Aptech Limited


11.6 Summary

 Pandas is an open-source library built on Python used for data handling


and manipulation purposes.
 The two data structures that Pandas provide are Series and
DataFrame.
 Series are one-dimensional arrays that can hold homogeneous data of
any type.
 DataFrames are two-dimensional data structures containing rows and
columns.
 You can create objects of Series or DataFrame from a list, a dictionary,
an array, or a scalar value using the pandas.Series or
pandas.DataFrame functions, respectively.
 Some of the methods that allows you to view the data stored in Pandas
data structures are head, tail, and describe.
 Some of the Pandas functions that allow you to perform basic arithmetic
operations on a Series or a DataFrame are add, sub, mul, and div.
 The other functions that are used for sorting, truncating, or extracting
data are, respectively, sort_values, truncate, and loc.
 The Pandas functions to handle missing values are isnull, notnull,
fillna, and dropna.
 Some of the Pandas string processing methods that are accessed via
the str attribute are str.lower, str.upper, str.replace, and
str.strip.

V1.0 © Aptech Limited


Test Your Knowledge

1. Which of the following data structures in Pandas have a two-dimensional


array with heterogeneous data?

a. Series
b. Panel
c. DataFrame
d. FrameSeries

2. Which of the following methods is used to return the n number of bottom


rows of a data frame?

a. bottom
b. tail
c. row_bottom
d. tail_row

3. Which of the following Pandas methods is used to view some basic


statistical details such as percentile, mean, and standard deviation of
numeric values?

a. stat
b. percentile
c. statistics
d. describe

4. Which of the following functions replaces all the NaN values with a
specified value, 99?

a. fillnull
b. fillnan
c. fillna
d. fillNaN

5. Which of the following functions are used to check whether a value is


NaN or not?

a. isnull
b. is nan
c. notnull
d. notnan

V1.0 © Aptech Limited


Answers to Test Your Knowledge

1 c
2 b
3 d
4 c
5 a, c

V1.0 © Aptech Limited


Try it Yourself

1. Create a file named Student_marks.csv with the data given in Figure


11.25.

Figure 11.25: Student_marks.csv File

2. Use Pandas in Jupyter notebook to perform the given tasks:


a. Write a Python program to create a DataFrame object from the
Student_marks.csv file and display the DataFrame.

b. Write Python program to get the first three rows and the last four
rows from the DataFrame.

c. Write a Python program to select only the rows where the student
hobby is Dancing.

d. Write a Python program to select only the rows where the Java
marks are missing.

e. Write a Python program to replace all the NaN values with 0.

f. Write a Python program to add a column Average by calculating


the average of all three subject marks for all the students.

g. Write a Python program to sort the name in ascending order and


average marks in descending order.

h. Write a Python program to replace the gender Male with M and


Female with F.

V1.0 © Aptech Limited


Learning Objectives

In this session, students will learn to:

⮚ Explain the concept of data visualization


⮚ Describe the Python libraries that are used for data visualization
⮚ Explain various data visualization techniques available in Python

In today's data-driven world where vast amount of information is generated


every day, the ability to convey patterns, trends, and correlations visually is
vital. Python offers a rich ecosystem of powerful libraries that enable data
professionals, analysts, and researchers to create interactive and engaging
visualizations. These visualizations not only enhance our understanding of data
but also help in making informed decisions. In addition, they help to share the
findings with diverse audiences, bridging the gap between raw data and
actionable insights.

This session will provide an overview of data visualization and its importance in
conveying information effectively to a wide audience. It will explain the Python
libraries such as Matplotlib and Seaborn used for data visualization. Finally, it
will describe different types of data visualization techniques available in
Python, including bar charts, scatter plots, line graphs, and histograms.

V1.0 © Aptech Limited


12.1 Introduction to Data Visualization

Consider that you are a marketing analyst tasked with understanding customer
behavior for an e-commerce Website. You have been provided with a massive
spreadsheet containing purchase history, demographics, and Website
interaction data for thousands of customers. Trying to decipher insights from
this tabular data will be challenging.

Tables and Comma Separated Value (CSV) files provide only raw data. To
identify meaningful patterns, trends, and relationships or to detect any
correlations between Website engagement and purchasing decisions, you
require a tool that can represent the data pictorially. Data visualization is the
process where raw data is transformed into visual representations.

12.1.1 What Is Data Visualization?

Data visualization falls within the sphere of data analysis and involves creating
visual depictions of information. These visuals such as pictures, maps, and
graphs help to convey insights of data in a clear way. Through data
visualization, you can quickly grasp an overview of any information. This visual
approach helps the human brain in processing and comprehending the data
provided.

Data visualization is a powerful technique for large to small datasets, but it is


most useful when dealing with large datasets where manual processing is
impractical. In cases where examining each piece of data individually is
tedious, visualizing the data becomes an essential tool for understanding the
data.

12.2 Python Libraries for Data Visualization

Python provides a range of plotting libraries, including prominent names like


Matplotlib and Seaborn, along with various other data visualization packages.
These libraries come equipped with diverse features that facilitate the creation
of informative, personalized, and visually appealing plots. Their collective aim
is to present data in a simple and impactful manner.

12.2.1 Matplotlib

Matplotlib serves as a Python visualization library that can be used for


generating two-dimensional (2D) and three-dimensional (3D) plots. Leveraging
the capabilities of NumPy library, it seamlessly integrates into Python, IPython
shells, Jupyter notebooks, and Web application servers. Within its versatile

V1.0 © Aptech Limited


toolkit are an array of graphs and plot types like scatter, line, bar, power
spectra, error charts, and histogram.

Pyplot, an essential component of Matplotlib, simplifies the process of creating


plots by offering functionalities to manage line styles, font characteristics, and
axis formatting.

Plot styles offered by Pyplot are:

● Line
● Histogram
● Scatter
● 3D
● Image
● Contour
● Polar

12.2.2 Seaborn

Seaborn stands out as an exceptional Python library tailored for the graphical
representation of statistical data. Seaborn offers an array of color palettes and
visually pleasing styles and helps to create enhanced statistical plots in Python.

Furthermore, Seaborn seamlessly integrates with Panda's data structures. This


synergy facilitates effortless transitions between diverse visual renditions of a
specific variable, enhancing comprehension of the supplied dataset.

Widely embraced for data science and machine learning, the Python Seaborn
library builds upon the visualization capabilities of Matplotlib.

12.2.3 Working with Matplotlib

Matplotlib is an open-source Python library that can be installed using Python


packages such as pip. To install Matplotlib, in the Command Prompt
notebook, type the command as:
pip install matplotlib

The command gets executed as shown in Figure 12.1.

V1.0 © Aptech Limited


Figure 12.1: Installation of Matplotlib Using pip

For a given set of data points, Matplotlib provides the option to create various
types of plots. You can create a line plot, custom marker line plot, plot with
custom markers and line styles, or an advanced line plot with labels and a grid.

Basic Line Plot

Let us generate a basic line plot for a set of data points. Consider that the
scores of five students—Adam, Richard, William, Emy, and Linda—in a test
are 86, 90, 79, 78, and 96, respectively. Code Snippet 1 generates a basic line
plot for this data.

Code Snippet 1:

import matplotlib.pyplot as plt


import numpy as np

x_pt = np.array(['Adam', 'Richard', 'William', 'Emy',


'Linda'])
y_pt = np.array([86,90,79,78,96])
plt.plot(x_pt, y_pt)
plt.show()

In Code Snippet 1:

 The import statements import the Matplotlib and NumPy libraries


into the application.
 The x_pt and y_pt statements define the data in x and y lists.
 The plot function is the fundamental function used for drawing
points or markers in a diagram. It helps to create a wide range of 2D
plots.

V1.0 © Aptech Limited


 The syntax of the plot function is:

plot([x], y, [fmt], [**kwargs])

 x and y denote horizontal and vertical axis coordinates. x-values are


optional. Matplotlib will automatically use indices for the
x-values.
 fmt defines basic formatting such as line style, marker style, and color
of the marker.
 kwargs can be used to provide additional keyword arguments to
further customize the plot.

Figure 12.2 displays the output of the code in Code Snippet 1. The code will
generate a basic line plot that connects the given names on the x-axis with
their corresponding scores on the y-axis. It helps to visualize the relationship
between the names and the scores.

Figure 12.2: Basic Line Plot

Customize a Plot

You can customize various aspects of a line plot using both the fmt string and
keyword arguments in Matplotlib.

V1.0 © Aptech Limited


Custom Marker Line Plot

Markers are shapes or symbols on a plot that help to highlight data points in a
plot. These markers can be customized to different styles using the marker
argument in the fmt string.

Table 12.1 shows the available marker references.

Marker Description
O Circle
* Star
. Point
, Pixel
x X
X X (filled)
+ Plus

Table 12.1: Marker References

Let us generate a custom marker line plot for the same set of data points used
in Code Snippet 1. The code in Code Snippet 2 generates a line plot for this
data with asterisk(*) as the marker.

Code Snippet 2:

import matplotlib.pyplot as plt


import numpy as np

x_pt = np.array(['Adam', 'Richard', 'William', 'Emy',


'Linda'])
y_pt = np.array([86,90,79,78,96])
plt.plot(x_pt, y_pt, marker = '*')
plt.show()

In Code Snippet 2, the marker argument is customized using marker = '*' to


set the marker style to asterisk. Figure 12.3 displays the output of the code.

V1.0 © Aptech Limited


Figure 12.3: Asterisk Marker Line Plot

Custom Line Style Plot

Line style refers to the appearance of the line that connects the data points in
a plot. Line style can be customized by setting the type of line or changing the
color of line in the fmt string. Table 12.2 shows the available options for type of
line.

Type of Line Description


- Solid line
: Dotted line
-- Dashed line
-. Dashed/dotted line

Table 12.2: Type of Line

Table 12.3 shows the available options for color of line.

Color Syntax Description


r Red
g Green
b Blue

V1.0 © Aptech Limited


Color Syntax Description
c Cyan
m Magenta
y Yellow
k Black
w White

Table 12.3: Color of Line

Let us generate a plot with customized marker and line for the same data
points used in the previous sections. The code in Code Snippet 3 generates a
line plot for this data with a circular green marker and dotted green line.

Code Snippet 3:

import matplotlib.pyplot as plt


import numpy as np

x_pt = np.array(['Adam', 'Richard', 'William', 'Emy',


'Linda'])
y_pt = np.array([86,90,79,78,96])
plt.plot(x_pt, y_pt, 'o:g')
plt.show()

In Code Snippet 3, the marker and line style are customized using 'o:g'. Here
‘o’ sets the marker style to circle, ‘:’ sets the line style to dotted line and ‘g’
sets the color to green. Figure 12.4 displays the output of the code.

Figure 12.4: Custom Marker and Line Style Plot

V1.0 © Aptech Limited


Additionally, you can specify the marker size and marker edge color.
 The size of the markers can be specified using the markersize
keyword argument or its abbreviated version, ms.
 The color of the markers' edges can be designated using the
markeredgecolor keyword argument or its abbreviated form, mec.

Let us generate a plot with customized marker and line for the data points (2,
3) and (6, 8). For example, consider that you want to create a plot with green
asterisk markers with red edges, sized 20 and connected by a green dotted
line. The code in Code Snippet 4 generates a line plot with the specified
customizations.

Code Snippet 4:

import matplotlib.pyplot as plt


import numpy as np

x_pt = np.array([2, 6])


y_pt = np.array([3, 8])

plt.plot(x_pt, y_pt, '*:g',ms = 20, mec = 'r')


plt.show()

In Code Snippet 4, ms = 20 specifies that the marker size is 20 and mec =


‘r’ specifies that the marker edge color is red. Figure 12.5 shows the output
of the code.

Figure 12.5: Custom Marker Size and Edge Color

V1.0 © Aptech Limited


Line Plot with Labels, Titles, and Grid

Pyplot provides options for enhancing your plots. The options include:

● The xlabel and ylabel functions to assign labels to the x-axis and
y-axis.
● The title function establishes a title for your plot.
● The grid function enhances visual clarity by adding grid lines to your
plot.

Let us generate a plot where student names are plotted on the x-axis and their
respective marks on the y-axis. The data points are represented using
diamond-shaped markers, connected by dashed lines in red color. The plot is
enhanced with labels for x and y axes, a title indicating Marksheet and grid
lines for improved visual clarity. The code in Code Snippet 5 generates a plot
with these specifications.

Code Snippet 5:

import matplotlib.pyplot as plt


import numpy as np

x = np.array(['Adam', 'Richard','William','Emy', 'Linda'])


y = np.array([86,90,79,78,96])

plt.plot(x, y,'D--r')

plt.xlabel("Students")
plt.ylabel("Marks")
plt.title("Marksheet")

plt.grid()

plt.show()

In Code Snippet 5:
 The plt.xlabel("Students") statement marks the x-axis with the label
as ‘Students’.
 The plt.ylabel("Marks") statement marks the y-axis with the label
as ‘Marks’.
 The plt.title("Marksheet") statement gives the plot a title as
‘Marksheet’.

V1.0 © Aptech Limited


 The plt.grid()statement adds grid lines to the plot.

Figure 12.6 displays the output of the code.

Figure 12.6: Line Plot with Labels, Titles, and Grid

12.3 Data Visualization in Python Using Matplotlib

Charts are one of the most common tools used for data visualization. You can
create a variety of charts in Python using Matplotlib. Download the
Product_sales.csv file from Onlinevarsity. Refer to this file for the examples
on various charts.

12.3.1 Line Charts

Line charts are used to represent the relation between two datasets, X and Y,
on different axes.

Let us contrast the sale prices and marked prices of the leading five brands of
a product with high star ratings. You can generate dual-line graphs to facilitate
a comparative analysis of their prices. Code Snippet 6 lists the code for
creating a line chart to compare the sale prices of the brands.

V1.0 © Aptech Limited


Code Snippet 6:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv('Product_sales.csv')
dp = df.head()
brand = dp["Brands"]
saleprice = dp['Sale Price in $']
mrpprice=dp['Marked Price in $']
plt.xlabel("Brands")
plt.ylabel("Price in $")
plt.plot(brand,saleprice,marker = '*',label ="Sale Price in $")
plt.plot(brand,mrpprice,marker = 'D',label ="Marked Price in
$")
plt.legend()
plt.show()

Figure 12.7 displays the output of the code.

In Code Snippet 6:

 Data is taken from a CSV file named Product_sales.csv.


 The code reads the CSV file, extracts the first few rows, and focuses on
the Brands, Sale Price in $, and Marked Price in $ columns.
 The code then creates a line plot with brand names on the x-axis and
prices in dollars on the y-axis. The sale prices are marked with asterisk
markers, while the marked prices are marked with diamond markers.
Both lines are labeled accordingly. The plot includes a legend.

V1.0 © Aptech Limited


Figure 12.7: Line Chart

12.3.2 Bar Chart

Primarily, a bar chart is used to illustrate the relationship between numeric and
categorical values. A bar graph serves to contrast discrete categories. One
axis of the chart represents a particular category of the columns and another
axis represents the corresponding values or counts of each category.

A bar chart, also referred to as a bar graph, presents categorical data using
rectangular bars. The charts can be created in vertical or horizontal
orientations. The dimensions of the bars including heights or lengths are
proportional to the values that they represent.

Let us generate a bar chart to compare the discount percentage of each


brand of products. Code Snippet 7 lists the code for creating a bar chart.

V1.0 © Aptech Limited


Code Snippet 7:

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('Product_sales.csv')
df['Discount'] = (df['Marked Price in $'] - df['Sale Price
in $'])*100/df['Marked Price in $']
x_pt = df["Brands"]
y_pt = df['Discount']
plt.barh(x_pt, y_pt)
plt.xlabel("Discount %")
plt.ylabel("Brands")
plt.show()

Figure 12.8 displays the output of the code.

In Code Snippet 7:
 Data is taken from a CSV file named Product_sales.csv.
 The code reads the CSV file, calculates the percentage discount for
each product, and then generates a horizontal bar chart using the
plt.barh function.
 The x-axis represents different brands of products, the y-axis indicates the
percentage discount, and each bar corresponds to a brand's discount
percentage.

Figure 12.8: Bar Chart

V1.0 © Aptech Limited


12.3.3 Scatter Plot

Scatter plots illustrate the relationships between two variables. They use dots to
plot and represent various data points. These plots are particularly useful for
showcasing relationships between numeric variables, as they position data
points along both horizontal and vertical axes. This provides insight into the
degree of influence one variable exerts on another.

Let us generate a scatter plot to show the relationship between the sale price
and the marked price of each brand of products. Code Snippet 8 shows the
code for creating a scatter plot.

Code Snippet 8:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv('Product_sales.csv')
saleprice = df['Sale Price in $']
mrpprice=df['Marked Price in $']
plt.xlabel("Marked Price in $")
plt.ylabel("Sale Price in $")
plt.grid()
plt.scatter(saleprice, mrpprice)

Figure 12.9 displays the output of the code.

In Code Snippet 8:
 Data is taken from a CSV file named Product_sales.csv.
 The code reads the CSV file and creates a scatter plot to visualize the
relationship between Sale Price in $ and Marked Price in $ values.
 The code adds labels and gridlines for a better understanding of the
data distribution.

V1.0 © Aptech Limited


Figure 12.9: Scatter Plot

12.3.4 Histogram

A histogram is a kind of bar graph that provides a precise depiction of the


distribution pattern within numerical data. It provides insight into the probability
distribution of a continuous variable.

Steps to generate a histogram are:

Bins are defined as consecutive, non-overlapping ranges of a given variable.

V1.0 © Aptech Limited


You can use the matplotlib.pyplot.hist function to plot histograms. This
function undertakes the computation and visual representation of the
histogram based on the input data x.

Table 12.4 displays the parameters for the matplotlib.pyplot.hist function.

Parameter Description
X Is the array or sequence of arrays
bins Can be an integer or a sequence or ‘auto’ and is
an optional parameter

Table 12.4: Parameters of a Histogram

Let us generate a histogram to show the Star Ratings obtained by brands


grouped under specific ranges. The bins are defined as a sequence as in
4.0-4.1, 4.1-4.2, 4.2-4.3, 4.3-4.4, 4.4-4.5, 4.5-4.6, 4.6-4.7, 4.7-4.8,
and 4.8-4.9. Code Snippet 9 lists the code for creating a histogram.

Code Snippet 9:

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('Product_sales.csv')
plt.xlabel("Star Ratings Range")
plt.ylabel("Count")
plt.hist(df['Star Ratings'],bins =
[4,4.1,4.2,4.3,4.4,4.5,4.6,4.7,4.8,4.9])

Figure 12.10 displays the output of the code.

In Code Snippet 9:
 Data is taken from a CSV file named Product_sales.csv.
 The code reads the CSV file. It takes the data in the Star Ratings
column from the DataFrame and divides it into specified bins, ranging
from 4 to 4.9 with small increments (0.1).
 The histogram depicts the frequency or count of star ratings falling within
each bin.

V1.0 © Aptech Limited


Figure 12.10: Histogram

V1.0 © Aptech Limited


12.4 Summary

⮚ Data visualization involves creating visual representations such as graphs


and charts to present data insights clearly.
⮚ Matplotlib is a versatile plotting library integrated with Python, IPython,
Jupyter, and Web servers. It supports various plot styles, including line,
histogram, scatter, 3D, image, contour, and polar.
⮚ Seaborn enhances the visual appeal of statistical data representation
using color palettes and styles to create aesthetically pleasing plots.
⮚ Matplotlib can be installed using packages such as pip or conda.
⮚ Different plot types can be generated, such as basic line plots, custom
marker line plots, and advanced plots with labels, titles, and gridlines.
⮚ Matplotlib offers various types of data visualization, including Line Charts,
Bar Charts, Scatter Plots, and Histograms.
⮚ Line Charts display relationships between two datasets using lines.
⮚ Bar Charts represent relationships between numeric and categorical
values using bars.
⮚ Scatter Plots illustrate relationships between variables using dots.
⮚ Histograms depict the distribution pattern of continuous variables using
bins.

V1.0 © Aptech Limited


Test Your Knowledge

1. You have a dataset containing the monthly temperature data of


different cities. You want to visualize the temperature trends over the
months using a line chart. Which function from Matplotlib would you use
to plot the data?

a. plt.plot
b. plt.scatter
c. plt.bar
d. plt.hist

2. You are analyzing the performance of students in a school and want to


visualize the average scores of different subjects. Following Python Code
Snippet is used to create the expected visualization:

import pandas as pd
import matplotlib.pyplot as plt
data = {
'Subjects': ['Math', 'Science', 'History', 'English',
'Geography'],
'Average_Score': [78, 92, 85, 75, 88]
}
df = pd.DataFrame(data)
x_pts = df["Subjects"]
y_pts = df['Average_Score']
plt.barh(x_pts, y_pts, color='skyblue')
plt.xlabel("Average Score")
plt.ylabel("Subjects")
plt.title("Average Scores of Students")
plt.show()

What type of visualization is being created with this code?

a. Scatter Plot
b. Histogram
c. Line Chart
d. Horizontal Bar Chart

V1.0 © Aptech Limited


3. You are analyzing the correlation between the age and height of a
group of individuals. To visualize this relationship, you decide to create a
scatter plot using Python. Which of the following options correctly
completes the given Code Snippet to create the scatter plot?

import pandas as pd
import matplotlib.pyplot as plt
data = {
'Age': [25, 30, 35, 40, 45],
'Height': [160, 170, 165, 175, 180]
}
data_df = pd.DataFrame(data)
# Missing code: Complete the code to create a scatter
plot
Display the scatter plot
plt.show()

a. scatter(data_df['Height'], data_df['Age'])
plt.xlabel("Height (cm)")
plt.ylabel("Age (years)")
plt.title("Height vs Age")

b. plot(data_df['Age'], data_df['Height'])
plt.xlabel("Age (years)")
plt.ylabel("Height (cm)")
plt.title("Height vs Age")

c. scatter(data_df['Age'], data_df['Height'])
plt.xlabel("Age (years)")
plt.ylabel("Height (cm)")
plt.title("Age vs Height")

d. plot(data_df['Height'], data_df['Age'])
plt.xlabel("Height (cm)")
plt.ylabel("Age (years)")
plt.title("Height vs Age")

V1.0 © Aptech Limited


4. You are analyzing the age of marathon participants using a dataset. To
represent the age distribution, you plan to use a histogram with
consecutive, non-overlapping age intervals. What is the purpose of a
histogram in this context?

a. Displaying gender distribution


b. Representing ages with a pie chart
c. Estimating age intervals
d. Visualizing age distribution

5. You want to customize your line plot with circle data points connected
using dotted/dashed lines. Which of the following options specifies the
correct fmt string to plot that line in blue color?

a. 'c-:b'
b. 'c-:B'
c. 'o-.b'
d. 'o-.B'

V1.0 © Aptech Limited


Answers to Test Your Knowledge

1 a
2 d
3 c
4 d
5 c

V1.0 © Aptech Limited


Try it Yourself

1. You have been tasked with analyzing a dataset containing information


about hospital workers, including their positions and salaries. Your goal is
to visualize the distribution of salaries among different positions using
Pandas and Matplotlib libraries in Python and display them in a bar
chart. Create a Hospital_worker_data.csv using the dataset given in
Table 12.5.

Table 12.5: Data for Hospital_worker_data.csv

a. Import the Libraries.


b. Load the Dataset from the CSV file.
c. Display the first few rows of the dataset.
d. Extract the Position and Salary columns and store them in
variables named positions and salaries, respectively.
e. Create a bar chart.
f. Choose the blue color for the bars.
g. Provide labels to the x-axis as Position and y-axis as Salary.
Add a title to the plot as Salary Distribution of Hospital
Workers by Position.
h. Ensure proper spacing and layout of the plot elements.
i. Display the Plot.

2. You have a dataset containing information about several fashion


brands, their top sales products, and the colors associated with those
products. Your task is to generate a line chart that displays the top sales
product for each brand. The top sales product of each brand will be
represented as a marker on the chart and the marker color will
correspond to the color of the product. Create the Fashion_sales.csv
file using the dataset given in Table 12.6.

V1.0 © Aptech Limited


Table 12.6: Data for Fashion_sales.csv

a. Import the Libraries.


b. Load the Dataset from the CSV file.
c. Create a line chart with Brand on the x-axis and Top Sales
Product and Color on the y-axis.
d. Represent the Top Sales Product data using * markers on the
line chart.
e. Display the Color data using D markers on the same line chart.
f. Provide a label for each data series to differentiate between Top
Sales Product and Color.
g. Add a legend to the chart.
h. Set appropriate labels for the x-axis (Brands) and y-axis (Top
Sales Product and Color).
i. Include a title for the chart as Top sales products and colors
for trending fashion brands.
j. Ensure that all elements of the chart are legible and arranged
properly.

3. You work for a furniture retail company that offers a wide range of
furniture products such as tables, chairs, sofas, and cabinets. The
company has a dataset named Furniture_sales.csv containing
information about the marked price of different furniture items. Create
the Furniture_sales.csv file using the dataset given in Table 12.7.

Table 12.7: Data for Furniture_sales.csv

a. Import the Libraries.


b. Load the Dataset from the CSV file.
c. Extract the Furniture and Price in $ columns from the csv.

V1.0 © Aptech Limited


d. Create a scatter plot with Furniture on the x-axis and Price in
$ on the y-axis.
e. Set appropriate labels for the x-axis and y-axis.
f. Add a title to the plot as Price of the furniture items.
g. Display the data points using red circles.
h. Display a legend on the plot with a label.
i. Display a grid.
j. Display the plot.

V1.0 © Aptech Limited


Learning Objectives

In this session, students will learn to:

⮚ Describe machine learning


⮚ Explain the importance of various data processing techniques
⮚ Describe the Python libraries to process data for machine learning

Machine learning (ML) is a branch of Artificial Intelligence. A regular computer


program tells a computer what to do, whereas ML allows computers to learn
from data by employing ML algorithms. ML algorithms are mathematical
methods that can find patterns in the data. Various ML models are created by
using ML algorithms. These ML models are used to analyze large amounts of
data, identify patterns, and make predictions or decisions based on the data.
As data is the crux of any ML model, ensuring data quality is crucial for the
performance and accuracy of these models. Therefore, data processing,
which involves sourcing, cleaning, transforming, and formatting the data to
make it suitable for analysis, is an essential step in ML. In this session, you will
learn about the data processing techniques in ML using Python.

V1.0 © Aptech Limited


13.1 Data Processing Techniques in ML

Data when sourced from different sources such as surveys or Web pages is raw
and comes in different formats, such as text, numbers, and images. For
example, suppose you want to analyze the academic performance of
students in a particular university. In that case, you may have to collect raw
data from various sources, such as faculty feedback and performance reports
of students. The raw data can be in different formats such as Portable
Document Formats (PDFs) and online forms. Also, the raw data may contain
missing values, errors, inconsistencies, and duplicate values and thus make it
difficult to comprehend the data. To make the raw data useful for analysis, you
must process it by:
 Filling or dropping the missing values.
 Removing the errors, inconsistencies, and duplicate values.
 Filtering or grouping the data by relevant criteria.
 Formatting the data into one single format.

Thus, data processing is the technique of converting raw data into a


meaningful and useful form for ease and accuracy of analysis and processing
by ML models.

Steps involved in data processing are:

V1.0 © Aptech Limited


13.2 Loading and Exploring the Dataset For ML with Python

Loading and exploring the dataset is the preliminary step performed in ML. This
step is important because it helps to:
 Discover the distribution and trends of the data.
 Identify the shape or structure of the data (such as number of rows and
columns it has).
 View information about the data.
 Identify missing values.
 Identify duplicates.
 Identify categorical and numerical columns.

To load and explore data for ML, you can use Pandas. For example, let us
consider a dataset that contains the gender-wise academic performance of
students in various courses. The data is based on the number of hours the
students attended each course and the number of hours the students
prepared for each course. Download the student_success.csv dataset file
present under Course Files on OnlineVarsity. Figure 13.1 shows the dataset in
student_success.csv opened in Excel. To analyze this dataset, you must
load the dataset into a Pandas DataFrame and use the functions available in
the Pandas library to perform the analysis.

V1.0 © Aptech Limited


Figure 13.1: student_success.csv File in Excel

Code Snippet 1 imports pandas library and loads the data in


student_success.csv into a DataFrame object called df. Download the
student_success.csv file from the Onlinevarsity and upload it to the current
working directory. The code calls df to print the DataFrame to the standard
output device.

Code Snippet 1:

import pandas as pd
df=pd.read_csv('student_success.csv')
df

Figure 13.2 shows the DataFrame df printed on the screen. Note that the
output is displayed in a readable and formatted way. This can help in
inspecting and exploring the data visually to identify any issues or patterns in
the data.

V1.0 © Aptech Limited


Figure 13.2: student_success.csv File Loaded into Dataframe df

13.2.1 Identifying the Shape or Structure of the Data

After loading the dataset into a DataFrame, you can identify the shape or
dimension of the dataset by using the pandas.DataFrame.shape property.
This property returns the number of rows and columns of the dataset
representing its dimensionality. The code to display the size of a DataFrame is:

df.shape

Figure 13.3 shows the output of the code displaying a tuple containing the
number of rows and number of columns in the dataset in df. The output
indicates that the DataFrame df has 21 rows and 5 columns.

V1.0 © Aptech Limited


Figure 13.3: Dimension of DataFrame df

13.2.2 Viewing Information About the Dataset

To view a brief information about the dataset, you can use the
pandas.DataFrame.info method. This method prints a concise summary of
the DataFrame. The summary first lists the number of columns and the range
index of the DataFrame. It then presents the count of non-null values in each
column and the data type of each column in a table format. Finally, the
summary includes the memory usage of the DataFrame. The code to
summarize the DataFrame is:

df.info()

Figure 13.4 shows the output of the code displaying information about the
dataset stored in the DataFrame df.

Figure 13.4: Concise Summary of DataFrame df

13.2.3 Identifying Missing Values

To improve the quality and accuracy of the data for ML, it is important to
identify missing values in the data. This will give an understanding of how the
missing values are distributed and related to other variables. The sum method
can be used on the result of the isnull method to display the number of
missing values in each column of the DataFrame df. The return value of the
sum method is a Series object containing the column names and
corresponding counts of null values.

V1.0 © Aptech Limited


The code to find the number of null values in each column is:

df.isnull().sum()

Figure 13.5 shows the output of the code. The output indicates that there are
two null values in each of the columns, Hours_attended and
Hours_prepared, respectively. There are three null values in Success_exam.

Figure 13.5: Count of Null Values in DataFrame df

13.2.4 Identifying Duplicated Values

It is necessary to check for duplicated values in datasets to reduce their size


and complexity and to eliminate redundancy. To check for duplicated values
in the DataFrame df use the pandas.DataFrame.duplicated method. This
method returns a Boolean Series indicating whether each row in the dataset
is a duplicate or not. For each set of duplicated values, this method sets the
first occurrence to False and the subsequent occurrences to True, by
default. The code to locate duplicates is:

df.duplicated()

Figure 13.6 shows the Boolean Series returned by the df.duplicated


method. The output indicates that the rows with indexes 19 and 20 have
duplicated values.

V1.0 © Aptech Limited


Figure 13.6: Duplicated Rows

13.2.5 Identifying Categorical and Numerical Columns

Categorical data are data that can be classified into different groups based
on their features or attributes, such as gender, courses, and results. They are
textual in nature. The default type of text data in Pandas is object. Identifying
categorical data is necessary because they must be encoded or transformed
into numerical values. This will enable ML models to perform mathematical
operations on the data, learn from the data, and make predictions.

Code Snippet 2 finds the categorical and numerical columns in the DataFrame
df. The code uses for loop to enumerate the columns in df and assigns the list
of columns whose dtype is object to the variable, categorical_col. It then
again enumerates the columns in df and assigns the list of columns that are
not of object data type to the variable, numerical_col.

Code Snippet 2:

categorical_col = [col for col in df.columns if


df[col].dtype == 'object']
print('Categorical columns are :',categorical_col)
numerical_col = [col for col in df.columns if
df[col].dtype != 'object']
print('Numerical columns are :',numerical_col)

V1.0 © Aptech Limited


Figure 13.7 shows the output of Code Snippet 2 displaying the list of categorical
columns and numerical columns in the DataFrame df.

Figure 13.7: Columns in DataFrame df

13.2.6 Identifying Unique Data in Categorical Columns

Identifying unique data in categorical columns is necessary for choosing the


appropriate encoding method for the data. To find the count of unique values
in each of the categorical columns in the DataFrame df, use the nunique
method on the categorical_col of the DataFrame. categorical_col is
the variable used in Code Snippet 2 that holds the list of categorical columns
in df. The nunique method returns a Series object containing the count of
unique values in each of the three columns: Gender, Courses, and
Success_exam with the column name as the index. It excludes the NaN values,
by default.

df[categorical_col].nunique()

Figure 13.8 shows the output of the code.

Figure 13.8: Number of Unique Values in df


Table 13.1 describes the possible values for each categorical column in the
DataFrame df.

Column Number of Possible Values


Unique Values
Gender 2 Male, Female
Courses 3 Commerce, Science, Arts
Success_exam 4 Yes, No, Y, N

Table 13.1: Possible Values in df Columns

V1.0 © Aptech Limited


13.3 Data Cleaning

After loading and exploring the data, the next step is to clean the data. Some
of the tasks involved in data cleaning are:

Removing Modifying Data


Handling Outliers
Duplicates for Consistency

Splitting the
Dataset into
Handling Missing
Dependent and
Data
Independent
Variables

13.3.1 Removing Duplicates

To remove duplicates from the dataset in a DataFrame, you can use the
pandas.DataFrame.drop_duplicates method. This method returns the
DataFrame object after removing the duplicates.

Code Snippet 3 removes duplicates from the DataFrame df by calling the


df.drop_duplicates method. It then assigns the returned DataFrame back
to df and prints df to the standard output device.

Code Snippet 3:

df=df.drop_duplicates()
df

Figure 13.9 shows the output of Code Snippet 3. Note that the rows with indexes
19 and 20 are removed from the DataFrame df.

V1.0 © Aptech Limited


Figure 13.9: DataFrame df with Duplicates Removed

13.3.2 Modifying Data for Consistency

To make the data in a dataset consistent and usable for different ML models,
it is necessary to have a single representation for values in columns in a dataset.
For example, refer to the DataFrame df shown in Figure 13.9. The values in
column Success_exam in the DataFrame df represent the success or failure
status of students in various courses. However, this column has different kinds
of entries for indicating success and failure. For example, while the row with
index 0 has an entry Y for indicating success, the row with index 2 has the entry
as Yes for the same status. Similar is the case for the failure status too. The row
with index 1 has the entry as No for indicating failure, whereas the row with
index 4 has the entry as N for the same.

To ensure that the success and failure statuses are uniformly represented, Code
Snippet 4 replaces all Yes entries with Y and all No entries with N in column
Success_exam using df["Success_exam"].str.replace('Yes','Y') and

V1.0 © Aptech Limited


df["Success_exam"].str.replace('No','N'), respectively. The code
assigns the resultant Series objects back to df["Success_exam"]. It then
uses df["Success_exam"].fillna('Y',inplace=True) to fill the NaN
entries with ‘Y’ by modifying the original DataFrame df. The code calls the
DataFrame df to print the updated DataFrame to the standard output device.

Code Snippet 4:

df["Success_exam"]=df["Success_exam"].str.replace('Yes',
'Y')
df["Success_exam"]=df["Success_exam"].str.replace('No','
N')
df["Success_exam"].fillna('Y',inplace=True)
df

Figure 13.10 shows the output of Code Snippet 4. Note that the DataFrame df
has the replaced and consistent set of values in Success_exam.

Figure 13.10: Success_exam with Consistent Set of Values

V1.0 © Aptech Limited


13.3.3 Handling Outliers

Outliers are extreme values in a dataset that differ extensively from other data
points within that dataset. Outliers can distort the result of the statistical analysis
and thus might make accurate predictions difficult to arrive at. Therefore, they
should be identified and removed from the dataset. Outliers are generally
identified for numerical columns.

One of the ways to identify outliers is by using the Interquartile Range (IQR)
method. This method divides the respective column values in a dataset into
four equal parts called quartiles. The first quartile Q1 is defined as the 25th
percentile of the data, which is the middle number between the smallest
number and the median of the dataset. The second quartile Q2 is the 50 th
percentile of the data, which is the median of the dataset. The third quartile
Q3, which is the 75th percentile of the data is the middle value between the
median and the highest value of the dataset. The interquartile range is the
difference between the third quartile and the first quartile. The steps to identify
the outliers are:

Calculate the IQR by finding the difference


between third quartile and first quartile.

Calculate the lower bound by subtracting 1.5


times the IQR from Q1.
lower bound = Q1−1.5 × IQR

Calculate the upper bound by adding 1.5


times the IQR from Q3.
upper bound = Q3 + 1.5 × IQR

Identify the outliers, which are data points that


are either greater than the upper bound or
lesser than the lower bound.

Remove the outliers.

To calculate the first quartile and the third quartile for a column in a
DataFrame, you can use the pandas.Series.quantile method. This method
calculates the value at a given quantile in a Series object or in a column of
a DataFrame.

V1.0 © Aptech Limited


Identifying and Removing Outliers in df

As the DataFrame df contains two numerical columns, Hours_attended and


Hours_prepared, it is necessary to find the outliers in these columns and
remove them if they are present.

Code Snippet 5 shows the code to identify the IQR, lower bound and upper
bound for the Hours_attended column. The code uses the
df['Hours_attended'].quantile method to calculate the Q1 value at
0.25 quantile and Q3 value at 0.75 quantile for Hours_attended in df. The
code calculates the IQR by subtracting Q3 from Q1. It calculates the lower
bound and upper bound by using the respective formula and prints these
values to the standard output device.

Code Snippet 5:

q1 = df['Hours_attended'].quantile(0.25)
q3 = df['Hours_attended'].quantile(0.75)
iqr = q3 - q1
lower_bound=q1 - 1.5 * iqr
upper_bound=q3 + 1.5 * iqr
print('Lower Bound=',lower_bound,'Upper
Bound=',upper_bound)

Figure 13.11 shows the output of Code Snippet 5 and displays the lower and
upper bounds for the column Hours_attended in df.

Figure 13.11: Bounds of the Hours_attended Column

Code Snippet 6 shows the code to remove the outliers from the column
Hours_attended in df. The code retrieves the rows in df that have
Hours_attended values that are either greater than the upper bound or lesser
than the lower bound by using the df.loc method. The code stores the
retrieved rows in a new DataFrame called outliers. It then calls the
df.drop(outliers.index) method to remove the respective rows in df that
have the same index as the rows in outliers. The code then returns a new
DataFrame df with the outliers removed. The code calls df to print the
DataFrame to the standard output device.

V1.0 © Aptech Limited


Code Snippet 6:

outliers = df.loc[(df['Hours_attended'] < lower_bound) |


(df['Hours_attended'] > upper_bound)]
df = df.drop(outliers.index)
df

Figure 13.12 shows the output of Code Snippet 6. Note that the output shows
the DataFrame df with the row with index 22 removed, which had an outlier
value of 180 in the column Hours_attended.

Figure 13.12: Outliers Removed from Hours_attended

Code Snippet 7 shows the code to calculate the IQR, upper bound, and lower
bound for the column Hours_prepared in df. The code uses the
df['Hours_prepared'].quantile method to calculate the Q1 value at 0.25
quantile and Q3 value at 0.75 quantile for Hours_prepared in df. The code
calculates the IQR by subtracting Q3 from Q1. It calculates the lower bound
and upper bound and prints these values to the standard output device.

V1.0 © Aptech Limited


Code Snippet 7:

q1 = df['Hours_prepared'].quantile(0.25)
q3 = df['Hours_prepared'].quantile(0.75)
iqr = q3 - q1
lower_bound=q1 - 1.5 * iqr
upper_bound=q3 + 1.5 * iqr
print('Lower bound for hours
prepared=',lower_bound,'Upper bound for hours
prepared=',upper_bound)

Figure 13.13 shows the lower bound and upper bound for Hours_prepared.

Figure 13.13: Bounds for Hours_prepared Column

Code Snippet 8 shows the code to remove the outliers from the column
Hours_prepared in df. The code retrieves the rows in df that have
Hours_prepared values that are either greater than the upper bound or lesser
than the lower bound by using the df.loc method. The code stores the
retrieved rows in a new DataFrame called outliers. It then calls the
df.drop(outliers.index) method to remove the respective rows in df that
have the same index as the rows in outliers. The code then returns a new
DataFrame df with the outliers removed. The code calls df to print the
DataFrame to the standard output device.

Code Snippet 8:

outliers = df.loc[(df['Hours_prepared'] < lower_bound) |


(df['Hours_prepared'] > upper_bound)]
df = df.drop(outliers.index)
df

Figure 13.14 shows the output of Code Snippet 8. Note that the output shows
the DataFrame df with the same number of original rows. This indicates that
the column Hours_prepared did not have any outliers.

V1.0 © Aptech Limited


Figure 13.14: Same Number of Original Rows

13.3.4 Handling Missing Values

Missing data in a dataset can be handled by either removing the missing data
or replacing the data with some other values. The replacement values can be
a string or numerical constant, or a derived value, such as the mean, mode, or
median of the column.

Replacing Missing Values in df

There are two null values in each of the columns, Hours_attended, and
Hours_prepared. There were three null values in Success_exam. Code
Snippet 4 replaced the null values in Success_exam with Y to make the column
values consistent.

Code Snippet 9 shows the code to replace the null values in Hours_attended
and Hours_prepared. The code replaces the NaN values in these columns with
the mean of the respective column. The code calls
df['Hours_attended'].fillna(df['Hours_attended'].mean())to
replace the missing values in the Hours_attended column with the mean

V1.0 © Aptech Limited


value of that column. Similarly, it calls
df['Hours_prepared'].fillna(df['Hours_prepared'].mean()) to
replace the missing values in the Hours_prepared column with the mean
value of that column.

Code Snippet 9:

df['Hours_attended'].fillna(df['Hours_attended'].mean(),
inplace=True)
df['Hours_prepared'].fillna(df['Hours_prepared'].mean(),
inplace=True)
df

Figure 13.15 shows the output of Code Snippet 9. Note that the rows with NaN
entries in Hours_attended and Hours_prepared are replaced with the mean
values of the respective column data.

Figure 13.15: Missing Values Replaced with the Mean Values

V1.0 © Aptech Limited


13.3.5 Splitting the Dataset into Independent and Dependent Variables

One of the important aspects in ML is to identify the dependent and


independent variables to establish the logical connection between the cause
and the effect of an event. The dependent variables represent the target
variable, and the independent variables represent the input variables. The
independent or input variables are manipulated to see their effect on the
dependent variable. The dependent or target variables are observed to
examine how they respond to the change in the independent variable. Thus,
the independent variables are the presumed causes that are used as input for
the ML model. The dependent variables are the presumed effects that are
used as output for the ML model.

Separating the Independent and Dependent Variables in df

Let us examine the columns in df to split them into dependent and


independent variables. The column Success_exam in df is dependent on the
other columns in the DataFrame such as the gender of the students and the
course they attended. The amount of time the students attended the course
and the amount of time they prepared for the course also affect the success
of the exam. Therefore, Success_exam is a dependent variable. The columns
Gender, Courses, Hours_attended, and Hours_prepared are independent
variables.

Code Snippet 10 retrieves the values from the columns that are the
independent variables in df and stores them as a two-dimensional array called
x. The code first selects the Gender, Courses, Hours_attended, and
Hours_prepared columns from df. It then uses the values attribute to retrieve
the values of the selected columns as a numpy array. The code assigns the
numpy array to the variable x and prints x to the standard output device.

Code Snippet 10:

x=df[['Gender','Courses','Hours_attended','H
ours_prepared']].values
x

Figure 13.16 shows the value of numpy array x. Note that the array has the same
number of rows as df with four columns corresponding to the selected
independent variables.

V1.0 © Aptech Limited


Figure 13.16: Value of x Containing Independent Variables

Code Snippet 11 selects the column Success_exam from df, which is the
dependent variable in df. The code retrieves the values of the column and
stores them as a two-dimensional array called y. The code prints y to the
standard output device.

Code Snippet 11:

y=df[['Success_exam']].values
y

Figure 13.17 shows the value of y, which is a two-dimensional array containing


the same number of rows as df. The DataFrame contains a single column with
the values of the dependent variable Success_exam.

V1.0 © Aptech Limited


Figure 13.17: Value of y Containing Dependent Variable

13.4 Encoding Data

Encoding is the process of converting data from one form to another form for
the purpose of storage or processing. In ML, the categorical data should be
converted into numerical data by using appropriate encoding methods. The
reason is categorical data are text values. It is difficult for the ML models to
learn from the categorical data as computers understand only numerical
values.

There are several ways to encode categorical data. Two of the most used
methods are as follows:
 One-hot encoding
 Label encoding

13.4.1 One-hot Encoding


In one-hot encoding, each unique categorical value in a column is converted
into a binary column by assigning a binary value of 1 or 0 to that column. For
example, in the DataFrame df, the column Courses has three unique
categories, Arts, Science, and Commerce. Each of these unique categories
will be converted into three columns. A sample with the Arts category will
have the binary value [1,0,0] as shown in Figure 13.18. Figure 13.18 illustrates
the one-hot encoding process.

V1.0 © Aptech Limited


Figure 13.18: One-hot Encoding

13.4.2 Label Encoding

In label encoding, each unique category in a column is assigned an integer


value. For example, each of the unique values in the Courses column will be
assigned a unique integer value as illustrated in Figure 13.19.

Figure 13.19: Label Encoding

Although, this is the simplest method, the disadvantage is that ML algorithms


may misinterpret the assigned integer values to have some ordered
relationship when it is not so. Therefore, it is recommended to use one-hot
encoding for variables that have more than two unique values.

13.4.3 Encoding Categorical Data in DataFrame df

The categorical columns in the DataFrame df are Gender, Courses, and


Success_exam. The choice of the encoding method depends on the
characteristics of the categorical values in these columns. The one-hot
encoding method is used when a column has more than two unique values.
Label encoding can be used for columns that have only two unique values.
For example, the columns Gender and Success contain only two values Male
or Female and Y or N, respectively. You can use the label encoding method
for these columns. The column Courses contains three values: Arts, Science,
or Commerce. You can use one-hot encoding method for this column.

V1.0 © Aptech Limited


Install Scikit-learn Library
To encode categorical columns, install the scikit-learn library. The code to
install scikit-learn using the pip installer package is:

pip install scikit-learn

The scikit-learn library is a Python module for machine learning. It has a


pre-processing package that contains several utility functions and encoding
classes to convert categorical data into numerical data.
Apply One-hot Encoding
Code Snippet 12 shows the code to encode the categorical variable Courses
in the numpy array x using the OneHotEncoder class from the
sklearn.preprocessing module. The code:

1. Imports the LabelEncoder and OneHotEncoder classes from the


sklearn.preprocessing module. The LabelEncoder class encodes
categorical variables as numerical values, while the OneHotEncoder
class encodes categorical variables as one-hot numeric arrays.
2. Creates an instance of the OneHotEncoder class and assigns it to a
variable called onehot_encoder.
3. Uses the onehot_encoder.fit_transform method to encode the
second column of x which contains the categorical values of Courses
as a one-hot numeric array. To do this, the code:
i. Slices the second column of x using x[:,1] and reshapes it into a
two-dimensional array with one column using the reshape
method. This is done to make the array compatible with the
fit_transform method. The fit_transform method learns the
unique categories in the column and transforms them into binary
vectors. The output is a sparse array that contains the one-hot
encoded values.
ii. Uses the toarray method of the onehot_encoder object to
convert the sparse array output into a dense array. Converting the
sparse array to a dense array is necessary to pass the encoded
data to other functions that require a dense array as input.
iii. Assigns the dense array to the variable feature_course. This
variable contains a two-dimensional array with one column for
each unique value in the Courses column.
4. Appends feature_course to the end of x along the column axis. To do
this, the code:
i. Imports the numpy library as np.

V1.0 © Aptech Limited


ii. Uses the np.append method to add the columns in
feature_course to x along the column axis (axis=1). The
np.append method returns a new array with all the columns from
both feature_course and x.
iii. Assigns the new array to x, overwriting the original value of x with
the new array.
iv. Uses np.delete(x, [1], axis=1)to remove the original second
column of x that contained the categorical values of Courses.
v. Assigns the result of the np.delete method to x, overwriting the
original value of x with the new array that has the second column
removed.
Code Snippet 12:

from sklearn.preprocessing import LabelEncoder,


OneHotEncoder
onehot_encoder = OneHotEncoder()
feature_course=onehot_encoder.fit_transform(x[:,1].resha
pe(-1,1)).toarray()
import numpy as np
x = np.append(x, feature_course, axis=1)
x = np.delete(x, [1], axis=1)
x

Figure 13.20 shows the numpy array x with each of the unique categories in its
second column Courses transformed into a binary vector of three columns
appended to its end.

Figure 13.20: Value of x with One-hot Encoding

V1.0 © Aptech Limited


Apply Label Encoding
Code Snippet 13 shows the code that applies label encoding to the first
column of x, which contains the values of the categorical variable Gender. The
code:
1. Creates an instance of the LabelEncoder class and assigns it to the
variable labelencoder_x.
2. Uses the labelencoder_x.fit_transform(x[:,0])to fit and
transform the values of the first column of x into numerical labels. This
means assigning a unique number between 0 and n_classes -1 to each
category in that column, where n_classes is the number of unique values
in that column.
3. Assigns the result of the labelencoder_x.fit_transform method to
the first column of x, which means overwriting the original values of
x[:,0] with the numerical labels.
Code Snippet 13:

labelencoder_x = LabelEncoder()
x[:,0] = labelencoder_x.fit_transform(x[:,0])
x

Figure 13.21 shows the output of the execution of Code Snippet 13. Note that
the first column is label-encoded with numerical values 0 or 1.

Figure 13.21: Value of x with Label Encoding

V1.0 © Aptech Limited


Code Snippet 14 shows the code that applies label encoding to fit and
transform the values of the categorical variable Success_exam in y.

The code:
1. Creates an instance of the LabelEncoder class and assigns it to the
variable labelencoder_y.
2. Uses labelencoder_y.fit_transform(y) to fit and transform the
values of y into a two-dimensional array of numerical labels of shape
(n_samples, 1), where n_samples is the number of samples in the input
array.
3. Uses the numpy.ravel method to flatten the two-dimensional array
into a one-dimensional array. The reason is that some of the ML models
require a one-dimensional array of shape (n_samples) as the target
variable.
4. Assigns the result of labelencoder_y.fit_transform(y).ravel(),
which is a one-dimensional array containing the numerical labels to y,
thus overwriting the original values of y with the numerical labels.
Code Snippet 14:

labelencoder_y = LabelEncoder()
y=labelencoder_y.fit_transform(y).ravel()
y

Figure 13.22 shows the values of Success_exam in the dependent variable y


converted into numerical labels.

Figure 13.22: Value of y with Label Encoding

13.5 Splitting Dataset as Training and Test

After encoding the data, the next step is to split the data as training and test
sets. The reason is that training and testing the ML model on the same data
might make the model learn only specific patterns rather than the relationships
between the underlying data. This can lead to the poor performance of the
ML model when it is provided with new data that has different patterns.
Splitting the data into training and test sets allows us to simulate the
performance of the model on new data and evaluate how well the model
generalizes to new data. The training set contains data that ML models learn
from. The test set contains data on which ML models are evaluated.

V1.0 © Aptech Limited


13.5.1 Splitting the Data in x and y as Training and Test

Code Snippet 15 shows the code to split the independent and dependent
variables, x and y into training and test sets. The code:
1. Imports the train_test_split function from the
sklearn.model_selection module. The train_test_split function
is a tool for splitting data into training and test sets.
2. Uses the train_test_split function that takes an input array, x and
an output array, y as arguments and returns four subtests: x_train,
x_test, y_train, and y_test. These subsets represent the input and
output variables for the training and test sets, respectively.
3. Specifies the test_size parameter of the train_test_split function
as 0.2. This means that 20% of the data will be used for the test set and
80% for the training set.
4. Specifies the random_state parameter of the train_test_split
function as 0, which means that the data will be shuffled before splitting
in a reproducible way. This ensures that the results are consistent across
different runs of the code.
Code Snippet 15:

from sklearn.model_selection import


train_test_split
x_train, x_test, y_train, y_test =
train_test_split(x, y,
test_size=0.2, random_state=0)

The code to print the value of x_train to the standard output device is:

x_train

Figure 13.23 shows the data in x_train.

V1.0 © Aptech Limited


Figure 13.23: Data in x_train

The code to print the value of x_test to the standard output device is:

x_test

Figure 13.24 shows the data in x_test.

Figure 13.24: Data in x_test

The code to print the value of y_train to the standard output device is:

y_train

Figure 13.25 shows the data in y_train.

Figure 13.25: Data in y_train

The code to print the value of y_test to the standard output device is:

Y_test

V1.0 © Aptech Limited


Figure 13.26 shows the data in y_test.

Figure 13.26: Data in y_test

13.6 Feature Scaling

Feature scaling is a technique to tune the values of independent variables in


a dataset to a similar range or scale. This can help to improve the performance
of machine learning models that are sensitive to the size or variation of these
variables. For example, observe the dataset in x_train and x_test shown in
Figures 13.24 and 13.25. The values in the second and third columns that
represent Hours_attended and Hours_prepared vary from 8.0 to 30.0 and 3.0
to 15.0, respectively. These two variables do not have the same scale. This can
lead to poor performance of ML models because most of the models are
based on Euclidean distance. The Euclidean distance is a measure of the
similarity between two points in a space. ML models use Euclidean distance to
measure the similarity between two recorded observations. If there is a
variance in the scale of the data, then the measure of the similarity also
changes affecting the performance of ML algorithms.
13.6.1 Apply Feature Scaling to x_train and x_test

Code Snippet 16 applies feature scaling to preprocess the training and test
sets of the input variables using the same scaling parameters. This is done to
ensure that the training and test sets have a similar range and distribution. The
code:
1. Imports the StandardScaler class from the sklearn.preprocessing
module, which is a tool for standardizing features by removing the mean
and scaling to unit variance.
2. Creates an instance of the StandardScaler class and assigns it to the
variable sc_x.
3. Uses sc_x.fit_transform(x_train) to fit and transform the x_train
array. This method calculates the mean and standard deviation of each
feature in x_train. It then uses them to center and scale each feature
value by subtracting the mean and dividing by the standard deviation.
The result is a new array of standardized features that has zero mean
and unit variance along each column and it is assigned back to
x_train.
4. Uses the sc_x.transform(x_test) to transform the x_test array. This
method uses the same mean and standard deviation that it calculated

V1.0 © Aptech Limited


from x_train and applies them to center and scale each feature value
in x_test. The result is a new array of standardized features that has zero
mean and unit variance along each column and it is assigned to
x_test.

Code Snippet 16:

from sklearn.preprocessing import StandardScaler


sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
x_train
x_test

Figure 13.27 shows the feature scaled array of x_train and x_test. Note
that both the arrays have similar range and distribution.

Figure 13.27: Feature Scaled x_train

Figure 13.28 shows the feature scaled array of x_test. Note that both the
arrays have similar range and distribution.

V1.0 © Aptech Limited


Figure 13.28: Feature Scaled x_test

13.7 Saving the Training and Test Sets

As the training and test sets are the basis for training, testing, and improving
the ML models, it is necessary to save these datasets. By saving the datasets,
they can be reused for various purposes. Datasets can be compared with
other datasets, updated with new data, or shared among researchers and
developers who work on related ML models.
13.7.1 Save x_train, x_test, y_train, and y_test
To save the training and test sets, you can use the Python library joblib.
joblib helps to store any Python object that contains numpy array.

Code Snippet 17 shows the code to save the training and test sets using the
joblib.dump function. The code first imports the joblib library. It then uses
the joblib.dump function to serialize the x_train, x_test, y_train, and
y_test numpy array objects. The serialized objects are saved as files in the
current working directory with the names filextrain.pkl, filextest.pkl,
fileytrain.pkl, and fileytest.pkl, respectively.

Code Snippet 17:

import joblib
joblib.dump(x_train,'filextrain.pkl')
joblib.dump(x_test,'filextest.pkl')
joblib.dump(y_train,'fileytrain.pkl')
joblib.dump(y_test,'fileytest.pkl')

The extension .pkl denotes pickle files that are used to store
Python objects. The numpy array objects in the .pkl files can be
later loaded into the memory using the joblib.load function.

V1.0 © Aptech Limited


13.8 Summary

⮚ ML algorithms are mathematical methods that can find patterns in the


data and various ML models are developed by using ML algorithms.
⮚ Data processing is collecting data and making it suitable to enable ML
models to observe and learn patterns.
⮚ Loading, cleaning, encoding, splitting, and scaling the data are some
of the steps involved in data processing.
⮚ Filling or dropping missing and duplicated values from datasets help to
improve the performance of ML models.
⮚ Outliers in a dataset should be removed to enable the ML model to
make accurate predictions.
⮚ Categorical data are text data that are classified based on their
features or attributes.
⮚ The two most used encoding methods for encoding categorical data
are one-hot encoding and label encoding.
⮚ The scikit-learn library, which is a Python module provides various
functions for ML.
⮚ Dependent and independent variables are used as the input and
output for the ML model, respectively.
⮚ Splitting the data into training and test sets allows to evaluate how well
the model generalizes to new data.
⮚ Feature scaling is a technique that can help to improve the
performance of machine learning models that are sensitive to the size or
variation of independent variables.
⮚ Saving the datasets as .pkl files enables you to reuse the datasets for
later use.

V1.0 © Aptech Limited


Test Your Knowledge

1. You are working on a machine learning project that requires you to


examine customer feedback data. The dataset includes both text and
numerical data. What data processing steps should be taken before
training your model to prepare the dataset for modeling?

a. Loading and exploring the dataset


b. Encoding categorical data
c. Feature scaling
d. Cleaning the dataset

2. You gathered data on stock market trends for your machine learning
project. The dataset includes both historical prices and categorical
data. To use this data effectively for training your model, what step
should you perform after loading and exploring the dataset?

a. Cleaning the dataset


b. Encoding categorical data
c. Feature scaling
d. Splitting dataset into training and test set

3. Why is data preprocessing important in the machine learning pipeline?

a. It helps in making data more complex


b. It makes the dataset smaller in size
c. It prepares the data for use in building and training ML models
d. It creates visualizations of the data

4. In the context of data processing, what does encoding categorical


data refer to?

a. Creating videos from raw data


b. Converting categorical data into numerical form
c. Removing all missing values from the dataset
d. Splitting the dataset into training and test sets

V1.0 © Aptech Limited


5. Which of the following is NOT a typical issue with real-world data that
necessitates data preprocessing?

a. Noisy data
b. Missing values
c. Unusable format
d. Overfitting

V1.0 © Aptech Limited


Answers to Test Your Knowledge

1 d
2 b
3 c
4 b
5 d

V1.0 © Aptech Limited


Try it Yourself

1. As a data scientist, you are engaged by a leading online retailer to


develop a predictive model that can forecast whether a user is inclined
to make a purchase. Your task is to create a dependable model that
assists in enhancing marketing strategies and propelling revenue
expansion. Create a Purchase_Item.csv using the dataset given in
Table 13.2.

Table 13.2: Data for Purchase_Item.csv

a. Import the necessary libraries and import the dataset.


b. Load the dataset.
c. Check the shape of the dataset.
d. Use the information.
e. Identify missing values in the dataset and decide how to handle
them.
f. Identify and remove duplicate rows from the dataset.
g. Preprocess the Purchase column, replacing One with O and Zero
with Z.
h. Detect outliers in the Age column using the IQR method and
handle them.
i. Handle missing values in the Purchase_History and Age
columns.
j. Prepare the data for training by selecting relevant features and
encoding categorical variables.
k. Split the data into training and testing sets.
l. Standardize the features.
m. Save the preprocessed training and testing sets using joblib.

V1.0 © Aptech Limited


Learning Objectives

In this session, students will learn to:

⮚ Describe different types of Machine Learning (ML)


⮚ Explain the logistic regression method of supervised learning
⮚ Describe ways to implement the K-mean clustering technique of
unsupervised learning
⮚ Explain reinforcement learning

ML is making a computer learn to do tasks better over a period of time without


programming explicitly to do the same. By making use of machine learning
algorithms, human interference in maintaining systems can be reduced.
Human error can be avoided. Accuracy of the output can be achieved under
diverse circumstances. The time taken to do a task can be considerably
minimized. For example, an online shopping mart can analyze its past few
years of purchase data for a product to predict its future trend. Over a period
of time, this predictive analysis can be perfected using machine learning
algorithms.

V1.0 © Aptech Limited


This session will focus on different types of ML along with different algorithms
associated with each of these types.

14.1 Introduction to Different Types of Machine Learning Techniques

ML can be implemented in different ways based on the complexity of the


model to be built. Learning can be based on past experiences, data
recordings, or inferences from actions and results over a period. Depending on
the problem to be solved, there are three broad types of Machine Learning:

14.2 Supervised Learning

As the name suggests, supervised learning refers to learning in the presence of


an instructor. The instructor here corresponds to the proven datasets. Based on
the information provided through the data sets, the machine learns to evolve.
Continuous learnings are applied to new challenges to obtain the best result.

V1.0 © Aptech Limited


Figure 14.1 depicts the concept of supervised learning.

Figure 14.1: Supervised Learning

A dataset of animals is stored in the machine. Suppose, a new input in the form
of a cow arrives. Based on the data stored, the machine thinks if the input is a
cat, a dog, a snake, or a cow, and then decides that it is a cow. This can be
considered as a form of supervised learning.

Supervised learning finds its use in many places including healthcare, finance,
and educational sectors. It is based on input data that is being stored and the
data which is questioned. A map between the input and possible output is
made to predict the result accurately. The supervised learning algorithm tries
to make a relationship between input and output and take meaningful insights
from the mapping.

The effectiveness of supervised learning depends on the volume and


accuracy of the datasets provided to the machine. Accurate insights or
predictions can be made from accurate data.

The two types of supervised learning are Regression Learning and Classification
Learning.

14.2.1 Regression Learning

The regression learning model tries to find the relationship between


independent variables and predict the dependent outcome which may be
continuous. For example, consider the data for metrological prediction of rains
which depends on factors such as cloud density, cloud movement, wind,
humidity, sea breeze, and season of the year. The occurrence of rain depends

V1.0 © Aptech Limited


on these factors which are independent variables. By analyzing the past data
of these independent variables, a prediction of rain can be made.

Similarly, sales in a store can be predicted based on past experiences and


analyzing the relationship between sales and advertisement. Based on the
money spent on advertising and the sales history of previous years, the sales
prediction for the year 2023 must be made. In this case, regression learning
method of prediction is the most suitable learning method. Table 14.1 shows
the money spent on advertisements and the sales earned over a few years.

Year Advertisement Cost Sales


2019 $90 $1,000
2020 $110 $1,300
2021 $150 $1,600
2022 $180 $1,900
2023 $200 ?

Table 14.1: Advertisement Cost and Sales

14.2.2 Classification Learning

In a classification type learning model, data is classified as a category or


object. The machine matches the input against the category and tries to fix
the category of the input object. Suppose classification of animals, birds, or
human beings is stored in the machine. The algorithm tries to fix the features of
the input such as images or colors against the categories stored already. Figure
14.2 depicts the concept of classification learning.

Figure 14.2: Classification Learning

V1.0 © Aptech Limited


The two models of classification learning are binary classification and multiclass
classification.

Binary Classification Model

In the binary classification, the total number of categories or classes is always


two. Based on the input provided, the output must be one of the two cases.
Consider a prediction system that identifies the gender of a person visiting a
religious place. The given input should fall under one of the two categories:
Male or Female.

Figure 14.3 shows the possibilities of output for a binary classification prediction
model.

Figure 14.3: Binary Classification Model of Prediction

Multiclass Classification Model

In the multiclass classification model, the number of output classification is


more than two. The learning model analyzes the input and decides to fit the
input under one of the many classes present. Figure 14.4 shows the prediction
system where the input status is fixed under one of the categories: Male,
Female, Child, or Elder. The input should fit into one of these four categories.

Such a model is useful in a prediction system where preference for a product


must be predicted. For example, a prediction system may require to predict
the gender and age group of people who prefer a particular flavor of
ice cream. The categories may be classified as shown in Figure 14.4. For a

V1.0 © Aptech Limited


group of target audience, the algorithm can help in predicting the target
audience for the particular flavor of ice cream.

Figure 14.4: Multiclass Classification Prediction Model

Some of the models of classification learning algorithms are Logistic Regression,


Naive Bayes, K-Nearest Neighbors, Decision Tree, and Support Vector
Machines. Let us see an example of implementation of Supervised Learning
under Logistic Regression.

14.2.3 Supervised Learning Using Logistic Regression

Logistic regression is the type of binary classification model where the


probability of the input variable getting fitted into one of the two categories is
determined. Here, the idea is to categorize the dependent input variable into
one of the two discrete outcome values, 0 or 1. However, instead of giving the
output as either 0 or 1, the logistic regression algorithm returns a probabilistic
value ranging from 0 to 1. Thus, the output resembles a ‘S’ shape in a graph
instead of resembling a line. The extremes of curvature help in predicting the
place where the input fits into.

Logistic Regression is an important machine learning algorithm that is used to


solve classification problems. It is capable of giving probabilities to classify
new data using continuous or discrete datasets. It can be used to classify the
observations using different types of data. It determines the effective variable
that can be used for the classification.

V1.0 © Aptech Limited


Steps involved in logistic regression model are:

Find the
Training the Predicting the
Accuracy of the
Model Test Data
Model

To demonstrate supervised learning using logistic regression, let us predict


whether students will succeed in their exam or not based on past data.

 Training

The training and test data are taken in the form of pickle files with extension
.pkl. Download the filextrain.pkl file present under Course Files on
Onlinevarsity. This file contains the training data of independent variables.
Code Snippet 1 shows how to load this data.

Code Snippet 1:
import joblib
x_train = joblib.load('filextrain.pkl')
x_train

Figure 14.5 shows the output of this code.

V1.0 © Aptech Limited


Figure 14.5: Training Data for Independent Variables

Similarly, Code Snippet 2 loads the training data can be loaded for the
dependent variables (y_train).

Code Snippet 2:
import joblib
y_train = joblib.load('fileytrain.pkl')
y_train

Figure 14.6 shows the output of this code.

Figure 14.6: Training Data for Dependent Variables

Code Snippet 3 shows how to load the test data for the independent variables
(x_test).

V1.0 © Aptech Limited


Code Snippet 3:
import joblib
x_test = joblib.load('fileytest.pkl')
x_test

Figure 14.7 shows the output of this code.

Figure 14.7: Test Data for Independent Variables

Code Snippet 4 shows how to load the test data the dependent variables
(y_test).

Code Snippet 4:
import joblib
y_test = joblib.load('fileytest.pkl')
y_test

Figure 14.8 shows the output of this code.

Figure 14.8: Test Data for Independent Variables

The data loaded can now be used to train the model. The
LogisticRegression class which belongs to the sklearn library provides
many methods to train the model. Code Snippet 5 shows the code to train the
model. Training data x and y are given as arguments for the fit function in
the classifier_model object.

Code Snippet 5:

from sklearn.linear_model import LogisticRegression


classifier_model=LogisticRegression(random_state=0)
classifier_model.fit(x_train,y_train)

V1.0 © Aptech Limited


Figure 14.9 shows the output of this code.

Figure 14.9: Training the Model

The random_state parameter for the LogisticRegression function works to


minimize the errors in prediction when the original dataset is shuffled to create
training and test datasets. If random_state is set to 0, the shuffling of data is
different from setting the random_state to any other number as the result for
each random_state is different.

 Predicting the Test Data

The model, classifier_model is trained by the training data of independent


variables (x_train) and dependent variables (y_train). The model is to be
tested for prediction. Prediction means that the model is fed with the test data
of independent variable (x_test) and told to predict the result which is
dependent variable (y_pred).

Code Snippet 6 shows the creation of prediction array. The predict function
helps in predicting the values of the dependent variables (y_pred) based on
the values of the independent variables (x_test). The values of y_pred is
displayed.

Code Snippet 6:

y_pred=classifier_model.predict(x_test)
y_pred

Figure 14.10 shows the output of Code Snippet 6.

Figure 14.10: Generating Prediction Data

Depending on the % of values matching from y_pred to y_test, the accuracy


of the model is determined. For example, consider the prediction of the model
for the test data with independent variables produces a result that is 80% similar
to the test data for dependent variables. In this case, the model is 80%

V1.0 © Aptech Limited


accurate. If all the values match, then the model is 100% accurate. If none of
the values match, then the model is incorrect.

The next step is to evaluate the model. Metrics are used to evaluate the results
of prediction against the actual values. Here, y_test is the actual values and
y_pred is the predicted values. To test the accuracy of the predicted model,
the predicted values must be compared with the actual values. Figure 14.11
depicts the process flow of determining the accuracy of the prediction model.

Figure 14.11: Accuracy of the Model

 Finding the Accuracy of the Model

Model accuracy can be determined using confusion matrix or accuracy


score. Confusion matrix is a pictorial representation of actuals versus predicted.
Figure 14.12 shows the confusion matrix interpretation.

Figure 14.12: Confusion Matrix

V1.0 © Aptech Limited


Table 14.2 lists the meaning of the cells in the matrix.

Value Meaning
True Positive Actuals and predicted show the same data point as 1
which means the inference and actual values are positive
True Negative Actuals and predicted show the same data point as 0
which means the inference and actual values are negative
False Positive Actual has the data point as 0 and predicted is 1 which
means the inference value is negative but actual value is
positive
False Negative Actual has the data point as 1 and predicted is 0 which
means the inference value is positive but the actual value
is negative

Table 14.2: Confusion Matrix – Values and Meaning

The confusion_matrix function can be used with test values and prediction
values of independent variables as input to generate the confusion matrix.
Code Snippet 7 helps in creating the confusion matrix.

Code Snippet 7:
from sklearn.metrics import
confusion_matrix
confmat = confusion_matrix(y_test,
y_pred)
print ("Confusion Matrix : \n", confmat)

Figure 14.13 shows the confusion matrix.

Figure 14.13: Confusion Matrix

To interpret the accuracy from the confusion matrix, you must add the true
positive and true negative scores. In the confusion matrix in Figure 14.13, the
true positive and true negative have the values 2. Thus, the total score is 4 (2+2).
The total number of values in y_pred and y-test arrays is 4. Therefore, the
prediction model is 100% accurate. Note that the rest of the cells in the matrix
have the value 0 as there are only 4 values in the array.

Suppose the confusion matrix values are 2, 1, 1, and 0 instead of 2, 0, 0, and 2.


In this case, the prediction is 50% accurate because the sum of true positive
and true negative values is 2 (2+0). Therefore, the % is 2/4 * 100 = 50%.

V1.0 © Aptech Limited


The accuracy score helps in finding the correctness fraction of the prediction
made by the model. Code Snippet 8 helps in finding the accuracy of the
prediction model. The accuracy_score function takes two arguments, y_test
and y_pred. The actual and prediction values are compared, and the
accuracy score is arrived at based on the similarities and differences.

Code Snippet 8:

from sklearn.metrics import accuracy_score


print ("Accuracy of our model : ", accuracy_score(y_test,
y_pred))

Figure 14.14 shows the output of the code.

Figure 14.14: Accuracy Score of the Model

The accuracy score shows that the model is 100% correct.

14.3 Unsupervised Learning

In the unsupervised learning method, the input data is unlabeled or


unclassified. Since the model is not given any classified data or input, it has to
identify the classes based on the inputs provided. The model then, classifies the
inputs based on appearance, patterns, and behavior.

Figure 14.15 shows the classification of random population data.

Figure 14.15: Classification of Data

V1.0 © Aptech Limited


In Figure 14.15, the classification is done by the model contrasting supervised
learning where classified data is fed. The unsupervised learning model aims to
create clusters or subgroups. There may be many clusters but the model aims
to make the distance between clusters minimum and nearer to the centroid
or central point. The two types of unsupervised learning are Clustering and
Association.

14.3.1 Clustering

Clustering learning patterns involve classifying the group into sub-groups based
on similarities. This helps in classifying such that the subgroup has homogeneity.
Each subgroup differs from the other in some ways. This classification helps
businesses in targeting their audiences.

Once classified, the attributes and trends of the groups are learned by the
machine itself without any external inputs or input-output mapping. The
clustering algorithm takes the raw data and classifies it into clusters or
subgroups based on patterns, appearances, and behavior. Figure 14.15 shows
the clusters formed from a general population. Clusters formed include Males,
Females, Children, Seniors, and Physically Challenged people.

Business entities target the clusters based on their preferences. For example,
children prefer toys and gaming instruments. Women prefer clothing and
accessories.

14.3.2 Association

Associative learning is learning about the behavior of the clusters. Here, the
clusters and their behavior are studied. The inference of preferences of the
clusters help in guessing the preferences of a similar cluster. For example, when
the target customers are children, the preferences of buying are studied. When
a customer of a similar cluster arrives, the options of products are displayed
with the thought that similar products may be the preferred products of buying.

V1.0 © Aptech Limited


Figure 14.16 shows the preferences of different customers.

Figure 14.16: Associative Behavior of Clusters

Customer 1 and Customer 2 bought chocolates, toys, and gadgets. The model
associates this to Customer 3 and predicts purchase preferences.

Some of the Unsupervised Machine learning algorithms are K-means clustering,


Hierarchical clustering, Anomaly detection, and Random forest.

14.3.3 Unsupervised Learning Using K means Clustering

In this learning model, the number of cluster types can be defined by the K-
means algorithm. K-means algorithm creates K number of clusters for n number
of observations. For each value of K, a different number of clusters are created.
The aim is to create as many clusters as possible and find the best K value to
get optimal results. This clustering algorithm is very useful in image
segmentation, customer segmentation, species clustering, anomaly
detection, and clustering languages.

V1.0 © Aptech Limited


Steps involved in the K-means algorithm are:

To implement the K means algorithm, Download the monthlysaving.csv


dataset files present under Course Files on OnlineVarsity. It has details such as
gender, age, monthly income, and savings.

 Load the Dataset


The monthlysaving.csv file must be loaded into the dataframe. Code
Snippet 9 shows the code for loading the dataset.
Code Snippet 9:

import numpy as nm
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('monthlysavings.CSV')
df

Figure 14.17 shows the output of this code.

Figure 14.17: Monthly Savings Data Loaded


 Get Independent Variables
The monthly income($) and Savings score(1-10) columns are
independent variables. Other columns have less impact on finding the number

V1.0 © Aptech Limited


of clusters. The code in Code Snippet 10 helps to segregate the independent
variables.
Code Snippet 10:

x = df.iloc[:, [2,3]].values
x

Figure 14.18 shows the output of the segregated independent variables.

Figure 14.18: Independent Variable Values

 Find K Mean by the Elbow Method


The value of K defines the number of clusters. One way to find the optimal
number of clusters for K-means algorithm is to use the elbow method. The
elbow method plots the sum of squared distances (SSD) of each point to its
closest cluster center against the number of clusters. The SSD decreases as the
number of clusters increases, but the rate of decrease slows down at some
point. The optimal number of clusters is where the plot bends or forms an
‘elbow’. This is the point at which dividing the data into further clusters must be
stopped.

The Kmeans library has methods such as fit and append. These methods help
in plotting points into the graph. The graph can be observed to arrive at the
elbow point. Code Snippet 11 uses the fit and append methods to plot the
graph.

Code Snippet 11:

from sklearn.cluster import KMeans


inertias =[]
for i in range(1,10):
kmeans = KMeans(n_clusters=i)
kmeans.fit(x)
inertias.append(kmeans.inertia_)
plt.plot(range(1,10), inertias,

V1.0 © Aptech Limited


marker='o')
plt.title('Elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

Figure 14.19 shows the graph and the elbow point.

Figure 14.19: Plotting of Points in Graph with Elbow Method

In Figure 14.19, the elbow point is at 3 and so K=3. Thus, there must be three
major cluster types.

 Scale the Data Using Feature Scaling


Scaling is done to bring the data into a common range for measuring purpose.
Feature scaling is a technique that transforms the values of different features
into a common range, such as [0, 1] or [-1, 1]. This helps to avoid features with
large values dominating the model and improves the performance of the k-
means clustering algorithm. Feature scaling can be done in different ways,
such as min-max normalization, standardization, or robust scaling.

Code Snippet 12 shows the code to scale the data using the min-max
normalization. The fit_transform method in the MinMaxScalar class does
the scaling of data.

V1.0 © Aptech Limited


Code Snippet 12:

from sklearn.preprocessing import


MinMaxScaler
ms = MinMaxScaler()
x = ms.fit_transform(x)
x

Figure 14.20 shows the output for the code.

Figure 14.20: Scaling the Data

 Train the model

The model must be trained with the scaled data. Code Snippet 13 shows the
code to train the model.

Code Snippet 13:

kmeans = KMeans(n_clusters=3)
y_predict= kmeans.fit_predict(x)
y_predict

Figure 14.21 shows the output for the code.

Figure 14.21: Train the Model

 Visualize the Cluster


The three values 0, 1, and 2 in the array shows that there are three clusters. They
are plotted on the graph with red, blue, and green color. The yellow-colored
circle is the centroid. Code Snippet 14 contains the code to visualize these
clusters as a plotting graph.

V1.0 © Aptech Limited


Code Snippet 14:
plt.scatter(x[y_predict==0, 0],
x[y_predict==0, 1], s=100, c='red',
label ='Cluster 1')
plt.scatter(x[y_predict==1, 0],
x[y_predict==1, 1], s=100, c='blue',
label ='Cluster 2')
plt.scatter(x[y_predict==2, 0],
x[y_predict==2, 1], s=100, c='green',
label ='Cluster 3')
plt.scatter(kmeans.cluster_centers_[:,
0], kmeans.cluster_centers_[:, 1],
s=300, c='yellow', label = 'Centroids')
plt.title('Customers')
plt.xlabel('Monthly earnings($)')
plt.ylabel('Savings Score(1-10)')
plt.legend()
plt.show()

Figure 14.22 shows the clusters and the centroids which are the center point
for each cluster.

Figure 14.22: Visualize the Cluster

V1.0 © Aptech Limited


14.4 Reinforcement Learning

Reinforcement Learning (RL) is an autonomous self-learning method in


machine learning in which the system learns by trial and error method. RL is
inspired by the way humans and animals learn from their own actions and the
rewards or penalties they receive. Reinforcement learning agents interact with
an environment and learn an optimal policy that maximizes a long-term
objective. Some examples of reinforcement learning applications are self-
driving cars, game-playing, robotics, and recommendation systems. Figure
14.23 shows a simple maze game that uses reinforcement learning. In this
game, the dog has to find the best path to reach the bone without being
stopped by the hurdle in the form of fence.

Figure 14:23: Reinforcement Learning

RL is the science of decision-making. It is about learning in a given situation. In


RL, data is not accumulated as in supervised and unsupervised learning. The
objective is to maximize the outcome in the given environment. Each positive
step towards the goal is rewarded and the negative step is given a negative
mark. The outcomes are taken and analyzed to determine if the decisions
made were correct, neutral, or incorrect. Based on the analysis, the feedback
is taken and the subsequent actions that must be taken are decided.

The input starts from the initial start point of the model. The output can be
various ways to achieve the target. The model gets trained by learning or
unlearning the path based on the output. Learning and unlearning occurs
continuously. The solution is based on the optimum result.

V1.0 © Aptech Limited


Reinforcement learning algorithms find their place in AI applications and
gaming applications. The commonly used reinforcement learning algorithms
are Q-Learning, State Action Reward State action (SARSA), and Deep Q Neural
Network (DQN).

The two types of reinforcement learning are positive reinforcement learning


and negative reinforcement learning.

14.4.1 Positive Reinforcement Learning

Positive lessons from the output of an event are studied. The study increases
the strength and similar behavior of the model. It has a positive effect on the
model. In Figure 14.23, the dog decides the path to take to reach its goal. The
path shown by the green arrow is the correct path.

Positive reinforcement learning increases the performance of the model. The


learning sustains the result for a long time. However, too much positive
reinforcement increases the load on the model and the performance may be
diminished.

14.4.2 Negative Reinforcement Learning

Negative lessons are learnt from the model, and this also increases the strength
of the model. A typical example of negative reinforcement is leaving home for
work early to avoid traffic jams. In Figure 14.23, the path to the hurdle is shown
in red color and it is learned not to take the route to achieve the target.

Negative reinforcement helps in learning by removing the undesirable


consequences. It Provides protection to a minimum standard of outcome.
However, this minimum standard of outcome is learned to be the maximum
output by the model. Hence, optimum level may not be reached as the model
gets satisfied with the minimum or safe output.

V1.0 © Aptech Limited


14.5 Summary

 Supervised learning is to classify new data with classified datasets.


 Regression and classification are the two types of supervised learning.
 Classification learning has two types and they are binary and multi-class
classification.
 Unsupervised learning is making the machine learn on its own without a
dataset, by making it understand the class under which the data falls.
 Clustering and associative learning are the two types of Unsupervised
learning.
 Reinforcement learning is a method by which the machine learns by trial
and error.
 Positive and Negative reinforcement methods are the two types of
Reinforcement learning methods.

V1.0 © Aptech Limited


Test Your Knowledge

1. A data scientist is trying to build a system that can tell if an online


payment is fake or real. Past data exists which clearly demarcates the
fake payments from the real payments. What type of machine learning
technique should the data scientist use for this project?

a. Unsupervised learning
b. Reinforcement learning
c. Supervised learning
d. Semi-supervised learning

2. A researcher is examining a dataset including animal photos. The


researcher is unsure of the species of each animal shown in the
photographs. The idea is to group together similar animals based on their
visual characteristics. What type of unsupervised learning problem is the
researcher working on in this scenario?

a. Clustering
b. Regression
c. Classification
d. Reinforcement learning

3. What is the difference between binary and multiclass classification?

a. Binary classification involves only one class and multiclass involves


multiple classes
b. Binary classification uses regression algorithms and multiclass uses
decision trees
c. Binary classification predicts continuous values and multiclass
predicts categorical values
d. Binary classification has one or two output categories and
multiclass has several

4. What does logistic regression predict in classification?

a. Continuous output values


b. Categorical output variables
c. Complex model relationships
d. Unlabeled data

V1.0 © Aptech Limited


5. What is the purpose of the training data in supervised learning?

a. To test the accuracy of the machine


b. To validate the training process of the algorithm
c. To provide a reference for the machine to learn from labeled
examples
d. To make predictions on new data directly

V1.0 © Aptech Limited


Answers to Test Your Knowledge

1 c
2 a
3 d
4 b
5 c

V1.0 © Aptech Limited


Try it Yourself

1. You are a data scientist working for an agriculture company that


specializes in tree care and management. Your company has collected
data on various trees in their care, including details about the age and
height of each tree. Your task is to gain insights into the distribution of
trees across different clusters, which can assist the company in
optimizing its tree care strategies. Create a Tree_Information.csv
using the dataset given in Table 14.3.

Table 14.3: Data for Tree_Information.csv

a. Import the necessary libraries and import the dataset.


b. Load the dataset.
c. Extract independent variables.
d. Find the optimal number of clusters in K Means using the elbow
method.
e. Apply feature scaling using Min-Max scaling.
f. Train the model using the K Means algorithm.
g. Visualize the cluster using different colors.

2. As a data scientist, you are engaged by a leading online retailer to


develop a predictive model that can forecast whether a user is inclined
to make a purchase. Your task is to create a dependable model that
assists in enhancing marketing strategies and propelling revenue
expansion. Your task is to develop a predictive model that can forecast
whether a user is inclined to make a purchase.

V1.0 © Aptech Limited


Create a Purchase_Item.csv using the dataset given in Table 14.4.

Table 14.4: Data for Purchase_Item.csv

a. Import necessary libraries and import the dataset.


b. Load the dataset.
c. Load the preprocessed training and testing data.
d. Build a logistic regression model.
e. Use the trained model to make predictions.
f. Compute and print the confusion matrix.
g. Calculate and print the accuracy of the model.

V1.0 © Aptech Limited


Learning Objectives

In this session, students will learn to:

 Describe the regression techniques in Python for machine learning


 Explain different types of regression techniques
 Describe how to implement the Linear Regression technique

Regression is a method of finding the relationship between two or more factors


which affects the statistical outcome. Regression analysis finds its use in the field
of Machine Learning (ML) where forecasting or prediction is essential. The
technique includes finding the dependent and independent variables in a
dataset and establishing a relationship between them which helps in
forecasting. A typical example can be finding the relationship between age,
gender, and height in children, where, as age increases, the height also
increases. Here age and gender are independent factors whereas height is a
dependent factor.

V1.0 © Aptech Limited


15.1 Introduction to Regression Techniques in Python

Regression technique can be implemented using Python. For this purpose,


Python provides libraries such as sklearn and pandas. These libraries provide
several functions that can be used to read data, classify data, manipulate
data, analyze data, and model data. These functions can be used to develop
algorithms and workflows to train the machine. Once the machine is trained, it
can help businesses with predictions or forecasting, which are crucial for taking
informed investment or disinvestment decisions.

Forecasting finds its use in many fields including medicine, technology,


economics, and social sciences.

Different types of regression techniques that can be implemented using Python


are:

15.2 Linear Regression

The word ‘Linear’ comes from the word, ‘Line’. In this method, the data points
for the independent and dependent variables are plotted on the X and Y axes,
respectively, in a graph. Then, a line is drawn to find the best possible
relationship between the two variables. The line is drawn in such a way that
that the data points are as close to the line as possible.

V1.0 © Aptech Limited


The mathematical representation of linear regression is:

y = a0+a1x+E

Table 15.1 describes the variables in this representation.

Variable Meaning
y Line of Regression
a0 Intercept of the line
a1x Linear regression coefficient
E Random error

Table 15.1: Variables of Linear Regression

Figure 15.1 helps in understanding the Line of Regression. The variables a0, and
a1 are dependent and independent variables, respectively. Here, x is the
coefficient factor of a1 which can be a negative or positive number. The
coefficient is an important factor in defining the line of slope. It helps in
determining, the number of times the factor is multiplied in the creation of the
linear regression slope. E is the error factor which is possible in any statistical
analysis.

Figure 15.1: Linear Regression

V1.0 © Aptech Limited


The aim is to minimize the distance between the data points, and the line of
slope. Here, the coefficient plays an important role. For each coefficient, the
slope is different. The aim is to find the optimal coefficient so that the best-fit
line is achieved. The cost function or hypothesis function helps in determining
the best-fit line. It also minimizes the error factor.

The metrics that are commonly used to evaluate linear regression are:

Two types of linear regression are:

15.3 Polynomial Regression

When there is no linear relationship between the input and output variables,
the graph obtained is not a line but a curve. In such cases, linear regression is
not enough to capture the complexity of the relationship between the

V1.0 © Aptech Limited


variables. This is where polynomial regression comes into use. For example, the
number of likes that a post gets is not dependent on the time. It could happen
that the number of likes is more in the beginning and then, they start reducing.
It could also happen that the likes are few in the beginning and increases as
the days go by and then starts reducing and then becomes steady. In such
cases, polynomial regression will provide a better prediction than linear
regression.

The mathematical representation of polynomial regression is:

y=b0+b1x+b2x2+b3x3…..bnxn

Table 15.2 describes the variables in this representation.

Variable Meaning
y Line of Regression
b0 0th Coefficient
b1 1st Coefficient
bn nth coefficient
x Input variable

Table 15.2: Variables of Polynomial Regression

Figure 15.2 shows a model of the polynomial regression curve. The trend is not
linear. It goes downward in the initial stage and then picks the momentum after
some time. The coefficients decide on the path of the curve. The relationship
between dependent and independent variables defines the degree of the
polynomial (denoted by n). For the best-fit data, a higher degree of the
polynomial can be used. However, in some rare scenarios, this can lead to the
polynomial regression curve being overfit. This, in turn, will result in new data
not being represented correctly.

V1.0 © Aptech Limited


Figure 15.2: Polynomial Regression Curve

There are a few evaluation metrics that assess the accuracy of the model.
These metrics provide insights into how the model fits the data. These insights
help in comparing different models.

Some of the metrics used for the evaluation of polynomial regression are:

V1.0 © Aptech Limited


15.4 Decision-Tree Regression

A decision-tree is a graphical representation for getting all the possible


solutions to a problem or a decision based on given conditions. It mimics how
human think when making a decision.

A decision tree starts with a question posed on a dataset. Based on the answer
to the question, which is usually Yes or No, the dataset is split into smaller units.
Then, questions are posed to these smaller units, which are further split into
smaller units. This process continues until a decision is taken.

Figure 15.3 shows the structure of the decision-tree.

Figure 15.3: Decision-tree Regression

In Figure 15.3, the root node is indicated by the blue box. This node is also
known as the parent node. Branches or child nodes, indicated by the orange
boxes, stem out of the root, based on the decision taken. These child nodes
may again become a parent node for decisions to be taken from there,
thereby, creating sub-trees. The leaf nodes are the end nodes, and they are
indicated by the green boxes. These are the outcomes or results of the decision
tree.

The algorithm used to implement the decision tree is called the Classification
and Regression Tree algorithm (CART). The two methods used to arrive at the
decision in a decision tree are Splitting and Pruning. Splitting is the process of
dividing the root node into child nodes and splitting child nodes into further sub

V1.0 © Aptech Limited


nodes until a decision is reached. Pruning is the process of removing unwanted
branches from the tree.

Steps involved in CART are:

Attribute Selection Measures (ASM)

The accuracy of the decision-tree regression lies in the selection of the right
attribute at the root node or at the various child nodes of the tree. The selection
of the attribute is complex because a dataset can have n attributes. Out of
these n attributes not all of them will be important to the decision that must be
made. Randomly selecting an attribute can lead to bad results.

Therefore, to aid the selection of the right attribute, some solutions have been
identified. These solutions include:

Using these solutions, the value for each attribute can be calculated. Based
on these values, the attributes can be sorted. Then, the attributes can be
placed in the tree in descending order of their values. That is, the attribute with
the highest value can be placed at the root node and the lower values can
be placed at different levels of child nodes. In case of information gain, the

V1.0 © Aptech Limited


attributes must be categorical, while in case of Gini index, the attributes must
be continuous.

15.5 Random-Forest Regression

Random-forest regression employs several decision trees to make predictions.


It splits the datasets into multiple subsets and creates a decision-tree for each
subset. It then takes the outcomes of each of the tree and averages them out
to make a prediction. The prediction is made on the basis of majority of the
outcomes. This method ensures better accuracy of the predictions.

The more the number of decision-trees used, the more accurate will be the
prediction without the risk of overfitting. The concept of random-forest
regression can be depicted as shown in Figure 15.4.

Figure 15.4: Random-forest Regression

As is shown in Figure 15.4, in random-forest regression, each decision-tree is


independent of each other. Therefore, the result of each decision-tree is
independent of each other. It takes less of training time than other regression
methods. This regression method provides accurate predictions with large
dataset.

There are two phases in the implementation of random-forest regression. The


first one is to combine each decision-tree to create a random forest. The
second is to combine the results from each decision tree to arrive at the final
prediction.

V1.0 © Aptech Limited


15.6 Support-vector Machine Algorithm

Support-Vector Machine (SVM) algorithm is one of the most popular


Supervised Learning algorithms. SVM aims at creating a best-fit boundary line
between two classes of data. For example, in a dataset of four-wheelers, it will
create a boundary line between the features that identify with sedans and
SUVs. This line segregates n-dimensional dataset into classes so that a new data
point is placed in the correct category. This best decision boundary is called a
hyperplane. A hyperplane divides the classes of data so as to minimize the
errors.

Figure 15.5 shows how data is classified using hyperplanes.

Figure 15.5: Support Vector Machine

In Figure 15.5, there are two classes, one indicated by the blue circles and the
other indicated by the green circles. When a new data comes in, it can be
placed in blue or green group based on its attributes. Hyperplane maintains
maximum distance from the classes. The support vectors are the boundaries of
the datapoints on both sides of the hyperplane. The distance between the
hyperplane and the support vector is called the margin. The distance between
the two support vectors is the maximum margin. The blue circle on the side of
the green circles is called an outlier. The SVM algorithm does not take outliers

V1.0 © Aptech Limited


into consideration. The aim of the SVM algorithm is to maximize the margin
distance that exists between the hyperplane and support vectors.

15.7 Example of Implementation of Linear Regression

Here is an example of how linear regression is implemented in Python. This


example aims at studying the relationship between temperature and the sales
of ice cream.

Figure 15.6 shows sales data for the first 15 days of a summer month at a
particular ice cream shop.

Figure 15.6: Ice Cream Sales Data

Consider that this data is stored in a CSV file named sales_icecream.csv,


which is stored in the C:\SalesData folder. Download the
sales_icecream.csv file present under Course Files on Onlinevarsity.

Step 1: The first step is to load the data. The data in csv file is loaded by
executing the code in the Code Snippet 1.

V1.0 © Aptech Limited


Code Snippet 1:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("C:\SalesData\sales_icecream.csv")
df

In this code, the pandas, numpy, and matplotlib libraries are imported and
the data is read into the dataframe (df) using the function read_csv.

Figure 15.7 shows the output of Code Snippet 1.

Figure 15.7: Output of Code Snippet 1

Step 2: The next step is to pre-process the data, where the data is cleaned of
duplicate data, missing data, and outliers.

In the sales data used for this example, there are no missing or duplicate data.
There are also no outliers. So, proceed to the next step.

Step 3: The third step is to separate the dependent and independent variables.
In this case, Temperature (Celsius) is the independent variable and

V1.0 © Aptech Limited


Ice-cream Sales (kg) is the dependent variable. The variables are
separated by executing the code in Code Snippet 2.

Code Snippet 2:

x = df[['Temperature (Celsius)']]
y = df['Ice-cream Sales (kg)']
x

In this code, the values of Temperature (Celsius) are loaded into variable
x and the values of Ice-cream Sales (Kg) are loaded into variable y. Then,
the values stored in variable x are printed.

Figure 15.8 shows the output of Code Snippet 2.

Figure 15.8: Output of Code Snippet 2

Step 4: After the data is separated into independent and dependent variables,
it should be split for training and testing purposes. Generally, the data is split in
the ratio of 70:30, where 70% of the data is used for training purposes and 30%
data is used for testing purposes. The data can be split into the required ratio
using Code Snippet 3.

V1.0 © Aptech Limited


Code Snippet 3:

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size = 0.3, random_state = 0)
x_train

In this code, four variables are being created, two for the training dataset and
two for the testing dataset. The testing dataset size is specified as 0.3, which
refers to 30%. The remaining will be used for the training dataset. Then, the
training dataset of the x variable is printed.

Figure 15.9 shows the output of Code Snippet 3.

Figure 15.9: Output of Code Snippet 3

Step 5: Next, the model must be trained with dependent and independent
data. The function fit() of LinearRegression helps in training the model
with data.
The model is trained using the code in Code Snippet 4.
Code Snippet 4:

from sklearn.linear_model import LinearRegression


sim_linear = LinearRegression()
sim_linear.fit(x_train, y_train)
print("Intercept: ", sim_linear.intercept_)
print("Coefficient: ", sim_linear.coef_)

This code trains the model and prints the intercept and coefficient. In the linear
equation, y=a0+a1x+E, a0 is the intercept, a1 is the coefficient and E is the error

V1.0 © Aptech Limited


factor which is included in any statistical prediction.

Figure 15.10 shows the output of Code Snippet 4.

Figure 15.10: Output of Code Snippet 4

Step 6: The next step is to predict using the test data. Now that the model is
trained with train data, it can be used to predict using the data in x_test,
which is the independent training dataset. Prediction can be made using the
code in Code Snippet 5.

Code Snippet 5:

y_pred= sim_linear.predict(x_test)
sim_linear_diff = pd.DataFrame({'Actual value': y_test,
'Predicted value': y_pred})
sim_linear_diff

This code makes the prediction and stores the results of the prediction in the
y_pred variable. This variable is then, compared with the data in the y_test
variable, which is the actual test data for the dependent variable. In this way,
the model is tested for its authenticity.

Figure 15.11 shows the output of Code Snippet 5.

Figure 15.11: Output of Code Snippet 5

Step 7: Next, the data must be plotted to find the best-fit line of regression. This
can be done using the code in Code Snippet 6.

V1.0 © Aptech Limited


Code Snippet 6:

plt.scatter(x_test,y_test,s=40,color='green')
plt.scatter(x_test,y_pred,s=40,color='red')
plt.plot(x_test, y_pred, 'blue')
plt.xlabel('Temperature (Celsius)')
plt.ylabel('Ice-cream Sales (kg)')
plt.show()

Figure 15.12 shows the output of Code Snippet 6.

Figure 15.12: Output of Code Snippet 6

The code plots the test data in green dots and the predicted values in red dots.
The blue line is the line of regression which is also called as line of best fit.

There are five data points in green. There seem to be only four data
points in red, the reason is, two of the predicted data carry the same
value and hence dots are overlapped.

V1.0 © Aptech Limited


Step 8: Having tested the model, it must now be evaluated based on the test
data and the predicted data that it has produced. This evaluation can be
done using the code in Code Snippet 7.

Code Snippet 7:

from sklearn import metrics


mAbsErr = metrics.mean_absolute_error(y_test, y_pred)
mSqaErr = metrics.mean_squared_error(y_test, y_pred)
print('R squared:
{:.2f}'.format(sim_linear.score(x,y)*100))
print('Mean Absolute Error:', mAbsErr)
print('Mean Square Error:', mSqaErr)

Figure 15.13 shows the output of Code Snippet 7.

Figure 15.13: Output of Code Snippet 7

The R squared value of 87.30 indicates that the predicted values are 87%
accurate.

The Mean Absolute Error value indicates the difference between the predicted
values and the actual values. This value should be low. The lower the Mean
Absolute Error value, more accurate is the prediction made by the model.
A perfect predictor will have a Mean Absolute Error of 0.

V1.0 © Aptech Limited


15.8 Summary

 Regression is a method of finding the relationship between two or more


factors which affects the statistical outcome.
 Regression is used for business forecasting or predictions.
 To implement the regression technique, Python provides libraries such as
sklearn and pandas.
 Various types of regression techniques include Linear, Polynomial,
Decision-tree, Random-forest, and Support-vector machine.
 In Linear regression, there is a line, which is sloped and that defines the
relationship between factors.
 In Polynomial regression, the relationship between factors is not straight
and hence, the graph obtained may be curved.
 Decision-tree regression imitates how a human mind thinks when a
decision must be made.
 Random-forest regression takes the outcome from multiple decision
trees and averages the outcomes to arrive at the best possible solution.
 Support-vector machine is based on classifying the data and drawing
boundaries.

V1.0 © Aptech Limited


Test Your Knowledge

1. Linear Regression comes under which technique?

a. Supervised learning
b. Reinforcement learning
c. Unsupervised learning
d. Semi-supervised learning

2. Which type of Linear Regression uses a single independent variable to


predict the value of a numerical dependent variable?

a. Simple Linear Regression


b. Multiple Linear Regression
c. Cost function optimization
d. R Squared (R2)

3. You are a data scientist working for a car dealership. Your job is to
analyze the factors that influence the prices of used cars in your region.
You have collected data on various attributes such as the age of the
car, mileage, brand, and the number of previous owners. You intend to
use linear regression to build a model that predicts the prices of used
cars based on these attributes. What type of linear regression would be
most appropriate for your analysis?

a. Simple Linear Regression


b. Multiple Linear Regression
c. Ridge Regression
d. Logistic Regression

4. Which evaluation metric for polynomial regression measures the


proportion of the variance in the dependent variable that is explained
by the independent variable(s) in the model?

a. Mean Squared Error (MSE)


b. Root Mean Squared Error (RMSE)
c. R-squared (R2) Score
d. Mean Absolute Error (MAE)

5. Which of the following is not an Attribute Selection Measure used in


Decision-Trees?

a. Entropy
b. Information gain
c. Gini index
d. Random selection

V1.0 © Aptech Limited


Answers to Test Your Knowledge

1 a
2 a
3 b
4 c
5 d

V1.0 © Aptech Limited


Try it Yourself

1. You are an agricultural researcher working with a group of farmers. These


farmers are keen to optimize their crop yields. To do this, they want to
understand the relationship between the amount of fertilizer used and
the corresponding water requirements. Your task is to build a predictive
model to estimate the water requirements based on the amount of
fertilizer applied. Create a Fertilization_information.csv using the
dataset given in Table 15.3.

Table 15.3: Data for Fertilization_information.csv

a. Import the necessary libraries and import the dataset.


b. Load the dataset.
c. Apply preprocessing techniques.
d. Use Fertilizer as the input feature (X) and Crop_Yield as the
target variable(Y).
e. Split the dataset into training and testing sets into 70 and 30.
f. Create a linear regression model.
g. Train the linear regression model.
h. Find the intercept and coefficients of the linear regression model.
i. Use the trained model to make predictions.
j. Compare the actual and predicted value.
k. Find a line of best fit that represents the best approximation of a
scatter plot of data points.
l. Calculate and print the R-squared value, Mean Absolute Error
(MAE), and Mean Squared Error (MSE).

V1.0 © Aptech Limited


Appendix - A

V1.0 © Aptech Limited


Here is a Python application, Blossom Buddy, that provides gardening tips
based on certain inputs. This application uses:
 MySQL to store the database in the backend.
 Hypertext Markup Language (HTML) with Cascading Style Sheets (CSS) in
the frontend.
 Python to build the interaction between the database and the frontend.

Figure 1 shows the Welcome screen of the Blossom Buddy application.

Figure 1: Welcome Screen


Figure 2 depicts the input and output of the application.

Figure 2: Input and Output

Figure 3 shows that the application takes input from the user through the
Welcome page. The user enters the information on the Welcome page and
clicks the Get Gardening Tips button. The application uses the data entered to

V1.0 © Aptech Limited


fetch gardening tips from the plant_data database. Then, the gardening tips
are displayed on the Tips page.

Figure 3: Blossom Buddy Application

plant_data Database
The plant_data Database has a single table named plant. Table 1 shows the
structure of the plant table. This table will have an id column as the primary
key.

Column Name Datatype Description


Plant_Name VARCHAR(50) Name of the plant
Location VARCHAR(50) The location in which the plant
grows (Indoor plant/Outdoor
plant)
Climate VARCHAR(50) The climate that suits the plant
(Dry/Humid/Moderate
/Mediterranean/Hot/Warm)
Watering_Reminder VARCHAR(50) Text that indicates how
frequently the plant must be
watered
Fertilizing_Reminder VARCHAR(50) Text that indicates how
frequently the plant must be
fertilized
Other_Care_Reminder VARCHAR(50) Text that indicates how to care
for the plant
Tips VARCHAR(50) Tips for healthy growth of the
plant

Table 1: plant Table Structure

V1.0 © Aptech Limited


Table 2 shows the data for the plant table.

Table 2: plant Data

Getting the Backend Database Ready

Let us create the plant_data database and the plant table in MySQL. Then,
the data in Table 2 must be inserted into the plant table.

1. Open MySQL Workbench.


2. Create the plant_data database in MySQL using the command:

CREATE DATABASE plant_data;

3. Create the plant table using the command in Code Snippet 1.

Code Snippet 1:

CREATE TABLE plant (


id INT AUTO_INCREMENT PRIMARY KEY,
Plant_Name VARCHAR(50),
Location VARCHAR(50),
Climate VARCHAR(50),
Watering_Reminder VARCHAR(50),
Fertilizing_Reminder
VARCHAR(50),
Other_Care_Reminder
VARCHAR(50),
Tips VARCHAR(50)
);

V1.0 © Aptech Limited


4. Insert rows into the plant table using the commands in Code Snippet
2.

Code Snippet 2:

INSERT INTO plant (Plant_Name, Location, Climate,


Watering_Reminder, Fertilizing_Reminder,
Other_Care_Reminder, Tips)
VALUES ('Philodendron', 'Indoor plant', 'Moderate',
'Every 7 days', 'Once a month', 'Trim brown leaves',
'Keep your Philodendron in bright, indirect light, water
when the top inch of soil is dry, and maintain moderate
humidity for optimal growth');

INSERT INTO plant (Plant_Name, Location, Climate,


Watering_Reminder, Fertilizing_Reminder,
Other_Care_Reminder, Tips)
VALUES ('Pothos', 'Indoor plant', 'Humid', 'Every 5
days', 'Every 2 weeks', 'Wipe leaves', 'Pothos plants
thrive in indirect light, require infrequent watering,
and are easy to care for');

INSERT INTO plant (Plant_Name, Location, Climate,


Watering_Reminder, Fertilizing_Reminder,
Other_Care_Reminder, Tips)
VALUES ('ZZ Plant', 'Indoor plant', 'Dry', 'Every 14
days', 'Once in 2 months', 'Avoid overwatering', 'Allow
the soil to dry out completely between waterings to
prevent overwatering');

INSERT INTO plant (Plant_Name, Location, Climate,


Watering_Reminder, Fertilizing_Reminder,
Other_Care_Reminder, Tips)
VALUES ('Rose', 'Indoor plant', 'Moderate', 'Every 5
days', 'Once a month', 'Trim long stems', 'Plant roses in
well-drained soil with plenty of sunlight and regular
pruning for healthy growth and vibrant blooms');

INSERT INTO plant (Plant_Name, Location, Climate,


Watering_Reminder, Fertilizing_Reminder,
Other_Care_Reminder, Tips)
VALUES ('Lavender', 'Outdoor plant', 'Mild', 'Every 3
days', 'Every 2 weeks', 'Deadhead blooms', 'Plant

V1.0 © Aptech Limited


lavender in well-drained soil with plenty of sunlight for
optimal growth and fragrance');

INSERT INTO plant (Plant_Name, Location, Climate,


Watering_Reminder, Fertilizing_Reminder,
Other_Care_Reminder, Tips)
VALUES ('Zinnia', 'Outdoor plant', 'Warm', 'Every 2
days', 'Once a month', 'Support tall stems', 'Plant
Zinnias in well-draining soil, provide full sun, and
deadhead spent blooms for continuous colorful flowers');

INSERT INTO plant (Plant_Name, Location, Climate,


Watering_Reminder, Fertilizing_Reminder,
Other_Care_Reminder, Tips)
VALUES ('Basil', 'Outdoor plant', 'Hot', 'Every day',
'Every 2 weeks', 'Prune suckers', 'Plant basil in well-
draining soil with plenty of sunlight and water it
consistently to keep the soil moist but not
waterlogged');

INSERT INTO plant (Plant_Name, Location, Climate,


Watering_Reminder, Fertilizing_Reminder,
Other_Care_Reminder, Tips)
VALUES ('Rosemary', 'Outdoor plant', 'Mediterranean',
'Every 4 days', 'Once a month', 'Trim after flowering',
'Plant rosemary in well-draining soil and provide plenty
of sunlight for optimal growth');

5. To view data in the plant table, use the command:

SELECT * FROM plant;

Creating the Frontend Pages


This application has two HTML pages: Welcome.html and Tips.html.
The Welcome.html page is the home screen of the application that takes input
from the user. This page is styled using the styles.css.

Code Snippet 3 lists the code for Welcome.html page. This page uses a text
box for Location and drop-down lists for Climate and Preferred plants.

V1.0 © Aptech Limited


Code Snippet 3:

<!DOCTYPE html>
<html>
<head>
<title>Gardening Application</title>
<link rel="stylesheet" href="{{url_for('static',
filename='CSS/styles.css')}}">

</head>
<body>
<div>
<h1>Welcome to Blossom Buddy!</h1>
<fieldset>
<legend>Enter Your Plant Info:</legend>
<form action="/tips" method="post">

<label for="location">Location:</label>
<!--Placeholder is used for Location-->
<input type="text" name="location" id="location"
placeholder="Indoor plant or Outdoor plant"
pattern="^[A-Za-z\s]+$"
title="Please use letters and spaces only">
<br>
<!-- <label for="climate">Climate:</label>
<input type="text" name="climate" id="climate">
<br> -->
<!--Drop down used for Climate-->
<label for="climate">Climate:</label>
<select name="climate" id="climate">
<option value="Select">Select climate</option>
<option value="Dry">Dry</option>
<option value="Hot">Hot</option>
<option value="Humid">Humid</option>

V1.0 © Aptech Limited


<option
value="Mediterranean">Mediterranean</option>
<option value="Mild">Mild</option>
<option value="Moderate">Moderate</option>
<option value="Warm">Warm</option>
</select>

<br>

<!-- <label for="plants">Preferred plants:</label>


<input type="text" name="plants" id="plants"> <br>
-->
<!--Drop down used for preferred plants name-->
<label for="plants">Preferred plants:</label>
<select name="plants" id="plants">
<option value="Select">Select plant
options</option>
<option
value="Philodendron">Philodendron</option>
<option value="Pothos">Pothos</option>
<option value="ZZ Plant">ZZ Plant</option>
<option value="Rose">Rose</option>
<option value="Lavender">Lavender</option>
<option value="Zinnia">Zinnia</option>
<option value="Basil">Basil</option>
<option value="Rosemary">Rosemary</option>
</select>
<br>
<button type="submit" class="btn btn-primary">Get
Gardening Tips</button>
</form>
</fieldset>
</div>

<script>
</script>
</body>
</html>

V1.0 © Aptech Limited


Code Snippet 4 lists the code for the style sheet, styles.css.

Code Snippet 4:

body {
align-items: center;
background:
url("https://cdn.pixabay.com/photo/2016/02/13/16/34/flowers-
1198159_1280.jpg");
background-attachment: fixed;
background-position: relative;
background-repeat: no-repeat;
background-size: cover;
display: grid;
margin: 0;
place-items: center;
width: 300px; / Set the width of the container /
height: 200px; / Set the height of the container /
background-position: right; / Crop from the right side /
}

fieldset {
/ background-color: #bfe5c6; /
font-size: 20px;
width: 550px;
margin-left: 500px;
margin-bottom: 5px;
}

legend {
background-color: rgb(209, 216, 211);
color: rgb(80, 3, 66);
padding: 10px 20px;
margin: 10px 20px;
}

input,select {
margin: 30px;
width: 200px;
height: 30px;
}

V1.0 © Aptech Limited


div {
color: rgb(97, 7, 85);
position: center;
width: 550px;
height: 200px;
border: 1px;
padding:150px;
float: right;
margin: 2px;

}
.form {

display: flex;
align-items: center;
justify-content: space-between;
flex-direction: column;
padding: 0 3rem;
height: 100%;
text-align: center;
font-size: 40px;

}
h1{
align-items: center;
margin-left: 500px;
width: 400px;
}
label {

display: inline-block;
width: 150px;
}

button{
font-size: 20px;
position: absolute;
background-color:#3b0338;
color: #fff;
border:none;
border-radius:15px;
padding:15px;
min-height:10px;
min-width: 100px;
}

input.button{
float: right;
}

V1.0 © Aptech Limited


The Tips.html page fetches the gardening tips data from the database based
on the input provided and displays it on the screen. The data displayed
includes Plant_Name, Tips, Watering_Remainder, and
Fertilizing_Remainder.
Code Snippet 5 lists the code for the Tips.html page that includes its style sheet
too.
Code Snippet 5:
<!DOCTYPE html>
<html>
<head>
<title>Gardening Tips</title>
</head>
<style>

body{
/*For background*/
background:
url("https://cdn.pixabay.com/photo/2012/03/02/00/37/background
-20823_1280.jpg");
background-attachment: fixed;
background-position: relative;
background-repeat: no-repeat;
background-size: cover;
display: grid;
height: 100%;
margin: 0;
place-items: center;
font-size: 20px;
}

.table_div{
/*Table we used to fetch the information from MySQL*/
border: 1px;
padding:150px;
float: right;
margin: 5px;
margin-top: 50px;

V1.0 © Aptech Limited


table, th, td {
border: 1px solid white;
border-collapse: collapse;
}
th{
background-color: #96D4D4;
text-align: center;
}
td{
background-color: #d7dbdb;
text-align: center;
}
table.center {
margin-left: auto;
margin-right: auto;
}
/*For the Caption*/
.caption_table{
margin-bottom: 40px;
font-size: 30px;
}

</style>

<body>
<div class="table_div">
<table style="border:1px solid black;margin-
left:auto;margin-right:auto;">
<caption class="caption_table">Growing Green: A Guide
to Plant Care 🌿</caption>
<tr>
<th>Plant name</th>
<th>Tips</th>
<th>Watering_Reminder</th>
<th>Fertilizing_Reminder</th>

</tr>
{% for tip in plantdet %}

<tr>
<td>{{ tip.Plant_Name }}</td>
<td>{{ tip.Tips }}</td>
<td>{{ tip.Watering_Reminder}}</td>
<td>{{ tip.Fertilizing_Reminder }}</td>

</tr>
{% endfor %}

</table>

V1.0 © Aptech Limited


</div>
</body>
</html>

Linking the Front-end and Back-end


The Blossom Buddy application uses the input entered in the Welcome.html
page to query the plant_data database. The data from the database is then,
displayed on the Tips.html page. This process happens when a user provides
the input on the Welcome.html page and then, clicks the Get Gardening Tips
button. Code Snippet 6 lists the Python script using the Flask framework for this
functionality, BlossomBuddy.py.
Code Snippet 6:
import re
from flask import Flask, request, render_template
from flask_mysqldb import MySQL
import MySQLdb.cursors
import nltk
from nltk.tokenize import word_tokenize

app = Flask(__name__)

# Connect to MySQL database


app.config['MYSQL_HOST'] = 'localhost'
app.config['MYSQL_USER'] = 'root'
app.config['MYSQL_PASSWORD'] = 'De!!9373'
app.config['MYSQL_DB'] = 'plant_data'

mysql = MySQL(app)

# Download NLTK data (you only need to do this once)


nltk.download('punkt')

# Define a regular expression pattern for valid location


location_pattern = r'^[A-Za-z\s,]+$' # Adjust this pattern as
needed

@app.route("/", methods=["GET", "POST"])


def home():
return render_template("Welcome.html")

V1.0 © Aptech Limited


@app.route("/tips", methods=["GET", "POST"])
def tips():
if request.method == "POST":
location = request.form.get("location")
climate = request.form.get("climate")
plant_name = request.form.get("plants")

# Validate the location using the regular expression


pattern
if not re.match(location_pattern, location):
error_message = "Invalid location format. Please
use letters, spaces, and commas only."
return render_template("Welcome.html",
error_message=error_message)

try:
cursor =
mysql.connection.cursor(MySQLdb.cursors.DictCursor)
cursor.execute(
'SELECT Plant_Name, Watering_Reminder,
Fertilizing_Reminder, Other_Care_Reminder, Tips FROM plant
WHERE Location = %s AND Climate = %s AND Plant_Name = %s',
(location, climate, plant_name,)
)

plantdet = cursor.fetchall()
cursor.close()

# Check if there are matching records in the


database
if len(plantdet) > 0:
# Tokenize the Fertilizing_Reminder field
tokenized_plantdet = []
for plant in plantdet:
fertilizing_reminder =
plant['Fertilizing_Reminder']
fertilizing_reminder_tokens =
word_tokenize(fertilizing_reminder)
plant['Fertilizing_Reminder'] =
fertilizing_reminder_tokens
tokenized_plantdet.append(plant)

V1.0 © Aptech Limited


# Redirect to the 'tips.html' page with the
tokenized data
return render_template("tips.html",
plantdet=tokenized_plantdet)
else:
# No matching records found in the database,
handle this case (e.g., display an error message)
error_message = "No gardening tips found for
the specified input."
return render_template("Welcome.html",
error_message=error_message)

except Exception as e:
# Handle database errors (e.g., connection error)
error_message = f"Database error: {str(e)}"
return render_template("Welcome.html",
error_message=error_message)

else:
# Handle the case when the form is not submitted
return render_template("Welcome.html")

if _name_ == "__main__":
app.run(debug=True)

The code in Code Snippet 6 uses:


 Regular expression on Location to check the correct input format.
 Tokenization on fertilizing remainder to separate the text into words.

Before running the code, change the password for MySQL server in
the given line of the Python script.
app.config['MYSQL_PASSWORD'] = 'De!!9373'

Figure 4 shows the folder structure for the application. Place the HTML files in
the templates folder. Place the CSS script in the static folder. The Python
script will remain in the root folder of the application.

V1.0 © Aptech Limited


Figure 4: Folder Structure for the Application
Run the Application
To execute the application, use the command:

Python BlossomBuddy.py

The Welcome screen appears as shown in Figure 5.

Figure 5: Welcome.html Page


Provide the inputs for Location as Indoor plant, Climate as Moderate, and
Preferred plants as Philodendron. Click Get Gardening Tips. The Tips.html
page appears as shown in Figure 6.

V1.0 © Aptech Limited


Figure 6: Tips.html Page
This is how simple it is to create a Web application in Python using Flask.

V1.0 © Aptech Limited


Appendix - B

V1.0 © Aptech Limited


Sr.
Case Studies
No.
1. Use MySQL connector, to connect Python with MySQL and create a
database, Patient. Create a table named Patient_info in this
database. The columns in the table are Patient_id, Patient_name,
Age, Gender, Admission_type, Date_of_enquiry, and Problem.
Patient_id is the primary key of the Patient_info table.

a. Create a GUI application in Python using the Tkinter library. The


application is designed to collect information about the
patients who consult the doctors in the hospital. The information
of the patients to be collected includes Patient_id,
Patient_name, Age, Gender, and Admission_type (In/Out),
Date_of_enquiry, and their Problem.
b. Use Entry widgets for Patient_id, Patient_name, Age,
Date_of_enquiry, and Problem. Use Radiobutton widgets for
gender and admission_type.
c. Place five buttons namely, Add record, Update record, Delete
record, View In-Patient, and View Out-patient. Write Python
code to perform each of the button actions such as adding,
updating, deleting, viewing in-patient, and viewing outpatient
details using the Patient_info table.
d. Use the Add record button to add the patient details given in
Table 3 into the Patient_info table.

Table 3: Patient Enquiry Details

V1.0 © Aptech Limited


e. Use the View In-patient and View Out-patient buttons to display
all the records with Admission_type as IN and OUT,
respectively.
2. Create a simple Web application in Python flask to retrieve the
patient details from MySQL and display it on a Web page.
a. Design a Web page for the Hospital Information system where
the end-user can view the details of either In-patient or Out-
patient. On the first page of the Hospital information system,
create two links: In-patient and Outpatient.
b. When the In-patient link is clicked, the application should
redirect to a new page. This new page should display the
details of patients, with their Admission_type as IN from the
Patient_info table, in the form of an HTML table. Similarly,
when the Outpatient link is clicked, the application should
redirect to a new page that displays the details of outpatients.
c. Inspect the In-patient details pages to scrape its contents.
d. Using the libraries Request and Beautiful Soup, retrieve the
information from the In-patient details page and store it in an
Excel file, patient_in.xlsx.
3. Create the heart_patient.csv file with the data given in Table 4.

Table 4: Patient Disease Details

Train the given dataset using the Logistic Regression classification


model and predict whether the patient has a chance of heart
disease.

a. Handle missing values in the dataset and remove duplicate


rows from the dataset.
b. Prepare the data for training by selecting relevant features
and encoding the categorical variables.
c. Split the dataset into training and testing sets (7:3).

V1.0 © Aptech Limited


d. Perform feature scaling and build the Logistic Regression
model using the training data.
e. Predict the model using the test data.
f. Evaluate the model using the confusion matrix and find its
accuracy.
4. Create the patient_detail.csv file with the data given in Table 5.

Table 5: Patient Personal Details

Find the optimal number of clusters and train the given dataset using
the K-means clustering algorithm to group the patients into different
clusters.
a. Load the dataset and extract the independent variables.
b. Find the optimal number of clusters in K-Means using the
elbow method.
c. Apply feature scaling using Min-Max scaling.
d. Train the model using the K-Means algorithm.
e. Visualize the cluster using different colors.

V1.0 © Aptech Limited

You might also like