0% found this document useful (0 votes)
14 views59 pages

2025 - Course Kit & Lesson Plan - Business Analytics For Decision Making

Uploaded by

rankupsarvesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views59 pages

2025 - Course Kit & Lesson Plan - Business Analytics For Decision Making

Uploaded by

rankupsarvesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

In conclusion, ethics in business analytics is about balancing the potential benefits of data-driven decision-

making with the responsibility to protect privacy, avoid bias, and ensure transparency and fairness.
Organizations must adopt ethical frameworks and practices to ensure that their analytics initiatives create
positive value without causing harm to individuals, communities, or society at large.

UNIT II

Introduction to Python
Python was developed by Guido Van Rossum in the year 1991.
Python is a high level programming language that contains features of functional programming language like
C and object oriented programming language like Java.

FEATURES OF PYTHON
Simple
Python is a simple programming language because it uses English like sentences in its programs.
Easy to learn
Python uses very few keywords. Its programs use very simple structure.
Open source
Python can be freely downloaded from www.python.org website. Its source code can be read, modified and
can be used in programs as desired by the programmers.
High level language
High level languages use English words to develop programs. These are easy to learn and use. Like COBOL,
PHP or Java, Python also uses English words in its programs and hence it is called high level programming
language.
Dynamically typed
In Python, we need not declare the variables. Depending on the value stored in the variable, Python
interpreter internally assumes the datatype.
Platform independent
Hence, Python programs are not dependant on any computer with any operating system. We can use Python
on Unix, Linux, Windows, Macintosh, Solaris, OS/2, Amiga, AROS, AS/400, etc. almost all operating
systems. This will make Python an ideal programming language for any network or Internet.
Portable
When a program yields same result on any computer in the world, then it is called a portable program.
Python programs will give same result since they are platform independent.
Procedure and Object oriented
Python is a procedure oriented as well as object oriented programming language. In procedure oriented
programming languages (e.g. C and Pascal), the programs are built using functions and procedures. But in
object oriented languages (e.g. C++ and Java), the programs use classes and objects.
An object is anything that exists physically in the real world. An object contains behavior. This behavior is
represented by its properties (or attributes) and actions. Properties are represented by variables and actions
are performed by methods. So, an object contains variables and methods.
A class represents common behavior of a group of objects. It also contains variables and methods. But a
class does not exist physically.
A class can be imagined as a model for creating objects. An object is an instance (physical form) of a class.
Interpreted
First, Python compiler translates the Python program into an intermediate code called byte code. This byte
code is then executed by PVM. Inside the PVM, an interpreter converts the byte code instructions into
machine code so that the processor will understand and run that machine code.
Extensible
There are other flavors of Python where programs from other languages can be integrated into Python. For
example, Jython is useful to integrate Java code into Python programs and run on JVM (Java Virtual
Machine). Similarly IronPython is useful to integrate .NET programs and libraries into Python programs and
run on CLR (Common Language Runtime).

Embeddable
Several applications are already developed in Python which can be integrated into other programming
languages like C, C++, Delphi, PHP, Java and .NET. It means programmers can use these applications for
their advantage in various software projects.
Huge library
Python has a big library that contains modules which can be used on any Operating system.
Scripting language
A scripting language uses an interpreter to translate the source code into machine code on the fly (while
running). Generally, scripting languages perform supporting tasks for a bigger application or software.
Python is considered as a scripting language as it is interpreted and it is used on Internet to support other
softwares.

Database connectivity
A database represents software that stores and manipulates data. Python provides interfaces to connect its
programs to all major databases like Oracle, Sybase, SQL Server or MySql.
Scalable
A program would be scalable if it could be moved to another Operating system or hardware and take full
advantage of the new environment in terms of performance.
Core Libraries in Python
The huge library of Python contains several small applications (or small packages) which are already
developed and immediately available to programmers. These libraries are called ‘batteries included’. Some
interesting batteries or packages are given here:
• argparse is a package that represents command-line parsing library.
• boto is Amazon web services library.
• CherryPy is a Object-oriented HTTP framework.
• cryptography offers cryptographic techniques for the programmers
• Fiona reads and writes big data files
• jellyfish is a library for doing approximate and phonetic matching of strings.
• matplotlib is a library for electronics and electrical drawings.
• mysql-connector-python is a driver written in Python to connect to MySQL database.
• numpy is a package for processing arrays of single or multidimensional type.
• pandas is a package for powerful data structures for data analysis, time series and statistics.
• Pillow is a Python imaging library.
• pyquery represents jquery-like library for Python.
• scipy is the scientific library to do scientific and engineering calculations.
• Sphinx is the Python documentation generator.
• sympy is a package for Computer algebra system (CAS) in Python.
• w3lib is a library of web related functions.
• whoosh contains fast and pure Python full text indexing, search and spell checking library.

To know the entire list of packages included in Python, one can visit:
https://www.pythonanywhere.com/batteries_included/

Python Virtual Machine: PVM


A Python program contains source code (first.py) that is first compiled by Python compiler to produce byte
code (first.pyc). This byte code is given to Python Virtual Machine (PVM) which converts the byte code to
machine code. This machine code is run by the processor and finally the results are produced.

Python Virtual Machine (PVM) is a software that contains an interpreter that converts the byte code into
machine code.
PVM is most often called Python interpreter. The PVM of PyPy contains a compiler in addition to the
interpreter. This compiler is called Just In Time (JIT) compiler which is useful to speed up execution of the
Python program.

Memory management by PVM


Memory allocation and deallocation are done by PVM during runtime. Entire memory is allocated on heap.
We know that the actual memory (RAM) for any program is allocated by the underlying Operating system.
On the top of the Operating system, a raw memory allocator oversees whether enough memory is available
to it for storing the objects (ex: integers, strings, functions, lists, modules etc). On the top of the raw memory
allocator, there are several object-specific allocators operate on the same heap. These memory allocators will
implement different types of memory management policies depending on the type of the objects. For
example, an integer number should be stored in memory in one way and a string should be stored in a
different way. Similarly, when we deal with tuples and dictionaries, they should be stored differently. These
issues are taken care by object-specific memory allocators.
Garbage collection
A module represents Python code that performs specific task. Garbage collector is a module in Python that is
useful to delete objects from memory which are not used in the program. The module that represents the
garbage collector is named as gc. Garbage collector in the simplest way maintains a count for each object
regarding how many times that object is referenced (or used). When an object is referenced twice, its
reference count will be 2. When an object has some count, it is being used in the program and hence garbage
collector will not remove it from memory. When an object is found with a reference count 0, garbage
collector will understand that the object is not used by the program and hence it can be deleted from
memory. Hence, the memory allocated for that object is deallocated or freed.
Frozen Binaries
When a software is developed in Python, there are two ways to provide the software to the end user. The first
way is to provide the .pyc files to the user. The user will install PVM in his computer and run the byte code
instructions of the .pyc files.
The other way is to provide the .pyc files, PVM along with necessary Python library. In this method, all
the .pyc files, related Python library and PVM will be converted into a single executable file (generally
with .exe extension) so that the user can directly execute that file by double clicking on it. In this way,
converting the Python programs into true executables is called frozen binaries. But frozen binaries will have
more size than that of simple .pyc files since they contain PVM and library files also.
For creating Frozen binaries, we need to use other party softwares. For example, py2exe is a software that
produces frozen binaries for Windows operating system. We can use pyinstaller for UNIX or LINUX.
Freeze is another program from Python organization to generate frozen binaries for UNIX.
Jupyter Notebook
Jupyter is an IDE that is popular among Python developers and Data Scientists. It is a part of Anaconda
platform which is a collection of tools and IDEs. By installing Anaconda, we can have Jupyter IDE available
to us. When we install Anaconda, it comes with a copy of Python software along with important packages
like numpy, scikit-learn, scipy, pandas, matplotlib, etc. Also, it has 2 popular IDEs: Spider and Jupyter.
Anaconda is liked by Data Scientists because of its capabilities of handing huge volumes of data very
quickly and efficiently. In this section, first we will install Anaconda platform and then see how to work with
Jupyter Notebook.

HOW TO INSTALL AND USE JUPYTER NOTEBOOK


It opens anaconda official website. At the right side top corner, click on ‘Free Download’ and below ‘Skip
registration’ to download without providing mail id.

Step 2) It downloads a file like “Anaconda3-2024.10-1-Windows-x86_64.exe”. Double click on it. Then


Anaconda Setup will start execution. Click on “Next” button.
Step 3) Then click on “Next” button to continue. When it displays “License Agreement”, click on “I Agree”
button.

Step 4) Then click on ‘Just Me’ radio button for installing your individual copy.
Step 5) It will show a default directory to install. Click on ‘Next’.

Step 6) In the next screen, select the checkbox ‘Create start menu shortcuts’. Also, unselect other
checkboxes.
Step 7) The installation starts in the next screen. We should wait for the installation to complete.

Step 8) When the installation completes, click on “Next” .


Step 9) In the next screen, click on “Next”.

Step 10) In the final screen, do not check the checkboxes and then click on “Finish”.
Note: Once the installation is completed, we can find a new folder by the name “Anaconda3(64-bit)” created
in Window 10 applications which can be seen by pressing Windows “Start” button. When we click on this
folder, we can find several icons including “Jupyter Notebook” and “Spyder”.

USING JUPYTER NOTEBOOK


Step 1) click on the “Start” button on the Windows task bar and select the “Anaconda3” folder. In that, click
on “Jupyter Notebook” link. First of a black window opens where the Jupyter server runs. Minimize this
window but do not close it. After that, Jupyter opens in the browser and displays the following initial screen
(Home Page).
Step 3) It opens a new page. Click on “Untitled” at the top of the page and enter a new name for your
program. Then click on “Rename” button.
Step 4) Type the program code in cell and click on “Run” to run the code of the current cell. The current cell
being edited is shown in green box.

Step 5) We can enter code in the next cell and so on. In this manner, we can run the program as blocks of
code, one block at a time. When input is required, it will wait for your input to enter, as shown in the
following screen. The blue box around the cell indicates command mode.

Step 6) Type the program in the cells and run each cell to see the results produced by each cell.
Note: To save the program, click on Floppy symbol below the “File” menu. Click on “Insert” to insert a new
cell either above or below the current cell. The programs in Jupyter are saved with the extension “.ipynb”
which indicates Interactive Python Notebook file. This file stores the program and other contents in the form
of JSON (JavaScript Object Notation). Click on ‘Logout’ to terminate Jupyter. Then close the server window
also.

Step 7) To reopen the program, first enter into Jupyter Notebook Home Page. In the “Files” tab, find out the
program named “first.ipynb” and click on it to open it in another page.

Step 8) Similarly, to delete the file, first select it and then click on the Delete Bin symbol.
RUNNING A PYTHON PROGRAM
Running a Python program can be done from 3 environments: 1. Command line window 2. IDLE graphics
window 3. System prompt

Go get help, type help().


Type topics, FUNCTIONS, modules
Press <ENTER> to quit.

In IDLE window, click on help -> ‘Python Docs’ or F1 button to get documentation help.
Save a Python program in IDLE and reopen it and run it.
COMMENTS (2 types)
# single line comments
“”” or ‘’’ multi line comments

Docstrings
If we write strings inside “”” or ‘’’ and if these strings are written as first statements in a module, function,
class or a method, then these strings are called documentation strings or docstrings. These docstrings are
useful to create an API documentation file from a Python program. An API (Application Programming
Interface) documentation file is a text file or html file that contains description of all the features of a
software, language or a product.

DATATYPES
A datatype represents the type of data stored into a variable (or memory).
Built-in datatypes
The built-in datatypes are of 5 types:

• None Type
• Numeric types
• Sequences
• Sets
• Mappings

None type: an object that does not contain any value.


Numeric types: int, float, complex.
Boolean type: bool.
Sequences: str, bytes, bytearray, list, tuple, range.

int type: represents integers like 12, 100, -55.


float type: represents float numbers like 55.3, 25e3.
complex type: represents complex numbers like 3+5j or 3-10.5J. Complex numbers will be in the form of
a+bj or a+bJ. Here ‘a’ is called real part and ‘b’ is called ‘imaginary part’ and ‘j’ or ‘J’ indicates √-1.

NOTE:
Binary numbers are represented by a prefix 0b or 0B. Ex: 0b10011001
Hexadecimal numbers are represented by a prefix 0x or 0X. Ex: 0X11f9c
Octal numbers are represented by a prefix 0o or 0O. Ex: 0o145.

bool type: represents any of the two boolean values, True or False.
Ex: a = 10>5 # here a is treated as bool type variable.
print(a) #displays True
NOTE:
1. To convert a float number into integer, we can use int() function. Ex: int(num)
2. To convert an integer into float, we can use float() function.
3. bin() converts a number into binary. Ex: bin(num)
4. oct() converts a number into octal.
5. hex() converts a number into hexadecimal.
STRINGS
str datatype: represents string datatype. A string is enclosed in single quotes or double quotes.
Ex: s1 = “Welcome”
s2 = ‘Welcome’
A string occupying multiple lines can be inserted into triple single quotes or triple double quotes.
Ex: s1 = ‘’’ This is a special training on
Python programming that
gives insights into Python language.
‘’’
To display a string with single quotes.
Ex: s2 = “””This is a book ‘on Core Python’ programming”””
To find length of a string, use len() function.
Ex: s3 = ‘Core Python’
n = len(s3)
print(n) -> 11
We can do indexing, slicing and repetition of strings.
Ex: s = “Welcome to Core Python”
print(s) -> Welcome to Core Python
print(s[0]) -> W
print(s[0:7]) -> Welcome
print(s[:7]) -> Welcome
print(s[1:7:2]) -> ecm
print(s[-1] -> n
print(s[-3:-1]) -> ho
print(s[1]*3) -> eee
print(s*2) ->Welcome to CorePython Welcome to CorePython
Remove spaces using rstrip(), lstrip(), strip() methods.
Ex: name = “ Vijay Kumar “
print(name.strip())
We can find substring position in a string using find() method. It returns -1 if not found.

Ex: n = str.find(sub, 0, len(str))


We can count number of substrings in a string using count() method. Returns 0 if not found.
Ex: n = str.count(sub)
We can replace a string s1 with another string s2 in a main string using replace() method.
Ex: str.replace(s1, s2)
We can change the case of a string using upper(), lower(), title() methods.
Ex: str.upper()

CHARACTERS
There is no datatype to represent a single character in Python. Characters are part of str datatype.
Ex:
str = "Hello"
print(str[0])
H
for i in str: print(i)
H
e
l
l
o

BYTES AND BYTEARRAY


bytes datatype: represents a group of positive integers in the range of 0 to 255 just like an array. The
elements of bytes type cannot be modified.
Ex: arr = [10, 20, 55, 100, 99]
x = bytes(arr)
for i in x:
print(i)
10
20
55
100
99

bytearray datatype: same as bytes type but its elements can be modified.
arr = [10,20,55,100,99]
x=bytearray(arr)
x[0]=11
x[1]=21
for i in x: print(i)

11
21
55
100
99
NOTE:
We can do only indexing in case of bytes or bytearray datatypes. We cannot do slicing or repetitions.

LISTS
A list is similar to an array that can store a group of elements. A list can store different types of elements and
can grow dynamically in memory. A list is represented by square braces [ ]. List elements can be modified.
Ex:
lst = [10, 20, 'Ajay', -99.5]
print(lst[2])
Ajay
To create an empty list.
lst = [] # then we can append elements to this list as lst.append(‘Vinay’)
NOTE:
Indexing, slicing and repetition are possible on lists.
print(lst[1])
20
print(lst[-3:-1])
[20, 'Ajay']
lst = lst*2
print(lst)
[10, 20, 'Ajay', -99.5, 10, 20, 'Ajay', -99.5]
We can use len() function to find the no. of elements in the list.
n = len(lst) -> 4
del() function is for deleting an element at a particular position.
del(lst[1]) -> deletes 20
remove() will remove a particular element. clear() wil delete all elements from the list.
lst.remove(‘Ajay’)
lst.clear()
We can update the list elements by assignment.
lst[0] = ‘Vinod’
lst[1:3] = 10, 15
43
max() and min() functions return the biggest and smallest elements.
max(lst)
min(lst)

TUPLES
A tuple is similar to a list but its elements cannot be modified. A tuple is represented by parentheses ( ).
Indexing, slicing and repetition are possible on tuples also.
Ex:
tpl=( ) # creates an empty tuple
tpl=(10, ) # with only one element – comma needed after the element
tpl = (10, 20, -30, "Raju")
print(tpl)
(10, 20, -30, 'Raju')
tpl[0]=-11 # error
print(tpl[0:2])
(10, 20)
tpl = tpl*2
print(tpl)
(10, 20, -30, 'Raju', 10, 20, -30, 'Raju')
NOTE: len(), count(), index(), max(), min() functions are same in case of tuples also.
We cannot use append(), extend(), insert(), remove(), clear() methods on tuples.
To sort the elements of a tuple, we can use sorted() method.
sorted(tpl) # sorts all elements into ascending order
sorted(tpl, reverse=True) # sorts all elements into descending order
To convert a list into tuple, we can use tuple() method.
tpl = tuple(lst)

RANGE DATATYPE
range represents a sequence of numbers. The numbers in the range cannot be modified. Generally, range is
used to repeat a for loop for a specified number of times.
Ex: we can create a range object that stores from 0 to 4 as:
r = range(5)
print(r[0]) -> 0
for i in r: print(i)
0
1
2
3
4
Ex: we can also mention step value as:
r = range(0, 10, 2)
for i in r: print(i)
0
2
4
6
8
r1 = range(50, 40, -2)
for i in r1: print(i)
50
48
46
44
42
SETS
A set datatype represents unordered collection of elements. A set does not accept duplicate elements where
as a list accepts duplicate elements. A set is written using curly braces { }. Its elements can be modified.
s = {1, 2, 3, "Vijaya"}
print(s)
{1, 2, 3, 'Vijaya'}
NOTE: Indexing, slicing and repetition are not allowed in case of a set.
To add elements into a set, we should use update() method as:
s.update([4, 5])
print(s)
{1, 2, 3, 4, 5, 'Vijaya'}
To remove elements from a set, we can use remove() method as:
s.remove(5)
print(s)
{1, 2, 3, 4, 'Vijaya'}
A frozenset datatype is same as set type but its elements cannot be modified.
Ex:
s = {1, 2, -1, 'Akhil'} -> this is a set
s1 = frozenset(s) -> convert it into frozenset
for i in s1: print(i)
1
2
Akhil
-1
NOTE: update() or remove() methods will not work on frozenset.
MAPPING DATATYPES
A map indicates elements in the form of key – value pairs. When key is given, we can retrieve the associated
value. A dict datatype (dictionary) is an example for a ‘map’.
d = {10: 'kamal', 11:'Subbu', 12:'Sanjana'}
print(d)
{10: 'kamal', 11: 'Subbu', 12: 'Sanjana'}
keys() method gives keys and values() method returns values from a dictionary.
k = d.keys()
for i in k: print(i)
10
11
12
for i in d.values(): print(i)
kamal
Subbu
Sanjana
To display value upon giving key, we can use as:
Ex: d = {10: 'kamal', 11:'Subbu', 12:'Sanjana'}
d[10] gives ‘kamal’
To create an empty dictionary, we can use as:
d = {}
Later, we can store the key and values into d, as:
d[10] = ‘Kamal’
d[11] = ‘Pranav’
We can update the value of a key, as: d[key] = newvalue.
Ex: d[10] = ‘Subhash’
We can delete a key and corresponding value, using del function.
Ex: del d[11] will delete a key with 11 and its value also.
PYTHON AUTOMATICALLY KNOWS ABOUT THE DATATYPE
The datatype of the variable is decided depending on the value assigned. To know the datatype of the
variable, we can use type() function.
Ex:
x = 15 #int type
print(type(x))
<class 'int'>
x = 'A' #str type
print(type(x))
<class 'str'>
x = 1.5 #float tye
print(type(x))
<class 'float'>
x = "Hello" #str type
print(type(x))
<class 'str'>
x = [1,2,3,4]
print(type(x))
<class 'list'>
x = (1,2,3,4)
print(type(x))
<class 'tuple'>
x = {1,2,3,4}
print(type(x))
<class 'set'>

Literals in Python
A literal is a constant value that is stored into a variable in a program.
a = 15
Here, ‘a’ is the variable into which the constant value ‘15’ is stored. Hence, the value 15 is called ‘literal’.
Since 15 indicates integer value, it is called ‘integer literal’.
Ex: a = ‘Srinu’ → here ‘Srinu’ is called string literal.
Ex: a = True → here, True is called Boolean type literal.
User-defined datatypes
The datatypes which are created by the programmers are called ‘user-defined’ datatypes. For example, an
array, a class, or a module is user-defined datatypes. We will discuss about these datatypes in the later
chapters.
Constants in Python
A constant is similar to variable but its value cannot be modified or changed in the course of the program
execution. For example, pi value 22/7 is a constant. Constants are written using caps as PI.

Identifiers and Reserved words


An identifier is a name that is given to a variable or function or class etc. Identifiers can include letters,
numbers, and the underscore character ( _ ). They should always start with a nonnumeric character. Special
symbols such as ?, #, $, %, and @ are not allowed in identifiers. Some examples for identifiers are salaray,
name11, gross_income, etc.
Reserved words are the words which are already reserved for some particular purpose in the Python
language. The names of these reserved words should not be used as identifiers. The following are the
reserved words available in Python:
and del from nonlocal try
as elif global not while
assert else if or with
break except import pass yield
class exec in print False
continue finally is raise True
def for lambda return
OPERATORS
A symbol that performs an operation.
An operator acts on variables or values
that are called ‘operands’.
Arithmetic operators
They perform basic arithmetic operations.
a=13, b = 5
Operator Meaning Example Result
+ Addition operator. a+b 18
- Subtraction operator. a-b 8
* Multiplication operator. a*b 65
/ Division operator. a/b 2.6
% Modulus operator. Gives remainder of a%b 3
division.
** Exponent operator. a ** b gives the a**b 371293
value of a to the power of b.

Assignment operators
To assign right side value to a left side variable.
Operator Example Meaning
= z = x+y Assignment operator. i.e. x+y
is stored into z.
+= z+=x Addition assignment
operator. i.e. z = z+x.
-= z-=x Subtraction assignment
operator. i.e. z = z-x.
*= z*=x Multiplication assignment
operator. i.e. z = z *x.
/= z/=x Division assignment operator.
i.e. z = z/x.
%= z%=x Modulus assignment
operator. i.e. z = z%x.
**= z**=y Exponentiation assignment
operator. i.e. z = z**y.
//= z//=y Floor division assignment
operator. i.e. z = z// y.

Ex:
a=b=c=5
print(a,b,c)
555
a,b,c=1,2,'Hello'
print(a,b,c)
1 2 Hello
x = [10,11,12]
a,b,c = 1.5, x, -1
print(a,b,c)
1.5 [10, 11, 12] -1

Unary minus operator


Converts +ve value into negative and vice versa.
Relational operators
Relational operators are used to compare two quantities. They return either True or False (bool datatype).
Ex:
a, b = 1, 2
print(a>b)
False

Ex:
1<2<3<4 will give True
1<2>3<4 will give False
Logical operators
Logical operators are useful to construct compound conditions. A compound condition is a combination of
more than one simple condition. 0 is False, any other number is True.

X=1, y=2
Operator Example Meaning Result
and x and y And operator. If x is False, it returns x, otherwise 2
it returns y.
or x or y Or operator. If x is False, it returns y, otherwise it 1
returns x.
not not x Not operator. If x is False, it returns True. If x is False
True it returns False.

Ex:
x=1; y=2; z=3
if(x<y or y>z):
print('Yes')
else:
print('No') -> displays Yes

Boolean operators
Boolean operators act upon ‘bool’ type values and they provide ‘bool’ type result. So the result will be again
either True or False.
x = True, y = False
Operator Example Meaning Result
and x and y Boolean and operator. False
If both x and y are
True, then it returns
True, otherwise
False.
or x or y Boolean or operator. True
If either x or y is
True, then it returns
True, else False.
not not x Boolean not operator. False
If x is True, it returns
False, else True.
INPUT AND OUTPUT
print() function for output
Example Output
print() Blank line
print(“Hai”) Hai
print(“This is the \nfirst line”) This is the
first line
print(“This is the \\nfirst line”) This is the \nfirst line
print(‘Hai’*3) HaiHaiHai
print(‘City=’+”Hyderabad”) City=Hyderabad
print(a, b) 12
print(a, b, sep=”,”) 1,2
print(a, b, sep=’-----‘) 1-----2
print("Hello") Hello
print("Dear") Dear
print("Hello", end='') HelloDear
print("Dear", end='')
a=2 You typed 2 as input
print('You typed ', a, 'as input')
%i, %f, %c, %s can be used as format strings. Hai Linda Your salary is 12000.5
name='Linda'; sal=12000.50 Hai Linda, Your salary is 12000.50
print('Hai', name, 'Your salary is', sal)
print('Hai %s, Your salary is %.2f' % (name,
sal))
print('Hai {}, Your salary is {}'.format(name, Hai Linda, Your salary is 12000.5
sal)) Hai Linda, Your salary is 12000.5
print('Hai {0}, Your salary is Hai 12000.5, Your salary is Linda
{1}'.format(name, sal))
print('Hai {1}, Your salary is
{0}'.format(name, sal))

input() function for accepting keyboard input


Example
str = input()
str = input(‘Enter your name= ‘)
a = int(input(‘Enter int number: ‘)
a = float(input(‘Enter a float number: ‘)
a,b,c = [int(x) for x in input("Enter three numbers: ").split()]
a,b,c = [int(x) for x in input('Enter a,b,c: ').split(',')]
a,b,c = [x for x in input('Enter 3 strings: ').split(',')]
lst = [float(x) for x in input().split(',')]
lst = eval(input(‘Enter a list: ‘))

Formal and actual arguments


When a function is defined, it may have some parameters. These parameters are useful to receive values
from outside of the function. They are called ‘formal arguments’. When we call the function, we should pass
data or values to the function. These values are called ‘actual arguments’. In the following code, ‘a’ and ‘b’
are formal arguments and ‘x’ and ‘y’ are actual arguments.
def sum(a, b): # a, b are formal arguments
c = a+b
print(c)
# call the function
x=10; y=15
sum(x, y) # x, y are actual arguments
The actual arguments used in a function call are of 4 types:
□ Positional arguments
□ Keyword arguments
□ Default arguments
□ Variable length arguments

Positional arguments
These are the arguments passed to a function in correct positional order. Here, the number of arguments and
their positions in the function definition should match exactly with the number and position of the argument
in the function call
def attach(s1, s2): # function definition
attach('New', 'York') # positional arguments
Keyword arguments
Keyword arguments are arguments that identify the parameters by their names.
def grocery(item, price): # function definition
grocery(item='Sugar', price=50.75) # key word arguments
Default arguments
We can mention some default value for the function parameters in the definition.
def grocery(item, price=40.00): # default argument is price
grocery(item='Sugar') # default value for price is used

Variable length arguments


A variable length argument is an argument that can accept any number of values. The variable length
argument is written with a ‘ * ‘ symbol before it in the function definition, as:
def add(farg, *args): # *args can take 0 or more values
add(5, 10)
add(5, 10, 20, 30)
Here, ‘farg’ is the formal argument and ‘*args’ represents variable length argument. We can pass 1 or more
values to this ‘*args’ and it will store them all in a tuple.
Function decorators
A decorator is a function that accepts a function as parameter and returns a function. A decorator takes the
result of a function, modifies the result and returns it. Thus decorators are useful to perform some additional
processing required by a function.
1. We should define a decorator function with another function name as parameter.
def decor(fun):
2. We should define a function inside the decorator function. This function actually modifies or decorates the
value of the function passed to the decorator function.
def decor(fun):
def inner():
value = fun() # access value returned by fun()
return value+2 # increase the value by 2
return inner # return the inner function
3. Return the inner function that has processed or decorated the value. In our example, in the last statement,
we were returning inner() function using return statement. With this, the decorator is completed.
The next question is how to use the decorator. Once a decorator is created, it can be used for any function to
decorate or process its result. For example, let us take num() function that returns some value, e.g. 10.
def num():
return 10
Now, we should call decor() function by passing num() function name as:
result_fun = decor(num)
So, ‘result_fun’ indicates the resultant function. Call this function and print the result, as:
print(result_fun())

ARRAYS
To work with arrays, we use numpy (numerical python) package.
For complete help on numpy: https://docs.scipy.org/doc/numpy/reference/
An array is an object that stores a group of elements (or values) of same datatype. Array elements should be
of same datatype. Arrays can increase or decrease their size dynamically.
NOTE: We can use for loops to display the individual elements of the array.
To work with numpy, we should import that module, as:
import numpy
import numpy as np
from numpy import *
Single dimensional (or 1D ) arrays
A 1D array contains one row or one column of elements. For example, the marks of a student in 5 subjects.
Creating single dimensional arrays
Creating arrays in numpy can be done in several ways. Some of the important ways are:
• Using array() function
• Using linspace() function
• Using logspace() function
• Using arange() function
• Using zeros() and ones() functions.

Creating 1D array using array()


To create a 1D array, we should use array() method that accepts list of elements.
Ex: arr = numpy.array([1,2,3,4,5])
Creating 1D array using linspace()
linspace() function is used to create an array with evenly spaced points between a starting point and ending
point. The form of the linspace() is:
linspace(start, stop, n)
‘start’ represents the starting element and ‘stop’ represents the ending element. ‘n’ is an integer that
represents the number of parts the elements should be divided. If ‘n’ is omitted, then it is taken as 50. Let us
take one example to understand this.
a = linspace(0, 10, 5)
In the above statement, we are creating an array ‘a’ with starting element 0 and ending element 10. This
range is divided into 5 equal parts and hence the points will be 0, 2.5, 5, 7.5 and 10. These elements are
stored into ‘a’. Please remember the starting and ending elements 0 and 10 are included.

Creating arrays using logspace


logspace() function is similar to linspace(). The linspace() produces the evenly spaced points. Similarly,
logspace() produces evenly spaced points on a logarithmically spaced scale. logspace is used in the
following format:
logspace(start, stop, n)
The logspace() starts at a value which is 10 power of ‘start’ and ends at a value which is 10 power of ‘stop’.
If ‘n’ is not specified, then its value is taken as 50. For example, if we write:
a = logspace(1, 4, 5)
This function represents values starting from 101 to 104 . These values are divided into 5 equal points and
those points are stored into the array ‘a’.
Creating 1D arrays using arange() function
The arange() function in numpy is same as range() function in Python. The arange() function is used in the
following format:
arange(start, stop, stepsize)
This creates an array with a group of elements from ‘start’ to one element prior to ‘stop’ in steps of
‘stepsize’. If the ‘stepsize’ is omitted, then it is taken as 1. If the ‘start’ is omitted, then it is taken as 0. For
example,
arange(10)
will produce an array with elements 0 to 9.
arange(5, 10, 2)
will produce an array with elements: 5,7,9.

Creating arrays using zeros() and ones() functions


We can use zeros() function to create an array with all zeros. The ones() function is useful to create an array
with all 1s. They are written in the following format:
zeros(n, datatype)
ones(n, datatype)
where ‘n’ represents the number of elements. we can eliminate the ‘datatype’ argument. If we do not specify
the ‘datatype’, then the default datatype used by numpy is ‘float’. See the examples:
zeros(5)
This will create an array with 5 elements all are zeros, as: [ 0. 0. 0. 0. 0. ]. If we want this array in integer
format, we can use ‘int’ as datatype, as:
zeros(5, int)
this will create an array as: [ 0 0 0 0 0 ].
If we use ones() function, it will create an array with all elements 1. For example,
ones(5, float)
will create an array with 5 integer elements all are 1s as: [ 1. 1. 1. 1. 1. ].

Arithmetic operations on arrays


Taking an array as an object, we can perform basic operations like +, -, *, /, // and % operations on each
element.
Ex:
import numpy
arr = numpy.array([10, 20, 30])
arr+5
Important Mathematical functions in numpy
Function Meaning
concatenate([a, b]) Joins the arrays a and b and returns the
resultant array.
sqrt(arr) Calculates square root value of each element
in the array ‘arr’.
power(arr, n) Returns power value of each element in the
array ‘arr’ when raised to the power of ‘n’.
exp(arr) Calculates exponentiation value of each
element in the array ‘arr’.
sum(arr) Returns sum of all the elements in the array
‘arr’.
prod(arr) Returns product of all the elements in the
array ‘arr’.
min(arr) Returns smallest element in the array ‘arr’.
max(arr) Returns biggest element in the array ‘arr’.
mean(arr) Returns mean value (average) of all elements
in the array ‘arr’.
median(arr) Returns median value of all elements in the
array ‘arr’.
std(arr) Gives standard deviation of elements in the
array ‘arr’.
argmin(arr) Gives index of the smallest element in the
array. Counting starts from 0.

Ex:
numpy.sort(arr)
numpy.max(arr)
numpy.sqrt(arr)
Aliasing the arrays
If ‘a’ is an array, we can assign it to ‘b’, as:
b=a
This is a simple assignment that does not make any new copy of the array ‘a’. It means, ‘b’ is not a new
array and memory is not allocated to ‘b’. Also, elements from ‘a’ are not copied into ‘b’ since there is no
memory for ‘b’. Then how to understand this assignment statement? We should understand that we are
giving a new name ‘b’ to the same array referred by ‘a’. It means the names ‘a’ and ‘b’ are referencing same
array. This is called ‘aliasing’.
‘Aliasing’ is not ‘copying’. Aliasing means giving another name to the existing object. Hence, any
modifications to the alias object will reflect in the existing object and vice versa.
Viewing and Copying arrays
We can create another array that is same as an existing array. This is done by view() method. This method
creates a copy of an existing array such that the new array will also contain the same elements found in the
existing array. The original array and the newly created arrays will share different memory locations. If the
newly created array is modified, the original array will also be modified since the elements in both the arrays
will be like mirror images.
We can create a view of ‘a’ as:
b = a.view()
Viewing is nothing but copying only. But it is called ‘shallow copying’ as the elements in the view when
modified will also modify the elements in the original array. So, both the arrays will act as one and the same.
Suppose we want both the arrays to be independent and modifying one array should not affect another array,
we should go for ‘deep copying’. This is done with the help of copy() method. This method makes a
complete copy of an existing array and its elements. When the newly created array is modified, it will not
affect the existing array or vice versa. There will not be any connection between the elements of the two
arrays.
We can create a copy of ’a’ as:
b = a.copy()
Multi-dimensional arrays (2D, 3D, etc)
They represent more than one row and more than one column of elements. For example, marks obtained by a
group of students each in five subjects.
Creating multi-dimensional arrays
We can create multi dimensional arrays in the following ways:
• Using array() function
• Using ones() and zeroes() functions
• Using eye() function
• Using reshape() function discussed earlier

Using array() function


To create a 2D array, we can use array() method that contains a list with lists.
ones() and zeros() functions
The ones() function is useful to create a 2D array with several rows and columns where all the elements will
be taken as 1. The format of this function is:
ones((r, c), dtype)
Here, ‘r’ represents the number of rows and ‘c’ represents the number of columns. ‘dtype’ represents the
datatype of the elements in the array. For example,
a = ones((3, 4), float)
will create a 2D array with 3 rows and 4 columns and the datatype is taken as float. If ‘dtype’ is omitted,
then the default datatype taken will be ‘float’. Now, if we display ‘a’, we can see the array as:
[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
The decimal point after each element represents that the elements are float type.
Just like ones() function, we can also use zeros() function to create a 2D array with elements filled with
zeros. Suppose, we write:
b = zeros((3,4), int)
Then a 2D array with 2 rows and 4 columns will be created where all elements will be 0s, as shown below:
[[0 0 0 0]
[0 0 0 0]
[0 0 0 0]]
eye() function
The eye() function creates a 2D array and fills the elements in the diagonal with 1s. The general format of
using this function is:
eye(n, dtype=datatype)
This will create an array with ‘n’ rows and ‘n’ columns. The default datatype is ‘float’. For example, eye(3)
will create a 3x3 array and fills the diagonal elements with 1s as shown below:
a = eye(3)
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
Indexing and slicing in 2D arrays
Ex:
import numpy as np
arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
arr[0] gives 0th row -> [1,2,3]
arr[1] gives 1st row -> [4,5,6]
arr[0,1] gives 0th row, 1st column element -> 2
arr[2,1] gives 2nd row, 1st column element -> 8
a = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
a[0:2, 0:3] -> 0th row to 1st row, 0th column to 2nd column
array([[1, 2, 3],
[5, 6, 7]])
a[1:3, 2:] -> 1th row to 2nd row, 2nd column to last column
array([[ 7, 8],
[11, 12]])

INTRODUCTION TO OOPS

Procedure oriented approach


The main and sub tasks are represented by functions and procedures.
Ex: C, Pascal, FORTRAN.
Object oriented approach
The main and sub tasks are represented by classes.
Ex: C++, Java, Python.
Differences between POA and OOA
POA OOA
1. There is no code reusability. For every We can create sub classes to existing classes
new task, the programmer needs to develop and reuse them.
a new function
2. One function may call on another Every class is independent and hence it can
function. Hence debugging becomes be debugged without disturbing other
difficult as we have to check every function. classes.
3. This approach is not developed from Developed from human being’s life and
human being’s life and hence learning and hence easy to understand and handle.
using it is very difficult.
4. Programmers lose control at a particular Suitable to handle bigger and complex
point when the code size is between 10,000 projects
to 1,00,000 lines. Hence not suitable for
bigger and complex projects.

Features of OOPS
1. classes and objects
2. encapsulation
3. abstraction
4. inheritance
5. polymorphism

Classes and objects


An object is anything that really exists in the world. Object contains behavior -> attributes and actions ->
variables and methods.
A group of objects having same behavior belong to same class or category.
A class is a model for creating objects. An object exists physically but a class does not exist physically.
Class also contains variables and methods.
Def: A class is a specification of behavior of a group of objects.
Def: An object is an instance (physical form) of a class.

Self variable
‘self’ is a default variable that contains the memory address of the instance of the current class. So, we can
use ‘self’ to refer to all the instance variables and instance methods.
Constructor
A constructor is a special method that is used to initialize the instance variables of a class. In the constructor,
we create the instance variables and initialize them with some starting values. The first parameter of the
constructor will be ‘self’ variable that contains the memory address of the instance.
A constructor may or may not have parameters.
Ex:
def __init__(self): # default constructor
self.name = ‘Vishnu’
self.marks = 900
Ex:
def __init__(self, n = ‘’, m=0): # parameterized constructor with 2 parameters
self.name = n
self.marks = m

Types of variables
The variables which are written inside a class are of 2 types:
• Instance variables
• Class variables or Static variables
Instance variables are the variables whose separate copy is created in every instance (or object). Instance
variables are defined and initialized using a constructor with ‘self’ parameter. Also, to access instance
variables, we need instance methods with ‘self’ as first parameter. Instance variables can be accessed as:
obj.var
Unlike instance variables, class variables are the variables whose single copy is available to all the instances
of the class. If we modify the copy of class variable in an instance, it will modify all the copies in the other
instances. A class method contains first parameter by default as ‘cls’ with which we can access the class
variables. For example, to refer to the class variable ‘x’, we can use ‘cls.x’.
NOTE: class variables are also called ‘static variables’. class methods are marked with the decorator
@classmethod .
NOTE: instance variables can be accessed as: obj.var or classname.var

Namespaces
A namespace represents a memory block where names are mapped (or linked) to objects. A class maintains
its own namespace, called ‘class namespace’. In the class namespace, the names are mapped to class
variables. Similarly, every instance (object) will have its own name space, called ‘instance namespace’. In
the instance namespace, the names are mapped to instance variables.
When we modify a class variable in the class namespace, its modified value is available to all instances.
When we modify a class variable in the instance namespace, then it is confined to only that instance. Its
modified value will not be available to other instances.
Types of methods
By this time, we got some knowledge about the methods written in a class. The purpose of a method is to
process the variables provided in the class or in the method. We already know that the variables declared in
the class are called class variables (or static variables) and the variables declared in the constructor are called
instance variables. We can classify the methods in the following 3 types:

Instance methods
(a) Accessor methods
(b) Mutator methods
Class methods
Static methods

An instance method acts on instance variables. There are two types of methods.
1. Accessor methods: They read the instance vars. They do not modify them. They are also called getter()
methods.
2. Mutator methods: They not only read but also modify the instance vars. They are also called setter()
methods.
PROGRAMS
4. Create getter and setter methods for a Manager with name and salary instance variables.
Static methods
We need static methods when the processing is at class level but we need not involve the class or instances.
Static methods are used when some processing is related to the class but does not need the class or its
instances to perform any work. For example, setting environmental variables, counting the number of
instances of the class or changing an attribute in another class etc. are the tasks related to a class. Such tasks
are handled by static methods. Static methods are written with a decorator @staticmethod above them. Static
methods are called in the form of classname.method().
Inner classes
Writing a class within another class is called inner class or nested class. For example, if we write class B
inside class A, then B is called inner class or nested class. Inner classes are useful when we want to sub
group the data of a class.

Encapsulation
Bundling up of data and methods as a single unit is called ‘encapsulation’. A class is an example for
encapsulation.
Abstraction
Hiding unnecessary data from the user is called ‘abstraction’. By default all the members of a class are
‘public’ in Python. So they are available outside the class. To make a variable private, we use double
underscore before the variable. Then it cannot be accessed from outside of the class. To access it from
outside the class, we should use: obj._Classname__var. This is called name mangling.

Inheritance
Creating new classes from existing classes in such a way that all the features of the existing classes are
available to the newly created classes – is called ‘inheritance’. The existing class is called ‘base class’ or
‘super class’. The newly created class is called ‘sub class’ or ‘derived class’.
Sub class object contains a copy of the super class object. The advantage of inheritance is ‘reusability’ of
code. This increases the overall performance of the organization.
Syntax: class Subclass(Baseclass):

Constructors in inheritance
In the previous programs, we have inherited the Student class from the Teacher class. All the methods and
the variables in those methods of the Teacher class (base class) are accessible to the Student class (sub
class). The constructors of the base class are also accessible to the sub class.

When the programmer writes a constructor in the sub class, then only the sub class constructor will get
executed. In this case, super class constructor is not executed. That means, the sub class constructor is
replacing the super class constructor. This is called constructor overriding.

super() method
super() is a built-in method which is useful to call the super class constructor or methods from the sub class.
super().__init__() # call super class constructor
super().__init__(arguments) # call super class constructor and pass arguments
super().method() # call super class method
Types of inheritance
There are two types:
1. Single inheritance: deriving sub class from a single super class.
Syntax: class Subclass(Baseclass):
2. Multiple inheritance: deriving sub class from more than one super class.
Syntax: class Subclass(Baseclass1, Baseclass2, … ):
NOTE: ‘object’ is the super class for all classes in Python.
Polymorphism
poly + morphos = many + forms
If something exists in various forms, it is called ‘Polymorphism’. If an operator or method performs various
tasks, it is called polymorphism.
Ex:
Duck typing: Calling a method on any object without knowing the type (class) of the object.
Operator overloading: same operator performing more than one task.
Method overloading: same method performing more than one task.
Method overriding: executing only sub class method in the place of super class method.
ABSTRACT CLASSES AND INTERFACES
An abstract method is a method whose action is redefined in the sub classes as per the requirement of the
objects. Generally abstract methods are written without body since their body will be defined in the sub
classes
anyhow. But it is possible to write an abstract method with body also. To mark a method as abstract, we
should use the decorator @abstractmethod. On the other hand, a concrete method is a method with body.
An abstract class is a class that generally contains some abstract methods. PVM cannot create objects to an
abstract class.
Once an abstract class is written, we should create sub classes and all the abstract methods should be
implemented (body should be written) in the sub classes. Then, it is possible to create objects to the sub
classes.
A meta class is a class that defines the behavior of other classes. Any abstract class should be derived from
the meta class ABC that belongs to ‘abc’ module. So import this module, as:
from abc import ABC, abstractmethod
(or) from abc import *

Interfaces in Python
We learned that an abstract class is a class which contains some abstract methods as well as concrete
methods also. Imagine there is a class that contains only abstract methods and there are no concrete methods.
It becomes an interface. This means an interface is an abstract class but it contains only abstract methods.
None of the methods in the interface will have body. Only method headers will be written in the interface.
So an interface can be defined as a specification of method headers. Since, we write only abstract methods in
the interface, there is possibility for providing different implementations (body) for those abstract methods
depending on the requirements of objects. In Python, we have to use abstract classes as interfaces.
Since an interface contains methods without body, it is not possible to create objects to an interface. In this
case, we can create sub classes where we can implement all the methods of the interface. Since the sub
classes will have all the methods with body, it is possible to create objects to the sub classes. The flexibility
lies in the fact that every sub class can provide its own implementation for the abstract methods of the
interface.

EXCEPTIONS
An exception is a runtime error which can be handled by the programmer. That means if the programmer can
guess an error in the program and he can do something to eliminate the harm caused by that error, then it is
called an ‘exception’. If the programmer cannot do anything in case of an error, then it is called an ‘error’
and not an exception.
All exceptions are represented as classes in Python. The exceptions which are already available in Python
are called ‘built-in’ exceptions. The base class for all built-in exceptions is ‘BaseException’ class. From
BaseException class, the sub class ‘Exception’ is derived. From Exception class, the sub classes
‘StandardError’ and ‘Warning’ are derived.
All errors (or exceptions) are defined as sub classes of StandardError. An error should be compulsorily
handled otherwise the program will not execute. Similarly, all warnings are derived as sub classes from
‘Warning’ class. A
warning represents a caution and even though it is not handled, the program will execute. So, warnings can
be neglected but errors cannot be neglected.
Just like the exceptions which are already available in Python language, a programmer can also create his
own exceptions, called ‘user-defined’ exceptions. When the programmer wants to create his own exception
class, he should derive his class from ‘Exception’ class and not from ‘BaseException’ class. In the Figure,
we are showing important classes available in Exception hierarchy.

Exception handling
The purpose of handling the errors is to make the program robust. The word ‘robust’ means ‘strong’. A
robust program does not terminate in the middle. Also, when there is an error in the program, it will display
an appropriate message to the user and continue execution. Designing the programs in this way is needed in
any software development. To handle exceptions, the programmer should perform the following 3 tasks:
Step 1: The programmer should observe the statements in his program where there may be a possibility of
exceptions. Such statements should be written inside a ‘try’ block. A try block looks like as follows:
try:
statements
The greatness of try block is that even if some exception arises inside it, the program will not be terminated.
When PVM understands that there is an exception, it jumps into an ‘except’ block.
Step 2: The programmer should write the ‘except’ block where he should display the exception details to the
user. This helps the user to understand that there is some error in the program. The programmer should also
display a message regarding what can be done to avoid this error. Except block looks like as follows:
except exceptionname:
statements # these statements form handler
The statements written inside an except block are called ‘handlers’ since they handle the situation when the
exception occurs.
Step 3: Lastly, the programmer should perform clean up actions like closing the files and terminating any
other processes which are running. The programmer should write this code in the finally block. Finally block
looks like as follows:

finally:
statements
The specialty of finally block is that the statements inside the finally block are executed irrespective of
whether there is an exception or not. This ensures that all the opened files are properly closed and all the
running processes are properly terminated. So, the data in the files will not be corrupted and the user is at the
safe-side.

However, the complete exception handling syntax will be in the following format:
try:
statements
except Exception1:
handler1
except Exception2:
handler2
else:
statements
finally:
statements
‘try’ block contains the statements where there may be one or more exceptions. The subsequent ‘except’
blocks handle these exceptions. When ‘Exception1’ occurs, ‘handler1’ statements are executed. When
‘Exception2’ occurs, ‘hanlder2’ statements are executed and so forth. If no exception is raised, the
statements inside the ‘else’ block are executed. Even if the exception occurs or does not occur, the code
inside ‘finally’ block is always executed. The following points are noteworthy:

• A single try block can be followed by several except blocks.


• Multiple except blocks can be used to handle multiple exceptions.
• We cannot write except blocks without a try block.
• We can write a try block without any except blocks.
• else block and finally blocks are not compulsory.
• When there is no exception, else block is executed after try block.
• finally block is always executed.

FILES IN PYTHON
A file represents storage of data. A file stores data permanently so that it is available to all the programs.
Types of files in Python
In Python, there are 2 types of files. They are:
• Text files
• Binary files

Text files store the data in the form of characters. For example, if we store employee name “Ganesh”, it will
be stored as 6 characters and the employee salary 8900.75 is stored as 7 characters. Normally, text files are
used to store characters or strings.
Binary files store entire data in the form of bytes, i.e. a group of 8 bits each. For example, a character is
stored as a byte and an integer is stored in the form of 8 bytes (on a 64 bit machine). When the data is
retrieved from the binary file, the programmer can retrieve the data as bytes. Binary files can be used to store
text, images, audio and video.
Opening a file
We should use open() function to open a file. This function accepts ‘filename’ and ‘open mode’ in which to
open the file.
file handler = open(“file name”, “open mode”, “buffering”)
Ex: f = open(“myfile.txt”, “w”)
Here, the ‘file name’ represents a name on which the data is stored. We can use any name to reflect the
actual data. For example, we can use ‘empdata’ as file name to represent the employee data. The file ‘open
mode’ represents the purpose of opening the file. The following table specifies the file open modes and their
meanings.
File open mode Description
w To write data into file. If any data is
already present in the file, it would be
deleted and the present data will be
stored.
r To read data from the file. The file
pointer is positioned at the beginning
of the file.
a To append data to the file. Appending
means adding at the end of existing
data. The file pointer is placed at the
end of the file. If the file does not
exist, it will create a new file for
writing data.
w+ To write and read data of a file. The
previous data in the file will be
deleted.
r+ To read and write data into a file. The
previous data in the file will not be
deleted. The file pointer is placed at
the beginning of the file.
a+ To append and read data of a file. The
file pointer will be at the end of the
file if the file exists. If the file does
not exist, it creates a new file for
reading and writing.
x Open the file in exclusive creation
mode. The file creation fails if the file
already exists.
The above Table represents file open modes for text files. If we attach ‘b’ for them, they represent modes for
binary files. For example, wb, rb, ab, w+b, r+b, a+b are the modes for binary files.
A buffer represents a temporary block of memory. ‘buffering’ is an optional integer used to set the size of
the buffer for the file. If we do not mention any buffering integer, then the default buffer size used is 4096 or
8192 bytes.
Closing a file
A file which is opened should be closed using close() method as:
f.close()
Files with characters
To write a group of characters (string), we use: f.write(str)
To read a group of characters (string), we use: str = f.read()
PROGRAMS
25. Create a file and store a group of chars.
26. Read the chars from the file.
Files with strings
To write a group of strings into a file, we need a loop that repeats: f.write(str+”\n”)
To read all strings from a file, we can use: str = f.read()
Knowing whether a file exists or not
The operating system (os) module has a sub module by the name ‘path’ that contains a method isfile(). This
method can be used to know whether a file that we are opening really exists or not. For example,
os.path.isfile(fname) gives True if the file exists otherwise False. We can use it as:
if os.path.isfile(fname): # if file exists,
f = open(fname, 'r') # open it
else:
print(fname+' does not exist')
sys.exit() # terminate the program
with statement
‘with’ statement can be used while opening a file. The advantage of with statement is that it will take care of
closing a file which is opened by it. Hence, we need not close the file explicitly. In case of an exception also,
‘with’ statement will close the file before the exception is handled. The format of using ‘with’ is:
with open(“filename”, “openmode”) as fileobject:
Ex: writing into a flie
# with statement to open a file
with open('sample.txt', 'w') as f:
f.write('Iam a learner\n')
f.write('Python is attractive\n')
Ex: reading from a file
# using with statement to open a file
with open('sample.txt', 'r') as f:
for line in f:
print(line)

DATA ANALYSIS USING PANDAS

Data Science
To work with datascience, we need the following packages to be installed:
C:\> pip install pandas
C:\> pip install xlrd //to extract data from Excel sheets
C:\> pip install matplotlib
Data plays important role in our lives. For example, a chain of hospitals contain data related to medical
reports and prescriptions of their patients. A bank contains thousands of customers’ transaction details. Share
market data represents minute to minute changes in the share values. In this way, the entire world is roaming
around huge data.
Every piece of data is precious as it may affect the business organization which is using that data. So, we
need some mechanism to store that data. Moreover, data may come from various sources. For example in a
business organization, we may get data from Sales department, Purchase department, Production department,
etc. Such data is stored in a system called ‘data warehouse’. We can imagine data warehouse as a central
repository of integrated data from different sources.
Once the data is stored, we should be able to retrieve it based on some pre-requisites. A business company
wants to know about how much amount they spent in the last 6 months on purchasing the raw material or
how many items found defective in their production unit. Such data cannot be easily retrieved from the huge
data available in the data warehouse. We have to retrieve the data as per the needs of the business
organization. This is called data analysis or data analytics where the data that is retrieved will be analyzed to
answer the questions raised by the management of the organization. A person who does data analysis is
called ‘data analyst’.

Once the data is analyzed, it is the duty of the IT professional to present the results in the form of pictures or
graphs so that the management will be able to understand it easily. Such graphs will also help them to
forecast the future of their company. This is called data visualization. The primary goal of data visualization
is to communicate information clearly and efficiently using statistical graphs, plots and diagrams.
Data science is a term used for techniques to extract information from the data warehouse, analyze them and
present the necessary data to the business organization in order to arrive at important conclusions and
decisions. A person who is involved in this work is called ‘data scientist’. We can find important differences
between the roles of data scientist and data analyst in following table:
Data Scientist Data Analyst
Data scientist formulates the questions that Data analyst receives questions from the
will help a business organization and then business team and provides answers to
proceed in solving them. them.
Data scientist will have strong data Data analyst simply analyzes the data and
visualization skills and the ability to provides information requested by the
convert data into a business story. team.
Perfection in mathematics, statistics and Perfection in data warehousing, big data
programming languages like Python and R concepts, SQL and business intelligence is
are needed for a Data scientist. needed for a Data analyst.
A Data scientist estimates the unknown A Data analyst looks at the known data
information from the known data. from a new perspective.

Please see the following sample data in the excel file: empdata.xlsx.
CREATING DATA FRAMES
is possible from csv files, excel files, python dictionaries, tuples list, json data etc.
Creating data frame from .csv file
>>> import pandas as pd
>>> df = pd.read_csv("f:\\python\PANDAS\empdata.csv")
>>> df
empid ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-00
1 1002 Anil Kumar 23000.50 3-20-2002
2 1003 Gaurav Gupta 18000.33 03-03-02
3 1004 Hema Chandra 16500.50 10-09-00
4 1005 Laxmi Prasanna 12000.75 08-10-00
5 1006 Anant Nag 9999.99 09-09-99

Creating data frame from .xlsx file


>>> df1 = pd.read_excel("f:\\python\PANDAS\empdata.xlsx", "Sheet1")
>>> df1
empid ename sal doj
0 1001 Ganesh Rao 10000.00 2000-10-10
1 1002 Anil Kumar 23000.50 2002-03-20
2 1003 Gaurav Gupta 18000.33 2002-03-03
3 1004 Hema Chandra 16500.50 2000-09-10
4 1005 Laxmi Prasanna 12000.75 2000-10-08
5 1006 Anant Nag 9999.99 1999-09-09

Creating data frame from a dictionary


>>> empdata = {"empid": [1001, 1002, 1003, 1004, 1005, 1006],
"ename": ["Ganesh Rao", "Anil Kumar", "Gaurav Gupta", "Hema Chandra", "Laxmi Prasanna", "Anant
Nag"],
"sal": [10000, 23000.50, 18000.33, 16500.50, 12000.75, 9999.99],
"doj": ["10-10-2000", "3-20-2002", "3-3-2002", "9-10-2000", "10-8-2000", "9-9-1999"]}
>>> df2 = pd.DataFrame(empdata)
>>> df2
doj empid ename sal
0 10-10-2000 1001 Ganesh Rao 10000.00
1 3-20-2002 1002 Anil Kumar 23000.50
2 3-3-2002 1003 Gaurav Gupta 18000.33
3 9-10-2000 1004 Hema Chandra 16500.50
4 10-8-2000 1005 Laxmi Prasanna 12000.75
5 9-9-1999 1006 Anant Nag 9999.99
Creating data frame from a list of tuples
>>> empdata = [(1001, 'Ganesh Rao', 10000.00, '10-10-2000'),
(1002, 'Anil Kumar', 23000.50, '3-20-2002'),
(1003, 'Gaurav Gupta', 18000.33, '03-03-2002'),
(1004, 'Hema Chandra', 16500.50, '10-09-2000'),
(1005, 'Laxmi Prasanna', 12000.75, '08-10-2000'),
(1006, 'Anant Nag', 9999.99, '09-09-1999')]
>>> df3 = pd.DataFrame(empdata, columns=["eno", "ename", "sal", "doj"])
>>> df3
eno ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-2000
1 1002 Anil Kumar 23000.50 3-20-2002
2 1003 Gaurav Gupta 18000.33 03-03-2002
3 1004 Hema Chandra 16500.50 10-09-2000
4 1005 Laxmi Prasanna 12000.75 08-10-2000
5 1006 Anant Nag 9999.99 09-09-1999
BASIC OPERATIONS ON DATAFRAMES
(Data analysis)
For all operations please refer to:
https://pandas.pydata.org/pandas-docs/stable/ generated/pandas.Series.html
df = pd.read_csv("f:\\python\PANDAS\empdata.csv")
>>> df
empid ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-00
1 1002 Anil Kumar 23000.50 3-20-2002
2 1003 Gaurav Gupta 18000.33 03-03-02
3 1004 Hema Chandra 16500.50 10-09-00
4 1005 Laxmi Prasanna 12000.75 08-10-00
5 1006 Anant Nag 9999.99 09-09-99
1. To know the no. of rows and cols – shape
>>> df.shape
(6, 4)
>>> r, c = df.shape
>>> r
6
2. To display the first or last 5 rows – head(), tail()
>>> df.head()
>>> df.tail()
To display the first 2 or last 2 rows
>>> df.head(2)
>>> df.tail(2)
3. Displaying range of rows – df[2:5]
To display 2nd row to 4th row:
>>> df[2:5]
To display all rows:
>>> df[:]
>>> df
4. To display column names – df.columns
>>> df.columns
Index(['empid', 'ename', 'sal', 'doj'], dtype='object')
5. To display column data – df.columname
>>> df.empid (or)
>>> df['empid']
>>> df.sal (or)
>>> df['sal']
6. To display multiple column data – df[[list of colnames]]
>>> df[['empid', 'ename']]
empid ename
0 1001 Ganesh Rao
1 1002 Anil Kumar
2 1003 Gaurav Gupta
3 1004 Hema Chandra
4 1005 Laxmi Prasanna
5 1006 Anant Nag
7. Finding maximum and minimum – max() and min()
>>> df['sal'].max()
23000.5
>>> df['sal'].min()
9999.9899999999998
8. To display statistical information on numerical cols – describe()
>>> df.describe()
empid sal
count 6.000000 6.000000
mean 1003.500000 14917.011667
std 1.870829 5181.037711
min 1001.000000 9999.990000
25% 1002.250000 10500.187500
50% 1003.500000 14250.625000
75% 1004.750000 17625.372500
max 1006.000000 23000.500000
9. Show all rows with a condition
To display all rows where sal>10000
>>> df[df.sal>10000]
empid ename sal doj
1 1002 Anil Kumar 23000.50 3-20-2002
2 1003 Gaurav Gupta 18000.33 03-03-02
3 1004 Hema Chandra 16500.50 10-09-00
4 1005 Laxmi Prasanna 12000.75 08-10-00
To retrieve the row where salary is maximum
>>> df[df.sal == df.sal.max()]
empid ename sal doj
1 1002 Anil Kumar 23000.5 3-20-2002
10. To show only cols of rows based on condition
>>> df[['empid', 'ename']][df.sal>10000]
empid ename
1 1002 Anil Kumar
2 1003 Gaurav Gupta
3 1004 Hema Chandra
4 1005 Laxmi Prasanna
11. To know the index range - index
>>> df.index
RangeIndex(start=0, stop=6, step=1)
12. To change the index to a column – set_index()
>>> df1 = df.set_index('empid')
(or) to modify the same Data Frame:
>>> df.set_index('empid', inplace=True)
>>> df
ename sal doj
empid
1001 Ganesh Rao 10000.00 10-10-00
1002 Anil Kumar 23000.50 3-20-2002
1003 Gaurav Gupta 18000.33 03-03-02
1004 Hema Chandra 16500.50 10-09-00
1005 Laxmi Prasanna 12000.75 08-10-00
1006 Anant Nag 9999.99 09-09-99
NOTE: Now it is possible to search on empid value using loc[].
>>> df.loc[1004]
ename Hema Chandra
sal 16500.5
doj 10-09-00
Name: 1004, dtype: object
13. To reset the index back – reset_index()
>>> df.reset_index(inplace=True)
>>> df
empid ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-00
1 1002 Anil Kumar 23000.50 3-20-2002
2 1003 Gaurav Gupta 18000.33 03-03-02
3 1004 Hema Chandra 16500.50 10-09-00
4 1005 Laxmi Prasanna 12000.75 08-10-00
5 1006 Anant Nag 9999.99 09-09-99
HANDLING MISSING DATA
Read .csv file data into Data Frame
>>> df = pd.read_csv("f:\\python\PANDAS\empdata1.csv")
>>> df
empid ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-00
1 1002 Anil Kumar 23000.50 03-03-02
2 1003 NaN 18000.33 03-03-02
3 1004 Hema Chandra NaN NaN
4 1005 Laxmi Prasanna 12000.75 10-08-00
5 1006 Anant Nag 9999.99 09-09-99
To set the empid as index – set_index()
>>> df.set_index('empid', inplace=True)
>>> df
ename sal doj
empid
1001 Ganesh Rao 10000.00 10-10-00
1002 Anil Kumar 23000.50 03-03-02
1003 NaN 18000.33 03-03-02
1004 Hema Chandra NaN NaN
1005 Laxmi Prasanna 12000.75 10-08-00
1006 Anant Nag 9999.99 09-09-99
To fill the NaN values by 0 – fillna(0)
>>> df1 = df.fillna(0)
>>> df1
ename sal doj
empid
1001 Ganesh Rao 10000.00 10-10-00
1002 Anil Kumar 23000.50 03-03-02
1003 0 18000.33 03-03-02
1004 Hema Chandra 0.00 0
1005 Laxmi Prasanna 12000.75 10-08-00
1006 Anant Nag 9999.99 09-09-99
To fill columns with different data – fillna(dictionary)
>>> df1 = df.fillna({'ename': 'Name missing', 'sal': 0.0, 'doj':'00-00-00'})
>>> df1
ename sal doj
empid
1001 Ganesh Rao 10000.00 10-10-00
1002 Anil Kumar 23000.50 03-03-02
1003 Name missing 18000.33 03-03-02
1004 Hema Chandra 0.00 00-00-00
1005 Laxmi Prasanna 12000.75 10-08-00
1006 Anant Nag 9999.99 09-09-99
To delete all rows with NaN values – dropna()
>>> df1 = df.dropna()
>>> df1
ename sal doj
empid
1001 Ganesh Rao 10000.00 10-10-00
1002 Anil Kumar 23000.50 03-03-02
1005 Laxmi Prasanna 12000.75 10-08-00
1006 Anant Nag 9999.99 09-09-99
SORTING THE DATA
Read .csv file data into Data Frame and indicate to consider ‘doj’ as date type field
>>> df = pd.read_csv("f:\\python\PANDAS\empdata2.csv", parse_dates=['doj'])
>>> df
empid ename sal doj
0 1001 Ganesh Rao 10000.00 2000-10-10
1 1002 Anil Kumar 23000.50 2002-03-03
2 1003 Gaurav Gupta 18000.33 2002-03-03
3 1004 Hema Chandra 16500.50 2002-03-03
4 1005 Laxmi Prasanna 12000.75 2000-08-10
5 1006 Anant Nag 9999.99 1999-09-09
To sort on a column – sort_values(colname)
>>> df1 = df.sort_values('doj')
>>> df1
>>> df1
empid ename sal doj
5 1006 Anant Nag 9999.99 1999-09-09
4 1005 Laxmi Prasanna 12000.75 2000-08-10
0 1001 Ganesh Rao 10000.00 2000-10-10
1 1002 Anil Kumar 23000.50 2002-03-03
2 1003 Gaurav Gupta 18000.33 2002-03-03

3 1004 Hema Chandra 16500.50 2002-03-03


NOTE: To sort in descending order:
>>> df1 = df.sort_values('doj', ascending=False)
To sort multiple columns differently – sort_values(by =[], ascending = [])
To sort on ‘doj’ in descending order and in that on ‘sal’ in ascending order:
>>> df1 = df.sort_values(by=['doj', 'sal'], ascending=[False, True])
>>> df1
empid ename sal doj
3 1004 Hema Chandra 16500.50 2002-03-03
2 1003 Gaurav Gupta 18000.33 2002-03-03
1 1002 Anil Kumar 23000.50 2002-03-03
0 1001 Ganesh Rao 10000.00 2000-10-10
4 1005 Laxmi Prasanna 12000.75 2000-08-10
5 1006 Anant Nag 9999.99 1999-09-09

Data Wrangling
Data wrangling, also known as data munging, is the process of cleaning, transforming, and organizing raw
data into a format that is suitable for analysis. Python is a popular language for data wrangling due to its
powerful libraries and tools. Below is an overview of the key steps and libraries used in data wrangling with
Python:
Key Steps in Data Wrangling
1. Data Collection: Gather data from various sources (e.g., CSV files, databases, APIs, web scraping).
2. Data Cleaning: Handle missing values, remove duplicates, correct inconsistencies, and fix errors.
3. Data Transformation: Reshape, aggregate, or filter data to make it suitable for analysis.
4. Data Integration: Combine data from multiple sources.
5. Data Validation: Ensure data quality and consistency.
6. Data Export: Save the cleaned and transformed data into a usable format (e.g., CSV, Excel,
database).
Python Libraries for Data Wrangling
1. Pandas: The most widely used library for data manipulation and analysis.
o Key features: DataFrames, handling missing data, merging datasets, reshaping data.
o Example: import pandas as pd
2. NumPy: Used for numerical computations and handling arrays.
o Example: import numpy as np
3. OpenPyXL: For working with Excel files.
o Example: from openpyxl import Workbook
4. SQLAlchemy: For interacting with databases.
o Example: from sqlalchemy import create_engine
5. BeautifulSoup and Requests: For web scraping and collecting data from websites.
o Example: from bs4 import BeautifulSoup, import requests
6. PySpark: For handling large-scale data wrangling tasks in distributed environments.

Common Data Wrangling Tasks in Python


Applications
import pandas as pd
# Load CSV file
df = pd.read_csv('data.csv')

# Load Excel file


df = pd.read_excel('data.xlsx')

# Load data from a database


from sqlalchemy import create_engine
engine = create_engine('sqlite:///database.db')
df = pd.read_sql('SELECT * FROM table_name', engine)
Inspecting Data
# View the first few rows
print(df.head())

# Get summary statistics


print(df.describe())

# Check for missing values


print(df.isnull().sum())

# Check data types


print(df.dtypes)

Handling Missing Data


# Drop rows with missing values
df_cleaned = df.dropna()

# Fill missing values with a specific value


df_filled = df.fillna(0)

# Fill missing values with the mean


df['column_name'].fillna(df['column_name'].mean(), inplace=True)

Data Transformation
# Rename columns
df.rename(columns={'old_name': 'new_name'}, inplace=True)

# Filter rows based on a condition


df_filtered = df[df['column_name'] > 10]

# Apply a function to a column


df['new_column'] = df['column_name'].apply(lambda x: x * 2)

# Group by and aggregate


df_grouped = df.groupby('category').agg({'value': 'sum'})
Merging Data
# Merge two DataFrames
df_merged = pd.merge(df1, df2, on='key_column', how='inner')

# Concatenate DataFrames
df_concat = pd.concat([df1, df2], axis=0)

Exporting Data
# Save
Visualizing Data
DATA VISUALIZATION USING MATPLOTLIB
Complete reference is available at:
https://matplotlib.org/api/pyplot_summary.html
CREATE DATAFRAME FROM DICTIONARY
>>> empdata = {"empid": [1001, 1002, 1003, 1004, 1005, 1006],
"ename": ["Ganesh Rao", "Anil Kumar", "Gaurav Gupta", "Hema Chandra", "Laxmi Prasanna", "Anant
Nag"],
"sal": [10000, 23000.50, 18000.33, 16500.50, 12000.75, 9999.99],
"doj": ["10-10-2000", "3-20-2002", "3-3-2002", "9-10-2000", "10-8-2000", "9-9-1999"]}
>>> import pandas as pd
>>> df = pd.DataFrame(empdata)
TAKE ONLY THE COLUMNS TO PLOT
>>> x = df['empid']
>>> y = df['sal']
DRAW THE BAR GRAPH
Bar chart shows data in the form of bars. It is useful for comparing values.
>>> import matplotlib.pyplot as plt
>>> plt.bar(x,y)
<Container object of 6 artists>
>>> plt.xlabel('employee id nos')
Text(0.5,0,'employee id nos')
>>> plt.ylabel('employee salaries')
Text(0,0.5,'employee salaries')
>>> plt.title('XYZ COMPANY')
Text(0.5,1,'XYZ COMPANY')
>>> plt.legend()
>>> plt.show()

CREATING BAR GRAPHS FROM MORE THAN 1 DATA FRAMES


For example, we can plot the empid and salaries from 2 departments: Sales team and Production team.
import matplotlib.pyplot as plt
x = [1001, 1002, 1003, 1004, 1005, 1006]
y = [10000, 23000.50, 18000.33, 16500.50, 12000.75, 9999.99]
x1 = [1010, 1011, 1012, 1013, 1014, 1015]
y1 = [5000, 6000, 4500.00, 12000, 9000, 10000]
plt.bar(x,y, label='Sales dept', color='red')
plt.bar(x1,y1, label='Production dept', color='green')
plt.xlabel('emp id')
plt.ylabel('salaries')
plt.title('XYZ COMPANY')
plt.legend()
plt.show()
CREATING HISTOGRAM
Histogram shows distributions of values. Histogram is similar to bar graph but it is useful to show values
grouped in bins or intervals.
NOTE: histtype : {‘bar’, ‘barstacked’, ‘step’, ‘stepfilled’},
import matplotlib.pyplot as plt
emp_ages = [22,45,30,60,60,56,60,45,43,43,50,40,34,33,25,19]
bins = [0,10,20,30,40,50,60]
plt.hist(emp_ages, bins, histtype='bar', rwidth=0.8, color='cyan')
plt.xlabel('employee ages')
plt.ylabel('No. of employees')
plt.title('XYZ COMPANY')
plt.legend()
plt.show()
CREATING A PIE CHART
A pie chart shows a circle that is divided into sectors that each represents a proportion of the whole.
To display no. of employees of different departments in a company.
import matplotlib.pyplot as plt
slices = [50, 20, 15, 15]
depts = ['Sales', 'Production', 'HR', 'Finance']
cols = ['cyan', 'magenta', 'blue', 'red']
plt.pie(slices, labels=depts, colors=cols, startangle=90, shadow=True,
explode= (0, 0.2, 0, 0), autopct='%.1f%%')
plt.title('XYZ COMPANY')
plt.show()
Feature Engineering and Selection
Feature engineering and feature selection are critical steps in the machine learning pipeline. They involve
creating new features from raw data and selecting the most relevant features to improve model performance.
Python provides powerful libraries and tools to perform these tasks effectively.

Feature Engineering
Feature engineering is the process of creating new features or transforming existing ones to better represent
the underlying problem and improve model performance.
Common Techniques for Feature Engineering
Handling Missing Values:
o Fill with mean, median, or mode.
o Use advanced techniques like KNN imputation.
2. df['column'].fillna(df['column'].mean(), inplace=True)

Encoding Categorical Variables:


• One-Hot Encoding: Convert categorical variables into binary columns.
• Label Encoding: Convert categories into numerical labels.
# One-Hot Encoding
df = pd.get_dummies(df, columns=['category_column'])
# Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['category_column'] = le.fit_transform(df['category_column'])
Scaling and Normalization:
• Standardization: Scale features to have zero mean and unit variance.
• Min-Max Scaling: Scale features to a specific range (e.g., 0 to 1).
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardization
scaler = StandardScaler()
df['scaled_column'] = scaler.fit_transform(df[['column']])

# Min-Max Scaling
minmax_scaler = MinMaxScaler()
df['scaled_column'] = minmax_scaler.fit_transform(df[['column']])

Creating Interaction Features:


• Combine two or more features to create new ones.
df['interaction_feature'] = df['feature1'] * df['feature2']

Binning:
• Convert continuous variables into discrete bins.
• df['binned_column'] = pd.cut(df['continuous_column'], bins=5)

Date/Time Feature Extraction:


• Extract useful information from date/time columns (e.g., day, month, year).
df['year'] = pd.to_datetime(df['date_column']).dt.year
df['month'] = pd.to_datetime(df['date_column']).dt.month
Text Feature Engineering:
• Tokenization, TF-IDF, word embeddings, etc.
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['text_column'])

Feature Selection
Feature selection involves identifying and selecting the most relevant features to improve model
performance and reduce overfitting.
Common Techniques for Feature Selection
1. Filter Methods:
o Use statistical measures to select features.
o Examples: Correlation, Chi-Square, Mutual Information.
# Correlation-based feature selection
correlation_matrix = df.corr()
relevant_features = correlation_matrix['target'].abs().sort_values(ascending=False)
Wrapper Methods:
• Use a machine learning model to evaluate feature subsets.
• Examples: Recursive Feature Elimination (RFE).
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
rfe = RFE(model, n_features_to_select=5)
rfe.fit(X, y)
selected_features = X.columns[rfe.support_]
Embedded Methods:
• Features are selected as part of the model training process.
• Examples: Lasso Regression, Decision Trees.
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.01)
lasso.fit(X, y)
selected_features = X.columns[lasso.coef_ != 0]
Dimensionality Reduction:
• Reduce the number of features while preserving information.
• Examples: PCA, t-SNE.
from sklearn.decomposition import PCA

pca = PCA(n_components=10)
X_pca = pca.fit_transform(X)
Feature Engineering and Selection Workflow
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split

# Load data
df = pd.read_csv('data.csv')

# Feature Engineering
# Handle missing values
df['column'].fillna(df['column'].mean(), inplace=True)

# One-Hot Encoding
df = pd.get_dummies(df, columns=['category_column'])

# Scaling
scaler = StandardScaler()
df['scaled_column'] = scaler.fit_transform(df[['column']])

# Feature Selection
X = df.drop('target', axis=1)
y = df['target']

# Select top 10 features using ANOVA F-test


selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)
# Get selected feature names
selected_features = X.columns[selector.get_support()]
print("Selected Features:", selected_features)
Feature Extraction and Engineering
Feature extraction and feature engineering are essential steps in preparing data for machine learning models.
While they are closely related, they serve slightly different purposes:
• Feature Engineering: Creating new features or transforming existing ones to better represent the
underlying problem.
• Feature Extraction: Reducing the dimensionality of data by extracting the most important
information from raw data.
Both processes aim to improve model performance, reduce overfitting, and make the data more interpretable.
Feature Extraction
Feature extraction is often used when dealing with high-dimensional data (e.g., images, text, or signals) to
reduce the number of features while retaining the most important information.
Common Techniques for Feature Extraction
Principal Component Analysis (PCA):
o Reduces dimensionality by projecting data onto orthogonal axes (principal components).
Linear Discriminant Analysis (LDA):
• Reduces dimensionality while preserving class separability (useful for supervised learning).

t-SNE (t-Distributed Stochastic Neighbor Embedding):


• Reduces dimensionality for visualization (not suitable for feature extraction in models).
Autoencoders:
• Neural networks used for unsupervised dimensionality reduction.
Text Feature Extraction:
• Convert text into numerical features using techniques like Bag of Words, TF-IDF, or word
embeddings.
Image Feature Extraction:
• Use pre-trained models (e.g., VGG, ResNet) or techniques like Histogram of Oriented Gradients
(HOG) to extract features from images

Feature Engineering
Feature engineering involves creating new features or transforming existing ones to improve model
performance.
Common Techniques for Feature Engineering
Handling Missing Values:
o Fill missing values with mean, median, mode, or use advanced techniques like KNN
imputation.
Encoding Categorical Variables:
• One-Hot Encoding: Convert categorical variables into binary columns.
• Label Encoding: Convert categories into numerical labels.
Scaling and Normalization:
• Standardization: Scale features to have zero mean and unit variance.
• Min-Max Scaling: Scale features to a specific range (e.g., 0 to 1).
Creating Interaction Features:
• Combine two or more features to create new ones.
Binning:
• Convert continuous variables into discrete bins.
Date/Time Feature Extraction:
• Extract useful information from date/time columns (e.g., day, month, year).
Polynomial Features:
• Create polynomial combinations of features.
Example: Feature Extraction and Engineering Workflow
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Load data
df = pd.read_csv('data.csv')

# Feature Engineering
# Handle missing values
df['column'].fillna(df['column'].mean(), inplace=True)

# One-Hot Encoding
df = pd.get_dummies(df, columns=['category_column'])

# Scaling
scaler = StandardScaler()
df['scaled_column'] = scaler.fit_transform(df[['column']])

# Feature Extraction
X = df.drop('target', axis=1)
y = df['target']

# Apply PCA for dimensionality reduction


pca = PCA(n_components=10)
X_pca = pca.fit_transform(X)

# Display explained variance


print("Explained Variance Ratio:", pca.explained_variance_ratio_)

Feature Engineering on Numeric Data, Categorical Data, Text Data, & Image Data
Feature engineering is the process of transforming raw data into meaningful features that improve the
performance of machine learning models. The techniques used depend on the type of data (numeric,
categorical, text, or image). Below is a detailed guide on feature engineering for each type of data:

Feature Engineering for Numeric Data


Numeric data is the most common type of data in machine learning. Feature engineering for numeric data
involves scaling, transforming, and creating new features.
Common Techniques
Scaling and Normalization:
o Standardization: Scale features to have zero mean and unit variance.
Log Transformation:
• Reduce skewness in data.
Binning:
• Convert continuous variables into discrete bins.
Polynomial Features:
• Create polynomial combinations of features.
Interaction Features:
• Combine two or more numeric features.

Feature Engineering for Categorical Data


Categorical data represents discrete values (e.g., gender, color). Feature engineering for categorical data
involves encoding and creating new features.
Common Techniques
One-Hot Encoding:
o Convert categorical variables into binary columns.
Label Encoding:
• Convert categories into numerical labels.
Target Encoding:
• Encode categories based on the target variable (mean of the target for each category).
Frequency Encoding:
• Encode categories based on their frequency in the dataset.
Creating Interaction Features:
• Combine categorical and numeric features.

Feature Engineering for Text Data


Text data requires converting unstructured text into structured numerical features.
Common Techniques
Bag of Words (BoW):
o Represent text as a vector of word frequencies.
TF-IDF (Term Frequency-Inverse Document Frequency):
• Weigh words based on their importance in the document and corpus.
Word Embeddings:
• Use pre-trained models like Word2Vec, GloVe, or FastText to represent words as dense vectors.
Text Cleaning:
• Remove stopwords, punctuation, and perform stemming/lemmatization.
N-Grams:
• Capture sequences of words (e.g., bigrams, trigrams).
Feature Engineering for Image Data
Image data requires extracting meaningful features from pixel values.
Common Techniques
Resizing and Normalization:
o Resize images to a fixed size and normalize pixel values.
Feature Extraction Using Pre-trained Models:
• Use models like VGG, ResNet, or Inception to extract features.
Edge Detection:
• Use techniques like Canny edge detection to highlight edges.
Histogram of Oriented Gradients (HOG):
• Extract features based on gradient orientations.
Color Histograms:
• Extract color distribution features.

Feature Scaling and Feature Selection are two important preprocessing techniques in machine learning and
data analysis. They play a crucial role in improving model performance, reducing computational complexity,
and ensuring better interpretability of the data.

Feature Scaling
Feature scaling is the process of normalizing or standardizing the range of independent variables (features)
in the dataset. This is particularly important for algorithms that are sensitive to the magnitude of the data,
such as distance-based algorithms or gradient descent-based optimization.
Why is Feature Scaling Important?
• Ensures that all features contribute equally to the model.
• Improves convergence speed for optimization algorithms (e.g., gradient descent).
• Prevents features with larger magnitudes from dominating those with smaller magnitudes.
Common Techniques for Feature Scaling
1. Normalization (Min-Max Scaling):
o Scales features to a fixed range, usually [0, 1].
o Formula: Xscaled=X−XminXmax−XminXscaled=Xmax−XminX−Xmin
o Suitable for algorithms like neural networks and k-nearest neighbors (KNN).
2. Standardization (Z-score Normalization):
o Scales features to have a mean of 0 and a standard deviation of 1.
o Formula: Xscaled=X−μσXscaled=σX−μ, where μμ is the mean and σσ is the standard
deviation.
o Suitable for algorithms like linear regression, logistic regression, and support vector machines
(SVM).
3. Robust Scaling:
o Uses the median and interquartile range (IQR) to scale features, making it robust to outliers.
o Formula: Xscaled=X−medianIQRXscaled=IQRX−median
4. Max Abs Scaling:
o Scales each feature by its maximum absolute value.
o Suitable for sparse data.
Feature Selection
Feature selection is the process of selecting a subset of relevant features (variables) to use in model
construction. It helps reduce overfitting, improve model interpretability, and decrease computational costs.
Why is Feature Selection Important?
• Reduces the dimensionality of the dataset, which can improve model performance.
• Removes irrelevant or redundant features, reducing noise.
• Speeds up training and inference times.
Common Techniques for Feature Selection
1. Filter Methods:
o Select features based on statistical measures (e.g., correlation, mutual information, chi-
square).
o Examples:
▪ Correlation coefficient for linear relationships.
▪ Mutual information for non-linear relationships.
▪ Chi-square test for categorical features.
2. Wrapper Methods:
o Use a machine learning model to evaluate the performance of subsets of features.
o Examples:
▪ Forward Selection: Start with no features and add one at a time.
▪ Backward Elimination: Start with all features and remove one at a time.
▪ Recursive Feature Elimination (RFE): Recursively removes the least important
features.
3. Embedded Methods:
o Perform feature selection as part of the model training process.
o Examples:
▪ Lasso (L1 regularization): Penalizes less important features by shrinking their
coefficients to zero.
▪ Ridge (L2 regularization): Reduces the impact of less important features but does not
eliminate them.
▪ Tree-based methods: Feature importance scores from decision trees, random forests,
or gradient boosting.
4. Dimensionality Reduction:
o Transform features into a lower-dimensional space.
o Examples:
▪ Principal Component Analysis (PCA): Reduces dimensions while preserving variance.
▪ Linear Discriminant Analysis (LDA): Reduces dimensions while preserving class
separability.
▪ t-SNE and UMAP: Non-linear dimensionality reduction for visualization.

Key Differences Between Feature Scaling and Feature Selection


Aspect Feature Scaling Feature Selection
Purpose Normalize/standardize feature values Select the most relevant features
Impact Improves algorithm performance and speed Reduces dimensionality and
overfitting
Techniques Normalization, standardization, robust scaling Filter, wrapper, embedded methods,
PCA
When to Required for distance-based or gradient-based Useful for high-dimensional datasets
Use algorithms
UNIT III

Unit-3

Building Machine Learning Models:


In today’s rapidly evolving business landscape, data has become one of the most valuable
assets. Machine learning (ML), a subset of artificial intelligence (AI), is revolutionizing how
businesses derive insights, optimize processes, and enhance decision-making. Understanding
how to build and apply machine learning models is not only a technical skill but also a
strategic advantage.
Machine learning is the process of developing algorithms that enable computers to learn from
data and improve their performance over time without being explicitly programmed. Unlike
traditional programming, where rules are hard-coded, ML models identify patterns and
relationships in data, enabling them to make predictions or decisions.
In a business context, machine learning applications range from predictive analytics and
customer segmentation to supply chain optimization and fraud detection. For Management
students, understanding ML is vital for leveraging data-driven strategies to create value.

Steps to Build Machine Learning Models


Building an ML model involves a structured process. Here’s an overview of the key steps:

1. Define the Problem


The first step in any machine learning project is identifying the business problem. Clearly
define what you want to achieve—whether it's predicting customer churn, recommending
products, or optimizing pricing strategies. This step is crucial for aligning ML efforts with
business objectives.
Example: A retail business might aim to predict which customers are likely to leave
their loyalty program in the next six months.

2. Collect and Prepare Data


Data is the foundation of machine learning. Begin by gathering relevant data from various
sources such as databases, CRM systems, or third-party providers. The data must be cleaned
and pre-processed to handle missing values, remove outliers, and normalize formats.
Example: For customer churn prediction, you might collect data on purchase history,
demographics, and engagement metrics.

3. Select the Right Algorithm


Choosing the right machine learning algorithm depends on the problem type:
• Supervised Learning: Used for labelled data (e.g., predicting sales revenue).

You might also like