EBOOK - Python Crash Course For Data Analysis
EBOOK - Python Crash Course For Data Analysis
AI PUBLISHING
© Copyright 2019 by AI Publishing
All rights reserved.
First Printing, 2019
Edited by AI Publishing
Ebook Converted and Cover by Gazler Studio
Publised by AI Publishing LLC
ISBN-13: 978-1-7330426-4-2
ISBN-10: 1-7330426-4-4
Legal Notice:
You cannot amend, distribute, sell, use, quote, or paraphrase any part
of the content within this book without the consent of the author.
Disclaimer Notice:
Please note the information contained within this document is for
educational and entertainment purposes only. No warranties of
any kind are expressed or implied. Readers acknowledge that the
author is not engaging in the rendering of legal, financial, medical,
or professional advice. Please consult a licensed professional before
attempting any techniques outlined in this book.
https://www.aispublishing.net/python-crash-course-da
About the Publisher
If you are also eager to learn the AtoZ of AI and Data Sciences
and have got no clue where to start, AI Publishing is the place
to go.
While you can read this book without picking up your laptop,
we highly recommend you experiment with the practical part.
1. Getting Started......................................................... 7
1.1. Installing Python..........................................................................7
1.2. Packages Managers and Repositories............................... 10
1.2.1. PyPI and PIP................................................................................ 10
1.2.2. Anaconda Repo and Conda.................................................. 12
1.2.3. Anaconda Repo and Conda.................................................. 13
Why Python?
First, Python is a high-level multi-purpose interpreted language.
It is well stablished and focuses in code readability and ease
of use. Furthermore, there is a giant number of packages
and libraries available for Python, from scientific ones to Big
data specific. All these libraries combined with Python’s easy
learning curve, makes the language an incredible tool with
great versatility.
Requirements
This box lists all requirements needed to be done before
proceeding to the next topics. Generally, it works as a check
list, to see if everything is ready before a tutorial.
Further Readings
Here you will be pointed to some external reference or source
that will serve as an additional content about the specific
Topic being studied. In general, it consists or packages
documentations and cheat sheets.
Hands-on Time
Here you will be pointed to an external file to train and test all
the knowledge acquired about a Tool that has being studied.
Generally, these files are jupyter notebooks (.ipynb), python
(.py) files or documents (.pdf).
§§ Miniconda
§§ Anaconda Distribution
10 | G e t t i n g S ta r t e d
§§ PIP Commands
The most common commands used to manage packages
using pip is listed below.
o Install Packages
To install any package from the PyPI, you just need to type the
command below:
pip install package_name
o Search Packages
Search packages available at PyPI:
pip search search_query
o Uninstall Packages
If you want to remove a package:
pip uninstall package_name
o Search Packages
Search packages available at PyPI:
conda search search_query
o Uninstall Packages
If you want to remove installed package:
conda remove package_name
o Create an Environment
Create a new environment:
conda create --name environment_name
o Activate Environment
Active a specific environment, on OS X/Ubuntu:
source activate environment_name
On Windows:
activate environment_name
o List Environments
List all environments:
conda env list
o Remove an Environment
If you want to remove an environment:
conda env remove --name environment_name
§§ Interface
At first glance, it seems like just another CLI. Initially, it shows
basic information of the installed Python version.
§§ Input Field
The input field is represented by the triple ‘>’ symbol, it is
where the commands/script should be typed.
>>>
Hands-on Time – Using Python REPL
Through the reading of the next topics, keep the Python REPL
opened. Execute all the operators/commands presented to
confirm its functionality yourself. Feel free to experiment
other commands or combinations.
D ata A n a ly s i s using Python | 19
2.10.3. Comments
Any command starting with “#” are completely ignored.
>>> # This is just a comment
§§ Strings
In Python Strings are expressed in double (“…”) or singles (‘…’)
quotes. If you want to quote in your string, you can scape
them using \, or use a different quote in its definition.
20 | Python for D ata A n a ly s i s - B a s i c s
PYTHON REPL:
>>> str(42) # the string representation of integer 42
‘42’
>>> str(.5) # the string representation of float 0.5
‘0.5’
§§ None Type
The None keyword represents the NoneType in Python. It is a
type that represents no values. In general, it is used to show
that a function did not resulted in any values.
PYTHON REPL:
>>> h = ‘Hello’ # the string value “Hello”
is bind to name h
>>> H = ‘Hi’ # the string value “Hi”
is bind to name H
>>> a = b = 2 # a and b are equal to 2
>>> c, d = 3, 4 # c and d are equal to 3 and 4,
respectively
PYTHON REPL:
>>> “hello” * 3
‘hellohellohello’
>>> “ABC” + “DEF”
‘ABCDEF’
This means that any expression will follow this order for
resolution. First, the order in Precedence Table, then the order
in Associativity Table.
§§ Self-Increment
Multiple times, it is necessary to increment a variable. There
is a small shortcut to perform this using the operators seen
before. The shortcut operations and their representation are
shown below.
Equivalent
Operator Example
Representation
+= x += 2 x=x+2
-= x -= 3 x=x-3
*= x *= 4 x=x*4
/= x /= 5 x=x/5
%/ x %= 6 x=x%6
§§ Boolean Logic
When you want to combine multiple Boolean logic operations
you can use and, or, not keywords. When used with Boolean
values, operator and returns true only when both are true,
operator or returns true when at least one is true, not inverts
the Boolean value.
PYTHON REPL:
>>> 1 > 0 and 2 <= 4 # True and True is True
True
>>> 0 == 1 or 5 < 3 # False or False is False
‘ABCDEF’
>>> not True # Inverts bool value
False
D ata A n a ly s i s using Python | 25
2.3.1. Lists
As defined previously, lists are an ordered sequence of mutable
values. They can contain multiple types of data. To define a
list, you just need to write a sequence of comma-separated
values between square brackets.
§§ Defining Lists
Square brackets are used in list definitions.
PYTHON REPL:
>>> a = [] # Empty list
>>> a = list() # Empty list
>>> b = [1, 2, 3, 4, 5, 6] # List with numeric values
>>> c = [42, ‘hi’, True] # List with compound types
§§ Function list()
Beyond creating an empty list, you can also create list from
previous objects such as strings.
26 | Python for D ata A n a ly s i s - B a s i c s
PYTHON REPL:
>>> d = ‘ABCDEF’ # String
>>> list(d) # Converted to list
[‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’]
§§ Indexing
Each value of the list is accessible by an index. The index
represents position of the value in the list, starting by position
0. Additionally, negative index can be used counting from the
end of the list. As lists are mutable, the index can also be used
to change the values.
PYTHON REPL:
>>> b[0] # Accessing first value of b
1
>>> b[-1] # Accessing last value of b
6
>>> b[-3] # Accessing 3rd values
from the right of b
4
>>> c[2] = False # Modifying the 3rd value of c
>>> c
[42, ‘hi’, False]
§§ Slicing
Sometimes, instead of accessing a single value from a list,
you could want to select a sub-list. For this, there is the
slicing operation. Generally, slicing is used with [start:end]
resulting in values from the start position until the end, the
end position is excluded of the resulting substring. If omitted,
the start defaults to 0 and the end to the length of the list.
Furthermore, slicing can be used to change multiple values at
once. The examples below show how slicing works.
D ata A n a ly s i s using Python | 27
PYTHON REPL:
>>> b[0:2] # Values from position
0 to 2(excluded).
[1, 2]
>>> b[-3:-1] # Values from -3 to -1(excluded)
[4, 5]
>>> b[:-2] # Values from start
to -2(excluded)
[1, 2, 3, 4]
>>> c[1:] # All values from
index 1 until the end
[‘hi’, False]
>>> c[:] # All values
[42, ‘hi’, False]
>>> c[0:2] = [‘a’, ‘b’] # Change multiple values
with indexing
>>> c
[‘a’, ‘b’, False]
In the image above, slicing [-4:-1] returns [12, 13, 14] and [0:2]
return [11, 12].
PYTHON REPL:
>>> [1, 2, 3] + [4, 5, 6] # Concatenate lists together
[1, 2, 3, 4 ,5 ,6]
>>> [1, 2, 1] * 2 # Repeat List values
[1, 2, 1, 1, 2, 1]
>>> li = [1, 2, 3]
PYTHON REPL:
>>> li = [1, 2, 3]
>>> li.append(4) # Append a value to the list end
>>> li
[1, 2, 3, 4]
2.3.2. Tuples
Tuples are immutable and can contain multiple types of data.
Therefore, they are generally used to read static data.
§§ Defining Tuples
Parenthesis are used to define tuples. To differentiate a tuple
with a single value to a simple parenthesis expression it must
contain a comma. Examples:
PYTHON REPL:
>>> a = (,) # Empty Tuple
>>> t = (1,) # Tuple with one value
>>> b = (1, 2, 3, 4, 5, 6) # Tuples with numeric values
>>> c = (42, ‘hi’, True) # Tuples with compound types
§§ Function tuple()
Like lists, you can also create tuples from previous objects
such as strings or lists.
D ata A n a ly s i s using Python | 29
PYTHON REPL:
>>> d = ‘ABC’ # String
>>> e = [1, 2, 3]
>>> tuple(d) # Converted to tuple
(‘A’, ‘B’, ‘C’)
>>> tuple(e) # Converted to tuple
(1, 2, 3)
§§ Indexing
Just like lists, tuples can be indexed, and the same indexing
rules are applied except for modifying its values. Examples:
PYTHON REPL:
>>> b[0] # Accessing first value of b
1
>>> b[-1] # Accessing last value of b
6
>>> b[-3] # Accessing 3rd values from the right of b
4
>>> b[2] = 0 # Error tuples are immutable
TypeError: ‘tuple’ object does not support item assignment
§§ Slicing
Tuples support slicing too. Of course, not allowing change of
the values.
30 | Python for D ata A n a ly s i s - B a s i c s
PYTHON REPL:
>>> b[0:2] # Values from position 0 to
2(excluded).
(1, 2)
>>> b[-3:-1] # Values from -3 to -1(excluded)
(4, 5)
>>> b[:-2] # Values from start to -2(excluded)
(1, 2, 3, 4)
>>> c[1:] # All values from index 1 until the end
(“hi”, True)
>>> c[:] # All values
(42, “hi”, True)
PYTHON REPL:
>>> (1, 2, 3) + (4, 5, 6) # New concatenated tuple
(1, 2, 3, 4 ,5 ,6)
>>> (1,2) * 2 # New repeated tuple
(1, 2, 1, 2)
PYTHON REPL:
>>> a, b, c = (42,), [1, 1], «abc»
>>>len(a)
1
>>>len(b)
2
>>>len(c)
3
2.4. Modules
The Python REPL is great to test small code snippets and
language syntax. However, after closed, all the variables
and operations defined are lost. Therefore, to write longer
programs and save the results a text editor is necessary. In the
editor, you can create a new file that contains the instructions
and variable definitions, this file is called a script. Python
scripts are saved with the .py extension.
§§ Running Scripts
Consider a script called myscript.py. In order to run this script,
follow these instructions:
1. Open your system terminal.
2. Navigate to the script folder (where the myscript.py
file is located).
3. Run the command:
python myscript.py
32 | Python for D ata A n a ly s i s - B a s i c s
Try to run the examples below. Paste their code in the text
editor and save a example1.py and example2.py files. Then
follow the instructions above to see the code outputs.
a = 5 b = 21 * 2
print(a) print(b)
OUTPUT OUTPUT
5 42
§§ Indentation
Indentation describes the spaces between the left margin
and the start of text. In Python, indentation mean that the
code belongs to a block of code, in other words, it indicates
whether a line is in the same block or not. Commonly, 2 or 4
“white spaces” or 1 “tab” can be used for indentation. However,
it is highly recommended to only one kind should be used
throughout the code, either whitespaces of tabs.
§§ Multiple Conditions
Sometimes we need to test various conditions, for this we can
have multiple if…else statements. Additionally, a shorter term
for else if statements can be used: elif.
if a > 3: a = 5
§§ Nested Statements
In order to test multiple dependent conditions, if statements
can be nested together, resulting in various blocks of code.
2.6. Loops
There are two basic loops in Python, for and while. Both are
used to loop through a block of code but have different use-
cases.
D ata A n a ly s i s using Python | 35
2.6.1. While
The while loop repeats a block of code while a given expression
is evaluated as True. The condition is tested before each
execution.
while EXPRESSION:
# Code executed while True
# More code executed while True
In the examples above we can see how while loop works. After
each code block execution, the expression is re-evaluated to
check if the block will be re-run again.
2.6.2. For
When you need a finer control for the total number of
executions, the for loop is used. In python, for loop uses a
36 | Python for D ata A n a ly s i s - B a s i c s
During each pass through the loop, a new value from the
sequence is passed to the variable VALUE, until the sequence
end is reached. Pay attention to the use of the in keyword
before the operator.
§§ Iterators
Simply put, an iterator corresponds a sequence of elements.
In Python, lists, tuples and strings are example of iterable
objects. If you want to create a sequence of integer to iterate
over, the range function can be used.
INPUT INPUT
# FOR example 1 # FOR example 2
l = [“a”, “b”, “c”] # or tuple s = “hi”
for i in l: for c in s:
print(i) print(c)
OUTPUT OUTPUT
a
h
b
i
c
§§ Range Function
This built-in function created a sequence of integers. It can be
used with 1, 2 or 3 arguments. With one argument, the sequence
starts from 0 until the given argument (excluded) with step
unitary. Using with two arguments, the first correspond to
initial(included) value or the sequence and the second to the
final (excluded) value. Finally, with three arguments is like with
two, but the third argument corresponds to the increment
size. Check the examples below.
INPUT INPUT
# Using range # Using the sequence
l = [1, “a”, -2, 42] directly
OUTPUT OUTPUT
1 1
a a
-2 -2
42 42
Even though this is an Easter egg, all these tips are valuable,
and any programmer should keep them in mind when creating
new projects.
Exercises
Answer the question below, then and check your responses
using the Python REPL.
>>> 1e-3
>>> 2
>>> 3.
>>> 5 > 2
>>> “String\’s” ( )
>>> “HelloWorld’ ( )
>>> -3 * 1
>>> 5 % 3
>>> 2 + 3 * 3
>>> True + 3
>>> 3 ** False
>>> type(3 / 3)
42 | Python for D ata A n a ly s i s - B a s i c s
>>> type(3. + 2)
>>> ‘123’ * 2
>>> 2 – 2 / 4
>>> (2 - 2) / 4
>>> 2 ** 2 ** 4
>>> 3 ** False
>>> 3 % 5 + (2 ** (6 / 3))
>>> a = b = 3
>>> c, d = 1, 2
>>> a + c == d * b - 2
>>> s = “a”
>>> s *= 3
>>> s + “b”
>>> a = 0
>>> b = c = 42
>>> b /= 2
>>> b != 21 or c/b == 2
>>> b = False
>>> c = not b
>>> a = [1]
>>> a * 11
>>> a = [1, 2]
>>> b = [3, 4]
>>> a + b
a) >>> li[3]
b) >>> li[-2]
c) >>> li[:-3]
d) >>> li[-5:]
44 | Python for D ata A n a ly s i s - B a s i c s
>>> a = (1,2,3)
>>> a[3]
>>> b = [5, 6, 7, 8]
>>> b[-5]
>>> a = (1, 2)
>>> b = (3, 4)
>>> c = a / b
>>> a = (1,2,3)
>>> b = (1,2,3)
>>> a + b
>>> a = (1,2,3)
>>> b = (1,2,3)
>>> a * b
Examples:
§§ Break
The break is used to leave the innermost loop statement.
Primarily, it is used in conjunction of an if statement that look
to specific condition to leave the loop.
EXAMPLE 1 - WHILE EXAMPLE 2 - FOR
# Print all values until # Print if find white space
c=2 m = “abc def”
while True: for c in m:
if c == 2: if c == “ “:
break print(“White space found!
c += 1 «)
print(c) break
OUTPUT OUTPUT
1 White space found!
2
§§ Continue
The continue statement ignores all the remaining lines of
code in the block and continues to the next iteration of the
innermost loop. This behavior is useful when you want to
“ignore” a loop iteration, but still want to continue iterating.
D ata A n a ly s i s using Python | 47
EXAMPLE – WHILE
# Print all values in range(6), except 0 and 5
for n in range(6):
if n == 1 or n == 5:
continue
print(n)
OUTPUT
0
2
3
4
§§ Else
The else keyword can also be used in loop. In while loops, it
is executed when the expression is evaluated as False. In for
loop, it is executed when the iterator end is reached.
EXAMPLE 1 - WHILE EXAMPLE 2 – FOR
# Add 2 and print final # Square and print when done
value s = [3, 5]
c = 0 # Iterate until c = 1
# Print when c >= 3 for i in s:
while c < 3: print(i ** 2)
c += 2 else:
print(“Add 2”) print(“Iterator ended! “)
else:
print(c)
OUTPUT OUTPUT
Add 2 9
Add 2 25
4 Iterator ended!
48 | Python for D ata A n a ly s i s - A dva n c e d
3.2.1. Sets
The definition of sets can be borrowed the mathematics: “A
set is a well-defined collection of distinct objects”¹. In Python,
sets are used to represent an unordered collection without
duplicates. Therefore, they are used to test membership and
remove duplicates. Simple set operations are supported: union,
intersection, difference and symmetric difference. Since
set is an unordered container, values cannot be accessed by
indexing. Additionally, Sets have their own operations invoked
by built-in functions. To define a set, you just need to write a
sequence of comma-separated values between curly brackets.
50 | Python for D ata A n a ly s i s - A dva n c e d
§§ Defining Sets
Curly brackets are used in sets definitions.
PYTHON REPL:
>>> a = set() # Empty set
>>> b = {1, 2, 3, 4, 5, 6} # Set with numeric values
>>> c = {42, “hi”, True} # Set with compound types
>>> {1, 2, 2, 2, 1} # Duplicates are removed
{1, 2}
§§ Function set()
Like lists and tuples, you can also create sets from previous
objects such as strings, lists and tuples.
PYTHON REPL:
>>> s = «abibliophobia» # String
>>> li = [2, 3, 5, 5, 5, 2, 3, 3] # List
>>> tu = (0, 1, 1, 0, 1, 0, 0, 0) # Tuple
>>> set(s) # Unique letters in
word
{‘a’, ‘b’, ‘h’, ‘i’, ‘l’, ‘o’, ‘p’}
>>> set(li) # Unique elements of
li
{2, 3, 5}
>>> set(tu) # Unique elements of
tu
{0, 1}
D ata A n a ly s i s using Python | 51
§§ Add or Remove
Adds or remove a value from the set.
PYTHON REPL:
>>> s= {2, 3, 5}
>>> s.add(1) # Add value 1 to set
>>> s
{1, 2, 3, 5}
>>> s.add(3) # Nothing happends, value already in
the set
>>> s
{1, 2, 3, 5}
>>> s.remove(3) # Remove value 3
>>> s
{1, 2, 5}
§§ Union
Returns new set resulted from the union of 2 sets.
PYTHON REPL:
>>> s1, s2 = {2, 3, 5}, {2, 4, 1, 6, 5}
>>> s1.union(s2) # Union of sets
{1, 2, 3, 4, 5, 6}
>>> s1 | s2 # Equivalent
{1, 2, 3, 4, 5, 6}
§§ Intersection
Returns new set resulted from the intersection of both sets.
PYTHON REPL:
>>> s1, s2 = {2, 3, 5}, {2, 4, 1, 6, 5}
>>> s1.intersaction(s2) # Intersection
of sets
{2, 5}
>>> s1 & s2 # Equivalent
{2, 5}
52 | Python for D ata A n a ly s i s - A dva n c e d
§§ Difference
Returns new set resulted from the difference of sets.
PYTHON REPL:
>>> s1, s2 = {2, 3, 5}, {2, 4, 1, 6, 5}
>>> s1.difference(s2) # Difference s1 and s2
{3}
>>> s1 – s2 # Equivalent
{3}
>>> s2.difference(s1) # Difference s2 and s1
{1, 4, 6}
>>> s2 – s1 # Equivalent
{1, 4, 6}
§§ Symmetric Difference
Returns new set resulted from the symmetric difference of
sets. Basically, it returns all the elements that are in only one
of the sets. It could be though as the difference of the union
and intersection of the sets.
PYTHON REPL:
>>> s1, s2 = {2, 3, 5}, {2, 4, 1, 6, 5}
>>> s1.symmetric_difference(s2) # Symmetric diff s1 and s2
{1, 3, 4, 5}
>>> s2 ^ s1 # Equivalent
{1, 3, 4, 5}
§§ Set Comprehension
Sets can also be used in a similar manner as list comprehension.
EXAMPLE - Set Comprehension
# Set comprehension – Get all consonants of a word
sc = {x for x in «spatulate» if x not in «aeiou»}
print(sc)
OUTPUT
{‘l’, ‘p’, ‘s’, ‘t’}
D ata A n a ly s i s using Python | 53
3.2.2. Dictionaries
Dictionaries are a powerful datatype present in python. It
can store indexed values like lists, but the indexes are not
a range of integer, instead they are unique keys. Therefore,
dictionaries are a set of key:value pairs. Like sets, dictionaries
are not ordered. The keys can be any immutable object such
as strings, tuples, integer or float numbers.
§§ Defining Dictionaries
Curly brackets and colon are used in an explicit dictionary
definition.
PYTHON REPL:
>>> a = {} # Empty dictionary
>>> a = dict() # Empty dictionary
>>> b = {1: 1, 2: 2, 3:3} # Explicit definitions
>>> c = {42:2, «hi»: 1}
>>> d = {«A»:[1,2,3], «B»: 2}
>>> e = dict(k1=1, k22) # Considered as string keys
PYTHON REPL:
>>> d = {} # Empty dict
>>> d[“a”] = [1, 2] # Assigning list to key “a”
>>> d[2] = “Hello” # Assigning string to key 2
>>> d
{‘a’: [1, 2], 2: ‘Hello’}
>>> d[2] = «123» # Overriding value in key 2
>>> d
{‘a’: [1, 2], 2: ‘123’}
>>> d[«b»] # Access invalid key
KeyError: ‘b’
PYTHON REPL:
>>> d = {‘a’: [1, 2], 2: ‘123’} # Same of previous
example
>>> d.keys() # List current keys
dict_keys([‘a’, 2])
>>> d.values() # List current values
dict_values([[1, 2], ‘123’])
§§ Dictionaries in Loops
In general, you want to know key:value pairs on a dictionary
when iterating. The items method returns both values when
used in for loops.
INPUT
# Nested Conditions
d = dict(a=1, b=2, c=3)
for k, v in d.items():
print(k)
print(v)
OUTPUT
a
1
b
2
c
3
§§ Dictionary Comprehension
Dictionaries can be used like in list comprehension.
EXAMPLE - Dict Comprehension
# Dict of cubes(values) of number(key) 1 to 4
dc = {x: x ** 3 for x in range(1, 5) }
print(dc)
OUTPUT
{1: 1, 2: 8, 3: 27, 4: 64}
3.3. Functions
We already used multiple functions built-in in the Python
programming language such as print, len or type. Generally,
a function is defined to represent a set of instructions that
56 | Python for D ata A n a ly s i s - A dva n c e d
In the first example, the function returns a set with all the
vowels present in the string. While the second example, a
dictionary with each letter and its occurrence is returned.
With the functions defined, you can apply the same logic to
different inputs.
58 | Python for D ata A n a ly s i s - A dva n c e d
§§ Function Docstrings
This is a way to describe the functionality of your function.
It comes immediately after the colon in function definition,
using text between triple “”” or ‘’’.
Example - Docstrings
def add(a,b):
“””
This is a function doc string, it is where you function is
described.
e.g. This functions adds a and b.
“””
return a +b
§§ Passing Arguments
When calling functions, there are two ways to pass arguments:
positional arguments and keyword arguments. This first
methods the order in which the arguments are present
in definition of the function is followed. While the second
manner the order does not matter, but each argument
should be passed together with its parameters in the format
parameter=argument. Consider the function defined below
and the examples.
D ata A n a ly s i s using Python | 59
Function Definition
def test(a, b, c):
print(‘Test Function’)
Example 1 - Example 2 - Keyword Example 3 - Combined
Positional
test(1, 2, 3) test(c=3, b=2, a=1) test_args(1, 2, c=3)
All these function calls are equivalent, just the way the
arguments are passed are changing.
§§ Function Arguments
A Python functions can have four types of arguments:
o Required Arguments: When the argument is mandatory
to the execution of the function. If the argument is not
passed an error is shown.
o Default Arguments: In this case, this function works
even if the argument is not passed. When this occurs, a
default value is used.
o Keyword Arguments: A way to pass arguments by the
parameters name.
o Variable Number of Arguments: A function can have an
unknown number of required arguments and keyword
arguments. In this case, extra positional arguments are
stored in the tuple args and extra keyword arguments
in the dictionary kwargs. This must be specified during
the function definition.
Consider the given function, the parameters *args and
**kwargs work as placeholder as extra positional and
keyword arguments passed, respectively.
60 | Python for D ata A n a ly s i s - A dva n c e d
Function Definition
def sum_values(a, b, c=2, *args, **kwargs):
# Required arguments
result = a + b
# Default argument
result += c
# Variable arguments
for v in args:
result += v
# variable kw arguments
for k, v in kwargs.items():
result += v
# Show extra positional and keyword args
print(args)
print(kwargs)
return result
§§ Unpacking Arguments
You can also use tuples and dictionaries to pass parameters to
functions. A single asterisk is used to perform the unpacking
for tuple (or list) and double for dictionaries.
62 | Python for D ata A n a ly s i s - A dva n c e d
Function Definition
def test(a, b):
print(‘Test Function’)
Example 1 - *args Example 2 - **kwargs
t = (1, 2) d = {“a”: 1, “b”: 2}
# Same as test(1,2) # Same as test(a=1, b=2)
test(*t) test(**d)
Example 3 – Combined Example 4 - Combined
t = (1, 2) d = dict(f=1, g=2)
# Same as test(1, 2, 1, 2) # Same as test(1, b=2, f=1, g=2)
test(1, 2, *t) test(1, b=2, **d)
§§ Lambda Expressions
Lambda expressions are small anonymous functions. It is
generally used to pass small functions as arguments. Its overall
format is shown below.
lambda INPUTS: OUTPUTS
value of x. In this case, x are strings inside the list obj and
the output are the last letter of these strings. Then we are
sorting the initial list by the last letter of the strings. This is a
perfect use-case of lambda functions, when you need a simple
function to be passed as argument.
3.4. Classes
As most of the widely used programming languages, Python
supports object-oriented programming. It is a way to design
computer programs made of interacting objects. In Python,
classes are the recipe to the object creation.
§§ Creating Classes
Classes can be created using the class keyword followed by
colon. Generally, the first method of a class is the __init__
function. The overall format is shown below.
class CLASS_NAME:
# Attributes and methods
__init__(self, P1, P2):
# Init code
def m1(self):
print(‘Method executed! ‘)
§§ Multiple Instances
You can create multiple instances of the same class, each with
its own attributes. The self parameter refers to the current
instance of the class. It is used to access variables and methods
of the class within the class definition.
66 | Python for D ata A n a ly s i s - A dva n c e d
c1 = AddMultCalc(5, 11)
c1.add()
c2 = AddMultCalc(-2, 5)
c2.mult()
OUTPUT
5+11=16
-2x5=-10
§§ Changing Attributes
You can modify attributes of already created instances.
Consider the same class defined previously.
Example 2 – Modify Attributes: Instance
c3 = AddMultCalc(-1, 4)
c3.add()
c3.a = -4
c3.add()
OUTPUT
-1+4=3
-4+4=0
def vowels(s):
v = “aeiou”
r = set()
# For each character c
for c in s:
# Check if it is a vowel
if c in v:
r.add(c)
return r
§§ Direct Import
When using this method, all classes and functions in the
module will be available through the module namespace. In
other words, the function and classes are available using the
module name and the “.”.
PYTHON REPL:
>>> import module
>>> module.vowels(“Hello World!”)
{‘e’, ‘o’}
§§ Partial Import
You can also import only specific functions/classes from the
module. In this case, the import format is changed to from
MODULE_NAME import FUNCTION/CLASS_NAME. No additional
namespace is created to invoke the function/class.
PYTHON REPL:
>>> from module import ShowAddClass
>>> s = ShowAddClass(2, 3)
>>> s.add()
5
You can use the *, to import all classes and functions available
in the module without the namespace restriction.
70 | Python for D ata A n a ly s i s - A dva n c e d
PYTHON REPL:
>>> from module import *
>>> s = ShowAddClass(2, 3)
>>> s.add()
5
>>> vowels(«Hello»)
{‘e’, ‘o’}
§§ Using Alias
Alias can be used when importing entire module or specific
functions. In general, this is done to avoid namespace conflicts
or reduce the module namespace length. The alias is created
with the keyword as.
PYTHON REPL:
>>> import module as m
>>> s = m.ShowAddClass(1, -1)
>>> s.add()
0
PYTHON REPL:
>>> vowels = “aeiou”
>>> from module import vowels as fun_vowels
>>> fun_vowels(vowels)
{‘a’, ‘e’, ‘i’, ‘o’, ‘u’}
if __name__ == «__main__»:
import sys
v = vowels(sys.argv[1])
print(v)
Exercises
Answer the question below, then and check your responses
using the Python REPL or creating and executing a .py file.
e) break
stop
terminate
continue
What’s the output of the example below?
i = 1
while True:
print(i)
i += 1
if i == 42:
break
print(i)
f) 40
41
D ata A n a ly s i s using Python | 73
42
43
What’s the output of the example below?
s = “acbdefgh”
m = “”
for c in s:
if c in “aeiou”:
continue
m += c
print(m)
g) abcdefgh
ae
cbdfgh
fgh
What’s the output of the example below?
c = 0
while c < 1:
c += 1
print(c)
if c == 1:
break
else:
print(“Else executed!”)
h) 1
1
Else executed
Else executed
Nothing is printed
74 | Python for D ata A n a ly s i s - A dva n c e d
i) Iterating...
Iteration over!
Iterating...
Iteration over!
Nothing is printed
Are the lists comprehensions defined below valid? Mark as
True of False.
j) lc = [a for b in range(5)] ( )
k) lc = [a for a in range(5) if a < 2] ( )
l) lc = [b**2 for b in range(3)] ( )
m) lc = [[(a, b) for a in range(5)] for b in range(3)]
( )
What is the equivalent of this for loop as a list comprehen-
sion?
sqrt = []
for v in range(5, 10):
sqrt.append(v ** .5)
v) sc = [c for c in “abracadabra”]
w) sc = [c for c in “abracadabra” if c == “a”]
x) sc = [c for c in “abracadabra” if c != “a”]
y) sc = [c for c in “brcdbrf”]
Consider the sets defined below, what are the results of the
operations?
s1 = {1, 2, “B”, 4, 5}
s2 = {1, “A”, “B”, “C”, 5}
z) >>> s1 & s2
>>> s1 ^ s2
>>> s2 | s2
>>> s1 – s2
>>> s2 – s1
76 | Python for D ata A n a ly s i s - A dva n c e d
>>> s2 & s1
>>> s2 ^ s1
Given the function definition, what are the results of the
function calls below?
def mult_values(a, b, c=.5, *args, **kwargs):
result = a * b
result *= c
for v in args:
result *= v
for k, v in kwargs.items():
result *= v
print(args)
print(kwargs)
return result
def area(self):
a = self.pi * r ** 2
print(‘Area:’ + str(a))
def circumference(self):
c = self.pi * r * 2
print(‘ Circumference:’ + str(c))
def area(self):
a = self.pi * r ** 2
print(‘Area:’ + str(a))
def circumference(self):
c = self.pi * r * 2
print(‘ Circumference:’ + str(c))
if __name__ == «__main__»:
c = Circle(1/3.14)
c.area()
ss) import module.py ( )
tt) in module import Circle ( )
uu) from module import add_area ( )
vv) from module import * ( )
ww) import module as c ( )
xx) from module Circle as C ( )
yy) from module import Circle as C ( )
80 | Python for D ata A n a ly s i s - A dva n c e d
4.1. IPython
IPython also known as Interactive Python is a capable toolkit
that allow you to experience Python interactively. It has two
main components: an interactive Python Shell interface, and
Jupyter kernels.
- Session logging
- Access to system Shell
- Support to python’s debugger and profiler
Now, let’s dive into each of these components and see how
these features come to life.
To run the IPython Shell you just need to call the command
bellow on your system console.
ipython
D ata A n a ly s i s using Python | 83
§§ Interface
§§ Help
84 | IPython and J u p y t e r N ot e b o o k s
You can type “?” after an accessible object at any time you
want more details about it.
§§ Code Completition
You can press “TAB” key at any time to trigger the code
completition.
§§ Syntax Highlight
§§ Magic Commands
86 | IPython and J u p y t e r N ot e b o o k s
§§ Dashboard
§§ Notebook Editor
Now you are ready to edit your own notebook, but first let’s
be familiar with the interface items listed above.
1. This is the notebook Cell, it is the simpler component of
a notebook.
2. This drop-down menu alternates the kind of the
selected cell. Each cell can have one of three types:
Code, Markdown and Raw.
3. This is the button that adds more cells to the notebook.
4. This button executes the current cells and selects the
next one.
Ok, after this brief description of the main interface elements,
we can start creating our notebook.
§§ Cell Basics
Any of the tree types of cell have two possible modes,
Command and Edit.
Command Mode: The cell left edge is blue and typing will send
commands to the notebook. If in edit mode, you can change
to command mode with the “ESC” key.
D ata A n a ly s i s using Python | 89
Edit Mode: The cell left edge is green and there is a small grey
pencil in the top right corner ( ), typing in this mode will
edit the content of the cell. This mode can be achieved double
clicking a markdown cell or single clicking a code/raw cell. If
in command mode, you can change to edit mode with the
“Enter” key.
“Shift+Enter”: Run the current cell and move to the next one.
Code Cells: Code cells can execute Python code. Any code
not assigned to a variable will be shown as the output of the
cell. Code cells have a “In []” on its left, indicating that it is an
input cell and the number inside the bracket reveals the order
of execution.
5.1. Numpy
Numpy is one of the most famous and widely used Python
packages for efficient data processing. Its main object is the
multi-dimensional array: ndarray. Some algorithms can have
considerable performance increase using the array class
offered by the numpy library. Additionally, the Scipy ecosystem
of software are built on top of this to provide various scientific
and engineering methods.
§§ One-dimensional Arrays
The type be given to the np.array() command with the dtype
keyword. Additionally, the type and dimension of the created
array can be checked using the dtype and ndim class attributes,
respectively.
D ata A n a ly s i s using Python | 95
IPYTHON SHELL:
>>> import numpy as np # Numpy with its common alias
§§ Multi-dimensional Arrays
Nested and array-like objects are used to construct the
dimensions of the array. You can think of multidimensional
array as a set of the arrays in previous dimension. For instance,
we have seen that 0-dimensinal array corresponds to a single
value, then a 1-dimensional array is a set of 0-dimensional
arrays. And a 2-dimensional array is a set of 1-dimensional
arrays and so on. This concept is illustrated in the table below.
96 | Numpy for N u m e r i c a l D ata P r o c e s s i n g
0 Single value
Multiple single
1
values (List)
Multiple List of
2
values (Matrix)
Multiple
3 Matrices of
values (Cube)
Collection of
4
Cubes
... … …
The same logic follows for more than 3 dimensions, you can
think a 4-dimensional array as a of collection of Cubes arrays.
However, beyond 3 dimensions it is not easily illustratable and
intuitive. The attributes shape and size are useful attributes
for multi-dimensional arrays. The first returns the size of the
array in each dimension, the second returns the total number
of elements present in the array.
D ata A n a ly s i s using Python | 97
IPYTHON SHELL:
>>> a = np.array([[1, 2, 1], # Create an array from
nested lists
[3, 4, 3]])
>>> a
array([[1, 2, 1],
[3, 4, 3]])
>>> a.ndim
2
>>> a.shape # Shape of the 2-dim array
(2, 3)
>>> a.size
6
>>> b = np.array(((1, 2), # Create an array from
nested tuples
(3, 4)),
((5, 6),
(7, 8))))
>>> b
array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
>>> b.ndim
3
>>> b.shape # Shape of the 3-dim array
(2, 2, 2)
>>> b.size
8
>>> c = np.array([[1,2], [1]]) # Inconsistent number
of columns
>>> c
array([list([1, 2]), list([1])], dtype=object)
IPYTHON SHELL:
>>> np.ones((2, 2)) # Array of 1s with shape (2,2)
array([[1., 1.],
[1., 1.]])
§§ Reshaping Arrays
Created arrays can be reshape with the reshape method. The
only restriction is that the new format should have the same
D ata A n a ly s i s using Python | 99
IPYTHON SHELL:
>>> r = np.arange(1, 10, 2) # Values from 1 to 10
with step 2
>>> r.reshape((3, 2)) # Reshape to (3, 2)
array([[ 0, 2],
[ 4, 6],
[ 8, 10]])
>>> r.reshape((2, -1)) # Invalid shape
array([[ 0, 2, 4],
[ 6, 8, 10]])
§§ Appending to Arrays
Differently of lists, numpy arrays have fixed sizes. Therefore,
to append a value in the array a new array is created, and the
values are copied. This can be done with the append function
which accepts values or other arrays. For large arrays, this is a
costly operation and should be avoided. A good practice is to
create the array with extra spaces and fill it.
IPYTHON SHELL:
100 | Numpy for N u m e r i c a l D ata P r o c e s s i n g
>>> ar
array([0, 1, 2, 3, 4])
IPYTHON SHELL:
>>> ar1 = np.zeros((2,2)) # 2x2 with 0s
>>> ar2 = np.ones((2,2)) # 2x2 with 1s
>>> np.vstack((ar1, ar2)) # Combine on first axis
array([[0., 0.],
[0., 0.],
[1., 1.],
[1., 1.]])
[[0., 0.],
[0., 0.]]])
>>> ar4.shape
(2, 2, 2)
§§ One-dimensional Arrays
The slicing, indexing and iterating with one-dimensional arrays
is equivalent to the same operation on normal Python lists.
The same logic can be used to change values on the array.
102 | Numpy for N u m e r i c a l D ata P r o c e s s i n g
IPYTHON SHELL:
>>> ar = np.lispace(0, 2, 6) # Sequence from 0 to 2 with 6
values
>>> ar[3] # Indexing
1.5
>>> ar[0:2] # Slicing
array([0. , 0.5])
>>> ar[-1] # Indexing
2.0
>>> ar[-2] = 5 # Modifying
>>> for i in ar: # Iterating
>>> print(i)
0.0
0.5
1.0
1.5
5.0
§§ Multi-dimensional Arrays
Indexing arrays with multiple dimension are done with a tuple
with a value for each dimension. However, if an index value is
omitted, it is considered a complete slice, which is equivalent to
“:”. Additionally, the “…” can be used in the indexes to represent
as many as “:” as needed. Iterating on multidimensional array
is always performed in the first dimension. Numpy has also
the capability to perform each element iteration with the
np.nditer function, but this function treats the values as read-
only by default. In general, it is easier and more intuitive to use
the first dimension during iteration over the multidimensional
array.
D ata A n a ly s i s using Python | 103
IPYTHON SHELL:
>>> mat = np.arange(9).reshape((3,3))
# 3x3 Matrix from 0 to 8
>>> mat
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> mat[1, 1] # Indexing
4
>>> mat[1, :] # Slicing
array([3, 4, 5])
>>> mat[1] # Equivalent
array([3, 4, 5])
>>> mat[0, :]
array([6, 7, 8])
>>> mat[0, ...] # Equivalent
array([6, 7, 8])
§§ Boolean Indexing
Numpy array also allow Boolean indexing. True and False means
if the value will be returned or not, respectively. You can use
boolean indexing to perform filter operations on arrays, such
as get values above or below a given threshold. Additionally,
multiple conditions can be performed at the same time and
more advanced filters created with this type of indexing. This
can simply be done by using comparison operations between
arrays and values. The operators & (and), |(or) are used to
combine multiple conditions between parentheses.
IPYTHON SHELL:
>>> mat = np.arange(9).reshape((3,3)) # 3x3 Matrix
from 0 to 8
>>> index = mat > 3 # Indexes
>>> index
array([[False, False, False],
[False, True, True],
[ True, True, True]])
>>> mat[index] # Only values above 3
array([4, 5, 6, 7, 8])
>>> mat[mat > 3] # Equivalent
array([3, 4, 5])
>>> mat[(mat>3) & (mat<6)] # Multiple comparison
array([3, 4, 5])
>>> mat[(mat>3) | (mat<1)] # Multiple comparison
array([0, 4, 5, 6, 7, 8])
IPYTHON SHELL:
>>> a = np.array([1, 2, 3], dtype=float)
>>> b = np.array([-1, 1, 3], dtype=float)
>>> a + b # Addition
array([0., 3., 6.])
>>> a - b # Subtraction
array([2., 1., 0.])
>>> a * b # Multiplication
array([-1, 2, 9])
>>> a / b # Division
array([-1., 2., 1.])
>>> a//b # Integer division
array([-1., 2., 1.])
>>> a % b # Modulus
array([-0., 0., 0.])
>>> a ** b # Power
array([ 1., 2., 27.])
IPYTHON SHELL:
>>> a = np.array([[1, 0], [2, -1]], dtype=float)
>>> b = np.array([[1, 1], [0, 1]], dtype=float)
5.4.3. Broadcasting
Broadcasting is a great feature that allow great flexibility.
Shortly, broadcasting is the ability to perform operation
between arrays that do not have exact same size or shape. It
is based in two rules:
1. If an array has fewer dimensions a ‘1’ will prepended
to the shape of this array until both arrays have the
same dimensions.
108 | Numpy for N u m e r i c a l D ata P r o c e s s i n g
IPYTHON SHELL:
>>> np.random.seed(42) # Replicability
>>> a = np.random.randn(50) # 50 values from
normal distribution
>>> a.mean() # Mean
-0.28385754369506705
>>> a.std() # Standard Deviation
0.8801815954843186
>>> a.max() # Max value
1.8522781845089378
>>> a.min() # Min value
-1.9596701238797756
>>> a.argmax() # Index of max value
22
>>> a.sum() # Sum of values
-14.192877184753353
6.1. Pandas
Pandas provide fast, simple and flexible functions and
structures to data manipulate data easily and intuitively. It is
highly suited for tabular data that can be easily expressed on
its fundamental data structures, Series and DataFrame. Those
structures are the base classes of the package, and through
them pandas provide multiple functionalities:
· Handling missing data;
· Data insertion/deletion;
· Powerful grouping, indexing and combining;
· Easily handle external files.
Those capabilities make pandas one of the most important
frame works for anyone working with data in Python.
112 | P a n da s for D ata M a n ip u l at i o n
6.3.1. Series
Series is the Pandas 1 dimensional structure. It is composed of
two collections: index and the data itself. The index represents
the label of the values in a Series, if not provided an integer
continuous series is automatically created. You can think of a
Series as handy combination between a numpy array and the
assignment capabilities of a Python dictionary.
§§ Creating Series
Multiple objects can be used to create a Series with the
pd.Series function, but overall the command usage is
maintained. The index size should match the size of the data
passed and some numpy array attributes are also present in
the Series object. The numpy array equivalent of the Series is
available with the values attribute.
IPYTHON SHELL:
>>> import pandas as pd # Pandas with its
common alias
IPYTHON SHELL:
>>> import numpy as np
>>> d = pd.Series([-1, 2, np.nan, 4, 5])
>>> d.mean() # Mean method
0.0
>>> d + 1 # Broadcasting
0 -1.0
1 0.0
2 NaN
3 2.0
4 3.0
dtype: float64
>>> d ** 2
0 4.0
1 1.0
2 NaN
3 1.0
4 4.0
dtype: float64
D ata A n a ly s i s using Python | 115
IPYTHON SHELL:
>>> e = pd.Series([42, 1, 0, -2], index= [‘a’, ‘b’, ‘c’, ‘d’])
>>> e[‘b’] = 101.1 # Assigment
>>> e[‘b’]
101.1
§§ Unique Features
Series also have some unique features, that do not have a
direct parallel from what have been seen. For instance, the
describe method that returns multiple statistical information
about the data, or even the method isin that checks if the
values are in another list-like object. The attribute name can be
used to briefly explain the data present in the Series, and you
can change the dtype attribute with the astype method. These
highly efficient methods are what makes Pandas excel during
data manipulation.
116 | P a n da s for D ata M a n ip u l at i o n
IPYTHON SHELL:
>>> f = pd.Series([np.nan, 2, 3, 5, 7, np.nan],
name=’primes’)
>>> f.describe() # Statistical description
count 4.000000
mean 4.250000
std 2.217356
min 2.000000
25% 2.750000
50% 4.000000
75% 5.500000
max 7.000000
Name: primes, dtype: float64
§§ Advanced Types
Beyond the basics numpy dtypes, Pandas Series supports
highly useful ones such as datetime, strings and categorical.
After converted, they can be easily accessed with the proper
accessor: dt, str and cat. Using these accessors you can
efficiently modify/identify the data in the series with built-in
functions.
118 | P a n da s for D ata M a n ip u l at i o n
IPYTHON SHELL:
>>> g = pd.Series([‘apple’, ‘pen’, ‘pen’, ‘apple’, ‘pen’],
name=’fruits’, dtype=’category’)
>>> g.cat.categories # Access Categories
Index([‘apple’, ‘pen’, ‘penaple’], dtype=’object’)
6.3.2. DataFrame
Now that we understand the overall capabilities of Pandas
Series, we can simply define a DataFrame as a collection
of Series. Think of it as a tabular (table-like) data structure
D ata A n a ly s i s using Python | 119
§§ Creating DataFrame
Like in the Series, the creation function accepts multiple types
of data as input. Beyond the index attribute, DataFrame has
the columns attribute that can be passed during creation. The
data can be passed as a dictionary with each key storing a list,
list of series, list of lists, etc.
IPYTHON SHELL:
>>> dl = {‘A’:[1, 3], ‘B’: [2, 4]}
>>> pd.DataFrame(dl) # Dictionary of lists
A B
0 1 2
1 3 4
IPYTHON SHELL:
>>> df = pd.DataFrame({‘a’:[1,2,3], ‘b’:[2,4,6], ‘c’:[1, 6, 9]})
>>> df
a b c
0 1 2 1
1 2 4 6
2 3 6 9
>>> df + 1 # Broadcasting
a b c
0 2 3 2
1 3 5 7
2 4 7 10
§§ Displaying Values
Pandas DataFrame have two methods to display the begin/
end of the tabular data: head and tail. This is useful when
dealing with large number or data.
IPYTHON SHELL:
>>> df = pd.DataFrame({‘A’:range(100), ‘B’:np.linspace(-5,
5, 100), ‘C’:0})
>>> df.head() # Default first 5 rows
A B C
0 0 -5.000000 0
1 1 -4.898990 0
2 2 -4.797980 0
3 3 -4.696970 0
4 4 -4.595960 0
IPYTHON SHELL:
>>> df = pd.DataFrame({‘a’:[1,2,3], ‘b’:[2,4,6], ‘c’:[1, 6,
9]}, index=[‘i’, ‘j’, ‘k’])
>>> df
a b c
i 1 2 1
j 2 4 6
k 3 6 9
>>> df.loc[‘i’] # Access all columns of row i by name
a 1
b 2
c 1
Name: i, dtype: int64
§§ Boolean Indexing
Pandas also support Boolean indexing. As seen before, it is
like boolean indexing numpy arrays. Therefore, True and False
means if the value will be returned or not, respectively. In
124 | P a n da s for D ata M a n ip u l at i o n
IPYTHON SHELL:
>>> df = pd.DataFrame({‘a’:[1,2,3], ‘b’:[2,4,6], ‘c’:[1, 6,
9]}, index=[‘i’, ‘j’, ‘k’])
>>> df[(df.a > 1) & (df.b < 5)] # rows where a > 1 and b <
5
a b c
j 2 4 6
>>> df[(df.c <2) | (df.b < 5)] # rows where c < 2 or b < 5
a b c
i 1 2 1
j 2 4 6
§§ Filter
This method applies conditions to include names of specified
axis. Therefore, it does not filter the content of the data, but
the labels of the indexes.
IPYTHON SHELL:
>>> df = pd.DataFrame({‘ABC’:[1,1,2], ‘BCD’:[0,1,3],
‘CDE’:[2, 1, 2]}, index=[‘dog’, ‘cat’, ‘rabbit’])
>>> df.filter(items=[«ABC», «CDE»])
# Select speficid columns
ABC CDE
dog 1 2
cat 1 1
rabbit 2 2
6.4.1. Merge
The function pd.merge can combine two DataFrames in one.
In general, this function is used with these arguments: the first
126 | P a n da s for D ata M a n ip u l at i o n
IPYTHON SHELL:
>>> df1 = pd.DataFrame({“col1”:[1, 2, 3],
“col2”:[42,11,25]})
>>> df2 = pd.DataFrame({“col1”:[1, 2, 4], “col3”:[np.
nan,22,51]}
6.4.2. Concatenate
Once again, this is like concatenating numpy arrays. Consider
these examples shown in the IPython shell below.
128 | P a n da s for D ata M a n ip u l at i o n
IPYTHON SHELL:
>>> c1 = pd.DataFrame({“A”:[1, 2, 3], “B”:[42,11,25]})
>>> c2 = pd.DataFrame({“C”:[1, 2, 4], “D”:[np.nan,22,51]})
6.4.3. Grouping
Multiple times you want to perform some aggregation
functions in combined values. Using groupby you can group
large amount of data and perform operation on it. Grouping
allow the calculation of some statistics such as mean, median,
mode, standard deviation, etc. Consider the DataFrame, df
below.
IPYTHON SHELL:
>>> df = pd.DataFrame({“breed”:[“Labrador”,
“Labrador”,”Bulldog”,”Labrador”, “Beagle”, “Bulldog”],
“height”:[57,60,40,58,36,38],”weight”:[30,29,24,34,11,22]})
>>> df.groupby(“breed”).mean() # Average height and weigh
height weight
breed
Beagle 36.000000 11.0
Bulldog 39.000000 23.0
Labrador 58.333333 31.0
>>> df.groupby(“breed”).min() # Min height and weigh
height weight
breed
Beagle 36 11
Bulldog 38 22
Labrador 57 29
A B C D
1 a
2 3.141 1 b
2
4 1.618 3
1.141 5
6 8
IPYTHON SHELL:
>>> df = pd.DataFrame({“A”:[np.nan,2,np.nan,4,np.
nan,6], “B”:[np.nan, 3.141,np.nan,1.618, 1.141, np.nan],
“C”:[1,1,2,3,5,8], “D”:[“a”,”b”,np.nan,np.nan,np.nan,np.
nan]})
A B C D
1 2 3 1
2 3 4 2
1 2 3 1
1 2 3 1
4 3 2 4
IPYTHON SHELL:
>>> df = pd.DataFrame({“A”:[1,2,1,1,4],”B”:[2, 3,2, 2,
3],”C”:[3, 4 ,3, 3 ,2], “D”:[1, 2, 1, 1, 4]})
· Distribution;
· Composition;
· Comparison;
· Relationship.
We will go through the most common graphs on different
packages and their purposes. To run any of the examples,
consider that these packages were already imported. Any
extra import is listed in the example itself.
§§ EXAMPLES
EXAMPLES – MATPLOTLIB, PANDAS AND SEABORN
# Data
data = np.random.normal(size=1000)
# Matplotlib
plt.hist(data)
plt.show()
# Pandas
data_series = pd.Series(data)
data_series.hist()
plt.show()
# Seaborn
sns.distplot(data)
plt.show()
OUTPUT - SEABORN
# Matplotlib
plt.boxplot(data)
plt.show()
# Pandas
data_series = pd.Series(data)
data_series.plot(kind=’box’)
plt.show()
# Seaborn
sns.swarmplot(data)
plt.show()
sns.violinplot(data)
plt.show()
sns.boxenplot(data)
plt.show()
D ata A n a ly s i s using Python | 139
OUTPUTS - SEABORN
values of the second variable for each entry in the data. It has
some variations, such as hexbin and estimated density plots
where the frequency of the points in a region is represented by
the color intensity. Once again, the simple scatter plot can be
easily displayed with matplotlib and pandas, and the fancier
equivalents hexbin and estimated density using seaborn.
This type of plot is where the advantages of dynamics plot
with bokeh become incredibly useful. For example, dynamic
scatter plots make outliers easily detectable and allows an
easily inspection.
EXAMPLES –
MATPLOTLIB, PANDAS, SEABORN AND BOKEH
# Data
data = np.random.normal(size=(1000, 2))
# Matplotlib
plt.plot(data[:, 0], data[:, 1])
plt.show()
# Pandas
data_df = pd.DataFrame(data, columns=[‘a’, ‘b’])
data_df.plot(x=’a’, y=’b’, kind=’scatter’)
plt.show()
# Seaborn
sns.jointplot(data[:, 0], data[:, 1], kind=’hex’)
plt.show()
sns.jointplot(data[:, 0], data[:, 1], kind=’kde’)
plt.show()
# Bokeh
p = figure()
p.circle(data[:, 0], data[:, 1])
show(p)
142 | D ata V i s u a l i z at i o n
OUTPUTS - SEABORN
144 | D ata V i s u a l i z at i o n
OUTPUTS - BOKEH
EXAMPLES –
MATPLOTLIB, PANDAS, SEABORN AND BOKEH
# Data
fruits = [«apple», «banana», «grape», «strawberry»,
«papaya»]
data = np.random.choice(fruits, size=100)
count_values = pd.Series(data).value_counts()
# Matplotlib
plt.plot(count_values.index,count_values.values)
plt.show()
# Pandas
count_values.plot(kind=’bar’)
plt.show()
# Seaborn
sns.barplot(count_values.index,count_values.values)
plt.show()
# Bokeh
p = figure(x_range=list(count_values.index))
p.vbar(x=count_values.index, top=count_values.values,
width=.9)
show(p)
146 | D ata V i s u a l i z at i o n
OUTPUT – MATPLOTLIB
OUTPUT - PANDAS
D ata A n a ly s i s using Python | 147
OUTPUTS - SEABORN
OUTPUTS - BOKEH
148 | D ata V i s u a l i z at i o n
# Matplotlib
plt.pie(count_values.values, labels=count_values.index)
plt.show()
# Pandas
count_values.plot(kind=’pie’)
plt.show()
D ata A n a ly s i s using Python | 149
OUTPUT –
MATPLOTLIB AND PANDAS (SAME OUTPUT )
EXAMPLES –
MATPLOTLIB, PANDAS, SEABORN AND BOKEH
# Data
x = np.arange(100)
y = np.random.normal(size=100) + 5*np.sin(x/20)
# Matplotlib
plt.plot(x, y)
plt.show()
# Pandas
pd.Series(y).plot()
plt.show()
# Seaborn
sns.barplot(count_values.index,count_values.values)
plt.show()
# Bokeh
p = figure()
p.line(x=count_values.index, top=count_values.values,
width=.9)
show(p)
D ata A n a ly s i s using Python | 151
OUTPUT - BOKEH
152 | D ata V i s u a l i z at i o n
EXAMPLES – SEABORN
# Data
fruits = [«apple», «banana», «grape», «strawberry»,
«papaya»]
stage = [«ripe», «unripe», «rooten»]
# Seaborn
sns.heatmap(ct)
plt.show()
D ata A n a ly s i s using Python | 153
OUTPUT –
MATPLOTLIB AND PANDAS (SAME OUTPUT )