0% found this document useful (0 votes)
51 views93 pages

Data Analysis of Visualization: CHAPTER - 1 Preliminaries

Uploaded by

Akansha Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views93 pages

Data Analysis of Visualization: CHAPTER - 1 Preliminaries

Uploaded by

Akansha Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Data Analysis of Visualization

CHAPTER -1 Preliminaries

Why is python used for data analysis?


Python is widely used for data analysis for several reasons:

1. Extensive Libraries: Python offers a rich ecosystem of libraries and tools that are
well-suited for data analysis. The most prominent library for data analysis in Python is
pandas, which provides data structures and functions for efficient data manipulation and
analysis. Additionally, there are libraries like NumPy for numerical computations,
Matplotlib and Seaborn for data visualization, and Scikit-Learn for machine learning.

2. Data Handling Capabilities: Python, with its pandas library, provides data structures
like DataFrames and Series that are designed for handling and manipulating tabular
data. These data structures make it easy to clean, reshape, and transform data for
analysis.

3. Versatility: Python is a versatile language that can be used for various tasks in data
analysis, including data cleaning, exploration, visualization, statistical analysis, and
machine learning. This versatility allows data analysts and data scientists to work within
a single language for their entire workflow.

4. Open Source Community: Python's open-source nature has led to a vast and active
community of developers who have created and continue to maintain a wide range of
data analysis libraries. This open-source ecosystem ensures that the tools available for
data analysis are constantly evolving and improving.

5. Accessibility: Python is known for its simple and readable syntax, making it
accessible to individuals from diverse backgrounds, including those who may not have a
strong programming background. This ease of use is especially beneficial for data
analysts who may not be primarily programmers.

6. Integration: Python can easily integrate with other programming languages and
tools, making it a versatile choice for data analysis projects. It can interface with
databases, web APIs, and other data sources, making it suitable for various data
collection and extraction tasks.

7. Strong Data Visualization: Python has excellent data visualization libraries such as
Matplotlib, Seaborn, and Plotly, which allow analysts to create informative and
aesthetically pleasing charts and graphs to communicate their findings effectively.

8. Machine Learning Capabilities: Python offers machine learning libraries such as


Scikit-Learn, TensorFlow, and PyTorch, making it an ideal choice for building predictive
models and performing advanced data analysis tasks.

9. Active Development and Community Support: Python has a robust and active
development community, ensuring that it stays up-to-date with the latest developments
in data analysis and technology.

10. Reproducibility: Python's ecosystem includes tools like Jupyter notebooks, which
are widely used in data analysis. Jupyter notebooks facilitate reproducible research by
allowing analysts to document their analysis process and share it with others.

These reasons, among others, have made Python a popular and powerful choice for
data analysis, and it continues to be the go-to language for many data professionals
and researchers.

Why not use python?


While Python is a versatile and widely-used programming language, there are situations
and contexts where it may not be the best choice. Here are a few reasons not to use
Python in certain scenarios:

1. Performance: Python is an interpreted language, which means it can be slower than


compiled languages like C or C++. This performance gap can be a significant issue for
computationally intensive tasks or real-time applications where speed is critical.

2. GIL (Global Interpreter Lock): Python has a Global Interpreter Lock, which can limit
its ability to utilize multiple CPU cores effectively. This limitation can make Python less
suitable for parallel processing or multithreaded applications.

3. Resource Intensive: Python may consume more memory than some other
languages due to its dynamic typing and extensive standard libraries. This can be a
problem when dealing with large datasets or in resource-constrained environments.
4. Mobile App Development: If you are developing mobile apps for iOS or Android,
Python might not be the best choice. While there are frameworks like Kivy and
BeeWare, they are less popular compared to native languages like Swift (for iOS) and
Java/Kotlin (for Android).

5. Low-level System Programming: Python is not suitable for low-level system


programming or writing device drivers. Languages like C and C++ are better suited for
these tasks.

6. Lack of Strong Static Typing: Python is dynamically typed, which can lead to
type-related errors only being discovered at runtime. In situations where type safety and
strict control are crucial, a statically-typed language like Java or C# might be a better
choice.

7. Dependency Management: Python's dependency management can sometimes lead


to compatibility issues and version conflicts. Tools like virtual environments and package
managers like pip can help mitigate this, but it can still be a source of frustration.

8. Learning Curve: Python may not be the best choice if you are working in a field
where another language is more commonly used, or if your team already has expertise
in a different language. Learning a new language can be time-consuming.

9. Limited GUI Development: While there are GUI development libraries like Tkinter
and PyQt for Python, if your primary focus is building rich graphical user interfaces
(GUIs), languages like Java or C# may offer more extensive and platform-specific
options.

10. Specialized Domains: In some specialized fields, there might be languages better
suited to the specific needs of that domain. For example, R is a popular language for
statistics and data science, and MATLAB is widely used in engineering and scientific
computing.

It's important to consider the specific requirements of your project and the strengths and
weaknesses of Python and other programming languages before making a decision.
Often, the choice of language depends on the specific problem you are trying to solve
and the ecosystem and libraries available for that problem domain.

Essential Python Libraries


1. Numpy
2. Pandas
3. Matplotlib
4. IPython and Jupyter
5. SciPy
6. Scikit-learn
7. Statsmodels

CHAPTER -2 Python Language Basics, IPython, and


Jupyter Notebooks

The Python Interpreter


Python is an interpreted language. The Python interpreter runs a program by execut‐
ing one statement at a time. The standard interactive Python interpreter can be invoked
on the command line with the python command:

$ python
Python 3.6.0 | packaged by conda-forge | (default, Jan 13 2017, 23:17:12) [GCC 4.8.2
20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> a = 5
>>> print(a) 5

Type exit() or press Ctrl-D to exit the program.

To run the python program in the command line :


$ python hello_world.py

IPython basics

Running the IPython Shell


$ ipython
Python 3.6.0 | packaged by conda-forge | (default, Jan 13 2017, 23:17:12) Type
"copyright", "credits" or "license" for more information.
IPython 5.1.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
Running the IPython Shell
To run the jupyter notebook in the command line :
$ jupyter notebook
%run filename

Tab Completion
While entering expressions in the shell, pressing the Tab key will search the
namespace for any variables (objects, functions, etc.) matching the characters you have
typed so far:
In [1]: an_apple = 27
In [2]: an_example = 42
In [3]: an an_apple and an_example an

You can also complete methods and attributes on any object after typing a period:
In [3]: b = [1, 2, 3]
In [4]: b.
b.append b.count b.insert b.reverse
b.clear b.extend b.pop b.sort
b.copy b.index b.remove

The same goes for modules:


In [1]: import datetime
In [2]:datetime,<Tab>
datetime.date datetime.MAXYEAR datetime.timedelta datetime.datetime
datetime.MINYEAR datetime.timezone datetime.datetime_CAPI datetime.time
datetime.tzinfo

Introspection
Using a question mark (?) before or after a variable will display some general
information about the object:
In [8]: b = [1, 2, 3]
In [9]: b?
Type: list
String Form:[1, 2, 3]
Length: 3
Docstring: list() -> new empty list
list(iterable) -> new list initialized from iterable's items
If the object is a function or instance method, the docstring, if defined, will also be
shown.
def add_numbers(a, b):
""" Add two numbers together Returns ------- the_sum : type of arguments """
return a + b

Then using ? shows us the docstring:


In [11]: add_numbers?
Signature: add_numbers(a, b)
Docstring: Add two numbers together
Returns
-------
the_sum : type of arguments
File: Type: function

Using ?? will also show the function’s source code if possible:


In [12]: add_numbers??
Signature: add_numbers(a, b)
Source:
def add_numbers(a, b):
""" Add two numbers together Returns ------- the_sum : type of arguments """
return a + b
File: <python-input-9-6a548a216e27>
Type: function

The %run Command


You can run any file as a Python program inside the environment of your IPython
session using the %run command.
Simple script stored in ipython_script_test.py:
def f(x, y, z):
return (x + y) / z
a=5
b=6
c = 7.5
result = f(a, b, c)

You can execute this by passing the filename to %run:


In [14]: %run ipython_script_test.py

Executing Code from the Clipboard


The most foolproof methods are the %paste and %cpaste magic functions.

%paste takes whatever text is in the clipboard and executes it as a single block in the
shell:

In [17]: %paste
x=5
y=7
if x > 5:
x += 1
y=8
## -- End pasted text –

%cpaste is similar, except that it gives you a special prompt for pasting code into:
In [18]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:x = 5
:y = 7
:if x > 5:
: x += 1
:
:y=8
:-
With the %cpaste block, you have the freedom to paste as much code as you like
before executing it. You might decide to use %cpaste in order to look at the pasted code
before executing it. If you accidentally paste the wrong code, you can break out of the
%cpaste prompt by pressing Ctrl-C.

About Magic Commands


IPython’s special commands (which are not built into Python itself) are known as
“magic” commands.

Magic functions can be used by default without the percent sign, as long as no vari‐
able is defined with the same name as the magic function in question. This feature is
called automagic and can be enabled or disabled with %automagic.
Matplotlib Integration
The %matplotlib magic function configures its integration with the IPy‐ thon shell or
Jupyter notebook.

In the IPython shell, running %matplotlib sets up the integration so you can create
multiple plot windows without interfering with the console session:
In [26]: %matplotlib
Using matplotlib backend: Qt4Agg

In Jupyter, the command is a little different (Figure 2-6):


In [26]: %matplotlib inline
Python Language Basics

Language Semantics

1. indentation, not braces : python uses indentation not braces.

2. Everything is an object : An important characteristic of the Python language is


the consistency of its object model. Every number, string, data structure, function,
class, module, and so on exists in the Python interpreter in its own “box,” which is
referred to as a Python object. Each object has an associated type (e.g., string or
function) and internal data. In practice this makes the language very flexible, as
even functions can be treated like any other object.
3. Comments : comment in python begins with # ( single line comment) or ‘“ ”’
(multiline comment).
4. Functions , variables and arguments parsing: function is a block of code that
can be called in the program when required. Variables act as the container to
store the values . While parameters are the variables defined in the functions that
accept the arguments value input by the user at run time.

def add(a,b):
return(a+b)
add( 13,2)

In above example, add(a,b) is the function


a, b are the parameters
13,2 is the arguments values passed.
5. Dynamic references, strong types: Python is considered a strongly typed
language, which means that every object has a specific type (or class), and
implicit conversions will occur only in certain obvi‐ ous circumstances.

String can’t be added to the integer thus showing error.


isinstance can accept a tuple of types if you want to check that an object’s type is
among those present in the tuple.

6. Attributes and methods: python has both attribute and methods and both can
be accessed using obj.attribute_name.

In other languages, accessing objects by name is often referred to as “reflection.”


7. Duck typing : When one does not care about the type of an object but rather
only whether it has certain methods or behavior. This is sometimes called “duck
typing,”

For example, you can verify that an object is iterable if it implemented the iterator
protocol. For many objects, this means it has a __iter__ “magic method,” though an
alternative and bet‐ ter way to check is to try using the iter function:

8. Import module: it is used to import the module.

Let there be file name “ sum.py”


def add(a,b):
return(a+b)

Import sum
a=sum.add(13+2)
print(a)

9. Binary operators and comparisons:


== and is operator is different in following ways:

== operator: it just compares the values of the two operand not there memory address.

Example: a=[1,2,3,4]
b=[1,2,3,4]
print(a==b)
Output: True

Is operator: with value comparisons it also compares the memory address of the two
operands and returns true only when memory address of both the operands are equal
Example: a=[1,2,3,4]
b=[1,2,3,4]
print(a==b)
Output: False #because both are referring to two different address

Example: a=[1,2,3,4]
b=a
print(a is b)
Output: True

10. Scalar Types: Python along with its standard library has a small set of built-in
types for handling numerical data, strings, boolean (True or False) values, and
dates and time. These “single value” types are sometimes called scalar types.

11. Numeric Types: there are two types of numeric values in python
● Integer
● Floating point

12. Strings: Strings in python are stored in ‘ ‘, “ ”, ‘“ ”’


c = """
This is a longer string that
spans multiple lines
"""
In [55]: c.count('\n')
Out[55]: 3
13. Boolean: boolean values in python are True and False.
14. Type Casting : it is conversion of the datatype of the variable.
15. None: none is the null value in the python.

16. DateTime: it is a immutable datatype.

Given a datetime instance, you can extract the equivalent date and time objects by
calling methods on the datetime of the same name:

Replacing the minute and second fields with zero:


17.Control Flow: if, else, elif statement, for loop,while loop, continue, break and pass
statements.

18.Ternary expressions: A ternary expression in Python allows you to combine an


if-else block that pro‐ duces a value into a single line or expression. The syntax for this
in Python is:
value = true-expr if condition else false-expr
CH-3 Built-in Data Structures, Functions, and Files

Data Structures and Sequences

1. Tuple
A tuple is a fixed-length, immutable sequence of Python objects.
In [1]: tup = 4, 5, 6
In [2]: tup
Out[2]: (4, 5, 6)

While the objects stored in a tuple may be mutable themselves, once the tuple is
created it’s not possible to modify which object is stored in each slot:

If an object inside a tuple is mutable, such as a list, you can modify it in-place:

Unpacking tuples
If you try to assign to a tuple-like expression of variables, Python will attempt to unpack
the value on the right hand side of the equals sign:
In [15]: tup = (4, 5, 6)
In [16]: a, b, c = tup
In [17]: b
Out[17]: 5
Even sequences with nested tuples can be unpacked:
In [18]: tup = 4, 5, (6, 7)
In [19]: a, b, (c, d) = tup
In [20]: d
Out[20]: 7

Count, which counts the number of occurrences of a value:


In [34]: a = (1, 2, 2, 2, 3, 4, 2)
In [35]: a.count(2)
Out[35]: 4

2. List
In contrast with tuples, lists are variable-length and their contents can be modified
in-place. You can define them using square brackets [] or using the list type function:
In [36]: a_list = [2, 3, 7, None]

Adding and removing elements


Concatenating and combining lists

Note that list concatenation by addition is a comparatively expensive operation since a


new list must be created and the objects copied over. Using extend to append elements
to an existing list, especially if you are building up a large list, is usually preferable.

Sorting
In [61]: a = [7, 2, 5, 1, 3]
In [62]: a.sort()
In [63]: a
Out[63]: [1, 2, 3, 5, 7]

In [64]: b = ['saw', 'small', 'He', 'foxes', 'six']


In [65]: b.sort(key=len) In [66]: b
Out[66]: ['He', 'saw', 'six', 'small', 'foxes']
Binary search and maintaining a sorted list
The built-in bisect module implements binary search and insertion into a sorted list.
bisect.bisect finds the location where an element should be inserted to keep it sor‐ ted,
while bisect.insort actually inserts the element into that location:
In [67]: import bisect
In [68]: c = [1, 2, 2, 2, 3, 4, 7]
In [69]: bisect.bisect(c, 2)
Out[69]: 4
In [70]: bisect.bisect(c, 5)
Out[70]: 6

In [71]: bisect.insort(c, 6)
In [72]: c
Out[72]: [1, 2, 2, 2, 3, 4, 6, 7]

The bisect module functions do not check whether the list is sor‐ ted, as doing so would
be computationally expensive. Thus, using them with an unsorted list will succeed
without error but may lead to incorrect results.

Built-in Sequence Functions

1. Enumerate
The enumerate function in Python is used to iterate over an iterable (such as a
list, tuple, or string) while keeping track of both the index and the value of each item. It
returns a tuple of the form (index, value). This can be very useful when you need to
know both the position and the value of items in a sequence. Here's an easy example to
illustrate how it works:

fruits = ['apple', 'banana', 'cherry', 'date', 'elderberry']

for index, fruit in enumerate(fruits):


print(f"Index {index}: {fruit}")

In this example:
We have a list called fruits containing several fruit names.
We use a for loop to iterate over the fruits list.
We apply the enumerate function to the fruits list, which returns an iterable of tuples,
where each tuple consists of an index and a value from the original list.
In each iteration, the index variable contains the index (starting from 0), and the fruit
variable contains the value (the name of a fruit).
The output of this code will be:

Index 0: apple
Index 1: banana
Index 2: cherry
Index 3: date
Index 4: elderberry
As you can see, the enumerate function simplifies the process of iterating over a list or
other iterables while keeping track of the position of each item. This is especially useful
in situations where you need to perform specific actions or access both the index and
the value of the items in the iterable.

2. Sorted
The sorted function returns a new sorted list from the elements of any sequence:
In [87]: sorted([7, 1, 2, 6, 0, 3, 2])
Out[87]: [0, 1, 2, 2, 3, 6, 7]
In [88]: sorted('horse race')
Out[88]: [' ', 'a', 'c', 'e', 'e', 'h', 'o', 'r', 'r', 's']
The sorted function accepts the same arguments as the sort method on lists.

3. Zip
zip “pairs” up the elements of a number of lists, tuples, or other sequences to create a
list of tuples:
In [89]: seq1 = ['foo', 'bar', 'baz']
In [90]: seq2 = ['one', 'two', 'three']
In [91]: zipped = zip(seq1, seq2)
In [92]: list(zipped)
Out[92]: [('foo', 'one'), ('bar', 'two'), ('baz', 'three')]

zip can take an arbitrary number of sequences, and the number of elements it pro‐
duces is determined by the shortest sequence:
In [93]: seq3 = [False, True]
In [94]: list(zip(seq1, seq2, seq3))
Out[94]: [('foo', 'one', False), ('bar', 'two', True)]

4. Reversed
Reversed iterates over the elements of a sequence in reverse order:
In [100]: list(reversed(range(10)))
Out[100]: [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

Keep in mind that reversed is a generator.

5. Dictionary
A more common name for it is hash map or associative array. It is a flexibly sized
collection of key-value pairs, where key and value are Python objects.

In [102]: d1 = {'a' : 'some value', 'b' : [1, 2, 3, 4]}


In [103]: d1
Out[103]: {'a': 'some value', 'b': [1, 2, 3, 4]}
In [102]: d1 = {'a' : 'some value', 'b' : [1, 2, 3, 4]}
In [103]: d1
Out[103]: {'a': 'some value', 'b': [1, 2, 3, 4]}

In [118]: list(d1.values())
Out[118]: ['some value', [1, 2, 3, 4], 'an integer']

6. Set
set A set is an unordered collection of unique elements.
List, Set, and Dict Comprehensions

List Comprehension:

numbers = [1, 2, 3, 4, 5]
squared_numbers = [x ** 2 for x in numbers]
print(squared_numbers) # Output: [1, 4, 9, 16, 25]

Set Comprehension:

numbers = [1, 2, 2, 3, 4, 4, 5]
unique_evens = {x for x in numbers if x % 2 == 0}
print(unique_evens) # Output: {2, 4}

Dictionary Comprehension:

numbers = [1, 2, 3, 4, 5]
squared_dict = {x: x ** 2 for x in numbers}
print(squared_dict) # Output: {1: 1, 2: 4, 3: 9, 4: 16, 5: 25}

Functions

If you anticipate needing to repeat the same or very similar code more than once, it may
be worth writing a reusable function. Functions can also help make your code more
readable by giving a name to a group of Python statements.

Namespaces, Scope, and Local Functions


Functions can access variables in two different scopes: global and local. An alternative
and more descriptive name describing a variable scope in Python is a namespace. Any
variables that are assigned within a function by default are assigned to the local
namespace. The local namespace is created when the function is called and
immediately populated by the function’s arguments. After the function is finished, the
local namespace is destroyed (with some exceptions that are outside the purview of
this chapter)

Anonymous (Lambda) Functions


Python has support for so-called anonymous or lambda functions, which are a way of
writing functions consisting of a single statement, the result of which is the return value.

def short_function(x):
return x * 2

equiv_anon = lambda x: x * 2

One reason lambda functions are called anonymous functions is that , unlike functions
declared with the def keyword, the function object itself is never given an explicit
__name__ attribute.

Currying: Partial Argument Application


Currying is computer science jargon (named after the mathematician Haskell Curry)
that means deriving new functions from existing ones by partial argument application.

Currying:

Currying is the process of converting a function that takes multiple arguments into a
series of unary (single-argument) functions. Each of these unary functions takes one
argument and returns a new function that expects the next argument. This process
continues until all the arguments are supplied, and the final function returns the result.

Here's an example of currying in Python:

def add(x):
def add_x(y):
return x + y
return add_x

# Usage
add_five = add(5)
result = add_five(3) # Calls add_x(3), which returns 5 + 3 = 8

In this example, the add function takes one argument x, and it returns another function
add_x that takes a single argument y. When we call add(5), it returns a function that can
add 5 to any number we pass to it

Partial Argument Application:

Partial argument application is a related concept. It allows you to fix a subset of the
arguments of a function, creating a new function with the remaining arguments. This is
useful for creating specialized functions based on a more general one.

def multiply(x, y):


return x * y

# Using the functools.partial to create a specialized function


from functools import partial

double = partial(multiply, 2) # Fixing the first argument as 2

# Usage
result = double(5) # Calls multiply(2, 5), which returns 2 * 5 = 10

In this example, we use functools.partial to create a new function double from the
multiply function by fixing the first argument as 2. This creates a specialized function
that doubles any number passed to it.

Generators

Generators in Python are a way to create iterable sequences of data without storing the
entire sequence in memory. They are especially useful when dealing with large datasets
or when you want to generate values on-the-fly. Generators use the yield keyword to
produce values one at a time and remember their state between calls. Here's an easy
example to help you understand how generators work:

def count_up_to(n):
i=1
while i <= n:
yield i
i += 1

# Create a generator object


counter = count_up_to(5)

# Use a for loop to iterate over the generator


for number in counter:
print(number)

The output of the code will be:

1
2
3
4
5

In this example, count_up_to is a generator function. It takes an argument n and yields


values from 1 up to n. When you create a generator object by calling count_up_to(5), it
doesn't immediately execute the function. Instead, it sets up a generator that
remembers its state.

When you loop through the generator using a for loop, each time you encounter the
yield keyword, the function's state is saved, and the yielded value is returned. The
function can then continue from where it left off. This allows you to generate and
process values one at a time, which is memory-efficient.

itertools module

The standard library itertools module has a collection of generators for many com‐ mon
data algorithms.

Errors and Exception Handling

try:
numerator = int(input("Enter the numerator: "))
denominator = int(input("Enter the denominator: "))
result = numerator / denominator
print("Result:", result)
except ZeroDivisionError:
print("Error: Division by zero is not allowed.")
except ValueError:
print("Error: Please enter valid integer values.")
except Exception as e:
print(f"An unexpected error occurred: {e}")
CH-4 NumPy Basics: Arrays and Vectorized Computation

NumPy, short for Numerical Python.

One of the reasons NumPy is so important for numerical computations in Python is


because it is designed for efficiency on large arrays of data. There are a number of
reasons for this:
● NumPy internally stores data in a contiguous block of memory, independent of
other built-in Python objects. NumPy’s library of algorithms written in the C
language can operate on this memory without any type checking or other
overhead. NumPy arrays also use much less memory than built-in Python
sequences.
● NumPy operations perform complex computations on entire arrays without the
need for Python for loops.

In [7]: import numpy as np


In [8]: my_arr = np.arange(1000000)
In [9]: my_list = list(range(1000000))

In [10]: %time for _ in range(10):


my_arr2 = my_arr * 2 CPU
times: user 20 ms, sys: 50 ms, total: 70 ms Wall time: 72.4 ms
In [11]: %time for _ in range(10):
my_list2 = [x * 2 for x in my_list]
CPU times: user 760 ms, sys: 290 ms, total: 1.05 s Wall time: 1.05 s

In the code you provided, you are comparing the execution times of two operations in
Python:

● Using a NumPy array (my_arr) and multiplying it by 2 in a loop.


● Using a regular Python list (my_list) and creating a new list by multiplying each
element by 2 in a list comprehension.
Let's break down the results:

For the NumPy array (my_arr):


● The %time magic command is used to measure the execution time.
● The loop runs 10 times, and in each iteration, the NumPy array is multiplied by 2
(my_arr * 2).
● The "CPU times" information shows that it took 20 ms in user time and 50 ms in
system time, with a total time of 70 ms.
● The "Wall time" (the actual time elapsed) is 72.4 ms.

For the Python list (my_list):


● Similar to the NumPy case, the loop runs 10 times, but in each iteration, a new list
is created using a list comprehension ([x * 2 for x in my_list]).
● The "CPU times" information shows that it took 760 ms in user time and 290 ms
in system time, with a total time of 1.05 s.
● The "Wall time" (the actual time elapsed) is also 1.05 s.
Explanation:
● NumPy is a powerful library for numerical and array computations in Python. It is
highly optimized for such operations and can take advantage of low-level
optimizations and parallel processing. As a result, multiplying the entire NumPy
array by 2 is much faster, and the operation is vectorized, meaning it's performed
element-wise in an efficient manner. This is why the "CPU times" and "Wall time"
are significantly lower in the NumPy case compared to the list comprehension.
● When you use a list comprehension on a regular Python list, the operation is not
as optimized as the NumPy operation. Each element in the list is accessed
individually and multiplied by 2, resulting in a slower operation. This is why the
"CPU times" and "Wall time" are significantly higher for the list comprehension.

In summary, NumPy is designed for efficient numerical operations, and it's highly
recommended for tasks involving large datasets and numerical computations. Regular
Python lists are more versatile but are generally slower for such operations.

The NumPy ndarray: A Multidimensional Array Object


An ndarray(N-dimensional) is a generic multidimensional container for homogeneous
data; that is, all of the elements must be the same type.

In [12]: import numpy as np # Generate some random data


In [13]: data = np.random.randn(2, 3)
In [14]: data
Out[14]: array([[-0.2047, 0.4789, -0.5194], [-0.5557, 1.9658, 1.3934]])
In [15]: data * 10
Out[15]: array([[ -2.0471, 4.7894, -5.1944], [ -5.5573, 19.6578, 13.9341]])
In [16]: data + data
Out[16]:array([[-0.4094, 0.9579, -1.0389], [-1.1115, 3.9316, 2.7868]])
Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an
object describing the data type of the array:
In [17]: data.shape
Out[17]: (2, 3)
In [18]: data.dtype
Out[18]: dtype('float64')

Difference in array, n-array, NumPy array


Let's clarify the differences between "array," "NumPy array," and "ndarray" with easy
examples and exceptions:

1. Array:
● An "array" is a generic term referring to a data structure that stores a collection of
elements. Arrays can be found in various programming languages, but the term is
not specific to Python or NumPy.
● Arrays can be of different types, including lists, tuples, or built-in arrays like those
in Python's `array` module.
● Arrays can store elements of mixed data types, and they don't necessarily
support advanced operations like element-wise vectorization.
Example:
my_array = [1, 2, 3, 4, 5]
Exception: Arrays in different programming languages or modules may have their own
characteristics and limitations, so the behavior of an "array" can vary.

2.NumPy Array:
● A "NumPy array" specifically refers to the array-like data structure provided by the
NumPy library.
● NumPy arrays are homogeneous, meaning they contain elements of the same
data type.
● They are highly efficient for numerical and scientific computations due to
optimized memory handling and support for vectorized operations.
Example:
import numpy as np
my_numpy_array = np.array([1, 2, 3, 4, 5])

Exception: NumPy arrays are well-suited for numerical operations, but they may not be
the best choice when you need heterogeneous data structures or when compatibility
with non-NumPy code is required.
3.ndarray:
● "ndarray" stands for "n-dimensional array" and is synonymous with "NumPy array."
It's a key data structure in NumPy, designed for handling multi-dimensional data.
● An ndarray can have any number of dimensions, making it suitable for working
with matrices, tensors, and higher-dimensional data.
● The term "ndarray" emphasizes the multi-dimensional aspect of NumPy arrays.
Example:
import numpy as np
my_ndarray = np.array([[1, 2, 3], [4, 5, 6]])

Exception: The term "ndarray" is often used interchangeably with "NumPy array" within
the context of NumPy. There's no significant practical difference between the two.

In summary, "array" is a broad term encompassing various data structures, while


"NumPy array" and "ndarray" are specific to the NumPy library.

Creating ndarrays
In [19]: data1 = [6, 7.5, 8, 0, 1]
In [20]: arr1 = np.array(data1)
In [21]: arr1
Out[21]: array([ 6. , 7.5, 8. , 0. , 1. ])

Nested sequences, like a list of equal-length lists, will be converted into a multidimen‐
sional array:
In [22]: data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
In [23]: arr2 = np.array(data2)
In [24]: arr2
Out[24]: array([[1, 2, 3, 4], [5, 6, 7, 8]])

Since data2 was a list of lists, the NumPy array arr2 has two dimensions with shape
inferred from the data. We can confirm this by inspecting the ndim and shape attributes:
It’s not safe to assume that np.empty will return an array of all zeros. In some cases, it
may return uninitialized “garbage” values
Data Types for ndarrays
You can explicitly convert or cast an array from one dtype to another using ndarray’s
astype method
Calling astype always creates a new array (a copy of the data), even if the new dtype is
the same as the old dtype.

Arithmetic with NumPy Arrays


Arrays are important because they enable you to express batch operations on data
without writing any for loops. NumPy users call this vectorization
Operations between differently sized arrays is called broadcasting .
Basic Indexing and Slicing

Now, when I change values in arr_slice, the mutations are reflected in the original array
arr:

If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly
copy the array—for example, arr[5:8].copy().
Indexing with slices

Boolean Indexing
The ~ operator can be useful when you want to invert a general condition:
Selecting data from an array by boolean indexing always creates a copy of the data,
even if the returned array is unchanged.The Python keywords and and or do not work
with boolean arrays. Use & (and) and | (or) instead.

Fancy Indexing
Fancy indexing is a term adopted by NumPy to describe indexing using integer array.
Transposing Arrays and Swapping Axes

Transposing is a special form of reshaping that similarly returns a view on the under‐
lying data without copying anything. Arrays have the transpose method and also the
special T attribute:
Universal Functions: Fast Element-Wise Array Functions
A universal function, or ufunc, is a function that performs element-wise operations on
data in ndarrays. You can think of them as fast vectorized wrappers for simple functions
that take one or more scalar values and produce one or more scalar results.
These are referred to as unary ufuncs. Others, such as add or maximum, take two arrays
(thus, binary ufuncs) and return a single array as the result:

Ufuncs accept an optional out argument that allows them to operate in-place on arrays:
Array-Oriented Programming with Arrays
Using NumPy arrays enables you to express many kinds of data processing tasks as
concise array expressions that might otherwise require writing loops. This practice of
replacing explicit loops with array expressions is commonly referred to as vectoriza‐
tion. In general, vectorized array operations will often be one or two (or more) orders of
magnitude faster than their pure Python equivalents, with the biggest impact in any kind
of numerical computations.
Mathematical and Statistical Methods
A set of mathematical functions that compute statistics about an entire array or about
the data along an axis are accessible as methods of the array class. You can use aggre‐
gations (often called reductions) like sum, mean, and std (standard deviation) either by
calling the array instance method or using the top-level NumPy function.
Methods for Boolean Arrays
Boolean values are coerced to 1 (True) and 0 (False) in the preceding methods. Thus,
sum is often used as a means of counting True values in a boolean array:
There are two additional methods, any and all, useful especially for boolean arrays. any
tests whether one or more values in an array is True, while all checks if every value is
True

These methods also work with non-boolean arrays, where non-zero elements evalu‐ ate
to True

Sorting

The top-level method np.sort returns a sorted copy of an array instead of modifying the
array in-place. A quick-and-dirty way to compute the quantiles of an array is to sort it
and select the value at a particular rank
Unique and Other Set Logic

Another function, np.in1d, tests membership of the values in one array in another,
returning a boolean array:

File Input and Output with Arrays


np.save and np.load are the two workhorse functions for efficiently saving and load‐ ing
array data on disk. Arrays are saved by default in an uncompressed raw binary format
with file extension .npy
You save multiple arrays in an uncompressed archive using np.savez and passing the
arrays as keyword arguments:

When loading an .npz file, you get back a dict-like object that loads the individual arrays
lazily:

Linear Algebra

The @ symbol (as of Python 3.5) also works as an infix operator that performs matrix
multiplication:
numpy.linalg has a standard set of matrix decompositions and things like inverse and
determinant.

The expression X.T.dot(X) computes the dot product of X with its transpose X.T.
CH-5 Getting Started with pandas
How to import pandas:
In [1]: import pandas as pd
In [2]: from pandas import Series, DataFrame

Introduction to pandas Data Structures


To work with pandas we need to work with: Series and DataFrame.
Series
A Series is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) and an associated array of data labels, called its index.

In [11]: obj = pd.Series([4, 7, -5, 3])


In [12]: obj Out[12]:
04
17
2 -5
33
dtype: int64

The string representation of a Series displayed interactively shows the index on the left
and the values on the right. Since we did not specify an index for the data, a default one
consisting of the integers 0 through N - 1 (where N is the length of the data) is created.
In [13]: obj.values
Out[13]: array([ 4, 7, -5, 3])
In [14]: obj.index # like range(4)
Out[14]: RangeIndex(start=0, stop=4, step=1)

In [15]: obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])


In [16]: obj2 Out[16]:
d4
b7
a -5
c3
dtype: int64
In [17]: obj2.index
Out[17]: Index(['d', 'b', 'a', 'c'], dtype='object')

In [15]: obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])


In [16]: obj2
Out[16]:
d4
b7
a -5
c3
dtype: int64
In [17]: obj2.index
Out[17]: Index(['d', 'b', 'a', 'c'], dtype='object')

Using NumPy functions or NumPy-like operations, such as filtering with a boolean array,
scalar multiplication, or applying math functions, will preserve the index-value link:

In [21]: obj2[obj2 > 0]


Out[21]:
d6
b7
c3
dtype: int64
In [22]: obj2 * 2
Out[22]:
d 12
b 14
a -10
c6
dtype: int64
In [23]: np.exp(obj2)
Out[23]:
d 403.428793
b 1096.633158
a 0.006738
c 20.085537
dtype: float64

In [24]: 'b' in obj2


Out[24]: True
In [25]: 'e' in obj2
Out[25]: False

One can convert a python dictionary in pandas series:


In [26]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
In [27]: obj3 = pd.Series(sdata)
In [28]: obj3
Out[28]:
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64

When you are only passing a dict, the index in the resulting Series will have the dict’s
keys in sorted order. You can override this by passing the dict keys in the order you want
them to appear in the resulting Series:
In [29]: states = ['California', 'Ohio', 'Oregon', 'Texas']
In [30]: obj4 = pd.Series(sdata, index=states)
In [31]: obj4
Out[31]:
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
In pandas NAN refers to the missing values and isnull and notnull functions are used to
check if the value is NAN.
In [32]: pd.isnull(obj4)
Out[32]:
California True
Ohio False
Oregon False
Texas False
dtype: bool
In [33]: pd.notnull(obj4)
Out[33]:
California False
Ohio True
Oregon True
Texas True
dtype: bool
Series also has these as instance methods:
In [34]: obj4.isnull()
Out[34]:
California True
Ohio False
Oregon False
Texas False
dtype: bool

A useful Series feature for many applications is that it automatically aligns by index
label in arithmetic operations:
In [35]: obj3
Out[35]:
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
In [36]: obj4
Out[36]:
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
In [37]: obj3 + obj4
Out[37]:
California NaN
Ohio 70000.0
Oregon 32000.0
Texas 142000.0
Utah NaN
dtype: float64

Both the Series object itself and its index have a name attribute, which integrates with
other key areas of pandas functionality:
In [38]: obj4.name = 'population'
In [39]: obj4.index.name = 'state'
In [40]: obj4
Out[40]:
state
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64

A Series’s index can be altered in-place by assignment:


In [41]: obj
Out[41]:
04
17
2 -5
33
dtype: int64
In [42]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
In [43]: obj
Out[43]:
Bob 4
Steve 7
Jeff -5
Ryan 3
dtype: int64

DataFrame
A DataFrame represents a rectangular table of data and contains an ordered collection
of columns, each of which can be a different value type (numeric, string, boolean, etc.).
The DataFrame has both a row and column index; it can be thought of as a dict of Series
all sharing the same index. The data is stored as one or more two-dimensional blocks
rather than a list, dict, or some other collection of one-dimensional arrays.

Simplest way to create a dataframe is:


data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002,
2001, 2002, 2003], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

The resulting DataFrame will have its index assigned automatically as with Series, and
the columns are placed in sorted order:
In [45]: frame
Out[45]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
5 3.2 Nevada 2003

For large DataFrames, the head method selects only the first five rows:
In [46]: frame.head()
Out[46]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002

If you specify a sequence of columns, the DataFrame’s columns will be arranged in that
order:
In [47]: pd.DataFrame(data, columns=['year', 'state', 'pop'])
Out[47]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5 2003 Nevada 3.2

If you pass a column that isn’t contained in the dict, it will appear with missing values in
the result:
In [48]: frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
....: index=['one', 'two', 'three', 'four',
....: 'five', 'six'])
In [49]: frame2
Out[49]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN

In [50]: frame2.columns
Out[50]: Index(['year', 'state', 'pop', 'debt'], dtype='object')

A column in a DataFrame can be retrieved as a Series either by dict-like notation or by


attribute:
In [51]: frame2['state']
Out[51]:
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
In [52]: frame2.year
Out[52]:
one 2000
two 2001
three 2002
four 2001
five 2002
six 2003
Name: year, dtype: int64

Rows can also be retrieved by position or name with the special loc attribute (much
more on this later):
In [53]: frame2.loc['three']
Out[53]:
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object

Columns can be modified by assignment. For example, the empty 'debt' column could
be assigned a scalar value or an array of values:
In [54]: frame2['debt'] = 16.5
In [55]: frame2
Out[55]:
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
six 2003 Nevada 3.2 16.5

In [56]: frame2['debt'] = np.arange(6.)


In [57]: frame2
Out[57]:
year state pop debt
one 2000 Ohio 1.5 0.0
two 2001 Ohio 1.7 1.0
three 2002 Ohio 3.6 2.0
four 2001 Nevada 2.4 3.0
five 2002 Nevada 2.9 4.0
six 2003 Nevada 3.2 5.0

When you are assigning lists or arrays to a column, the value’s length must match the
length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the
DataFrame’s index, inserting missing values in any holes:
In [58]: val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
In [59]: frame2['debt'] = val
In [60]: frame2
Out[60]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
six 2003 Nevada 3.2 NaN

Assigning a column that doesn’t exist will create a new column. The del keyword will
delete columns as with a dict. As an example of del, I first add a new column of boolean
values where the state column equals 'Ohio':
In [61]: frame2['eastern'] = frame2.state == 'Ohio'
In [62]: frame2
Out[62]:
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
six 2003 Nevada 3.2 NaN False

The del method can then be used to remove this column:


In [63]: del frame2['eastern']
In [64]: frame2.columns
Out[64]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
Another common form of data is a nested dict of dicts:
In [65]: pop = {'Nevada': {2001: 2.4, 2002: 2.9}, ....
: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
In [66]: frame3 = pd.DataFrame(pop)
In [67]: frame3
Out[67]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6

You can transpose the DataFrame (swap rows and columns) with similar syntax to a
NumPy array:
In [68]: frame3.T
Out[68]:
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6

In [69]: pd.DataFrame(pop, index=[2001, 2002, 2003])


Out[69]:
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN

In [70]: pdata = {'Ohio': frame3['Ohio'][:-1], ....: 'Nevada': frame3['Nevada'][:2]}


In [71]: pd.DataFrame(pdata)
Out[71]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7

In [72]: frame3.index.name = 'year'; frame3.columns.name = 'state'


In [73]: frame3
Out[73]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6

As with Series, the values attribute returns the data contained in the DataFrame as a
two-dimensional ndarray:
In [74]: frame3.values
Out[74]: array([[ nan, 1.5], [ 2.4, 1.7], [ 2.9, 3.6]])

In [75]: frame2.values
Out[75]: array([[2000, 'Ohio', 1.5, nan],
[2001, 'Ohio', 1.7, -1.2],
[2002, 'Ohio', 3.6, nan],
[2001, 'Nevada', 2.4, -1.5],
[2002, 'Nevada', 2.9, -1.7],
[2003, 'Nevada', 3.2, nan]], dtype=object)

Index Object

pandas’s Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names).
In [76]: obj = pd.Series(range(3), index=['a', 'b', 'c'])
In [77]: index = obj.index
In [78]: index
Out[78]: Index(['a', 'b', 'c'], dtype='object')
In [79]: index[1:]
Out[79]: Index(['b', 'c'], dtype='object')

Index objects are immutable and thus can’t be modified by the user:
index[1] = 'd' # TypeError
Immutability makes it safer to share Index objects among data structures:
In [80]: labels = pd.Index(np.arange(3))
In [81]: labels
Out[81]: Int64Index([0, 1, 2], dtype='int64')
In [82]: obj2 = pd.Series([1.5, -2.5, 0], index=labels)
In [83]: obj2
Out[83]:
0 1.5
1 -2.5
2 0.0
dtype: float64
In [84]: obj2.index is labels
Out[84]: True
In addition to being array-like, an Index also behaves like a fixed-size set:
In [85]: frame3
Out[85]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
In [86]: frame3.columns
Out[86]: Index(['Nevada', 'Ohio'], dtype='object', name='state')
In [87]: 'Ohio' in frame3.columns
Out[87]: True
In [88]: 2003 in frame3.index
Out[88]: False

Unlike Python sets, a pandas Index can contain duplicate labels:


In [89]: dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
In [90]: dup_labels
Out[90]: Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

Selections with duplicate labels will select all occurrences of that label.
Essential Functionality
Reindexing
An important method on pandas objects is reindex, which means to create a new object
with the data conformed to a new index.
Consider an example:
In [91]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
In [92]: obj
Out[92]:
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64

Calling reindex on this Series rearranges the data according to the new index, intro‐
ducing missing values if any index values were not already present:
In [93]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
In [94]: obj2
Out[94]:
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64

For ordered data like time series, it may be desirable to do some interpolation or filling
of values when reindexing. The method option allows us to do this, using a method such
as ffill, which forward-fills the values:
In [95]: obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
In [96]: obj3
Out[96]:
0 blue
2 purple
4 yellow
dtype: object
In [97]: obj3.reindex(range(6), method='ffill')
Out[97]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object

With DataFrame, reindex can alter either the (row) index, columns, or both. When passed
only a sequence, it reindexes the rows in the result:
In [98]: frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
....: index=['a', 'c', 'd'],
....: columns=['Ohio', 'Texas', 'California'])
In [99]: frame
Out[99]:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
In [100]: frame2 = frame.reindex(['a', 'b', 'c', 'd'])
In [101]: frame2
Out[101]:
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0

The columns can be reindexed with the columns keyword:


In [102]: states = ['Texas', 'Utah', 'California']
In [103]: frame.reindex(columns=states)
Out[103]:
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8

you can reindex more succinctly by label-indexing with loc, and many users prefer to use
it exclusively:
In [104]: frame.loc[['a', 'b', 'c', 'd'], states]
Out[104]:
Texas Utah California
a 1.0 NaN 2.0
b NaN NaN NaN
c 4.0 NaN 5.0
d 7.0 NaN 8.0

Dropping entries from an Axis


Dropping one or more entries from an axis is easy if you already have an index array or
list without those entries. As it requires a bit of munging and set logic, the drop method
will return a new object with the indicated value or values deleted from an axis:
In [105]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
In [106]: obj
Out[106]:
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
In [107]: new_obj = obj.drop('c')
In [108]: new_obj
Out[108]:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
In [109]: obj.drop(['d', 'c'])
Out[109]:
a 0.0
b 1.0
e 4.0
dtype: float64

In [110]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),


.....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
.....: columns=['one', 'two', 'three', 'four'])
In [111]: data
Out[111]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

In [112]: data.drop(['Colorado', 'Ohio'])


Out[112]:
one two three four
Utah 8 9 10 11
New York 12 13 14 15

You can drop values from the columns by passing axis=1 or axis='columns':
In [113]: data.drop('two', axis=1)
Out[113]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
In [114]: data.drop(['two', 'four'], axis='columns')
Out[114]:
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14

Many functions, like drop, which modify the size or shape of a Series or DataFrame, can
manipulate an object in-place without returning a new object:
In [115]: obj.drop('c', inplace=True)
In [116]: obj
Out[116]:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64

Indexing, Selection, and Filtering


In [117]: obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
In [118]: obj
Out[118]:
a 0.0
b 1.0
c 2.0
d 3.0
dtype: float64
In [120]: obj[1]
Out[120]: 1.0
In [121]: obj[2:4]
Out[121]:
c 2.0
d 3.0
dtype: float64
In [122]: obj[['b', 'a', 'd']]
Out[122]:
b 1.0
a 0.0
d 3.0
dtype: float64
In [123]: obj[[1, 3]]
Out[123]:
b 1.0
d 3.0
dtype: float64
In [124]: obj[obj < 2]
Out[124]:
a 0.0
b 1.0
dtype: float64
Slicing with labels behaves differently than normal Python slicing in that the end‐ point
is inclusive:
In [125]: obj['b':'c']
Out[125]:
b 1.0
c 2.0
dtype: float64
Setting using these methods modifies the corresponding section of the Series:
In [126]: obj['b':'c'] = 5
In [127]: obj
Out[127]:
a 0.0
b 5.0
c 5.0
d 3.0
dtype: float64
In [128]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
.....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
.....: columns=['one', 'two', 'three', 'four'])
In [129]: data
Out[129]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [130]: data['two']
Out[130]:
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int64
In [131]: data[['three', 'one']]
Out[131]:
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12

Indexing like this has a few special cases. First, slicing or selecting data with a boolean
array:
In [132]: data[:2]
Out[132]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
In [133]: data[data['three'] > 5]
Out[133]:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [134]: data < 5
Out[134]:
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
In [135]: data[data < 5] = 0
In [136]: data
Out[136]:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

Selection with loc and iloc


They enable you to select a subset of the rows and columns from a DataFrame with
NumPy-like notation using either axis labels (loc) or integers (iloc).

In [137]: data.loc['Colorado', ['two', 'three']]


Out[137]:
two 5
three 6
Name: Colorado, dtype: int64
In [138]: data.iloc[2, [3, 0, 1]]
Out[138]:
four 11
one 8
two 9
Name: Utah, dtype: int64
In [139]: data.iloc[2]
Out[139]:
one 8
two 9
three 10
four 11
Name: Utah, dtype: int64
In [140]: data.iloc[[1, 2], [3, 0, 1]]
Out[140]:
four one two
Colorado 7 0 5
Utah 11 8 9

In [141]: data.loc[:'Utah', 'two']


Out[141]:
Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int64
In [142]: data.iloc[:, :3][data.three > 5]
Out[142]:
one two three
Colorado 0 5 6
Utah 8 9 10
New York 12 13 14

Integer Indexes
ser = pd.Series(np.arange(3.))
ser
Ser[-1]
In this case, pandas could “fall back” on integer indexing, but it’s difficult to do this in
general without introducing subtle bugs. Here we have an index containing 0, 1, 2, but
inferring what the user wants (label-based indexing or position-based) is difficult:
In [144]: ser
Out[144]:
0 0.0
1 1.0
2 2.0
dtype: float64

In [145]: ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])


In [146]: ser2[-1]
Out[146]: 2.0

In [147]: ser[:1]
Out[147]:
0 0.0
dtype: float64
In [148]: ser.loc[:1]
Out[148]:
0 0.0
1 1.0
dtype: float64
In [149]: ser.iloc[:1]
Out[149]:
0 0.0
dtype: float64

Arithmetic and Data Alignment


In [150]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
In [151]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], .....: index=['a', 'c', 'e', 'f', 'g'])
In [152]: s1
Out[152]:
a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64
In [153]: s2
Out[153]:
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
Adding these together yields:
In [154]: s1 + s2
Out[154]:
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64

In [155]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),


.....: index=['Ohio', 'Texas', 'Colorado'])
In [156]: df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
.....: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [157]: df1
Out[157]:
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0
In [158]: df2
Out[158]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0

Adding these together returns a DataFrame whose index and columns are the unions of
the ones in each DataFrame:
In [159]: df1 + df2
Out[159]:
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN

In [160]: df1 = pd.DataFrame({'A': [1, 2]})


In [161]: df2 = pd.DataFrame({'B': [3, 4]})
In [162]: df1
Out[162]:
A
0 1
1 2
In [163]: df2
Out[163]:
B
0 3
1 4
In [164]: df1 - df2
Out[164]:
A B
0 NaN NaN
1 NaN NaN

Arithmetic methods with fill values,


In [165]: df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
.....: columns=list('abcd'))
In [166]: df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
.....: columns=list('abcde'))
In [167]: df2.loc[1, 'b'] = np.nan
In [168]: df1
Out[168]:
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
In [169]: df2
Out[169]:
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 NaN 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0

Adding these together results in NA values in the locations that don’t overlap:
In [170]: df1 + df2
Out[170]:
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 NaN 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
Using the add method on df1, I pass df2 and an argument to fill_value:
In [171]: df1.add(df2, fill_value=0)
Out[171]:
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 5.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0

for a listing of Series and DataFrame methods for arithmetic. Each of them has a
counterpart, starting with the letter r, that has arguments flipped. So these two
statements are equivalent:
In [172]: 1 / df1
Out[172]:
a b c d
0 inf 1.000000 0.500000 0.333333
1 0.250000 0.200000 0.166667 0.142857
2 0.125000 0.111111 0.100000 0.090909
In [173]: df1.rdiv(1)
Out[173]:
a b c d
0 inf 1.000000 0.500000 0.333333
1 0.250000 0.200000 0.166667 0.142857
2 0.125000 0.111111 0.100000 0.090909
Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill
value:
In [174]: df1.reindex(columns=df2.columns, fill_value=0)
Out[174]:
a b c d e
0 0.0 1.0 2.0 3.0 0
1 4.0 5.0 6.0 7.0 0
2 8.0 9.0 10.0 11.0 0

Operations between DataFrame and Series


Function Application and Mapping

Another frequent operation is applying a function on one-dimensional arrays to each


column or row. DataFrame’s apply method does exactly this:
In [193]: f = lambda x: x.max() - x.min()
In [194]: frame.apply(f)
Out[194]:
b 1.802165
d 1.684034
e 2.689627
dtype: float64

If you pass axis='columns' to apply, the function will be invoked once per row instead:
In [195]: frame.apply(f, axis='columns')
Out[195]:
Utah 0.998382
Ohio 2.521511
Texas 0.676115
Oregon 2.542656
dtype: float64

In [196]: def f(x):


.....: return pd.Series([x.min(), x.max()], index=['min', 'max'])
In [197]: frame.apply(f)
Out[197]:
b d e
min -0.555730 0.281746 -1.296221
max 1.246435 1.965781 1.393406

Sorting and Ranking


Ranking assigns ranks from one through the number of valid data points in an array
Axis Indexes with Duplicate Labels
Summarizing and Computing Descriptive Statistics
Unique Values, Value Counts, and Membership
In [251]: obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
The first function is unique, which gives you an array of the unique values in a Series:
In [252]: uniques = obj.unique()
In [253]: uniques
Out[253]: array(['c', 'a', 'd', 'b'], dtype=object)

if needed (uniques.sort()). Relatedly, value_counts computes a Series containing value


frequencies:
In [254]: obj.value_counts()
Out[254]:
c 3
a 3
b 2
d 1
dtype: int64
value_counts is also available as a top-level pandas method that can be used with any
array or sequence:
In [255]: pd.value_counts(obj.values, sort=False)
Out[255]:
a 3
b 2
c 3
d 1
dtype: int64
isin performs a vectorized set membership check and can be useful in filtering a dataset
down to a subset of values in a Series or column in a DataFrame:
In [256]: obj
Out[256]:
0 c
1 a
2 d
3 a
4 a
5 b
6 b
7 c
8 c
dtype: object
In [257]: mask = obj.isin(['b', 'c'])
In [258]: mask
Out[258]:
0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool
In [259]: obj[mask]
Out[259]:
0 c
5 b
6 b
7 c
8 c
dtype: object

Related to isin is the Index.get_indexer method, which gives you an index array from an
array of possibly non-distinct values into another array of distinct values:
In [260]: to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
In [261]: unique_vals = pd.Series(['c', 'b', 'a'])
In [262]: pd.Index(unique_vals).get_indexer(to_match)
Out[262]: array([0, 2, 1, 1, 0, 2])

In [263]: data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],


.....: 'Qu2': [2, 3, 1, 2, 3],
.....: 'Qu3': [1, 5, 2, 4, 4]})
In [264]: data
Out[264]:
Qu1 Qu2 Qu3
0 1 2 1
1 3 3 5
2 4 1 2
3 3 2 4
4 4 3 4
In [265]: result = data.apply(pd.value_counts).fillna(0)
In [266]: result
Out[266]:
Qu1 Qu2 Qu3
1 1.0 1.0 1.0
2 0.0 2.0 1.0
3 2.0 2.0 0.0
4 2.0 0.0 2.0
5 0.0 0.0 1.0

In above,
Qu1
1->1 3->2 4->2
Qu2
1->1 2->2 3->2
Qu3
1->1 2->2 4->2 5->1

You might also like