Data Analysis of Visualization: CHAPTER - 1 Preliminaries
Data Analysis of Visualization: CHAPTER - 1 Preliminaries
CHAPTER -1 Preliminaries
1. Extensive Libraries: Python offers a rich ecosystem of libraries and tools that are
well-suited for data analysis. The most prominent library for data analysis in Python is
pandas, which provides data structures and functions for efficient data manipulation and
analysis. Additionally, there are libraries like NumPy for numerical computations,
Matplotlib and Seaborn for data visualization, and Scikit-Learn for machine learning.
2. Data Handling Capabilities: Python, with its pandas library, provides data structures
like DataFrames and Series that are designed for handling and manipulating tabular
data. These data structures make it easy to clean, reshape, and transform data for
analysis.
3. Versatility: Python is a versatile language that can be used for various tasks in data
analysis, including data cleaning, exploration, visualization, statistical analysis, and
machine learning. This versatility allows data analysts and data scientists to work within
a single language for their entire workflow.
4. Open Source Community: Python's open-source nature has led to a vast and active
community of developers who have created and continue to maintain a wide range of
data analysis libraries. This open-source ecosystem ensures that the tools available for
data analysis are constantly evolving and improving.
5. Accessibility: Python is known for its simple and readable syntax, making it
accessible to individuals from diverse backgrounds, including those who may not have a
strong programming background. This ease of use is especially beneficial for data
analysts who may not be primarily programmers.
6. Integration: Python can easily integrate with other programming languages and
tools, making it a versatile choice for data analysis projects. It can interface with
databases, web APIs, and other data sources, making it suitable for various data
collection and extraction tasks.
7. Strong Data Visualization: Python has excellent data visualization libraries such as
Matplotlib, Seaborn, and Plotly, which allow analysts to create informative and
aesthetically pleasing charts and graphs to communicate their findings effectively.
9. Active Development and Community Support: Python has a robust and active
development community, ensuring that it stays up-to-date with the latest developments
in data analysis and technology.
10. Reproducibility: Python's ecosystem includes tools like Jupyter notebooks, which
are widely used in data analysis. Jupyter notebooks facilitate reproducible research by
allowing analysts to document their analysis process and share it with others.
These reasons, among others, have made Python a popular and powerful choice for
data analysis, and it continues to be the go-to language for many data professionals
and researchers.
2. GIL (Global Interpreter Lock): Python has a Global Interpreter Lock, which can limit
its ability to utilize multiple CPU cores effectively. This limitation can make Python less
suitable for parallel processing or multithreaded applications.
3. Resource Intensive: Python may consume more memory than some other
languages due to its dynamic typing and extensive standard libraries. This can be a
problem when dealing with large datasets or in resource-constrained environments.
4. Mobile App Development: If you are developing mobile apps for iOS or Android,
Python might not be the best choice. While there are frameworks like Kivy and
BeeWare, they are less popular compared to native languages like Swift (for iOS) and
Java/Kotlin (for Android).
6. Lack of Strong Static Typing: Python is dynamically typed, which can lead to
type-related errors only being discovered at runtime. In situations where type safety and
strict control are crucial, a statically-typed language like Java or C# might be a better
choice.
8. Learning Curve: Python may not be the best choice if you are working in a field
where another language is more commonly used, or if your team already has expertise
in a different language. Learning a new language can be time-consuming.
9. Limited GUI Development: While there are GUI development libraries like Tkinter
and PyQt for Python, if your primary focus is building rich graphical user interfaces
(GUIs), languages like Java or C# may offer more extensive and platform-specific
options.
10. Specialized Domains: In some specialized fields, there might be languages better
suited to the specific needs of that domain. For example, R is a popular language for
statistics and data science, and MATLAB is widely used in engineering and scientific
computing.
It's important to consider the specific requirements of your project and the strengths and
weaknesses of Python and other programming languages before making a decision.
Often, the choice of language depends on the specific problem you are trying to solve
and the ecosystem and libraries available for that problem domain.
$ python
Python 3.6.0 | packaged by conda-forge | (default, Jan 13 2017, 23:17:12) [GCC 4.8.2
20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> a = 5
>>> print(a) 5
IPython basics
Tab Completion
While entering expressions in the shell, pressing the Tab key will search the
namespace for any variables (objects, functions, etc.) matching the characters you have
typed so far:
In [1]: an_apple = 27
In [2]: an_example = 42
In [3]: an an_apple and an_example an
You can also complete methods and attributes on any object after typing a period:
In [3]: b = [1, 2, 3]
In [4]: b.
b.append b.count b.insert b.reverse
b.clear b.extend b.pop b.sort
b.copy b.index b.remove
Introspection
Using a question mark (?) before or after a variable will display some general
information about the object:
In [8]: b = [1, 2, 3]
In [9]: b?
Type: list
String Form:[1, 2, 3]
Length: 3
Docstring: list() -> new empty list
list(iterable) -> new list initialized from iterable's items
If the object is a function or instance method, the docstring, if defined, will also be
shown.
def add_numbers(a, b):
""" Add two numbers together Returns ------- the_sum : type of arguments """
return a + b
%paste takes whatever text is in the clipboard and executes it as a single block in the
shell:
In [17]: %paste
x=5
y=7
if x > 5:
x += 1
y=8
## -- End pasted text –
%cpaste is similar, except that it gives you a special prompt for pasting code into:
In [18]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:x = 5
:y = 7
:if x > 5:
: x += 1
:
:y=8
:-
With the %cpaste block, you have the freedom to paste as much code as you like
before executing it. You might decide to use %cpaste in order to look at the pasted code
before executing it. If you accidentally paste the wrong code, you can break out of the
%cpaste prompt by pressing Ctrl-C.
Magic functions can be used by default without the percent sign, as long as no vari‐
able is defined with the same name as the magic function in question. This feature is
called automagic and can be enabled or disabled with %automagic.
Matplotlib Integration
The %matplotlib magic function configures its integration with the IPy‐ thon shell or
Jupyter notebook.
In the IPython shell, running %matplotlib sets up the integration so you can create
multiple plot windows without interfering with the console session:
In [26]: %matplotlib
Using matplotlib backend: Qt4Agg
Language Semantics
def add(a,b):
return(a+b)
add( 13,2)
6. Attributes and methods: python has both attribute and methods and both can
be accessed using obj.attribute_name.
For example, you can verify that an object is iterable if it implemented the iterator
protocol. For many objects, this means it has a __iter__ “magic method,” though an
alternative and bet‐ ter way to check is to try using the iter function:
Import sum
a=sum.add(13+2)
print(a)
== operator: it just compares the values of the two operand not there memory address.
Example: a=[1,2,3,4]
b=[1,2,3,4]
print(a==b)
Output: True
Is operator: with value comparisons it also compares the memory address of the two
operands and returns true only when memory address of both the operands are equal
Example: a=[1,2,3,4]
b=[1,2,3,4]
print(a==b)
Output: False #because both are referring to two different address
Example: a=[1,2,3,4]
b=a
print(a is b)
Output: True
10. Scalar Types: Python along with its standard library has a small set of built-in
types for handling numerical data, strings, boolean (True or False) values, and
dates and time. These “single value” types are sometimes called scalar types.
11. Numeric Types: there are two types of numeric values in python
● Integer
● Floating point
Given a datetime instance, you can extract the equivalent date and time objects by
calling methods on the datetime of the same name:
1. Tuple
A tuple is a fixed-length, immutable sequence of Python objects.
In [1]: tup = 4, 5, 6
In [2]: tup
Out[2]: (4, 5, 6)
While the objects stored in a tuple may be mutable themselves, once the tuple is
created it’s not possible to modify which object is stored in each slot:
If an object inside a tuple is mutable, such as a list, you can modify it in-place:
Unpacking tuples
If you try to assign to a tuple-like expression of variables, Python will attempt to unpack
the value on the right hand side of the equals sign:
In [15]: tup = (4, 5, 6)
In [16]: a, b, c = tup
In [17]: b
Out[17]: 5
Even sequences with nested tuples can be unpacked:
In [18]: tup = 4, 5, (6, 7)
In [19]: a, b, (c, d) = tup
In [20]: d
Out[20]: 7
2. List
In contrast with tuples, lists are variable-length and their contents can be modified
in-place. You can define them using square brackets [] or using the list type function:
In [36]: a_list = [2, 3, 7, None]
Sorting
In [61]: a = [7, 2, 5, 1, 3]
In [62]: a.sort()
In [63]: a
Out[63]: [1, 2, 3, 5, 7]
In [71]: bisect.insort(c, 6)
In [72]: c
Out[72]: [1, 2, 2, 2, 3, 4, 6, 7]
The bisect module functions do not check whether the list is sor‐ ted, as doing so would
be computationally expensive. Thus, using them with an unsorted list will succeed
without error but may lead to incorrect results.
1. Enumerate
The enumerate function in Python is used to iterate over an iterable (such as a
list, tuple, or string) while keeping track of both the index and the value of each item. It
returns a tuple of the form (index, value). This can be very useful when you need to
know both the position and the value of items in a sequence. Here's an easy example to
illustrate how it works:
In this example:
We have a list called fruits containing several fruit names.
We use a for loop to iterate over the fruits list.
We apply the enumerate function to the fruits list, which returns an iterable of tuples,
where each tuple consists of an index and a value from the original list.
In each iteration, the index variable contains the index (starting from 0), and the fruit
variable contains the value (the name of a fruit).
The output of this code will be:
Index 0: apple
Index 1: banana
Index 2: cherry
Index 3: date
Index 4: elderberry
As you can see, the enumerate function simplifies the process of iterating over a list or
other iterables while keeping track of the position of each item. This is especially useful
in situations where you need to perform specific actions or access both the index and
the value of the items in the iterable.
2. Sorted
The sorted function returns a new sorted list from the elements of any sequence:
In [87]: sorted([7, 1, 2, 6, 0, 3, 2])
Out[87]: [0, 1, 2, 2, 3, 6, 7]
In [88]: sorted('horse race')
Out[88]: [' ', 'a', 'c', 'e', 'e', 'h', 'o', 'r', 'r', 's']
The sorted function accepts the same arguments as the sort method on lists.
3. Zip
zip “pairs” up the elements of a number of lists, tuples, or other sequences to create a
list of tuples:
In [89]: seq1 = ['foo', 'bar', 'baz']
In [90]: seq2 = ['one', 'two', 'three']
In [91]: zipped = zip(seq1, seq2)
In [92]: list(zipped)
Out[92]: [('foo', 'one'), ('bar', 'two'), ('baz', 'three')]
zip can take an arbitrary number of sequences, and the number of elements it pro‐
duces is determined by the shortest sequence:
In [93]: seq3 = [False, True]
In [94]: list(zip(seq1, seq2, seq3))
Out[94]: [('foo', 'one', False), ('bar', 'two', True)]
4. Reversed
Reversed iterates over the elements of a sequence in reverse order:
In [100]: list(reversed(range(10)))
Out[100]: [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
5. Dictionary
A more common name for it is hash map or associative array. It is a flexibly sized
collection of key-value pairs, where key and value are Python objects.
In [118]: list(d1.values())
Out[118]: ['some value', [1, 2, 3, 4], 'an integer']
6. Set
set A set is an unordered collection of unique elements.
List, Set, and Dict Comprehensions
List Comprehension:
numbers = [1, 2, 3, 4, 5]
squared_numbers = [x ** 2 for x in numbers]
print(squared_numbers) # Output: [1, 4, 9, 16, 25]
Set Comprehension:
numbers = [1, 2, 2, 3, 4, 4, 5]
unique_evens = {x for x in numbers if x % 2 == 0}
print(unique_evens) # Output: {2, 4}
Dictionary Comprehension:
numbers = [1, 2, 3, 4, 5]
squared_dict = {x: x ** 2 for x in numbers}
print(squared_dict) # Output: {1: 1, 2: 4, 3: 9, 4: 16, 5: 25}
Functions
If you anticipate needing to repeat the same or very similar code more than once, it may
be worth writing a reusable function. Functions can also help make your code more
readable by giving a name to a group of Python statements.
def short_function(x):
return x * 2
equiv_anon = lambda x: x * 2
One reason lambda functions are called anonymous functions is that , unlike functions
declared with the def keyword, the function object itself is never given an explicit
__name__ attribute.
Currying:
Currying is the process of converting a function that takes multiple arguments into a
series of unary (single-argument) functions. Each of these unary functions takes one
argument and returns a new function that expects the next argument. This process
continues until all the arguments are supplied, and the final function returns the result.
def add(x):
def add_x(y):
return x + y
return add_x
# Usage
add_five = add(5)
result = add_five(3) # Calls add_x(3), which returns 5 + 3 = 8
In this example, the add function takes one argument x, and it returns another function
add_x that takes a single argument y. When we call add(5), it returns a function that can
add 5 to any number we pass to it
Partial argument application is a related concept. It allows you to fix a subset of the
arguments of a function, creating a new function with the remaining arguments. This is
useful for creating specialized functions based on a more general one.
# Usage
result = double(5) # Calls multiply(2, 5), which returns 2 * 5 = 10
In this example, we use functools.partial to create a new function double from the
multiply function by fixing the first argument as 2. This creates a specialized function
that doubles any number passed to it.
Generators
Generators in Python are a way to create iterable sequences of data without storing the
entire sequence in memory. They are especially useful when dealing with large datasets
or when you want to generate values on-the-fly. Generators use the yield keyword to
produce values one at a time and remember their state between calls. Here's an easy
example to help you understand how generators work:
def count_up_to(n):
i=1
while i <= n:
yield i
i += 1
1
2
3
4
5
When you loop through the generator using a for loop, each time you encounter the
yield keyword, the function's state is saved, and the yielded value is returned. The
function can then continue from where it left off. This allows you to generate and
process values one at a time, which is memory-efficient.
itertools module
The standard library itertools module has a collection of generators for many com‐ mon
data algorithms.
try:
numerator = int(input("Enter the numerator: "))
denominator = int(input("Enter the denominator: "))
result = numerator / denominator
print("Result:", result)
except ZeroDivisionError:
print("Error: Division by zero is not allowed.")
except ValueError:
print("Error: Please enter valid integer values.")
except Exception as e:
print(f"An unexpected error occurred: {e}")
CH-4 NumPy Basics: Arrays and Vectorized Computation
In the code you provided, you are comparing the execution times of two operations in
Python:
In summary, NumPy is designed for efficient numerical operations, and it's highly
recommended for tasks involving large datasets and numerical computations. Regular
Python lists are more versatile but are generally slower for such operations.
1. Array:
● An "array" is a generic term referring to a data structure that stores a collection of
elements. Arrays can be found in various programming languages, but the term is
not specific to Python or NumPy.
● Arrays can be of different types, including lists, tuples, or built-in arrays like those
in Python's `array` module.
● Arrays can store elements of mixed data types, and they don't necessarily
support advanced operations like element-wise vectorization.
Example:
my_array = [1, 2, 3, 4, 5]
Exception: Arrays in different programming languages or modules may have their own
characteristics and limitations, so the behavior of an "array" can vary.
2.NumPy Array:
● A "NumPy array" specifically refers to the array-like data structure provided by the
NumPy library.
● NumPy arrays are homogeneous, meaning they contain elements of the same
data type.
● They are highly efficient for numerical and scientific computations due to
optimized memory handling and support for vectorized operations.
Example:
import numpy as np
my_numpy_array = np.array([1, 2, 3, 4, 5])
Exception: NumPy arrays are well-suited for numerical operations, but they may not be
the best choice when you need heterogeneous data structures or when compatibility
with non-NumPy code is required.
3.ndarray:
● "ndarray" stands for "n-dimensional array" and is synonymous with "NumPy array."
It's a key data structure in NumPy, designed for handling multi-dimensional data.
● An ndarray can have any number of dimensions, making it suitable for working
with matrices, tensors, and higher-dimensional data.
● The term "ndarray" emphasizes the multi-dimensional aspect of NumPy arrays.
Example:
import numpy as np
my_ndarray = np.array([[1, 2, 3], [4, 5, 6]])
Exception: The term "ndarray" is often used interchangeably with "NumPy array" within
the context of NumPy. There's no significant practical difference between the two.
Creating ndarrays
In [19]: data1 = [6, 7.5, 8, 0, 1]
In [20]: arr1 = np.array(data1)
In [21]: arr1
Out[21]: array([ 6. , 7.5, 8. , 0. , 1. ])
Nested sequences, like a list of equal-length lists, will be converted into a multidimen‐
sional array:
In [22]: data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
In [23]: arr2 = np.array(data2)
In [24]: arr2
Out[24]: array([[1, 2, 3, 4], [5, 6, 7, 8]])
Since data2 was a list of lists, the NumPy array arr2 has two dimensions with shape
inferred from the data. We can confirm this by inspecting the ndim and shape attributes:
It’s not safe to assume that np.empty will return an array of all zeros. In some cases, it
may return uninitialized “garbage” values
Data Types for ndarrays
You can explicitly convert or cast an array from one dtype to another using ndarray’s
astype method
Calling astype always creates a new array (a copy of the data), even if the new dtype is
the same as the old dtype.
Now, when I change values in arr_slice, the mutations are reflected in the original array
arr:
If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly
copy the array—for example, arr[5:8].copy().
Indexing with slices
Boolean Indexing
The ~ operator can be useful when you want to invert a general condition:
Selecting data from an array by boolean indexing always creates a copy of the data,
even if the returned array is unchanged.The Python keywords and and or do not work
with boolean arrays. Use & (and) and | (or) instead.
Fancy Indexing
Fancy indexing is a term adopted by NumPy to describe indexing using integer array.
Transposing Arrays and Swapping Axes
Transposing is a special form of reshaping that similarly returns a view on the under‐
lying data without copying anything. Arrays have the transpose method and also the
special T attribute:
Universal Functions: Fast Element-Wise Array Functions
A universal function, or ufunc, is a function that performs element-wise operations on
data in ndarrays. You can think of them as fast vectorized wrappers for simple functions
that take one or more scalar values and produce one or more scalar results.
These are referred to as unary ufuncs. Others, such as add or maximum, take two arrays
(thus, binary ufuncs) and return a single array as the result:
Ufuncs accept an optional out argument that allows them to operate in-place on arrays:
Array-Oriented Programming with Arrays
Using NumPy arrays enables you to express many kinds of data processing tasks as
concise array expressions that might otherwise require writing loops. This practice of
replacing explicit loops with array expressions is commonly referred to as vectoriza‐
tion. In general, vectorized array operations will often be one or two (or more) orders of
magnitude faster than their pure Python equivalents, with the biggest impact in any kind
of numerical computations.
Mathematical and Statistical Methods
A set of mathematical functions that compute statistics about an entire array or about
the data along an axis are accessible as methods of the array class. You can use aggre‐
gations (often called reductions) like sum, mean, and std (standard deviation) either by
calling the array instance method or using the top-level NumPy function.
Methods for Boolean Arrays
Boolean values are coerced to 1 (True) and 0 (False) in the preceding methods. Thus,
sum is often used as a means of counting True values in a boolean array:
There are two additional methods, any and all, useful especially for boolean arrays. any
tests whether one or more values in an array is True, while all checks if every value is
True
These methods also work with non-boolean arrays, where non-zero elements evalu‐ ate
to True
Sorting
The top-level method np.sort returns a sorted copy of an array instead of modifying the
array in-place. A quick-and-dirty way to compute the quantiles of an array is to sort it
and select the value at a particular rank
Unique and Other Set Logic
Another function, np.in1d, tests membership of the values in one array in another,
returning a boolean array:
When loading an .npz file, you get back a dict-like object that loads the individual arrays
lazily:
Linear Algebra
The @ symbol (as of Python 3.5) also works as an infix operator that performs matrix
multiplication:
numpy.linalg has a standard set of matrix decompositions and things like inverse and
determinant.
The expression X.T.dot(X) computes the dot product of X with its transpose X.T.
CH-5 Getting Started with pandas
How to import pandas:
In [1]: import pandas as pd
In [2]: from pandas import Series, DataFrame
The string representation of a Series displayed interactively shows the index on the left
and the values on the right. Since we did not specify an index for the data, a default one
consisting of the integers 0 through N - 1 (where N is the length of the data) is created.
In [13]: obj.values
Out[13]: array([ 4, 7, -5, 3])
In [14]: obj.index # like range(4)
Out[14]: RangeIndex(start=0, stop=4, step=1)
Using NumPy functions or NumPy-like operations, such as filtering with a boolean array,
scalar multiplication, or applying math functions, will preserve the index-value link:
When you are only passing a dict, the index in the resulting Series will have the dict’s
keys in sorted order. You can override this by passing the dict keys in the order you want
them to appear in the resulting Series:
In [29]: states = ['California', 'Ohio', 'Oregon', 'Texas']
In [30]: obj4 = pd.Series(sdata, index=states)
In [31]: obj4
Out[31]:
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
In pandas NAN refers to the missing values and isnull and notnull functions are used to
check if the value is NAN.
In [32]: pd.isnull(obj4)
Out[32]:
California True
Ohio False
Oregon False
Texas False
dtype: bool
In [33]: pd.notnull(obj4)
Out[33]:
California False
Ohio True
Oregon True
Texas True
dtype: bool
Series also has these as instance methods:
In [34]: obj4.isnull()
Out[34]:
California True
Ohio False
Oregon False
Texas False
dtype: bool
A useful Series feature for many applications is that it automatically aligns by index
label in arithmetic operations:
In [35]: obj3
Out[35]:
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
In [36]: obj4
Out[36]:
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
In [37]: obj3 + obj4
Out[37]:
California NaN
Ohio 70000.0
Oregon 32000.0
Texas 142000.0
Utah NaN
dtype: float64
Both the Series object itself and its index have a name attribute, which integrates with
other key areas of pandas functionality:
In [38]: obj4.name = 'population'
In [39]: obj4.index.name = 'state'
In [40]: obj4
Out[40]:
state
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64
DataFrame
A DataFrame represents a rectangular table of data and contains an ordered collection
of columns, each of which can be a different value type (numeric, string, boolean, etc.).
The DataFrame has both a row and column index; it can be thought of as a dict of Series
all sharing the same index. The data is stored as one or more two-dimensional blocks
rather than a list, dict, or some other collection of one-dimensional arrays.
The resulting DataFrame will have its index assigned automatically as with Series, and
the columns are placed in sorted order:
In [45]: frame
Out[45]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
5 3.2 Nevada 2003
For large DataFrames, the head method selects only the first five rows:
In [46]: frame.head()
Out[46]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
If you specify a sequence of columns, the DataFrame’s columns will be arranged in that
order:
In [47]: pd.DataFrame(data, columns=['year', 'state', 'pop'])
Out[47]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5 2003 Nevada 3.2
If you pass a column that isn’t contained in the dict, it will appear with missing values in
the result:
In [48]: frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
....: index=['one', 'two', 'three', 'four',
....: 'five', 'six'])
In [49]: frame2
Out[49]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN
In [50]: frame2.columns
Out[50]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
Rows can also be retrieved by position or name with the special loc attribute (much
more on this later):
In [53]: frame2.loc['three']
Out[53]:
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
Columns can be modified by assignment. For example, the empty 'debt' column could
be assigned a scalar value or an array of values:
In [54]: frame2['debt'] = 16.5
In [55]: frame2
Out[55]:
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
six 2003 Nevada 3.2 16.5
When you are assigning lists or arrays to a column, the value’s length must match the
length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the
DataFrame’s index, inserting missing values in any holes:
In [58]: val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
In [59]: frame2['debt'] = val
In [60]: frame2
Out[60]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
six 2003 Nevada 3.2 NaN
Assigning a column that doesn’t exist will create a new column. The del keyword will
delete columns as with a dict. As an example of del, I first add a new column of boolean
values where the state column equals 'Ohio':
In [61]: frame2['eastern'] = frame2.state == 'Ohio'
In [62]: frame2
Out[62]:
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
six 2003 Nevada 3.2 NaN False
You can transpose the DataFrame (swap rows and columns) with similar syntax to a
NumPy array:
In [68]: frame3.T
Out[68]:
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6
As with Series, the values attribute returns the data contained in the DataFrame as a
two-dimensional ndarray:
In [74]: frame3.values
Out[74]: array([[ nan, 1.5], [ 2.4, 1.7], [ 2.9, 3.6]])
In [75]: frame2.values
Out[75]: array([[2000, 'Ohio', 1.5, nan],
[2001, 'Ohio', 1.7, -1.2],
[2002, 'Ohio', 3.6, nan],
[2001, 'Nevada', 2.4, -1.5],
[2002, 'Nevada', 2.9, -1.7],
[2003, 'Nevada', 3.2, nan]], dtype=object)
Index Object
pandas’s Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names).
In [76]: obj = pd.Series(range(3), index=['a', 'b', 'c'])
In [77]: index = obj.index
In [78]: index
Out[78]: Index(['a', 'b', 'c'], dtype='object')
In [79]: index[1:]
Out[79]: Index(['b', 'c'], dtype='object')
Index objects are immutable and thus can’t be modified by the user:
index[1] = 'd' # TypeError
Immutability makes it safer to share Index objects among data structures:
In [80]: labels = pd.Index(np.arange(3))
In [81]: labels
Out[81]: Int64Index([0, 1, 2], dtype='int64')
In [82]: obj2 = pd.Series([1.5, -2.5, 0], index=labels)
In [83]: obj2
Out[83]:
0 1.5
1 -2.5
2 0.0
dtype: float64
In [84]: obj2.index is labels
Out[84]: True
In addition to being array-like, an Index also behaves like a fixed-size set:
In [85]: frame3
Out[85]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
In [86]: frame3.columns
Out[86]: Index(['Nevada', 'Ohio'], dtype='object', name='state')
In [87]: 'Ohio' in frame3.columns
Out[87]: True
In [88]: 2003 in frame3.index
Out[88]: False
Selections with duplicate labels will select all occurrences of that label.
Essential Functionality
Reindexing
An important method on pandas objects is reindex, which means to create a new object
with the data conformed to a new index.
Consider an example:
In [91]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
In [92]: obj
Out[92]:
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
Calling reindex on this Series rearranges the data according to the new index, intro‐
ducing missing values if any index values were not already present:
In [93]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
In [94]: obj2
Out[94]:
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
For ordered data like time series, it may be desirable to do some interpolation or filling
of values when reindexing. The method option allows us to do this, using a method such
as ffill, which forward-fills the values:
In [95]: obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
In [96]: obj3
Out[96]:
0 blue
2 purple
4 yellow
dtype: object
In [97]: obj3.reindex(range(6), method='ffill')
Out[97]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
With DataFrame, reindex can alter either the (row) index, columns, or both. When passed
only a sequence, it reindexes the rows in the result:
In [98]: frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
....: index=['a', 'c', 'd'],
....: columns=['Ohio', 'Texas', 'California'])
In [99]: frame
Out[99]:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
In [100]: frame2 = frame.reindex(['a', 'b', 'c', 'd'])
In [101]: frame2
Out[101]:
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
you can reindex more succinctly by label-indexing with loc, and many users prefer to use
it exclusively:
In [104]: frame.loc[['a', 'b', 'c', 'd'], states]
Out[104]:
Texas Utah California
a 1.0 NaN 2.0
b NaN NaN NaN
c 4.0 NaN 5.0
d 7.0 NaN 8.0
You can drop values from the columns by passing axis=1 or axis='columns':
In [113]: data.drop('two', axis=1)
Out[113]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
In [114]: data.drop(['two', 'four'], axis='columns')
Out[114]:
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
Many functions, like drop, which modify the size or shape of a Series or DataFrame, can
manipulate an object in-place without returning a new object:
In [115]: obj.drop('c', inplace=True)
In [116]: obj
Out[116]:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
Indexing like this has a few special cases. First, slicing or selecting data with a boolean
array:
In [132]: data[:2]
Out[132]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
In [133]: data[data['three'] > 5]
Out[133]:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [134]: data < 5
Out[134]:
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
In [135]: data[data < 5] = 0
In [136]: data
Out[136]:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
Integer Indexes
ser = pd.Series(np.arange(3.))
ser
Ser[-1]
In this case, pandas could “fall back” on integer indexing, but it’s difficult to do this in
general without introducing subtle bugs. Here we have an index containing 0, 1, 2, but
inferring what the user wants (label-based indexing or position-based) is difficult:
In [144]: ser
Out[144]:
0 0.0
1 1.0
2 2.0
dtype: float64
In [147]: ser[:1]
Out[147]:
0 0.0
dtype: float64
In [148]: ser.loc[:1]
Out[148]:
0 0.0
1 1.0
dtype: float64
In [149]: ser.iloc[:1]
Out[149]:
0 0.0
dtype: float64
Adding these together returns a DataFrame whose index and columns are the unions of
the ones in each DataFrame:
In [159]: df1 + df2
Out[159]:
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
Adding these together results in NA values in the locations that don’t overlap:
In [170]: df1 + df2
Out[170]:
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 NaN 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
Using the add method on df1, I pass df2 and an argument to fill_value:
In [171]: df1.add(df2, fill_value=0)
Out[171]:
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 5.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
for a listing of Series and DataFrame methods for arithmetic. Each of them has a
counterpart, starting with the letter r, that has arguments flipped. So these two
statements are equivalent:
In [172]: 1 / df1
Out[172]:
a b c d
0 inf 1.000000 0.500000 0.333333
1 0.250000 0.200000 0.166667 0.142857
2 0.125000 0.111111 0.100000 0.090909
In [173]: df1.rdiv(1)
Out[173]:
a b c d
0 inf 1.000000 0.500000 0.333333
1 0.250000 0.200000 0.166667 0.142857
2 0.125000 0.111111 0.100000 0.090909
Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill
value:
In [174]: df1.reindex(columns=df2.columns, fill_value=0)
Out[174]:
a b c d e
0 0.0 1.0 2.0 3.0 0
1 4.0 5.0 6.0 7.0 0
2 8.0 9.0 10.0 11.0 0
If you pass axis='columns' to apply, the function will be invoked once per row instead:
In [195]: frame.apply(f, axis='columns')
Out[195]:
Utah 0.998382
Ohio 2.521511
Texas 0.676115
Oregon 2.542656
dtype: float64
Related to isin is the Index.get_indexer method, which gives you an index array from an
array of possibly non-distinct values into another array of distinct values:
In [260]: to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
In [261]: unique_vals = pd.Series(['c', 'b', 'a'])
In [262]: pd.Index(unique_vals).get_indexer(to_match)
Out[262]: array([0, 2, 1, 1, 0, 2])
In above,
Qu1
1->1 3->2 4->2
Qu2
1->1 2->2 3->2
Qu3
1->1 2->2 4->2 5->1