Dsintro RST
Dsintro RST
_dsintro:
{{ header }}
************************
Intro to data structures
************************
.. ipython:: python
import numpy as np
import pandas as pd
We'll give a brief intro to the data structures, then consider all of the broad
categories of functionality and methods in separate sections.
.. _basics.series:
Series
------
.. code-block:: python
s = pd.Series(data, index=index)
* a Python dict
* an ndarray
* a scalar value (like 5)
The passed **index** is a list of axis labels. Thus, this separates into a few
cases depending on what **data is**:
**From ndarray**
.. ipython:: python
pd.Series(np.random.randn(5))
.. note::
**From dict**
.. ipython:: python
.. ipython:: python
.. note::
NaN (not a number) is the standard missing data marker used in pandas.
.. ipython:: python
Series is ndarray-like
~~~~~~~~~~~~~~~~~~~~~~
.. ipython:: python
s.iloc[0]
s.iloc[:3]
s[s > s.median()]
s.iloc[[4, 3, 1]]
np.exp(s)
.. note::
s.dtype
.. ipython:: python
s.array
Accessing the array can be useful when you need to do some operation without the
index (to disable :ref:`automatic alignment <dsintro.alignment>`, for example).
.. ipython:: python
s.to_numpy()
Series is dict-like
~~~~~~~~~~~~~~~~~~~
A :class:`Series` is also like a fixed-size dict in that you can get and set values
by index
label:
.. ipython:: python
s["a"]
s["e"] = 12.0
s
"e" in s
"f" in s
.. ipython:: python
:okexcept:
s["f"]
Using the :meth:`Series.get` method, a missing label will return None or specified
default:
.. ipython:: python
s.get("f")
s.get("f", np.nan)
When working with raw NumPy arrays, looping through value-by-value is usually
not necessary. The same is true when working with :class:`Series` in pandas.
:class:`Series` can also be passed into most NumPy methods expecting an ndarray.
.. ipython:: python
s + s
s * 2
np.exp(s)
.. ipython:: python
s.iloc[1:] + s.iloc[:-1]
.. note::
Name attribute
~~~~~~~~~~~~~~
.. _dsintro.name_attribute:
.. ipython:: python
s = pd.Series(np.random.randn(5), name="something")
s
s.name
.. ipython:: python
s2 = s.rename("different")
s2.name
.. _basics.dataframe:
DataFrame
---------
Along with the data, you can optionally pass **index** (row labels) and
**columns** (column labels) arguments. If you pass an index and / or columns,
you are guaranteeing the index and / or columns of the resulting
DataFrame. Thus, a dict of Series plus a specific index will discard all data
not matching up to the passed index.
If axis labels are not passed, they will be constructed from the input data
based on common sense rules.
The resulting **index** will be the **union** of the indexes of the various
Series. If there are any nested dicts, these will first be converted to
Series. If no columns are passed, the columns will be the ordered list of dict
keys.
.. ipython:: python
d = {
"one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
"two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}
df = pd.DataFrame(d)
df
The row and column labels can be accessed respectively by accessing the
**index** and **columns** attributes:
.. note::
When a particular set of columns is passed along with a dict of data, the
passed columns override the keys in the dict.
.. ipython:: python
df.index
df.columns
All ndarrays must share the same length. If an index is passed, it must
also be the same length as the arrays. If no index is passed, the
result will be ``range(n)``, where ``n`` is the array length.
.. ipython:: python
d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}
pd.DataFrame(d)
pd.DataFrame(d, index=["a", "b", "c", "d"])
.. ipython:: python
pd.DataFrame(data)
pd.DataFrame(data, index=["first", "second"])
pd.DataFrame(data, columns=["C", "A", "B"])
.. note::
.. _basics.dataframe.from_list_of_dicts:
From a list of dicts
~~~~~~~~~~~~~~~~~~~~
.. ipython:: python
.. _basics.dataframe.from_dict_of_tuples:
.. ipython:: python
pd.DataFrame(
{
("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
}
)
.. _basics.dataframe.from_series:
From a Series
~~~~~~~~~~~~~
The result will be a DataFrame with the same index as the input Series, and
with one column whose name is the original name of the Series (only if no other
column name provided).
.. ipython:: python
.. _basics.dataframe.from_list_namedtuples:
The field names of the first ``namedtuple`` in the list determine the columns
of the :class:`DataFrame`. The remaining namedtuples (or tuples) are simply
unpacked
and their values are fed into the rows of the :class:`DataFrame`. If any of those
tuples is shorter than the first ``namedtuple`` then the later columns in the
corresponding row are marked as missing values. If any are longer than the
first ``namedtuple``, a ``ValueError`` is raised.
.. ipython:: python
from collections import namedtuple
.. _basics.dataframe.from_list_dataclasses:
Please be aware, that all values in the list should be dataclasses, mixing
types in the list would result in a ``TypeError``.
.. ipython:: python
**Missing data**
Alternate constructors
~~~~~~~~~~~~~~~~~~~~~~
.. _basics.dataframe.from_dict:
**DataFrame.from_dict**
.. ipython:: python
If you pass ``orient='index'``, the keys will be the row labels. In this
case, you can also pass the desired column names:
.. ipython:: python
pd.DataFrame.from_dict(
dict([("A", [1, 2, 3]), ("B", [4, 5, 6])]),
orient="index",
columns=["one", "two", "three"],
)
.. _basics.dataframe.from_records:
**DataFrame.from_records**
.. ipython:: python
data
pd.DataFrame.from_records(data, index="C")
.. _basics.dataframe.sel_add_del:
.. ipython:: python
df["one"]
df["three"] = df["one"] * df["two"]
df["flag"] = df["one"] > 2
df
.. ipython:: python
del df["two"]
three = df.pop("three")
df
.. ipython:: python
df["foo"] = "bar"
df
When inserting a :class:`Series` that does not have the same index as
the :class:`DataFrame`, it
will be conformed to the DataFrame's index:
.. ipython:: python
df["one_trunc"] = df["one"][:2]
df
You can insert raw ndarrays but their length must match the length of the
DataFrame's index.
.. ipython:: python
.. _dsintro.chained_assignment:
Inspired by `dplyr's
<https://dplyr.tidyverse.org/reference/mutate.html>`__
``mutate`` verb, DataFrame has an :meth:`~pandas.DataFrame.assign`
method that allows you to easily create new columns that are potentially
derived from existing columns.
.. ipython:: python
iris = pd.read_csv("data/iris.data")
iris.head()
iris.assign(sepal_ratio=iris["SepalWidth"] / iris["SepalLength"]).head()
.. ipython:: python
.. ipython:: python
@savefig basics_assign.png
(
iris.query("SepalLength > 5")
.assign(
SepalRatio=lambda x: x.SepalWidth / x.SepalLength,
PetalRatio=lambda x: x.PetalWidth / x.PetalLength,
)
.plot(kind="scatter", x="SepalRatio", y="PetalRatio")
)
.. ipython:: python
In the second expression, ``x['C']`` will refer to the newly created column,
that's equal to ``dfa['A'] + dfa['B']``.
Indexing / selection
~~~~~~~~~~~~~~~~~~~~
The basics of indexing are as follows:
.. csv-table::
:header: "Operation", "Syntax", "Result"
:widths: 30, 20, 10
Row selection, for example, returns a :class:`Series` whose index is the columns of
the
:class:`DataFrame`:
.. ipython:: python
df.loc["b"]
df.iloc[2]
.. ipython:: python
.. ipython:: python
df - df.iloc[0]
For explicit control over the matching and broadcasting behavior, see the
section on :ref:`flexible binary operations <basics.binop>`.
.. ipython:: python
df * 5 + 2
1 / df
df ** 4
.. _dsintro.boolean:
.. ipython:: python
Transposing
~~~~~~~~~~~
.. ipython:: python
.. _dsintro.numpy_interop:
.. ipython:: python
np.exp(df)
np.asarray(df)
.. ipython:: python
When multiple :class:`Series` are passed to a ufunc, they are aligned before
performing the operation.
Like other parts of the library, pandas will automatically align labeled inputs
as part of a ufunc with multiple inputs. For example, using :meth:`numpy.remainder`
on two :class:`Series` with differently ordered labels will align before the
operation.
.. ipython:: python
As usual, the union of the two indices is taken, and non-overlapping values are
filled
with missing values.
.. ipython:: python
.. ipython:: python
np.maximum(ser, idx)
Console display
~~~~~~~~~~~~~~~
.. ipython:: python
:suppress:
.. ipython:: python
baseball = pd.read_csv("data/baseball.csv")
print(baseball)
baseball.info()
.. ipython:: python
:suppress:
:okwarning:
# restore GlobalPrintConfig
pd.reset_option(r"^display\.")
.. ipython:: python
print(baseball.iloc[-20:, :12].to_string())
.. ipython:: python
pd.DataFrame(np.random.randn(3, 12))
You can change how much to print on a single row by setting the ``display.width``
option:
.. ipython:: python
pd.set_option("display.width", 40) # default is 80
pd.DataFrame(np.random.randn(3, 12))
You can adjust the max width of the individual columns by setting
``display.max_colwidth``
.. ipython:: python
datafile = {
"filename": ["filename_01", "filename_02"],
"path": [
"media/user_name/storage/folder_01/filename_01",
"media/user_name/storage/folder_02/filename_02",
],
}
pd.set_option("display.max_colwidth", 30)
pd.DataFrame(datafile)
pd.set_option("display.max_colwidth", 100)
pd.DataFrame(datafile)
.. ipython:: python
:suppress:
pd.reset_option("display.width")
pd.reset_option("display.max_colwidth")
You can also disable this feature via the ``expand_frame_repr`` option.
This will print the table in one block.
.. ipython:: python
.. code-block:: ipython