0% found this document useful (0 votes)
5 views33 pages

Handout Pandas

Uploaded by

nickn1390
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views33 pages

Handout Pandas

Uploaded by

nickn1390
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

 API reference  DataFrame  pandas.

DataFrame

pandas.DataFrame
class pandas.DataFrame(data=None, index=None, columns=None,
dtype=None, copy=None) [source]

Two-dimensional, size-mutable, potentially heterogeneous tabular data.


Data structure also contains labeled axes (rows and columns). Arithmetic operations align
on both row and column labels. Can be thought of as a dict-like container for Series
objects. The primary pandas data structure.
Parameters:
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is
a dict, column order follows insertion-order. If a dict contains Series which have
an index defined, it is aligned by its index. This alignment also occurs if data is a
Series or a DataFrame itself. Alignment is done on Series/DataFrame inputs.
If data is a list of dicts, column order follows insertion-order.
index : Index or array-like
Index to use for resulting frame. Will default to RangeIndex if no indexing
information part of input data and no index provided.
columns : Index or array-like
Column labels to use for resulting frame when data does not have them,
defaulting to RangeIndex(0, 1, 2, …, n). If data contains column labels, will
perform column selection instead.
dtype : dtype, default None
Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool or None, default None
Copy data from inputs. For dict data, the default of None behaves like
copy=True . For DataFrame or 2d ndarray input, the default of None behaves
like copy=False . If data is a dict containing one or more Series (possibly of
different dtypes), copy=False will ensure that these inputs are not copied.
Skip to main content
 Changed in version 1.3.0.

 See also
DataFrame.from_records
Constructor from tuples, also record arrays.
DataFrame.from_dict
From dicts of Series, arrays, or dicts.
read_csv
Read a comma-separated values (csv) file into DataFrame.
read_table
Read general delimited file into DataFrame.
read_clipboard
Read text from clipboard into DataFrame.

Notes
Please reference the User Guide for more information.
Examples
Constructing DataFrame from a dictionary.
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
col1 col2
0 1 3
1 2 4

Notice that the inferred dtype is int64.


>>> df.dtypes
col1 int64
col2 int64
dtype: object

To enforce a single dtype:


>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes Skip to main content
col1 int8
col2 int8
dtype: object

Constructing DataFrame from a dictionary including Series:


>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
>>> pd.DataFrame(data=d, index=[0, 1, 2, 3])
col1 col2
0 0 NaN
1 1 NaN
2 2 2.0
3 3 3.0

Constructing DataFrame from numpy ndarray:


>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
... columns=['a', 'b', 'c'])
>>> df2
a b c
0 1 2 3
1 4 5 6
2 7 8 9

Constructing DataFrame from a numpy ndarray that has labeled columns:


>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
... dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
>>> df3 = pd.DataFrame(data, columns=['c', 'a'])
...
>>> df3
c a
0 3 1
1 6 4
2 9 7

Constructing DataFrame from dataclass:


>>> from dataclasses import make_dataclass
>>> Point = make_dataclass("Point", [("x", int), ("y", int)])
>>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
x y
0 0 0
1 0 3
2 2 3

Constructing DataFrame from Series/DataFrame:


Skip to main content
>>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"])
>>> df = pd.DataFrame(data=ser, index=["a", "c"])
>>> df
0
a 1
c 3

>>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"])


>>> df2 = pd.DataFrame(data=df1, index=["a", "c"])
>>> df2
x
a 1
c 3

Attributes
T The transpose of the DataFrame.
at Access a single value for a row/column label pair.
attrs Dictionary of global attributes of this dataset.
axes Return a list representing the axes of the DataFrame.
columns The column labels of the DataFrame.
dtypes Return the dtypes in the DataFrame.
empty Indicator whether Series/DataFrame is empty.
flags Get the properties associated with this pandas object.
iat Access a single value for a row/column pair by integer position.
iloc (DEPRECATED) Purely integer-location based indexing for selection by
position.
index The index (row labels) of the DataFrame.
loc Access a group of rows and columns by label(s) or a boolean array.
ndim Return an int representing the number of axes / array dimensions.
shape Return a tuple representing the dimensionality of the DataFrame.
size Return an int representing
Skip tothe
mainnumber
contentof elements in this object.
 API reference  DataFrame  pandas.DataFrame.loc

pandas.DataFrame.loc
property DataFrame.loc [source]

Access a group of rows and columns by label(s) or a boolean array.


.loc[] is primarily label based, but may also be used with a boolean array.

Allowed inputs are:


A single label, e.g. 5 or 'a' , (note that 5 is interpreted as a label of the index, and
never as an integer position along the index).
A list or array of labels, e.g. ['a', 'b', 'c'] .
A slice object with labels, e.g. 'a':'f' .
 Warning
Note that contrary to usual python slices, both the start and the stop are
included

A boolean array of the same length as the axis being sliced, e.g. [True, False,
True] .

An alignable boolean Series. The index of the key will be aligned before masking.
An alignable Index. The Index of the returned selection will be the input.
A callable function with one argument (the calling Series or DataFrame) and that
returns valid output for indexing (one of the above)
See more at Selection by Label.
Raises:
KeyError
If any items are not found.
IndexingError
If an indexed key is passed and its index is unalignable to the frame index.
Skip to main content
 See also
DataFrame.at
Access a single value for a row/column label pair.
DataFrame.iloc
Access group of rows and columns by integer position(s).
DataFrame.xs
Returns a cross-section (row(s) or column(s)) from the Series/DataFrame.
Series.loc
Access group of values using labels.

Examples
Getting values
>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
... index=['cobra', 'viper', 'sidewinder'],
... columns=['max_speed', 'shield'])
>>> df
max_speed shield
cobra 1 2
viper 4 5
sidewinder 7 8

Single label. Note this returns the row as a Series.


>>> df.loc['viper']
max_speed 4
shield 5
Name: viper, dtype: int64

List of labels. Note using [[]] returns a DataFrame.


>>> df.loc[['viper', 'sidewinder']]
max_speed shield
viper 4 5
sidewinder 7 8

Single label for row and column


>>> df.loc['cobra', 'shield']
2

Skip to main content


Slice with labels for row and single label for column. As mentioned above, note that both
the start and stop of the slice are included.
>>> df.loc['cobra':'viper', 'max_speed']
cobra 1
viper 4
Name: max_speed, dtype: int64

Boolean list with the same length as the row axis


>>> df.loc[[False, False, True]]
max_speed shield
sidewinder 7 8

Alignable boolean Series:


>>> df.loc[pd.Series([False, True, False],
... index=['viper', 'sidewinder', 'cobra'])]
max_speed shield
sidewinder 7 8

Index (same behavior as df.reindex )


>>> df.loc[pd.Index(["cobra", "viper"], name="foo")]
max_speed shield
foo
cobra 1 2
viper 4 5

Conditional that returns a boolean Series


>>> df.loc[df['shield'] > 6]
max_speed shield
sidewinder 7 8

Conditional that returns a boolean Series with column labels specified


>>> df.loc[df['shield'] > 6, ['max_speed']]
max_speed
sidewinder 7

Multiple conditional using & that returns a boolean Series


Skip to main content
>>> df.loc[(df['max_speed'] > 1) & (df['shield'] < 8)]
max_speed shield
viper 4 5

Multiple conditional using | that returns a boolean Series


>>> df.loc[(df['max_speed'] > 4) | (df['shield'] < 5)]
max_speed shield
cobra 1 2
sidewinder 7 8

Please ensure that each condition is wrapped in parentheses () . See the user guide for
more details and explanations of Boolean indexing.
 Note
If you find yourself using 3 or more conditionals in .loc[] , consider using
advanced indexing.
See below for using .loc[] on MultiIndex DataFrames.
Callable that returns a boolean Series
>>> df.loc[lambda df: df['shield'] == 8]
max_speed shield
sidewinder 7 8

Setting values
Set value for all items matching the list of labels
>>> df.loc[['viper', 'sidewinder'], ['shield']] = 50
>>> df
max_speed shield
cobra 1 2
viper 4 50
sidewinder 7 50

Set value for an entire row


>>> df.loc['cobra'] = 10
>>> df
max_speed shield Skip to main content
cobra 10 10
viper 4 50
sidewinder 7 50

Set value for an entire column


>>> df.loc[:, 'max_speed'] = 30
>>> df
max_speed shield
cobra 30 10
viper 30 50
sidewinder 30 50

Set value for rows matching callable condition


>>> df.loc[df['shield'] > 35] = 0
>>> df
max_speed shield
cobra 30 10
viper 0 0
sidewinder 0 0

Add value matching location


>>> df.loc["viper", "shield"] += 5
>>> df
max_speed shield
cobra 30 10
viper 0 5
sidewinder 0 0

Setting using a Series or a DataFrame sets the values matching the index labels, not
the index positions.
>>> shuffled_df = df.loc[["viper", "cobra", "sidewinder"]]
>>> df.loc[:] += shuffled_df
>>> df
max_speed shield
cobra 60 20
viper 0 10
sidewinder 0 0

Getting values on a DataFrame with an index that has integer labels


Another example using integers for the index
Skip to main content
>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
... index=[7, 8, 9], columns=['max_speed', 'shield'])
>>> df
max_speed shield
7 1 2
8 4 5
9 7 8

Slice with integer labels for rows. As mentioned above, note that both the start and stop of
the slice are included.
>>> df.loc[7:9]
max_speed shield
7 1 2
8 4 5
9 7 8

Getting values with a MultiIndex


A number of examples using a DataFrame with a MultiIndex
>>> tuples = [
... ('cobra', 'mark i'), ('cobra', 'mark ii'),
... ('sidewinder', 'mark i'), ('sidewinder', 'mark ii'),
... ('viper', 'mark ii'), ('viper', 'mark iii')
... ]
>>> index = pd.MultiIndex.from_tuples(tuples)
>>> values = [[12, 2], [0, 4], [10, 20],
... [1, 4], [7, 1], [16, 36]]
>>> df = pd.DataFrame(values, columns=['max_speed', 'shield'], index=index)
>>> df
max_speed shield
cobra mark i 12 2
mark ii 0 4
sidewinder mark i 10 20
mark ii 1 4
viper mark ii 7 1
mark iii 16 36

Single label. Note this returns a DataFrame with a single index.


>>> df.loc['cobra']
max_speed shield
mark i 12 2
mark ii 0 4

Single index tuple. Note this returns a Series.


Skip to main content
>>> df.loc[('cobra', 'mark ii')]
max_speed 0
shield 4
Name: (cobra, mark ii), dtype: int64

Single label for row and column. Similar to passing in a tuple, this returns a Series.
>>> df.loc['cobra', 'mark i']
max_speed 12
shield 2
Name: (cobra, mark i), dtype: int64

Single tuple. Note using [[]] returns a DataFrame.


>>> df.loc[[('cobra', 'mark ii')]]
max_speed shield
cobra mark ii 0 4

Single tuple for the index with a single label for the column
>>> df.loc[('cobra', 'mark i'), 'shield']
2

Slice from index tuple to single label


>>> df.loc[('cobra', 'mark i'):'viper']
max_speed shield
cobra mark i 12 2
mark ii 0 4
sidewinder mark i 10 20
mark ii 1 4
viper mark ii 7 1
mark iii 16 36

Slice from index tuple to index tuple


>>> df.loc[('cobra', 'mark i'):('viper', 'mark ii')]
max_speed shield
cobra mark i 12 2
mark ii 0 4
sidewinder mark i 10 20
mark ii 1 4

Previous
viper mark ii 7 1
Next
pandas.DataFrame.iat pandas.DataFrame.iloc
Please see the user guide for moreSkipdetails
to main
andcontent
explanations of advanced indexing.
 API reference  DataFrame  pandas.DataF...

pandas.DataFrame.iloc
property DataFrame.iloc [source]

Purely integer-location based indexing for selection by position.


 Deprecated since version 2.2.0: Returning a tuple from a callable is deprecated.
.iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may
also be used with a boolean array.
Allowed inputs are:
An integer, e.g. 5 .
A list or array of integers, e.g. [4, 3, 0] .
A slice object with ints, e.g. 1:7 .
A boolean array.
A callable function with one argument (the calling Series or DataFrame) and that
returns valid output for indexing (one of the above). This is useful in method chains,
when you don’t have a reference to the calling object, but would like to base your
selection on some value.
A tuple of row and column indexes. The tuple elements consist of one of the above
inputs, e.g. (0, 1) .
.iloc will raise IndexError if a requested indexer is out-of-bounds, except slice
indexers which allow out-of-bounds indexing (this conforms with python/numpy slice
semantics).
See more at Selection by Position.

Skip to main content


 See also
DataFrame.iat
Fast integer location scalar accessor.
DataFrame.loc
Purely label-location based indexer for selection by label.
Series.iloc
Purely integer-location based indexing for selection by position.

Examples
>>> mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
... {'a': 100, 'b': 200, 'c': 300, 'd': 400},
... {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000}]
>>> df = pd.DataFrame(mydict)
>>> df
a b c d
0 1 2 3 4
1 100 200 300 400
2 1000 2000 3000 4000

Indexing just the rows


With a scalar integer.
>>> type(df.iloc[0])
<class 'pandas.core.series.Series'>
>>> df.iloc[0]
a 1
b 2
c 3
d 4
Name: 0, dtype: int64

With a list of integers.


>>> df.iloc[[0]]
a b c d
0 1 2 3 4
>>> type(df.iloc[[0]])
<class 'pandas.core.frame.DataFrame'>

>>> df.iloc[[0, 1]]


a b c d
Skip to main content
0 1 2 3 4
1 100 200 300 400

With a slice object.


>>> df.iloc[:3]
a b c d
0 1 2 3 4
1 100 200 300 400
2 1000 2000 3000 4000

With a boolean mask the same length as the index.


>>> df.iloc[[True, False, True]]
a b c d
0 1 2 3 4
2 1000 2000 3000 4000

With a callable, useful in method chains. The x passed to the lambda is the DataFrame
being sliced. This selects the rows whose index label even.
>>> df.iloc[lambda x: x.index % 2 == 0]
a b c d
0 1 2 3 4
2 1000 2000 3000 4000

Indexing both axes


You can mix the indexer types for the index and columns. Use : to select the entire axis.
With scalar integers.
>>> df.iloc[0, 1]
2

With lists of integers.


>>> df.iloc[[0, 2], [1, 3]]
b d
0 2 4
2 2000 4000

With slice objects.


Skip to main content
>>> df.iloc[1:3, 0:3]
a b c
1 100 200 300
2 1000 2000 3000

With a boolean array whose length matches the columns.


>>> df.iloc[:, [True, False, True, False]]
a c
0 1 3
1 100 300
2 1000 3000

With a callable function that expects the Series or DataFrame.


>>> df.iloc[:, lambda df: [0, 2]]
a c

© 2024, pandas via NumFOCUS, Inc. Hosted by


OVHcloud. Built with the PyData Sphinx Theme
0.14.4.
Created using Sphinx 8.0.2.
 API reference  Input/output  pandas.DataF...

pandas.DataFrame.to_pickle
DataFrame.to_pickle(path, *, compression='infer', protocol=5,

storage_options=None) [source]

Pickle (serialize) object to file.


Parameters:
path : str, path object, or file-like object
String, path object (implementing os.PathLike[str] ), or file-like object
implementing a binary write() function. File path where the pickled object will
be stored.
compression : str or dict, default ‘infer’
For on-the-fly compression of the output data. If ‘infer’ and ‘path’ is path-like,
then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’,
‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to None
for no compression. Can also be a dict with key 'method' set to one of
{ 'zip' , 'gzip' , 'bz2' , 'zstd' , 'xz' , 'tar' } and other key-value pairs
are forwarded to zipfile.ZipFile , gzip.GzipFile , bz2.BZ2File ,
zstandard.ZstdCompressor , lzma.LZMAFile or tarfile.TarFile ,
respectively. As an example, the following could be passed for faster
compression and to create a reproducible gzip archive:
compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1} .

 Added in version 1.5.0: Added support for .tar files.


protocol : int
Int which indicates which protocol should be used by the pickler, default
HIGHEST_PROTOCOL (see [1] paragraph 12.1.2). The possible values are 0, 1, 2,
3, 4, 5. A negative value for the protocol parameter is equivalent to setting its
value to HIGHEST_PROTOCOL.
[1] https://docs.python.org/3/library/pickle.html.
storage_options : dict, optional
Skip to main content
Extra options that make sense for a particular storage connection, e.g. host,
port, username, password, etc. For HTTP(S) URLs the key-value pairs are
forwarded to urllib.request.Request as header options. For other URLs (e.g.
starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to
fsspec.open . Please see fsspec and urllib for more details, and for more
examples on storage options refer here.
 See also
read_pickle
Load pickled pandas object (or any object) from file.
DataFrame.to_hdf
Write DataFrame to an HDF5 file.
DataFrame.to_sql
Write DataFrame to a SQL database.
DataFrame.to_parquet
Write a DataFrame to the binary parquet format.

Examples
>>> original_df = pd.DataFrame({"foo": range(5), "bar": range(5, 10)})
>>> original_df
foo bar
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9
>>> original_df.to_pickle("./dummy.pkl")

>>> unpickled_df = pd.read_pickle("./dummy.pkl")


>>> unpickled_df
foo bar
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9

Previous Next
pandas.read_pickle pandas.read_table
Skip to main content
© 2024, pandas via NumFOCUS, Inc. Hosted by
OVHcloud. Built with the PyData Sphinx Theme
0.14.4.
Created using Sphinx 8.0.2.
 API reference  DataFrame  pandas.DataF...

pandas.DataFrame.sort_values
DataFrame.sort_values(by, *, axis=0, ascending=True, inplace=False,

kind='quicksort', na_position='last', ignore_index=False, key=None)


Sort by the values along either axis. [source]

Parameters:
by : str or list of str
Name or list of names to sort by.
if axis is 0 or ‘index’ then by may contain index levels and/or column labels.
if axis is 1 or ‘columns’ then by may contain column levels and/or index labels.
axis : “{0 or ‘index’, 1 or ‘columns’}”, default 0
Axis to be sorted.
ascending : bool or list of bool, default True
Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list
of bools, must match the length of the by.
inplace : bool, default False
If True, perform operation in-place.
kind : {‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}, default ‘quicksort’
Choice of sorting algorithm. See also numpy.sort() for more information.
mergesort and stable are the only stable algorithms. For DataFrames, this option
is only applied when sorting on a single column or label.
na_position : {‘first’, ‘last’}, default ‘last’
Puts NaNs at the beginning if first; last puts NaNs at the end.
ignore_index : bool, default False Back to top
If True, the resulting axis will be labeled 0, 1, …, n - 1.
key : callable, optional
Apply the key function to the values before sorting. This is similar to the key
argument in the builtin sorted() function, with the notable difference that this
key function should beSkip vectorized.
to mainItcontent
should expect a Series and return a
Series with the same shape as the input. It will be applied to each column in by
independently.
Returns:
DataFrame or None
DataFrame with sorted values or None if inplace=True .
 See also
DataFrame.sort_index
Sort a DataFrame by the index.
Series.sort_values
Similar method for a Series.

Examples
>>> df = pd.DataFrame({
... 'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
... 'col2': [2, 1, 9, 8, 7, 4],
... 'col3': [0, 1, 9, 4, 2, 3],
... 'col4': ['a', 'B', 'c', 'D', 'e', 'F']
... })
>>> df
col1 col2 col3 col4
0 A 2 0 a
1 A 1 1 B
2 B 9 9 c
3 NaN 8 4 D
4 D 7 2 e
5 C 4 3 F

Sort by col1
>>> df.sort_values(by=['col1'])
col1 col2 col3 col4
0 A 2 0 a
1 A 1 1 B
2 B 9 9 c
5 C 4 3 F
4 D 7 2 e
3 NaN 8 4 D

Sort by multiple columns


Skip to main content
>>> df.sort_values(by=['col1', 'col2'])
col1 col2 col3 col4
1 A 1 1 B
0 A 2 0 a
2 B 9 9 c
5 C 4 3 F
4 D 7 2 e
3 NaN 8 4 D

Sort Descending
>>> df.sort_values(by='col1', ascending=False)
col1 col2 col3 col4
4 D 7 2 e
5 C 4 3 F
2 B 9 9 c
0 A 2 0 a
1 A 1 1 B
3 NaN 8 4 D

Putting NAs first


>>> df.sort_values(by='col1', ascending=False, na_position='first')
col1 col2 col3 col4
3 NaN 8 4 D
4 D 7 2 e
5 C 4 3 F
2 B 9 9 c
0 A 2 0 a
1 A 1 1 B

Sorting with a key function


>>> df.sort_values(by='col4', key=lambda col: col.str.lower())
col1 col2 col3 col4
0 A 2 0 a
1 A 1 1 B
2 B 9 9 c
3 NaN 8 4 D
4 D 7 2 e
5 C 4 3 F

Natural sort with the key argument, using the natsort


<https://github.com/SethMMorton/natsort> package.
>>> df = pd.DataFrame({
... "time": ['0hr', '128hr', '72hr', '48hr', '96hr'],
... "value": [10, 20, 30, 40, 50]
... })
>>> df
time value Skip to main content
0 0hr 10
1 128hr 20
2 72hr 30
3 48hr 40
4 96hr 50
>>> from natsort import index_natsorted
>>> df.sort_values(
... by="time",
... key=lambda x: np.argsort(index_natsorted(df["time"]))
... )
time value
0 0hr 10
3 48hr 40
2 72hr 30
4 96hr 50
1 128hr 20

Previous Next
pandas.DataFrame.reorder_levels pandas.DataFrame.sort_index
© 2024, pandas via NumFOCUS, Inc. Hosted by
OVHcloud. Built with the PyData Sphinx Theme
0.14.4.
Created using Sphinx 8.0.2.
 API reference  DataFrame  pandas.DataF...

pandas.DataFrame.groupby
DataFrame.groupby(by=None, axis=<no_default>, level=None,
as_index=True, sort=True, group_keys=True, observed=<no_default>,
dropna=True) [source]

Group DataFrame using a mapper or by a Series of columns.


A groupby operation involves some combination of splitting the object, applying a
function, and combining the results. This can be used to group large amounts of data and
compute operations on these groups.
Parameters:
by : mapping, function, label, pd.Grouper or list of such
Used to determine the groups for the groupby. If by is a function, it’s called on
each value of the object’s index. If a dict or Series is passed, the Series or dict
VALUES will be used to determine the groups (the Series’ values are first
aligned; see .align() method). If a list or ndarray of length equal to the
selected axis is passed (see the groupby user guide), the values are used as-is
to determine the groups. A label or list of labels may be passed to group by the
columns in self . Notice that a tuple is interpreted as a (single) key.
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Split along rows (0) or columns (1). For Series this parameter is unused and
defaults to 0.
 Deprecated since version 2.1.0: Will be removed and behave like axis=0
in a future version. For axis=1 , do frame.T.groupby(...) instead.
level : int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels. Do
not specify both by and level .
as_index : bool, default True
Skip to main content
Return object with group labels as the index. Only relevant for DataFrame input.
as_index=False is effectively “SQL-style” grouped output. This argument has no
effect on filtrations (see the filtrations in the user guide), such as head() ,
tail() , nth() and in transformations (see the transformations in the user
guide).
sort : bool, default True
Sort group keys. Get better performance by turning this off. Note this does not
influence the order of observations within each group. Groupby preserves the
order of rows within each group. If False, the groups will appear in the same
order as they did in the original DataFrame. This argument has no effect on
filtrations (see the filtrations in the user guide), such as head() , tail() ,
nth() and in transformations (see the transformations in the user guide).

 Changed in version 2.0.0: Specifying sort=False with an ordered


categorical grouper will no longer sort the values.
group_keys : bool, default True
When calling apply and the by argument produces a like-indexed (i.e. a
transform) result, add group keys to index to identify pieces. By default group
keys are not included when the result’s index (and column) labels match the
inputs, and are included otherwise.
 Changed in version 1.5.0: Warns that group_keys will no longer be
ignored when the result from apply is a like-indexed Series or DataFrame.
Specify group_keys explicitly to include the group keys or not.

 Changed in version 2.0.0: group_keys now defaults to True .


observed : bool, default False
This only applies if any of the groupers are Categoricals. If True: only show
observed values for categorical groupers. If False: show all values for categorical
groupers.
 Deprecated since version 2.1.0: The default value will change to True in
a future version of pandas.
Skip to main content
dropna : bool, default True
If True, and if group keys contain NA values, NA values together with row/column
will be dropped. If False, NA values will also be treated as the key in groups.
Returns:
pandas.api.typing.DataFrameGroupBy
Returns a groupby object that contains information about the groups.
 See also
resample
Convenience method for frequency conversion and resampling of time
series.

Notes
See the user guide for more detailed usage and examples, including splitting an object into
groups, iterating through groups, selecting a group, aggregation, and more.
Examples
>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
... 'Parrot', 'Parrot'],
... 'Max Speed': [380., 370., 24., 26.]})
>>> df
Animal Max Speed
0 Falcon 380.0
1 Falcon 370.0
2 Parrot 24.0
3 Parrot 26.0
>>> df.groupby(['Animal']).mean()
Max Speed
Animal
Falcon 375.0
Parrot 25.0

Hierarchical Indexes
We can groupby different levels of a hierarchical index using the level parameter:
>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
... ['Captive', 'Wild', 'Captive', 'Wild']]
>>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))

... Skip to main content


>>> df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},
index=index)
>>> df
Max Speed
Animal Type
Falcon Captive 390.0
Wild 350.0
Parrot Captive 30.0
Wild 20.0
>>> df.groupby(level=0).mean()
Max Speed
Animal
Falcon 370.0
Parrot 25.0
>>> df.groupby(level="Type").mean()
Max Speed
Type
Captive 210.0
Wild 185.0

We can also choose to include NA in group keys or not by setting dropna parameter, the
default setting is True.
>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])

>>> df.groupby(by=["b"]).sum()
a c
b
1.0 2 3
2.0 2 5

>>> df.groupby(by=["b"], dropna=False).sum()


a c
b
1.0 2 3
2.0 2 5
NaN 1 4

>>> l = [["a", 12, 12], [None, 12.3, 33.], ["b", 12.3, 123], ["a", 1, 1]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])

>>> df.groupby(by="a").sum()
b c
a
a 13.0 13.0
b 12.3 123.0

Skip to main content


>>> df.groupby(by="a", dropna=False).sum()
b c
a
a 13.0 13.0
b 12.3 123.0
NaN 12.3 33.0

When using .apply() , use group_keys to include or exclude the group keys. The
group_keys argument defaults to True (include).

>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',


... 'Parrot', 'Parrot'],
... 'Max Speed': [380., 370., 24., 26.]})
>>> df.groupby("Animal", group_keys=True)[['Max Speed']].apply(lambda x: x)
Max Speed
Animal
Falcon 0 380.0
1 370.0
Parrot 2 24.0
3 26.0

>>> df.groupby("Animal", group_keys=False)[['Max Speed']].apply(lambda x: x)


Max Speed
0 380.0
1 370.0
2 24.0
3 26.0

© 2024, pandas via NumFOCUS, Inc. Hosted by


OVHcloud. Built with the PyData Sphinx Theme
0.14.4.
Created using Sphinx 8.0.2.
 API reference  DataFrame  pandas.DataF...

pandas.DataFrame.drop_duplicates
DataFrame.drop_duplicates(subset=None, *, keep='first',

inplace=False, ignore_index=False) [source]

Return DataFrame with duplicate rows removed.


Considering certain columns is optional. Indexes, including time indexes are ignored.
Parameters:
subset : column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by default use all of the
columns.
keep : {‘first’, ‘last’, False }, default ‘first’
Determines which duplicates (if any) to keep.
‘first’ : Drop duplicates except for the first occurrence.
‘last’ : Drop duplicates except for the last occurrence.
False : Drop all duplicates.

inplace : bool, default False


Whether to modify the DataFrame rather than creating a new one.
ignore_index : bool, default False
If True , the resulting axis will be labeled 0, 1, …, n - 1.
Returns:
DataFrame or None
DataFrame with duplicates removed or None if inplace=True .
 See also
DataFrame.value_counts
Count unique combinations of columns.

Examples Skip to main content


Consider dataset containing ramen rating.
>>> df = pd.DataFrame({
... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
... 'rating': [4, 4, 3.5, 15, 5]
... })
>>> df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0

By default, it removes duplicate rows based on all columns.


>>> df.drop_duplicates()
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0

To remove duplicates on specific column(s), use subset .


>>> df.drop_duplicates(subset=['brand'])
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5

To remove duplicates and keep last occurrences, use keep .


>>> df.drop_duplicates(subset=['brand', 'style'], keep='last')
brand style rating
1 Yum Yum cup 4.0
2 Indomie cup 3.5
4 Indomie pack 5.0

Previous Next
pandas.DataFrame.drop pandas.DataFrame.duplicated
© 2024, pandas via NumFOCUS, Inc. Hosted by
OVHcloud. Skip to main content
 API reference  DataFrame  pandas.DataF...

pandas.DataFrame.drop
DataFrame.drop(labels=None, *, axis=0, index=None, columns=None,

level=None, inplace=False, errors='raise') [source]

Drop specified labels from rows or columns.


Remove rows or columns by specifying label names and corresponding axis, or by directly
specifying index or column names. When using a multi-index, labels on different levels can
be removed by specifying the level. See the user guide for more information about the
now unused levels.
Parameters:
labels : single label or list-like
Index or column labels to drop. A tuple will be used as a single label and not
treated as a list-like.
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
index : single label or list-like
Alternative to specifying axis ( labels, axis=0 is equivalent to
index=labels ).

columns : single label or list-like


Alternative to specifying axis ( labels, axis=1 is equivalent to
columns=labels ).

level : int or level name, optional


For MultiIndex, level from which the labels will be removed.
inplace : bool, default False
If False, return a copy. Otherwise, do operation in place and return None.
errors : {‘ignore’, ‘raise’}, default ‘raise’
If ‘ignore’, suppress error and only existing labels are dropped.
Returns: Skip to main content
DataFrame or None
Returns DataFrame or None DataFrame with the specified index or column
labels removed or None if inplace=True.
Raises:
KeyError
If any of the labels is not found in the selected axis.
 See also
DataFrame.loc
Label-location based indexer for selection by label.
DataFrame.dropna
Return DataFrame with labels on given axis omitted where (all or any) data
are missing.
DataFrame.drop_duplicates
Return DataFrame with duplicate rows removed, optionally only considering
certain columns.
Series.drop
Return Series with specified index labels removed.

Examples
>>> df = pd.DataFrame(np.arange(12).reshape(3, 4),
... columns=['A', 'B', 'C', 'D'])
>>> df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11

Drop columns
>>> df.drop(['B', 'C'], axis=1)
A D
0 0 3
1 4 7
2 8 11

>>> df.drop(columns=['B', 'C'])


A D
0 0 3 Skip to main content
1 4 7
2 8 11

Drop a row by index


>>> df.drop([0, 1])
A B C D
2 8 9 10 11

Drop columns and/or rows of MultiIndex DataFrame


>>> midx = pd.MultiIndex(levels=[['llama', 'cow', 'falcon'],
... ['speed', 'weight', 'length']],
... codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
... [0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> df = pd.DataFrame(index=midx, columns=['big', 'small'],
... data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
... [250, 150], [1.5, 0.8], [320, 250],
... [1, 0.8], [0.3, 0.2]])
>>> df
big small
llama speed 45.0 30.0
weight 200.0 100.0
length 1.5 1.0
cow speed 30.0 20.0
weight 250.0 150.0
length 1.5 0.8
falcon speed 320.0 250.0
weight 1.0 0.8
length 0.3 0.2

Drop a specific index combination from the MultiIndex DataFrame, i.e., drop the
combination 'falcon' and 'weight' , which deletes only the corresponding row
>>> df.drop(index=('falcon', 'weight'))
big small
llama speed 45.0 30.0
weight 200.0 100.0
length 1.5 1.0
cow speed 30.0 20.0
weight 250.0 150.0
length 1.5 0.8
falcon speed 320.0 250.0
length 0.3 0.2

>>> df.drop(index='cow', columns='small')


big
llama speed 45.0
weight 200.0
length 1.5 Skip to main content
falcon speed 320.0
weight 1.0
length 0.3

>>> df.drop(index='length', level=1)


big small
llama speed 45.0 30.0
weight 200.0 100.0
cow speed 30.0 20.0
weight 250.0 150.0
falcon speed 320.0 250.0
weight 1.0 0.8

Previous Nex
pandas.DataFrame.between_time pandas.DataFrame.drop_duplicates
© 2024, pandas via NumFOCUS, Inc. Hosted by
OVHcloud. Built with the PyData Sphinx Theme
0.14.4.
Created using Sphinx 8.0.2.

You might also like