Pandas Basics
Pandas Basics
pandas is a Python package providing fast, flexible, and expressive data structures
designed to make working with “relational” or “labelled” data both easy and
intuitive. It aims to be the fundamental high-level building block for doing practical,
real-world data analysis in Python. Additionally, it has the broader goal of
becoming the most powerful and flexible open source data analysis/manipulation
tool available in any language.
pandas is well suited for many different kinds of data:
Tabular data with heterogeneously-typed columns, as in an SQL table or Excel
spreadsheet
Ordered and unordered (not necessarily field-frequency) time series data.
Arbitrary matrix data (homogeneously typed or heterogeneous) with row and
column labels
Any other form of observational / statistical data sets. The data need not be
labelled at all to be placed into a pandas data structure
Data structures of pandas
The best way to think about the pandas data structures is as flexible
containers for lower dimensional data. For example, DataFrame is a
container for Series, and Series is a container for scalars. We would
like to be able to insert and remove objects from these containers in
a dictionary-like fashion.
Install and import
import pandas as pd
Core components of pandas: Series and DataFrames
The primary two components of pandas are the Series and DataFrame.
data = {
'apples': [3, 2, 0, 1],
'oranges': [0, 3, 7, 2]
}
purchases = pd.DataFrame(data)
purchases
Index in DataFrame
purchases
purchases.loc['June']
Example:
import pandas as pd
df = pd.DataFrame({"Name":["Braund, Mr. Owen Harris","Allen, Mr.
William Henry","Bonnell, Miss. Elizabeth"],
"Age": [22, 35, 58],
"Sex": ["male", "male", "female"]})
#I’m just interested in working with the data in the column Age
df["Age"]
Create a Series:
ages
It’s quite simple to load data from various file formats into a
DataFrame. In the following examples we'll keep using our apples
and oranges data, but this time it's coming from various files.
pandas supports many diffrent fie formats or data sources out of the
box (csv, excel, sql, json, parquet, ...), each of them with the
prefi read_*.
With CSV files all you need is a single line to load in the data:
df = pd.read_csv('purchases.csv')
df
df
Most important DataFrame operations
The first thing to do when opening a new dataset is print out a few
rows to keep as a visual reference. We accomplish this with .head():
movies_df.head()
movies_df.tail(2)
Getting info about your data
.info() should be one of the very first commands you run after loading
your data:
movies_df.info()
movies_df.shape
Movies_df.describe()
Handling duplicates
temp_df = movies_df.append(movies_df)
temp_df.shape
temp_df = temp_df.drop_duplicates()
temp_df.shape
temp_df.drop_duplicates(inplace=True)
Column cleanup
Many times datasets will have verbose column names with symbols,
upper and lowercase words, spaces, and typos. To make selecting
data by column name easier we can spend a little time cleaning up
their names.
Here's how to print the column names of our dataset:
movies_df.columns