Python Pandas
Python Pandas
Overview
Pandas is an open-source Python library that provides powerful, flexible, and easy-to-use data
structures for data manipulation and analysis. It is built on top of NumPy and is widely used in
data science, machine learning, and analytics workflows. The name "pandas" is derived from
"panel data," a term used in econometrics.
1. Data Structures:
o Series: A one-dimensional labeled array that can hold data of any type (integer,
string, float, etc.).
o DataFrame: A two-dimensional labeled data structure, similar to a table in a
database or an Excel spreadsheet.
2. Data Handling:
o Handles missing data efficiently.
o Supports a wide range of data formats: CSV, Excel, SQL databases, JSON,
Parquet, etc.
o Can read and write data easily to/from disk.
3. Data Manipulation:
o Powerful functions for filtering, sorting, grouping, merging, pivoting, and
reshaping data.
o Built-in support for time series data, including date range generation and
frequency conversion.
4. Indexing & Selection:
o Label-based and integer-based indexing with .loc[] and .iloc[].
o Hierarchical indexing for high-dimensional data.
5. Performance:
o Highly optimized for performance, leveraging C and Cython under the hood.
o Supports vectorized operations for speed and efficiency.
Basic Example
python
CopyEdit
import pandas as pd
# Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Paris', 'London']
}
df = pd.DataFrame(data)
# Filtering data
print(df[df['Age'] > 28])
Use Cases
• Data Cleaning and Preparation: Handling missing values, duplicates, and data type
conversions.
• Exploratory Data Analysis (EDA): Summarizing data using statistics and visualizations
(with libraries like Matplotlib or Seaborn).
• Time Series Analysis: Managing date-time data for financial, weather, or scientific time-
series.
• Data Transformation: Aggregating, merging, and reshaping datasets for modeling.