0% found this document useful (0 votes)
9 views14 pages

Data Engineer Interview 1740985064

This document is a cheatsheet for Exploratory Data Analysis (EDA) using Pandas, covering data loading, overview, missing values, duplicates, summary statistics, value counts, correlation, grouping, data visualization, cleaning, transformation, datetime analysis, filtering, handling outliers, and pivot tables. It provides concise code snippets for each EDA task, making it a quick reference for data analysts. Additionally, it includes links for career guidance and certification courses.

Uploaded by

Saba Hussien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views14 pages

Data Engineer Interview 1740985064

This document is a cheatsheet for Exploratory Data Analysis (EDA) using Pandas, covering data loading, overview, missing values, duplicates, summary statistics, value counts, correlation, grouping, data visualization, cleaning, transformation, datetime analysis, filtering, handling outliers, and pivot tables. It provides concise code snippets for each EDA task, making it a quick reference for data analysts. Additionally, it includes links for career guidance and certification courses.

Uploaded by

Saba Hussien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Pandas EDA

(Exploratory
Data Analysis)
Cheatsheet
📂 Data Loading
pd.read_csv(path): Reads a CSV file
pd.read_excel(path, sheet_name="Sheet1"):
Reads an Excel file
pd.read_sql(query, Connection_Object): Reads
SQL table
pd.read_json(path): Reads a JSON file
pd.read_html(url): Reads tables from an HTML
page
pd.read_parquet(path): Reads a Parquet file
df.to_csv("output.csv", index=False): Saves
DataFrame to a CSV file
df.to_excel("output.xlsx", index=False): Saves
DataFrame to an Excel file
df.to_json("output.json"): Saves DataFrame to
a JSON file
df.to_parquet("output.parquet"): Saves
DataFrame to a Parquet file
🔎 Data Overview
df.head(n): Displays first n rows (default
5)
df.tail(n): Displays last n rows (default 5)
df.shape: Returns (rows, columns)
df.info(): Displays column data types &
memory usage
df.columns: Lists all column names
df.index: Displays index range
df.dtypes: Shows data types of each
column
df.describe(): Summary statistics for
numerical columns
df.describe(include="all"): Summary
statistics for all columns
🔍 Checking Missing Values
df.isnull().sum(): Counts missing values in each
column
df.isna().sum(): Same as isnull()
df[df.isnull().any(axis=1)]: Displays rows with
missing values
df.dropna(): Removes rows with missing values
df.fillna(value): Replaces missing values with a
specified value
df.fillna(df.median()): Fills missing values with
median
df.interpolate(): Performs linear interpolation to
fill NaN

📊 Checking Duplicates
df.duplicated(): Returns a Boolean Series for
duplicate rows
df[df.duplicated()]: Displays duplicate rows
df.drop_duplicates(): Removes duplicate rows
📊 Summary Statistics
df.mean(): Mean of numerical columns
df.median(): Median of numerical
columns
df.mode(): Mode of numerical columns
df.std(): Standard deviation of numerical
columns
df.var(): Variance of numerical columns
df.min(): Minimum value of each column
df.max(): Maximum value of each
column
df.count(): Count of non-null values per
column
df.nunique(): Number of unique values
per column
📊 Value Counts & Distributions
df["column"].value_counts(): Counts
occurrences of each unique value
df["column"].value_counts(normalize=Tr
ue): Normalized value counts
(percentage)
df["column"].unique(): Lists unique values
df["column"].nunique(): Number of
unique values

📊 Correlation & Covariance


df.corr(): Correlation matrix (Pearson by
default)
df.corr(method="kendall"): Kendall
correlation
df.corr(method="spearman"): Spearman
correlation
df.cov(): Covariance matrix
📊 Grouping & Aggregation
df.groupby("column")["value"].mean():
Groups by column and gets mean
df.groupby("column")["value"].agg(["sum",
"count", "mean"]): Aggregates multiple stats
df.pivot_table(values="sales",
index="category", aggfunc="sum"): Pivot
table
📈 Data Visualization (Quick Plots)
df.hist(figsize=(10, 5)): Histogram for numerical
columns
df.boxplot(figsize=(10, 5)): Box plot for outlier
detection
df["column"].plot(kind="hist"): Histogram for a
single column
df["column"].plot(kind="box"): Box plot for a
single column
df.plot(kind="scatter", x="col1", y="col2"):
Scatter plot
🧹 Data Cleaning & Transformation
df["column"].str.lower(): Converts text to
lowercase
df["column"].str.upper(): Converts text to
uppercase
df["column"].str.strip(): Removes
leading/trailing spaces
df["column"].str.replace("old", "new"):
Replaces text
df["column"].astype("int"): Converts
column to integer type
df["column"] =
pd.to_datetime(df["column"]): Converts
column to datetime
🕰️ DateTime Analysis
df["date_column"].dt.year: Extracts year
df["date_column"].dt.month: Extracts month
df["date_column"].dt.day: Extracts day
df["date_column"].dt.weekday: Extracts
weekday
🔍 Data Filtering & Selection
df.loc[condition]: Filters data based on a
condition
df.query('condition'): Filters data using a query
string
df.iloc[start:end]: Selects rows by position
(inclusive start, exclusive end)
df[df["column"] > value]: Filters rows where
column values are greater than a specified
value
df[df["column"].isin([value1, value2]): Filters
rows where the column matches any of the
specified values
🔍 Handling Outliers
Using Z-Score
from scipy import stats
z_scores = stats.zscore(df["column"]):
Computes Z-scores for a column
df = df[(z_scores < 3) & (z_scores > -3)]: Filters
out outliers with Z-scores above 3 or below -3
📊 Pivot Tables & Cross Tabulation
pd.pivot_table(df, values='value',
index='row_group', columns='column_group',
aggfunc='sum'): Creates pivot table with
aggregation
pd.crosstab(df['column1'], df['column2'],
margins=True): Creates a cross-tabulation of
two columns (with margins)
df.pivot_table(values="value",
index="category", aggfunc=["sum", "mean",
"std"]): Multiple aggregation functions in a
pivot table
FOR CAREER GUIDANCE,
CHECK OUT OUR PAGE
www.nityacloudtech.com

Follow Us on Linkedin:
Aditya Chandak
Free SQL Interview Preparation:
https://topmate.io/nitya_cloudtech/1403841

Data Analyst Certification:


https://nityacloudtech.com/pages/courses/NCT_Courses

Data Engineer Certification:


https://nityacloudtech.com/pages/courses/NCT_Courses

Artificial Intelligence Certification:


https://nityacloudtech.com/pages/courses/NCT_Courses

Register for Free AI Workshop:


https://nityacloudtech.com/pages/placement_training/AI_MLMasterClass

You might also like