Data Engineer Interview 1740985064
Data Engineer Interview 1740985064
(Exploratory
Data Analysis)
Cheatsheet
📂 Data Loading
pd.read_csv(path): Reads a CSV file
pd.read_excel(path, sheet_name="Sheet1"):
Reads an Excel file
pd.read_sql(query, Connection_Object): Reads
SQL table
pd.read_json(path): Reads a JSON file
pd.read_html(url): Reads tables from an HTML
page
pd.read_parquet(path): Reads a Parquet file
df.to_csv("output.csv", index=False): Saves
DataFrame to a CSV file
df.to_excel("output.xlsx", index=False): Saves
DataFrame to an Excel file
df.to_json("output.json"): Saves DataFrame to
a JSON file
df.to_parquet("output.parquet"): Saves
DataFrame to a Parquet file
🔎 Data Overview
df.head(n): Displays first n rows (default
5)
df.tail(n): Displays last n rows (default 5)
df.shape: Returns (rows, columns)
df.info(): Displays column data types &
memory usage
df.columns: Lists all column names
df.index: Displays index range
df.dtypes: Shows data types of each
column
df.describe(): Summary statistics for
numerical columns
df.describe(include="all"): Summary
statistics for all columns
🔍 Checking Missing Values
df.isnull().sum(): Counts missing values in each
column
df.isna().sum(): Same as isnull()
df[df.isnull().any(axis=1)]: Displays rows with
missing values
df.dropna(): Removes rows with missing values
df.fillna(value): Replaces missing values with a
specified value
df.fillna(df.median()): Fills missing values with
median
df.interpolate(): Performs linear interpolation to
fill NaN
📊 Checking Duplicates
df.duplicated(): Returns a Boolean Series for
duplicate rows
df[df.duplicated()]: Displays duplicate rows
df.drop_duplicates(): Removes duplicate rows
📊 Summary Statistics
df.mean(): Mean of numerical columns
df.median(): Median of numerical
columns
df.mode(): Mode of numerical columns
df.std(): Standard deviation of numerical
columns
df.var(): Variance of numerical columns
df.min(): Minimum value of each column
df.max(): Maximum value of each
column
df.count(): Count of non-null values per
column
df.nunique(): Number of unique values
per column
📊 Value Counts & Distributions
df["column"].value_counts(): Counts
occurrences of each unique value
df["column"].value_counts(normalize=Tr
ue): Normalized value counts
(percentage)
df["column"].unique(): Lists unique values
df["column"].nunique(): Number of
unique values
Follow Us on Linkedin:
Aditya Chandak
Free SQL Interview Preparation:
https://topmate.io/nitya_cloudtech/1403841