BASF Interview QA
BASF Interview QA
To build a data pipeline, I start by understanding the business requirements and identifying the data sources.
Then, I ingest the data using tools like Azure Data Factory or Databricks Auto Loader. I store the data in
Delta Lake for reliability and ACID compliance. Transformation is handled using PySpark in Databricks,
including cleaning, standardizing, and joining data. I schedule the pipeline using Databricks Workflows and
monitor it via Azure Monitor. I apply best practices like modular coding, schema validation, and error handling
throughout.
I implement schema validation checks at the ingestion layer to detect changes early. For optional or new
columns, I apply dynamic schema inference where appropriate. If schema drift is a recurring issue, I work
with upstream teams to enforce data contracts. I also log and alert deviations using built-in monitoring tools.
Delta Lake provides ACID transactions, schema enforcement, time travel, and versioning, which makes it
ideal for production data pipelines. It also allows for efficient updates and deletes, which traditional data lakes
struggle with. This improves data quality, reliability, and pipeline resilience.
4. How do you optimize performance when working with large datasets in Spark?
I optimize performance by using partitioning, caching intermediate datasets, minimizing shuffles, and tuning
the cluster size. I also ensure data types are appropriate and avoid wide transformations when possible.
5. What components of Azure have you used for data engineering or ML workflows?
I've worked with Azure Data Factory for ETL orchestration, Azure Blob Storage for raw data, Azure SQL DB
for structured outputs, Azure Key Vault for credentials, and Azure Databricks for transformation and ML
I use Azure Data Factory pipelines or Databricks Workflows depending on the use case. For ADF, I define
linked services, datasets, and activities for each step. For Databricks Workflows, I configure jobs and
clusters, and link notebooks with dependencies. Monitoring and alerting are set up via Azure Monitor.
I implement role-based access control (RBAC) using Azure Active Directory. I restrict workspace and cluster
access by role. I use secrets stored in Azure Key Vault and referenced securely in notebooks. I also enable
8. Explain how you would deploy a data pipeline that pulls from SAP and writes into a data lake.
I'd use Azure Data Factory to connect to SAP (e.g., via OData or SAP connector) and extract the required
data. I'd perform preliminary validation, land it in Azure Blob Storage, and use Databricks for further
transformation. Final outputs would be stored in Delta format in ADLS, and monitored with alerts for any
ingestion failures.
9. Write a SQL query to find the top 5 suppliers by total delay in shipments.
FROM shipment_data
GROUP BY supplier_id
LIMIT 5;
I use vectorized operations with Pandas or PySpark, avoid unnecessary loops, and modularize code into
reusable functions. I also log key steps and handle exceptions gracefully. For large data, I prefer PySpark
11. Have you used Pandas vs PySpark? When do you use each?
Yes. I use Pandas for small to medium datasets that fit in memory, typically for local prototyping. I use
PySpark for large datasets that require distributed processing, especially when running in Databricks or on
big data clusters.
12. How do you handle null values and data validation in Python?
I use functions like .fillna(), .dropna(), or custom imputation logic. For validation, I use assertions, schema
checks, and row-level filtering. In PySpark, I define schemas explicitly and use filters or when/otherwise logic
to clean data.
13. How would you prepare supply chain data to be used in a GenAI application?
I'd clean and structure the data into context-rich formats. For example, convert supplier delivery logs into
natural language summaries. I'd use semantic tagging to label entities and relationships (e.g., vendor ->
product -> delay_reason). I'd also consider embedding structured summaries for retrieval-augmented
14. What's your understanding of semantic layers and ontologies in data platforms?
A semantic layer abstracts raw data into business-friendly terms (e.g., "Total Delivered Quantity" instead of
qty_delivd). Ontologies define the relationships between data entities, like how a vendor relates to a shipment
or invoice. These are crucial for GenAI and self-serve BI tools to interpret data meaningfully.
15. What KPIs would you track in a supply chain analytics dashboard?
On-time delivery rate, supplier reliability, inventory turnover, stockout rate, average shipment delay, demand
forecast accuracy, and fill rate. I'd visualize these by supplier, region, and product category.
16. How would you build a forecasting model for inventory demand?
I'd aggregate historical sales/order data, perform time series decomposition, and apply models like ARIMA,
Prophet, or LSTM based on seasonality and volume. I'd also include external factors like lead time, holidays,
17. Have you ever helped solve a supply chain issue using data?
Yes, at PETRONAS I worked on a project that monitored upstream tank levels in real time. Our data solution
helped flag abnormal consumption patterns, leading to faster restocking decisions and preventing potential
shutdowns.
18. How do you work in agile teams with product owners and data scientists?
I attend sprint planning and daily stand-ups, contribute to story grooming, and ensure my deliverables (e.g.,
pipelines, transformations) align with the product vision. I often collaborate with data scientists to shape
19. Have you used Azure DevOps boards or GitHub Actions for CI/CD?
Yes, I've used Azure DevOps Boards for tracking tasks and GitHub Actions for automating testing and
deployment of notebooks and pipelines. I've also worked with Git branching strategies for collaborative
development.
20. Tell us about a time you had to deliver something under pressure or with limited data.
Once I was asked to build a backfill pipeline within 2 days for missing EPH-OPU data. With incomplete
documentation, I reverse-engineered available logs, validated key fields with SMEs, and delivered an
automated solution that filled a 3-week gap successfully. This reduced reporting delays and was later