0% found this document useful (0 votes)
44 views4 pages

BASF Interview QA

Uploaded by

Fitri Afiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views4 pages

BASF Interview QA

Uploaded by

Fitri Afiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

BASF Interview Q&A - Muhammad Fitri Afiq Bin Mohd Rashid

BASF Data Engineer for Supply Chains - Mock Q&A

1. Walk me through how you build a data pipeline end-to-end.

To build a data pipeline, I start by understanding the business requirements and identifying the data sources.

Then, I ingest the data using tools like Azure Data Factory or Databricks Auto Loader. I store the data in

Delta Lake for reliability and ACID compliance. Transformation is handled using PySpark in Databricks,

including cleaning, standardizing, and joining data. I schedule the pipeline using Databricks Workflows and

monitor it via Azure Monitor. I apply best practices like modular coding, schema validation, and error handling

throughout.

2. How do you handle schema changes or data drift in production pipelines?

I implement schema validation checks at the ingestion layer to detect changes early. For optional or new

columns, I apply dynamic schema inference where appropriate. If schema drift is a recurring issue, I work

with upstream teams to enforce data contracts. I also log and alert deviations using built-in monitoring tools.

3. What are the benefits of using Delta Lake in Databricks?

Delta Lake provides ACID transactions, schema enforcement, time travel, and versioning, which makes it

ideal for production data pipelines. It also allows for efficient updates and deletes, which traditional data lakes

struggle with. This improves data quality, reliability, and pipeline resilience.

4. How do you optimize performance when working with large datasets in Spark?

I optimize performance by using partitioning, caching intermediate datasets, minimizing shuffles, and tuning

the cluster size. I also ensure data types are appropriate and avoid wide transformations when possible.

Broadcasting small lookup tables is another trick I use to speed up joins.

5. What components of Azure have you used for data engineering or ML workflows?

I've worked with Azure Data Factory for ETL orchestration, Azure Blob Storage for raw data, Azure SQL DB

for structured outputs, Azure Key Vault for credentials, and Azure Databricks for transformation and ML

modeling. I also use Azure DevOps for CI/CD.


6. How do you orchestrate data pipelines in Azure?

I use Azure Data Factory pipelines or Databricks Workflows depending on the use case. For ADF, I define

linked services, datasets, and activities for each step. For Databricks Workflows, I configure jobs and

clusters, and link notebooks with dependencies. Monitoring and alerting are set up via Azure Monitor.

7. How do you handle access control and security in Databricks?

I implement role-based access control (RBAC) using Azure Active Directory. I restrict workspace and cluster

access by role. I use secrets stored in Azure Key Vault and referenced securely in notebooks. I also enable

audit logging and review cluster policies regularly.

8. Explain how you would deploy a data pipeline that pulls from SAP and writes into a data lake.

I'd use Azure Data Factory to connect to SAP (e.g., via OData or SAP connector) and extract the required

data. I'd perform preliminary validation, land it in Azure Blob Storage, and use Databricks for further

transformation. Final outputs would be stored in Delta format in ADLS, and monitored with alerts for any

ingestion failures.

9. Write a SQL query to find the top 5 suppliers by total delay in shipments.

SELECT supplier_id, SUM(delay_days) AS total_delay

FROM shipment_data

GROUP BY supplier_id

ORDER BY total_delay DESC

LIMIT 5;

10. How do you write efficient Python code for ETL?

I use vectorized operations with Pandas or PySpark, avoid unnecessary loops, and modularize code into

reusable functions. I also log key steps and handle exceptions gracefully. For large data, I prefer PySpark

over Pandas due to its distributed computing.

11. Have you used Pandas vs PySpark? When do you use each?

Yes. I use Pandas for small to medium datasets that fit in memory, typically for local prototyping. I use

PySpark for large datasets that require distributed processing, especially when running in Databricks or on
big data clusters.

12. How do you handle null values and data validation in Python?

I use functions like .fillna(), .dropna(), or custom imputation logic. For validation, I use assertions, schema

checks, and row-level filtering. In PySpark, I define schemas explicitly and use filters or when/otherwise logic

to clean data.

13. How would you prepare supply chain data to be used in a GenAI application?

I'd clean and structure the data into context-rich formats. For example, convert supplier delivery logs into

natural language summaries. I'd use semantic tagging to label entities and relationships (e.g., vendor ->

product -> delay_reason). I'd also consider embedding structured summaries for retrieval-augmented

generation if using LLMs.

14. What's your understanding of semantic layers and ontologies in data platforms?

A semantic layer abstracts raw data into business-friendly terms (e.g., "Total Delivered Quantity" instead of

qty_delivd). Ontologies define the relationships between data entities, like how a vendor relates to a shipment

or invoice. These are crucial for GenAI and self-serve BI tools to interpret data meaningfully.

15. What KPIs would you track in a supply chain analytics dashboard?

On-time delivery rate, supplier reliability, inventory turnover, stockout rate, average shipment delay, demand

forecast accuracy, and fill rate. I'd visualize these by supplier, region, and product category.

16. How would you build a forecasting model for inventory demand?

I'd aggregate historical sales/order data, perform time series decomposition, and apply models like ARIMA,

Prophet, or LSTM based on seasonality and volume. I'd also include external factors like lead time, holidays,

and promotions as features.

17. Have you ever helped solve a supply chain issue using data?

Yes, at PETRONAS I worked on a project that monitored upstream tank levels in real time. Our data solution

helped flag abnormal consumption patterns, leading to faster restocking decisions and preventing potential

shutdowns.
18. How do you work in agile teams with product owners and data scientists?

I attend sprint planning and daily stand-ups, contribute to story grooming, and ensure my deliverables (e.g.,

pipelines, transformations) align with the product vision. I often collaborate with data scientists to shape

datasets that match model needs and iterate based on feedback.

19. Have you used Azure DevOps boards or GitHub Actions for CI/CD?

Yes, I've used Azure DevOps Boards for tracking tasks and GitHub Actions for automating testing and

deployment of notebooks and pipelines. I've also worked with Git branching strategies for collaborative

development.

20. Tell us about a time you had to deliver something under pressure or with limited data.

Once I was asked to build a backfill pipeline within 2 days for missing EPH-OPU data. With incomplete

documentation, I reverse-engineered available logs, validated key fields with SMEs, and delivered an

automated solution that filled a 3-week gap successfully. This reduced reporting delays and was later

repurposed for anomaly detection.

You might also like