Data Engineering
Data Engineering
Data Engineering in corporate will make you a millionaire in 10 years and this is how I think
today's computer science students could achieve the same.
𝗦𝘁𝗲𝗽 𝟭: 𝗦𝗤𝗟
- Basic SQL Syntax
- DDL, DML, DCL
- Joins & Subqueires
- Views & Indexes
- CTEs & Window Functions
𝗦𝘁𝗲𝗽 𝟮: 𝗣𝘆𝘁𝗵𝗼𝗻
- Fundamentals
- Numpy
- Pandas
𝗦𝘁𝗲𝗽 𝟯: 𝗣𝘆𝘀𝗽𝗮𝗿𝗸
- RDD
- Dataframe
- Datasets
- Spark Streaming
- Optimization techniques
• SQL - https://lnkd.in/gV_5EFtE
• Python - https://lnkd.in/dt_-2-Uj
• Pyspark - https://lnkd.in/gtCdub-V
• Airflow - https://lnkd.in/guebuHJ7
• Kafka - https://lnkd.in/gVZUT52s
• Azure Cloud - https://lnkd.in/gwc3By9h
• Google Cloud - https://lnkd.in/gV_5EFtE
• AWS - https://lnkd.in/gJeUGfjS
• Projects - https://lnkd.in/gcpsNtnw
5+ years of experience in software engineering, with a focus on platform engineering or
cloud-native application development.
ore Requirements
• Deep knowledge in Hadoop ecosystem, like HDFS, Hive, MapReduce, Presto etc.
• Advanced knowledge of complex software design, distributed system design, design patterns,
data structures and algorithms.
• Excellent data analytics skills and ability to explore and identify data issues.
• Ability to explain complex subjects in layman’s terms.
• Experience with distributed version control like Git or similar
• Familiarity with continuous integration/deployment processes and tools such as Jenkins and
Maven.
• Familiar with public cloud technologies in Google Cloud Platform, especially BigQuery, GCS
and Dataproc.
• Experience with ETL pipelines.
• Experience in advertising domain.
• Familiar with workflow management systems like airflow or oozie.
• Experience with enterprise monitoring and alerting solutions like Prometheus, Graphite, alerts
manager and Splunk.
𝗦𝗤𝗟
- How would you write a query to calculate a cumulative sum or running total within a specific
partition in SQL?
- How do window functions differ from aggregate functions, and when would you use them?
- How do you identify and remove duplicate records in SQL without using temporary tables?
𝗣𝘆𝘁𝗵𝗼𝗻
- How do you manage memory efficiently when processing large files in Python?
- What are Python decorators, and how would you use them to optimize reusable code in ETL
processes?
- How do you use Python’s built-in logging module to capture detailed error and audit logs?
𝗣𝘆𝘀𝗽𝗮𝗿𝗸
- How would you handle skewed data in a Spark job to prevent performance issues?
- What is the difference between the Spark Session and Spark Context? When should each be
used?
- How do you handle backpressure in Spark Streaming applications to manage load
effectively?
𝗔𝘇𝘂𝗿𝗲 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀
- How do you configure cluster autoscaling in Databricks, and when should it be used?
- How do you implement data versioning in Delta Lake tables within Databricks?
- How would you monitor and optimize Databricks job performance metrics?
𝗖𝗜/𝗖𝗗
- What are blue-green deployments, and how would you use them for ETL jobs?
- How do you implement rollback mechanisms in CI/CD pipelines for data integration
processes?
- What strategies do you use to handle schema evolution in data pipelines as part of CI/CD?
𝗗𝗮𝘁𝗮 𝗪𝗮𝗿𝗲𝗵𝗼𝘂𝘀𝗶𝗻𝗴
- How do you optimize join operations in a data warehouse to improve query performance?
- What is a slowly changing dimension (SCD), and what are different ways to implement it in a
data warehouse?
- How do surrogate keys benefit data warehouse design over natural keys?
𝗗𝗮𝘁𝗮 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴
- How do you decide between a star schema and a snowflake schema for a data warehouse?
Provide examples of scenarios where each is ideal.
- What is dimensional modeling, and how does it differ from entity-relationship modeling in
terms of use cases?
- How do you handle one-to-many relationships in a dimensional model to ensure efficient
querying?
Experience creating solutions incorporating Machine Learning algorithms and models using
Python with Data Engineering libraries and tools
You have developed server-side Java and Python applications using mainstream
libraries and frameworks, including the Spring framework, Pandas, SciPy, PySpark, and
Pydantic
Current cloud technology experience with AWS
Experience integrating with async messaging, logging, or queues, such as Kafka,
RabbitMQ, SQS, NATS
You collaborate as a hands-on team member developing a significant commercial
software project in Java and Python
Software development experience building and testing applications following secure
coding practices. Additional preferred experience includes building systems for
financial services or tightly regulated businesses, security and privacy compliance
(GPDR, CCPA, ISO 27001, PCI, HIPAA, etc.) experience
Responsibilities
Qualifications
Design business-critical data models that would be used to power business decisions. Ensure
data quality, consistency, and accuracy.
Design, build, and maintain scalable, robust and reliable data pipelines for internal
stakeholders and customers.
Deliver data products that our customers can use, including data warehouse sharing
and embedded analytics.
Help develop a mature product analytics capability within the company and empower
data-driven decisions.
Contribute to the broader Data Analytics community at Zip to influence tooling and
standards to improve culture and productivity.
Qualifications
Nice to Haves
15+ years of experience in software development, focusing on big data processing, real-time
serving and distributed low latency systems
Expert in multiple distributed technologies (e.g. Spark/Storm, Kafka, Key Value Stores,
Caching, Solr, Druid, etc.)
Proficient in Scala or Java and Full Stack application development.
Deep knowledge in Hadoop ecosystem, like HDFS, Hive, MapReduce, presto etc.
Advanced knowledge of complex software design, distributed system design, design
patterns, data structures and algorithms.
Experience working as a Machine learning engineer closely collaborating with data
scientists.
Experience working with ML frameworks like TensorFlow and ML feature engineering.
Experience in one or more public cloud technologies like GCP, Azure, etc.
Excellent debugging and problem-solving capability.
Experience in working in large teams using CI/CD and agile methodologies.
Domain expertise in Ad Tech systems is a plus.
Experience working with financial applications is a plus.