Lecture 3 Data Engineering Concepts, Processes, and Tools
Lecture 3 Data Engineering Concepts, Processes, and Tools
Tools
Sharing top billing on the list of data science capabilities, machine learning and artificial
intelligence are not just buzzwords: Many organizations are eager to adopt them. But prior to
building intelligent products, you need to gather and prepare data, that fuels AI. A separate
discipline called data engineering, lays the necessary groundwork for analytics projects.
Tasks related to it occupy the first three layers of the data science hierarchy of needs
suggested by Monica Rogati.
Within a large organization, there are usually many different types of operations management
software (e.g., ERP, CRM, production systems, etc.), all containing databases with varied
information. Besides, data can be stored as separate files or pulled from external sources —
such as IoT devices — in real time. Having data scattered in different formats prevents the
organization from seeing a clear picture of its business state and running analytics.
1
Data engineering addresses this problem step by step.
Data transformation adjusts disparate data to the needs of end users. It involves removing
errors and duplicates from data, normalizing it, and converting it into the needed format.
Data serving delivers transformed data to end users — a BI platform, dashboard, or data
science team.
Data flow orchestration provides visibility into the data engineering process, ensuring that
all tasks are successfully completed. It coordinates and continuously tracks data workflows to
detect and fix data quality and performance issues.
The mechanism that automates ingestion, transformation, and serving steps of the data
engineering process is known as a data pipeline.