0% found this document useful (0 votes)
223 views

Data Engineering Quick Reference

The document provides an overview of key concepts in data engineering, including databases, data warehousing, big data technologies, data processing, data streaming, data visualization, cloud technologies, data governance, data modeling, data integration, data architecture, machine learning, data science, programming languages, and cloud computing services. It lists and defines common tools and technologies used in each of these areas.

Uploaded by

Ale G
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
223 views

Data Engineering Quick Reference

The document provides an overview of key concepts in data engineering, including databases, data warehousing, big data technologies, data processing, data streaming, data visualization, cloud technologies, data governance, data modeling, data integration, data architecture, machine learning, data science, programming languages, and cloud computing services. It lists and defines common tools and technologies used in each of these areas.

Uploaded by

Ale G
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Data Engineering Quick Reference

Databases
┣ Relational Database : A database that stores data in tables with a defined
schema

┣ NoSQL Database : A database that does not use the traditional relational
database model

┣ SQL : A language used to interact with relational databases

┣ MongoDB : A popular NoSQL database that stores data in JSON-like documents

┣ Cassandra : A popular NoSQL database that is designed for high scalability


and availability

┣ Redis : An in-memory key-value store used for caching and other high-
performance use cases

┗ Amazon RDS : A managed relational database service provided by AWS

Data Warehousing
┣ Data Warehouse : A large, centralized repository of data from various
sources used for business intelligence and decision-making

┣ OLAP : Online Analytical Processing, used for analyzing data from a data
warehouse

┣ Star Schema : A type of data model used in data warehousing that consists
of a central fact table surrounded by dimension tables

┣ Snowflake Schema : A variation of the star schema that uses normalized


dimension tables

┣ Slowly Changing Dimensions (SCD) : A technique used for managing changes to


dimensional data over time

┣ ETL : Extract, Transform, Load, the process of moving data from source
systems into a data warehouse

┗ Amazon Redshift : A cloud-based data warehousing service provided by AWS

BY: Waleed Mousa


Data Engineering Quick Reference
Big Data Technologies
┣ Hadoop : An open-source framework for distributed storage and processing
of large data sets

┣ Spark : An open-source distributed computing system used for big data


processing and analytics

┣ Hive : A data warehousing system built on top of Hadoop for querying and
analysis of large data sets

┣ Pig : A high-level platform for creating MapReduce programs used for


large-scale data processing

┣ MapReduce : A programming model for processing large data sets across


clusters of computers

┗ Impala : A distributed SQL query engine for processing big data sets
stored in Hadoop

┣ Kafka : A distributed streaming platform used for building real-time data


pipelines and streaming applications

┗ Amazon EMR : A managed big data processing service provided by AWS

Data Processing
┣ Data Pipeline : A set of processes used to extract, transform, and load
data from various sources into a destination system

┣ ETL Tools : Tools used to automate the extraction, transformation, and


loading of data

┣ Apache Airflow : An open-source platform used for creating, scheduling, and


monitoring data pipelines

┣ AWS Glue : A fully-managed ETL service provided by AWS

┣ Talend : A popular open-source ETL tool used for data integration and
management

┗ Data Governance : The process of managing the availability, usability,


integrity, and security of data

BY: Waleed Mousa


Data Engineering Quick Reference
Data Streaming
┣ Data Stream : A continuous flow of data that is processed in real-time

┣ Apache Kafka : A distributed streaming platform used for building real-


time data pipelines and streaming applications

┣ Kinesis : A fully-managed data streaming service provided by AWS

┣ Flume : A distributed system for collecting, aggregating, and moving large


amounts of log data from different sources to a centralized data store

┣ Spark Streaming : An extension of the Spark API used for processing real-
time data streams

┗ Flink : An open-source distributed stream processing framework used for


real-time data processing

Data Visualization
┣ Tableau : A popular data visualization tool used for creating interactive
dashboards and reports

┣ Power BI : A business analytics service provided by Microsoft used for


creating interactive visualizations and reports

┣ D3.js : A JavaScript library used for creating interactive data


visualizations in the browser

┣ ggplot2 : A popular data visualization package for R

┗ matplotlib : A popular data visualization package for Python

BY: Waleed Mousa


Data Engineering Quick Reference
Cloud Technologies
┣ AWS : Amazon Web Services, a cloud computing platform provided by Amazon

┣ Azure : A cloud computing platform provided by Microsoft

┣ GCP : Google Cloud Platform, a cloud computing platform provided by Google

┣ Docker : A containerization platform used for packaging and deploying


applications

┗ Kubernetes : An open-source container orchestration platform used for


automating the deployment, scaling, and management of containerized applications

Data Governance
┣ Data Security : The process of ensuring data privacy and confidentiality

┣ Data Quality : The process of ensuring data accuracy, consistency, and


completeness

┣ Data Lineage : The process of tracking data from its source to its
destination

┣ Data Discovery : The process of identifying data assets and their


relationships

┗ Data Stewardship : The process of managing data assets and their use

BY: Waleed Mousa


Data Engineering Quick Reference
Data Modeling
┣ Entity-Relationship Model : A data modeling technique used to represent
the relationships between entities in a system

┣ Dimensional Modeling : A data modeling technique used in data warehousing


for creating optimized data structures

┣ Data Flow Diagrams : A diagrammatic representation of the flow of data


through a system

┣ UML : Unified Modeling Language, a standardized language used for object-


oriented modeling

┗ ERD Tools : Tools used for creating entity-relationship diagrams and other
data modeling diagrams

Data Integration
┣ Data Federation : The process of combining data from multiple sources into
a single virtual view

┣ Data Replication : The process of copying data from one database to


another in near-real time

┣ Data Synchronization : The process of ensuring that data is consistent


across multiple systems

┣ Extract, Load, Transform (ELT) : A data integration approach where data is


extracted from source systems, loaded into a staging area, and transformed
before being loaded into a target system

┗ Change Data Capture (CDC) : A data integration technique where changes in


source systems are captured and propagated to target systems in near-real time

BY: Waleed Mousa


Data Engineering Quick Reference
Data Architecture
┣ Data Lake : A storage repository that holds a vast amount of raw,
unstructured data in its native format

┣ Data Mart : A subset of a data warehouse that is designed for a specific


business function or department

┣ Data Hub : A centralized repository of data that serves as a single source


of truth for an organization

┣ Data Virtualization : A data integration technique that allows data to be


accessed and manipulated in real-time without copying or moving it

┗ Master Data Management (MDM) : The process of creating and maintaining a


single, trusted view of key business data

Machine Learning
┣ Supervised Learning : A type of machine learning where the algorithm is
trained on labeled data

┣ Unsupervised Learning : A type of machine learning where the algorithm is


trained on unlabeled data

┣ Reinforcement Learning : A type of machine learning where the algorithm


learns from feedback in an environment

┣ Deep Learning : A type of machine learning that uses neural networks to


model complex relationships in data

┣ TensorFlow : An open-source machine learning framework developed by Google

┣ PyTorch : An open-source machine learning framework developed by Facebook

┗ Scikit-learn : A popular machine learning library for Python

BY: Waleed Mousa


Data Engineering Quick Reference
Data Science
┣ Statistical Analysis : The process of analyzing data to uncover
relationships and patterns

┣ Data Exploration : The process of identifying patterns and trends in data

┣ Predictive Modeling : The process of using data to make predictions about


future events

┣ Time Series Analysis : The process of analyzing data that is collected


over time

┣ Spatial Analysis : The process of analyzing data that is related to


geographic locations

┣ Data Visualization : The process of representing data graphically

┗ Data Mining : The process of discovering patterns and relationships in


large datasets

Programming Languages
┣ Python : A popular programming language used for data engineering and
machine learning

┣ Java : A popular programming language used for building enterprise-level


applications and big data technologies

┣ Scala : A programming language used for building big data technologies and
data streaming applications

┣ SQL : A language used for interacting with relational databases

┗ R : A programming language used for statistical computing and data


analysis

BY: Waleed Mousa


Data Engineering Quick Reference

Cloud Computing Services


┣ EC2 : Elastic Compute Cloud, a virtual server provided by AWS

┣ S3 : Simple Storage Service, a scalable object storage service provided by


AWS

┣ Lambda : A serverless compute service provided by AWS

┣ CloudFormation : A service provided by AWS for modeling and setting up


cloud resources

┣ Azure VM : A virtual machine provided by Azure

┣ Azure Blob Storage : A scalable object storage service provided by Azure

┣ Azure Functions : A serverless compute service provided by Azure

┣ Azure Resource Manager : A service provided by Azure for modeling and


setting up cloud resources

┣ GCE : Google Compute Engine, a virtual machine provided by GCP

┣ Cloud Storage : A scalable object storage service provided by GCP

┣ Cloud Functions : A serverless compute service provided by GCP

┗ Cloud Deployment Manager : A service provided by GCP for modeling and


setting up cloud resources.

BY: Waleed Mousa


Data Engineering Quick Reference

Resources
┣ Data Engineering with Python by Paul Crickard III

┣ Designing Data-Intensive Applications by Martin Kleppmann

┣ Data Engineering Cookbook by Andreas Kretz

┣ Streaming Systems by Tyler Akidau, Slava Chernyak, and Reuven Lax

┗ AWS Certified Data Analytics Study Guide by Richard Wentk

Useful Technologies
┣ Apache Airflow : A platform used for creating, scheduling, and monitoring
data pipelines

┣ Apache Kafka : A distributed streaming platform used for building real-


time data pipelines and streaming applications

┣ Spark : An open-source distributed computing system used for big data


processing and analytics

┣ Docker : A containerization platform used for packaging and deploying


applications

┗ Kubernetes : An open-source container orchestration platform used for


automating the deployment, scaling, and management of containerized applications

BY: Waleed Mousa

You might also like