Basic Terms of DATA ENGINEERING
Basic Terms of DATA ENGINEERING
Hadoop: An open-source framework for distributed storage and processing of large data sets using a
cluster of computers.
Spark: A fast and general-purpose cluster computing system for big data processing, providing high-level
APIs in Java, Scala, Python, and R.
MapReduce: A programming model for processing and generating large datasets that parallelizes the
computation across a distributed cluster.
Apache Kafka: A distributed streaming platform that enables the handling of real-time data feeds and
event processing.
SQL: Structured Query Language used for managing and querying relational databases.
NoSQL Databases: Databases that provide a mechanism for storage and retrieval of data that is modeled
in ways other than the tabular relations used in relational databases.
Data Warehousing: A process for collecting, managing, and storing large volumes of data from different
sources for business intelligence and reporting.
ETL (Extract, Transform, Load): A process of extracting data from source systems, transforming it into a
more usable format, and loading it into a target system.
Machine Learning: A subset of artificial intelligence that focuses on the development of algorithms
allowing systems to learn and make predictions or decisions based on data.
Data Modeling: The process of creating a visual representation of data structures, relationships, and
rules to improve understanding and communication within an organization.
Data Governance: The overall management of the availability, usability, integrity, and security of data
used in an enterprise.
HDFS (Hadoop Distributed File System): A distributed file system designed to store vast amounts of data
across multiple machines, providing high fault tolerance.
MapReduce: A programming model and processing engine for distributed computing on large datasets.
It divides tasks into smaller sub-tasks, processes them in parallel, and combines the results.
YARN (Yet Another Resource Negotiator): A resource management layer that manages and schedules
resources across the cluster, allowing different applications to share resources efficiently.
Hadoop Common: A set of utilities, libraries, and APIs that provide support for Hadoop modules. It
includes tools for managing and interacting with Hadoop clusters.
Hive: A data warehousing and SQL-like query language tool built on top of Hadoop. It facilitates querying
and managing large datasets stored in Hadoop.
Pig: A high-level platform and scripting language built on top of Hadoop to simplify the development of
complex data processing tasks using a scripting language called Pig Latin.
HBase: A distributed, scalable, and NoSQL database that provides real-time read and write access to
large datasets. It is modeled after Google's Bigtable.
Spark: While not exclusive to Hadoop, Apache Spark is often integrated into Hadoop clusters. It's a fast,
in-memory data processing engine with elegant and expressive development APIs.
Oozie: A workflow scheduler system for managing Hadoop jobs. It allows users to define workflows to
coordinate the execution of Hadoop jobs.
Flume: A distributed, reliable, and available service for efficiently collecting, aggregating, and moving
large amounts of log data into HDFS.
Amazon S3 (Simple Storage Service): A scalable object storage service designed to store and retrieve any
amount of data. It is commonly used for data storage in various data engineering workflows.
Amazon Redshift: A fully managed, petabyte-scale data warehouse service that enables high-
performance analysis using standard SQL queries.
AWS Glue: A fully managed extract, transform, and load (ETL) service that makes it easy to prepare and
load data for analysis. It automatically generates ETL code.
AWS Data Pipeline: A web service for orchestrating and automating the movement and transformation
of data between different AWS services and on-premises data sources.
Amazon EMR (Elastic MapReduce): A cloud-based big data platform that enables processing of vast
amounts of data quickly using popular frameworks such as Apache Spark and Hadoop.
AWS Lambda: A serverless computing service that allows running code without provisioning or
managing servers. It can be used to build scalable and cost-effective data processing pipelines.
Amazon Kinesis: A platform for streaming data on AWS, offering services like Kinesis Data Streams for
real-time data streaming and Kinesis Data Firehose for loading streaming data to data stores.
Amazon DynamoDB: A fully managed NoSQL database service that provides fast and predictable
performance with seamless scalability. It is suitable for handling large amounts of data.
Amazon Athena: A serverless query service that enables querying data stored in Amazon S3 using
standard SQL. It eliminates the need for setting up and managing a database.
AWS Glue DataBrew: A visual data preparation tool that makes it easy to clean, normalize, and
transform data for analytics and machine learning.
Amazon RDS (Relational Database Service): A fully managed relational database service that supports
multiple database engines, making it easier to set up, operate, and scale a relational database.
Amazon QuickSight: A fully managed business intelligence (BI) service that makes it easy to create and
visualize interactive dashboards in the AWS Cloud.
AWS Glue Elastic Views: A serverless service that allows you to easily create materialized views that
combine and replicate data across multiple data stores.
AWS Step Functions: A serverless function orchestrator that allows you to build and coordinate multiple
AWS services into serverless workflows for your applications.
Amazon Aurora: A MySQL and PostgreSQL-compatible relational database built for the cloud that
combines the performance and availability of high-end commercial databases with the simplicity and
cost-effectiveness of open-source databases.
Amazon Elasticsearch Service: A fully managed service that makes it easy to deploy, secure, and scale
Elasticsearch for log and event data analysis.
AWS DataSync: A service that simplifies, automates, and accelerates moving data between on-premises
storage and Amazon S3 or Amazon EFS.
AWS Lake Formation: A service that makes it easy to set up, secure, and manage a data lake, allowing
you to centrally define and enforce security, governance, and auditing policies.
AWS DMS (Database Migration Service): A service that makes it easier to migrate relational databases,
data warehouses, NoSQL databases, and other types of data stores.
Azure Blob Storage: A scalable and secure object storage solution for the cloud, suitable for storing and
serving large amounts of unstructured data.
Azure Synapse Analytics (formerly SQL Data Warehouse): A cloud-based analytics service that brings
together big data and data warehousing for fast querying and analysis.
Azure Data Factory: A cloud-based data integration service that allows you to create data-driven
workflows for orchestrating and automating data movement and data transformation.
Azure Databricks: An Apache Spark-based analytics platform that accelerates big data analytics and
provides collaborative and interactive tools for data scientists and engineers.
Azure HDInsight: A fully managed cloud service that makes it easy to process large amounts of data
using popular open-source frameworks such as Apache Hadoop, Spark, Hive, and more.
Azure SQL Database: A fully managed relational database service that provides high availability, security,
and scalability for applications.
Azure Stream Analytics: A real-time analytics service designed for stream processing on large amounts
of fast-streaming data.
Azure Cosmos DB: A globally distributed, multi-model database service designed for highly responsive
and scalable applications.
Azure Data Lake Storage: A scalable and secure data lake solution for big data analytics, capable of
handling massive amounts of data in various formats.
Azure Logic Apps: A service that allows you to automate workflows and integrate services, applications,
and systems across on-premises and cloud environments.
Azure Data Explorer: A fast and highly scalable data exploration service for querying and analyzing large
volumes of diverse data.
Azure Data Share: A service that facilitates secure sharing of data between organizations, departments,
or different Azure subscriptions.
Azure Data Bricks Delta Lake: An optimized storage layer that brings ACID transactions to Apache Spark
and big data workloads.
Azure Data Catalog: A fully managed service that serves as a centralized repository for metadata and
data discovery, making it easier to find, understand, and use data.
Azure SQL Data Warehouse: A cloud-based data warehouse service that allows you to scale, pause, and
resume data warehouse processing power independently of storage.
Azure Data Explorer (Kusto): A fast and scalable data exploration service for real-time analysis of large
datasets using a powerful query language known as Kusto Query Language (KQL).
Azure Logic Apps: A serverless platform for building and automating workflows that integrate with
various services and systems.
Azure Data Box: A family of secure, ruggedized appliances that help you transfer large amounts of data
to and from Azure when network-based methods are not feasible.
Azure Time Series Insights: A fully managed analytics, storage, and visualization service for analyzing
time-series data from IoT devices.
Azure Data Lake Analytics: A distributed analytics service that allows you to run big data analytics
without managing the infrastructure.
Azure Event Hubs: A scalable and real-time event processing service that can ingest and process large
volumes of events from various sources.
Azure Data Share: A service that enables organizations to share data across different Azure subscriptions
and with external organizations securely.
Azure Data Catalog: A fully managed service that serves as a central repository for metadata and
facilitates data discovery, understanding, and usage.
Azure Cognitive Search: A fully managed search-as-a-service solution that allows you to quickly add a
robust search capability to your applications.
Azure Data Migration Service: A service that simplifies the migration of on-premises databases to Azure,
supporting various source and target database engines.
Azure Data Box Edge: An appliance that combines edge compute capabilities with data transfer, allowing
processing at the edge before transferring data to Azure.
Azure Blockchain Service: A fully managed blockchain service that simplifies the formation,
management, and governance of consortium blockchain networks.
Azure Purview: A unified data governance service that helps you discover and understand the data you
have across your organization.
Azure Machine Learning Studio: A collaborative, drag-and-drop tool for building, testing, and deploying
machine learning models.
BigQuery: A fully managed, serverless data warehouse that enables super-fast SQL queries using the
processing power of Google's infrastructure.
Cloud Dataflow: A fully managed stream and batch processing service for real-time data processing and
analytics.
Cloud Dataprep: A cloud service for exploring, cleaning, and preparing structured and unstructured data
for analysis, visualization, and machine learning.
Cloud Dataproc: A fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache
Hadoop clusters.
Cloud Composer: A fully managed workflow orchestration service that allows you to author, schedule,
and monitor pipelines using Apache Airflow.
Cloud Spanner: A globally distributed, horizontally scalable, and strongly consistent database service
designed for both operational and analytical workloads.
Cloud SQL: A fully managed relational database service that supports various database engines,
including MySQL, PostgreSQL, and SQL Server.
Cloud Pub/Sub: A messaging service that enables the creation of event-driven systems and real-time
analytics.
Cloud Storage Transfer Service: A service that allows you to securely and efficiently transfer large
amounts of data from on-premises or other cloud providers to Google Cloud Storage.
Cloud Datastore: A NoSQL document database for building web, mobile, and server applications.
Cloud Bigtable: A fully managed, highly scalable NoSQL database service designed to handle large
amounts of data with low-latency access.
Cloud AutoML: A suite of machine learning products that enables developers with limited machine
learning expertise to train high-quality custom models.
Cloud IoT Core: A fully managed service that allows you to securely connect and manage IoT devices at
scale.
Data Studio: A free business intelligence and data visualization tool that allows users to create
interactive reports and dashboards.
Looker (Now part of Google Cloud): A data exploration and business intelligence platform that allows
organizations to create and share reports and dashboards.
Vertex AI: A platform that unifies the capabilities of the Google Cloud AI portfolio, providing tools for
building, deploying, and managing machine learning models.
Cloud Composer: A fully managed workflow orchestration service based on Apache Airflow that helps
automate, schedule, and manage complex data workflows.
Cloud AI Platform: A managed service for building, training, and deploying machine learning models
using popular frameworks like TensorFlow and scikit-learn.
Cloud Identity and Access Management (IAM): A centralized access control service for managing user
permissions and roles across GCP services.