0% found this document useful (0 votes)
38 views9 pages

Basic Terms of DATA ENGINEERING

The document provides an overview of key terms and concepts in data engineering across various platforms, including Big Data, AWS, Azure, and GCP. It covers essential technologies and services such as Hadoop, Spark, Amazon S3, Azure Synapse Analytics, and Google BigQuery, among others. Each term is briefly defined to highlight its significance in the field of data engineering.

Uploaded by

samreddy022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views9 pages

Basic Terms of DATA ENGINEERING

The document provides an overview of key terms and concepts in data engineering across various platforms, including Big Data, AWS, Azure, and GCP. It covers essential technologies and services such as Hadoop, Spark, Amazon S3, Azure Synapse Analytics, and Google BigQuery, among others. Each term is briefly defined to highlight its significance in the field of data engineering.

Uploaded by

samreddy022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

DATA ENGINNERING

Basic terms in BIG DATA ENGINEERING


Big Data: Large and complex datasets that exceed the capabilities of traditional data processing
applications.

Hadoop: An open-source framework for distributed storage and processing of large data sets using a
cluster of computers.

Spark: A fast and general-purpose cluster computing system for big data processing, providing high-level
APIs in Java, Scala, Python, and R.

MapReduce: A programming model for processing and generating large datasets that parallelizes the
computation across a distributed cluster.

Apache Kafka: A distributed streaming platform that enables the handling of real-time data feeds and
event processing.

SQL: Structured Query Language used for managing and querying relational databases.

NoSQL Databases: Databases that provide a mechanism for storage and retrieval of data that is modeled
in ways other than the tabular relations used in relational databases.

Data Warehousing: A process for collecting, managing, and storing large volumes of data from different
sources for business intelligence and reporting.

ETL (Extract, Transform, Load): A process of extracting data from source systems, transforming it into a
more usable format, and loading it into a target system.

Machine Learning: A subset of artificial intelligence that focuses on the development of algorithms
allowing systems to learn and make predictions or decisions based on data.
Data Modeling: The process of creating a visual representation of data structures, relationships, and
rules to improve understanding and communication within an organization.

Data Governance: The overall management of the availability, usability, integrity, and security of data
used in an enterprise.

HDFS (Hadoop Distributed File System): A distributed file system designed to store vast amounts of data
across multiple machines, providing high fault tolerance.

MapReduce: A programming model and processing engine for distributed computing on large datasets.
It divides tasks into smaller sub-tasks, processes them in parallel, and combines the results.

YARN (Yet Another Resource Negotiator): A resource management layer that manages and schedules
resources across the cluster, allowing different applications to share resources efficiently.

Hadoop Common: A set of utilities, libraries, and APIs that provide support for Hadoop modules. It
includes tools for managing and interacting with Hadoop clusters.

Hive: A data warehousing and SQL-like query language tool built on top of Hadoop. It facilitates querying
and managing large datasets stored in Hadoop.

Pig: A high-level platform and scripting language built on top of Hadoop to simplify the development of
complex data processing tasks using a scripting language called Pig Latin.

HBase: A distributed, scalable, and NoSQL database that provides real-time read and write access to
large datasets. It is modeled after Google's Bigtable.

Spark: While not exclusive to Hadoop, Apache Spark is often integrated into Hadoop clusters. It's a fast,
in-memory data processing engine with elegant and expressive development APIs.

Oozie: A workflow scheduler system for managing Hadoop jobs. It allows users to define workflows to
coordinate the execution of Hadoop jobs.
Flume: A distributed, reliable, and available service for efficiently collecting, aggregating, and moving
large amounts of log data into HDFS.

Basic terms in AWS DATA ENGINEERING

Amazon S3 (Simple Storage Service): A scalable object storage service designed to store and retrieve any
amount of data. It is commonly used for data storage in various data engineering workflows.

Amazon Redshift: A fully managed, petabyte-scale data warehouse service that enables high-
performance analysis using standard SQL queries.

AWS Glue: A fully managed extract, transform, and load (ETL) service that makes it easy to prepare and
load data for analysis. It automatically generates ETL code.

AWS Data Pipeline: A web service for orchestrating and automating the movement and transformation
of data between different AWS services and on-premises data sources.

Amazon EMR (Elastic MapReduce): A cloud-based big data platform that enables processing of vast
amounts of data quickly using popular frameworks such as Apache Spark and Hadoop.

AWS Lambda: A serverless computing service that allows running code without provisioning or
managing servers. It can be used to build scalable and cost-effective data processing pipelines.

Amazon Kinesis: A platform for streaming data on AWS, offering services like Kinesis Data Streams for
real-time data streaming and Kinesis Data Firehose for loading streaming data to data stores.

Amazon DynamoDB: A fully managed NoSQL database service that provides fast and predictable
performance with seamless scalability. It is suitable for handling large amounts of data.
Amazon Athena: A serverless query service that enables querying data stored in Amazon S3 using
standard SQL. It eliminates the need for setting up and managing a database.

AWS Glue DataBrew: A visual data preparation tool that makes it easy to clean, normalize, and
transform data for analytics and machine learning.

Amazon RDS (Relational Database Service): A fully managed relational database service that supports
multiple database engines, making it easier to set up, operate, and scale a relational database.

Amazon QuickSight: A fully managed business intelligence (BI) service that makes it easy to create and
visualize interactive dashboards in the AWS Cloud.

AWS Glue Elastic Views: A serverless service that allows you to easily create materialized views that
combine and replicate data across multiple data stores.

AWS Step Functions: A serverless function orchestrator that allows you to build and coordinate multiple
AWS services into serverless workflows for your applications.

Amazon Aurora: A MySQL and PostgreSQL-compatible relational database built for the cloud that
combines the performance and availability of high-end commercial databases with the simplicity and
cost-effectiveness of open-source databases.

Amazon Elasticsearch Service: A fully managed service that makes it easy to deploy, secure, and scale
Elasticsearch for log and event data analysis.

AWS DataSync: A service that simplifies, automates, and accelerates moving data between on-premises
storage and Amazon S3 or Amazon EFS.

AWS Lake Formation: A service that makes it easy to set up, secure, and manage a data lake, allowing
you to centrally define and enforce security, governance, and auditing policies.

AWS DMS (Database Migration Service): A service that makes it easier to migrate relational databases,
data warehouses, NoSQL databases, and other types of data stores.
Azure Blob Storage: A scalable and secure object storage solution for the cloud, suitable for storing and
serving large amounts of unstructured data.

Basic terms in AZURE DATA ENGINEERING

Azure Synapse Analytics (formerly SQL Data Warehouse): A cloud-based analytics service that brings
together big data and data warehousing for fast querying and analysis.

Azure Data Factory: A cloud-based data integration service that allows you to create data-driven
workflows for orchestrating and automating data movement and data transformation.

Azure Databricks: An Apache Spark-based analytics platform that accelerates big data analytics and
provides collaborative and interactive tools for data scientists and engineers.

Azure HDInsight: A fully managed cloud service that makes it easy to process large amounts of data
using popular open-source frameworks such as Apache Hadoop, Spark, Hive, and more.

Azure SQL Database: A fully managed relational database service that provides high availability, security,
and scalability for applications.

Azure Stream Analytics: A real-time analytics service designed for stream processing on large amounts
of fast-streaming data.

Azure Cosmos DB: A globally distributed, multi-model database service designed for highly responsive
and scalable applications.

Azure Data Lake Storage: A scalable and secure data lake solution for big data analytics, capable of
handling massive amounts of data in various formats.

Azure Logic Apps: A service that allows you to automate workflows and integrate services, applications,
and systems across on-premises and cloud environments.
Azure Data Explorer: A fast and highly scalable data exploration service for querying and analyzing large
volumes of diverse data.

Azure Data Share: A service that facilitates secure sharing of data between organizations, departments,
or different Azure subscriptions.

Azure Data Bricks Delta Lake: An optimized storage layer that brings ACID transactions to Apache Spark
and big data workloads.

Azure Data Catalog: A fully managed service that serves as a centralized repository for metadata and
data discovery, making it easier to find, understand, and use data.

Azure SQL Data Warehouse: A cloud-based data warehouse service that allows you to scale, pause, and
resume data warehouse processing power independently of storage.

Azure Data Explorer (Kusto): A fast and scalable data exploration service for real-time analysis of large
datasets using a powerful query language known as Kusto Query Language (KQL).

Azure Logic Apps: A serverless platform for building and automating workflows that integrate with
various services and systems.

Azure Data Box: A family of secure, ruggedized appliances that help you transfer large amounts of data
to and from Azure when network-based methods are not feasible.

Azure Time Series Insights: A fully managed analytics, storage, and visualization service for analyzing
time-series data from IoT devices.

Azure Data Lake Analytics: A distributed analytics service that allows you to run big data analytics
without managing the infrastructure.

Azure Event Hubs: A scalable and real-time event processing service that can ingest and process large
volumes of events from various sources.
Azure Data Share: A service that enables organizations to share data across different Azure subscriptions
and with external organizations securely.

Azure Data Catalog: A fully managed service that serves as a central repository for metadata and
facilitates data discovery, understanding, and usage.

Azure Cognitive Search: A fully managed search-as-a-service solution that allows you to quickly add a
robust search capability to your applications.

Azure Data Migration Service: A service that simplifies the migration of on-premises databases to Azure,
supporting various source and target database engines.

Azure Data Box Edge: An appliance that combines edge compute capabilities with data transfer, allowing
processing at the edge before transferring data to Azure.

Azure Blockchain Service: A fully managed blockchain service that simplifies the formation,
management, and governance of consortium blockchain networks.

Azure Purview: A unified data governance service that helps you discover and understand the data you
have across your organization.

Azure Machine Learning Studio: A collaborative, drag-and-drop tool for building, testing, and deploying
machine learning models.

Basic terms in GCP DATA ENGINEERING


Google Cloud Storage: A scalable and durable object storage solution for the cloud, suitable for storing
and retrieving large amounts of unstructured data.

BigQuery: A fully managed, serverless data warehouse that enables super-fast SQL queries using the
processing power of Google's infrastructure.

Cloud Dataflow: A fully managed stream and batch processing service for real-time data processing and
analytics.

Cloud Dataprep: A cloud service for exploring, cleaning, and preparing structured and unstructured data
for analysis, visualization, and machine learning.

Cloud Dataproc: A fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache
Hadoop clusters.

Cloud Composer: A fully managed workflow orchestration service that allows you to author, schedule,
and monitor pipelines using Apache Airflow.

Cloud Spanner: A globally distributed, horizontally scalable, and strongly consistent database service
designed for both operational and analytical workloads.

Cloud SQL: A fully managed relational database service that supports various database engines,
including MySQL, PostgreSQL, and SQL Server.

Cloud Pub/Sub: A messaging service that enables the creation of event-driven systems and real-time
analytics.

Cloud Storage Transfer Service: A service that allows you to securely and efficiently transfer large
amounts of data from on-premises or other cloud providers to Google Cloud Storage.

Cloud Datastore: A NoSQL document database for building web, mobile, and server applications.
Cloud Bigtable: A fully managed, highly scalable NoSQL database service designed to handle large
amounts of data with low-latency access.

Cloud AutoML: A suite of machine learning products that enables developers with limited machine
learning expertise to train high-quality custom models.

Cloud IoT Core: A fully managed service that allows you to securely connect and manage IoT devices at
scale.

Data Studio: A free business intelligence and data visualization tool that allows users to create
interactive reports and dashboards.

Looker (Now part of Google Cloud): A data exploration and business intelligence platform that allows
organizations to create and share reports and dashboards.

Vertex AI: A platform that unifies the capabilities of the Google Cloud AI portfolio, providing tools for
building, deploying, and managing machine learning models.

Cloud Composer: A fully managed workflow orchestration service based on Apache Airflow that helps
automate, schedule, and manage complex data workflows.

Cloud AI Platform: A managed service for building, training, and deploying machine learning models
using popular frameworks like TensorFlow and scikit-learn.

Cloud Identity and Access Management (IAM): A centralized access control service for managing user
permissions and roles across GCP services.

You might also like