0% found this document useful (0 votes)

38 views9 pages

Basic Terms of DATA ENGINEERING

The document provides an overview of key terms and concepts in data engineering across various platforms, including Big Data, AWS, Azure, and GCP. It covers essential technologies and services such as Hadoop, Spark, Amazon S3, Azure Synapse Analytics, and Google BigQuery, among others. Each term is briefly defined to highlight its significance in the field of data engineering.

Uploaded by

samreddy022

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views9 pages

Basic Terms of DATA ENGINEERING

Uploaded by

samreddy022

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

DATA ENGINNERING

Basic terms in BIG DATA ENGINEERING

Big Data: Large and complex datasets that exceed the capabilities of traditional data processing
applications.

Hadoop: An open-source framework for distributed storage and processing of large data sets using a
cluster of computers.

Spark: A fast and general-purpose cluster computing system for big data processing, providing high-level
APIs in Java, Scala, Python, and R.

MapReduce: A programming model for processing and generating large datasets that parallelizes the
computation across a distributed cluster.

Apache Kafka: A distributed streaming platform that enables the handling of real-time data feeds and
event processing.

SQL: Structured Query Language used for managing and querying relational databases.

NoSQL Databases: Databases that provide a mechanism for storage and retrieval of data that is modeled
in ways other than the tabular relations used in relational databases.

Data Warehousing: A process for collecting, managing, and storing large volumes of data from different
sources for business intelligence and reporting.

ETL (Extract, Transform, Load): A process of extracting data from source systems, transforming it into a
more usable format, and loading it into a target system.

Machine Learning: A subset of artificial intelligence that focuses on the development of algorithms
allowing systems to learn and make predictions or decisions based on data.
Data Modeling: The process of creating a visual representation of data structures, relationships, and
rules to improve understanding and communication within an organization.

Data Governance: The overall management of the availability, usability, integrity, and security of data
used in an enterprise.

HDFS (Hadoop Distributed File System): A distributed file system designed to store vast amounts of data
across multiple machines, providing high fault tolerance.

MapReduce: A programming model and processing engine for distributed computing on large datasets.
It divides tasks into smaller sub-tasks, processes them in parallel, and combines the results.

YARN (Yet Another Resource Negotiator): A resource management layer that manages and schedules
resources across the cluster, allowing different applications to share resources efficiently.

Hadoop Common: A set of utilities, libraries, and APIs that provide support for Hadoop modules. It
includes tools for managing and interacting with Hadoop clusters.

Hive: A data warehousing and SQL-like query language tool built on top of Hadoop. It facilitates querying
and managing large datasets stored in Hadoop.

Pig: A high-level platform and scripting language built on top of Hadoop to simplify the development of
complex data processing tasks using a scripting language called Pig Latin.

HBase: A distributed, scalable, and NoSQL database that provides real-time read and write access to
large datasets. It is modeled after Google's Bigtable.

Spark: While not exclusive to Hadoop, Apache Spark is often integrated into Hadoop clusters. It's a fast,
in-memory data processing engine with elegant and expressive development APIs.

Oozie: A workflow scheduler system for managing Hadoop jobs. It allows users to define workflows to
coordinate the execution of Hadoop jobs.
Flume: A distributed, reliable, and available service for efficiently collecting, aggregating, and moving
large amounts of log data into HDFS.

Basic terms in AWS DATA ENGINEERING

Amazon S3 (Simple Storage Service): A scalable object storage service designed to store and retrieve any
amount of data. It is commonly used for data storage in various data engineering workflows.

Amazon Redshift: A fully managed, petabyte-scale data warehouse service that enables high-
performance analysis using standard SQL queries.

AWS Glue: A fully managed extract, transform, and load (ETL) service that makes it easy to prepare and
load data for analysis. It automatically generates ETL code.

AWS Data Pipeline: A web service for orchestrating and automating the movement and transformation
of data between different AWS services and on-premises data sources.

Amazon EMR (Elastic MapReduce): A cloud-based big data platform that enables processing of vast
amounts of data quickly using popular frameworks such as Apache Spark and Hadoop.

AWS Lambda: A serverless computing service that allows running code without provisioning or
managing servers. It can be used to build scalable and cost-effective data processing pipelines.

Amazon Kinesis: A platform for streaming data on AWS, offering services like Kinesis Data Streams for
real-time data streaming and Kinesis Data Firehose for loading streaming data to data stores.

Amazon DynamoDB: A fully managed NoSQL database service that provides fast and predictable
performance with seamless scalability. It is suitable for handling large amounts of data.
Amazon Athena: A serverless query service that enables querying data stored in Amazon S3 using
standard SQL. It eliminates the need for setting up and managing a database.

AWS Glue DataBrew: A visual data preparation tool that makes it easy to clean, normalize, and
transform data for analytics and machine learning.

Amazon RDS (Relational Database Service): A fully managed relational database service that supports
multiple database engines, making it easier to set up, operate, and scale a relational database.

Amazon QuickSight: A fully managed business intelligence (BI) service that makes it easy to create and
visualize interactive dashboards in the AWS Cloud.

AWS Glue Elastic Views: A serverless service that allows you to easily create materialized views that
combine and replicate data across multiple data stores.

AWS Step Functions: A serverless function orchestrator that allows you to build and coordinate multiple
AWS services into serverless workflows for your applications.

Amazon Aurora: A MySQL and PostgreSQL-compatible relational database built for the cloud that
combines the performance and availability of high-end commercial databases with the simplicity and
cost-effectiveness of open-source databases.

Amazon Elasticsearch Service: A fully managed service that makes it easy to deploy, secure, and scale
Elasticsearch for log and event data analysis.

AWS DataSync: A service that simplifies, automates, and accelerates moving data between on-premises
storage and Amazon S3 or Amazon EFS.

AWS Lake Formation: A service that makes it easy to set up, secure, and manage a data lake, allowing
you to centrally define and enforce security, governance, and auditing policies.

AWS DMS (Database Migration Service): A service that makes it easier to migrate relational databases,
data warehouses, NoSQL databases, and other types of data stores.
Azure Blob Storage: A scalable and secure object storage solution for the cloud, suitable for storing and
serving large amounts of unstructured data.

Basic terms in AZURE DATA ENGINEERING

Azure Synapse Analytics (formerly SQL Data Warehouse): A cloud-based analytics service that brings
together big data and data warehousing for fast querying and analysis.

Azure Data Factory: A cloud-based data integration service that allows you to create data-driven
workflows for orchestrating and automating data movement and data transformation.

Azure Databricks: An Apache Spark-based analytics platform that accelerates big data analytics and
provides collaborative and interactive tools for data scientists and engineers.

Azure HDInsight: A fully managed cloud service that makes it easy to process large amounts of data
using popular open-source frameworks such as Apache Hadoop, Spark, Hive, and more.

Azure SQL Database: A fully managed relational database service that provides high availability, security,
and scalability for applications.

Azure Stream Analytics: A real-time analytics service designed for stream processing on large amounts
of fast-streaming data.

Azure Cosmos DB: A globally distributed, multi-model database service designed for highly responsive
and scalable applications.

Azure Data Lake Storage: A scalable and secure data lake solution for big data analytics, capable of
handling massive amounts of data in various formats.

Azure Logic Apps: A service that allows you to automate workflows and integrate services, applications,
and systems across on-premises and cloud environments.
Azure Data Explorer: A fast and highly scalable data exploration service for querying and analyzing large
volumes of diverse data.

Azure Data Share: A service that facilitates secure sharing of data between organizations, departments,
or different Azure subscriptions.

Azure Data Bricks Delta Lake: An optimized storage layer that brings ACID transactions to Apache Spark
and big data workloads.

Azure Data Catalog: A fully managed service that serves as a centralized repository for metadata and
data discovery, making it easier to find, understand, and use data.

Azure SQL Data Warehouse: A cloud-based data warehouse service that allows you to scale, pause, and
resume data warehouse processing power independently of storage.

Azure Data Explorer (Kusto): A fast and scalable data exploration service for real-time analysis of large
datasets using a powerful query language known as Kusto Query Language (KQL).

Azure Logic Apps: A serverless platform for building and automating workflows that integrate with
various services and systems.

Azure Data Box: A family of secure, ruggedized appliances that help you transfer large amounts of data
to and from Azure when network-based methods are not feasible.

Azure Time Series Insights: A fully managed analytics, storage, and visualization service for analyzing
time-series data from IoT devices.

Azure Data Lake Analytics: A distributed analytics service that allows you to run big data analytics
without managing the infrastructure.

Azure Event Hubs: A scalable and real-time event processing service that can ingest and process large
volumes of events from various sources.
Azure Data Share: A service that enables organizations to share data across different Azure subscriptions
and with external organizations securely.

Azure Data Catalog: A fully managed service that serves as a central repository for metadata and
facilitates data discovery, understanding, and usage.

Azure Cognitive Search: A fully managed search-as-a-service solution that allows you to quickly add a
robust search capability to your applications.

Azure Data Migration Service: A service that simplifies the migration of on-premises databases to Azure,
supporting various source and target database engines.

Azure Data Box Edge: An appliance that combines edge compute capabilities with data transfer, allowing
processing at the edge before transferring data to Azure.

Azure Blockchain Service: A fully managed blockchain service that simplifies the formation,
management, and governance of consortium blockchain networks.

Azure Purview: A unified data governance service that helps you discover and understand the data you
have across your organization.

Azure Machine Learning Studio: A collaborative, drag-and-drop tool for building, testing, and deploying
machine learning models.

Basic terms in GCP DATA ENGINEERING

Google Cloud Storage: A scalable and durable object storage solution for the cloud, suitable for storing
and retrieving large amounts of unstructured data.

BigQuery: A fully managed, serverless data warehouse that enables super-fast SQL queries using the
processing power of Google's infrastructure.

Cloud Dataflow: A fully managed stream and batch processing service for real-time data processing and
analytics.

Cloud Dataprep: A cloud service for exploring, cleaning, and preparing structured and unstructured data
for analysis, visualization, and machine learning.

Cloud Dataproc: A fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache
Hadoop clusters.

Cloud Composer: A fully managed workflow orchestration service that allows you to author, schedule,
and monitor pipelines using Apache Airflow.

Cloud Spanner: A globally distributed, horizontally scalable, and strongly consistent database service
designed for both operational and analytical workloads.

Cloud SQL: A fully managed relational database service that supports various database engines,
including MySQL, PostgreSQL, and SQL Server.

Cloud Pub/Sub: A messaging service that enables the creation of event-driven systems and real-time
analytics.

Cloud Storage Transfer Service: A service that allows you to securely and efficiently transfer large
amounts of data from on-premises or other cloud providers to Google Cloud Storage.

Cloud Datastore: A NoSQL document database for building web, mobile, and server applications.
Cloud Bigtable: A fully managed, highly scalable NoSQL database service designed to handle large
amounts of data with low-latency access.

Cloud AutoML: A suite of machine learning products that enables developers with limited machine
learning expertise to train high-quality custom models.

Cloud IoT Core: A fully managed service that allows you to securely connect and manage IoT devices at
scale.

Data Studio: A free business intelligence and data visualization tool that allows users to create
interactive reports and dashboards.

Looker (Now part of Google Cloud): A data exploration and business intelligence platform that allows
organizations to create and share reports and dashboards.

Vertex AI: A platform that unifies the capabilities of the Google Cloud AI portfolio, providing tools for
building, deploying, and managing machine learning models.

Cloud Composer: A fully managed workflow orchestration service based on Apache Airflow that helps
automate, schedule, and manage complex data workflows.

Cloud AI Platform: A managed service for building, training, and deploying machine learning models
using popular frameworks like TensorFlow and scikit-learn.

Cloud Identity and Access Management (IAM): A centralized access control service for managing user
permissions and roles across GCP services.

Bgcse Mathematics - Core - Latest Edition
86% (7)
Bgcse Mathematics - Core - Latest Edition
232 pages
190X Proxy HTTP
No ratings yet
190X Proxy HTTP
4 pages
Cheat Sheet AWS Data Engineer Associate
No ratings yet
Cheat Sheet AWS Data Engineer Associate
117 pages
Internship Report
No ratings yet
Internship Report
34 pages
Data Engineering Quick Reference
No ratings yet
Data Engineering Quick Reference
9 pages
LED Multifunctional Power Instrument
100% (1)
LED Multifunctional Power Instrument
28 pages
Personal, Legal, Ethical, and Organizational Issues
No ratings yet
Personal, Legal, Ethical, and Organizational Issues
39 pages
Ilovepdf Merged-Compressed
No ratings yet
Ilovepdf Merged-Compressed
352 pages
SAP S4HANA DTS Practice Questions
No ratings yet
SAP S4HANA DTS Practice Questions
11 pages
Foxit Advanced PDF Editor
No ratings yet
Foxit Advanced PDF Editor
199 pages
G.988 201010
No ratings yet
G.988 201010
595 pages
IDM Log 1585699074609
No ratings yet
IDM Log 1585699074609
68 pages
User Manual: Semikron Skiip - Tester Manual Control Unit
100% (2)
User Manual: Semikron Skiip - Tester Manual Control Unit
20 pages
97.1.2 Relationship of 1000BASE-T1 To Other Standards
No ratings yet
97.1.2 Relationship of 1000BASE-T1 To Other Standards
6 pages
A Brief Introduction of Existing Big Data Tools
No ratings yet
A Brief Introduction of Existing Big Data Tools
37 pages
Kids Box Flyers Unit 10 Lesson 48
No ratings yet
Kids Box Flyers Unit 10 Lesson 48
21 pages
Schneider Electric Altivar Process ATV9xx DTM Library V3.8.2 ReleaseNotes
No ratings yet
Schneider Electric Altivar Process ATV9xx DTM Library V3.8.2 ReleaseNotes
9 pages
Hadoop Training in Bangalore
No ratings yet
Hadoop Training in Bangalore
38 pages
Parcial Cono 1 21
No ratings yet
Parcial Cono 1 21
21 pages
Bingjing - Big Data Tools
No ratings yet
Bingjing - Big Data Tools
38 pages
Devfest London 2022 Schedule - GDG London
No ratings yet
Devfest London 2022 Schedule - GDG London
9 pages
False Color Analysis Page: Software Starting
No ratings yet
False Color Analysis Page: Software Starting
23 pages
Home Page Acumatica
No ratings yet
Home Page Acumatica
88 pages
Parcial Cono 1 14
No ratings yet
Parcial Cono 1 14
14 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
DD Quiz
No ratings yet
DD Quiz
2 pages
QGIS - Instructions
No ratings yet
QGIS - Instructions
13 pages
Architecture
No ratings yet
Architecture
6 pages
2023-03-02 - 11-17-05 Plugin Log
No ratings yet
2023-03-02 - 11-17-05 Plugin Log
2 pages
Project Report
No ratings yet
Project Report
23 pages
Ip Office Power User - lb4323
No ratings yet
Ip Office Power User - lb4323
2 pages
Big Data Tools and Techniques
No ratings yet
Big Data Tools and Techniques
12 pages
A Guide For Beginners: Big Data Glossary
No ratings yet
A Guide For Beginners: Big Data Glossary
1 page
Data Engineering
No ratings yet
Data Engineering
48 pages
Analytics Services v2
No ratings yet
Analytics Services v2
59 pages
AWS Quick Start - AWS Purpose-Built Database Strategy - Final
No ratings yet
AWS Quick Start - AWS Purpose-Built Database Strategy - Final
32 pages
Linux Notes
No ratings yet
Linux Notes
4 pages
Aim Data Engineer
No ratings yet
Aim Data Engineer
6 pages
AWS White Paper
No ratings yet
AWS White Paper
6 pages
Global Technology
No ratings yet
Global Technology
10 pages
Big Data PDF
No ratings yet
Big Data PDF
18 pages
LP 6 Question
No ratings yet
LP 6 Question
4 pages
Data Science
No ratings yet
Data Science
87 pages
Dbms Material
No ratings yet
Dbms Material
40 pages
Modernserverlessdatalak
No ratings yet
Modernserverlessdatalak
45 pages
Data Engineering Roadmap uYdSPm5q
100% (1)
Data Engineering Roadmap uYdSPm5q
5 pages
AWS Data Lake
No ratings yet
AWS Data Lake
118 pages
Data Analysis With Hive
No ratings yet
Data Analysis With Hive
2 pages
Ppb1 Workshop Batch v2
No ratings yet
Ppb1 Workshop Batch v2
43 pages
AWS Data Engineering Involves Using Amazon Web Services
No ratings yet
AWS Data Engineering Involves Using Amazon Web Services
2 pages
Glossary
No ratings yet
Glossary
11 pages
Open Source Software Referance Guide
No ratings yet
Open Source Software Referance Guide
9 pages
Module 1 Glossary What Is Big Data
No ratings yet
Module 1 Glossary What Is Big Data
2 pages
A - Learning - Oreilly.com-Preface Data Engineering With AWS
No ratings yet
A - Learning - Oreilly.com-Preface Data Engineering With AWS
6 pages
CCD Unit 3
No ratings yet
CCD Unit 3
8 pages
Unit3 - Cloud Data Storage
No ratings yet
Unit3 - Cloud Data Storage
7 pages
Basics of Wireless Local Area Networks 1
No ratings yet
Basics of Wireless Local Area Networks 1
12 pages
Data Camp Lexicon
No ratings yet
Data Camp Lexicon
2 pages
Data Engineering by AWS
100% (1)
Data Engineering by AWS
11 pages
Sample
No ratings yet
Sample
6 pages
Data Engineering
No ratings yet
Data Engineering
5 pages
AWS Data Engineering Services
No ratings yet
AWS Data Engineering Services
24 pages
Big Data Handling Techniques
No ratings yet
Big Data Handling Techniques
21 pages
The Ultimate Data Engineering Guide - Apache Spark, Apache Airflow, and AWS Glue
No ratings yet
The Ultimate Data Engineering Guide - Apache Spark, Apache Airflow, and AWS Glue
6 pages
Big Data Pyq 21-22
No ratings yet
Big Data Pyq 21-22
9 pages
VNode Manual RestApi Client v1.8
No ratings yet
VNode Manual RestApi Client v1.8
31 pages
Module 2
No ratings yet
Module 2
20 pages
Tools in Data Analytics
No ratings yet
Tools in Data Analytics
17 pages
CCD Chapter 3 Notes
No ratings yet
CCD Chapter 3 Notes
11 pages
Data Engineering 101
No ratings yet
Data Engineering 101
1 page
Q. What Is Big Data?
No ratings yet
Q. What Is Big Data?
8 pages
Final Project On Data Lakes With AWS
No ratings yet
Final Project On Data Lakes With AWS
2 pages
Updated PRACTICE QUESTIONS FOR 2025 Final1
No ratings yet
Updated PRACTICE QUESTIONS FOR 2025 Final1
3 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Ijeme V13 N4 5
No ratings yet
Ijeme V13 N4 5
9 pages
Course1 Summary
No ratings yet
Course1 Summary
4 pages
WEOPI AI - Powrful AI Tool Search, Forms and More
No ratings yet
WEOPI AI - Powrful AI Tool Search, Forms and More
3 pages
Data Engg
No ratings yet
Data Engg
19 pages
Yasir f29 Ass1 Bigdata
No ratings yet
Yasir f29 Ass1 Bigdata
7 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Database in AWS
No ratings yet
Database in AWS
24 pages
Big Data Technologies UNIT 1
No ratings yet
Big Data Technologies UNIT 1
5 pages
AWS Summary
No ratings yet
AWS Summary
4 pages
Storage
No ratings yet
Storage
6 pages
Big Data Deals With Large Data Sets
No ratings yet
Big Data Deals With Large Data Sets
4 pages
Dhan Singh Big Data File - 4
No ratings yet
Dhan Singh Big Data File - 4
1 page
Complete Data Engineering Roadmap With Resources
No ratings yet
Complete Data Engineering Roadmap With Resources
16 pages
Mastering Amazon Redshift: Scalable Cloud Data Warehousing
From Everand
Mastering Amazon Redshift: Scalable Cloud Data Warehousing
Robert Johnson
No ratings yet
A Comprehensive Guide to Amazon Web Services
From Everand
A Comprehensive Guide to Amazon Web Services
Josh Luberisse
No ratings yet
SQL Demystified: A Beginner's Roadmap to Data Retrieval and Management
From Everand
SQL Demystified: A Beginner's Roadmap to Data Retrieval and Management
Kaushal Mehta
No ratings yet

Basic Terms of DATA ENGINEERING

Uploaded by

Basic Terms of DATA ENGINEERING

Uploaded by

DATA ENGINNERING

Basic terms in BIG DATA ENGINEERING

Basic terms in AWS DATA ENGINEERING

Basic terms in AZURE DATA ENGINEERING

Basic terms in GCP DATA ENGINEERING

You might also like