module5DataEngineering
module5DataEngineering
net/publication/386573107
CITATIONS READS
0 21
1 author:
Khwaish Shahani
Vivekanand Education Society's Institute of Technology
10 PUBLICATIONS 0 CITATIONS
SEE PROFILE
All content following this page was uploaded by Khwaish Shahani on 09 December 2024.
Khwaish
0
[1] Write a Short Note On Data Engineering?
[1] Data engineering is a technology, focusing on designing, building, and managing the
infrastructure required to collect, store, and analyze large volumes of data.
[2] It enables organizations to transform raw data into useful insights and is the backbone
of data science, machine learning, and business intelligence.
[3] Data engineering is a set of operations to make data available and usable to data
scientists, data analysts, business intelligence (BI) developers, and other specialists
within an organization.
[4] It takes dedicated experts – data engineers – to design and build systems for gathering
and storing data at scale as well as preparing it for further analysis
Data Engineering
[1] A data engineer is a professional who prepares and manages big data that is then
analyzed by data analysts and scientists.
[2] They are responsible for designing, building, integrating, and maintaining data from
several sources, , thus designing the infrastructure of the data that is collected in the
database.
Data Scientist
Data Scientist make use of Machine Learning, Deep Learning techniques and Inferential
Modeling and find correlations between data and create predictive models on the basis
of which he/she can develop recommendation systems useful for said business
Data Analyst
Data Analyst is responsible for:
screening and cleaning/polishing of the raw data collected;
data preparation;
understanding of business metrics and problems;
visualization of data through reports and graphs;
identification of trends and useful suggestions to aid in strategic business decisions.
1
[5] Data Science Vs Data Engineer Vs Data Analyst
2
Roles and Responsibilities
Data Analyst Data Engineer Data Scientist
Preprocessing and data gathering Develop ,test and maintain Responsible for developing
architectures operating Models
Emphasis on representating Data Understanding programming Carry out data analytics and
Via reporting and visualization and its complexity optimization using deep learning
and machine Learning .
Responsible for statistical Deploy Ml and statistical Involve in strategic Planning for
Analysis And Data Interpretation . Models Data Analytics
Ensures data acquisition and Building pipelines for Integrate Data and perform Adhoc
Maintenance . various ETL operations Analysis
Optimize Statistical And Ensures data Accuracy and Fill in the gap between customers
Efficiency And quality . Flexibility and stakeholders
3
[Q2] Write a Short Note on Data Engineering Process
The data engineering process covers a sequence of tasks that turn a large amount of raw
data into a practical product meeting the needs of analysts, data scientists, machine
learning engineers, and others.
4
[Q3] Write Short Note On Data Ingestion Techniques
involves collecting large amounts of raw data from various sources into one place and
then processing it later.
Use Cases: When both real-time insights and historical data analysis are required (e.g.,
in recommendation systems, social media analytics, and fraud detection).
Tools: Hadoop (batch processing) + Apache Kafka (stream processing).
5
Best pratices For Data Ingestion
Understand the Data Sources:
Apply checks to ensure the quality and validity of incoming data. Common issues such
as missing values, duplicates, or incorrect data formats should be addressed during
ingestion.
Tools like Apache Nifi and Talend can help automate validation and transformation
during ingestion.
Scalability:
Use scalable solutions that can handle data volume growth over time, especially with
increasing data sources and higher data velocity.
Consider using cloud-based storage solutions (e.g., Amazon S3, Google Cloud Storage)
for dynamic scaling capabilities.
Incremental Ingestion:
Rather than ingesting entire datasets repeatedly, use techniques that ingest only the
new or updated data (delta loads). This is particularly useful for batch ingestion and
significantly reduces resource usage.
Metadata Management:
Maintain clear metadata around the ingestion process, such as data source details,
ingestion timestamps, and transformations applied. This makes the ingestion pipeline
more transparent and easier to troubleshoot
6
[Q4] Write Short Note on Data Storage management?
Data storage is an essential component of any data architecture, and two of the most
common solutions for storing and managing large volumes of data are data lakes and
data warehouses.
Although both serve to store data, they differ significantly in terms of structure, use
cases, and functionality
Data Lake
A data lake is a centralized repository that allows you to store vast amounts
of raw data in its original format, whether structured, semi-structured, or
unstructured.
The idea behind a data lake is to provide a flexible environment for storing
data without requiring upfront structuring or processing.
7
Data WareHouse
OLAP and OLAP cubes
OLAP or Online Analytical Processing refers to the computing approach allowing users
to analyze multidimensional data.
Such structures enable efficient processing and advanced analysis of vast amounts of
varied data.
For example, a sales department report would include such dimensions as product,
region, sales representative, sales amount, month, and so on .
References
What is Data Ingestion? | IBM
What Is Data Ingestion? | Informatica
What is Data Ingestion? Tools, Types, and Key Concepts | Simplilearn