Data Engineering Notes
Data Engineering Notes
Agenda:
Data
Data Pipeline
Data Storage
Big Data
Types of Data
Categorical Numerical
Continuous
Ratio measurable w/ true Height
zero
Data Pipeline
Data Storage for Analytics
Cost $0 $$ $$$
- Exploratory Analysis
- Real-time data processing - Reporting
- Machine Learning
Use Cases - High transactional throughput - Analytics
- Data mining
- Strong data consistency - BI
- Data science research
Data Integrity
A→ Atomicity The entire transaction takes place at once or doesn't happen at all
C→ Consistency The database must be consistent before and after the transaction
D→ Durability The changes of a successful transaction occurs even if the system failure occurs
Cloud Computing
The 6 V’s
Volume Variety Velocity
Refers to the quantity of data Complexity of the data The speed at which data is
produced or gathered (Gartner) generated and processed
Different data types and
Storage and processing needs to sources frequency of generation,
be addressed
Data Engineers need tools that frequency of handling,
handle a variety of data formats recording
in different locations
publishing shift from batch
processing to online
processing
Data Quality Valence implies connectedness Value is the ultimate goal of data
science and engineering
Origin Two data items are connected
when there is some relationship Value refers to the valuable
Reliability of source, previous
between them insights gained from the ability to
processing
investigate and identify new
Valence → Data connections /
Volatility patterns and trends from high
Total number of possible
Validity volume cross-platform systems
connections
Agenda:
Data Architecture
5. Cost optimization
6. Sustainability
Used when the data volume exceeds the available memory and ideal for data exploration, filtration, sampling, and summarization.
Hadoop
Components include HDFS, MapReduce and YARN
Apache Spark Faster alternative to MapReduce that can handle batch and real-time data and is flexible to work with HDFS and Cassandra
Apache Cassandra Processes structured data with fault-tolerance on cloud infrastructure and commodity hardware
NoSQL database management system that stores data in flexible, JSON-like documents, making it easy to handle and scale large volumes
MangoDB
of diverse and unstructured data
Data Legislation
The AI Act
The AI Act is a proposed European law on Artificial Intelligence -
The First law on AI by a major regulator anywhere
Example:
Subliminal Manipulation
An inaudible sound is played in truck drivers’ cabins to push them to drive
longer than healthy and safe. AI is used to find the frequency maximising this
- Resulting in physical / psychological harm
effect on drivers
Example:
General Purpose Social Scoring An AI system identifies at-risk children in need of social care based on
insignificant or irrelevant social ‘misbehaviour’ of parents
Example:
Remote biometric identification for law enforcement purposes in
All faces captured live by video cameras checked, in real time, against a
publicly accessible spaces (with exceptions)
database to identify a terrorist
The AI Act requires high-quality, Organizations must maintain The AI Act strict data privacy and
unbiased data for training AI systems, transparency about the data sources security measures affecting how big data
which necessitates robust data and processing methods used in AI is collected, stored, and processed,
governance practices to ensure data models, leading to more rigorous ensuring compliance with privacy
accuracy, consistency, and fairness in big documentation and auditing of big regulations
data environments. data processes
Lecture 3: Data Models 1
Agenda:
Data sources
Constraints
Big Data can contain sensitive and valuable business information (assets)
Financial records
Medical History
Educational Records
Job History
Requires that a person or system only be given the privileges and data necessary to complete immediate tasks
required of them and nothing more
Migrating to the cloud is not a security guarantee, it follows a shared responsibility model
User → responsible for securing the applications and systems in the cloud
Data Security
Data security protects digital information in the data pipeline from unauthorized access, corruptions or theft
Security measures are complemented by administrative access controls & organizational policies and procedures
Encryption
Data resiliency
Backups
Employee education
Data Encryption
Cryptographic Key → A string of characters (mathematically generated) that is fed to a cryptographic algorithm
(encryption/decryption) to secure data
Encryption of data using one or two keys is known as symmetric and asymmetric encryption
Authentication → Assumes only authorized users have private keys in asymmetric encryption
sequenceDiagram
participant Client
participant Server
Data
governance
policies and
procedures can
then be
established
• Dates
• Phone numbers
• Social security
numbers
• Airline reservation
• Relational • Credit card
• Predefined data models systems
Structured databases • Humans or numbers
• Usually text-only • Inventory control
Data • Data machines • Customer names
• Easy to search • CRM systems
warehouses • Addresses
• ERP systems
• Product names
and numbers
• Transaction
information
• Text files
• No predefined data • Applications • Word processing • Reports
model • NoSQL • Presentation • Email messages
Unstructured • May include text, images, databases • Humans or software • Audio files
Data sound, video, or other • Data machines • Email clients • Video files
formats warehouses • Tools for viewing • Images
• Difficult to search • Data lakes or editing media • Surveillance
imagery
Data Constraints
Constraint Example
Value Age ≥ 18
Agenda:
Definitions
UML Diagrams
Information Systems
A system has a boundary outside of which are external entities (people ,systems) that interact with the system in
focus
Within this system are components (systems, people) that implement that system’s behaviour
These interactions and internal system behaviours involve data being exchanged, transformed and stored
Any system exists within a broader context that needs to be understood and the system should be defined in terms of
how it delivers functionality and services within this context, usually in the form of requirements
Levels of Data Modelling
What problems are involved with the business and require immediate solutions?
Entities, attributes, and relationships What should the collection look like?
Includes table structures, column names, column data types, primary keys, column constraints, relationships
between tables
Entities, attributes, key groups, primary key, foreign keys and relationships to each other
Specific to a DBMS
Agenda:
---
title: The process for creating a DFD
---
%%{
init: {
'theme': 'base',
'themeVariables': {
'primaryColor': '#0a0a0a',
'primaryTextColor': '#fff',
'primaryBorderColor': '#fff',
'lineColor': '#fff',
'tertiaryTextColor':'#fff'
}
}
}%%
graph LR
step1["`Identify business data objects`"]
step2["`Identify processes`"]
step3["`Identify external enetities`"]
step4["`Tie diagram together`"]
step1 --> step2 --> step3 --> step4
ERD Elements
Entity A class of persons, places, objects, events, or concepts about which we need to capture and store data
Attribute A descriptive property or characteristic of an entity. Also known as element, property, or field
The minimum and maximum number of occurrences of one entity that may be related to a single occurrence of the
Cardinality
other entity
Cardinality Interpretation:
Exactly one (one and only one):
Minimum Instances: 1
Maximum Instances: 1
Zero or one:
Minimum Instances: 0
Maximum Instances: 1
One or more:
Minimum Instances: 1
Minimum Instances: 0
Agenda:
Systems Development
Hardware
Software
Data
Procedures
People
A system development lifecycle (SDLC) is a framework describing a process for understanding, planning, building,
testing, and deploying an information system
Example Roles
sequence diagrams
component diagrams
A linear approach describes a sequence of tasks that are An evolutionary approach evolves the solution through
completed in order, only moving to the next step once the progressive versions, each more complete than the last,
previous step is complete. E.g. waterfall, V-model, and often uses a prototyping approach to development.
incremental E.g. iterative, spiral
breaks down problem into distinct stages, each early delivery of value to the customer - either
with a clear purpose working versions or knowledge of project risk
everything is agreed in advance of being used, with copes well with complex requirements - fast changing,
no need to revisit later uncertain or
Model Types
Incremental Life Cycle Iterative Life Cycle Boehm’s Spiral Life Cycle
Complexity of problem
Team experience
Stability of requirements
Customer involvement
Uniqueness
name home_state
state_code
name home_state
state_code
employee_id
employee_roles
employees
jobs
home_state
state_code
state_code
home_state
employee_roles
employees
jobs
states
Lecture 8: RDBMS & SQL
Agenda:
ACID
SQL
Example Queries
ACID
ACID is a database design principle which defines how transactions are managed specifically in a relational database
All operations will Ensures that the Ensures that the Ensures that the results
always succeed or fail database will always results of a of an operation are
completely remain in a consistent transaction are not permanent
state by ensuring visible to other
No partial transactions Once a transaction has
that only data that operations until it is
been committed, it
conforms to the complete
cannot be rolled back.
constraints of the
Irrespective of any
database schema can
system failure
be written to the
database
CRUD Operations
Agenda:
Characteristics of NoSQL
Characteristics of NoSQL
Characteristic Description
Schema-less Data Allows storage of data without a predefined schema, enabling flexibility in handling diverse and evolving
Model data types.
Scale Out Rather T han Supports horizontal scaling by adding more nodes to the database cluster, as opposed to upgrading a
Scale Up single node's hardware.
Built on cluster-based technologies that ensure fault tolerance and high availability by replicating data
Highly Available
across multiple nodes.
Lower Operational Often based on open-source platforms with no licensing fees and designed to run on cost-effective
Costs commodity hardware.
Ensures that while data may not be immediately consistent across nodes after a write, it will eventually
Eventual Consistency
reach a consistent state.
Prioritizes availability and scalability (BASE model) over strict consistency (ACID model), with databases
BASE, Not ACID
designed to eventually reach consistency.
Data access is typically through APIs, including RESTful APIs, with some databases offering SQL-like
API-Driven Data Access
query capabilities.
Auto Sharding and Automatically partitions (shards) data across multiple nodes and replicates it to ensure high availability
Replication and support horizontal scaling.
Comes with built-in caching mechanisms, reducing the need for external caching solutions like
Integrated Caching
Memcached.
Distributed Query
Maintains consistent query performance across multiple shards in a distributed environment.
Support
Supports using multiple storage technologies (NoSQL and RDBMS) within the same application, allowing
Polyglot Persistence
for a flexible approach to data persistence.
Characteristic Description
Stores de-normalized, aggregated data to eliminate the need for complex joins and mappings, although
Aggregate-Focused
graph databases are an exception to this approach.
CAP Theorem
BASE Principle
BASE is a database design principle based on the CAP theorem and leveraged by database systems that use
distributed technology.
basically available → database will always acknowledge a client’s request, either in the form of the requested
data or a success/failure notification
soft state → database may be in an inconsistent state when data is read; thus, the results may change if the
same data is requested again
eventual consistency → state in which reads by different clients, immediately following a write to the
database, may not return consistent results. Database only attains consistency once the changes have been
propagated to all nodes
Key-value
Document
Column-family
Graph
Agenda:
Intro to GCP
Services
Coupons
ML Use Cases
Cloud Computing
Benefits:
scalability
flexibility
cost savings
global reach
Definition: Cloud computing is explained as getting tasks done using someone else's computers. Specifically,
using Google Cloud means utilizing Google's computers.
Capabilities: Google Cloud enables developers to build and host applications, store data, and analyze data using
Google's scalable and reliable infrastructure.
Overview: Google's data centers, located worldwide, house the computing, storage, and networking resources
that power Google's services like Search, Gmail, and YouTube.
Developer Access: Google Cloud shares these resources with developers, allowing them to build and run
applications on Google’s infrastructure.
Retail Example: A scenario is presented where a retailer needs to manage inventory, pricing, and demand
across thousands of stores. Handling seasonal spikes, especially during holidays, is highlighted as a significant
challenge.
On-Premise vs. Cloud: Managing an on-premise database would be costly and inefficient due to the need for
provisioning additional hardware. Conversely, using Google Cloud's managed services, like Cloud Spanner, allows
for efficient scaling and cost management.
Compute Engine: Provides virtual machines that run in Google data centers.
Cloud Run: Enables the deployment of containerized applications on a fully managed serverless platform.
App Engine: Supports the deployment of highly scalable web apps and back-end services.
2. Storing Data
Cloud Storage: Ideal for unstructured data such as images, videos, and audio files.
Cloud SQL: Offers managed versions of MySQL, Postgres, and SQL Server, allowing for familiar relational
database management without the hassle of self-management.
Cloud Firestore: A NoSQL, document-based, real-time database, popular in scenarios where up-to-date data is
crucial, like gaming.
Vision AI: Provides an API for image analysis, including object detection, landmark recognition, text extraction,
and more.
Cloud Natural Language: Analyzes text to extract information about entities, sentiment, syntax, and content
categorization.
Vertex AI: Google Cloud’s platform for building, deploying, and managing custom machine learning models.
Cloud Firestore: Stores profile information such as dog names, locations, and hobbies.
Vision AI: Analyzes profile photos to detect objects like balls, stuffed animals, and bones, providing data insights.
Cloud Run: Deploys the application to the web, allowing for automatic scaling as the user base grows.