0% found this document useful (0 votes)
25 views3 pages

20 - 04 - 2024 Cheatsheet

The document discusses several lectures covering topics like data engineering, data formats, database management systems, distributed systems, data pipelines, cloud computing and more. It describes concepts like OLTP, OLAP, data warehouses, NoSQL databases, CAP theorem, ACID properties, orchestration with Airflow and challenges of big data.

Uploaded by

ycong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views3 pages

20 - 04 - 2024 Cheatsheet

The document discusses several lectures covering topics like data engineering, data formats, database management systems, distributed systems, data pipelines, cloud computing and more. It describes concepts like OLTP, OLAP, data warehouses, NoSQL databases, CAP theorem, ACID properties, orchestration with Airflow and challenges of big data.

Uploaded by

ycong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Lecture 1.

Data Engine development, operation and maintenance of on-premise, cloud or hybrid data infrastructure to support downstream machine
learning, analytics, reporting applications.
Databases: OLTP, row-based, CRUD, many concurrent access, writing and updating data in transactional env. Datawarehouse (DW): OLAP, column-
based, analytics/ small userbase, read-heavy operations. Datamart: transformed, smaller DW for biz units. Serverless pipeline, compute and storage re-
sources are provisioned and managed by cloud provider, no need for users to manage servers, infrastructure or scaling.
Lecture 2. Data Formats Consideration: Text/ Binary: ease of use; Splitable: for scalability and for distributed file system; Compressible: for text file
formats, binary formats already has compression by definition. Verbose; Supported data types; schema enforcement; schema evolution support; row/column
based storage; splitable (for scalability); ecosystem
Text- CSV/TSV: splitable; XLM: verbose, not splitable, DTD schema validation; JSON: compact, not splitable, compressible, heterogenous arrays and
objects as dictionaries Binary- serialization converts complex data types to stream of bytes for transfer; AVRO: row-based, splitable, compressible, schema
evolution, JSON defined schema; PARQUET: column based (more compressible) format of Hadoop for analytics queries, write-once-read-many, splitable,
schema evolution, restricted to batch processing. Memory Hierarchy: Register, CPU cache, RAM, SSD, HDD, Object storage, Archival storage. Design 1 stores transactions info. Design 2 doesn’t store qty info & lost transaction info
Move Data tmd = size of data * (1/speed of computation + 1/network speed) Move Compute tmc = size of program/ network speed + size of data/ number
of nodes/ speed of computation)

Lecture 3. Owner entity and weak entity: one-to-many r/s. Weak entity must have total participation.
Each B-trees node fits into a disk block or page, allowing for efficient retrieval of entire nodes with a single disk read operation. B-tree stores data for
efficient retrieval in a block-oriented storage. High spatial locality and pre-fetching can improve I/O efficiency. B-trees used in DBMS to implement index-
ing structures. Store keys in sorted order.
In B+ trees, only the leaf nodes contain pointers to data records. The internal nodes contain only keys. B+ more storage efficient and supports range queries.
Query Processing: SQL Query, Parse & Rewrite Query, Logical Query Plan: relational algebra tree, Physical Query Plan, Query Execution, Disk. Statistics
about data inform query optimisation. Equivalence in relational algebra to generate logically equivalent query plans. Commutative, associative and pushing
predicate to be applied first R⋈S = S⋈R. R˅S = S˅R. Selinger Algorithm to find optimal join sequence in a left deep join tree using dynamic program-
ming. O(2^n) < O(n!)

Lecture 4. ACID (SQL): Atomicity: all or nothing. Consistency: consistent with constraints. Isolation: all transactions run in isolation. Durability:
committed transactions persist. BASE (Distributed, NoSQL): Basic Availability (multiple servers): availability over consistency. Soft State: temporary
inconsistency. Eventual Consistency.
CAP: Consistency: reading the same data from any node. Availability: no downtime. Partition Tolerance: functions even when communication among
servers is unreliable. NoSQL have ‘P’ due to distributed system for high scalability requirement.
Master-slave architecture: focus on maintaining consistency. Only master node handles write request. Failure of master node will lead to application
downtime and while read is scalable, write is not scalable Master-slave architecture with sharding: data is partitioned into group of keys. single point of
failure becomes multi-point of failures. Common means of increasing availability when master node fails is to employ a master failover protocol.

Challenges: Fault tolerantly store big data, Process big data parallelly, Manage continuously evolving schema and metadata for semi-structured and unstruc-
tured data. Internet data: massive, sparse, (semi) un-structured. DBMS optimised for dense largely uniform structured data.

RDBMS for dense, structured data NoSQL: schema flexibility, ACID-> BASE, async inserts & updates, scalable, distributed and massive parallel compu-
and designed for “single” machine. ting, consistency->availability, lower cost, limited query capabilities (cannot join), no standardization

Lecture 4. Log structured merge tree storage structure. High write throughput, append only with very rare random read.
Memtable: key value pairs sorted by key. When updating value, LSM appends another record; no need to search for key.
During read, find the most recent key-value pair.
Key-value: schemaless, stores data as hashtables; keys needs to be hashable (AP db) Redis in -memory database Colum-
nar: High performance on aggregation queries and better compression as column have the same type. Hybrid of RDBMS
and key-value. Values are stored in groups of 0 or more columns (column family), but in column order. Values are queries
by matching keys. Google BigTable good for data with high velocity to capture the data first. Graph: collection of nodes,
edges which can be labelled to narrow searches. Each node and edge can have any no. of attributes. Uses cases: key players
identification, pagerank, community detection, label propagation, path finding, cycle detection. Document: document is a
loosely structured set of key/value pairs. Documents are treated as a whole and are addressed in the database via a unique
key. Documents are schema free. Comparisons between RDBMS and document: Tables-> Collections. Row-> Documents.
Columns -> Key-value pairs. MongoDB, CouchDB, Firestore.

Lecture 5. Data Pipeline & Orchestration. Workflow management with Airflow: Define task and dependen-
cies; Scalable as it schedule, distribute and execute tasks across worker nodes. Monitor job status; Accessibility of log file;
Ease of deployment of workflow changes; Handle errors and failures; Tasks can pass parameters downstream; Integration
with other infrastructure. Data pipeline for streaming data: Lambda Architecture– duplicate code, data quality, added
complexity, two distributed systems. Kappa– simplified, single codebase , single processing path for data -> improved
consistency, ease of maintenance.

Twitter case study: high scale and


throughput in real-time processing
lead to data loss and inaccuracies,
exacerbated by back pressure. Back
pressure is where data is produced
faster than it can be consumed
leading to overload and congestion.

Lecture 6. Cloud computing: use of hosted services such as data storage, servers, databases, networking and software
over the internet. Cloud computing benefits: pay for what you use, more flexibility, more scalability, no server space
required, better data security, provides disaster recovery, automatic software updates, teams can collaborate from wide-
spread locations, data can be accessed and shared anywhere over the internet, rapid implementation.
Symmetric multi-processing on shared memory architecture: not easily scalable as memory and I/O bus width are
bottlenecks. Massive parallel processing on shared disk architecture: SMP clusters share common disk. Coordination
protocol for cache coherency. Sharing on shared-nothing architecture: data is partitioned across multiple computation
nodes. Shared nothing has become the dominant system. scalability: horizontal partitioned across nodes, each node
responsible for rows on its local disk. Cheap commodity hardware. However, Heterogenous workload (high I/O light
compute vs low I/O, heavy compute) membership changes (node failures, system resize) lead to significant performance
impact as computing resources is used for data shuffling rather than serving request. And hard to keep system online during
upgrading. Multiple Cluster Shared architecture: S3 for data storage. Cloud services: manage virtual warehouses,
queries, transactions, all metadata, access control information, usage stats. Virtual warehouse: each can contain one or
more computation servers. For different size of the virtual warehouse, they will allocate different number of machines.
Sharding: chop the data to multiple pieces. No cache coherence problem anymore

Hardware separation of
memory space. No coherence
problem if all read-only
request
Connected to
the same hard
disk
GCP Example: Columnar oriented storage for OLAP use cases. Capacitor (File format) uses Run-Length Encoding to encode sparse data. Encoding is not
trivial. Reorder rows is a NP complete problem. And not all columns are equal. Encoding columns with long string provides more gain in improvements.
Some columns are more likely to be selected in a query. Some columns are more likely to be used in a filter. Colossus (Distributed File System): Sharding
and geo-replication (same data in different regions and different zones in the same region) for failover and serving purposes. Storage optimisation
(Partition): large amount of data and low number of distinct value. partitions >1GB and are stored in the same data block. Storage optimisation
(Clustering): lower level data rearrangement, data is sorted for faster selection and filtering. Partitioned into different blocks and stored based on cluster-
ing columns.
Query processing (Dremel): high speed data transfer between colossus and Dremel. Movement of data from colossus to Dremel. Not movement of com-
pute. Worker process data and store intermediate results in shuffle then another worker fetch data from shuffle. Steps of executing a query: AP request Data partitioned by creation_date clustered by tags
management -> Lexing and parsing SQL -> query planning (catalog resolution) -> query execution (scheduling and dynamic planning) -> finalise results Creation_date | Tags | Title

Viz: Clarity (use labels), Accuracy, Efficiency (communicate effectively). Creation_date| Title | Tags

Types of Charts: Line chart for time series, scatter plot to see r/s, histogram/ density plot for distribution, bar chart
for categorical data, heatmap to visualise matrix of values, pairplots to visualise r/s.
Statistical data analysis. Numerical data: distribution histogram, Categorical data: cardinality & unique counts,
Textual data: number of unique tokens, document frequency to penalise term frequency. Creation_date | Tags | Title

Correlation analysis: numeric vs numeric: pairwise plot, correlation heatmap, linear regression, categorical vs
categorical: cardinality and unique counts, information gain, numerical vs categorical: density plot.
Outlier: boxplot, outside 1.5IQR, Zscore >3, violin plot. Partitioned
& Clustered
Machine Learning: F2 score (0-1) harmonic mean of precision and recall.
Image generation: diffusion model able to use prompt unlike GANs.
Linear regression: use L2 loss, take small steps in negative gradient direction.
Logistic regression uses an activation function: 1/ (1 + e^(-y))
Input textual data: bag of words (order and structure discarded, large vocab lead to large and sparse table),
word2vec (similar words are close together), transformer (multiple latent representation, computation can be
parallelized).
Supervised: classification, regression, unstructured output. Unsupervised: association rule (basket analysis),
clustering, dimension reduction
Computer vision: classification -> semantic segmentation -> object detection (multiple objects) -> instance
segmentation (object detection then semantic segmentation)

Hadoop MapReduce is a software framework to process vast amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault- PageRank
tolerant manner. MapReduce job splits input data into independent chunks which are processed by the map tasks parallelly. Sort the outputs of the maps, R(t) = αHR(t-1) + (α-1)s
then input to the reduce tasks. Both the input and the output of the job are stored in HDFS which is running on the same set of nodes. Effectively schedule Damping factor (α) represents the probability that a user will continue
tasks on the nodes where data is already present. Framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. clicking on links versus jumping to a random page. Mapper emits (child
MAPPER: Key is what the data will be grouped on and value is info pertinent to the analysis in the reducer. REDUCER: takes grouped data as input and runs (X), share of Y) for each Y. Reducer group all the child(X) together and sum
a reduce function once per key grouping. The function is passed the key and an iterator over all of the values associated with that key. The data can be the share of Y.
aggregated, filtered, and combined. Kmeans clustering
3 features: Fault tolerance, Data movement and Lifecycle management. ResourceManager (RM) arbitrates resources among all the applications in the Mapper emit (key: (which centorid), value:(point, 1)) Reducer group all the
system. NodeManager is per-machine framework agent responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and keys (which has the same centroid) together sum up all the values of the
sends heartbeat to the RM. RM has 2 components: Scheduler (SC) and ApplicationsManager(AM). SC allocates resources to the various running applica- points and divide to find the new mean which is the new centroid. Broad-
tions. AM is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster. Per- cast the new means to each machine on the cluster
application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the SC, tracking their status and monitoring for Linear Regression
progress. Client submit application -> RM start applicationmaster, RM allocates containers to applicationmaster on needs basis -> Node manager sends Mapper receives a subset(batch) of data and computes partial gradient.
heartbeat to RM. Reduce phase: gradients from all the mappers are aggregated to compute
HDFS fault tolerant distributed file system run on commodity hardware. Moving Computation > Moving Data. Write-once-read-many and mostly, append the total gradient. Update phase: update the weights using the total
to. Master/Slave architecture to keep metadata and coordinate access. HDFS cluster: 1 NameNode, master server, manages filesystem namespace and gradient computed from all subsets
metadata, regulates access to files by clients. Data never flows through the NameNode. n addition, there are many DataNodes, usually one per node in the
cluster. nternally, a file is split into one or more blocks (64MB or 128MB). Blocks stored in a set of DataNodes and replicated for fault tolerance.
NameNode gets heartbeat and a Blockreport from DataNodes in the cluster.

Mapreduce runtime concerns: Orchestration of distributed computation, handling scheduling of tasks, handle data distribution, handle synchronisa-
tion, handle errors and faults (detect worker failures and restart jobs), everything happens on top of a distributed file system.

5 Vs of Big Data: Volume, Velocity, Variety, Veracity, Value. Data stream processing deals with velocity. Data stream is a real-time, continuous CQ: Continuous queries
ordered sequence of structured records. Massive volume, impossible to store locally and insights must be generated real-time, continuous query.
Structure: nfinite sequence of items, same structure and Explicit or mplicit timestamp and physical or logical representation of time. Application:
Transactional (logs interactions) measurement (monitor evolution of entity states) Data Stream Management Service Manage continuous data
streams. Executes continuous query that is permanently installed. Stream query processor produces new results as long as new data arrive at the
system. Scratch space is temporary storage used as a buffering where new data joins the queue to be processed. With stream algorithms, generate
insights by going over the data once as data grows linearly, the summary statistics should grow sub-linearly, at most log.

Heavy hitters (potential candidates for


further investigation) s-heavy hitters: count
(e) at time t > s * (length of stream so far). Data is stationary
Maintain a structure containing [ceiling(1/s) -
DBMS DSMS
1] s-heavy hitter elements. False positives are Count-based window
okay but no false negatives. Data Nature Persistent data, relatively static Read-only (append-only) data. DSMS
Window-based algorithm: window based on process data to generate insights not to
ordering attribute, based on record counts or modify the streams
based on explicit markers.
Storage “Unbounded” disk store, theoreti- Bounded main memory
cally load entire dataset into
Ordering attribute window memory and generate insights

Access pattern Random access Sequential access

Computation Query driven processing model Data driven processing model (push-
model (pull-based) based)

Lossy counting N: current length of the stream. s (0, 1): support threshold, ε (0, 1): error. Output: Elements with counter values exceeding (s-ε) * N Query answer One time query Continuous query
Rule of thumb ε = 0.1 * s
Guarantees: Frequencies underestimated by at most ε*N. No false negatives. False positives haves true frequency at least (s-ε) * N. All elements Update Relatively low update rate Possibly multi-GB arrival rate
exceeding frequency s* N will be output frequency
Example: User interested in identifying all items whose freq is at least 0.1%. s = 0.01% . Assume ε = 0.01%. Guarantee 1: all elements with freq >
0.1% will be returned. Guarantee 2: no element with frequency below 0.09% will be returned. False positives between 0.09% and 0.1% might or Transactions AC D properties No transactions management
might not be outputed. Guarantee 3: all individual frequencies are less than their true frequencies by at most 0.01% * N.
Lossy counting (step 1): divide the stream to window size w = ceiling(1 / ε). (Step 2) Go through elements. If counter exists, increase by one, if not Data precision Assume precise data Data stale/ imprecise
create one and initialise it to one.. (Step 3) At window boundary, decrement all counters by 1. If counter is zero for a specific element, drop it..
Comparing lossy and sticky
Sticky Sampling Counting algorithm using sampling approach. Probabilistic sampling decides if a counter for a distinct element is created. If a sampling for Unique and Zipf
counter exists for a certain element, every future instance of this element will be counted. N: current length of the stream. s (0, 1): support threshold, ε where frequency of item is
(0, 1): error, δ (0,1): probability of failure. inversely proportional to rank in
Guarantees: all items with true freq exceeds s*N are output. No false negatives. No items whose true freq is less than (s-ε) * N is output. Estimated frequency table. Sticky is due to
freq are less than the true freq by at most ε*N. Guarantees are the same as for lossy counting except for the small probability that it might fail to tendency to remember every
provide correct answers. unique element that gets sam-
Example: (step 1)Dynamic window size doubling window size of each new window where t = 1/ε log(1/(s * δ)) . Wk = 2k * t for k >= 1. W0 = 2t. pled. Lossy is good at pruning
(step 2) Go through elements. If counter exists, increase it. If not, create a counter with probability 1/r and initialise it to one. Sampling rate r starts low-freq elements quickly. For
with 1 and grows in proportion to window size W. (Step 3) At window boundary, go through elements of each counter. Toss coin, if unsuccessful highly skewed data both algo
remove element, otherwise move on to next counter. If counter becomes zero, drop it. Decrement happens probabilistically. Heavy hitters will have a require much less space than
higher chance of staying in the dictionary. worse-case bounds.

Kafka distributed event streaming platform sends data from


source to destination. Apache Storm: computation graph
where nodes are individual computation. Nodes send data
between one another in form of tuples. Stream is an un-
bounded sequence of tuples. Spout is the source of a stream
that reads data from an external data source and emit tuples.
Bolts subscribe to stream, streams transformers, process
incoming streams and produce new ones. Topology is
deployed over a cluster of computation resource. Parallelism
is achieved by running multiple replicas of the same spout/
bolt.
Storm architecture: MasterNode runs Nimbus, a central job
master which topologies are submitted. In charge of sched-
uling, job orchestration, communication, fault tolerance and distribhute code around cluster assign task to worker nodes. Worker-
Nodes are where applications are executed. Each WorkerNode runs a supervisor. Each supervisor listens for work assigned to its
worker node by Nimbus and starts/ stops worker processes
Microprocessor speed (vertical scaling) stagnated: thermal barrier, system bottleneck, material limits

Hadoop MapReduce is a software framework for easily writing applications which process vast
amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of com-
modity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent chunks which are processed by
the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which
are then input to the reduce tasks. Typically both the input and the output of the job are stored in a
file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the
failed tasks.

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework
and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of
nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data
is already present, resulting in very high aggregate bandwidth across the cluster.

Map-reduce for PageRank, Kmeans clustering & Gradient Descent


n the PageRank algorithm, the damping factor is typically denoted by the symbol α. The damping
factor represents the probability that a user will continue clicking on links versus jumping to a ran-
dom page.

Apache Hadoop includes three modules:


Hadoop Distributed File System (HDFS A distributed file system that provides high-throughput
access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: Parallel processing of large data sets.

MAPPER: The key is what the data will be grouped on and the value is the information pertinent to
the analysis in the reducer.
REDUCER: The reducer takes the grouped data as input and runs a reduce function once per key
grouping. The function is passed the key and an iterator over all of the values associated with that
key. A wide range of processing can happen in this function, as we’ll see in many of our patterns. The
data can be aggregated, filtered, and combined in a number of ways.

Three features managed by Hadoop: Fault tolerance, data movement and lifecycle management

YARN: resource management and job scheduling layer of Hadoop.

ResourceManager is the ultimate authority that arbitrates resources among all the applications in the
system. The NodeManager is the per-machine framework agent who is responsible for containers,
monitoring their resource usage (cpu, memory, disk, network) and sends heartbeat to the Re-
sourceManager/Scheduler.

The ResourceManager has two main components: Scheduler and ApplicationsManager. The Scheduler
is responsible for allocating resources to the various running applications. ApplicationsManager is re-
sponsible for accepting job-submissions, negotiating the first container for executing the application
specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on
failure. The per-application ApplicationMaster has the responsibility of negotiating appropriate re-
source containers from the Scheduler, tracking their status and monitoring for progress.

HDFS fault tolerant distributed file system designed to run on commodity hardware. Moving Compu-
tation is Cheaper than Moving Data. Write-once-read-many and mostly, append to. Master/slave archi-
tecture to keep metadata and coordinate access. An HDFS cluster consists of a single NameNode, a
master server that manages the file system namespace and regulates access to files by clients. The
NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a
way that user data never flows through the NameNode. In addition, there are a number of DataNodes,
usually one per node in the cluster, which manage storage attached to the nodes that they run on. Inter-
nally, a file is split into one or more blocks (64MB or 128MB) and these blocks are stored in a set of
DataNodes. The blocks of a file are replicated for fault tolerance. The NameNode makes all decisions
regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the
DataNodes in the cluster.

The namenode manages the filesystem namespace. t maintains the filesystem tree and the metadata
for all the files and directories in the tree.

You might also like