0% found this document useful (0 votes)

25 views3 pages

20 - 04 - 2024 Cheatsheet

The document discusses several lectures covering topics like data engineering, data formats, database management systems, distributed systems, data pipelines, cloud computing and more. It describes concepts like OLTP, OLAP, data warehouses, NoSQL databases, CAP theorem, ACID properties, orchestration with Airflow and challenges of big data.

Uploaded by

ycong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views3 pages

20 - 04 - 2024 Cheatsheet

Uploaded by

ycong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Lecture 1.

Data Engine development, operation and maintenance of on-premise, cloud or hybrid data infrastructure to support downstream machine
learning, analytics, reporting applications.
Databases: OLTP, row-based, CRUD, many concurrent access, writing and updating data in transactional env. Datawarehouse (DW): OLAP, column-
based, analytics/ small userbase, read-heavy operations. Datamart: transformed, smaller DW for biz units. Serverless pipeline, compute and storage re-
sources are provisioned and managed by cloud provider, no need for users to manage servers, infrastructure or scaling.
Lecture 2. Data Formats Consideration: Text/ Binary: ease of use; Splitable: for scalability and for distributed file system; Compressible: for text file
formats, binary formats already has compression by definition. Verbose; Supported data types; schema enforcement; schema evolution support; row/column
based storage; splitable (for scalability); ecosystem
Text- CSV/TSV: splitable; XLM: verbose, not splitable, DTD schema validation; JSON: compact, not splitable, compressible, heterogenous arrays and
objects as dictionaries Binary- serialization converts complex data types to stream of bytes for transfer; AVRO: row-based, splitable, compressible, schema
evolution, JSON defined schema; PARQUET: column based (more compressible) format of Hadoop for analytics queries, write-once-read-many, splitable,
schema evolution, restricted to batch processing. Memory Hierarchy: Register, CPU cache, RAM, SSD, HDD, Object storage, Archival storage. Design 1 stores transactions info. Design 2 doesn’t store qty info & lost transaction info
Move Data tmd = size of data * (1/speed of computation + 1/network speed) Move Compute tmc = size of program/ network speed + size of data/ number
of nodes/ speed of computation)

Lecture 3. Owner entity and weak entity: one-to-many r/s. Weak entity must have total participation.
Each B-trees node fits into a disk block or page, allowing for efficient retrieval of entire nodes with a single disk read operation. B-tree stores data for
efficient retrieval in a block-oriented storage. High spatial locality and pre-fetching can improve I/O efficiency. B-trees used in DBMS to implement index-
ing structures. Store keys in sorted order.
In B+ trees, only the leaf nodes contain pointers to data records. The internal nodes contain only keys. B+ more storage efficient and supports range queries.
Query Processing: SQL Query, Parse & Rewrite Query, Logical Query Plan: relational algebra tree, Physical Query Plan, Query Execution, Disk. Statistics
about data inform query optimisation. Equivalence in relational algebra to generate logically equivalent query plans. Commutative, associative and pushing
predicate to be applied first R⋈S = S⋈R. R˅S = S˅R. Selinger Algorithm to find optimal join sequence in a left deep join tree using dynamic program-
ming. O(2^n) < O(n!)

Lecture 4. ACID (SQL): Atomicity: all or nothing. Consistency: consistent with constraints. Isolation: all transactions run in isolation. Durability:
committed transactions persist. BASE (Distributed, NoSQL): Basic Availability (multiple servers): availability over consistency. Soft State: temporary
inconsistency. Eventual Consistency.
CAP: Consistency: reading the same data from any node. Availability: no downtime. Partition Tolerance: functions even when communication among
servers is unreliable. NoSQL have ‘P’ due to distributed system for high scalability requirement.
Master-slave architecture: focus on maintaining consistency. Only master node handles write request. Failure of master node will lead to application
downtime and while read is scalable, write is not scalable Master-slave architecture with sharding: data is partitioned into group of keys. single point of
failure becomes multi-point of failures. Common means of increasing availability when master node fails is to employ a master failover protocol.

Challenges: Fault tolerantly store big data, Process big data parallelly, Manage continuously evolving schema and metadata for semi-structured and unstruc-
tured data. Internet data: massive, sparse, (semi) un-structured. DBMS optimised for dense largely uniform structured data.

RDBMS for dense, structured data NoSQL: schema flexibility, ACID-> BASE, async inserts & updates, scalable, distributed and massive parallel compu-
and designed for “single” machine. ting, consistency->availability, lower cost, limited query capabilities (cannot join), no standardization

Lecture 4. Log structured merge tree storage structure. High write throughput, append only with very rare random read.
Memtable: key value pairs sorted by key. When updating value, LSM appends another record; no need to search for key.
During read, find the most recent key-value pair.
Key-value: schemaless, stores data as hashtables; keys needs to be hashable (AP db) Redis in -memory database Colum-
nar: High performance on aggregation queries and better compression as column have the same type. Hybrid of RDBMS
and key-value. Values are stored in groups of 0 or more columns (column family), but in column order. Values are queries
by matching keys. Google BigTable good for data with high velocity to capture the data first. Graph: collection of nodes,
edges which can be labelled to narrow searches. Each node and edge can have any no. of attributes. Uses cases: key players
identification, pagerank, community detection, label propagation, path finding, cycle detection. Document: document is a
loosely structured set of key/value pairs. Documents are treated as a whole and are addressed in the database via a unique
key. Documents are schema free. Comparisons between RDBMS and document: Tables-> Collections. Row-> Documents.
Columns -> Key-value pairs. MongoDB, CouchDB, Firestore.

Lecture 5. Data Pipeline & Orchestration. Workflow management with Airflow: Define task and dependen-
cies; Scalable as it schedule, distribute and execute tasks across worker nodes. Monitor job status; Accessibility of log file;
Ease of deployment of workflow changes; Handle errors and failures; Tasks can pass parameters downstream; Integration
with other infrastructure. Data pipeline for streaming data: Lambda Architecture– duplicate code, data quality, added
complexity, two distributed systems. Kappa– simplified, single codebase , single processing path for data -> improved
consistency, ease of maintenance.

Twitter case study: high scale and

throughput in real-time processing
lead to data loss and inaccuracies,
exacerbated by back pressure. Back
pressure is where data is produced
faster than it can be consumed
leading to overload and congestion.

Lecture 6. Cloud computing: use of hosted services such as data storage, servers, databases, networking and software
over the internet. Cloud computing benefits: pay for what you use, more flexibility, more scalability, no server space
required, better data security, provides disaster recovery, automatic software updates, teams can collaborate from wide-
spread locations, data can be accessed and shared anywhere over the internet, rapid implementation.
Symmetric multi-processing on shared memory architecture: not easily scalable as memory and I/O bus width are
bottlenecks. Massive parallel processing on shared disk architecture: SMP clusters share common disk. Coordination
protocol for cache coherency. Sharing on shared-nothing architecture: data is partitioned across multiple computation
nodes. Shared nothing has become the dominant system. scalability: horizontal partitioned across nodes, each node
responsible for rows on its local disk. Cheap commodity hardware. However, Heterogenous workload (high I/O light
compute vs low I/O, heavy compute) membership changes (node failures, system resize) lead to significant performance
impact as computing resources is used for data shuffling rather than serving request. And hard to keep system online during
upgrading. Multiple Cluster Shared architecture: S3 for data storage. Cloud services: manage virtual warehouses,
queries, transactions, all metadata, access control information, usage stats. Virtual warehouse: each can contain one or
more computation servers. For different size of the virtual warehouse, they will allocate different number of machines.
Sharding: chop the data to multiple pieces. No cache coherence problem anymore

Hardware separation of
memory space. No coherence
problem if all read-only
request
Connected to
the same hard
disk
GCP Example: Columnar oriented storage for OLAP use cases. Capacitor (File format) uses Run-Length Encoding to encode sparse data. Encoding is not
trivial. Reorder rows is a NP complete problem. And not all columns are equal. Encoding columns with long string provides more gain in improvements.
Some columns are more likely to be selected in a query. Some columns are more likely to be used in a filter. Colossus (Distributed File System): Sharding
and geo-replication (same data in different regions and different zones in the same region) for failover and serving purposes. Storage optimisation
(Partition): large amount of data and low number of distinct value. partitions >1GB and are stored in the same data block. Storage optimisation
(Clustering): lower level data rearrangement, data is sorted for faster selection and filtering. Partitioned into different blocks and stored based on cluster-
ing columns.
Query processing (Dremel): high speed data transfer between colossus and Dremel. Movement of data from colossus to Dremel. Not movement of com-
pute. Worker process data and store intermediate results in shuffle then another worker fetch data from shuffle. Steps of executing a query: AP request Data partitioned by creation_date clustered by tags
management -> Lexing and parsing SQL -> query planning (catalog resolution) -> query execution (scheduling and dynamic planning) -> finalise results Creation_date | Tags | Title

Viz: Clarity (use labels), Accuracy, Efficiency (communicate effectively). Creation_date| Title | Tags

Types of Charts: Line chart for time series, scatter plot to see r/s, histogram/ density plot for distribution, bar chart
for categorical data, heatmap to visualise matrix of values, pairplots to visualise r/s.
Statistical data analysis. Numerical data: distribution histogram, Categorical data: cardinality & unique counts,
Textual data: number of unique tokens, document frequency to penalise term frequency. Creation_date | Tags | Title

Correlation analysis: numeric vs numeric: pairwise plot, correlation heatmap, linear regression, categorical vs
categorical: cardinality and unique counts, information gain, numerical vs categorical: density plot.
Outlier: boxplot, outside 1.5IQR, Zscore >3, violin plot. Partitioned
& Clustered
Machine Learning: F2 score (0-1) harmonic mean of precision and recall.
Image generation: diffusion model able to use prompt unlike GANs.
Linear regression: use L2 loss, take small steps in negative gradient direction.
Logistic regression uses an activation function: 1/ (1 + e^(-y))
Input textual data: bag of words (order and structure discarded, large vocab lead to large and sparse table),
word2vec (similar words are close together), transformer (multiple latent representation, computation can be
parallelized).
Supervised: classification, regression, unstructured output. Unsupervised: association rule (basket analysis),
clustering, dimension reduction
Computer vision: classification -> semantic segmentation -> object detection (multiple objects) -> instance
segmentation (object detection then semantic segmentation)

Hadoop MapReduce is a software framework to process vast amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault- PageRank
tolerant manner. MapReduce job splits input data into independent chunks which are processed by the map tasks parallelly. Sort the outputs of the maps, R(t) = αHR(t-1) + (α-1)s
then input to the reduce tasks. Both the input and the output of the job are stored in HDFS which is running on the same set of nodes. Effectively schedule Damping factor (α) represents the probability that a user will continue
tasks on the nodes where data is already present. Framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. clicking on links versus jumping to a random page. Mapper emits (child
MAPPER: Key is what the data will be grouped on and value is info pertinent to the analysis in the reducer. REDUCER: takes grouped data as input and runs (X), share of Y) for each Y. Reducer group all the child(X) together and sum
a reduce function once per key grouping. The function is passed the key and an iterator over all of the values associated with that key. The data can be the share of Y.
aggregated, filtered, and combined. Kmeans clustering
3 features: Fault tolerance, Data movement and Lifecycle management. ResourceManager (RM) arbitrates resources among all the applications in the Mapper emit (key: (which centorid), value:(point, 1)) Reducer group all the
system. NodeManager is per-machine framework agent responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and keys (which has the same centroid) together sum up all the values of the
sends heartbeat to the RM. RM has 2 components: Scheduler (SC) and ApplicationsManager(AM). SC allocates resources to the various running applica- points and divide to find the new mean which is the new centroid. Broad-
tions. AM is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster. Per- cast the new means to each machine on the cluster
application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the SC, tracking their status and monitoring for Linear Regression
progress. Client submit application -> RM start applicationmaster, RM allocates containers to applicationmaster on needs basis -> Node manager sends Mapper receives a subset(batch) of data and computes partial gradient.
heartbeat to RM. Reduce phase: gradients from all the mappers are aggregated to compute
HDFS fault tolerant distributed file system run on commodity hardware. Moving Computation > Moving Data. Write-once-read-many and mostly, append the total gradient. Update phase: update the weights using the total
to. Master/Slave architecture to keep metadata and coordinate access. HDFS cluster: 1 NameNode, master server, manages filesystem namespace and gradient computed from all subsets
metadata, regulates access to files by clients. Data never flows through the NameNode. n addition, there are many DataNodes, usually one per node in the
cluster. nternally, a file is split into one or more blocks (64MB or 128MB). Blocks stored in a set of DataNodes and replicated for fault tolerance.
NameNode gets heartbeat and a Blockreport from DataNodes in the cluster.

Mapreduce runtime concerns: Orchestration of distributed computation, handling scheduling of tasks, handle data distribution, handle synchronisa-
tion, handle errors and faults (detect worker failures and restart jobs), everything happens on top of a distributed file system.

5 Vs of Big Data: Volume, Velocity, Variety, Veracity, Value. Data stream processing deals with velocity. Data stream is a real-time, continuous CQ: Continuous queries
ordered sequence of structured records. Massive volume, impossible to store locally and insights must be generated real-time, continuous query.
Structure: nfinite sequence of items, same structure and Explicit or mplicit timestamp and physical or logical representation of time. Application:
Transactional (logs interactions) measurement (monitor evolution of entity states) Data Stream Management Service Manage continuous data
streams. Executes continuous query that is permanently installed. Stream query processor produces new results as long as new data arrive at the
system. Scratch space is temporary storage used as a buffering where new data joins the queue to be processed. With stream algorithms, generate
insights by going over the data once as data grows linearly, the summary statistics should grow sub-linearly, at most log.

Heavy hitters (potential candidates for

further investigation) s-heavy hitters: count
(e) at time t > s * (length of stream so far). Data is stationary
Maintain a structure containing [ceiling(1/s) -
DBMS DSMS
1] s-heavy hitter elements. False positives are Count-based window
okay but no false negatives. Data Nature Persistent data, relatively static Read-only (append-only) data. DSMS
Window-based algorithm: window based on process data to generate insights not to
ordering attribute, based on record counts or modify the streams
based on explicit markers.
Storage “Unbounded” disk store, theoreti- Bounded main memory
cally load entire dataset into
Ordering attribute window memory and generate insights

Access pattern Random access Sequential access

Computation Query driven processing model Data driven processing model (push-
model (pull-based) based)

Lossy counting N: current length of the stream. s (0, 1): support threshold, ε (0, 1): error. Output: Elements with counter values exceeding (s-ε) * N Query answer One time query Continuous query
Rule of thumb ε = 0.1 * s
Guarantees: Frequencies underestimated by at most ε*N. No false negatives. False positives haves true frequency at least (s-ε) * N. All elements Update Relatively low update rate Possibly multi-GB arrival rate
exceeding frequency s* N will be output frequency
Example: User interested in identifying all items whose freq is at least 0.1%. s = 0.01% . Assume ε = 0.01%. Guarantee 1: all elements with freq >
0.1% will be returned. Guarantee 2: no element with frequency below 0.09% will be returned. False positives between 0.09% and 0.1% might or Transactions AC D properties No transactions management
might not be outputed. Guarantee 3: all individual frequencies are less than their true frequencies by at most 0.01% * N.
Lossy counting (step 1): divide the stream to window size w = ceiling(1 / ε). (Step 2) Go through elements. If counter exists, increase by one, if not Data precision Assume precise data Data stale/ imprecise
create one and initialise it to one.. (Step 3) At window boundary, decrement all counters by 1. If counter is zero for a specific element, drop it..
Comparing lossy and sticky
Sticky Sampling Counting algorithm using sampling approach. Probabilistic sampling decides if a counter for a distinct element is created. If a sampling for Unique and Zipf
counter exists for a certain element, every future instance of this element will be counted. N: current length of the stream. s (0, 1): support threshold, ε where frequency of item is
(0, 1): error, δ (0,1): probability of failure. inversely proportional to rank in
Guarantees: all items with true freq exceeds s*N are output. No false negatives. No items whose true freq is less than (s-ε) * N is output. Estimated frequency table. Sticky is due to
freq are less than the true freq by at most ε*N. Guarantees are the same as for lossy counting except for the small probability that it might fail to tendency to remember every
provide correct answers. unique element that gets sam-
Example: (step 1)Dynamic window size doubling window size of each new window where t = 1/ε log(1/(s * δ)) . Wk = 2k * t for k >= 1. W0 = 2t. pled. Lossy is good at pruning
(step 2) Go through elements. If counter exists, increase it. If not, create a counter with probability 1/r and initialise it to one. Sampling rate r starts low-freq elements quickly. For
with 1 and grows in proportion to window size W. (Step 3) At window boundary, go through elements of each counter. Toss coin, if unsuccessful highly skewed data both algo
remove element, otherwise move on to next counter. If counter becomes zero, drop it. Decrement happens probabilistically. Heavy hitters will have a require much less space than
higher chance of staying in the dictionary. worse-case bounds.

Kafka distributed event streaming platform sends data from

source to destination. Apache Storm: computation graph
where nodes are individual computation. Nodes send data
between one another in form of tuples. Stream is an un-
bounded sequence of tuples. Spout is the source of a stream
that reads data from an external data source and emit tuples.
Bolts subscribe to stream, streams transformers, process
incoming streams and produce new ones. Topology is
deployed over a cluster of computation resource. Parallelism
is achieved by running multiple replicas of the same spout/
bolt.
Storm architecture: MasterNode runs Nimbus, a central job
master which topologies are submitted. In charge of sched-
uling, job orchestration, communication, fault tolerance and distribhute code around cluster assign task to worker nodes. Worker-
Nodes are where applications are executed. Each WorkerNode runs a supervisor. Each supervisor listens for work assigned to its
worker node by Nimbus and starts/ stops worker processes
Microprocessor speed (vertical scaling) stagnated: thermal barrier, system bottleneck, material limits

Hadoop MapReduce is a software framework for easily writing applications which process vast
amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of com-
modity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent chunks which are processed by
the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which
are then input to the reduce tasks. Typically both the input and the output of the job are stored in a
file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the
failed tasks.

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework
and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of
nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data
is already present, resulting in very high aggregate bandwidth across the cluster.

Map-reduce for PageRank, Kmeans clustering & Gradient Descent

n the PageRank algorithm, the damping factor is typically denoted by the symbol α. The damping
factor represents the probability that a user will continue clicking on links versus jumping to a ran-
dom page.

Apache Hadoop includes three modules:

Hadoop Distributed File System (HDFS A distributed file system that provides high-throughput
access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: Parallel processing of large data sets.

MAPPER: The key is what the data will be grouped on and the value is the information pertinent to
the analysis in the reducer.
REDUCER: The reducer takes the grouped data as input and runs a reduce function once per key
grouping. The function is passed the key and an iterator over all of the values associated with that
key. A wide range of processing can happen in this function, as we’ll see in many of our patterns. The
data can be aggregated, filtered, and combined in a number of ways.

Three features managed by Hadoop: Fault tolerance, data movement and lifecycle management

YARN: resource management and job scheduling layer of Hadoop.

ResourceManager is the ultimate authority that arbitrates resources among all the applications in the
system. The NodeManager is the per-machine framework agent who is responsible for containers,
monitoring their resource usage (cpu, memory, disk, network) and sends heartbeat to the Re-
sourceManager/Scheduler.

The ResourceManager has two main components: Scheduler and ApplicationsManager. The Scheduler
is responsible for allocating resources to the various running applications. ApplicationsManager is re-
sponsible for accepting job-submissions, negotiating the first container for executing the application
specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on
failure. The per-application ApplicationMaster has the responsibility of negotiating appropriate re-
source containers from the Scheduler, tracking their status and monitoring for progress.

HDFS fault tolerant distributed file system designed to run on commodity hardware. Moving Compu-
tation is Cheaper than Moving Data. Write-once-read-many and mostly, append to. Master/slave archi-
tecture to keep metadata and coordinate access. An HDFS cluster consists of a single NameNode, a
master server that manages the file system namespace and regulates access to files by clients. The
NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a
way that user data never flows through the NameNode. In addition, there are a number of DataNodes,
usually one per node in the cluster, which manage storage attached to the nodes that they run on. Inter-
nally, a file is split into one or more blocks (64MB or 128MB) and these blocks are stored in a set of
DataNodes. The blocks of a file are replicated for fault tolerance. The NameNode makes all decisions
regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the
DataNodes in the cluster.

The namenode manages the filesystem namespace. t maintains the filesystem tree and the metadata
for all the files and directories in the tree.

Metal Oxide Surge Arrester
No ratings yet
Metal Oxide Surge Arrester
35 pages
NoSQL DBs
No ratings yet
NoSQL DBs
46 pages
Open Stack Deployment Design Guide
No ratings yet
Open Stack Deployment Design Guide
100 pages
Empowerment Technologies: Quarter 2 - Module 12: Multimedia and ICT
69% (13)
Empowerment Technologies: Quarter 2 - Module 12: Multimedia and ICT
19 pages
Sony-Ps2-Scph-39000 Series Service Manual gh-017 gh-019
No ratings yet
Sony-Ps2-Scph-39000 Series Service Manual gh-017 gh-019
30 pages
Jaipur Facebook Users 10266
No ratings yet
Jaipur Facebook Users 10266
458 pages
Unit 6
No ratings yet
Unit 6
143 pages
Firewall Policy PDF
No ratings yet
Firewall Policy PDF
4 pages
BIG - DATA - Unit 4
No ratings yet
BIG - DATA - Unit 4
99 pages
PropertyMe - Task and Checklists
100% (1)
PropertyMe - Task and Checklists
13 pages
BDA (2) Merged
No ratings yet
BDA (2) Merged
29 pages
Seminar Nosql
No ratings yet
Seminar Nosql
56 pages
CloudComputing DATABASE
No ratings yet
CloudComputing DATABASE
27 pages
11-NoSQL Nhom8
No ratings yet
11-NoSQL Nhom8
72 pages
CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
No ratings yet
CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
102 pages
Nosqldbs
No ratings yet
Nosqldbs
149 pages
Content Writing Guidelines
No ratings yet
Content Writing Guidelines
12 pages
8x8 LED Matrix Using Arduino
No ratings yet
8x8 LED Matrix Using Arduino
4 pages
Module 1
No ratings yet
Module 1
69 pages
BDS Session 5 - NoSQL DB
No ratings yet
BDS Session 5 - NoSQL DB
51 pages
No SQL
No ratings yet
No SQL
109 pages
NoSQL Database Technology - A Survey and Comparison of Systems
No ratings yet
NoSQL Database Technology - A Survey and Comparison of Systems
44 pages
Module 3
No ratings yet
Module 3
37 pages
R23 IDS Unit3
No ratings yet
R23 IDS Unit3
36 pages
CS3492-DBMS Unit-5
No ratings yet
CS3492-DBMS Unit-5
9 pages
How To Make A Three Axis CNC Machine (Cheaply and Easily)
No ratings yet
How To Make A Three Axis CNC Machine (Cheaply and Easily)
17 pages
No SQL
No ratings yet
No SQL
32 pages
Unit 2
No ratings yet
Unit 2
41 pages
04 Surveys Cattell PDF
No ratings yet
04 Surveys Cattell PDF
16 pages
BigData NoSQL
No ratings yet
BigData NoSQL
30 pages
CC - Lecture 6-Data
No ratings yet
CC - Lecture 6-Data
44 pages
Solar Tree: Presented By
No ratings yet
Solar Tree: Presented By
18 pages
Lecture 6 - NoSQL
No ratings yet
Lecture 6 - NoSQL
28 pages
Nosql Tricks
No ratings yet
Nosql Tricks
34 pages
4.1 Intro Nosql
No ratings yet
4.1 Intro Nosql
43 pages
Module 1
No ratings yet
Module 1
34 pages
Big Data Slides
No ratings yet
Big Data Slides
26 pages
Metal Skin Well Management
No ratings yet
Metal Skin Well Management
17 pages
NoSQL D
No ratings yet
NoSQL D
26 pages
Introduction To NoSQL
No ratings yet
Introduction To NoSQL
29 pages
NOSQL Lecture 1 Notes
No ratings yet
NOSQL Lecture 1 Notes
31 pages
E-Learning FINAL REPORT
No ratings yet
E-Learning FINAL REPORT
58 pages
Bcse302l Dbms Module-7 Nosql
No ratings yet
Bcse302l Dbms Module-7 Nosql
30 pages
Lecture 1 - NoSQL
No ratings yet
Lecture 1 - NoSQL
31 pages
Lecture 1
No ratings yet
Lecture 1
31 pages
Unit 4: Big Data Tehnology Landscape Two Inportant Technologies
No ratings yet
Unit 4: Big Data Tehnology Landscape Two Inportant Technologies
42 pages
Unit 4 BDA
No ratings yet
Unit 4 BDA
22 pages
Introduction To Nosql: Gabriele Pozzani
No ratings yet
Introduction To Nosql: Gabriele Pozzani
49 pages
Cs 620 / Dasc 600 Introduction To Data Science & Analytics: Lecture 6-Nosql
No ratings yet
Cs 620 / Dasc 600 Introduction To Data Science & Analytics: Lecture 6-Nosql
31 pages
NoSQL Unit 1 & 2 QnA
No ratings yet
NoSQL Unit 1 & 2 QnA
18 pages
Redeem Koin Festive 30 Jan
No ratings yet
Redeem Koin Festive 30 Jan
78 pages
Case Study About Database Tools
No ratings yet
Case Study About Database Tools
13 pages
41 NoSQL Introduction
No ratings yet
41 NoSQL Introduction
18 pages
Massively Parallel Cloud Data Storage Systems: S. Sudarshan IIT Bombay
No ratings yet
Massively Parallel Cloud Data Storage Systems: S. Sudarshan IIT Bombay
17 pages
NoSQL Databases
No ratings yet
NoSQL Databases
20 pages
2 Big Data Analytics-Hadoop R21 A7902 ABP
No ratings yet
2 Big Data Analytics-Hadoop R21 A7902 ABP
16 pages
5.1 Intro Nosql
No ratings yet
5.1 Intro Nosql
22 pages
HP Pavilion 15-ab029TX Notebook - Laptop PDF
No ratings yet
HP Pavilion 15-ab029TX Notebook - Laptop PDF
1 page
04-2 Intro Nosql
No ratings yet
04-2 Intro Nosql
18 pages
No SQL Ia-01 - Micro
No ratings yet
No SQL Ia-01 - Micro
6 pages
Comparisons Labeled and Detailed
No ratings yet
Comparisons Labeled and Detailed
6 pages
Unit 3
No ratings yet
Unit 3
7 pages
Introduction To NoSQL
No ratings yet
Introduction To NoSQL
13 pages
Java
No ratings yet
Java
12 pages
IET Udaipur BDA Unit-1
No ratings yet
IET Udaipur BDA Unit-1
10 pages
No SQL
No ratings yet
No SQL
12 pages
DADT Group 29 Section A MBAFT24
No ratings yet
DADT Group 29 Section A MBAFT24
10 pages
Big Data Pyq 21-22
No ratings yet
Big Data Pyq 21-22
9 pages
Departmentofchemicalengineering PDF
No ratings yet
Departmentofchemicalengineering PDF
8 pages
Big Data Unit-Ii Notes
No ratings yet
Big Data Unit-Ii Notes
7 pages
Cheat Sheet v2
No ratings yet
Cheat Sheet v2
3 pages
Eligibility Criteria and Duration (All Courses 2023-24) Sr. Name of Courses Educational Qualification Required Duration of Course
No ratings yet
Eligibility Criteria and Duration (All Courses 2023-24) Sr. Name of Courses Educational Qualification Required Duration of Course
6 pages
Sample Grade 7 IPlan
No ratings yet
Sample Grade 7 IPlan
4 pages
Empotech Tos
No ratings yet
Empotech Tos
1 page
Cheat Sheet v4
No ratings yet
Cheat Sheet v4
3 pages
NoSQL, Cloud Computing, and IOT
No ratings yet
NoSQL, Cloud Computing, and IOT
3 pages
Check Credit Score List 2025-01-20
No ratings yet
Check Credit Score List 2025-01-20
13 pages
Draytek Vigor 1100ax Ont Gpon
No ratings yet
Draytek Vigor 1100ax Ont Gpon
2 pages
Advanced Threat Protections White Paper
No ratings yet
Advanced Threat Protections White Paper
6 pages
MR - Nimal Karunarathna CIN-346
No ratings yet
MR - Nimal Karunarathna CIN-346
9 pages
Ceilings of Sound Pro Manual
No ratings yet
Ceilings of Sound Pro Manual
8 pages
Program Schedule ICMMT 2022 - Final
No ratings yet
Program Schedule ICMMT 2022 - Final
4 pages
Department of Education: Republic of The Philippines
No ratings yet
Department of Education: Republic of The Philippines
5 pages
Punch Point
No ratings yet
Punch Point
1 page
IT9420 Encoder
No ratings yet
IT9420 Encoder
1 page
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Rsync Solutions: Definitive Reference for Developers and Engineers
From Everand
Rsync Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
From Everand
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Distributed Cluster Operations with DC/OS: Definitive Reference for Developers and Engineers
From Everand
Distributed Cluster Operations with DC/OS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet

20 - 04 - 2024 Cheatsheet

Uploaded by

20 - 04 - 2024 Cheatsheet

Uploaded by

Lecture 1.

Twitter case study: high scale and

Heavy hitters (potential candidates for

Access pattern Random access Sequential access

Kafka distributed event streaming platform sends data from

Map-reduce for PageRank, Kmeans clustering & Gradient Descent

Apache Hadoop includes three modules:

YARN: resource management and job scheduling layer of Hadoop.

You might also like