0% found this document useful (0 votes)
54 views12 pages

GCP Technologies

Google Cloud Storage is one of the most important and widely used services of GCP. It provides blob type storage for unstructured data like audio, video, images. Datastore and Filestore provide no SQL database services for transactional and non-relational data with strong consistency. BigQuery is the data warehouse service that provides SQL capabilities for analysis on large datasets. Dataflow is used for processing streaming data in flight using directed acyclic graphs.

Uploaded by

Neha Khatri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views12 pages

GCP Technologies

Google Cloud Storage is one of the most important and widely used services of GCP. It provides blob type storage for unstructured data like audio, video, images. Datastore and Filestore provide no SQL database services for transactional and non-relational data with strong consistency. BigQuery is the data warehouse service that provides SQL capabilities for analysis on large datasets. Dataflow is used for processing streaming data in flight using directed acyclic graphs.

Uploaded by

Neha Khatri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Products Storage Type Corresponding technologies

Google Cloud Storage Blob type storage Media storage (audio/video/image)


Cloud SQL OLTP MySQL
Cloud Spanner OLTP MySQL
Data Store No SQL XML/HTML/Key value pair Mongo DB
Bigtable No SQL (Hbase) XML/HTML/Key value pair
BigQuery OLAP (Hive) Data Warehouse
Memcache Cache storage RAM

Lets start talking about some simple use cases


You have data coming in for long term storage, which might be used for analysis, logging, reporting, long term legal archives: G
You have transactional data, which is relational in nature, coming in to keep record of transactions, data in GBs. Since data is t
consistent across database: Cloud SQL
You have transactional data, which is relational in nature, coming in to keep record of transactions, data > 10TBs. Since data is
consistent across database: Cloud Spanner
You have transactional data, which is NOT relational in nature, coming in to keep record of transactions, consistency, that is re
(Filestore is next version of Datastore, so both are the same thing, we have Firestore in case it's from mobile apps)

What is ACID? In case you are having transactional data coming in , you need to ensure that the databases are ACID in nature,
1) Automicity
2) Consistent
3) Isolation
4) Durable
You have data coming in petabytes, it is getting streamed in, relational in nature, and you will have to do analysis on top of tha
You have high volume, structured, but non relational data, like time series, streamed in from iOTs etc. Dataset can become hu
updates in the database, and this data will be used for analysis: Bigtable

You have cost constraint, data getting streamed in, non relational data, but you have just launched your web app/game, so yo
transaction data, no analysis required data gets stored in: Datastore (it is having excellent scale down facility, really strong con
Relational data coming in, streamed in, need to create reports: Get data in BQ, create reports in Data Studio, Data Studio can N
Bigtable and Cloud spanner are very expensive, but can store huge quantities of data
Datastore is really cheap, but can't store relational data, no link with reporting, not used for analytics

Google Cloud Storage: Cheapest and highly available data, but only for unstructred data. If you need cheap storage, not sure a
BgQuery is as cheap as cloud storage when it comes to data warehousing, so excellent case for streaming data, bacth storage
reporting
Any limitation of BQ? Yes, can't store data where row size is more than 10 MB
Any limitation of Cloud SQL: Max 30 TB data size, anything above, if it's relational, transactional data, put it in Cloud Spanner. A
Any limitation of Cloud Spanner: It's Google's proprietary technology, data is sparsely populated, globally consistent, petabyte
Cloud SQl uses both pstgre and Mysql
Cloud spanner is ideal for financial systems
Datastore has good integration with Google App Engine
Any issue with Bigtable? Can't store relational data, isn't used to store transactional data, best for time series kind of data, to a
Datastore: Best part is that it is having a free tier, also you will be charged on the basis of quantity of data returned, not quanti
BQ: Again charged on the basis of quantity of data returned in your query, not how much data was queried
BigQuery is also Google's proprietary database, columnar storage, with SQL capabilities
Bigtable is used by Google to power it's search, Gmail, maps etc. It suppports BOTH streaming and batch data transfers
In Datastore data is stored on the basis of key value pair in entitities. No fixed schema, and every row can be of different schem
Use case
Static wesites, storing images, video or audio data 9Unstructured Data)
OLTP, Transactional relational data with strong consistency
OLTP, Transactional relational data with strong consistency, with HUGE dataset
Transactional data with strong consistency, something that can easily scale down to 0
Transactional data with strong consistency, something that can easily scale up
SQL based data warehouse
RAM attached to VMs, very expensive, very fast, but gets deleted when machine restarts

orting, long term legal archives: Google Cloud Storage


tions, data in GBs. Since data is transactional so, you want strong consistency, that is records should be

tions, data > 10TBs. Since data is transactional so, you want strong consistency, that is records should be

nsactions, consistency, that is records should be consistent across database: Cloud Datastore/Filestore
's from mobile apps)

he databases are ACID in nature, that is:

have to do analysis on top of that, or create report with this data: BigQuery
OTs etc. Dataset can become huge, but you have no idea how big, you might have to get a lot of writes, multiple

ched your web app/game, so you aren't sure how popular it will be, it might not even take off, this is all
e down facility, really strong consistency in database, ACID transactions, has daily free quota limits)
in Data Studio, Data Studio can NOT be linked with Bigtable

nalytics

u need cheap storage, not sure about usage of data right now, store here. Compliance data gets stored here
r streaming data, bacth storage data, but relational in nature, structred data which will be used for analysis and

al data, put it in Cloud Spanner. And it is available only at regional level


ed, globally consistent, petabytes of capacity, but not for analysis

for time series kind of data, to analyze, and is very expensive. Data is stored sparsely
ntity of data returned, not quantity of data queried
a was queried

and batch data transfers


ery row can be of different schema, different size
GCS: Google Cloud Storage is one of the most important and widely used service of GCP
4 Storage classes: Multi Regional, Regional, Nearline, Coldline
Multiregional: For frequently accessed data, stored in multiple regions for better availability and redundancy
Regional: For frequently accessed data, but not as highly available as multiregional
Nearline: For data which is not as frequently accessed, maybe once a month. Storage is cheaper, access is more expensive. 30
Coldline: for data that is accessed once an year, for archival storage, maybe like for audits etc. Storage is cheapest, access if m
Archival: Lowest cost for data archiving, 365 days minimum
You can store an object of size 5TB at max
Data that is accessed frequently is called as hot data
GCS data is stored in buckets
Data is stored in the form of objects, each object has it's own URL
Every GCS bucket has to be stored with a GLOBALLY unique name
When the same piece of datais stored again within GCS, it replaces the earlier object
You can switch on versioning if you want to store multiple versions of data
Each version of data is stored within the bucket and cost the same for each object
You can put lifecycle management rules on the bucket, where you can specify to delete object after a specific period of time
Lifecycle management file is uploaded in GCS with JSON format
You can't get deleted objects back in GCS, deletion is permanent
It has strong global consistency

Refer to the image below to see the availability SLA of storage options
nd redundancy

er, access is more expensive. 30 days minimum storage


Storage is cheapest, access if more expensive, 90 days minimum storage

t after a specific period of time


Products Type
Data Proc Managed Hadoop offering
Data Flow Inflight data ETL offering
BigQuery Data warehouse providing SQL capabilities to query
Helping to ingest streaming live data, and then handle the
Cloud Pub Sub same within GCP tools

Cloud Datalab Tool based on Jupyter to write your own ML algorithms

NOT very important from exam perspective, just read:


Dataflow in some more detail
You have real time data streaming in, which needs to be processed in flight: Data Flow
Think about it this way, a graph is created, graph of all the processes that will act upon data, which will be ingested from sourc
Dataflow is based on Apache Beam
All transformations within dataflow are done with the help of DAGs (Directed Acyclic Graphs)
Some important terms that you need to know about Dataflow: pcollection, transforms and pipeline
Think of a DAG as a flowchart, data needs to pass through the whole flowchart to be processed, only difference is that flowcha
A pipeline is a DAG as a whole, repeatable jobs from start to finish, look at figure one to see one pipeline representation
Transform takes one or more pcollections to perform the processing function that you provide on the elements of that pcollec
Driver: Defined computation DAG
Runner: Executes DAG on the backend
Transforms never change input pcollection, they receive input pcollection and make changes in output ocollection

Fig1: Data Flow Pipeline


Corresponding technologies
Hadoop Spark
Apache Beam
Hive

Kafka

Jupyter Notebook

ght: Data Flow


t will act upon data, which will be ingested from source, and then written to a sink: DataFlow

ected Acyclic Graphs)


on, transforms and pipeline
wchart to be processed, only difference is that flowcharts run linearly, whereas DAG doesn't necessarily works linearly
at figure one to see one pipeline representation
ction that you provide on the elements of that pcollection, and produce the output pcollection

n and make changes in output ocollection


Use case
Making data processing easier and faster through processing on clusters
Serverless streaming and batch processing platform
OLAP based highly scalable multi cloud data warehouse

Messaging and event driven service for data ingestion and processing

Integrated tool for data exploration, analysis, visualization and machine learning

d from source, and then written to a sink: DataFlow

that flowcharts run linearly, whereas DAG doesn't necessarily works linearly
tation
that pcollection, and produce the output pcollection
Cloud Data Prod: A managed Hadoop offering, supports Spark. It includes Hive and Pig as well
No ops required: Just create cluster, use it and then turn it off
Don't store data in HDFS since it becomes expensive
It's ideal for moving existing code in GCP
Create clusters using compute engine VMs
Need atleast 1 master and 2 worker nodes
You can use preemptible machines, but not as a ,master node
Remember that preemptible machines have to be released in 30 seconds, whenever GCP wants it back, and it will definitely be
Use preemptible machines for operations, not for storing data
You can't make cluster only with preemptible machines
Min persistent disk that you can have is for 100 GB
Initialization scripts can be on git, or GCS
Can work with data proc on command Line Interface, GCP Console or programatically
You can run as root
Clusters can be easily scaled up or down, even while jobs are running

Operations of scaling are:


1) Add Workers
2) Remove workers
3) Add HDFS storage
For high availability clusters use 3 masyer nodes rather than 1, that will run in Apache zookeper for automatic failover
You can also have single node clusters, 1 node as master and worker, but can't be preemptible, it can be used for learning
Data proc jobs do not restart on failure, can optinally change though for streaming and long runnig jobs
It has connection with both BQ and Bigtable
ts it back, and it will definitely be rolled back in 24 hours

er for automatic failover


e, it can be used for learning
unnig jobs

You might also like