GCP Technologies
GCP Technologies
What is ACID? In case you are having transactional data coming in , you need to ensure that the databases are ACID in nature,
1) Automicity
2) Consistent
3) Isolation
4) Durable
You have data coming in petabytes, it is getting streamed in, relational in nature, and you will have to do analysis on top of tha
You have high volume, structured, but non relational data, like time series, streamed in from iOTs etc. Dataset can become hu
updates in the database, and this data will be used for analysis: Bigtable
You have cost constraint, data getting streamed in, non relational data, but you have just launched your web app/game, so yo
transaction data, no analysis required data gets stored in: Datastore (it is having excellent scale down facility, really strong con
Relational data coming in, streamed in, need to create reports: Get data in BQ, create reports in Data Studio, Data Studio can N
Bigtable and Cloud spanner are very expensive, but can store huge quantities of data
Datastore is really cheap, but can't store relational data, no link with reporting, not used for analytics
Google Cloud Storage: Cheapest and highly available data, but only for unstructred data. If you need cheap storage, not sure a
BgQuery is as cheap as cloud storage when it comes to data warehousing, so excellent case for streaming data, bacth storage
reporting
Any limitation of BQ? Yes, can't store data where row size is more than 10 MB
Any limitation of Cloud SQL: Max 30 TB data size, anything above, if it's relational, transactional data, put it in Cloud Spanner. A
Any limitation of Cloud Spanner: It's Google's proprietary technology, data is sparsely populated, globally consistent, petabyte
Cloud SQl uses both pstgre and Mysql
Cloud spanner is ideal for financial systems
Datastore has good integration with Google App Engine
Any issue with Bigtable? Can't store relational data, isn't used to store transactional data, best for time series kind of data, to a
Datastore: Best part is that it is having a free tier, also you will be charged on the basis of quantity of data returned, not quanti
BQ: Again charged on the basis of quantity of data returned in your query, not how much data was queried
BigQuery is also Google's proprietary database, columnar storage, with SQL capabilities
Bigtable is used by Google to power it's search, Gmail, maps etc. It suppports BOTH streaming and batch data transfers
In Datastore data is stored on the basis of key value pair in entitities. No fixed schema, and every row can be of different schem
Use case
Static wesites, storing images, video or audio data 9Unstructured Data)
OLTP, Transactional relational data with strong consistency
OLTP, Transactional relational data with strong consistency, with HUGE dataset
Transactional data with strong consistency, something that can easily scale down to 0
Transactional data with strong consistency, something that can easily scale up
SQL based data warehouse
RAM attached to VMs, very expensive, very fast, but gets deleted when machine restarts
tions, data > 10TBs. Since data is transactional so, you want strong consistency, that is records should be
nsactions, consistency, that is records should be consistent across database: Cloud Datastore/Filestore
's from mobile apps)
have to do analysis on top of that, or create report with this data: BigQuery
OTs etc. Dataset can become huge, but you have no idea how big, you might have to get a lot of writes, multiple
ched your web app/game, so you aren't sure how popular it will be, it might not even take off, this is all
e down facility, really strong consistency in database, ACID transactions, has daily free quota limits)
in Data Studio, Data Studio can NOT be linked with Bigtable
nalytics
u need cheap storage, not sure about usage of data right now, store here. Compliance data gets stored here
r streaming data, bacth storage data, but relational in nature, structred data which will be used for analysis and
for time series kind of data, to analyze, and is very expensive. Data is stored sparsely
ntity of data returned, not quantity of data queried
a was queried
Refer to the image below to see the availability SLA of storage options
nd redundancy
Kafka
Jupyter Notebook
Messaging and event driven service for data ingestion and processing
Integrated tool for data exploration, analysis, visualization and machine learning
that flowcharts run linearly, whereas DAG doesn't necessarily works linearly
tation
that pcollection, and produce the output pcollection
Cloud Data Prod: A managed Hadoop offering, supports Spark. It includes Hive and Pig as well
No ops required: Just create cluster, use it and then turn it off
Don't store data in HDFS since it becomes expensive
It's ideal for moving existing code in GCP
Create clusters using compute engine VMs
Need atleast 1 master and 2 worker nodes
You can use preemptible machines, but not as a ,master node
Remember that preemptible machines have to be released in 30 seconds, whenever GCP wants it back, and it will definitely be
Use preemptible machines for operations, not for storing data
You can't make cluster only with preemptible machines
Min persistent disk that you can have is for 100 GB
Initialization scripts can be on git, or GCS
Can work with data proc on command Line Interface, GCP Console or programatically
You can run as root
Clusters can be easily scaled up or down, even while jobs are running