GCP Q and A
GCP Q and A
1. What is GCP Compute Engine, and how does it differ from App Engine?
Key Differences:
Compute Engine = Full control over VMs; you manage the infrastructure.
App Engine = Managed service for running apps; Google manages the infrastructure.
VM Instances: Virtual machines you can configure with desired resources (CPU,
RAM, OS, etc.).
Machine Types: Specifies the type and size of the VM (e.g., standard, compute-
optimized).
Persistent Disks: Block storage that is attached to VMs for data persistence.
Images: Pre-configured OS or application environments used to launch VMs.
Instance Groups: Groups of VMs that can be managed together for scalability.
Networks and Firewalls: Configures networking rules to control traffic to/from
VMs.
Snapshots: Backups of VM disks that can be used for recovery.
1. Standard Machine Types (N1, N2, N2D): Balanced for general workloads (CPU and
memory).
2. Compute-Optimized Machine Types (C2): High-performance VMs for CPU-
intensive tasks.
3. Memory-Optimized Machine Types (M2, C2D): High-memory configurations for
memory-intensive tasks.
4. Accelerator-Optimized Machine Types (A2): VMs with GPUs for machine learning
and graphics processing.
5. Custom Machine Types: Custom-configured VMs where you define the exact
amount of CPU and memory.
Preemptible VMs:
o Short-lived, temporary VMs that Google can shut down with a 30-second
warning.
o They are cheaper (costs about 70-80% less).
o Ideal for fault-tolerant workloads (e.g., batch processing).
Regular VMs:
o Can run indefinitely until stopped or terminated.
o You are billed for the time the VM is running.
o Suitable for persistent and mission-critical workloads.
Persistent Disks:
o Compute Engine uses persistent disks to provide durable storage.
o Standard Persistent Disks: Block storage for general-purpose workloads.
o SSD Persistent Disks: Faster storage for high-performance workloads.
o Persistent disks are independent of VMs. Data remains even if the VM is
stopped.
o You can resize or attach multiple disks to a single VM.
Local SSDs: Temporary storage that provides high-speed storage but data is lost
when the VM is stopped.
You can also use Google Cloud Storage for object-based storage, but it’s more suited
for large files and blobs.
6. How do you create a VM instance in GCP Compute Engine using the Console and
gcloud CLI?
9. How do you create and use custom machine types in Compute Engine?
10. What are startup scripts in Compute Engine, and how do you use them?
12. What are the different network tiers available in Compute Engine?
1. Premium Tier:
o Best performance and lowest latency for services.
o Uses Google’s global network infrastructure.
o Ideal for latency-sensitive applications like video streaming or gaming.
2. Standard Tier:
o More cost-effective but provides higher latency and lower performance.
o Traffic routes through public internet and not Google’s private network.
o Suitable for non-latency-sensitive applications.
14. How do you configure firewall rules for a Compute Engine VM?
15. How does Compute Engine support VPN and hybrid connectivity?
1. Cloud VPN:
o Google Cloud’s Cloud VPN allows you to securely connect your on-premises
network to your GCP VPC over an IPsec VPN tunnel.
o This is typically used for hybrid cloud setups where you have workloads both
on-premises and in Google Cloud.
2. Cloud Interconnect:
o Dedicated Interconnect provides a private, high-performance connection
between your on-premises network and Google Cloud.
o Partner Interconnect allows you to connect through a service provider for
lower-cost options.
3. Hybrid Connectivity:
o Hybrid Cloud setups can connect on-premises systems to Google Cloud
using Cloud VPN or Cloud Interconnect for high availability, low latency,
and secure communication.
4. Peering:
o You can also connect multiple VPCs (on Google Cloud or between regions)
using VPC Peering for inter-project and cross-region communication.
Benefits:
18. What are the different types of load balancers supported by Compute Engine?
19. What is the difference between managed and unmanaged instance groups?
21. How do committed use discounts (CUDs) help reduce costs in Compute Engine?
1. Committed Use Discounts (CUDs) offer significant savings when you commit to
using Compute Engine resources for a 1-year or 3-year term.
2. You can receive up to 70% off the standard pricing for Compute Engine instances.
3. The savings are based on:
o Machine types (e.g., n1-standard, e2-series).
o Regions where the instances are deployed.
4. Flexible Discounts:
o CUDs can be applied to both vCPUs and RAM resources.
o You can combine CUDs with other discounts, like sustained use discounts.
22. How do you monitor the performance of Compute Engine instances?
23. How can you reduce VM costs using sustained use discounts?
1. Sustained Use Discounts are automatically applied when you use a VM instance for
a significant portion of the month (over 25% of the time).
2. The longer the instance runs in a given month, the larger the discount:
o Discount starts after the first 25% usage.
o The discount increases progressively based on usage from 25% to 100%.
3. How to Benefit:
o If your VMs run continuously for most of the month, they will automatically
get discounts of up to 30%.
o This is automatically applied, and you don’t need to do anything extra.
4. Example:
o If you run an instance for 100% of the month, you may receive a 30%
discount on the usage for that instance.
25. What are Shielded VMs, and how do they enhance security in Compute Engine?
1. What is GCP Cloud Storage, and how does it differ from Persistent Disk?
Key Differences:
Cloud Storage: Object storage for unstructured data, accessible globally via HTTP.
Persistent Disk: Block storage for VM instances, used within the Compute Engine
environment.
1. Standard:
o Designed for frequent access.
o Best for data that is actively used and changed frequently.
o Low latency and high throughput.
2. Nearline:
o Designed for infrequent access (accessed once a month or less).
o Lower cost than Standard storage but higher retrieval costs.
o Ideal for backup and long-term storage.
3. Coldline:
o Designed for rare access (accessed once a year or less).
o Even lower cost than Nearline but higher retrieval costs.
o Ideal for archival storage or disaster recovery.
4. Archive:
o Designed for long-term archival storage.
o Lowest-cost storage with the highest retrieval costs.
o Best for data that is never accessed or rarely accessed.
3. How does Cloud Storage ensure data durability and availability?
1. Durability:
o Google Cloud Storage provides 99.999999999% (11 9's) durability over a
given year.
o It automatically replicates data across multiple availability zones to ensure
durability and prevent data loss due to failures.
o Uses erasure coding and replication to safeguard against data loss.
2. Availability:
o Data in Cloud Storage is available 24/7 with high uptime.
o For multi-regional storage, it replicates data across multiple regions to provide
high availability in case of regional failures.
o For regional storage, it replicates data across multiple zones within a region
for availability and fault tolerance.
A storage bucket is a container in Google Cloud Storage where you store your
objects (files).
Buckets serve as the top-level organizational unit for storing data.
Key features:
o Buckets are globally unique.
o Each bucket has a specific location (region or multi-region) and storage class.
o Objects within the bucket can be managed with access controls, lifecycle
policies, and other configurations.
Example:
Bucket Versioning allows you to store and access previous versions of objects in a
bucket.
o When enabled, every time an object is overwritten or deleted, the previous
version is retained.
o You can restore previous versions of objects if needed.
How it helps in data recovery:
o It provides a safety net for accidental overwrites or deletions.
o You can easily revert to an earlier version of a file, ensuring data integrity and
recovery from mistakes.
Lifecycle Rules allow you to automatically manage your objects based on their age or
other conditions. They help in reducing costs and automating data management tasks,
such as archiving or deletion.
Types of Lifecycle Rules:
o Set Storage Class: Move objects to a different storage class (e.g., from
Standard to Coldline) based on age.
o Delete Objects: Delete objects after a specified number of days.
o Custom Rules: Set conditions like time-based or age-based actions.
How to Configure Lifecycle Rules:
o Using the GCP Console:
1. Go to Cloud Storage and open the bucket.
2. Click on Lifecycle tab and click Add a Rule.
3. Define the condition (e.g., age of object) and the action (e.g., move to
Coldline).
4. Save the rule.
o Using gcloud CLI: You can create a lifecycle configuration file and apply it
with gsutil:
{
"rule": [
{
"action": {"type": "Delete"},
"condition": {"age": 365}
}
]
}
10. How do you delete a bucket in Cloud Storage, and what are the prerequisites?
Prerequisites:
o The bucket must be empty before it can be deleted.
o If the bucket contains objects or versions, you need to delete them first.
Using the GCP Console:
1. Go to Cloud Storage and open the bucket.
2. Click on Delete at the top of the page.
3. Confirm the bucket name and click Delete.
Using gcloud CLI: You can delete a bucket with the following command:
gsutil rb gs://your-bucket-name
If the bucket is not empty, you can first delete the objects:
gsutil -m rm -r gs://your-bucket-name/**
Then, delete the bucket:
gsutil rb gs://your-bucket-name
Here are the answers for the next set of questions on GCP Cloud Storage:
11. What are signed URLs, and how do they work in Cloud Storage?
Signed URLs are temporary, time-limited URLs that grant access to private objects
in Cloud Storage without requiring authentication.
How they work:
o You can generate a signed URL for an object to give a user or application
temporary access (read or write).
o The URL is signed with a secret key and includes an expiration timestamp.
Once the expiration is reached, the URL becomes invalid.
o Typically used for sharing private objects with specific users or allowing
access to objects without needing them to authenticate with Google Cloud.
How to generate a signed URL:
o Use the gsutil signurl or gcloud commands or the Storage API.
12. How do you grant access to specific users for a Cloud Storage bucket?
Simple use case where you want a More granular control, useful
Use Case consistent access model for all objects when different permissions are
in a bucket. needed for different objects.
roles/storage.objectViewer,
Typical Role roles/storage.legacyBucketReader
roles/storage.objectAdmin
gsutil ls -a gs://your-bucket/your-object
2. Restore the object: To restore a deleted version, copy it back to the same
bucket:
17. What is the difference between customer-managed encryption keys (CMEK) and
Google-managed encryption keys?
Google controls and rotates the The customer has control over key
Key Control
keys. creation, rotation, and revocation.
Suitable for most users where ease Suitable for users with specific
Use Case
of use is a priority. compliance or security requirements.
Google-managed Encryption Customer-managed Encryption Keys
Feature
Keys (GMEK) (CMEK)
18. How do you restrict access to a Cloud Storage bucket using IAM roles?
IAM Roles in Cloud Storage are used to control who can access the resources and
what actions they can perform.
Steps to Restrict Access:
1. Go to IAM & Admin in the Google Cloud Console.
2. Select Add to add a new member (user, service account, or group).
3. Choose the role you want to assign (e.g., roles/storage.objectViewer for read
access or roles/storage.objectAdmin for full control).
4. Specify the resource (bucket) the role applies to.
Common IAM Roles:
o roles/storage.admin: Full access to the Cloud Storage resources.
o roles/storage.objectViewer: Read-only access to objects.
o roles/storage.objectCreator: Permission to upload objects.
Bucket-Level Permissions: By setting roles at the bucket level, you restrict users to
only access the resources within that bucket.
19. How do you set up VPC Service Controls to protect Cloud Storage data?
VPC Service Controls help protect data by creating a security perimeter around GCP
services, including Cloud Storage, to prevent data exfiltration.
Steps to Set Up VPC Service Controls:
1. Go to the VPC Service Controls page in the Google Cloud Console.
2. Click Create Perimeter.
3. Define the perimeter by selecting the services (such as Cloud Storage) and
resources (e.g., projects or buckets) you want to protect.
4. Set access levels to define which services and resources can access the data
within the perimeter.
5. Enable Access Context Manager to define the policies governing access to
the perimeter.
Use Case:
o This helps prevent data leakage and unauthorized access from outside the
defined perimeter, especially in scenarios involving sensitive data.
20. What is the purpose of the Storage Transfer Service in GCP?
Storage Transfer Service is a fully managed service that allows you to transfer data
into Cloud Storage from:
o On-premises systems (local storage, file servers).
o Other cloud providers (AWS S3, Azure Blob Storage).
o Another Cloud Storage bucket.
Key Features:
o Supports bulk transfers of large datasets.
o Can schedule transfers for regular data migration.
o Allows for filtering, pre-transfer checks, and transfer verification.
Use Cases:
o Migrating large datasets to Cloud Storage from various sources.
o Synchronizing data between Cloud Storage buckets.
o Automating regular backups from on-premises or other cloud systems.
21. How do you optimize Cloud Storage costs using lifecycle policies?
Lifecycle Policies in Cloud Storage allow you to automatically manage the transition
and deletion of objects based on their age or other criteria.
Key features:
o Object Transition: Automatically move objects to a less expensive storage
class (e.g., from STANDARD to NEARLINE or COLDLINE) after a certain
period.
o Object Deletion: Automatically delete objects after a set time (e.g., delete
files older than 1 year).
How to set up a lifecycle policy:
1. Go to Cloud Storage and select the bucket.
2. In the bucket details, go to the Lifecycle tab.
3. Add a rule based on conditions (e.g., age of the object or storage class).
Example use case:
o Archive old data in Coldline after 30 days.
o Delete objects that are older than 365 days.
This helps save costs by automatically managing the storage class and deleting unused data.
22. What is the effect of enabling Requester Pays on a Cloud Storage bucket?
Requester Pays enables a Cloud Storage bucket to charge the requester (not the
bucket owner) for access and operations on the objects in the bucket.
Effects:
o The requester (e.g., users or services accessing the bucket) is billed for the
data retrieval, storage operations, and network egress costs.
o The bucket owner is not charged for these operations.
How it works:
o When the requester tries to access an object in the bucket, the Requester Pays
feature requires them to specify the billing project that will cover the costs.
Use case:
o Commonly used when data is shared publicly, but the bucket owner wants to
avoid paying for access requests.
23. How do you monitor and analyze Cloud Storage usage and access logs?
Cloud Storage Usage and Access Logs can be analyzed using several tools:
1. Cloud Audit Logs:
Logs all admin activity (e.g., changes to IAM roles or bucket
settings).
Available in the Cloud Logging interface.
2. Storage Access Logs (Access Transparency):
Can be enabled for bucket access logs to track detailed data access
operations (e.g., GET, PUT, DELETE requests).
Logs include IP address, requestor identity, and response status.
Enable logging in the Bucket Details page.
3. Cloud Monitoring:
Use Cloud Monitoring to set up dashboards and alerts for bucket
usage, storage class changes, and network egress.
4. BigQuery:
Export Cloud Storage logs to BigQuery for detailed analysis and
long-term storage.
24. What are the advantages of using Cloud Storage over traditional file storage
systems?
Pay only for what you use with Fixed hardware and operational
Cost
flexible pricing. costs.
Backup and Built-in redundancy with geo- Often requires manual backup
Redundancy replication. solutions.
25. How does Cloud CDN work with Cloud Storage to improve performance?
Cloud CDN (Content Delivery Network) caches content closer to users, reducing
latency and improving performance when accessing Cloud Storage objects.
How it works with Cloud Storage:
1. When a user requests an object stored in Cloud Storage, Cloud CDN caches
the object at edge locations closer to the user.
2. Future requests are served directly from the edge cache, improving
performance by reducing latency and load on the origin bucket.
3. Cloud CDN uses cache keys (like URL paths) to determine when to fetch new
data from Cloud Storage or serve cached content.
Benefits:
o Reduced Latency: By caching content closer to the user, the access time
decreases significantly.
o Cost Savings: Reduces egress traffic from Cloud Storage, as Cloud CDN
serves cached content.
o Global Reach: Ensures fast content delivery across different regions.
How to set it up:
0. Create a Cloud Storage bucket to store content.
1. Enable Cloud CDN for the Cloud Storage bucket through the Google Cloud
Load Balancer.
1. What is GCP BigQuery, and how does it differ from traditional databases?
1. Projects:
o A project is a top-level container for organizing resources and managing
permissions.
2. Datasets:
o A dataset is a container within a project that holds tables, views, and other
resources.
3. Tables:
o Tables are collections of structured data in rows and columns. These can be
queried using SQL.
4. Views:
o Views are virtual tables that contain the result of a query. They don’t store
data, but you can reference them like tables.
5. Jobs:
o Jobs represent tasks like running queries or loading data. They are tracked and
can be monitored in the BigQuery Console.
6. Query Engine:
o The query engine executes SQL queries on BigQuery, using distributed
computing to process large datasets.
3. What storage and compute separation mean in BigQuery?
Storage and compute separation means that BigQuery's storage and compute
resources are decoupled, allowing them to scale independently.
o Storage: BigQuery stores your data in columnar format in the cloud. Storage
is automatically scaled to accommodate large datasets, and you pay only for
the amount of data stored.
o Compute: Compute resources are provisioned dynamically to run queries, and
you pay only for the compute resources used during query execution. These
resources scale based on query complexity and data volume.
Advantages:
o Cost Efficiency: You are only billed for the storage you use and the compute
you consume during queries.
o Scalability: BigQuery can scale storage and compute independently,
optimizing resource usage and performance.
1. Native Tables:
o Native tables are the default type of tables in BigQuery. They store data in a
columnar format and are fully managed by BigQuery.
o They can be created via the BigQuery Console, gcloud CLI, or SQL.
2. External Tables:
o External tables are used to reference data stored outside of BigQuery, such as
in Google Cloud Storage or Google Sheets. The data remains in its original
location, but you can query it as if it were a table in BigQuery.
3. Partitioned Tables:
o Partitioned tables divide data into partitions based on a timestamp or
integer column, which improves query performance and cost by limiting the
data scanned in queries.
o Types of partitions: Date partitioning, Integer range partitioning.
4. Clustered Tables:
o Clustered tables allow data within a table to be organized and stored based on
one or more columns, improving the efficiency of queries that filter by these
columns.
5. Materialized Views:
o Materialized views are like views but store the results of queries for faster
subsequent access. They help optimize performance by precomputing complex
queries.
1. BigQuery Console:
o You can load data through the BigQuery Console by selecting a dataset,
clicking on "Create Table," and specifying the source (e.g., CSV, JSON, or
Avro).
2. gcloud Command-Line Tool:
o Use the bq command to load data. Example:
3. BigQuery API:
o Use the BigQuery REST API to programmatically load data. This is useful
for automating data loading in scripts.
4. Cloud Storage:
o Upload data from Google Cloud Storage using BigQuery's Cloud Storage
integration. This method is particularly useful for loading large datasets.
5. Streaming Data:
o BigQuery supports streaming data using the BigQuery Streaming API. You
can continuously stream data into tables in near real-time.
6. Data Transfer Service:
o Use BigQuery Data Transfer Service to load data from supported sources
like Google Analytics, Google Ads, or other third-party services.
7. How do you load CSV, JSON, and Avro files into BigQuery?
1. CSV Files:
o In the BigQuery Console:
Select Create Table, choose CSV as the file format, and specify the
file location (either from Cloud Storage or local files).
You can specify options such as field delimiter, header row, and skip
leading rows.
o Example bq command:
bq load --source_format=NEWLINE_DELIMITED_JSON
dataset_name.table_name gs://bucket_name/file.json
3. Avro Files:
o Choose Avro as the file format when creating the table.
o BigQuery will automatically infer the schema from the Avro file.
o Example bq command:
Cost Reduces cost by scanning only the Reduces query time by making
Efficiency relevant partitions. filtering more efficient.
External Tables allow BigQuery to query data stored outside of BigQuery without
needing to load it into BigQuery's internal storage.
Key Purposes:
o Access Data in Cloud Storage: External tables allow you to reference and
query data stored in Google Cloud Storage (e.g., CSV, JSON, or Parquet
files) directly from BigQuery.
o Access Data in Google Sheets: You can create an external table that
references a Google Sheet.
o Reduced Storage Cost: External tables avoid the cost of storing data in
BigQuery. Data is queried directly from external locations.
Benefits:
o Query external data: Perform SQL queries on data stored outside of
BigQuery, without the need to import it.
o Save on storage: Use external tables when you have large datasets stored
externally and want to avoid importing them.
Example of using an external table with Cloud Storage:
bq mk --external_table_definition=gs://bucket_name/*.csv
dataset_name.external_table
Execution Plan: When you run a query in BigQuery, it first generates an execution
plan based on the query's SQL and the underlying data structure.
Distributed Processing: BigQuery uses a distributed architecture where the query
is broken into smaller tasks and distributed across many worker nodes (compute
resources). Each worker node processes a portion of the data in parallel.
Columnar Storage: BigQuery stores data in a columnar format, so only the relevant
columns needed for the query are read, improving query efficiency.
Data Shuffling: For operations like JOINs and GROUP BY, BigQuery performs
data shuffling to ensure the relevant data is brought together for processing.
Optimized Query Execution: BigQuery optimizes queries by analyzing the schema,
data distribution, and query pattern. It applies techniques like column pruning and
partition pruning to minimize the amount of data processed.
Cost: BigQuery charges based on the amount of data processed, so optimizing queries
to minimize data scans is important.
12. What are the best practices to optimize query performance in BigQuery?
Query Caching in BigQuery allows BigQuery to cache the results of a query for
24 hours.
How it works:
o When you run a query for the first time, BigQuery stores the results in a cache.
o If the exact same query is run within 24 hours, BigQuery will return the
cached results instead of re-executing the query, improving performance and
reducing costs.
Benefits:
o Faster Results: Cached queries return results faster as the computation is
skipped.
o Cost Savings: Cached results reduce the amount of data scanned and therefore
lower costs for repeated queries.
Limitations:
o The cache is invalidated if the dataset or table data changes, or if the query is
modified (even slightly).
14. What is the difference between interactive and batch queries in BigQuery?
Higher cost per query (since it's Lower cost per query (due to deferred
Cost
optimized for speed). execution).
Ideal for ad-hoc queries or Ideal for large data processing that
Use Case
when quick results are needed. can tolerate delays.
Interactive Queries are typically used when you need immediate results, like
running exploratory queries or during debugging.
Batch Queries are better suited for large-scale data processing, like ETL jobs,
where delays in response time are acceptable.
15. How does partition pruning work in BigQuery?
19. What is dynamic data masking in BigQuery, and how does it enhance security?
Dynamic data masking is a security feature in BigQuery that enables the masking of
sensitive data in query results, providing a way to hide certain values without actually
modifying the underlying data.
How it works:
o Masking is applied dynamically based on the user’s role or identity. For
example, a user with a specific role might see a full email address, while
another user may only see a masked email (e.g., ***@domain.com).
Implementation:
o You can define masking policies on specific columns in BigQuery. These
policies determine how data should be masked for users who do not have
access to sensitive information.
o Masking is defined at the column level using a user-defined function (UDF)
or using BigQuery policy tags.
Benefits:
o Improved security: Sensitive information is protected from unauthorized
users.
o Regulatory compliance: Helps meet compliance requirements like GDPR or
HIPAA by masking sensitive data.
21. What is BigQuery ML, and how does it enable machine learning in SQL?
BigQuery ML (BigQuery Machine Learning) is a feature that allows you to build and
train machine learning models directly within BigQuery using SQL queries. It
simplifies machine learning workflows by eliminating the need to move data out of
BigQuery for training models.
How it works:
o You can use SQL statements to create and train models (e.g., linear
regression, logistic regression, k-means clustering).
o Models are stored in BigQuery tables and can be queried just like any other
table.
Key Benefits:
o No need for separate ML tools: You can stay within the BigQuery
environment.
o Ease of use: Uses SQL syntax, so no need for complex programming or
expertise in Python, R, etc.
o Seamless integration: BigQuery ML can easily integrate with your data
stored in BigQuery, reducing data movement and time spent on preprocessing.
22. How do you perform ETL operations using BigQuery?
23. What are BigQuery User-Defined Functions (UDFs), and when should you use
them?
UDFs are custom functions written in SQL or JavaScript that you can use in
BigQuery queries to encapsulate reusable logic.
Types of UDFs:
o SQL UDFs: You write functions using SQL expressions to encapsulate
reusable logic.
o JavaScript UDFs: You write custom functions using JavaScript that can be
executed as part of BigQuery queries.
When to use them:
o Custom Logic: When you need to apply complex transformations that aren't
covered by BigQuery's built-in functions.
o Reusability: When the logic is used across multiple queries and you want to
centralize the function for easier maintenance.
o Performance: JavaScript UDFs can help when dealing with complex string
manipulations or other complex operations that aren’t efficient with standard
SQL.
Considerations: UDFs can impact performance due to the additional overhead of
invoking custom code, so they should be used judiciously.
24. How do you handle JSON and nested data structures in BigQuery?
BigQuery provides native support for nested and repeated data types, allowing you
to work with JSON-like data.
Handling JSON:
o You can use BigQuery’s JSON functions to parse and query JSON-formatted
data directly.
JSON_EXTRACT, JSON_EXTRACT_SCALAR for extracting data
from JSON strings.
JSON_EXTRACT_ARRAY to extract array data.
Handling Nested Structures:
o BigQuery supports STRUCT (for nested objects) and ARRAY (for repeated
data), both of which are similar to JSON objects and arrays.
o Accessing nested fields: Use dot notation to access nested fields, for example,
person.name or address.city.
o Flattening Nested Data: To work with nested data, you can use UNNEST to
flatten arrays or extract nested data for further processing.
o Example SQL:
SELECT
name,
JSON_EXTRACT(json_field, '$.address.city') AS city
FROM
`project.dataset.table`;
Window functions allow you to perform calculations across rows that are related to
the current row. Unlike aggregate functions, window functions do not collapse rows
but instead return a value for each row in the result set.
Syntax:
SELECT
column1,
column2,
ROW_NUMBER() OVER (PARTITION BY column1 ORDER BY column2
DESC) AS row_num
FROM
`project.dataset.table`;
26. How does BigQuery pricing work for storage and queries?
Storage Pricing:
o BigQuery charges for data storage based on the amount of data stored in
tables and datasets.
o Active storage: Data stored in a table that has been modified within the last 90
days.
o Long-term storage: Data stored in a table that hasn't been modified for 90
days or more, which is cheaper.
Query Pricing:
o On-demand queries: BigQuery charges based on the amount of data
processed by each query (per byte). This includes scanning the tables and
datasets in the query.
o Flat-rate pricing: An alternative to on-demand pricing, where you pay a fixed
monthly fee for a certain amount of query processing capacity (via
reservations).
Other Costs:
o Streaming Inserts: Charges for data inserted using the streaming API.
o Storage Transfer: Charges for transferring data into BigQuery from external
sources like Cloud Storage.
1. On-Demand Pricing:
o You pay for the data processed by queries (per byte).
o Ideal for unpredictable or low-volume workloads, as you pay only for what
you use.
2. Flat-Rate (Reservation) Pricing:
o You pay a fixed monthly fee for a certain number of slots (compute capacity)
for query execution.
o Ideal for high-volume or consistent workloads as it provides predictable
pricing.
Slot Reservations: You can reserve slots (compute resources) that BigQuery uses to
process queries.
o Dedicated Reservations: You get dedicated query capacity.
o Flexible Slots: Can be used across multiple projects or teams.
o Capacity Commitments: A commitment to pay for a specific number of slots
for 1 or 3 years.
Audit Logs: BigQuery integrates with Cloud Audit Logs to track user activities and
API calls.
Key Audit Logs:
o Data Access Logs: Records who accessed specific BigQuery resources
(tables, datasets).
o Admin Activity Logs: Tracks changes to resources (dataset, table
creation/deletion).
o System Event Logs: Tracks internal events that affect BigQuery performance.
Monitoring:
o You can use Cloud Logging to analyze audit logs and track queries that
consume excessive resources, user access patterns, and identify potential
performance bottlenecks.
o Use the logs to monitor query execution time, resource usage, and errors.
1. Optimize Queries:
o Limit data scanned: Only select the required columns (SELECT column1,
column2) and apply filters (WHERE clauses).
o Use Partitioning and Clustering: Partition tables by date or other criteria to
reduce the amount of data scanned in queries.
o **Avoid SELECT ***: Only select the columns you need to minimize the
amount of data read.
2. Leverage Flat-Rate Pricing:
o If you have a steady workload, switch from on-demand pricing to flat-rate
pricing (slot reservations) to lock in predictable costs.
3. Optimize Storage:
o Long-term storage: For rarely accessed data, use long-term storage to take
advantage of cheaper pricing.
o Use partitioned tables: This reduces the storage cost by optimizing data
access.
4. Data Transfer Efficiency:
o Minimize the number of data transfers to/from BigQuery (e.g., by using
external tables or pushing data directly from Cloud Storage).
5. Use Materialized Views:
o Use materialized views for frequently accessed aggregated data. This helps
avoid recalculating the same queries repeatedly.
6. Use Query Caching:
o If you run the same query multiple times, BigQuery caches the results and will
not reprocess the data. This can significantly reduce query costs if the cached
result is available.
7. Limit the use of Streaming Inserts:
o Streaming data into BigQuery is more expensive than loading data in bulk.
Use batch loading when possible to reduce costs.
8. Automate Data Expiry:
o Set lifecycle management policies to automatically delete or archive outdated
data in BigQuery tables.
DATAPROC
1. What is GCP Dataproc, and how does it differ from traditional Hadoop clusters?
GCP Dataproc is a fully managed cloud service for running Apache Hadoop,
Apache Spark, and other big data tools in Google Cloud. It allows users to easily
create, manage, and scale clusters for processing big data workloads.
Differences from Traditional Hadoop Clusters:
o Managed Service: Dataproc is fully managed, meaning Google takes care of
cluster provisioning, scaling, and maintenance, whereas traditional Hadoop
clusters require manual setup and management.
o Cost-Effective: Dataproc enables on-demand pricing, where you only pay for
the compute resources you use, unlike on-prem Hadoop clusters that incur
upfront hardware costs and maintenance.
o Scalability: Dataproc clusters can scale up and down dynamically based on
workload demands, making them more flexible than static on-prem clusters.
o Integration with GCP: Dataproc integrates seamlessly with GCP services
like BigQuery, Cloud Storage, and Dataflow, offering a cloud-native
approach, while traditional Hadoop clusters have limited cloud integration.
Cluster: A collection of virtual machines (VMs) running Hadoop, Spark, and other
big data tools. A Dataproc cluster can consist of multiple nodes and is used to process
data.
Master Node: The main node that controls the cluster, running the Hadoop Resource
Manager and Spark Driver.
Worker Nodes: These nodes run tasks and store data as part of the cluster’s
distributed file system.
Dataproc Jobs: The tasks (such as Spark, Hadoop, Hive, etc.) that you run on the
cluster.
Dataproc API: Used to interact with Dataproc clusters programmatically, enabling
you to create, manage, and submit jobs to clusters.
Cloud Storage: Often used for storing data to be processed in Dataproc, as well as
output data.
Hadoop Distributed File System (HDFS): Dataproc clusters can use Google Cloud
Storage as the primary storage layer or use HDFS if needed.
3. How does Dataproc integrate with other GCP services like BigQuery and Cloud
Storage?
4. What are the main advantages of using Dataproc over on-prem Hadoop?
Fully Managed: Dataproc removes the need for managing the infrastructure,
patching, and maintenance associated with on-prem Hadoop clusters. Google handles
cluster setup, upgrades, and scaling automatically.
Cost Efficiency: Dataproc offers a pay-as-you-go model, meaning you only pay for
the compute and storage you use, unlike on-prem Hadoop clusters that have fixed
costs for hardware and maintenance.
Scalability: Dataproc allows you to quickly scale up or down based on the workload,
unlike on-prem clusters which may require significant investment and time to scale.
Integration with Google Cloud: Dataproc seamlessly integrates with other Google
Cloud services (BigQuery, Cloud Storage, etc.), making it easier to store and process
data without complex network configurations.
Speed of Deployment: You can create and configure a Dataproc cluster in a matter of
minutes, unlike setting up an on-prem Hadoop cluster which can take days or even
weeks.
Security: Dataproc clusters are secured with Google's Identity and Access
Management (IAM) and Cloud Security features, providing better security controls
than on-prem systems.
5. What are the different cluster modes available in Dataproc?
6. How do you create a Dataproc cluster using the GCP Console and gcloud CLI?
Standard Cluster:
o This is the default cluster type with one master node and multiple worker
nodes.
o It is suitable for general big data processing workloads.
o Failover: No failover for the master node; if it goes down, the cluster will be
unavailable.
High-Availability (HA) Cluster:
o This cluster type has multiple master nodes (at least 3), providing high
availability.
o It ensures that if one master node fails, the others can take over, preventing
cluster downtime.
o This is ideal for production workloads that cannot afford interruptions.
Single Node Cluster:
o This type consists of only one node which serves as both the master and
worker.
o It is useful for small-scale, experimental, or testing workloads.
o There is no redundancy, and it is not recommended for production workloads.
bash
gcloud dataproc autoscaling-policies create <POLICY_NAME> \
--region <REGION> \
--min-workers <MIN_WORKERS> \
--max-workers <MAX_WORKERS> \
--cool-down-period <SECONDS> \
--metric <METRIC> \
--target <TARGET>
This ensures that the cluster will automatically scale the number of workers based on
the workload.
9. What are the different machine types you can use in a Dataproc cluster?
Machine types are specified for both master and worker nodes when creating a
Dataproc cluster. Common options include:
o Standard machine types (e.g., n1-standard-1, n1-standard-2): General-
purpose machine types with a balanced combination of CPU and memory.
o High CPU machine types (e.g., n1-highcpu-2): Machines optimized for CPU-
intensive workloads.
o High memory machine types (e.g., n1-highmem-4): Machines with a higher
amount of memory for memory-intensive tasks.
o Preemptible VMs: Cost-effective, short-lived instances that can be terminated
by Google at any time. These are cheaper but may not be suitable for long-
running tasks.
o Custom machine types: You can specify custom configurations for both CPU
and memory to meet specific workload needs.
You can submit a PySpark job to a Dataproc cluster using the gcloud CLI or the Dataproc
API.
bash
gcloud dataproc jobs submit pyspark <JOB_FILE_PATH> \
--cluster <CLUSTER_NAME> \
--region <REGION>
You can run Hive or Spark SQL queries in Dataproc by using gcloud CLI, Dataproc API,
or directly via Dataproc clusters.
bash
gcloud dataproc jobs submit hive --cluster <CLUSTER_NAME> \
--region <REGION> \
--execute "SELECT * FROM <TABLE_NAME>;"
bash
gcloud dataproc jobs submit spark --cluster <CLUSTER_NAME> \
--region <REGION> \
--execute "SELECT * FROM <TABLE_NAME>;"
Dataproc Workflow Templates are predefined collections of Dataproc jobs that can
be executed in a sequence. These templates help automate complex data processing
pipelines in a repeatable way.
How they help:
1. Automation: Workflows can automate multi-step data processing pipelines
involving multiple jobs (e.g., Spark, Hive, Hadoop).
2. Consistency: Templates ensure that the same job sequence is executed
consistently across different environments.
3. Efficiency: Reduce manual intervention by defining the steps in advance and
executing them with a single command.
Creating a Workflow Template: You can create and manage workflow templates
using the gcloud CLI:
bash
gcloud dataproc workflow-templates create <TEMPLATE_NAME> \
--region <REGION>
After creating a template, you can add jobs to it and run the workflow.
14. How do you integrate Dataproc with Apache Airflow for job orchestration?
python
from airflow import DAG
from airflow.providers.google.cloud.dataproc.operators.dataproc import
DataprocSubmitJobOperator
from airflow.providers.google.cloud.dataproc.operators.dataproc import
DataprocClusterCreateOperator
from airflow.providers.google.cloud.dataproc.operators.dataproc import
DataprocClusterDeleteOperator
# Define the DAG
with DAG('dataproc_dag', default_args=default_args, schedule_interval=None) as
dag:
create_cluster = DataprocClusterCreateOperator(
task_id='create_dataproc_cluster',
cluster_name='my-cluster',
project_id='my-project',
region='us-central1',
)
submit_job = DataprocSubmitJobOperator(
task_id='submit_spark_job',
cluster_name='my-cluster',
job={
'reference': {'project_id': 'my-project'},
'placement': {'cluster_name': 'my-cluster'},
'spark_job': {'main_class': 'org.apache.spark.examples.SparkPi'}
}
)
delete_cluster = DataprocClusterDeleteOperator(
task_id='delete_dataproc_cluster',
cluster_name='my-cluster',
project_id='my-project',
region='us-central1',
)
15. How does Dataproc handle ephemeral clusters for cost optimization?
Ephemeral clusters are short-lived clusters that are created for specific workloads
and then deleted after the job is completed.
Cost Optimization:
1. No ongoing costs: You only pay for the resources used during the lifetime of
the cluster. Since the cluster is deleted after the job completes, you avoid
paying for idle resources.
2. Faster provisioning: Dataproc allows you to spin up clusters quickly, so you
can perform your tasks efficiently without long setup times.
3. Autoscaling: Dataproc can automatically scale the number of worker nodes
based on job requirements, ensuring that resources are used effectively and
costs are minimized.
How to use ephemeral clusters:
1. You can create ephemeral clusters using the gcloud CLI or Dataproc API.
2. After the job completes, Dataproc automatically deletes the cluster,
minimizing cost.
16. What are the IAM roles required to manage a Dataproc cluster?
To manage a Dataproc cluster, users need specific IAM roles that provide access to
resources. The key roles for managing Dataproc clusters are:
Kerberos authentication can be enabled when creating a Dataproc cluster to provide secure
communication between services.
Steps:
1. Enable Kerberos during cluster creation using the gcloud CLI or Dataproc
UI:
bash
gcloud dataproc clusters create <CLUSTER_NAME> \
--region <REGION> \
--enable-kerberos \
--kerberos-realm <REALM> \
--kerberos-kdc <KDC_SERVER> \
--kerberos-admin-server <ADMIN_SERVER>
2. Kerberos settings:
--enable-kerberos: Enables Kerberos authentication on the cluster.
--kerberos-realm: Defines the Kerberos realm (usually in uppercase,
e.g., EXAMPLE.COM).
--kerberos-kdc: The Kerberos Key Distribution Center (KDC) server
address.
--kerberos-admin-server: The Kerberos admin server address.
3. Client setup:
Ensure your clients are configured to authenticate with Kerberos.
18. What security features does Dataproc provide for data encryption?
Dataproc offers several security features to ensure data encryption, both at rest and in transit:
1. Encryption at Rest:
o Data is encrypted by default using Google-managed encryption keys
(GMEK) for all data stored on Google Cloud services.
o You can also use Customer-managed encryption keys (CMEK) for more
control over the encryption keys.
2. Encryption in Transit:
o Dataproc uses TLS (Transport Layer Security) to secure data in transit
between cluster nodes and between the cluster and external resources (e.g.,
GCS).
3. Encryption for Storage:
o Dataproc uses encryption for persistent disk storage, including both the
operating system and data disks.
o You can control encryption for storage on the Dataproc cluster using CMEK.
4. Secure Cluster Access:
o Dataproc integrates with Identity-Aware Proxy (IAP) to control access to the
cluster and ensure secure authentication.
19. How do you restrict access to a Dataproc cluster using VPC Service Controls?
VPC Service Controls allow you to secure access to resources like Dataproc clusters
by creating a security perimeter to isolate them from unauthorized networks.
Steps to Restrict Access:
1. Create a VPC Service Controls perimeter: This perimeter defines a
boundary around your Dataproc cluster, protecting it from external access.
bash
gcloud access-context-manager perimeters create <PERIMETER_NAME> \
--resources=projects/<PROJECT_ID> \
--restricted-services=dataproc.googleapis.com
2. Associate the perimeter with the Dataproc cluster: Ensure your Dataproc
cluster is part of the security perimeter.
3. Allow access only from trusted sources: By using VPC Service Controls,
you can ensure that only services and users within the defined perimeter can
access Dataproc clusters.
20. How does Dataproc handle data security when processing sensitive information?
21. What are the best practices for optimizing Spark performance in Dataproc?
22. How does Dataproc pricing work, and what factors impact the cost?
1. Cluster Resources:
o You are charged for the VM instances (vCPUs and memory) in your cluster,
as well as for the persistent disks attached to the VM instances.
o Preemptible VMs cost less but can be terminated at any time.
2. Cluster Uptime:
o You pay for the time your Dataproc cluster is running, including the time
spent on provisioning, even if idle.
3. Data Processing:
o If you're using Cloud Storage for input/output data, there may be additional
costs for data transfer.
4. Storage:
o Costs for GCS storage depend on the amount of data you store and the
storage class used (e.g., Standard, Nearline, Coldline).
5. Data Transfer:
o Transferring data into or out of GCP services might incur network egress
fees.
6. Additional Services:
o Costs may arise if you are using additional GCP services (e.g., BigQuery for
data querying, Cloud Logging for logs).
7. Autoscaling:
o Autoscaling dynamically adjusts the number of nodes in your cluster, which
impacts cost based on resource usage.
24. What are initialization actions in Dataproc, and how do you use them?
Initialization actions are custom scripts that run when Dataproc clusters are created.
They are typically used to install software packages, configure system settings, or
customize the cluster environment.
Use cases:
1. Install Custom Software: Install non-default software like Python packages,
Java libraries, etc.
2. Cluster Configuration: Set up system configurations such as environment
variables, user accounts, etc.
3. Security Setup: Configure encryption settings, authentication, or Kerberos
setup.
How to use:
o During cluster creation, specify the initialization script with the --
initialization-actions flag.
Example:
bash
gcloud dataproc clusters create <CLUSTER_NAME> \
--region <REGION> \
--initialization-actions gs://<BUCKET_NAME>/scripts/install.sh
o The script can be hosted in Google Cloud Storage (GCS) or executed from a
local machine.
25. How does Dataproc handle preemptible VMs for cost optimization?
Preemptible VMs are short-lived, low-cost VMs offered by Google Cloud, designed
for cost optimization in scenarios where job interruption is acceptable.
1. Cost Savings:
o Preemptible VMs cost significantly less than regular VMs (up to 80%
cheaper).
2. Short Lifecycle:
o These VMs can be terminated by Google Cloud with little notice
(approximately 30 seconds). They are ideal for batch jobs that can tolerate
interruptions.
3. Cluster Autoscaling:
o Dataproc can automatically add preemptible VMs to a cluster when using
autoscaling. This helps in reducing costs without compromising the overall
processing power.
4. Use with Spark Jobs:
o If using Spark, preemptible VMs can be employed for worker nodes, and the
master node can run on regular VMs to ensure job stability.
5. Reconfiguration:
o Use preemptible VMs for large-scale, parallel processing tasks (e.g., ETL)
where short-term disruption will not significantly affect overall job
completion.
DATAFLOW
GCP Dataflow is a fully managed service for stream and batch data processing that
enables the execution of data processing pipelines built using the Apache Beam
programming model. It allows for the transformation of large-scale data in a flexible,
cost-effective manner with automatic scaling.
Difference from Dataproc:
1. Dataflow is designed primarily for streaming and batch data processing
with Apache Beam, while Dataproc is tailored for Hadoop/Spark-based big
data processing.
2. Dataflow abstracts infrastructure management, while Dataproc provides
greater control over clusters and configurations.
3. Dataflow focuses on pipeline execution with automated scaling and
management, whereas Dataproc requires manual handling of cluster sizes and
scaling.
2. What are the key advantages of using Dataflow over traditional ETL tools?
1. Scalability:
o Dataflow handles large datasets efficiently by automatically scaling resources
based on pipeline needs, reducing the complexity of scaling in traditional ETL
tools.
2. Fully Managed:
o No need to manage infrastructure or clusters. Dataflow abstracts the
infrastructure layer, providing a serverless environment for running pipelines.
3. Unified Batch and Stream Processing:
o It supports both batch and streaming data processing, making it flexible for
different use cases.
4. Real-Time Data Processing:
o Dataflow is optimized for real-time data processing, which is a challenge for
traditional ETL tools.
5. Cost Efficiency:
o It uses Google Cloud's auto-scaling capabilities to manage compute
resources efficiently, providing a cost-effective solution.
6. Integration with GCP Services:
o Dataflow integrates well with other GCP services like BigQuery, Cloud
Storage, Pub/Sub, and Cloud Spanner.
3. What is the Apache Beam framework, and how does it relate to Dataflow?
1. Stream Mode:
o In streaming mode, Dataflow processes continuous, real-time data, allowing
pipelines to ingest, process, and output data in near real-time.
o This mode is used for processing data that arrives continuously, such as real-
time logs, IoT sensor data, and event streams.
2. Batch Mode:
o In batch mode, Dataflow processes a fixed amount of data (usually from
Cloud Storage, BigQuery, etc.) in batches. The processing occurs after the
entire dataset is collected.
o This mode is ideal for periodic tasks like ETL processes, data aggregations,
and transformations on historical data.
Batch Processing:
1. Dataflow processes fixed datasets in batch mode, reading data from sources
like Cloud Storage or BigQuery.
2. It handles transformations, aggregations, filtering, and data enrichment in one
go, processing data at scheduled intervals.
3. Batch jobs are typically run at regular intervals (e.g., daily or hourly),
processing the accumulated data within a time window.
Streaming Processing:
1. Dataflow allows the processing of real-time data streams, processing records
as they arrive.
2. The pipeline processes data in small windows, allowing updates to be
processed immediately as new events arrive.
3. Streaming pipelines handle low-latency, real-time analytics, and event-driven
applications, such as log processing, monitoring, and dynamic ETL tasks.
Both batch and streaming modes are implemented using the Apache Beam model, which
allows you to write code that is agnostic to the processing mode.
6. What are the main components of a Dataflow pipeline?
1. PCollection:
o A distributed dataset, either bounded (batch) or unbounded (streaming). It
holds the data that flows through the pipeline.
2. Transforms:
o Operations that process or manipulate data in PCollections. Common
transforms include ParDo (for processing each element), GroupByKey (for
aggregation), and Windowing (for managing time-based data).
3. Pipeline:
o A pipeline defines the sequence of transforms applied to data. It is the top-
level object that represents the entire data processing workflow.
4. I/O (Input/Output):
o Data is read from and written to external systems using I/O connectors. These
include sources like Cloud Storage, BigQuery, Pub/Sub, and sinks where
data is written.
5. Windows:
o For streaming data, data can be divided into windows based on time or other
criteria. This allows for processing data in time-bound chunks.
6. Execution Engine:
o The execution engine (like Dataflow in GCP) is responsible for executing the
pipeline, handling resource provisioning, and scaling.
7. How do you create and deploy a Dataflow pipeline using Apache Beam?
To create and deploy a Dataflow pipeline using Apache Beam, follow these steps:
python
CopyEdit
from apache_beam.options.pipeline_options import PipelineOptions
options = PipelineOptions()
options.view_as(StandardOptions).runner = 'DataflowRunner'
4. Deploy to Dataflow:
o Submit the pipeline using the Dataflow runner:
For Python: python my_pipeline.py --runner DataflowRunner --project
<PROJECT_ID> --region <REGION>
For Java: mvn compile exec:java -
Dexec.mainClass=org.apache.beam.examples.WordCount --
runner=DataflowRunner
5. Monitor and manage the pipeline:
o Use GCP Console to monitor job status, manage logs, and troubleshoot.
PCollection (short for parallel collection) is the main data structure in Apache
Beam. It represents a distributed, immutable collection of data that can be processed
in parallel across a cluster.
How it works:
1. Bounded PCollection: Represents a finite dataset (like a file, database query).
2. Unbounded PCollection: Represents an infinite dataset (like real-time
streaming data).
3. Data in a PCollection can be transformed and processed by various Beam
transforms.
4. PCollections flow through the pipeline, getting processed by various stages
(transforms) before being written to sinks.
Bounded Data:
o Data that has a clear beginning and end, typically handled in batch mode.
o Examples: Logs, daily file uploads, database snapshots.
o Dataflow processes it as a finite dataset, and once processed, the job
completes.
Unbounded Data:
o Data that continuously arrives, typically handled in streaming mode.
o Examples: Event logs, IoT sensor data, real-time social media feeds.
o Dataflow processes it as a continuous stream of data, often using windowing
and triggering strategies to manage time-based operations.
In Dataflow, bounded data is processed in batches, while unbounded data requires careful
handling using windowing and triggers to manage real-time data processing.
10. What are the different windowing strategies in Dataflow?
Windowing in Dataflow allows you to group unbounded data into finite windows for
processing. The common windowing strategies are:
1. Fixed Windows:
o Divides the stream into fixed-size time intervals (e.g., every 5 minutes).
o Example: Grouping logs in 10-minute windows.
2. Sliding Windows:
o Similar to fixed windows, but with an overlap. Data can belong to multiple
windows.
o Example: A 10-minute window that slides every 5 minutes.
3. Session Windows:
o Used for grouping data based on session gaps (periods of inactivity).
o Example: Grouping user activity logs, where a session is considered as a
period of activity separated by inactivity longer than a threshold.
4. Global Windows:
o All data is treated as a single window, often with custom triggering or filtering
logic to control when the window is processed.
o Example: Summing data over an entire day.
5. Custom Windows:
o You can define custom windowing logic based on your specific needs (e.g.,
event-based windowing).
Each windowing strategy helps organize and process streaming data in manageable chunks
for aggregation, filtering, or other computations.
11. What is the difference between ParDo, GroupByKey, and Combine in Apache
Beam?
1. ParDo:
o ParDo (Parallel Do) is a transform that applies a function to each element in
a PCollection.
o It is used for element-wise processing where each input element can be
mapped to zero or more output elements.
o Example: Apply a function to each element to process data or transform the
data into different formats.
Example:
python
pcollection | 'TransformData' >> beam.ParDo(MyDoFn())
2. GroupByKey:
o GroupByKey is used to group elements by their key. It is typically used after
applying a key-value pair transformation.
o It groups the input data by the key so that elements with the same key are
combined.
o Example: Aggregate data, like summing values by a key.
Example:
python
pcollection | 'GroupByKey' >> beam.GroupByKey()
3. Combine:
o Combine is used to combine elements in a PCollection based on a function.
It is used for aggregation or reducing operations like sum, average, etc.
o The difference between GroupByKey and Combine is that Combine
operates on the values associated with the keys and performs a reduction
operation.
Example:
python
pcollection | 'SumValues' >> beam.CombinePerKey(sum)
Example (Python):
python
class MyDoFn(beam.DoFn):
def process(self, element):
# Apply custom logic to the element
yield element * 2
Example:
python
pcollection | 'ApplyCustomTransformation' >> beam.ParDo(MyDoFn())
1. Side Input:
o A side input is an additional input that is used by a DoFn for reading static
data that does not change during processing (e.g., lookup tables).
o Side inputs can be broadcasted to each worker, and they can be used in
element-wise processing (e.g., applying a filter or enrichment to a stream of
events).
Example: Using side input to enrich data with additional reference data.
python
side_input = p | 'CreateSideInput' >> beam.Create([10, 20, 30])
pcollection | 'ProcessWithSideInput' >> beam.ParDo(MyDoFn(),
side_input=beam.pvalue.AsList(side_input))
2. Side Output:
o A side output is used to output data from a transform that doesn’t fit into the
main pipeline. It's useful when you want to split or handle data differently
based on certain conditions (e.g., filtering).
o You can emit data from the main output and the side output for further
processing.
Example:
python
class MyDoFn(beam.DoFn):
def process(self, element, output_tag):
if element % 2 == 0:
yield element
else:
yield pvalue.TaggedOutput('odd', element)
In streaming pipelines, late-arriving data refers to data that arrives after the window has
already been closed. There are several ways to handle late data:
1. Allowed Lateness:
o Define an allowed lateness period during which late data is still accepted and
processed. Once the lateness period is over, late data will be discarded.
Example:
python
windowed_data | 'ApplyWindow' >> beam.WindowInto(
beam.window.FixedWindows(60),
allowed_lateness=beam.window.Duration(minutes=5)
)
2. Watermarking:
o Watermarks are used to track the progress of event-time processing. Late data
that arrives after the watermark can either be discarded or handled depending
on the allowed lateness.
3. Late Data Handling (Custom Handling):
o You can implement custom logic to handle late data (e.g., buffering it for later
processing or sending it to a dead-letter queue).
15. What are watermarks in Dataflow, and how do they affect event time processing?
Example:
python
pcollection | 'WindowData' >> beam.WindowInto(
beam.window.FixedWindows(60),
trigger=beam.trigger.RealtimeAfterWatermark(),
allowed_lateness=beam.window.Duration(minutes=5)
)
Example:
python
from apache_beam.io.gcp.bigquery import ReadFromBigQuery
Writing to BigQuery:
o Dataflow can also write processed data back into BigQuery using the
WriteToBigQuery transform.
o Dataflow can handle the schema and append data into BigQuery tables.
Example:
python
from apache_beam.io.gcp.bigquery import WriteToBigQuery
17. How can you use Dataflow to read and write data from Cloud Storage?
Example:
python
CopyEdit
from apache_beam.io import ReadFromText
Example:
python
from apache_beam.io import WriteToText
Example:
python
from apache_beam.io.gcp.pubsub import ReadFromPubSub
Writing to Pub/Sub:
o Dataflow can also publish processed data back into a Pub/Sub topic using the
WriteToPubSub transform.
Example:
python
from apache_beam.io.gcp.pubsub import WriteToPubSub
19. How does Dataflow interact with Cloud SQL and Spanner?
Cloud SQL:
o Dataflow can connect to Cloud SQL (e.g., MySQL, PostgreSQL) using JDBC
to read or write data.
o You can use the JdbcIO transform to read from and write to Cloud SQL
databases.
python
from apache_beam.io.jdbc import ReadFromJdbc
python
from apache_beam.io.gcp.spanner import ReadFromSpanner
20. How do you implement Dataflow with Bigtable for large-scale data processing?
Example:
python
from apache_beam.io.gcp.bigtable import ReadFromBigtable
Writing to Bigtable:
o Dataflow can write results back to Bigtable using the WriteToBigtable
transform.
Example:
python
from apache_beam.io.gcp.bigtable import WriteToBigtable
Automatic Scaling:
o Dataflow automatically adjusts the number of workers (VM instances) based
on the load and resource requirements of your pipeline.
o It scales up the number of workers when the pipeline is processing more data
and scales down when the load decreases.
o Autoscaling is useful for handling varying data volumes without manual
intervention.
Key Points:
o Dataflow uses Dynamic Worker Scaling to determine the optimal number of
workers based on real-time demand.
o Worker Types (e.g., standard, preemptible) can be configured for cost
optimization.
22. What are the best practices for optimizing Dataflow pipeline performance?
Backpressure Handling:
o Dataflow automatically handles backpressure in streaming pipelines by
buffering data and adjusting the flow of data to workers.
o When the data is ingested faster than it can be processed, Dataflow applies the
following mechanisms:
Dynamic Work Rebalancing: Dataflow dynamically balances the
work across workers to prevent overloading.
Windowing & Triggers: Use windowing to group data into
manageable chunks, preventing excessive memory usage.
Batching and Throttling: Dataflow can buffer or throttle incoming
data to avoid overloading the system.
25. What are the key cost factors in Dataflow, and how can you reduce them?
1. Worker Types:
o Standard vs Preemptible Workers: Use preemptible workers to reduce
costs significantly, but with the trade-off of the possibility of interruption.
o Worker Size: Adjust the number of CPUs and memory allocated to each
worker to match your pipeline's needs.
2. Pipeline Duration:
o Long-running pipelines incur more costs. Optimize your pipeline to run
efficiently and complete tasks in a shorter time.
3. Autoscaling:
o Enable autoscaling to automatically scale up or down based on load,
preventing over-provisioning and unnecessary costs.
4. Batch vs Streaming:
o Streaming pipelines might incur higher costs due to continuous processing. If
possible, consider switching to batch processing for cost savings.
5. Data Shuffling:
o Reduce shuffling operations in your pipeline, as they can cause high network
usage and increased costs.
6. I/O Operations:
o Minimize frequent reading and writing to Cloud Storage, BigQuery, or other
external systems, as this can increase costs.
7. Use Dataflow Templates:
o Templates enable reuse, preventing the need for new pipelines and saving on
development and execution costs.
8. Monitor and Analyze:
o Regularly monitor and analyze pipeline performance using the Dataflow UI
and Stackdriver to detect inefficiencies and reduce unnecessary compute time.
1. Automatic Recovery:
o If a worker fails, Dataflow automatically reassigns its tasks to other workers,
ensuring that the pipeline continues running.
2. Checkpointing and Watermarks:
o Dataflow uses watermarks to track event time and checkpointing to store
intermediate results, which ensures that data can be reprocessed if a failure
occurs.
3. Dataflow is Stateless:
o Each operation in Dataflow is stateless, so if a task fails, it can be retried
independently without affecting the overall process.
4. Retries:
o If tasks fail due to temporary issues (e.g., transient network issues), Dataflow
retries them automatically.
5. Dynamic Work Rebalancing:
o Dataflow continuously monitors the system and redistributes work when
necessary to maintain optimal performance.
28. What is exactly-once processing in Dataflow, and how is it achieved?
Exactly-Once Processing:
o Ensures that data is processed once and only once, even in the case of system
failures or retries.
Achieved Using:
1. Transactional Data: Dataflow uses Transactional Insertion for writing to sinks like
BigQuery or Cloud Storage, ensuring each record is processed only once.
2. Idempotent Writes: The pipeline is designed to perform idempotent writes, meaning
that repeated attempts to process the same data don’t change the result.
3. External State Management: It uses External State (e.g., BigQuery or Cloud
Spanner) to ensure that only new data is processed and avoids reprocessing old data.
4. Watermarks and Timers: Timers help in defining when data is considered
"processed" and used for triggering actions. Watermarks manage late data arrivals.
1. Google-Managed Encryption:
o By default, all data processed by Dataflow is encrypted using Google-
managed encryption keys (GMEK) during both transit and at rest.
2. Customer-Managed Encryption Keys (CMEK):
o You can use CMEK for more control over the encryption process. To enable
CMEK:
Create a Cloud Key Management Service (KMS) key.
Configure your Dataflow pipeline to use that key when reading from
and writing to Cloud Storage or BigQuery.
3. Dataflow Job Encryption:
o You can specify encryption for the entire Dataflow job when creating it via
the gcloud CLI or the Dataflow UI by providing the KMS key used for
encryption.
4. SSL Encryption:
o All data transferred over the network during Dataflow processing is encrypted
in transit using SSL/TLS by default.
30. How does Dataflow handle retries and failures in streaming pipelines?
1. Automatic Retries:
o Dataflow retries tasks that fail due to transient issues. This includes retries for
worker failures, network errors, or other temporary issues.
2. Backoff Strategy:
o Dataflow uses an exponential backoff strategy for retries, increasing the
delay between successive retries to avoid overwhelming the system.
3. Dead-letter Policy:
o In some cases, if retries fail repeatedly, Dataflow can route the failed data to a
dead-letter queue, where it can be analyzed or retried manually.
4. Error Handling in Transformations:
o You can define custom error handling logic within your Apache Beam
pipeline, allowing you to process or log errors in a specific manner.
5. Dynamic Work Rebalancing:
o If some tasks fail, Dataflow can reassign the tasks to healthy workers,
balancing the load and ensuring the pipeline continues running.
6. Event Time Processing:
o In streaming pipelines, watermarks and windowing ensure that late data is
processed correctly and that the pipeline can handle delayed events without
disrupting the overall flow.
CLOUD COMPOSER
1. What is GCP Cloud Composer, and how does it relate to Apache Airflow?
Cloud Composer:
o Cloud Composer is a fully managed workflow orchestration service provided
by Google Cloud, built on top of Apache Airflow. It allows users to automate,
schedule, and monitor data workflows in the cloud.
Relation to Apache Airflow:
o Cloud Composer leverages Apache Airflow as its core framework for
creating, managing, and scheduling workflows. However, Cloud Composer
abstracts the infrastructure management tasks and integrates it tightly with
GCP services, making it easier to use for users in Google Cloud environments.
2. What are the key advantages of using Cloud Composer over a self-managed Airflow
setup?
python
CopyEdit
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime
default_args = {
'owner': 'airflow',
'start_date': datetime(2025, 1, 1),
}
1. Action Operators:
o BashOperator: Executes a bash command.
o PythonOperator: Executes a Python function.
o EmailOperator: Sends an email.
o DummyOperator: Does nothing but serves as a placeholder.
2. Transfer Operators:
o S3ToGCSOperator: Transfers data from S3 to GCS.
o GCSToS3Operator: Transfers data from GCS to S3.
o BigQueryOperator: Runs queries in BigQuery.
o PostgresOperator: Executes SQL commands in PostgreSQL.
3. Sensor Operators:
o FileSensor: Waits for a file to appear in a specific location.
o HttpSensor: Waits for a response from a web server.
4. Branch Operator:
o BranchPythonOperator: Branches the workflow into multiple paths based on
a condition.
5. SubDagOperator:
o Used for executing sub-DAGs, allowing you to nest workflows within other
workflows.
8. How do you set up dependencies between tasks in a DAG?
In Airflow, you can set task dependencies using the shift operators (>> and <<) or by using
the set_downstream() and set_upstream() methods.
python
task1 >> task2 # task1 will run before task2
task2 << task3 # task2 will run after task3
python
task1.set_downstream(task2)
task3.set_upstream(task2)
This ensures that tasks will run in the desired order, respecting their dependencies.
9. What is the difference between task retries and task SLA in Airflow?
1. Task Retries:
o Purpose: Allows a task to be retried a certain number of times in case it fails.
Airflow will attempt to re-execute the task after a failure.
o Key Properties:
retries: Number of retry attempts (default is 0).
retry_delay: Delay between retries (e.g., timedelta(minutes=5)).
Example:
python
task = PythonOperator(
task_id='my_task',
retries=3,
retry_delay=timedelta(minutes=10),
python_callable=my_function,
dag=dag,
)
python
task = PythonOperator(
task_id='my_task',
sla=timedelta(hours=2),
python_callable=my_function,
dag=dag,
)
10. How do you parameterize DAGs using Airflow Variables and XComs?
1. Airflow Variables:
o Purpose: Airflow Variables allow you to store and retrieve dynamic values
for DAGs, which can be used for parameterization.
o Usage:
Set variables using the UI or CLI.
Retrieve variables in a DAG with Variable.get('variable_name').
Example:
python
CopyEdit
from airflow.models import Variable
value = Variable.get("my_variable")
o You can also provide default values and use them within your tasks.
2. XComs (Cross-Communication):
o Purpose: XComs allow tasks to exchange data with each other. A task can
push a value to XCom, which can be pulled by other tasks.
o Usage:
Use xcom_push to send data from one task to another.
Use xcom_pull to retrieve the data in a downstream task.
Example:
python
# Pushing a value
task1.xcom_push(key='my_key', value='some_value')
# Pulling a value
value = task2.xcom_pull(task_ids='task1', key='my_key')
11. How does Cloud Composer integrate with BigQuery?
python
CopyEdit
from airflow.providers.google.cloud.operators.bigquery import BigQueryOperator
bigquery_query = BigQueryOperator(
task_id='run_bigquery_query',
sql='SELECT * FROM `project.dataset.table` LIMIT 1000',
destination_dataset_table='project.dataset.result_table',
write_disposition='WRITE_TRUNCATE',
dag=dag
)
1. DataflowOperator:
o Cloud Composer provides a DataflowPythonOperator (and other Dataflow
operators) that allows you to trigger Dataflow pipelines from within a DAG.
Example:
python
from airflow.providers.google.cloud.operators.dataflow import
DataflowPythonOperator
trigger_dataflow = DataflowPythonOperator(
task_id='trigger_dataflow_pipeline',
py_file='gs://your-bucket/dataflow_script.py',
job_name='dataflow-job',
location='us-central1',
options={
'input': 'gs://input-data/*.csv',
'output': 'gs://output-data/result'
},
dag=dag
)
13. How can you use Cloud Composer to move data between Cloud Storage and
BigQuery?
1. CloudStorageToBigQueryOperator:
o This operator is used to load data from Cloud Storage (CSV, JSON, etc.) to
BigQuery.
Example:
python
CopyEdit
from airflow.providers.google.cloud.operators.bigquery import
CloudStorageToBigQueryOperator
load_gcs_to_bq = CloudStorageToBigQueryOperator(
task_id='load_gcs_to_bq',
bucket_name='your-bucket-name',
source_objects=['path/to/your/file.csv'],
destination_project_dataset_table='project.dataset.table',
skip_leading_rows=1,
field_delimiter=',',
source_format='CSV',
dag=dag
)
2. BigQueryToCloudStorageOperator:
o This operator is used to export data from BigQuery to Cloud Storage.
Example:
python
CopyEdit
from airflow.providers.google.cloud.operators.bigquery import
BigQueryToCloudStorageOperator
export_bq_to_gcs = BigQueryToCloudStorageOperator(
task_id='export_bq_to_gcs',
source_project_dataset_table='project.dataset.table',
destination_cloud_storage_uris=['gs://your-bucket/data/*.csv'],
export_format='CSV',
field_delimiter=',',
compression='NONE',
dag=dag
)
14. How do you use Pub/Sub with Cloud Composer for event-driven workflows?
1. PubSub Operators:
o You can use PubSub operators to trigger Cloud Composer workflows based
on events. You can use the PubSubPullSensor to monitor a topic for new
messages or trigger tasks in response to incoming Pub/Sub messages.
2. PubSubPullSensor:
o This sensor waits for messages on a Pub/Sub topic, and when a message is
available, it can trigger subsequent tasks.
Example:
python
CopyEdit
from airflow.providers.google.cloud.sensors.pubsub import PubSubPullSensor
wait_for_pubsub_message = PubSubPullSensor(
task_id='wait_for_message',
project_id='your-gcp-project',
subscription='your-subscription-name',
max_messages=1,
timeout=300, # Timeout in seconds
poke_interval=30, # Check every 30 seconds
dag=dag
)
3. Triggering Tasks:
o After the sensor detects a message, you can trigger other tasks in the DAG
based on the content of the message.
15. How does Cloud Composer connect to external APIs and databases?
python
from airflow.providers.http.operators.http import SimpleHttpOperator
call_external_api = SimpleHttpOperator(
task_id='call_api',
method='GET',
http_conn_id='external_api_connection',
endpoint='api/v1/data',
headers={"Content-Type": "application/json"},
dag=dag
)
Operator:
o An Operator is an abstraction used to define a task in a Directed Acyclic
Graph (DAG). It encapsulates the logic to perform specific actions (e.g.,
running a query in BigQuery, executing a bash command).
o Operators are used to specify what tasks should be executed within a DAG.
Hook:
o A Hook is a higher-level abstraction that provides the interface to interact with
external systems. Hooks manage connections and simplify communication
with services like databases, APIs, or cloud services.
o Hooks are often used inside Operators to handle connections to external
systems (e.g., a PostgresHook or BigQueryHook).
Difference:
Operators execute tasks, while Hooks help establish connections and handle the
interaction with external services.
17. How do you use BigQueryOperator in Airflow?
python
CopyEdit
from airflow.providers.google.cloud.operators.bigquery import BigQueryOperator
run_bigquery_query = BigQueryOperator(
task_id='run_bigquery_query',
sql='SELECT * FROM `project.dataset.table` WHERE column = "value"',
destination_dataset_table='project.dataset.result_table',
write_disposition='WRITE_TRUNCATE',
dag=dag
)
Example usage:
python
CopyEdit
from airflow.operators.python_operator import PythonOperator
def my_python_function():
print("This is a Python function executed by Airflow")
python_task = PythonOperator(
task_id='run_python_task',
python_callable=my_python_function,
dag=dag
)
python
from airflow.operators.bash_operator import BashOperator
bash_task = BashOperator(
task_id='run_bash_command',
bash_command='echo "Hello, World!"',
dag=dag
)
Executing
Executing SQL queries on
Common Usage SELECT/INSERT/UPDATE queries
BigQuery or managing datasets.
on PostgreSQL.
Requires BigQueryHook to
Requires PostgresHook to manage the
Connection manage the connection to
database connection.
BigQuery.
Example of PostgresOperator:
python
from airflow.providers.postgres.operators.postgres import PostgresOperator
postgres_task = PostgresOperator(
task_id='run_postgres_query',
sql='SELECT * FROM my_table WHERE column = %s',
parameters=('value',),
postgres_conn_id='postgres_conn_id',
dag=dag
)
Example of BigQueryOperator:
python
from airflow.providers.google.cloud.operators.bigquery import BigQueryOperator
bigquery_task = BigQueryOperator(
task_id='run_bigquery_query',
sql='SELECT * FROM `project.dataset.table` WHERE column = "value"',
destination_dataset_table='project.dataset.result_table',
write_disposition='WRITE_TRUNCATE',
dag=dag
)
Cloud Composer scales workloads through Airflow's native scheduler and workers:
1. Horizontal scaling: You can adjust the number of workers (VMs) in your Cloud
Composer environment to handle increasing or decreasing workload demands. This is
done via the Google Cloud Console or by adjusting the environment settings.
2. Dynamic scaling: Cloud Composer automatically scales the resources in response to
the number of tasks in the DAG, based on the number of worker nodes and the
worker size.
3. Autoscaling: Cloud Composer uses the Kubernetes Engine (if enabled) to manage
the scaling of resources dynamically, depending on the tasks that need to be
processed. If demand increases, Cloud Composer adds more resources; when the
demand decreases, it reduces resources.
22. What are the best practices for optimizing DAG performance in Cloud Composer?
Here are some best practices to optimize DAG performance in Cloud Composer:
1. Task parallelism: Maximize parallelism by ensuring that tasks that don’t depend on
each other can run concurrently. Use task_concurrency and max_active_runs to
control parallelism at the task and DAG levels.
2. Task retries and failure handling: Set appropriate retries, backoff strategies, and
timeouts for tasks to avoid excessive retries and resource usage. Use retries,
retry_delay, and max_retry_delay parameters.
3. Split large tasks: Break down large tasks into smaller sub-tasks or processes to make
them easier to manage and scale.
4. Use XCom for passing data: Use XCom to pass small amounts of data between
tasks rather than large datasets.
5. Optimized connections: Ensure your external connections (e.g., databases, APIs) are
optimized for high throughput and low latency.
2. Task Groups: Use task groups to logically group tasks and manage complex DAGs.
Task groups can help keep the DAG visualized better and easier to understand.
3. Autoscaling: Cloud Composer can automatically scale the number of workers based
on the tasks in the DAG, ensuring that resources are allocated efficiently and tasks are
completed in a timely manner.
4. Resource Limits: You can set resource limits for tasks and workers. For example,
you can configure the CPU, memory, and storage for the workers to ensure optimal
allocation.
25. How can you reduce costs when using Cloud Composer?
1. Use minimal worker instances: Scale down the number of worker nodes when
possible, especially if your DAGs do not require heavy processing. Utilize
autoscaling to dynamically adjust resources based on the workload.
2. Use Preemptible VMs: For non-critical tasks, use preemptible VMs to reduce the
cost of running Airflow workers.
3. Optimize DAG structure: Reduce the complexity and number of tasks in your DAGs
to avoid overprovisioning resources.
4. Choose appropriate machine types: Select smaller machine types for the workers,
depending on the workload requirements. This helps save costs while maintaining
adequate performance.
5. Limit resource allocation: Set limits on CPU and memory for your workers and
tasks, ensuring that they are not overprovisioned.
6. Use efficient operators: Use optimized operators that reduce the workload on Cloud
Composer, such as using BigQueryOperator directly instead of custom Python tasks
that perform similar operations.
7. Monitor usage: Regularly monitor Cloud Composer usage and optimize based on
performance data and task completion times.
1. Project Owner or Editor: These roles provide full access to Cloud Composer
resources.
2. Cloud Composer Admin (roles/composer.admin): Grants permissions to create,
update, and delete Cloud Composer environments.
3. Cloud Composer Worker (roles/composer.worker): Required for users who need
to manage workflows running within the Cloud Composer environment.
4. Cloud Composer Viewer (roles/composer.viewer): Grants read-only access to
Cloud Composer resources.
5. Service Account User (roles/iam.serviceAccountUser): Required for tasks that
interact with Google Cloud services using service accounts.
6. BigQuery Data Viewer (roles/bigquery.dataViewer): If interacting with BigQuery,
this role is necessary to read data from BigQuery.
7. Storage Object Viewer (roles/storage.objectViewer): If interacting with Cloud
Storage, this role provides read access to storage objects.
27. How do you set up logging and monitoring for Cloud Composer?
Logging and monitoring in Cloud Composer can be set up using the following tools:
2. Cloud Monitoring:
o Cloud Composer integrates with Cloud Monitoring to monitor resource
utilization and task performance.
o You can set up alert policies for critical errors, failed tasks, or performance
degradation.
o Use Stackdriver Monitoring to track resource usage such as CPU and
memory.
3. Airflow UI:
o The Airflow web interface offers built-in monitoring, allowing you to view
DAG execution status, logs, task dependencies, and more.
4. Custom Monitoring:
o You can integrate custom monitoring into your DAGs using Cloud
Monitoring APIs or by pushing custom metrics to Cloud Monitoring.
28. How do you secure DAGs and sensitive information in Cloud Composer?
3. Airflow Connections:
o Use Airflow’s Connections feature to securely manage credentials for
databases, APIs, and other services. Store these credentials securely rather
than in plaintext in your DAG code.
4. Environment Encryption:
o Enable encryption at rest for your Cloud Composer environment to protect
data from unauthorized access.
6. Audit Logs:
o Enable audit logging to track access to Cloud Composer and ensure that all
actions related to DAG execution and configuration changes are logged and
monitored.
29. What happens when a DAG fails, and how do you retry failed tasks?
1. Task Failure:
o A task failure is recorded in the Airflow UI, where you can inspect the logs to
diagnose the issue.
o The DAG run status will reflect the failure, and dependent tasks will not be
executed unless the failure is resolved.
2. Retrying Failed Tasks:
o Airflow supports task retries. When a task fails, you can configure it to retry
based on the retries and retry_delay parameters.
o You can specify the maximum number of retries using the retries argument in
the task operator.
o The retry_delay argument specifies the delay between retries.
o Failed tasks can also be retried manually via the Airflow UI or using the
Airflow CLI.
o If using PythonOperator or custom operators, you can implement custom
retry logic using retry_exponential_backoff.
30. How does Cloud Composer ensure high availability and reliability?
1. Regional Deployment:
o Cloud Composer environments are deployed regionally across multiple
availability zones to ensure redundancy in case of failures.
2. Multiple Workers and Scheduler:
o Airflow uses multiple workers to process tasks, ensuring that if one worker
fails, others can take over.
o The Airflow Scheduler is highly available and can run across multiple
instances for fault tolerance.
3. Autoscaling:
o Cloud Composer can scale workers dynamically based on task demand. This
ensures that resources are always available for task execution, even during
high workloads.
4. Preemptible VMs for Cost-Effective Redundancy:
o Cloud Composer can use preemptible VMs as part of its worker pool, which
are more cost-effective and replaceable during high availability situations.
5. Backup and Restore:
o Cloud Composer integrates with Google Cloud Backup and Disaster
Recovery mechanisms to allow for backup and restore of your Airflow
environment in case of unexpected failures.
6. Error Handling and Retries:
o Airflow’s native retry mechanisms ensure that transient failures are
automatically retried, improving the reliability of task execution.
7. Monitoring and Alerts:
o Cloud Composer integrates with Cloud Monitoring to provide alerts for
failures, resource shortages, or other issues. Proactive monitoring helps ensure
service reliability.
IAM
GCP IAM (Identity and Access Management) is a framework that allows you to control
access to Google Cloud resources by specifying who can perform what actions on which
resources. IAM is important because it helps organizations enforce the principle of least
privilege, ensuring that only authorized users and services have access to sensitive resources,
while maintaining security and compliance.
Security: Helps prevent unauthorized access and ensures that users can only perform
actions that they are authorized to.
Compliance: Supports auditing and access control policies for meeting regulatory
requirements.
Granular Access: Provides fine-grained access control to Cloud resources.
Scalability: Enables managing permissions for large organizations with multiple
users, roles, and projects.
1. Identities: Represents entities that need access to resources (users, groups, service
accounts, or Google groups).
2. Roles: Define a collection of permissions that are granted to identities. Roles can be
assigned to users, groups, or service accounts.
3. Permissions: Specific actions allowed on a resource (e.g., compute.instances.start or
storage.objects.create).
4. Policies: Policies are the bindings that associate identities with roles. Policies define
the access granted to identities.
5. Audit Logs: Logs that track all IAM-related activities, such as who performed an
action and when.
Component Description
4. How does GCP IAM differ from traditional role-based access control (RBAC)?
While both GCP IAM and traditional RBAC systems manage access using roles, there are
some key differences:
GCP IAM uses service accounts to Traditional RBAC does not typically
Service
grant permissions to virtual include service accounts but focuses
Accounts
machines, apps, and services. on user roles.
1. Primitive Roles:
o Owner: Full control over all resources (including billing, project settings,
etc.).
o Editor: Can modify resources, but cannot manage roles and permissions.
o Viewer: Can view resources but cannot modify them.
2. Predefined Roles:
o These roles are specific to a Google Cloud service and grant granular
permissions based on the actions needed for that service (e.g.,
roles/storage.objectViewer for reading objects in Cloud Storage).
3. Custom Roles:
o Custom roles allow you to define a set of specific permissions tailored to your
use case. You can create custom roles to grant only the permissions needed for
a specific task or resource.
6. What is the difference between primitive, predefined, and custom roles in IAM?
Basic roles that apply to the entire Google Cloud project, which include
Primitive
Owner, Editor, and Viewer. They grant broad permissions across all services
Roles
within a project but lack fine-grained control.
These roles are specific to Google Cloud services and grant granular
Predefined permissions based on the actions needed for that service (e.g.,
Roles roles/storage.objectViewer for reading Cloud Storage objects). Predefined
roles are more fine-grained than primitive roles.
Custom roles allow you to create a role with a specific set of permissions
Custom tailored to your needs. You can combine different permissions and assign them
Roles to users or service accounts based on business requirements. Custom roles
offer the highest level of granularity.
bash
gcloud projects add-iam-policy-binding PROJECT_ID \
--member='user:USER_EMAIL' --role='ROLE_NAME'
The Principle of Least Privilege (PoLP) is the practice of giving users, groups, or service
accounts the minimum permissions necessary to perform their tasks, and no more. This
minimizes the potential attack surface, limits the impact of security breaches, and helps
ensure compliance with security best practices.
Assigning the most restrictive roles that meet the needs of the user or service account.
Avoiding broad roles like Owner or Editor unless absolutely necessary.
Using custom roles when predefined roles grant more permissions than needed.
9. How do IAM policies inherit permissions across the GCP resource hierarchy?
IAM policies are inherited across the GCP resource hierarchy, which is structured as:
If a role is granted at the Organization level, all folders, projects, and resources
within that organization inherit those permissions.
If a role is granted at the Folder level, it will be inherited by all projects and
resources under that folder.
Permissions granted at the Project level apply to all resources within the project.
However, specific permissions granted at a lower level (e.g., project or resource level)
override higher-level permissions, meaning that a user can have different permissions
depending on the level at which the role is assigned.
10. What happens when a user has multiple IAM roles assigned at different levels?
When a user has multiple IAM roles assigned at different levels, permissions are additive:
Permissions from all roles (across the project, folder, and organization levels) are
combined to determine the user's total permissions.
If there are conflicting permissions (e.g., one role grants access to a resource while
another denies it), the deny permission takes precedence.
Roles can be assigned at different levels (organization, folder, project), and the more
specific level (e.g., project or resource) typically takes precedence over the broader
level (organization or folder).
In summary, the user gets the union of all permissions from their roles, unless a conflict
arises where a "deny" permission prevails.
11. What is a service account, and how is it different from a regular user account?
Service Account: A service account is a special type of Google Cloud identity used
by applications or virtual machines (VMs) to interact with Google Cloud services.
Service accounts are typically used for non-human access to resources, such as
running automated tasks or managing cloud resources through applications.
Regular User Account: A regular user account represents a human user and is
associated with an individual’s Google account (e.g., Gmail). Users access Google
Cloud resources directly via this account and are granted roles and permissions for
managing resources.
Differences:
bash
gcloud iam service-accounts create SERVICE_ACCOUNT_NAME \
--display-name "Service Account Display Name"
3. Assign roles:
o You can assign roles to the service account either during creation or after by
using the gcloud CLI or the Console.
bash
gcloud projects add-iam-policy-binding PROJECT_ID \
--
member="serviceAccount:SERVICE_ACCOUNT_NAME@PROJECT_ID.iam.gserv
iceaccount.com" \
--role="roles/storage.admin"
13. What are service account keys, and how should they be securely managed?
Service Account Keys: A service account key is a set of credentials (in JSON or P12
format) that allows an application or VM to authenticate as a service account. These
keys are used to grant access to resources that the service account has been authorized
for.
Secure Management:
o Avoid storing keys in source control or any publicly accessible location.
o Use Google Cloud Secret Manager to store keys securely.
o Rotate keys periodically and disable old keys.
o Avoid creating unnecessary keys and use the least privileged keys for the
least privileged roles.
o When possible, use Workload Identity Federation instead of managing keys
manually to reduce the risk of compromised credentials.
You can use service accounts for authentication by creating a key for the service account and
configuring your application to use this key for authentication.
bash
CopyEdit
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json"
3. Use Libraries:
o For example, in Python, use google-auth library to authenticate:
python
CopyEdit
from google.auth import exceptions
from google.auth.transport.requests import Request
Workload identity federation supports better governance and compliance, especially for
hybrid and multi-cloud architectures.
16. What are IAM conditions, and how do they help in fine-grained access control?
IAM Conditions: IAM Conditions are an advanced feature in Google Cloud IAM
that allow you to define conditional access based on attributes like resource names,
request times, or the user's IP address. Conditions are added to IAM policies to
enforce rules about when and how a permission is granted.
Fine-Grained Access Control:
o With conditions, you can create more granular access control, such as
granting access only to specific resources, or enforcing policies based on the
request's context.
o Example: You can allow users to access Cloud Storage objects only during
business hours or restrict access to certain resources based on the user's
location.
json
CopyEdit
{
"condition": {
"title": "Allow access during business hours",
"expression": "request.time >= timestamp('2023-01-01T09:00:00Z') && request.time <=
timestamp('2023-01-01T17:00:00Z')"
}
}
17. How do you use IAM policies to enforce organization-wide security best practices?
To enforce organization-wide security best practices, you can use IAM policies at the
organization or folder level, which apply to all projects under that organization or folder.
Best Practices:
o Principle of Least Privilege: Grant only the minimum permissions needed for
users and service accounts to perform their tasks.
o Use Predefined Roles: Prefer predefined IAM roles instead of creating
custom roles, to ensure they follow best practices and least privilege.
o Enforce Strong Authentication: Use multi-factor authentication (MFA)
for all users accessing critical resources.
o Use Organization Policies: Apply restrictions on resource creation and access
at the organizational level to ensure security standards are consistently
followed.
o Regular Audits: Continuously review IAM roles and policies, ensuring that
users only have the access they need.
You can also use Google Cloud's IAM Recommender to get suggestions on how to adjust
roles and permissions to reduce over-provisioning.
18. How does IAM logging work, and how can it help in auditing?
IAM Logging: IAM logging is managed through Cloud Audit Logs. Google Cloud
records all IAM-related activities, such as role assignments, policy changes, and
permission grants, in the Audit Logs. These logs are stored in Cloud Logging
(formerly Stackdriver), which can then be used for analysis, monitoring, and auditing.
Audit Logs Types:
o Admin Activity Logs: Logs actions that modify resources or configuration
(e.g., role assignments, policy changes).
o Data Access Logs: Logs access to sensitive resources (if enabled).
o System Event Logs: Logs generated by the system, such as resource
provisioning.
Benefits:
o Tracking Changes: Monitor and track who made changes to IAM roles or
permissions, which can be helpful in identifying unauthorized access or
security breaches.
o Compliance: Useful for compliance audits, ensuring that only authorized
users have performed specific tasks.
o Forensics: In case of a security breach, IAM logs help trace the events that led
to the breach, assisting in post-event analysis.
You can use Cloud Logging queries to filter and analyze specific IAM actions or role
changes.
19. What are organization policies, and how do they interact with IAM roles?
Organization Policies: Organization policies are a set of rules that define constraints
on resource usage across your GCP organization. They are used to enforce
governance, security, and compliance across all projects within the organization.
Interaction with IAM:
o Enforce Security Standards: Organization policies can restrict certain IAM
roles from being assigned, ensuring that users can only get roles that comply
with the organization's security requirements.
o Limit Resource Creation: Organization policies can prevent certain actions
(e.g., restricting the creation of resources in specific regions or enforcing the
use of certain types of encryption).
o Prevent Overly Permissive Roles: You can use organization policies to
prevent the assignment of overly permissive roles (e.g., roles/owner).
20. How does IAM support multi-cloud or hybrid cloud access control?
IAM in Google Cloud supports multi-cloud and hybrid cloud environments in several ways:
Identity Federation: IAM allows identity federation, so users from other cloud
providers (e.g., AWS, Azure) or on-premise identity systems (e.g., LDAP) can
authenticate and access Google Cloud resources via Workload Identity Federation.
Cloud Identity: Using Cloud Identity or Google Workspace, organizations can
manage users across multiple environments (on-premise, Google Cloud, and external
cloud platforms) with a single identity system.
Cross-cloud Permissions: With Cross-Cloud Identity and IAM roles, you can
assign consistent roles across different cloud platforms to give users seamless access
to resources in Google Cloud and other clouds, using standard security policies.
Multi-cloud Management: IAM also integrates with tools like Anthos (GCP’s
multi-cloud platform) to provide access control and centralized identity management
across Google Cloud, on-premises, and other cloud environments.
By using IAM's federated identities and integrated role management, you can ensure
consistent, secure access to cloud resources across hybrid and multi-cloud architectures.
21. How do you check and troubleshoot IAM access issues?
IAM Recommender is a tool in GCP that provides recommendations for roles that
should be assigned or revoked, based on usage patterns and the Principle of Least
Privilege.
How to Use IAM Recommender:
o It helps identify over-provisioned IAM roles (e.g., when a user has excessive
permissions that are not being used) and suggests role minimization.
o You can view recommendations in the IAM & Admin section of the GCP
Console under Recommender.
o Automated Recommendations: The tool automatically recommends roles
based on the actions that users perform, helping you assign only the necessary
roles to users and service accounts.
o Review Recommendations: Go through the IAM Recommender's suggestions
and either accept or reject the recommendations based on your security
policies.
23. What is the difference between IAM roles and Cloud Identity Groups?
IAM Roles:
o IAM roles define permissions for accessing and performing operations on
resources within Google Cloud.
o There are predefined, custom, and primitive IAM roles that assign a specific
set of permissions to a user, group, or service account.
o IAM roles are used to grant access to specific resources within GCP (e.g.,
BigQuery, Cloud Storage).
Cloud Identity Groups:
o Cloud Identity Groups are a way of organizing users in your organization.
They allow you to group users for collaborative purposes or to simplify role
assignments.
o You can create groups in Google Cloud Identity or Google Workspace, and
then assign IAM roles to those groups.
o The main difference is that Cloud Identity Groups help manage users, while
IAM roles provide the actual access to cloud resources.
In summary, Cloud Identity Groups help with grouping users for easier management, while
IAM roles manage what resources the grouped users can access.
24. How do you revoke a user’s access when they leave the organization?
Here are some best practices for managing IAM permissions in large organizations:
Pub/Sub decouples the sender (publisher) from the receiver (subscriber), making it scalable
and reliable for event-driven architectures, microservices, and real-time data pipelines.
How it works:
Topics:
o A topic is a named resource to which messages are sent by publishers.
o Topics act as message channels that carry the messages sent by publishers.
Subscriptions:
o A subscription represents the link between a topic and a subscriber. A
subscriber receives messages from the subscription.
o Subscriptions can be of two types:
Pull subscriptions: The subscriber explicitly pulls messages from the
subscription.
Push subscriptions: The subscriber automatically receives messages
pushed by Pub/Sub to a configured endpoint.
Messages:
o Messages are the data that is sent by the publisher to a topic. Messages are
typically payloads (data) in a structured or unstructured format, and they can
have optional attributes for additional context.
Publisher:
o The publisher is an application or service that sends messages to a topic.
Subscriber:
o The subscriber is an application or service that receives messages from a
subscription.
3. How does Pub/Sub differ from traditional message queues like RabbitMQ or Kafka?
Here’s a comparison of Pub/Sub with traditional message queues like RabbitMQ and
Kafka:
1. At-most-once:
o Pub/Sub delivers each message at most once. If a message is not successfully
delivered to a subscriber, it will not be retried.
2. At-least-once:
o Pub/Sub ensures that each message is delivered at least once. This model
guarantees delivery, even in cases where retries are necessary, but it can lead
to duplicates.
o It’s the default delivery model in Pub/Sub.
3. Exactly-once:
o Pub/Sub guarantees that each message is delivered exactly once, preventing
both missing and duplicate messages. However, this requires a higher level of
overhead to track and manage message state.
o This model is available in Dataflow pipelines that integrate with Pub/Sub for
processing data.
5. What is the difference between a topic and a subscription in Pub/Sub?
Topic:
o A topic is a named resource where publishers send messages.
o Topics act as message channels.
o A publisher pushes messages to a topic, but the topic does not handle message
delivery to subscribers directly.
Subscription:
o A subscription is an entity that allows subscribers to receive messages from
a topic.
o The subscription is the mechanism that links a topic to a subscriber.
o A subscription can be pull (where the subscriber requests messages) or push
(where Pub/Sub pushes messages to the subscriber endpoint).
In short, a topic is where messages are sent, while a subscription is where messages are
received by subscribers.
To publish messages to a Pub/Sub topic, you can use the Google Cloud SDK, client
libraries, or the Pub/Sub API. Here’s a basic workflow for publishing messages:
1. Create a Publisher Client: Initialize a publisher client for the specific topic using the
client library for your programming language (e.g., Python, Java, Go, etc.).
2. Create the Topic: If the topic doesn’t exist, you can create it using the gcloud
command or API.
3. Publish the Message: Send the message to the topic using the publish() method of the
publisher client.
python
CopyEdit
from google.cloud import pubsub_v1
# Message to be published
data = "Hello, Pub/Sub!"
You can also use Google Cloud Console to send test messages manually.
7. What are the different types of subscriptions in Pub/Sub?
1. Pull Subscription:
o Definition: In this model, the subscriber application explicitly "pulls"
messages from the subscription.
o Use case: This is useful when you need control over message consumption
and can manage the processing rate manually.
o Process: The subscriber repeatedly requests messages using the pull() method.
2. Push Subscription:
o Definition: In this model, Pub/Sub pushes messages to a subscriber's endpoint
(e.g., HTTP endpoint).
o Use case: This is useful when you want Pub/Sub to automatically deliver
messages to your application without polling.
o Process: The subscriber configures a URL endpoint, and Pub/Sub pushes the
message to that URL.
Pub/Sub guarantees message ordering by using message ordering keys. Here’s how it
works:
Key points:
In practice, message ordering is often used in scenarios like processing events sequentially
(e.g., financial transactions or time-sensitive data).
Message retention in Pub/Sub refers to how long messages are retained in a subscription
before they are deleted if not acknowledged by a subscriber.
By default, messages are retained for 7 days from the time they are published to a
topic, even if they are not acknowledged by the subscriber.
Retention Period:
o Default: 7 days.
o You cannot extend the retention period beyond 7 days.
After the retention period ends, the unacknowledged messages are discarded, and no
longer available for the subscriber.
Acknowledge Mechanism: Messages are kept in the subscription as long as they
remain unacknowledged. Once a message is acknowledged by the subscriber, it is
deleted.
Retention Configuration: While the message retention period is fixed at 7 days, you can
control how messages are retained in a subscription by ensuring that they are acknowledged
before the retention period expires.
Message Retention: Messages are stored until they are acknowledged by a subscriber
(up to 7 days).
Redelivery on Failure: If a subscriber does not acknowledge (ACK) a message,
Pub/Sub retries and redelivers it until it is successfully processed.
Dead-Letter Topics (DLTs): If messages repeatedly fail to be acknowledged, they
can be redirected to a dead-letter topic (DLT) for further analysis.
Message Acknowledgment
python
CopyEdit
from google.cloud import pubsub_v1
subscriber = pubsub_v1.SubscriberClient()
subscription_path = 'projects/my-project/subscriptions/my-subscription'
def callback(message):
print(f"Received message: {message.data}")
message.ack() # Acknowledges the message
subscriber.subscribe(subscription_path, callback=callback)
Dead-Letter Topics (DLTs)
sh
CopyEdit
gcloud pubsub subscriptions update my-subscription \
--dead-letter-topic=projects/my-project/topics/my-dlt \
--max-delivery-attempts=5
13. How does Pub/Sub handle duplicate messages?
Network failures
Subscriber processing delays
Redelivery due to unacknowledged messages
✅ Example of Deduplication
python
processed_messages = set()
def callback(message):
if message.message_id not in processed_messages:
print(f"Processing message: {message.data}")
processed_messages.add(message.message_id)
message.ack() # Acknowledge message
else:
print("Duplicate message detected, ignoring.")
Flow control prevents subscribers from being overwhelmed by too many messages. It
helps:
python
flow_control = pubsub_v1.types.FlowControl(max_messages=10,
max_bytes=1024*1024)
2. Batch Processing
o Process messages in batches instead of one by one.
3. Auto-Scaling
o Use multiple subscriber instances for higher throughput.
python
subscriber.subscribe(
subscription_path,
callback=callback,
flow_control=pubsub_v1.types.FlowControl(max_messages=5)
)
Pub/Sub is designed for horizontal scaling and can handle millions of messages per second.
Key scaling features:
sh
CopyEdit
gcloud pubsub subscriptions create my-scaled-subscription \
--topic=my-topic \
--ack-deadline=20
This command creates a subscription with a longer ACK deadline, allowing parallel
workers to process messages efficiently.
Summary Table: Key Concepts
Feature Description
GCP IAM controls who can publish and subscribe to a Pub/Sub topic. You can use IAM
roles and policies to restrict access to Pub/Sub resources.
Pub/Sub encrypts messages at rest and in transit using multiple layers of encryption.
Encryption at Rest
Encryption in Transit
Messages are encrypted using TLS (Transport Layer Security) when transmitted
between publishers, Pub/Sub, and subscribers.
Users can enable Cloud KMS to manage their own encryption keys instead of using
Google-managed encryption.
Ensures greater control over encryption and key rotation policies.
sh
CopyEdit
gcloud pubsub topics create my-topic \
--kms-key=projects/my-project/locations/global/keyRings/my-keyring/cryptoKeys/my-key
✅ Best Practices
18. How can you ensure that only authorized publishers and subscribers interact with a
topic?
To enforce strict access control, you should use IAM policies, VPC Service Controls, and
private access.
✅ Best Practices
✔️ Audit IAM permissions using Cloud Audit Logs.
✔️ Disable anonymous access by ensuring no public roles are assigned.
✔️ Enable CMEK for additional encryption security.
19. What is VPC Service Controls, and how does it help secure Pub/Sub
communication?
VPC Service Controls (VPC-SC) is a security perimeter around Google Cloud services
like Pub/Sub.
It prevents unauthorized data exfiltration from within a controlled network.
sh
CopyEdit
gcloud access-context-manager perimeters create my-perimeter \
--title="My Security Perimeter" \
--resources=projects/my-project \
--restricted-services=pubsub.googleapis.com
✅ Best Practices
sh
CopyEdit
gcloud kms keyrings create my-keyring --location global
gcloud kms keys create my-key \
--location global \
--keyring my-keyring \
--purpose encryption
sh
CopyEdit
gcloud pubsub topics create my-topic \
--kms-key=projects/my-project/locations/global/keyRings/my-keyring/cryptoKeys/my-key
sh
CopyEdit
gcloud kms keys add-iam-policy-binding my-key \
--location global \
--keyring my-keyring \
--member=serviceAccount:[email protected] \
--role=roles/cloudkms.cryptoKeyEncrypterDecrypter
Customer-Managed Encryption Allows users to manage their own encryption keys via
(CMEK) Cloud KMS.
VPC Service Controls Prevents unauthorized external access and data leaks.
21. How does Pub/Sub integrate with Cloud Functions for event-driven architectures?
Pub/Sub triggers Cloud Functions when a message is published to a topic, making it ideal for
event-driven architectures.
sh
gcloud pubsub topics create my-topic
sh
gcloud functions deploy my-function \
--runtime python310 \
--trigger-topic my-topic \
--entry-point my_handler_function \
--region us-central1
python
import base64
22. How can you use Pub/Sub with Dataflow for real-time stream processing?
Pub/Sub integrates with Apache Beam (Dataflow) to ingest, process, and transform
streaming data in real-time.
Architecture
Steps to Integrate
sh
gcloud pubsub topics create my-streaming-topic
gcloud pubsub subscriptions create my-sub --topic my-streaming-topic
python
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
class PrintMessage(beam.DoFn):
def process(self, element):
print(f"Received: {element}")
pipeline_options = PipelineOptions(streaming=True)
with beam.Pipeline(options=pipeline_options) as pipeline:
(pipeline
| "Read from Pub/Sub" >> beam.io.ReadFromPubSub(topic="projects/my-
project/topics/my-streaming-topic")
| "Process Message" >> beam.ParDo(PrintMessage()))
Use Cases
sh
gcloud pubsub topics create my-workflow-topic
python
from airflow.providers.google.cloud.sensors.pubsub import PubSubPullSensor
from airflow import DAG
from datetime import datetime
Use Cloud Functions to receive a message and trigger a DAG using Airflow REST
API.
Cloud Function calls:
sh
gcloud composer environments run my-composer \
--location us-central1 trigger-dag -- my_dag_id
Use Cases
24. How does Pub/Sub integrate with BigQuery for real-time analytics?
Pub/Sub can stream data into BigQuery using Dataflow or BigQuery's Pub/Sub
Subscription.
Method 1: Using Dataflow for Real-time Streaming
sh
CopyEdit
gcloud pubsub topics create my-bigquery-stream
sh
CopyEdit
bq mk --table my_project:my_dataset.my_table \
id:STRING, event_time:TIMESTAMP, message:STRING
python
CopyEdit
import apache_beam as beam
from apache_beam.io.gcp.bigquery import WriteToBigQuery
with beam.Pipeline(options=PipelineOptions(streaming=True)) as p:
(p
| "Read Pub/Sub" >> beam.io.ReadFromPubSub(topic="projects/my-project/topics/my-
bigquery-stream")
| "Transform" >> beam.Map(lambda msg: {"id": "123", "event_time": "2024-01-
01T00:00:00Z", "message": msg})
| "Write to BQ" >> WriteToBigQuery("my_project:my_dataset.my_table",
create_disposition="CREATE_IF_NEEDED"))
Method 2: BigQuery Subscription (Without Dataflow)
sh
CopyEdit
gcloud pubsub subscriptions create my-bq-sub \
--topic=my-bigquery-stream \
--bigquery-table=my_project:my_dataset.my_table \
--use-schema
Use Cases
Instead of service account keys, use Workload Identity for Kubernetes service
accounts.
sh
CopyEdit
gcloud iam service-accounts add-iam-policy-binding \
[email protected] \
--member="serviceAccount:my-project.svc.id.goog[gke-namespace/gke-service]" \
--role="roles/pubsub.subscriber"
2. Use Pull Subscriptions Instead of Push
python
from google.cloud import pubsub_v1
subscriber = pubsub_v1.SubscriberClient()
subscription_path = "projects/my-project/subscriptions/my-gke-sub"
def callback(message):
print(f"Received: {message.data}")
message.ack()
subscriber.subscribe(subscription_path, callback=callback)
3. Use Horizontal Pod Autoscaler (HPA) for Scaling
yaml
CopyEdit
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: pubsub-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: pubsub-worker
minReplicas: 2
maxReplicas: 10
metrics:
- type: External
external:
metricName: pubsub.googleapis.com|subscription|num_undelivered_messages
targetAverageValue: 100
Best Practices Summary
1. What is GCP Cloud Spanner, and how does it differ from traditional relational
databases?
Automatic
Yes (Managed by Google) No (Needs manual partitioning)
Sharding
Component Description
5. How does Cloud Spanner ensure strong consistency across multiple regions?
🔹 Example: A bank transfer between two regions (e.g., USA → Europe) will either fully
commit or rollback, ensuring no partial transactions occur.
Summary Table: Cloud Spanner Core Concepts
6. What is the difference between an instance, database, and table in Cloud Spanner?
Component Description
7. How does Cloud Spanner store and distribute data across multiple nodes?
✅ 1. Automatic Data Sharding – Data is split into splits (shards) and distributed across
nodes.
✅ 2. Paxos Protocol – Ensures strong consistency across replicas.
✅ 3. Read/Write Leaders – Each shard has a leader replica handling writes.
✅ 4. Multi-Region Replication – Data is replicated across different regions to ensure
availability.
✅ 5. Colossus Storage – Google’s distributed file system manages persistent storage.
🔹 Example: If a table has 10 million rows, Spanner automatically partitions the data and
assigns it across multiple nodes.
8. What is the purpose of interleaved tables in Cloud Spanner?
✅ Benefits:
✅ Example:
sql
CopyEdit
CREATE TABLE Customers (
CustomerID STRING(36) NOT NULL,
Name STRING(100),
) PRIMARY KEY (CustomerID);
Impact:
🔹 Best Practice: Use interleaved tables when parent-child relationships have a 1-to-many
association.
10. What is the difference between primary keys and foreign keys in Cloud Spanner?
sql
CopyEdit
CREATE TABLE Customers (
CustomerID STRING(36) NOT NULL,
Name STRING(100),
) PRIMARY KEY (CustomerID);
🔹 Key Differences:
Primary Key vs Primary Key uniquely identifies a row, Foreign Key ensures
Foreign Key relationships
Cloud Spanner uses Google Standard SQL, which is similar to ANSI SQL but includes
additional Spanner-specific features.
✅ Supported Features
✅ Example Query
sql
CopyEdit
SELECT CustomerID, Name
FROM Customers
WHERE Name LIKE 'A%';
Read-
No locks, strong consistency SELECT * FROM Orders;
Only
python
CopyEdit
def transfer_funds(transaction):
transaction.execute_update(
"UPDATE Accounts SET Balance = Balance - 100 WHERE AccountID = 'A123';"
)
transaction.execute_update(
"UPDATE Accounts SET Balance = Balance + 100 WHERE AccountID = 'B456';"
)
database.run_in_transaction(transfer_funds)
13. What is the difference between strong consistency and eventual consistency in Cloud
Spanner?
Consistency
Description Use Case
Model
✅ Example:
✅ Read Operations
✅ Write Operations
sql
CopyEdit
SELECT * FROM Orders AS OF TIMESTAMP CURRENT_TIMESTAMP - INTERVAL
10 SECOND;
🔹 Best Practice: Use stale reads to reduce query latency for non-critical workloads.
✅ How it Works
🔹 Why is it Important?
Feature Description
Read/Write
Reads from multiple replicas, writes through Paxos leader.
Operations
TrueTime API Prevents stale reads & write conflicts using atomic clocks.
Cloud Spanner scales horizontally by sharding data across multiple nodes and using Paxos-
based replication for consistency.
Automatic Sharding: Data is split into splits (shards) and distributed across nodes.
Compute and Storage Separation: Nodes handle queries and transactions, while
storage scales independently.
Multi-Region Replication: Ensures high availability and low-latency reads.
Load-Based Rebalancing: Spanner automatically moves data between nodes when
load increases.
🔹 Example: If a table reaches high read/write throughput, Spanner splits the data into
smaller chunks and distributes them to different nodes dynamically.
17. What are best practices for optimizing performance in Cloud Spanner?
Use Stale Reads Reduce read latency by allowing slightly older data.
Avoid Large
Keep transactions small to reduce contention.
Transactions
sql
CopyEdit
SELECT * FROM Orders AS OF TIMESTAMP CURRENT_TIMESTAMP - INTERVAL 5
SECOND;
Benefit Description
Faster Query Performance Improves query speed for non-primary key lookups.
Optimized Sorting & Joins Queries using indexed columns execute faster.
sql
CopyEdit
CREATE INDEX idx_email ON Users(Email);
SELECT * FROM Users WHERE Email = '[email protected]';
Method Description
Geographical Load Spanner routes queries to the closest replica for low
Distribution latency.
sql
CopyEdit
SELECT * FROM Orders AS OF TIMESTAMP CURRENT_TIMESTAMP - INTERVAL
10 SECOND;
✅ Best Practice:
Feature Description
Performance Optimization Use interleaved tables, secondary indexes, and batch writes.
Secondary Index Benefits Improves query speed, reduces full table scans.
21. How does IAM work in Cloud Spanner for access control?
GCP IAM (Identity and Access Management) controls who can access Cloud Spanner and
what actions they can perform.
sh
CopyEdit
gcloud spanner databases add-iam-policy-binding my-database \
--instance=my-instance \
--member=user:[email protected] \
--role=roles/spanner.reader
✅ Best Practices:
22. How can you monitor and audit queries in Cloud Spanner?
Tool Purpose
sql
CopyEdit
EXPLAIN SELECT * FROM Orders WHERE CustomerID = '12345';
✅ This helps find slow queries and optimize them with indexes.
sh
CopyEdit
gcloud logging read "resource.type=spanner_instance"
✅ Best Practices:
23. What are the pricing factors for Cloud Spanner, and how can costs be optimized?
Compute
Charged per node-hour Scale nodes dynamically based on load.
Nodes
sh
CopyEdit
gcloud spanner instances update my-instance \
--autoscaling-config=enabled
Cloud Spanner encrypts all data at rest and in transit by default using Google-managed
encryption keys.
sh
CopyEdit
gcloud kms keyrings create my-keyring --location=us-central1
gcloud kms keys create my-key --location=us-central1 \
--keyring=my-keyring --purpose=encryption
sh
CopyEdit
gcloud projects add-iam-policy-binding my-project \
--member=serviceAccount:[email protected] \
--role=roles/cloudkms.cryptoKeyEncrypterDecrypter
sh
CopyEdit
gcloud spanner instances create my-instance \
--config=regional-us-central1 \
--encryption-key=projects/my-project/locations/us-central1/keyRings/my-
keyring/cryptoKeys/my-key
✅ Best Practices:
Task Command
Delete a
gcloud spanner backups delete my-backup --instance=my-instance
Backup
✅ Best Practices:
Feature Description
Monitoring & Auditing Uses Cloud Logging, Audit Logs, and Query Execution Plans.
✅ Cloud Data Fusion is a fully managed, cloud-native ETL (Extract, Transform, Load)
and ELT service built on CDAP (Cask Data Application Platform). It enables users to
design, deploy, and manage data pipelines visually using a drag-and-drop UI.
🔹 How it Works:
1⃣ Data Ingestion → Reads data from various sources (e.g., BigQuery, Cloud Storage,
MySQL).
2️⃣ Data Transformation → Applies transformations using built-in plugins (e.g., joins,
aggregations).
3️⃣ Data Loading → Writes processed data to targets (e.g., BigQuery, Pub/Sub, GCS, Cloud
SQL).
✅ Key Features:
2. What are the key benefits of using Data Fusion over traditional ETL tools?
Ease of Use Drag-and-drop UI for pipeline building Requires coding & scripting
Security IAM, VPC-SC, CMEK for encryption Limited built-in security features
✅ Why Use Data Fusion? → Scalable, serverless, easy to use, and deeply integrated
with GCP!
3. What is the underlying technology that powers Cloud Data Fusion?
✅ Cloud Data Fusion is powered by CDAP (Cask Data Application Platform), an open-
source data integration framework.
🔹 Architecture Overview:
1⃣ User designs ETL workflows using the Data Fusion UI.
2️⃣ CDAP converts the workflow into a Spark or Dataflow job.
3️⃣ The job is executed on Cloud Dataproc (for batch) or Dataflow (for streaming).
4️⃣ Processed data is written to BigQuery, GCS, or Cloud SQL.
Programming
No-code UI (drag & drop) Code-based (Python, Java, SQL)
Model
✅ When to Use?
Small-scale ETL
Basic Runs on shared infrastructure, limited scalability
workloads
✅ How to Choose?
Feature Details
✅ Cloud Data Fusion consists of several key components that work together to enable data
integration, transformation, and orchestration.
Component Description
7. What is the role of CDAP (Cask Data Application Platform) in Data Fusion?
✅ CDAP (Cask Data Application Platform) is the core engine of Cloud Data Fusion that
provides the platform for building and executing data pipelines.
✅ Data Fusion transforms and orchestrates data using a visual pipeline approach.
✅ A Pipeline in Data Fusion is a workflow that ingests, transforms, and loads data from
various sources to destinations.
🔹 Structure of a Pipeline
Stage Description
Source Reads data from BigQuery, Cloud Storage, Pub/Sub, MySQL, Kafka, etc.
Transform Applies joins, filters, aggregations, type conversions, and business logic.
🔹 Types of Pipelines
1⃣ Batch Pipelines → Used for scheduled ETL jobs (e.g., moving data from GCS to
BigQuery).
2️⃣ Streaming Pipelines → Processes real-time data using Pub/Sub and Dataflow.
3️⃣ Hybrid Pipelines → Combines both batch & streaming for complex workflows.
✅ Wrangler, Plugins, and Connectors are essential components for data transformation
and integration in Data Fusion.
Component Description
A data preparation tool for cleaning, filtering, and reshaping data before
Wrangler
processing.
🔹 Wrangler Features
🔹 Types of Plugins
🔹 Types of Connectors
Feature Description
Drag and drop source plugins (e.g., BigQuery, Cloud Storage, Pub/Sub) onto the
canvas.
Configure the source properties (e.g., file path, query, or subscription).
4️⃣ Apply Transformations:
Drag and drop sink plugins (e.g., BigQuery, GCS, Cloud SQL).
Configure the destination details (e.g., table name, file path).
✅ To schedule and automate a pipeline in Data Fusion, use the following methods:
You can schedule pipelines by setting up cron-like jobs using Cloud Scheduler.
Configure Cloud Scheduler to trigger your Data Fusion pipeline via an HTTP
request.
Trigger pipelines programmatically via the Data Fusion REST API using the POST
request to initiate a pipeline.
13. What are the different types of pipeline triggers in Data Fusion?
Pipelines can be triggered manually from the Pipeline Studio interface or via the
REST API.
✅ Cloud Data Fusion handles schema evolution to ensure that changes in data structure
don’t break existing pipelines.
1⃣ Schema Discovery:
When reading data from sources like BigQuery, Cloud Storage, or JDBC, Data
Fusion can automatically infer the schema (column names, types).
Additive Schema Evolution: Data Fusion supports adding new fields to the schema
without impacting existing data pipelines.
Schema Validation: For non-additive changes (e.g., type changes), you can
configure schema validation to ensure compatibility before processing.
In Wrangler, users can manually modify schemas, rename columns, or change data
types if required.
Transformation plugins handle dynamic schema changes during the ETL process.
✅ Handling real-time streaming data in Cloud Data Fusion involves using Streaming
Pipelines and integrating with other GCP services.
Streaming plugins (like Pub/Sub Source, Dataflow for processing, and BigQuery
Sink) allow for real-time transformations and loading.
Data Fusion pipelines support at-least-once delivery and retries for real-time
streaming data.
You can configure dead-letter queues for unprocessed messages in case of failures.
Feature Description
Scheduling Use Cloud Scheduler, Data Fusion UI, or API for automated
Pipelines scheduling.
Real-Time Create streaming pipelines, integrate with Pub/Sub and Dataflow for
Streaming low-latency processing.
You can configure BigQuery as a source to read data into Data Fusion pipelines.
Use the BigQuery Source plugin to run a SQL query or specify a table for data
extraction.
Data Fusion can apply transformations like filtering, aggregation, and data reshaping
before writing the output to BigQuery.
For automated operations, use BigQuery Operators in Data Fusion to run queries,
create tables, or manage datasets.
17. How can Data Fusion be used to move data from Cloud Storage to BigQuery?
1⃣ Create a Pipeline:
Add a Cloud Storage Source plugin to the pipeline to read data from files stored in
GCS (e.g., CSV, JSON, Avro).
3️⃣ Apply Transformations (Optional):
Add a BigQuery Sink plugin to write the transformed data into a BigQuery table.
Configure parameters such as dataset name, table name, and write disposition (e.g.,
overwrite or append).
After validation, deploy and run the pipeline to move the data from Cloud Storage to
BigQuery.
18. How does Data Fusion work with Pub/Sub for streaming ingestion?
1⃣ Pub/Sub Source:
Use the Pub/Sub Source plugin in Data Fusion to consume data from Pub/Sub
topics.
Data Fusion can receive messages continuously in real-time.
Process the incoming streaming data using transformations (e.g., filter, aggregate)
within the pipeline.
For real-time processing, Data Fusion utilizes Dataflow to scale the pipeline,
ensuring the efficient handling of streaming data.
After processing, the pipeline can write the data to BigQuery, Cloud Storage, or
another destination.
Pipelines are triggered based on Pub/Sub events, so Data Fusion executes tasks
automatically as new messages arrive.
19. How do you use Data Fusion with Cloud Spanner and Cloud SQL?
✅ Cloud Data Fusion supports integrations with Cloud Spanner and Cloud SQL for ETL
operations:
1⃣ Cloud Spanner:
Source: Use the Cloud Spanner Source plugin to read data from Cloud Spanner
tables.
Sink: Use the Cloud Spanner Sink plugin to write processed data into Cloud
Spanner tables.
You can perform data transformations before writing to Cloud Spanner.
Source: Use the Cloud SQL Source plugin to extract data from Cloud SQL
databases (e.g., MySQL, PostgreSQL).
Sink: Use the Cloud SQL Sink plugin to load transformed data into Cloud SQL
tables.
You can apply ETL transformations to data while ingesting it into Cloud SQL.
Perform necessary data transformations between Cloud Spanner/Cloud SQL and other
systems (e.g., BigQuery, Cloud Storage).
20. How do you implement hybrid and multi-cloud data movement using Data Fusion?
✅ Hybrid and multi-cloud data movement using Data Fusion is possible by integrating
with different cloud environments:
1⃣ On-Premises Integration:
Use Data Fusion Hybrid for integrating with on-premises data systems.
Data Fusion supports connectors to on-premises databases and services, ensuring
seamless integration with hybrid cloud setups.
Data Fusion supports moving data between Google Cloud, AWS, and Azure.
Use the Cloud Storage connectors and BigQuery connectors to transfer data across
different cloud platforms.
Cloud Spanner, Cloud SQL, and Cloud Pub/Sub allow you to interact with cloud-
native databases across different cloud environments.
Create data pipelines that span multiple cloud platforms, using Data Fusion to
orchestrate workflows and apply transformations to ensure consistency and
reliability.
Cloud Cloud Storage Source BigQuery Sink Move data from Cloud Storage
Storage Plugin Plugin to BigQuery for analytics.
Cloud Cloud Spanner Source Cloud Spanner Sink Read and write data from/to
Spanner Plugin Plugin Cloud Spanner.
Cloud SQL Source Cloud SQL Sink Extract and load data from/to
Cloud SQL
Plugin Plugin Cloud SQL.
1⃣ Use Parallelism:
Increase parallelism by configuring batch sizes and splitting data efficiently. This
allows the pipeline to process data in parallel, improving throughput.
Use Wrangler and Pre-built Transformations for optimized data processing. Avoid
unnecessary transformations and ensure they are only applied when needed.
3️⃣ Optimize Batch Sizes:
Tune batch sizes for sources and sinks to handle data in optimal chunks. Large
batches reduce the overhead of reading/writing data multiple times.
For streaming pipelines, leverage Dataflow for scalable processing. Adjust the
window size and trigger intervals to balance speed and accuracy.
Use high-performance connectors like BigQuery, Cloud Storage, and Pub/Sub for
faster data transfers and integration.
Use monitoring tools to profile and analyze the execution times of each
transformation step to identify bottlenecks.
22. What IAM roles and permissions are required for Data Fusion?
✅ IAM roles and permissions for Cloud Data Fusion depend on the required actions:
Data Fusion Admin: Full access to Data Fusion, including creating, managing, and
deploying pipelines.
Data Fusion Developer: Can create, edit, and deploy pipelines, but without full
admin access.
Data Fusion Viewer: Read-only access to view pipelines and monitoring data.
Cloud Data Fusion Service Account: Required for running pipelines. Needs roles
like:
o roles/datafusion.admin
o roles/storage.objectAdmin (for GCS access)
o roles/bigquery.dataEditor (for BigQuery access)
o roles/pubsub.subscriber (for Pub/Sub access)
23. How does Data Fusion ensure data security and encryption?
1⃣ Encryption at Rest:
Data Fusion uses Google Cloud encryption to ensure data is encrypted while stored
on disk. This includes encryption for data in BigQuery, Cloud Storage, Spanner,
etc.
All data transferred between services (e.g., Cloud Storage, BigQuery) and Cloud Data
Fusion is encrypted using TLS (Transport Layer Security) to prevent unauthorized
access.
Access control through IAM ensures that only authorized users and services can
access data in Data Fusion.
For external systems, Data Fusion allows using secure OAuth, API keys, and VPC
Service Controls for encrypted and secure communication.
Sensitive Data can be masked or encrypted during the transformation process, and
you can enable audit logging to track who accessed or modified data.
24. How do you monitor and debug issues in Data Fusion pipelines?
1⃣ Pipeline Monitoring:
Use the Pipeline Monitoring UI to monitor pipeline execution, check for failed jobs,
and inspect logs.
You can visualize step-by-step pipeline performance, identify slow steps, and
troubleshoot data flow.
2️⃣ Cloud Logging and Stackdriver:
Cloud Logging integrates with Data Fusion, so pipeline logs are stored in
Stackdriver for centralized logging and debugging.
Check logs for errors or exceptions during execution to diagnose issues.
Configure alerts and notifications in Data Fusion to be notified when a pipeline fails
or encounters issues.
Track metrics like job success rates, data processed, and latency for each pipeline
stage to identify bottlenecks.
Enable debug mode for pipelines to view detailed logs and error messages for each
transformation or operation step.
25. What are the key cost factors in Data Fusion, and how can they be optimized?
Cloud Storage costs depend on the amount of data processed and stored by Data
Fusion pipelines.
Minimize storage costs by cleaning up unnecessary data and optimizing data
partitions.
Egress charges: Moving data across regions (e.g., from Cloud Storage to BigQuery)
can incur egress costs. Optimize data movement by performing transformations in the
same region.
Running pipelines too frequently or with inefficient scheduling can increase costs.
Ensure that pipelines are scheduled appropriately based on the business needs.
Use small batch sizes, right-size clusters, and selective execution to minimize
overuse of resources.
Cost Reduce pipeline execution and storage costs, and optimize data
Optimization transfer