The document outlines the details of an internship at Niveus Solutions, focusing on cloud engineering services and technologies such as BigQuery and Dataplex. It covers the internship duration, objectives, and implementation steps for integrating data management solutions. Key takeaways emphasize the importance of data governance, metadata management, and automation in enhancing data ecosystems.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
32 views17 pages
Internship
The document outlines the details of an internship at Niveus Solutions, focusing on cloud engineering services and technologies such as BigQuery and Dataplex. It covers the internship duration, objectives, and implementation steps for integrating data management solutions. Key takeaways emphasize the importance of data governance, metadata management, and automation in enhancing data ecosystems.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17
Internship Details
PRESENTED BY: Faculty:
P Alekhya Ms. Anusha N. Assistant Professor Gd- III
4NM21IS100 Topics for Discussion
o Internship Certificate and Company Details
o Internship Duration o Introduction o Objectives o Technologies Used o Problem Definition o Implementation o Conclusion o References Internship Certificate and Company Details
o Niveus Solutions, founded in
2013 is a cloud engineering services company. o Specializing in application, infrastructure, data modernization, cloud consulting and security, Niveus became an exclusive partner of Google Cloud India in 2019. Internship Duration
o Internship Duration Dec 20, 2024 – Feb 21, 2025 o Work Mode Work from Office in Mangalore o Role Customer Engineer Intern Introduction
Data Warehouse: A centralized system for storing and
analyzing large volumes of structured data, used for reporting and business intelligence.
BigQuery: Google Cloud’s serverless and highly scalable
data warehouse that enables fast SQL queries and built-in machine learning.
Dataplex: A Google Cloud tool for unified data management
and governance across data lakes and warehouses, integrating seamlessly with BigQuery. Technologies Used
The following technologies are utilized:
BigQuery- A fully managed data warehouse.
Dataplex- A data fabric solution for governance and security.
Docker - Containerization platform for deploying applications.
DataHub - A metadata platform for data governance.
Google Cloud Run Functions - Managed service to deploy and
run containerized applications. BigQuery & Schema Overview
BigQuery is a serverless data warehouse that allows users
to:
Store and query large datasets efficiently.
Define structured schemas with columns and data types.
Use partitioning and clustering for performance
optimization. Dataplex Overview
Dataplex is a unified data management solution that
helps manage data across multiple storage systems.
Provides automated governance and security policies.
Enables data discovery and quality checks across
warehouses and lakes. Data Lineage
Data lineage tracks the origin, transformations, and
movement of data.
Helps in understanding dependencies across datasets.
Enhances compliance and auditability in data
governance.
Dataplex provides lineage tracking via BigQuery and
Data Catalog integration. Data Profiling
Data profiling involves analyzing datasets to extract useful
insights
Identifies data distributions, missing values, and
anomalies.
Helps in improving data quality and consistency.
Enables metadata-driven decision making for data
governance. Data Quality Scan
A data quality scan ensures clean, reliable, and accurate
data.
Uses rules and thresholds to detect errors in datasets.
Validates schema consistency, duplicates, and missing
value.
Provides automated monitoring for continuous data
integrity. Scheduled Query
Scheduled queries automate data transformations and
reporting
Run SQL queries at predefined intervals.
Optimize ETL workflows without manual intervention.
Improve data availability by keeping tables updated.
DataHub Overview
DataHub is a modern metadata platform for data
governance
Provides real-time metadata search and discovery.
Enables data lineage tracking and impact analysis.
Supports multi-source data ingestion including
BigQuery, MySQL, Kafka, etc. Features of DataHub
Key features of DataHub include:
Metadata Search - Locate datasets instantly.
Data Lineage - Understand data dependencies.
Access Control - Manage permissions and security.
Schema Evolution - Track changes over time.
Data Ingestion - Connect multiple data sources.
Implementation
Steps to integrate BigQuery Datasets into DataHub
1. Configure Dataplex for data governance and profiling.
2. Set up BigQuery datasets and scheduled queries.
3. Deploy DataHub using Docker and ingestion pipelines.
4. Automate data lineage and quality scans.
Google Cloud Run Functions
Cloud Run hosts containerized services that can query
BigQuery using its client libraries.
It enables on-demand data processing or analytics via
HTTP triggers (e.g., from apps or schedulers).
It is ideal for automated workflows, real-time
dashboards, or integrating BigQuery insights into APIs. Conclusion
Key takeaways:
1. Dataplex enhances data governance and security.
2. DataHub improves metadata management and discovery.
3. Integration of both leads to a more efficient data
ecosystem.
4. Integration of BigQuery datasets into Cloud Run functions