0% found this document useful (0 votes)
36 views5 pages

Azure Databricks

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views5 pages

Azure Databricks

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 5

Azure Databricks Mind Map (Detailed)

**1. CORE PLATFORM & ARCHITECTURE**


1.1. **Unified Analytics Platform**
1.1.1. Built on Apache Spark
1.1.2. Optimized for Azure Cloud
1.1.3. Collaboration-focused (Notebooks, Workspaces)
1.2. **Databricks Runtime (DBR)**
1.2.1. Standard DBR (includes Spark, Scala, Python, R, common libraries)
1.2.2. DBR for Machine Learning (ML) (includes MLflow, TensorFlow, PyTorch,
Keras, XGBoost, etc.)
1.2.3. DBR for Genomics (specialized libraries)
1.2.4. Photon Engine (Native vectorized query engine, C++ based, faster
than Spark)
1.3. **Workspace**
1.3.1. Centralized environment for users
1.3.2. Folders & Objects (Notebooks, Libraries, Experiments, Models,
Dashboards, Queries, Alerts)
1.3.3. Access Control (Permissions on objects)
1.3.4. Repos (Git integration for version control)
1.4. **Compute Resources**
1.4.1. **Clusters**
1.4.1.1. All-Purpose Clusters (Interactive, development, ad-hoc
analysis)
1.4.1.1.1. Configuration (Node types, Autoscaling, Autotermination,
Spark config)
1.4.1.1.2. Libraries (Cluster-scoped, Notebook-scoped)
1.4.1.1.3. Init Scripts
1.4.1.1.4. Cluster Policies (restrict configurations)
1.4.1.1.5. Pools (reduce cluster start times by maintaining idle
instances)
1.4.1.2. Job Clusters (Automated workloads, cost-effective)
1.4.1.2.1. Tied to a specific job
1.4.1.2.2. Terminates when job completes
1.4.1.3. High Concurrency Clusters (for many users sharing, fine-
grained sharing, SQL, Python, R)
1.4.1.3.1. Table Access Control (Legacy)
1.4.1.3.2. Credential Passthrough (Legacy)
1.4.2. **SQL Warehouses (formerly SQL Endpoints)**
1.4.2.1. Optimized for SQL workloads (BI, Analytics)
1.4.2.2. Types:
1.4.2.2.1. Classic (Managed by Databricks in your account)
1.4.2.2.2. Pro (More features, better performance)
1.4.2.2.3. Serverless (Managed by Databricks in Databricks'
account, instant start-up)
1.4.2.3. Sizing (T-Shirt sizes)
1.4.2.4. Auto-scaling & Auto-stop
1.4.2.5. Load Balancing
1.4.2.6. Photon Enabled by default

**2. DATA MANAGEMENT & STORAGE**


2.1. **Delta Lake** (Open-source storage layer bringing reliability to data
lakes)
2.1.1. ACID Transactions
2.1.2. Schema Enforcement & Evolution
2.1.3. Time Travel (Data Versioning, Rollbacks)
2.1.4. Unified Batch & Streaming
2.1.5. Upserts & Deletes (MERGE command)
2.1.6. OPTIMIZE & Z-ORDER (Performance tuning)
2.1.7. Vacuum (Clean up old files)
2.1.8. Change Data Feed (CDF)
2.1.9. Generated Columns
2.1.10. Constraints (NOT NULL, CHECK)
2.2. **Unity Catalog** (Fine-grained governance for data & AI on the Lakehouse)
2.2.1. Centralized Metadata Store (Metastore)
2.2.2. Data Discovery & Search
2.2.3. Fine-grained Access Control (SQL GRANT/REVOKE on catalogs, schemas,
tables, views, functions)
2.2.4. Data Lineage (Column-level)
2.2.5. Auditing
2.2.6. Data Sharing (Delta Sharing integration)
2.2.7. Privileges: `USE CATALOG`, `USE SCHEMA`, `SELECT`, `MODIFY`, `CREATE
TABLE`, etc.
2.2.8. Managed vs. External Tables
2.3. **Data Sources & Sinks**
2.3.1. Azure Data Lake Storage (ADLS Gen2) - Primary
2.3.2. Azure Blob Storage
2.3.3. Azure SQL Database / Managed Instance
2.3.4. Azure Synapse Analytics
2.3.5. Azure Cosmos DB
2.3.6. Azure Event Hubs / Kafka
2.3.7. Other Cloud Storages (S3, GCS via connectors)
2.3.8. JDBC/ODBC sources
2.3.9. Various file formats (Parquet, ORC, CSV, JSON, Avro, Text, Binary)
2.4. **DBFS (Databricks File System)**
2.4.1. Abstraction layer over cloud storage
2.4.2. Mounting cloud storage (ADLS Gen2, Blob)
2.4.3. Staging files, libraries, init scripts

**3. WORKLOADS & USE CASES**


3.1. **Data Engineering (ETL/ELT)**
3.1.1. Apache Spark APIs (DataFrame, SQL)
3.1.2. Delta Live Tables (DLT)
3.1.2.1. Declarative ETL pipelines
3.1.2.2. Data quality checks (expectations)
3.1.2.3. Automatic dependency management
3.1.2.4. Continuous or triggered execution
3.1.2.5. Auto-scaling & error handling
3.1.3. Structured Streaming (Real-time data processing)
3.1.4. Job Orchestration (Databricks Jobs, Azure Data Factory)
3.2. **Machine Learning & AI**
3.2.1. **MLflow Integration (End-to-end MLOps)**
3.2.1.1. MLflow Tracking (Log parameters, metrics, artifacts)
3.2.1.2. MLflow Projects (Package code for reproducibility)
3.2.1.3. MLflow Models (Package models for serving, manage flavors)
3.2.1.4. MLflow Model Registry (Centralized model store, versioning,
staging)
3.2.2. **Feature Store**
3.2.2.1. Centralized repository for features
3.2.2.2. Feature discovery, sharing, reuse
3.2.2.3. Online & Offline serving
3.2.2.4. Automatic feature computation
3.2.3. **Model Training**
3.2.3.1. Distributed training (Horovod, spark-tensorflow-distributor)
3.2.3.2. Hyperparameter tuning (Hyperopt)
3.2.3.3. AutoML (Automated model selection and tuning)
3.2.4. **Model Serving**
3.2.4.1. Serverless Real-Time Inference (Managed, auto-scaling
endpoints)
3.2.4.2. Classic MLflow Model Serving (Cluster-based)
3.2.4.3. Batch inference (using Spark jobs)
3.2.5. Popular ML Libraries (TensorFlow, PyTorch, scikit-learn, XGBoost,
LightGBM)
3.2.6. Responsible AI (Bias detection, explainability - e.g., SHAP)
3.3. **Data Science & Analytics**
3.3.1. Interactive Notebooks (Python, Scala, SQL, R)
3.3.2. Data Visualization (built-in, Matplotlib, Seaborn, Plotly)
3.3.3. Koalas / Pandas API on Spark (Familiar Pandas syntax on distributed
data)
3.4. **Databricks SQL (Data Warehousing & BI)**
3.4.1. SQL Editor (Querying, visualizations, dashboards)
3.4.2. SQL Warehouses (Compute for SQL queries)
3.4.3. Dashboards & Visualizations
3.4.4. Alerts (Notify on data conditions)
3.4.5. Query History & Profiles
3.4.6. BI Tool Connectors (Power BI, Tableau, Looker, etc.)
3.4.7. ANSI SQL compliance

**4. DEVELOPMENT & INTERFACES**


4.1. **Notebooks**
4.1.1. Multi-language support (Python, SQL, Scala, R)
4.1.2. Co-authoring & commenting
4.1.3. Built-in visualizations
4.1.4. Version control (internal, Git via Repos)
4.1.5. Parameterization (Widgets)
4.1.6. Dashboarding from notebooks
4.2. **Databricks Repos**
4.2.1. Git integration (Azure DevOps, GitHub, GitLab, Bitbucket)
4.2.2. Branching, committing, pulling, merging
4.2.3. CI/CD for notebooks and code
4.3. **Databricks CLI**
4.3.1. Command-line interface for managing workspace, clusters, jobs, etc.
4.4. **REST APIs**
4.4.1. Programmatic control over Databricks resources
4.5. **IDE Connectors**
4.5.1. Databricks Connect (Run Spark code from IDEs like VS Code, IntelliJ,
PyCharm)
4.5.2. VS Code Extension
4.6. **Libraries Management**
4.6.1. Cluster-installed libraries
4.6.2. Notebook-scoped libraries (%pip, %conda)
4.6.3. Workspace libraries
4.6.4. Custom JARs, Python eggs/wheels

**5. MANAGEMENT, OPERATIONS & GOVERNANCE**


5.1. **Security**
5.1.1. **Identity & Access Management (IAM)**
5.1.1.1. Azure Active Directory (AAD) Integration (SSO, User
Provisioning via SCIM)
5.1.1.2. Role-Based Access Control (RBAC) (Workspace admins, users,
service principals)
5.1.1.3. Table Access Control (TALC - Legacy, for Hive metastore on
High Concurrency clusters)
5.1.1.4. Unity Catalog for fine-grained data access
5.1.1.5. Secrets Management (Databricks-backed, Azure Key Vault-backed)
5.1.2. **Network Security**
5.1.2.1. VNet Injection (Deploy Databricks in your VNet)
5.1.2.2. No Public IP (NPIP) / Private Link (for front-end and back-end
connections)
5.1.2.3. Network Security Groups (NSGs)
5.1.2.4. Firewall (UDRs for egress control)
5.1.3. **Data Encryption**
5.1.3.1. At Rest (DBFS with platform-managed keys or customer-managed
keys for workspace storage)
5.1.3.2. In Transit (TLS/SSL)
5.1.3.3. Managed Disks Encryption (Platform-managed or CMK)
5.2. **Monitoring & Logging**
5.2.1. Spark UI
5.2.2. Ganglia Metrics (for older DBRs) / Databricks Metrics
5.2.3. Driver & Worker Logs
5.2.4. Azure Monitor Integration (Diagnostic logs, metrics)
5.2.5. Audit Logs (Workspace activities, Unity Catalog events)
5.3. **Cost Management**
5.3.1. Databricks Units (DBUs) - pricing based on VM type and DBR
5.3.2. Cluster auto-scaling & auto-termination
5.3.3. Job clusters vs. All-purpose clusters
5.3.4. Spot instances
5.3.5. Cluster policies (to control costs)
5.3.6. Cost tracking using Tags
5.3.7. Azure Cost Management integration
5.4. **Automation & Orchestration**
5.4.1. Databricks Jobs (Scheduling notebooks, JARs, Python scripts)
5.4.1.1. Triggers (Scheduled, Continuous, File Arrival, API)
5.4.1.2. Task Orchestration (Linear, DAGs)
5.4.1.3. Repair and Rerun
5.4.1.4. Job parameters
5.4.1.5. Notifications (Email, Webhooks)
5.4.2. Azure Data Factory (ADF) Integration
5.4.3. Terraform / ARM Templates for IaC
5.5. **Compliance & Certifications**
5.5.1. HIPAA, SOC 2 Type II, ISO 27001, PCI DSS, etc. (Check Azure
compliance docs)

**6. INTEGRATION WITH AZURE ECOSYSTEM**


6.1. Azure Data Lake Storage (ADLS Gen2)
6.2. Azure Blob Storage
6.3. Azure Active Directory (AAD)
6.4. Azure Monitor
6.5. Azure Key Vault
6.6. Azure Data Factory (ADF)
6.7. Azure Synapse Analytics
6.8. Power BI
6.9. Azure Machine Learning
6.10. Azure Event Hubs & IoT Hub
6.11. Azure DevOps (for CI/CD)
6.12. Azure Private Link

**7. PRICING & TIERS**


7.1. **Pricing Tiers**
7.1.1. Standard
7.1.2. Premium (includes RBAC, Audit Logs, Unity Catalog, etc.)
7.1.3. Enterprise (Enhanced security options)
7.2. **DBU Consumption** (Varies by VM type, DBR type, Photon usage)
7.3. **Compute Costs** (Azure VMs)
7.4. **Storage Costs** (ADLS Gen2, Blob)
7.5. **SQL Warehouse Pricing** (DBUs based on size and type - Classic, Pro,
Serverless)
7.6. Pay-as-you-go vs. Reserved Instances / Pre-purchase plans

**8. KEY BENEFITS**


8.1. Increased Productivity
8.2. Performance & Scalability
8.3. Unified Data & AI Platform
8.4. Collaboration
8.5. Open Source Foundation (Spark, Delta Lake, MLflow)
8.6. Enterprise-grade Security & Governance
8.7. Simplified MLOps
8.8. Cost Efficiency (when managed well)

**9. CONSIDERATIONS & BEST PRACTICES**


9.1. Cost Optimization Strategies
9.2. Security Best Practices (VNet, IAM, Key Vault)
9.3. Performance Tuning (Partitioning, Z-Ordering, Caching, Photon)
9.4. Choosing Correct Cluster Types & Sizes
9.5. Effective Use of Unity Catalog for Governance
9.6. CI/CD Implementation for Notebooks and Jobs
9.7. Monitoring and Alerting Setup
9.8. Training & Skill Development for Teams

You might also like