Insight Mind Sdn Bhd memaparkan semula ini
Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"
Data Warehouse -> Data Lakes -> Data Lakehouse. Data architectures have evolved over time, depending on the 'type' of workload it needs to serve. Nowadays, more & more organizations have been thinking of and adopting open lakehouse architectures. The reason is actually quite simple! To start with: ✅ customers have the flexibility to store data in open storage formats (table + file) ✅ every component is modular, which means flexibility in terms of bringing the best tools/software ✅ customers own/control their cloud storage (such as S3 bucket/MinIO etc.) ✅ they can work on the same data with multiple compute engines (BI, Streaming, ML use cases) These aspects have resonated with orgs suffering with problems like - ❌ increasing storage & compute costs ❌ unable to manage multiple data copies ❌ need to maintain a 2-tier architecture (data warehouse + data lake), among other pains The modularity ("de-bundled database") in a lakehouse is probably one of the most attractive reasons to adopt lakehouse. It allows you to be flexible & select the best component for your use case with all the benefits of scalable storage, low cost & data management services such as compaction, clustering, cleaning. For example: - Best of Compute: you can use a compute that is performant for your use case (Spark for distributed ETL, Flink for stream processing, maybe DuckDB/Daft for single node workloads) - Open Table Format: choice of open table formats (Apache Hudi, Apache Iceberg, Delta Lake)for transactional capabilities and open storage. - Catalog: depending on ecosystem & integrations you can work with AWS Glue or Unity Catalog, etc. Now, while the table formats has provided scope for openness, it is important to recognize that an 'open' data architecture needs more than just open table formats. It requires: interoperability across formats (Apache XTable (Incubating)), catalogs, & open compute services for table management services such as clustering, compaction, and cleaning to also be open in nature. These are factors that cannot be ignored as we head to the next phase of lakehouses. Read more about it in my blog (link in comments). #dataengineering #softwareengineering