What is a Data Lakehouse? Definition, Benefits & Features

source: eweek.com | image: pexels.com

A data lakehouse is a hybrid data management architecture that combines the best features of a data lake and a data warehouse into one data management solution.

A data lake is a centralized repository that allows storage of large amounts of data in its native, raw format. On the other hand, a data warehouse is a repository that stores structured and semi-structured data from multiple sources for analysis and reporting purposes.

A data lakehouse aims to bridge the gap between these two data management approaches by merging the flexibility, scale and low cost of data lake with the performance and ACID (Atomicity, Consistency, Isolation, Durability) transactions of data warehouses. This enables business intelligence and analytics on all data in a single platform.

What is a lakehouse?

A data lakehouse is a modern data architecture that creates a single platform by combining the key benefits of data lakes (large repositories of raw data in its original form) and data warehouses (organized sets of structured data). Specifically, data lakehouses enable organizations to use low-cost storage to store large amounts of raw data while providing structure and data management functions.

Historically, data warehouses and data lakes had to be implemented as separate, siloed architectures to avoid overloading the underlying systems and creating contention for the same resources. Companies used data warehouses to store structured data for business intelligence (BI) and reporting and data lakes to store unstructured and semi-structured data for machine learning (ML) workloads. But this approach required data to be regularly shifted between the two separate systems when data from either architecture needed to be processed together, creating complexity, higher costs, and issues around data freshness, duplication, and consistency.

Data lakehouses aim to break down these silos and deliver the flexibility, scalability, and agility needed to ensure your data generates value for your business, rather than inefficiencies.

Data lakehouse vs. data lake vs. data warehouse

The term “data lakehouse” merges two types of existing data repositories: the data warehouse and the data lake. So, what exactly are the differences when it comes to a data lakehouse vs. data lake vs. data warehouse?

Data warehouses

Data warehouses provide fast access to data and SQL compatibility for business users that need to generate reports and insights for decision-making. All data must go through ETL (extract, transform, load) phase. This means it is optimized in a specific format, or schema, based on the use case before it is loaded to support high-performance queries and data integrity. However, this approach limits the flexibility of access to the data and creates additional costs if data needs to be moved around for future use.

Data lakes

Data lakes store large amounts of unstructured and structured data in its native format. Unlike data warehouses, data is processed, cleaned up, and transformed during analysis to enable faster loading speeds, making them ideal for big data processing, machine learning, or predictive analytics. However, they require expertise in data science, which limits the set of people who can use the data, and if they’re not properly maintained, data quality can deteriorate over time. Data lakes also make it more challenging to get real-time queries as the data is unprocessed, so it still potentially needs to be cleaned, processed, ingested, and integrated before it can be used.

Data lakehouse

A data lakehouse merges these two approaches to create a single structure that allows you to access and leverage data for many different purposes, from BI to data science to machine learning. In other words, a data lakehouse captures all of your organization’s unstructured, structured, and semi-structured data and stores it on low-cost storage while providing the capabilities for all users to organize and explore data according to their needs.

Data lakehouse features

The key data lakehouse features include:

Single data low-cost data store for all data types (structured, unstructured, and semi-structured)

Data management features to apply schema, enforce data governance, and provide ETL processes and data cleansing
Transaction support for ACID (atomicity, consistency, isolation, and durability) properties to ensure data consistency when multiple users concurrently read and write data
Standardized storage formats that can be used in multiple software programs
End-to-end streaming to support real-time ingestion of data and insight generation
Separate compute and storage resources to ensure scalability for a diverse set of workloads

Direct access for BI apps to the source data in the lakehouse to reduce data duplication.

How does a data lakehouse work?

When it comes to making a data lakehouse work, it’s important to consider what it’s trying to achieve. Data lakehouses aim to centralize disparate data sources and simplify engineering efforts so that everyone in your organization can be data users.

A data lakehouse uses the same low-cost cloud object storage of data lakes to provide on-demand storage for easy provisioning and scaling. Like a data lake, it can capture and store large volumes of all data types in raw form. The lakehouse integrates metadata layers over this store to provide warehouse-like capabilities, such as structured schemas, support for ACID transactions, data governance, and other data management and optimization features.

Challenges of using a data lakehouse

The concept of a data lakehouse is still a relatively new architecture, meaning that some of the biggest challenges deal with the fact that it is evolving and best practices are still being defined by early adopters.

In addition, data lakehouses are complex to build from the ground up. In most cases, you’ll need to either opt for an out-of-box data lakehouse solution or use a platform like Google Cloud that offers all the needed components to support an open lakehouse architecture.

Layers of data lakehouse architecture

A data lakehouse architecture consists of the following layers:

Storage layer: The storage layer is the data lake layer for all of your raw data, usually a low-cost object store for all your unstructured, structured, and semi-structured datasets. It’s decoupled from computing resources so compute can scale independently.
Staging layer: The staging layer is the metadata layer that sits on top of your data lake layer. It provides a detailed catalog about all the data objects in storage, enabling you to apply data management features, such as schema enforcement, ACID properties, indexing, caching, and access control.
Semantic layer: The semantic layer, the lakehouse layer, exposes all your data for use, where users can use client apps and analytics tools to access and leverage data for experimentation and business intelligence presentation.

Data lakehouse examples

There are several existing data lakehouse examples, including Databricks Lakehouse Platform and Amazon Redshift Spectrum. However, as technologies continue to mature and data lakehouse adoption has increased, the implementation has shifted away from coupling lakehouse components to a specific data lake.

For example, the Google Cloud approach has been to unify the core capabilities of enterprise data operations, data lakes, and data warehouses. This implementation places BigQuery’s storage and compute power at the heart of the data lakehouse architecture. You can then apply a unified governance approach and other warehouse-like capabilities using Dataplex and Analytics Hub.

BigQuery is not only integrated with the Google Cloud ecosystem, it also allows you to use partner and open source technologies to bring the best of lake and warehouse capabilities together in a single system.

We are continuing to build on this approach with the release of BigLake, now in Preview, a unified storage engine that simplifies data access to data warehouses and data lakes. You can apply fine-grained access control and accelerate query performance across distributed data.