Google BigLake

As data grows, it is becoming more distributed to data lakes and data warehouses. Since this distributed data has separate management stacks, data lakes and data warehouses have become siloed, making them only useful to serve specific use cases. However, as use cases grow, they require data regardless of where it has been stored. This is difficult for siloed data warehouses and lakes that support different capabilities. One way to work around this would be to set up and manage data movement infrastructure to support the use cases, but this becomes quickly un-scalable as the use cases and data volume grows. Fortunately, there is a new Google Cloud solution, BigLake which aims to mitigate this.

What is Google BigLake?

BigLake is a storage engine built to enable organizations to unify their data lakes and warehouses to utilize data for various analytics use cases scalably. It does this by providing uniform, fine-grained access control and enhancing query performance across multi-cloud storage and open formats. This way, organizations can break down data silos without having to incur the cost of setting up and managing data movement infrastructure or writing data movement jobs.  

BigLake supports interoperability between data warehouses and lakes. With BigLake, organizations can make data uniformly accessible to the queries that access them by storing a single copy of data across the warehouses and lakes. This includes Google Cloud engines such as BigQuery and Vertex AI and open-source engines such as Spark, Presto, Trino, and Hive. Storing data uniformly across multiple sources in a single copy helps organizations to avoid data duplication. This also makes analytics and deriving insights from distributed data more feasible. That’s not all, organizations have the freedom to choose the best analytics tools, whether open source or cloud-native. So whatever the preferred tool for your data scientist is, they will get a consistent experience accessing these big lake tables.

How secure is Google BigLake?

BigLake enables organizations to maintain security across distributed data. Through its fine-grained security controls, data administrators can grant access at the row-column level as opposed to giving access at the file level. These controls are applicable across open-source engines such as Apache Spark and Trino. The security model also constitutes three roles: data lake administrators, data warehouse administrators, and data analysts. Each of who have different IAM roles. Data administrators have the opportunity to centrally manage security policies which are enforced across the query engines by the API interface built into BigLake’s connectors.

What offers Google BigLake?

BigLake doesn’t just offer unified analytics alone, organizations using it also get unified governance and management at scale. This is a result of its integration with Dataplex which adds centralized policy and metadata management, lifecycle management, and data organization to BigLake. This unified data governance extends across multi-cloud tables including those defined over Azure data lake 2 and Amazon S3.

BigLake brings performance acceleration to data lakes. This is powered by BigQuery infrastructure which enables your queries to be efficiently executed. This resembles the way queries are executed in data warehouses. Because BigLake is built on open formats, it supports open formats such as JSON, ORC, Parquet, and Avro.

Suppose you’re currently using BigQuery, you can create BigLake tables from your BigQuery console. However, if you’re using open source engines running on DataProc you can use the Notebook experience to create, discover, and query these BigLake tables. The same applies if you have your self-managed implementations.    

Previous
Previous

Cloud Run Jobs

Next
Next

Best Google Cloud ETL Tools