Data Lineage
What is Data lineage?
Data lineage encompasses tracing data from its source and the transformations it may undergo to reach its target. Data lineage is a process that is initialized when data is put into a data system, the manipulations (such as extractions or queries) that precede, and the final output. It generates a link between single and multiple sources and target entities in a data warehouse, consequently enabling traceability. For this reason, it is vital to data governance.
Data lineage can be developed into a system that is either active or passive. Active data lineage systems are meant for non-language data operations, such as running an Apache Beam pipeline or an Apache Spark job on Hadoop. Passive systems are more suitable for language-based data operations since they support SQL.
Passive systems are ideal for data-warehouses such as BigQuery. Data lineage systems serve as repositories, hosting records of data lineage and providing an insight into those records via a visual interface or API. Such systems also enable the automatic extraction of data from the warehouse. GCP, a platform that handles vast volumes of data, requires data engineering and analytics to manage this data. It is an integral component of data engineering and analytics on GCP. So, why is it essential for these two processes?
Why is Data lineage important for Data Engineering and Analytics on GCP?
Data lineage enables easy detection of the root cause of data inconsistencies within a system or between multiple systems during analytics on GCP. Suppose BigQuery data shows that a given customer is eligible for a committed use discount, whereas Cloud Storage data shows that the same customer is still on a free trial on a Compute Engine service. It will provide a clear insight into the data pipeline from this output to the source. For this reason, you can perform a root-cause analysis and the origin of the inconsistency determined, which enables traceability.
By enabling traceability, it makes it possible to enforce data access policies in the warehouse. For example, since a passive data lineage system supports SQL, ZetaSQL can be used to layout the narratives that determine the parties that can access a given piece of data in the data warehouse (BigQuery). This will prevent unauthorized data access.
It dictates the blueprint for developing serverless data pipelines on GCP. Data engineers build data pipelines that define the data path from ingestion via Cloud Pub/Sub to storage, output, and subsequently analytics in BigQuery. There are data governance regulations, such as General Data Protection Regulation (GDPR), to which data platforms must adhere. These regulations are applied at different stages of the data pipeline.
In summary, data lineage ensures data engineering and analytics on GCP are well within the rules regulating data management and governance. These are highly regulated processes, especially for platforms that handle highly sensitive customer data (such as credit card-, or phone numbers). It also simplifies correcting errors that may accrue from data engineering and analytics, despite the large volumes of data handled on GCP.