Friday, December 16, 2022
HomeBig Data14 Inquiries to Ask When Evaluating Knowledge Lineage - Atlan

14 Inquiries to Ask When Evaluating Knowledge Lineage – Atlan


On the lookout for an information lineage software? These are the important thing “gotchas” and options you have to be asking about.

Knowledge lineage is usually a mess.

Consider it like knitting a blanket. There are threads coming and going from each course, far too many to depend. All of those have to return collectively completely in an intricate sample. In the event you get it proper, it’s artwork. If even one factor will get out of line, it’s chaos.

Lineage is difficult to get proper due to the sheer variety of variables at play — information flowing from quite a lot of sources (each ever-changing previous ones and the newest new ones), transformations at each stage, complicated language concerned in naming and describing information belongings, totally different kinds for writing information logic and code, and rather more.

As troublesome as that is, we will’t hand over. Lineage is indispensable within the information staff toolbox, revealing information flows and powering essential use circumstances like affect evaluation, root trigger evaluation, governance, and compliance.

Listed here are 14 inquiries to ask throughout your seek for the appropriate information lineage software to totally assess its depth (variety of distinctive sources supported), breadth (variety of fields or objects supported for every supply), and utility (skill to energy insights and actions throughout various information personas).

By Shane Gibson (@shagility)

1. Does it robotically parse SQL queries?

Many information platforms present a lineage API, so it’s simple for any lineage system to ingest and use lineage from these sources. Nonetheless, not each platform does this. Computerized SQL parsing is essential to plug these gaps and be sure that your lineage is full, masking all information sources, processes, and belongings.

In the event you solely parse SQL on the warehouse layer, SQL queries from sources with out native question historical past (e.g. relational databases like PostgreSQL and MySQL) will slip via the cracks.

Search for the flexibility to learn a dump of SQL queries from supply techniques that don’t possess a “question historical past” function.

2. Which kinds of SQL statements are supported?

To keep away from gaps in your lineage, it’s essential to parse and register lineage from varied kinds of SQL statements:

  • CREATE TABLE
  • CREATE TABLE AS SELECT
  • CREATE VIEW
  • MERGE
  • INSERT INTO
  • UPDATE

Most SQL parsers help SQL CREATE and, in some circumstances, MERGE statements. Nonetheless, many don’t help INSERT INTO and UPDATE statements. These account for many transformations in information warehouses, so they’re essential for full lineage protection.

Search for lineage that may additionally parse MERGEINSERT INTO, and UPDATE statements.

3. Does the lineage API help programmatic lineage creation and retrieval?

The info ecosystem is continually evolving, and new information sources are rising on a regular basis. Programmatically processing lineage from unsupported sources (by way of an open API) is vital to scaling lineage with out worrying about which new platforms you may and may’t undertake.

Search for two key options:

  • Potential to retrieve and create lineage programmatically by way of an API.
  • Potential to publish and retrieve desk and column-level lineage throughout any object sort.

4. Is lineage a local functionality, or is it offered by an exterior partnership?

In lineage, coping with edge circumstances is the norm, and new edge circumstances usually require {custom} options or help out of your lineage vendor.

Lineage is usually packaged as half of a bigger catalog or anomaly detection product. Typically this lineage is natively out there and supported by the product’s staff. Nonetheless, typically it comes by way of an exterior partnership, which may result in slower assist and fixes.

Search for three key options:

  • Whether or not the lineage functionality is natively supported or externally offered.
  • Whether or not the product’s staff has direct management over the lineage growth.
  • Clear SLAs for help and an engineering dependency matrix (if there may be an exterior dependency).

5. How is it future-proofed towards modifications within the trendy information stack?

Knowledge transformation instruments and processes are at all times evolving. At the same time as clients swap from legacy stacks to the newest information instruments, lineage ought to at all times keep dependable.

Pulling lineage farther away from an information supply — e.g. from throughout the transformation course of — can result in issues if the supply system modifications. Pulling lineage from as near the supply as attainable is usually safer and extra future-proof.

Search for lineage that pulls from a supply system’s question historical past (e.g. natively from Snowflakequite than integrating with a downstream transformation software or course of.

6. Does it have cloud-native flexibility to scale up SQL parsing calls for?

In lineage, it’s simple to finish up with large-scale SQL parsing calls for. (We’ve personally seen clients with over a million queries per day.) Parsing these queries takes important computational assets, so it’s essential that your lineage can sustain.

Cloud-native merchandise use the newest design patterns and microservices invented by corporations like Netflix for limitless scalability. Watch out for platforms that weren’t constructed for the cloud or have legacy tech debt — they are going to be laborious to take care of, resulting in efficiency issues as your lineage scales.

Search for trendy, cloud-native structure that helps SQL parsing at scale.

7. Does it provide lineage all the way down to the column degree?

Desk-level lineage is taken into account “desk stakes”, however column-level lineage ought to be too. It’s essential for a variety of use circumstances:

  • Tracing delicate information classifications for remodeled PII information
  • Influence evaluation from issues like schema modifications
  • Root trigger evaluation — e.g. investigating why a dashboard seems to be off by tracing a BI area to upstream columns within the information warehouse

With out the flexibility to dive into granular columns or area lineage, information engineers and analysts could miss key depth throughout their investigations.

Search for two key options:

  • Native column-level expertise within the UI, together with viewing graph linkages on the column degree.
  • Assist for MERGEINSERT INTO, and UPDATE SQL statements, that are key for column-level transformations.

8. Does it robotically join upstream SQL sources with downstream BI belongings?

Typically, the objective of lineage is to establish why one thing on the final mile doesn’t look proper. Because the managers for firm information, information engineering groups are accountable for ensuring the info that feeds end-user belongings is reliable and dependable. When this fails, lineage is an important diagnostic software.

Not all lineage will natively connect with your chosen BI software (e.g. LookerTableauEnergy BI, and so on). Some depend on time-consuming guide scripts and asset pushing, even for main BI instruments.

Search for both native connectors or automated scripts that robotically connect with your BI software of alternative.

9. Does it help field-level lineage for BI dashboards?

Anybody doing root trigger evaluation must dive into an incorrect area (i.e. dimension, measure, calculated area, and so on.) within the dashboard, and work backwards to zero in on the upstream fields or columns which are damaged. That is solely attainable with field-level lineage for the BI software.

Subject-level lineage can also be essential for affect evaluation. If an information engineer is attempting to make a schema change, they should perceive the particular downstream columns and fields that might be affected — not simply which dashboards might be affected in some unspecified manner.

Some platforms help lineage for a couple of fields, however don’t go deep with BI fields which are essential for most of these evaluation.

Search for two key options:

  • Protection of each column-level lineage for SQL sources and BI field-level lineage.
  • Whether or not your BI software’s objects are supported and uncovered in lineage. (E.g. in Looker, will lineage cowl all of the fields/objects you care about, equivalent to Dashboards, Appears to be like, Explores, Tiles, Fields, and Views?)

10. Can it create upstream lineage with information in Salesforce (at each the thing and area ranges)?

We frequently hear that Salesforce is the “Wild West” and nobody is aware of what is occurring with that information within the ETL pipeline. Nonetheless, opening up Salesforce (and different essential SaaS supply techniques) is usually a game-changer for serving to information and enterprise groups to collaborate. Influence evaluation is a significant use case right here, since Salesforce fields get modified on a regular basis and wreak havoc downstream.

If it’s out there, be sure to analyze the depth of Salesforce lineage. Some lineage begins on the storage layer (i.e. information warehouse, lake, and so on). Can lineage be generated upstream of the storage layer for SaaS instruments like Salesforce?

In that case, how deep does it go? Some techniques can’t go all the way down to Salesforce object and area ranges, that are essential for making downstream lineage helpful and understanding context for downstream information belongings.

Search for object and field-level lineage from Salesforce all the way down to the info warehouse layer.

11. Does it natively combine with trendy information integration instruments?

Constructing lineage upstream of an information warehouse is difficult. Doing so at scale, particularly should you draw information from a number of supply techniques, is even tougher.

In the event you observe an ELT method, it’s essential that your lineage can join with trendy information integration instruments like Fivetran. This allows you to construct upstream lineage, creating true end-to-end lineage and displaying what occurs with information earlier than it enters the storage layer.

Search for whether or not it natively connects to Fivetran or different trendy information integration instruments.

12. Can it combine with Databricks and generate lineage for Spark jobs?

Spark lineage is troublesome to generate. However should you use Databricks, that is key to unlocking visibility into your transformations and creating usable lineage to assist information scientists, engineers, and analysts with ML and analytics workloads in Databricks.

Search for two key options:

  • Whether or not it ingests lineage from Databricks’ Unity Catalog API (which incorporates Spark, Scala, and SQL)
  • Whether or not it helps field-level lineage in BI instruments downstream of Databricks.

13. Does it incorporate different kinds of metadata to present extra context for belongings within the lineage graph?

In isolation, lineage solely tells a part of the story and, due to this fact, solely gives a part of the worth. Lineage turns into actionable when it’s mixed with key metadata and context:

  • Operational metadata: How and when have been belongings orchestrated?
  • High quality and anomaly metadata: What state are the belongings in? Are they dependable?
  • Enterprise/semantic metadata: How do the belongings hyperlink to key enterprise phrases or KPIs?
  • Proprietor and skilled metadata: Who must you contact or collaborate with throughout troubleshooting?
  • Social metadata: What’s the human context for this asset — e.g. related Slack discussions or Jira tickets in regards to the asset? That is what machines alone will miss.

Typically lineage graphs seem as one more a siloed view. With out the opposite metadata for these belongings, it may be laborious to place lineage in context.

Search for three key options:

  • Openness: An “open by design”, extensible platform the place you may harvest information and metadata from any supply by way of APIs (together with custom-built connectors).
  • Flexibility: Assist for a variety of technical, operational, anomaly/high quality, and enterprise/semantic metadata from these sources.
  • Personalization: A personalised information expertise, the place every persona sees the metadata that’s proper for them, quite than drowning in all of the metadata.

14. Can it’s used not simply to analyze points, but in addition to drive motion programmatically?

Along with enabling information folks’s work, lineage may allow automated system actions and workflows.

For instance, if an upstream desk has information high quality points, it’s essential to robotically add bulletins to downstream BI dashboards. This retains enterprise customers from creating “Rubbish In, Rubbish Out” evaluation, and saves information analysts and engineers from manually sending alerts or warnings.

Some platforms don’t have the underlying structure and scalability to carry out automated actions based mostly on lineage.

Search for open APIs, the flexibility to construct or customise automated workflows, and the flexibility to learn metadata-change occasions and set off modifications in linked belongings throughout the lineage graph.



This weblog was co-written with Mark Pavletich (Director of Gross sales Engineering) and Swaminathan Kumar (Technique & Intelligence).


Header photograph: Crawford Jolly on Unsplash.

This weblog was initially printed on In the direction of Knowledge Science.





Supply hyperlink

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments