Sunday, October 15, 2023
HomeBig DataEducating ChatGPT on Knowledge Lakehouse

Educating ChatGPT on Knowledge Lakehouse


As the usage of ChatGPT turns into extra prevalent, I ceaselessly encounter clients and knowledge customers citing ChatGPT’s responses of their discussions. I like the passion surrounding ChatGPT and the eagerness to find out about trendy knowledge architectures comparable to knowledge lakehouses, knowledge meshes, and knowledge materials. ChatGPT is a superb useful resource for gaining high-level insights and constructing consciousness of any know-how. Nevertheless, warning is critical when delving deeper into a selected know-how. ChatGPT is skilled on historic knowledge and relying on how one phrases their query, it might provide inaccurate or deceptive info. 

I took the free model of ChatGPT on a take a look at drive (in March 2023) and requested some easy questions on knowledge lakehouse and its elements. Listed below are some responses that weren’t precisely proper, and our clarification on the place and why it went incorrect. Hopefully this weblog will give ChatGPT a chance to be taught and proper itself whereas counting in direction of my 2023 contribution to social good. 

I assumed this was a reasonably complete listing. The one key part that’s lacking is a typical, shared desk format, that can be utilized by all analytic companies accessing the lakehouse knowledge. When implementing a knowledge lakehouse, the desk format is a vital piece as a result of it acts as an abstraction layer, making it straightforward to entry all of the structured, unstructured knowledge within the lakehouse by any engine or device, concurrently. The desk format offers the required construction for the unstructured knowledge that’s lacking in a knowledge lake, utilizing a schema or metadata definition, to convey it nearer to a knowledge warehouse. A few of the fashionable desk codecs are Apache Iceberg, Delta Lake, Hudi, and Hive ACID.

Additionally, the information lake layer just isn’t restricted to cloud object shops.  Many firms nonetheless have huge quantities of knowledge on premises and knowledge lakehouses usually are not restricted to public clouds. They are often constructed on premises or as hybrid deployments leveraging personal clouds, HDFS shops, or Apache Ozone. 

At Cloudera, we additionally present machine studying as a part of our lakehouse, so knowledge scientists get easy accessibility to dependable knowledge within the knowledge lakehouse to shortly launch new machine studying tasks and construct and deploy new fashions for superior analytics. 

I like how ChatGPT began this reply, however it shortly jumps into options and even offers an incorrect response on the characteristic comparability. Options usually are not the one approach of deciding which is a greater desk format. It will depend on compatibility, openness, versatility, and different elements that may assure broader utilization for various knowledge customers, assure safety and governance, and future-proof your structure. 

Here’s a high-level characteristic comparability chart if you wish to go into the small print of what’s obtainable on Delta Lake versus Apache Iceberg.

 

This response is just a little harmful due to its incorrectness and demonstrates why I really feel these instruments usually are not prepared for deeper evaluation. At first look it might appear like an inexpensive response, however its premise is incorrect, which makes you doubt the complete response and different responses as nicely. Saying “Delta Lake is constructed on high of Apache Iceberg” is wrong as the 2 are utterly completely different, unrelated desk codecs and one has nothing to do with the conception of the opposite. They have been created by completely different organizations to unravel frequent knowledge issues. 

 

I’m impressed that ChatGPT acquired this one proper, though it made a number of errors with our product names, and missed a number of which might be vital for a lakehouse implementation.

CDP’s elements that assist a knowledge lakehouse structure embody:

  1. Apache Iceberg desk format that’s built-in into CDP to supply construction to the large quantities of structured, unstructured knowledge in your knowledge lake.
  2. Knowledge companies, together with cloud native knowledge warehouse referred to as CDW, knowledge engineering service referred to as CDE, knowledge streaming service referred to as knowledge in movement, and machine studying service referred to as CML.
  3. Cloudera Shared Knowledge Expertise (SDX), which offers a unified knowledge catalog with computerized knowledge profilers, unified safety, and unified governance over all of your knowledge each in the private and non-private cloud.

ChatGPT is a good device to get a high-level view of recent applied sciences, however I’d say use it rigorously, validate its responses, and use it just for the attention stage of the shopping for cycle. As you go into the consideration or comparability stage, it’s not dependable but.

Additionally, solutions on ChatGPT maintain updating so hopefully it corrects itself earlier than you learn this weblog. 

To be taught extra about Cloudera’s lakehouse go to the webpage and in case you are able to get began watch the Cloudera Now demo.



Supply hyperlink

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments