Monday, October 23, 2023
HomeBig DataFrom Hive Tables to Iceberg Tables: Problem-Free

From Hive Tables to Iceberg Tables: Problem-Free


Introduction

For greater than a decade now, the Hive desk format has been a ubiquitous presence within the massive knowledge ecosystem, managing petabytes of knowledge with exceptional effectivity and scale. However as the information volumes, knowledge selection, and knowledge utilization grows, customers face many challenges when utilizing Hive tables due to its antiquated directory-based desk format. A few of the widespread points embody constrained schema evolution, static partitioning of knowledge, and lengthy planning time due to S3 listing listings.

Apache Iceberg is a contemporary desk format that not solely addresses these issues but additionally provides extra options like time journey, partition evolution, desk versioning, schema evolution, sturdy consistency ensures, object retailer file structure (the power to distribute information current in a single logical partition throughout many prefixes to keep away from object retailer throttling), hidden partitioning (customers don’t need to be intimately conscious of partitioning), and extra. Subsequently, Apache Iceberg desk format is poised to exchange the standard Hive desk format within the coming years. 

Nonetheless, as there are already 25 million terabytes of knowledge saved within the Hive desk format, migrating present tables within the Hive desk format into the Iceberg desk format is important for efficiency and price. Relying on the scale and utilization patterns of the information, a number of completely different methods may very well be pursued to attain a profitable migration. On this weblog, I’ll describe just a few methods one might undertake for varied use instances. Whereas these directions are carried out for Cloudera Knowledge Platform (CDP), Cloudera Knowledge Engineering, and Cloudera Knowledge Warehouse, one can extrapolate them simply to different providers and different use instances as nicely.

There are few situations that one would possibly encounter. A number of of those use instances would possibly suit your workload and also you would possibly be capable of combine and match the potential options supplied to fit your wants. They’re meant to be a basic information. In all of the use instances we try emigrate a desk named “occasions.”

Method 1

You have got the power to cease your purchasers from writing to the respective Hive desk throughout the period of your migration. That is ultimate as a result of this would possibly imply that you just don’t have to switch any of your shopper code. Generally that is the one alternative out there you probably have tons of of purchasers that may doubtlessly write to a desk. It may very well be a lot simpler to easily cease all these jobs moderately than permitting them to proceed throughout the migration course of.

In-place desk migration

Answer 1A: utilizing Spark’s migrate process

Iceberg’s Spark extensions present an in-built process known as “migrate” emigrate an present desk from Hive desk format to Iceberg desk format. Additionally they present a “snapshot” process that creates an Iceberg desk with a unique title with the identical underlying knowledge. You may first create a snapshot desk, run sanity checks on the snapshot desk, and be sure that the whole lot is so as. 

As soon as you might be glad you’ll be able to drop the snapshot desk and proceed with the migration utilizing the migrate process. Understand that the migrate process creates a backup desk named “events__BACKUP__.” As of this writing, the “__BACKUP__” suffix is hardcoded. There’s an effort underway to let the person move a customized backup suffix sooner or later.

Understand that each the migrate and snapshot procedures don’t modify the underlying knowledge: they carry out in-place migration. They merely learn the underlying knowledge (not even full learn, they only learn the parquet headers) and create corresponding Iceberg metadata information. Because the underlying knowledge information should not modified, it’s possible you’ll not be capable of take full benefit of the advantages provided by Iceberg instantly. You may optimize your desk now or at a later stage utilizing the “rewrite_data_files” process. This might be mentioned in a later weblog. Now let’s talk about the professionals and cons of this strategy.

PROS:

  • Can do migration in phases: first do the migration after which perform the optimization later utilizing rewrite_data_files process (weblog to comply with). 
  • Comparatively quick because the underlying knowledge information are saved in place. You don’t have to fret about creating a short lived desk and swapping it later. The process will try this for you atomically as soon as the migration is completed. 
  • Since a Hive backup is accessible one can revert the change totally by dropping the newly created Iceberg desk and by renaming the Hive backup desk (__backup__) desk to its unique title.

CONS:

  • If the underlying knowledge isn’t optimized, or has lots of small information, these disadvantages may very well be carried ahead to the Iceberg desk as nicely. Question engines (Impala, Hive, Spark) would possibly mitigate a few of these issues by utilizing Iceberg’s metadata information. The underlying knowledge file places is not going to change. So if the prefixes of the file path are widespread throughout a number of information we might proceed to endure from S3 throttling (see Object Retailer File Layout to see the best way to configure it correctly.) In CDP we solely help migrating exterior tables. Hive managed tables can’t be migrated. Additionally, the underlying file format for the desk needs to be one in every of avro, orc, or parquet.

Observe: There’s additionally a SparkAction within the JAVA API

Answer 1B: utilizing Hive’s “ALTER TABLE” command

Cloudera carried out a simple option to do the migration in Hive. All it’s a must to do is to change the desk properties to set the storage handler to “HiveIcebergStorageHandler.”

The professionals and cons of this strategy are primarily the identical as Answer 1B. The migration is finished in place and the underlying knowledge information should not modified. Hive creates Iceberg’s metadata information for a similar actual desk.

Shadow desk migration

Answer 1C: utilizing the CTAS assertion

This resolution is most generic and it might doubtlessly be used with any processing engine (Spark/Hive/Impala) that helps SQL-like syntax. 

You possibly can run fundamental sanity checks on the information to see if the newly created desk is sound. 

As soon as you might be glad together with your sanity checking you possibly can rename your “occasions” desk to a “backup_events” desk after which rename your “iceberg_events” to “occasions.” Understand that in some instances the rename operation would possibly set off a listing rename of the underlying knowledge listing. If that’s the case and your underlying knowledge retailer is an object retailer like S3, that can set off a full copy of your knowledge and may very well be very costly. If whereas creating the Iceberg desk the placement clause is specified, then the renaming operation of the Iceberg desk is not going to trigger the underlying knowledge information to maneuver. The title will change solely within the Hive metastore. The identical applies for Hive tables as nicely. In case your unique Hive desk was not created with the placement clause specified, then the rename to backup will set off a listing rename. In that case, In case your filesystem is object retailer primarily based, then it is perhaps finest to drop it altogether. Given the nuances round desk rename it’s important to check with dummy tables in your system and test that you’re seeing your required habits earlier than you carry out these operations on important tables.

You possibly can drop your “backup_events” if you want. 

Your purchasers can now resume their learn/write operations on the “occasions” and so they don’t even have to know that the underlying desk format has modified. Now let’s talk about the professionals and cons of this strategy.

PROS:

  • The newly created knowledge is nicely optimized for Iceberg and the information might be distributed nicely.
  • Any present small information might be coalesced mechanically.
  • Frequent process throughout all of the engines.
  • The newly created knowledge information might benefit from Iceberg’s Object Retailer File Structure, in order that the file paths have completely different prefixes, thus decreasing object retailer throttling. Please see the linked documentation to see the best way to benefit from this characteristic. 
  • This strategy isn’t essentially restricted to migrating a Hive desk. One might use the identical strategy emigrate tables out there in any processing engine like Delta, Hudi, and so forth.
  • You possibly can change the information format say from “orc” to “parquet.’’

CONS

  • This can set off a full learn and write of the information and it is perhaps an costly operation. 
  • Your whole knowledge set might be duplicated. That you must have adequate space for storing out there. This shouldn’t be an issue in a public cloud backed by an object retailer. 

Method 2

You don’t have the luxurious of lengthy downtime to do your migration. You need to let your purchasers or jobs proceed writing the information to the desk. This requires some planning and testing, however is feasible with some caveats. Right here is a method you are able to do it with Spark. You possibly can doubtlessly extrapolate the concepts introduced to different engines.

  • Create an Iceberg desk with the specified properties. Understand that it’s a must to hold the partitioning scheme the identical for this to work appropriately. 

 

  • Modify your purchasers or jobs to jot down to each tables so that they write to the “iceberg_events” desk and “occasions” desk. However for now, they solely learn from the “occasions” desk. Seize the timestamp from which your purchasers began writing to each of the tables.
  • You programmatically listing all of the information within the Hive desk that have been inserted earlier than the timestamp you captured in step 2.
  • Add all of the information captured in step 3 to the Iceberg desk utilizing the “add_files” process. The “add_files” process will merely add the file to your Iceberg desk. You additionally would possibly be capable of benefit from your desk’s partitioning scheme to skip step 3 totally and add information to your newly created Iceberg desk utilizing the “add_files” process.

  • Should you don’t have entry to Spark you would possibly merely learn every of the information listed in step 3 and insert them into the “iceberg_events.”
  • When you efficiently add all the information information, you’ll be able to cease your purchasers from studying/writing to the outdated “occasions” and use the brand new “iceberg_events.”

Some caveats and notes

  • In step 2, you’ll be able to management which tables your purchasers/jobs must write to utilizing some flag that may be fetched from exterior sources like surroundings variables, some database (like Redis) pointer, and properties information, and so forth. That method you solely have to switch your shopper/job code as soon as and don’t need to hold modifying it for every step. 
  • In step 2, you might be capturing a timestamp that might be used to calculate information wanted for step 3; this may very well be affected by clock drift in your nodes. So that you would possibly need to sync all of your nodes earlier than you begin the migration course of. 
  • In case your desk is partitioned by date and time (as most actual world knowledge is partitioned), as in all new knowledge coming will go to a brand new partition on a regular basis, then you definitely would possibly program your purchasers to start out writing to each the tables from a particular date and time. That method you simply have to fret about including the information from the outdated desk (“occasions”) to the brand new desk (“Iceberg_events”) from that date and time, and you may benefit from your partitioning scheme and skip step 3 totally. That is the strategy that ought to be used at any time when potential. 

Conclusion

Any giant migration is hard and needs to be thought via rigorously. Fortunately, as mentioned above there are a number of methods at our disposal to do it successfully relying in your use case. When you’ve got the power to cease all of your jobs whereas the migration is occurring it’s comparatively easy, however if you wish to migrate with minimal to no downtime then that requires some planning and cautious considering via your knowledge structure. You should use a mixture of the above approaches to finest fit your wants. 

To study extra:

  1. For extra on desk migration, please check with respective on-line documentations in Cloudera Knowledge Warehouse (CDW) and Cloudera Knowledge Engineering (CDE).
  2. Watch our webinar Supercharge Your Analytics with Open Knowledge Lakehouse Powered by Apache Iceberg. It features a reside demo recording of Iceberg capabilities.
  3. Strive Cloudera Knowledge Warehouse (CDW), Cloudera Knowledge Engineering (CDE), and Cloudera Machine Studying (CML) by signing up for a 60 day trial, or take a look at drive CDP. You too can schedule a demo by clicking right here or if you have an interest in chatting about Apache Iceberg in CDP, contact your account workforce.  



Supply hyperlink

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments