Thursday, November 2, 2023
HomeBig DataTips on how to improve your Hive tables to Unity Catalog

Tips on how to improve your Hive tables to Unity Catalog


On this weblog we’ll reveal with examples, how one can seamlessly improve your Hive metastore (HMS)* tables to Unity Catalog (UC) utilizing totally different methodologies relying on the variations of HMS tables being upgraded.

*Be aware: Hive metastore may very well be your default or exterior metastore and even AWS Glue Information Catalog. For simplicity, we’ve used the time period “Hive metastore” all through this doc

Earlier than we dive into the small print, allow us to have a look at the steps we’d take for the improve –

  1. Assess – On this step, you’ll consider the present HMS tables recognized for improve in order that we will decide the appropriate strategy for improve. This step is mentioned on this weblog.
  2. Create – On this step, you create the required UC property comparable to, Metastore, Catalog, Schema, Storage Credentials, Exterior Areas. For particulars discuss with the documentations – AWS, Azure, GCP
  3. Improve – On this step, you’ll comply with the steering to improve the tables from HMS to UC. This step is mentioned on this weblog.
  4. Grant – On this step, you will want to supply grants on the newly upgraded UC tables to principals, in order that they’ll entry the UC tables. For element discuss with the documentations – AWS, Azure, GCP

Unity Catalog, now usually accessible on all three cloud platforms (AWS, Azure, and GCP), simplifies safety and governance of your information with the next key options:

  • Outline as soon as, safe in every single place: Unity Catalog presents a single place to manage information entry insurance policies that apply throughout all workspaces.
  • Requirements-compliant safety mannequin: Unity Catalog’s safety mannequin is predicated on normal ANSI SQL and permits directors to grant permissions of their current information lake utilizing acquainted syntax, on the stage of catalogs, databases (additionally referred to as schemas), tables, and views.
  • Constructed-in auditing and lineage: Unity Catalog mechanically captures user-level audit logs that file entry to your information. Unity Catalog additionally captures lineage information that tracks how information property are created and used throughout all languages.
  • Information discovery: Unity Catalog allows you to tag and doc information property, and supplies a search interface to assist information customers discover information.
  • System tables (Public Preview): Unity Catalog allows you to simply entry and question your account’s operational information, together with audit logs, billable utilization, and lineage.
  • Information Sharing: Delta Sharing is an open protocol developed by Databricks for safe information sharing with different organizations whatever the computing platforms they use. Databricks has constructed Delta Sharing into its Unity Catalog information governance platform, enabling a Databricks consumer, referred to as an information supplier, to share information with an individual or group outdoors of their group, referred to as an information recipient.

All these wealthy options which can be found with Unity Catalog (UC) out of the field are not available in your Hive metastore at the moment and would take an enormous quantity of your assets to construct floor up. Moreover, as most (if not all) of the newer Databricks options, comparable to, Lakehouse Monitoring, Lakehouse Federation, LakehouseIQ, are constructed on, ruled by and wishes Unity Catalog as a prerequisite to operate, delaying upgradation of your information property to UC from HMS would restrict your potential to reap the benefits of these newer product options.

Therefore, one query that involves thoughts is how will you simply improve tables registered in your current Hive metastore to the Unity Catalog metastore with the intention to reap the benefits of all of the wealthy options Unity Catalog presents. On this weblog, we’ll stroll you thru concerns, methodologies with examples for upgrading your HMS desk to UC.

Improve Issues and Conditions

On this part we evaluation concerns for improve earlier than we dive deeper into the Improve Methodologies within the subsequent part.

Improve Issues

Variations of Hive Metastore tables is one such consideration. Hive Metastore tables, deemed for improve to Unity Catalog, may have been created with mixing and matching varieties for every parameter proven within the desk beneath. For instance, one may have created a CSV Managed Desk utilizing DBFS root location or a Parquet Exterior desk on Amazon S3 location.This part describes the parameters based mostly on which totally different variation of the tables may have been created in your

Parameter

Variation

Desk Identification Information

Desk Sort

MANAGED

Run desc prolonged hive_metastore.<schema identify>.<desk identify> and verify the worth of the sphere “Sort”. It ought to say “MANAGED”.

EXTERNAL

Run desc prolonged hive_metastore.<schema identify>.<desk identify> and verify the worth of the sphere “Sort”. It ought to say “EXTERNAL”.

Information Storage Location

DBFS Root storage location

Run desc prolonged hive_metastore.<schema identify>.<desk identify> and verify the worth of the sphere “Location”. It ought to begin with “dbfs:/consumer/hive/warehouse/”

DBFS Mounted Cloud Object Storage

Run desc prolonged hive_metastore.<schema identify>.<desk identify> and verify the worth of the sphere “Location”. It ought to begin with “dbfs:/mnt/”

Immediately specifying cloud storage Location (comparable to S3://, abfss:// or gs://)

Run desc prolonged hive_metastore.<schema identify>.<desk identify> and verify worth of subject “Location”. It ought to begin with “S3://” or “abfss://” or “gs://”

Desk file format and interface

File codecs comparable to Delta, Parquet, Avro

Run desc prolonged hive_metastore.<schema identify>.<desk identify> and verify the worth of the sphere “Supplier”. It ought to say for instance ”delta”, “parquet”.

Interface comparable to Hive SerDe interface

Run desc prolonged hive_metastore.<schema identify>.<desk identify> and verify the worth of the sphere “Supplier”. It ought to say ”hive”.

Relying on the variations of the parameters talked about above, the adopted improve methodologies may fluctuate. Particulars are mentioned within the Improve Methodologies part beneath.

One other level must be thought-about earlier than you begin the improve of HMS tables to UC in Azure Databricks:

For AZURE Cloud – Tables at present saved on Blob storage (wasb) or ADLS gen 1 (adl) must be upgraded to ADLS gen 2 (abfs). In any other case it’s going to elevate an error if you happen to attempt to use unsupported Azure cloud storage with Unity Catalog.

Error instance: Desk will not be eligible for an improve from Hive Metastore to Unity Catalog. Motive: Unsupported file system scheme wasbs.

Improve Conditions

Earlier than beginning the improve course of, the storage credentials and exterior places must be created as proven within the steps beneath.

  1. Create Storage Credential(s) with entry to the goal cloud storage.
  2. Create Exterior Location(s) pointing to the goal cloud storage utilizing the storage credential(s).
    • The Exterior Areas are used for creating UC Exterior Tables, Managed Catalogs, or Managed schemas.

Improve Methodologies

On this part we present you all of the totally different improve choices within the type of a matrix. We additionally use diagrams to point out the steps concerned in upgrading.

There are two main strategies for upgrading, utilizing SYNC (for supported eventualities) or utilizing information replication (the place SYNC will not be supported).

  • Utilizing SYNC – For all of the supported eventualities (as proven within the Improve Matrix part beneath) use SYNC to improve HMS tables to UC. Utilizing SYNC means that you can improve tables with out information replication
  • Utilizing Information Replication – For all unsupported eventualities (as proven within the Improve Matrix part beneath) use both Create Desk As Choose (CTAS) or DEEP CLONE*. This methodology would require information replication

*Be aware – Think about using deep clone for HMS Parquet and Delta tables to repeat the info and improve tables in UC from HMS. Use Create Desk As Choose (CTAS) for different file codecs.

The diagrams beneath describes the improve steps for every methodology. To know which methodology to make use of to your improve use case, discuss with the Improve Matrix part beneath.

Pictorial Illustration of improve

Diagram 1 – Upgrading HMS tables to UC utilizing SYNC (with out information replication)

HMS tables

Diagram Keys:

  1. HMS Managed and exterior tables retailer information as a listing of information on cloud object storage
  2. SYNC Command is used to improve desk metadata from HMS to UC. Goal UC tables are Exterior regardless of the supply HMS desk varieties.
  3. No Information is copied when the SYNC command is used for upgrading tables from HMS to UC. Similar underlying cloud storage location (utilized by the supply HMS desk) is referred to by the goal UC Exterior desk.
  4. A storage credential represents an authentication and authorization mechanism for accessing information saved in your cloud tenant.
  5. An exterior location is an object that mixes a cloud storage path with a storage credential that authorizes entry to the cloud storage path.

Diagram 2 – Upgrading HMS tables to UC with information replication

HMS tables

Diagram Keys:

  1. HMS Managed and exterior tables retailer information as a listing of information on DBFS Root Storage Location.
  2. CTAS or Deep Clone creates the UC goal desk metadata from the HMS desk. One can select to improve to an exterior or managed desk regardless of the HMS desk kind.
  3. CTAS or Deep Clone copies information from DBFS root storage to focus on cloud storage.
  4. A storage credential represents an authentication and authorization mechanism for accessing information saved in your cloud tenant.
  5. An exterior location is an object that mixes a cloud storage path with a storage credential that authorizes entry to the cloud storage path.

Improve Matrix

Beneath desk showcases the totally different prospects of Upgrading HMS tables to UC tables. For every situation, we offer steps that you may comply with for the improve.

HMS Storage Format utilizing DBFS Root Storage

Ex.

HMS Desk Sort 

Description of HMS Desk Sort

Instance of HMS Desk

Goal UC TableType

Goal UC Information File Format

Improve Methodology

1

Managed

The info information for the managed tables reside inside DBFS Root (the default location for the Databricks managed HMS database).

%sql

create desk if not exists hive_metastore.hmsdb_upgrade_db.people_parquet
utilizing parquet
as choose * from parquet.`dbfs:/databricks-datasets/learning-spark-v2/folks/people-10m.parquet/`restrict 100;

Exterior Or Managed

Delta is the popular file format for each Managed and Exterior Tables. Exterior tables do assist non-delta file codecs.1

CTAS or Deep Clone

2

Exterior

This implies the info information for the Exterior tables reside inside DBFS Root. The desk definition has the “Location” clause which makes the desk exterior.

%sql

create desk if not exists hive_metastore.hmsdb_upgrade_db.people_parquet
utilizing parquet
location “dbfs:/consumer/hive/warehouse/hmsdb_upgrade_db.db/people_parquet”
as
choose * from parquet.`dbfs:/databricks-datasets/learning-spark-v2/folks/people-10m.parquet/`restrict 100;

Exterior Or Managed

Delta is the popular file format for each Managed and Exterior Tables. Exterior tables do assist non-delta file codecs. 1

CTAS or Deep Clone

1. Be aware – Ideally change it to Delta whereas Upgrading with CTAS.

HMS Hive SerDe desk

Ex.

H MS Desk Sort 

Description of HMS Desk Sort

Instance of HMS Desk

Goal UC TableType

Goal UC Information File Format

Improve Methodology

3

Hive SerDe Exterior or Managed 2

These are the tables created utilizing the Hive SerDe interface. Seek advice from this hyperlink to study extra about hive tables on databricks.

%sql

CREATE TABLE if not exists hive_metastore.hmsdb_upgrade_db.parquetExample (id int, identify string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
STORED AS INPUTFORMAT ‘org.apache.hadoop.mapred.SequenceFileInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat’
location “s3://databricks-dkushari/sync-test/parquetexample”;

Exterior Or Managed

Delta is the popular file format for each Managed and Exterior Tables. Exterior tables do assist non-delta file codecs. 3

CTAS or Deep Clone

2. Be aware – regardless of the underlying storage format, hive SerDe follows the identical improve path..
3. Be aware – Ideally change it to Delta while you’re doing the improve utilizing CTAS
.

HMS Storage Format utilizing DBFS Mounted Storage

Ex.

HMS Desk Sort 

Description of HMS Desk Sort

Instance of HMS Desk

Goal UC TableType

Goal UC Information File Format

Improve Methodology

4

Managed

That is when the dad or mum database has its location set to exterior paths, e.g., a mounted path from the item retailer. The desk is created with no location clause and desk information is saved beneath that default database path.

%sql
create database if not exists hive_metastore.hmsdb_upgrade_db location “dbfs:/mnt/test-mnt/hmsdb_upgrade_db/”;

createtableifnotexists hive_metastore.hmsdb_upgrade_db.people_delta
as
choose * from delta.`dbfs:/databricks-datasets/learning-spark-v2/folks/people-10m.delta`restrict 100;

Exterior

As that of the HMS supply information file format

  1. Run Sync to create UC Exterior desk
  2. Convert HMS Managed to HMS Exterior (code supplied in Appendix beneath)
  3. Drop HMS desk in spite of everything dependencies are resolved, so there isn’t any technique to entry the info
  4. Unmount the mount level in spite of everything dependencies are resolved in order that there isn’t any technique to entry the info utilizing mount factors 4

5

Managed

Managed

Delta

CTAS or Deep Clone

6

Exterior

The desk is created with a location clause and a path specifying a mounted path from a cloud object retailer. 

%sql
create database if not exists hive_metastore.hmsdb_upgrade_db location “dbfs:/mnt/test-mnt/hmsdb_upgrade_db/”;

create desk if not exists hive_metastore.hmsdb_upgrade_db.people_delta
location “dbfs:/mnt/test-mnt/hmsdb_upgrade_db/people_delta”
as
choose * from delta.`dbfs:/databricks-datasets/learning-spark-v2/folks/people-10m.delta`restrict 100;

Exterior

As that of the HMS supply information file format

  1. Run Sync (weblog) to create UC Exterior desk
  2. Drop the HMS desk in spite of everything dependencies are resolved in order that there isn’t any technique to entry the info
  3. Unmount the mount level in spite of everything dependencies are resolved so there isn’t any technique to entry the info

7

Exterior

Managed

Delta

CTAS or Deep Clone

4. Be aware – Be sure that the HMS desk is dropped individually after conversion to an exterior desk. If the HMS database/schema was outlined with a location and if the database is dropped with the cascade possibility, then the underlying information can be misplaced and the upgraded UC tables will lose the info..

HMS Storage Format utilizing Cloud Object Storage

Ex.

HMS Desk Sort 

Description of HMS Desk Sort

Instance of HMS Desk

Goal UC TableType

Goal UC Information File Format

Improve Methodology

8

Managed

The dad or mum database has its location set to exterior paths, e.g., a cloud object retailer. The desk is created with no location clause and desk information is saved beneath that default database path.

%sql

create database if not exists hive_metastore.hmsdb_upgrade_db location “s3://databricks-dkushari/hmsdb_upgrade_db/”;

create desk if not exists hive_metastore.hmsdb_upgrade_db.people_delta
as
choose * from delta.`dbfs:/databricks-datasets/learning-spark-v2/folks/people-10m.delta`restrict 100;

Extern

al

As of the supply information file format

  1. Run Sync to create UC Exterior desk
  2. Convert HMS Managed to HMS Exterior (code supplied beneath)
  3. Drop HMS desk in spite of everything dependencies are resolved in order that there isn’t any technique to entry the info 4

9

Managed

Managed

Delta

CTAS or Deep Clone

10

Exterior

The desk is created with a location clause and a path specifying a cloud object retailer. 

%sql

create desk if not exists hive_metastore.hmsdb_upgrade_db.people_delta
location “s3://databricks-dkushari/hmsdb_upgrade_db/people_delta”
as
choose * from delta.`dbfs:/databricks-datasets/learning-spark-v2/folks/people-10m.delta`restrict 100;

Exterior

As of the supply information file format

  1. Run Sync (weblog) to create UC Exterior desk
  2. Drop the HMS desk in spite of everything dependencies are resolved in order that there isn’t any technique to entry the info

11

Exterior

Managed

Delta

CTAS or Deep Clone

Examples of improve

On this part, we’re offering a Databricks Pocket book with examples for every situation mentioned above.

Conclusion

On this weblog, we’ve proven how one can improve your Hive metastore tables to Unity Catalog metastore. Please discuss with the Pocket book to strive totally different improve choices. It’s also possible to discuss with the Demo Middle to get began with automating the improve course of. To automate upgrading Hive Metastore tables to Unity Catalog we suggest you employ this Databricks Lab repository.

Improve your tables to Unity Catalog at the moment and profit from unified governance options. After Upgrading to UC, you may drop Hive metastore schemas and tables if you happen to not want them. Dropping an exterior desk doesn’t modify the info information in your cloud tenant. Take precautions (as described on this weblog) whereas dropping managed tables or schemas with managed tables.

Appendix

import org.apache.spark.sql.catalyst.catalog.{CatalogTable,
CatalogTableType}
import org.apache.spark.sql.catalyst.TableIdentifier
val tableName = "desk"
val dbName = "dbname"
val oldTable: CatalogTable = 
spark.sessionState.catalog.getTableMetadata(TableIdentifier(tableName,
Some(dbName)))
val alteredTable: CatalogTable = oldTable.copy(tableType = 
CatalogTableType.EXTERNAL)
spark.sessionState.catalog.alterTable(alteredTable)



Supply hyperlink

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments