Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

June 5, 2023

1

Posted in Technical |
March 02, 2023 4 min learn

Just lately, we introduced enhanced multi-function analytics assist in Cloudera Information Platform (CDP) with Apache Iceberg. Iceberg is a high-performance open desk format for large analytic knowledge units. It permits a number of knowledge processing engines, resembling Flink, NiFi, Spark, Hive, and Impala to entry and analyze knowledge in easy, acquainted SQL tables.

On this weblog publish, we’re going to share with you ways Cloudera Stream Processing (CSP) is built-in with Apache Iceberg and the way you need to use the SQL Stream Builder (SSB) interface in CSP to create stateful stream processing jobs utilizing SQL. This permits you to maximise utilization of streaming knowledge at scale. We are going to discover tips on how to create catalogs and tables and present examples of tips on how to write and skim knowledge from these Iceberg tables. At present, Iceberg assist in CSP is in technical preview mode.

The CSP engine is powered by Apache Flink, which is the best-in-class processing engine for stateful streaming pipelines. Let’s check out what options are supported from the Iceberg specification:

As proven within the desk above, Flink helps a variety of options with the next limitations:

No DDL assist for hidden partitioning
Altering a desk is just attainable for desk properties (no schema/partition evolution)
Flink SQL doesn’t assist inspecting metadata tables
No watermark assist

CSP at present helps the v1 format options however v2 format assist is coming quickly.

SQL Stream Builder integration

Hive Metastore

To make use of the Hive Metastore with Iceberg in SSB, step one is to register a Hive catalog, which we are able to do utilizing the UI:

Within the Venture Explorer open the Information Sources folder and right-click on Catalog, which is able to carry up the context menu.

Clicking “New Catalog” will open up the catalog creation modal window.

To register a Hive catalog we are able to enter any distinctive title for the catalog in SSB. The Catalog Kind ought to be set to Hive. The Default Database is an elective subject so we are able to depart it empty for now.

The CM Host subject is just obtainable within the CDP Public Cloud model of SSB as a result of the streaming analytics cluster templates don’t embody Hive, so in an effort to work with Hive we are going to want one other cluster in the identical setting, which makes use of a template that has the Hive element. To supply the CM host we are able to copy the FQDN of the node the place Cloudera Supervisor is operating. This data may be obtained from the Cloudera Administration Console by first choosing the Information Hub cluster that has Hive put in and belongs to the identical setting. Subsequent, go to the Nodes tab:

Search for the node marked “CM Server” on the best facet of the desk. After the shape is stuffed out, click on Validate after which the Create button to register the brand new catalog.

Within the subsequent instance, we are going to discover tips on how to create a desk utilizing the Iceberg connector and Hive Metastore.

Let’s create our new desk:

CREATE TABLE `ssb`.`ssb_default`.`iceberg_hive_example` (

  `column_int` INT,

  `column_str` VARCHAR(2147483647)

) WITH (

  'connector' = 'iceberg',

  'catalog-database' = 'default',

  'catalog-type' = 'hive',

  'catalog-name' = 'hive-catalog',

  'ssb-hive-catalog' = 'your-hive-data-source',

  'engine.hive.enabled' = 'true'

)

As we are able to see within the code snippet, SSB offers a customized comfort property ssb-hive-catalog to simplify configuring Hive. With out this property, we would wish to know the hive-conf location on the server or the thrift URI and warehouse path. The worth of this property ought to be the title of the beforehand registered Hive catalog. By offering this selection, SSB will robotically configure all of the required Hive-specific properties, and if it’s an exterior cluster in case of CDP Public Cloud it would additionally obtain the Hive configuration information from the opposite cluster. The catalog-database property defines the Iceberg database title within the backend catalog, which by default makes use of the default Flink database (“default_database”). The catalog-name is a user-specified string that’s used internally by the connector when creating the underlying iceberg catalog. This selection is required because the connector doesn’t present a default worth.

After the desk is created we are able to insert and question knowledge utilizing acquainted SQL syntax:

INSERT INTO `iceberg_hive_example` VALUES (1, 'a');

SELECT * FROM `iceberg_hive_example`;

Querying knowledge utilizing Time Journey:

SELECT * FROM `iceberg_hive_example` /*+OPTIONS('as-of-timestamp'='1674475871165')*/;

Or:

SELECT * FROM `iceberg_hive_example` /*+OPTIONS('snapshot-id'='901544054824878350')*/

In streaming mode, we’ve the next capabilities obtainable:

We will learn all of the data from the present snapshot, after which learn incremental knowledge ranging from that snapshot:

SELECT * FROM `iceberg_hive_example` /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s')*/

Moreover, we are able to learn all incremental knowledge ranging from the supplied snapshot-id (data from this snapshot shall be excluded):

SELECT * FROM `iceberg_hive_example` /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s', 'start-snapshot-id'='3821550127947089987')*/ ;

Conclusion

Now we have coated tips on how to entry the ability of Apache Iceberg in SQL Stream Builder and its potentialities and limitations in Flink. We additionally explored tips on how to create and entry Iceberg tables utilizing a Hive catalog and the comfort choices in SSB to facilitate the combination, so you’ll be able to spend much less time on configuration and focus extra on the info.

Anyone can check out SSB utilizing the Stream Processing Neighborhood Version (CSP-CE). CE makes growing stream processors straightforward, out of your desktop or some other growth node. Analysts, knowledge scientists, and builders can now consider new options, develop SQL-based stream processors regionally utilizing SQL Stream Builder powered by Flink, and develop Kafka Customers/Producers and Kafka Join Connectors, all regionally earlier than shifting to manufacturing in CDP.

Supply hyperlink