Tuesday, February 14, 2023
HomeBig DataWhat’s a Dimensional Mannequin and The way to Implement It on the...

What’s a Dimensional Mannequin and The way to Implement It on the Databricks Lakehouse Platform


Oracle is a well known know-how for internet hosting Enterprise Information Warehouse options. Nevertheless, many shoppers like Optum and the U.S. Citizenship and Immigration Providers selected emigrate to the Databricks Lakehouse Platform to leverage the facility of information, analytics, and AI in a single single platform at scale and to ship enterprise worth sooner. For instance, Optum’s on-premises Oracle-based information warehouse system struggled to shortly course of and analyze the information. With Azure Databricks, they’ve improved information pipeline efficiency by 2x, enabling sooner supply of outcomes to hospitals, saving them hundreds of thousands of {dollars} in probably misplaced income.

Migrating from Oracle to Databricks includes a number of steps, essentially the most crucial ones are:

On this weblog publish, we’ll give attention to changing PL/SQL proprietary code to an open normal python code and reap the benefits of PySpark for ETL workloads and Databricks SQL’s information analytics workload energy.

Problem of changing PL/SQL to PySpark

As the necessity for creating information pipelines and ETL grew, each database wanted a programming language wrapper to have the context to move parameters and deal with datasets programmatically. As an alternative of utilizing open supply requirements like Python, most databases created their very own proprietary languages. PL/SQL is Oracle’s model of programming language extensions on SQL. It leverages the SQL language with procedural components (parameters, variables, datasets as cursors, conditional statements, loops, exception blocks and so forth.). Its proprietary language and extensions had been developed over time and have their very own specificities that may make it tough to transform to a typical, extensively used, full-blown open supply programming language like Python.

An instance of that is the provided PL/SQL packages (DBMS_ or UTL_ packages and so forth.) and the user-defined sorts that can be utilized as objects for column sorts (objects or collections outlined as a column kind) which makes migration fairly advanced. These Oracle-specific options and quite a few others should be thought of throughout code conversion to Apache Spark™.

Many organizations have created ETL information processing jobs by writing PL/SQL procedures and capabilities wrapped into packages that run towards an Oracle database. You may convert these PL/SQL jobs to open supply python and Spark and run it in Databricks notebooks or Delta Stay Tables with none of the complexity of PL/SQL and run it on the fashionable Databricks on-demand serverless compute.

Migrate PL/SQL code to PySpark to your ETL pipelines

ETL Course of is used principally for:

  • Ingesting information from a number of sources
  • Validating, cleansing and reworking information
  • Loading information into varied information layers (bronze, silver, gold / Operational Information Retailer, Information Warehouse, Information Marts … relying on the information structure)

In Oracle Databases, PL/SQL is often used to validate/remodel the information in place. Relying on the Oracle database structure, information strikes from varied containers that could possibly be a person schema and/or a pluggable database.
Here is an instance of a typical Oracle database implementation that helps a Information Warehouse utilizing PL/SQL for ETL.

Typical Oracle database implementation that supports a Data Warehouse using PL/SQL for ETL.
Typical Oracle database implementation that helps a Information Warehouse utilizing PL/SQL for ETL.

Transferring from Oracle and PL/SQL to the Databricks Lakehouse will leverage many key points:

  • PySpark will present a typical library in Python, which is able to present the potential to course of varied information sources at scale on to the ODS with out having to materialize a desk within the staging space. This may be carried out with a python pocket book scheduled regularly with Databricks workflows.
  • Delta Stay Tables will ship the potential to implement the entire ETL pipeline in both python or SQL pocket book with information high quality management (Handle information high quality with Delta Stay Tables) and are capable of course of information both in batch or real-time (Course of streaming information with Delta Stay Tables).
Delta Live Tables pipeline example
Delta Stay Tables pipeline instance

Whatever the characteristic used, PL/SQL logic will likely be migrated into python code or SQL. For instance, PL/SQL capabilities will likely be translated into PySpark and are known as instantly or by a python user-defined perform (See this hyperlink on the way to use Python UDF in Delta Stay Tables: Delta Stay Tables cookbook )

Migrate PL/SQL code to Databricks SQL or Python UDF

Databricks SQL is used to run many SQL Workloads and certainly one of them is to run analytics queries based mostly on information hosted on the lakehouse. These analytics queries can require some capabilities to be executed on these tables (information redaction and so forth.).
These capabilities are often addressed in Oracle by PL/SQL capabilities or packages’ capabilities and will likely be migrated through the course of.

Python UDF on Databricks SQL leverages conventional SQL workloads with the functionalities introduced by the python language.

PL/SQL Code migration samples

This part will likely be devoted to some examples of code migration from Oracle PL/SQL to Databricks. Primarily based on greatest practices and our suggestions, every instance relies on the implementation selection. (PySpark in an ETL course of, Python UDF in a Databricks SQL analytic workload).

Dynamic cursors utilizing DBMS_SQL provided Oracle bundle

In Oracle, a cursor is a pointer to a personal SQL space that shops details about processing a selected SELECT or DML assertion. A cursor that’s constructed and managed by the Oracle kernel by PL/SQL is an implicit cursor. A cursor that you just assemble and handle is an specific cursor.

In Oracle, cursors may be parameterized through the use of dynamic strings for the SQL assertion, however this method can result in SQL Injection points, that is why it is higher to make use of DBMS_SQL provided PL/SQL bundle or EXECUTE IMMEDIATE statements which is able to assist to construct dynamic statements. A cursor can very simply be transformed to a Spark DataFrame.

The next instance is how we remodel dynamic SQL statements constructed with an Oracle-supplied PL/SQL bundle to PySpark.

Right here is the PL/SQL SQL code in Oracle.


create or change perform get_prod_name(in_pid in quantity) return VARCHAR2
as
    sql_stmt VARCHAR2(256);
    l_i_cursor_id   INTEGER;
    l_n_rowcount    NUMBER;
    l_vc_prod_name VARCHAR2(30);
BEGIN
    sql_stmt := 'choose prod_name from merchandise the place prod_id=:pid FETCH NEXT 1 ROWS ONLY'; 
    l_i_cursor_id:=dbms_sql.open_cursor;
    
    dbms_sql.parse(l_i_cursor_id,sql_stmt, dbms_sql.native);

    dbms_sql.bind_variable(l_i_cursor_id,'pid',20);
    
    dbms_sql.define_column(l_i_cursor_id,1, l_vc_prod_name,30);
    
    l_n_rowcount:= dbms_sql.execute_and_fetch(l_i_cursor_id);
    
    dbms_sql.column_value(l_i_cursor_id,1,l_vc_prod_name);
    dbms_sql.close_cursor(l_i_cursor_id);
    
    return l_vc_prod_name;
END;
/

Right here is the code that performs the identical performance in PySpark.


import pyspark.sql.capabilities as f

def get_prod_name(in_pid:int) -> str:
  sql_stmt="""choose PROD_NAME 
     from merchandise 
     the place PROD_ID={} 
     LIMIT 1""".format(in_pid)
  ret=spark.sql(sql_stmt).accumulate()[0]["PROD_NAME"]
  return ret

print(get_prod_name(20))

# Operate name
print(get_prod_name(20))

Collections migrations to python

In Oracle PL/SQL, many assortment and document sorts exist.

Assortment Sort Variety of Components Index Sort Dense or Sparse
Associative array (or index-by desk) Unspecified String or Integer Both
VARRAY (variable-size array) Specified Integer All the time dense
Nested desk Unspecified Integer Begins dense, can turn out to be sparse

These may be migrated to Python components regardless if they’re executed as duties utilizing PySpark for ETL functions, or into Python UDF in Databricks SQL.

An associative array is a set of key-value pairs. Every secret is a novel index used to find the related worth with the syntax variable_name(index).

The information kind of index may be both a string kind (VARCHAR2, VARCHAR, STRING, or LONG) or PLS_INTEGER. Indexes are saved in type order, not creation order. The easiest way emigrate an affiliate array to python (or PySpark) is to make use of a dictionary construction.

Here’s what the code seems to be like in PL/SQL:


DECLARE
  -- Associative array listed by string:
  
  TYPE inhabitants IS TABLE OF NUMBER  
    INDEX BY VARCHAR2(64);           
  
  city_population  inhabitants;   
  i  VARCHAR2(64);
BEGIN
  -- Add components (key-value pairs) to associative array:
 
  city_population('Smallville')  := 2000;
  city_population('Midland')     := 750000;
  city_population('Megalopolis') := 1000000;
 
  -- Change worth related to key 'Smallville':
 
  city_population('Smallville') := 2001;
 
  -- Print associative array:
 
  i := city_population.FIRST;  -- Get first factor of array
 
  WHILE i IS NOT NULL LOOP
    DBMS_Output.PUT_LINE
      ('Inhabitants of ' || i || ' is ' || city_population(i));
    i := city_population.NEXT(i);  -- Get subsequent factor of array
  END LOOP;
END;
/

Beneath is, an instance on the way to convert associative arrays into python from PL/SQL:


# Declare dictionary and add components
city_population={'Smallville':2000, 'Midland':750000, 'Megalopolis':1000000}

# modify a component
city_population['Smallville']=1750

# get the primary factor of dictionary
elt#=listing(city_population.keys())[0]

# print dictionary content material
for ok,v in city_population.gadgets():
  print(f"Inhabitants of {ok} is {v}")

Information redaction

On the semantic layer of an information warehouse, it’s typically essential to redact delicate information. To try this, capabilities are fairly often used to implement the information redaction course of.

In an Oracle Database, you need to use Superior Safety Choice for Information Redaction or a PL/SQL that may implement the redaction. Each of those strategies can be utilized by our migrations groups, but when the supply database makes use of PL/SQL to do this, the most effective answer will likely be to make use of Python UDF into Databricks SQL.

Python UDFs permit customers to jot down Python code and invoke it by a SQL perform in a straightforward, safe and totally ruled method, bringing the facility of Python to Databricks SQL.

Within the following instance, we translated a PL/SQL perform that redacts product names when the listing value is bigger than 100 through the use of the python UDF characteristic.

Code in PL/SQL as under:


CREATE OR REPLACE FUNCTION simple_redaction(
  enter VARCHAR2,
  value NUMBER)
return varchar2
as
BEGIN
  IF (value > 100) THEN
    RETURN SUBSTR(enter,0,1)||'*****'||SUBSTR(enter,-1);
  ELSE
    RETURN enter;
  END IF;
END;
/

SQL> choose simple_redaction(prod_name, list_price) as  
     r_prod_name, list_price 
     from product;

R_PROD_NAME		       LIST_PRICE
------------------------------ ----------
product 1			     10.5
p*****2 			            103.1
product 3			     5.99
product 4			    12.35
product 5			       35
e*****t 			             1400

The Python UDF will likely be as under:


create or change perform simple_redaction(_input string, 
                 _price float)
returns STRING
language python
as $$
if _price >100:
  return (_input[0]+"*****"+_input[-1])
else:
  return _input
$$;



choose simple_redaction(prod_name, list_price) as r_prod_name, list_price from product;

--------------------------------
| r_prod_name	| list_price   |
--------------------------------
| product 1	| 10.5         |
| p*****2	| 103.1        |
| product 3	| 5.99         | 
| product 4	| 12.35        |
| product 5	| 35.0         |
| e*****t	| 1400.0       |
--------------------------------

Planning your PL/SQL migration

Databricks and our SI/consulting companions may help you with an in depth technical migration evaluation which incorporates your goal structure, technical evaluation of your current code, such because the variety of objects to be migrated, their general complexity classification, technical approaches to information, code and report modernization and so forth. Our prospects can execute the migration in-house manually or speed up their migration through the use of an automatic code conversion of PL/SQL to PySpark.

Automated migration method

The Information Warehouse Migration apply at Databricks is flourishing, and we now have a number of ISV and Consulting/SI companions who can help with EDW Migrations. Information Ingestion companions like Fivetran, Qlik, Arcion may help migrate the information in real-time utilizing CDC from Oracle to Databricks and low-code/code non-compulsory ETL companions like Matillion and Prophecy also can assist if Saved procedures should be transformed to visible ETL mappings. See the complete listing of our ISV companions right here.

With assistance from legacy platform evaluation instruments and computerized code conversion accelerators, Databricks Skilled Providers and several other of our approved Migrations Brickbuilder SI companions also can migrate PL/SQL code shortly and successfully to native Databricks Notebooks.

Right here is one instance of an automated code conversion demo from PL/SQL to PySpark by BladeBridge, our ISV conversion accomplice.

LeapLogic is one other accomplice that additionally has automated evaluation and code converters from varied EDWs to Databricks. Here’s a demo of their Oracle conversion software to Databricks.

Most consulting/SI companions use related automated conversion instruments until it’s a full modernization and redesign.

Whether or not you select to modernize your Legacy Oracle EDW platform in-house or with the assistance of a consulting accomplice, Databricks migration specialists and skilled companies workforce are right here that can assist you alongside the way in which.
Please see this EDW Migration web page for extra info and accomplice migration choices.

Be happy to achieve out to the Databricks workforce for a custom-made Oracle Migration evaluation.

Get began on migrating your first items of code to Databricks

Attempt Databricks free for 14 days.



Supply hyperlink

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments