This put up is co-written with Ramesh Daddala, Jitendra Kumar Sprint and Pavan Kumar Bijja from Bristol Myers Squibb.
Bristol Myers Squibb (BMS) is a worldwide biopharmaceutical firm whose mission is to find, develop, and ship progressive medicines that assist sufferers prevail over severe illnesses. BMS is persistently innovating, reaching vital medical and regulatory successes. In collaboration with AWS, BMS recognized a enterprise have to migrate and modernize their {custom} extract, rework, and cargo (ETL) platform to a local AWS answer to scale back complexities, assets, and funding to improve when new Spark, Python, or AWS Glue variations are launched. Along with utilizing native managed AWS providers that BMS didn’t want to fret about upgrading, BMS was trying to provide an ETL service to non-technical enterprise customers that might visually compose knowledge transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless knowledge integration engine. AWS Glue Studio is a graphical interface that makes it straightforward to create, run, and monitor ETL jobs in AWS Glue. Providing this service decreased BMS’s operational upkeep and value, and provided flexibility to enterprise customers to carry out ETL jobs with ease.
For the previous 5 years, BMS has used a {custom} framework known as Enterprise Information Lake Providers (EDLS) to create ETL jobs for enterprise customers. Though this framework met their ETL targets, it was tough to keep up and improve. BMS’s EDLS platform hosts over 5,000 jobs and is rising at 15% YoY (12 months over 12 months). Every time the newer model of Apache Spark (and corresponding AWS Glue model) was launched, it required vital operational help and time-consuming handbook modifications to improve current ETL jobs. Manually upgrading, testing, and deploying over 5,000 jobs each few quarters was time consuming, error susceptible, pricey, and never sustainable. As a result of one other launch for the EDLS framework was pending, BMS determined to evaluate alternate managed options to scale back their operational and improve challenges.
On this put up, we share how BMS will modernize leveraging the success of the proof of idea focusing on BMS’s ETL platform utilizing AWS Glue Studio.
Answer overview
This answer addresses BMS’s EDLS necessities to beat challenges utilizing a custom-built ETL framework that required frequent upkeep and part upgrades (requiring intensive testing cycles), keep away from complexity, and cut back the general price of the underlying infrastructure derived from the proof of idea. BMS had the next targets:
- Develop ETL jobs utilizing visible workflows offered by the AWS Glue Studio visible editor. The AWS Glue Studio visible editor is a low-code surroundings that means that you can compose knowledge transformation workflows, seamlessly run them on the AWS Glue Apache Spark-based serverless knowledge integration engine, and examine the schema and knowledge leads to every step of the job.
- Migrate over 5,000 current ETL jobs utilizing native AWS Glue Studio in an automatic and scalable method.
EDLS job steps and metadata
Each EDLS job includes a number of job steps chained collectively and run in a predefined order orchestrated by the {custom} ETL framework. Every job step incorporates the next ETL features:
- File ingest – File ingestion allows you to ingest or checklist information from a number of file sources, like Amazon Easy Storage Service (Amazon S3), SFTP, and extra. The metadata holds configurations for the file ingestion step to connect with Amazon S3 or SFTP endpoints and ingest information to focus on location. It retrieves the required information and obtainable metadata to indicate on the UI.
- Information high quality verify – The info high quality module allows you to carry out high quality checks on an enormous quantity of knowledge and generate stories that describe and validate the information high quality. The info high quality step makes use of an EDLS ingested supply object from Amazon S3 and runs one to many knowledge conformance checks which might be configured by the tenant.
- Information rework be part of – This is without doubt one of the submodules of the information rework module that may carry out joins between the datasets utilizing a {custom} SQL based mostly on the metadata configuration.
- Database ingest – The database ingestion step is without doubt one of the necessary service parts in EDLS, which facilitates you to acquire and import the specified knowledge from the database and export it to a particular file within the location of your selection.
- Information rework – The info rework module performs numerous knowledge transformations in opposition to the supply knowledge utilizing JSON-driven guidelines. Every knowledge rework functionality has its personal JSON rule and, based mostly on the particular JSON rule you present, EDLS performs the information transformation on the information obtainable within the Amazon S3 location.
- Information persistence – The info persistence module is without doubt one of the necessary service parts in EDLS, which allows you to acquire the specified knowledge from the supply and persist it to an Amazon Relational Database Service (Amazon RDS) database.
The metadata corresponding to every job step contains ingest sources, transformation guidelines, knowledge high quality checks, and knowledge locations saved in an RDS occasion.
Migration utility
The answer includes constructing a Python utility that reads EDLS metadata from the RDS database and translating every of the job steps into an equal AWS Glue Studio visible editor JSON node illustration.
AWS Glue Studio gives two varieties of transforms:
- AWS Glue-native transforms – These can be found to all customers and are managed by AWS Glue.
- Customized visible transforms – This new performance means that you can add custom-built transforms utilized in AWS Glue Studio. Customized visible transforms develop the managed transforms, enabling you to go looking and use transforms from the AWS Glue Studio interface.
The next is a high-level diagram depicting the sequence movement of migrating a BMS EDLS job to an AWS Glue Studio visible editor job.
Migrating BMS EDLS jobs to AWS Glue Studio contains the next steps:
- The Python utility reads current metadata from the EDLS metadata database.
- For every job step kind, based mostly on the job metadata, the Python utility selects both the native AWS Glue rework, if obtainable, or a custom-built visible rework (when the native performance is lacking).
- The Python utility parses the dependency info from metadata and builds a JSON object representing a visible workflow represented as a Directed Acyclic Graph (DAG).
- The JSON object is distributed to the AWS Glue API, creating the AWS Glue ETL job. These jobs are visually represented within the AWS Glue Studio visible editor utilizing a sequence of sources, transforms (native and {custom}), and targets.
Pattern ETL job technology utilizing AWS Glue Studio
The next movement diagram depicts a pattern ETL job that incrementally ingests the supply RDBMS knowledge in AWS Glue based mostly on modified timestamps utilizing a {custom} SQL and merges it into the goal knowledge on Amazon S3.
The previous ETL movement could be represented utilizing the AWS Glue Studio visible editor by a mix of native and {custom} visible transforms.
Customized visible rework for incremental ingestion
Submit POC, BMS and AWS recognized there can be a have to leverage {custom} transforms to execute a subset of jobs leveraging their present EDLS Service the place Glue Studio performance is not going to be a pure match. The BMS crew’s requirement was to ingest knowledge from numerous databases with out relying on the existence of transaction logs or particular schema, so AWS Database Migration Service (AWS DMS) wasn’t an possibility for them. AWS Glue Studio gives the native SQL question visible rework, the place a {custom} SQL question can be utilized to rework the supply knowledge. Nonetheless, in an effort to question the supply database desk based mostly on a modified timestamp column to retrieve new and modified information for the reason that final ETL run, the earlier timestamp column state must be continued so it may be used within the present ETL run. This must be a recurring course of and may also be abstracted throughout numerous RDBMS sources, together with Oracle, MySQL, Microsoft SQL Server, SAP Hana, and extra.
AWS Glue gives a job bookmark function to trace the information that has already been processed throughout a earlier ETL run. An AWS Glue job bookmark helps a number of columns because the bookmark keys to find out new and processed knowledge, and it requires that the keys are sequentially rising or reducing with out gaps. Though this works for a lot of incremental load use instances, the requirement is to ingest knowledge from totally different sources with out relying on any particular schema, so we didn’t use an AWS Glue job bookmark on this use case.
The SQL-based incremental ingestion pull could be developed in a generic manner utilizing a {custom} visible rework utilizing a pattern incremental ingestion job from a MySQL database. The incremental knowledge is merged into the goal Amazon S3 location in Apache Hudi format utilizing an upsert write operation.
Within the following instance, we’re utilizing the MySQL knowledge supply node to outline the connection however the DynamicFrame of the information supply itself will not be used. The {custom} rework node (DB incremental ingestion) will act because the supply for studying the information incrementally utilizing the {custom} SQL question and the beforehand continued timestamp from the final ingestion.
The rework accepts as enter parameters the preconfigured AWS Glue connection title, database kind, desk title, and {custom} SQL (parameterized timestamp subject).
The next is the pattern visible rework Python code:
To merge the supply knowledge into the Amazon S3 goal, an information lake framework like Apache Hudi or Apache Iceberg can be utilized, which is natively supported in AWS Glue 3.0 and later.
You can even use Amazon EventBridge to detect the ultimate AWS Glue job state change and replace the Amazon DynamoDB desk’s final ingested timestamp accordingly.
Construct the AWS Glue Studio job utilizing the AWS SDK for Python (Boto3) and AWS Glue API
For the pattern ETL movement and the corresponding AWS Glue Studio ETL job we confirmed earlier, the underlying CodeGenConfigurationNode
struct (an AWS Glue job definition pulled utilizing the AWS Command Line Interface (AWS CLI) command aws glue get-job –job-name <jobname>
) is represented as a JSON object, proven within the following code:
The JSON object (ETL job DAG) represented within the CodeGenConfigurationNode
is generated by a sequence of native and {custom} transforms with the respective enter parameter arrays. This may be achieved utilizing Python JSON encoders that serialize the category objects to JSON and subsequently create the AWS Glue Studio visible editor job utilizing the Boto3 library and AWS Glue API.
Inputs required to configure the AWS Glue transforms are sourced from the EDLS jobs metadata database. The Python utility reads the metadata info, parses it, and configures the nodes routinely.
The order and sequencing of the nodes is sourced from the EDLS jobs metadata, with one node changing into the enter to a number of downstream nodes constructing the DAG movement.
Advantages of the answer
The migration path will assist BMS obtain their core targets of decomposing their current {custom} ETL framework to modular, visually configurable, much less advanced, and simply manageable pipelines utilizing visible ETL parts. The utility aids the migration of the legacy ETL pipelines to native AWS Glue Studio jobs in an automatic and scalable method.
With constant out-of-the field visible ETL transforms within the AWS Glue Studio interface, BMS will have the ability to construct refined knowledge pipelines with out having to write down code.
The {custom} visible transforms will lengthen AWS Glue Studio capabilities and fulfill a number of the BMS ETL necessities the place the native transforms are lacking that performance. Customized transforms will assist outline, reuse, and share business-specific ETL logic amongst all of the groups. The answer will increase the consistency between groups and retains the ETL pipelines updated by minimizing duplicate effort and code.
With minor modifications, the migration utility could be reused to automate migration of pipelines throughout future AWS Glue model upgrades.
Conclusion
The profitable consequence of this proof of idea has proven that migrating over 5,000 jobs from BMS’s {custom} utility to native AWS providers can ship vital productiveness beneficial properties and value financial savings. By transferring to AWS, BMS will have the ability to cut back the hassle required to help AWS Glue, enhance DevOps supply, and save an estimated 58% on AWS Glue spend.
These outcomes are very promising, and BMS is happy to embark on the following section of the migration. We imagine that this undertaking can have a optimistic influence on BMS’s enterprise and assist us obtain our strategic targets.
Concerning the authors
Sivaprasad Mahamkali is a Senior Streaming Information Engineer at AWS Skilled Providers. Siva leads buyer engagements associated to real-time streaming options, knowledge lakes, analytics utilizing opensource and AWS providers. Siva enjoys listening to music and likes to spend time along with his household.
Dan Gibbar is a Senior Engagement Supervisor at AWS Skilled Providers. Dan leads healthcare and life science engagements collaborating with clients and companions to ship outcomes. Dan enjoys the outside, trying triathlons, music and spending time with household.
Shrinath Parikh as a Senior Cloud Information Architect with AWS. He works with clients across the globe to help them with their knowledge analytics, knowledge lake, knowledge lake home, serverless, governance and NoSQL use instances. In Shrinath’s off time, he enjoys touring, spending time with household and studying/constructing new instruments utilizing innovative applied sciences.
Ramesh Daddala is a Affiliate Director at BMS. Ramesh leads enterprise knowledge engineering engagements associated to enterprise knowledge lake providers (EDLs) and collaborating with Information companions to ship and help enterprise knowledge engineering and ML capabilities. Ramesh enjoys the outside, touring and likes to spend time with household.
Jitendra Kumar Sprint is a Senior Cloud Architect at BMS with experience in hybrid cloud providers, Infrastructure Engineering, DevOps, Information Engineering, and Information Analytics options. He’s captivated with meals, sports activities, and journey.
Pavan Kumar Bijja is a Senior Information Engineer at BMS. Pavan permits knowledge engineering and analytical providers to BMS Industrial area utilizing enterprise capabilities. Pavan leads enterprise metadata capabilities at BMS. Pavan likes to spend time along with his household, enjoying Badminton and Cricket.
Shovan Kanjilal is a Senior Information Lake Architect working with strategic accounts in AWS Skilled Providers. Shovan works with clients to design knowledge and machine studying options on AWS.