Modernize a legacy real-time analytics software with Amazon Managed Service for Apache Flink

October 14, 2023

1

Organizations with legacy, on-premises, near-real-time analytics options usually depend on self-managed relational databases as their knowledge retailer for analytics workloads. To reap the advantages of cloud computing, like elevated agility and just-in-time provisioning of assets, organizations are migrating their legacy analytics functions to AWS. The elevate and shift migration strategy is proscribed in its capacity to remodel companies as a result of it depends on outdated, legacy applied sciences and architectures that restrict flexibility and decelerate productiveness. On this publish, we talk about methods to modernize your legacy, on-premises, real-time analytics structure to construct serverless knowledge analytics options on AWS utilizing Amazon Managed Service for Apache Flink.

Close to-real-time streaming analytics captures the worth of operational knowledge and metrics to offer new insights to create enterprise alternatives. On this publish, we talk about challenges with relational databases when used for real-time analytics and methods to mitigate them by modernizing the structure with serverless AWS options. We introduce you to Amazon Managed Service for Apache Flink Studio and get began querying streaming knowledge interactively utilizing Amazon Kinesis Knowledge Streams.

Answer overview

On this publish, we stroll by way of a name heart analytics resolution that gives insights into the decision heart’s efficiency in near-real time by way of metrics that decide agent effectivity in dealing with calls within the queue. Key efficiency indicators (KPIs) of curiosity for a name heart from a near-real-time platform could possibly be calls ready within the queue, highlighted in a efficiency dashboard inside a number of seconds of knowledge ingestion from name heart streams. These metrics assist brokers enhance their name deal with time and in addition reallocate brokers throughout organizations to deal with pending calls within the queue.

Historically, such a legacy name heart analytics platform can be constructed on a relational database that shops knowledge from streaming sources. Knowledge transformations by way of saved procedures and use of materialized views to curate datasets and generate insights is a recognized sample with relational databases. Nonetheless, as knowledge loses its relevance with time, transformations in a near-real-time analytics platform want solely the most recent knowledge from the streams to generate insights. This will likely require frequent truncation in sure tables to retain solely the most recent stream of occasions. Additionally, the necessity to derive near-real-time insights inside seconds requires frequent materialized view refreshes on this conventional relational database strategy. Frequent materialized view refreshes on prime of regularly altering base tables as a result of streamed knowledge can result in snapshot isolation errors. Additionally, a knowledge mannequin that enables desk truncations at a daily frequency (for instance, each 15 seconds) to retailer solely related knowledge in tables could cause locking and efficiency points.

The next diagram offers the high-level structure of a legacy name heart analytics platform. On this conventional structure, a relational database is used to retailer knowledge from streaming knowledge sources. Datasets used for producing insights are curated utilizing materialized views contained in the database and revealed for enterprise intelligence (BI) reporting.

Modernizing this conventional database-driven structure within the AWS Cloud means that you can use refined streaming applied sciences like Amazon Managed Service for Apache Flink, that are constructed to remodel and analyze streaming knowledge in actual time. With Amazon Managed Service for Apache Flink, you may achieve actionable insights from streaming knowledge with serverless, absolutely managed Apache Flink. You need to use Amazon Managed Service for Apache Flink to shortly construct end-to-end stream processing functions and course of knowledge repeatedly, getting insights in seconds or minutes. With Amazon Managed Service for Apache Flink, you should utilize Apache Flink code or Flink SQL to repeatedly generate time-series analytics over time home windows and carry out refined joins throughout streams.

The next structure diagram illustrates how a legacy name heart analytics platform operating on databases could be modernized to run on the AWS Cloud utilizing Amazon Managed Service for Apache Flink. It reveals a name heart streaming knowledge supply that sends the most recent name heart feed in each 15 seconds. The second streaming knowledge supply constitutes metadata details about the decision heart group and brokers that will get refreshed all through the day. You possibly can carry out refined joins over these streaming datasets and create views on prime of it utilizing Amazon Managed Service for Apache Flink to generate KPIs required for the enterprise utilizing Amazon OpenSearch Service. You possibly can analyze streaming knowledge interactively utilizing managed Apache Zeppelin notebooks with Amazon Managed Service for Apache Flink Studio in near-real time. The near-real-time insights can then be visualized as a efficiency dashboard utilizing OpenSearch Dashboards.

On this publish, you carry out the next high-level implementation steps:

Ingest knowledge from streaming knowledge sources to Kinesis Knowledge Streams.
Use managed Apache Zeppelin notebooks with Amazon Managed Service for Apache Flink Studio to remodel the stream knowledge inside seconds of knowledge ingestion.
Visualize KPIs of name heart efficiency in near-real time by way of OpenSearch Dashboards.

Stipulations

This publish requires you to arrange the Amazon Kinesis Knowledge Generator (KDG) to ship knowledge to a Kinesis knowledge stream utilizing an AWS CloudFormation template. For the template and setup info, consult with Take a look at Your Streaming Knowledge Answer with the New Amazon Kinesis Knowledge Generator.

We use two datasets on this publish. The primary dataset is truth knowledge, which accommodates name heart group knowledge. The KDG generates a truth knowledge feed in each 15 seconds that accommodates the next info:

AgentId – Brokers work in a name heart setting surrounded by different name heart workers answering clients’ questions and referring them to the mandatory assets to unravel their issues.
OrgId – A name heart accommodates completely different organizations and departments, reminiscent of Care Hub, IT Hub, Allied Hub, Prem Hub, Assist Hub, and extra.
QueueId – Name queues present an efficient approach to route calls utilizing easy or refined patterns to make sure that all calls are moving into the proper arms shortly.
WorkMode – An agent work mode determines your present state and availability to obtain incoming calls from the Automated Name Distribution (ACD) and Direct Agent Name (DAC) queues. Name Middle Elite doesn’t route ACD and DAC calls to your cellphone if you end up in an Aux mode or ACW mode.
WorkSkill – Working as a name heart agent requires a number of mushy expertise to see the most effective outcomes, like problem-solving, bilingualism, channel expertise, aptitude with knowledge, and extra.
HandleTime – This customer support metric measures the size of a buyer’s name.
ServiceLevel – The decision heart service stage is outlined as the proportion of calls answered inside a predefined period of time—the goal time threshold. It may be measured over any time period (reminiscent of half-hour, 1 hour, 1 day, or 1 week) and for every agent, workforce, division, or the corporate as an entire.
WorkStates – This specifies what state an agent is in. For instance, an agent in an out there state is out there to deal with calls from an ACD queue. An agent can have a number of states with respect to completely different ACD gadgets, or they will use a single state to explain their relationship to all ACD gadgets. Agent states are reported in agent-state occasions.
WaitingTime – That is the typical time an inbound name spends ready within the queue or ready for a callback if that function is energetic in your IVR system.
EventTime – That is the time when the decision heart stream is distributed (through the KDG on this publish).

The next truth payload is used within the KDG to generate pattern truth knowledge:

{
"AgentId" : {{random.quantity(
{
"min":2001,
"max":2005
}
)}},
"OrgID" : {{random.quantity(
{
"min":101,
"max":105
}
)}},
"QueueId" : {{random.quantity(
{
"min":1,
"max":5
}
)}},
"WorkMode" : "{{random.arrayElement(
["TACW","ACW","AUX"]
)}}",
"WorkSkill": "{{random.arrayElement(
["Problem Solving","Bilingualism","Channel experience","Aptitude with data"]
)}}",
"HandleTime": {{random.quantity(
{
"min":1,
"max":150
}
)}},
"ServiceLevel":"{{random.arrayElement(
["Sev1","Sev2","Sev3"]
)}}",
"WorkSates": "{{random.arrayElement(
["Unavailable","Available","On a call"]
)}}",
"WaitingTime": {{random.quantity(
{
"min":10,
"max":150
}
)}},
"EventTime":"{{date.utc("YYYY-MM-DDTHH:mm:ss")}}"
}

The next screenshot reveals the output of the pattern truth knowledge in an Amazon Managed Service for Apache Flink pocket book.

The second dataset is dimension knowledge. This knowledge accommodates metadata info like group names for his or her respective group IDs, agent names, and extra. The frequency of the dimension dataset is twice a day, whereas the actual fact dataset will get loaded in each 15 seconds. On this publish, we use Amazon Easy Storage Service (Amazon S3) as a knowledge storage layer to retailer metadata info (Amazon DynamoDB can be utilized to retailer metadata info as nicely). We use AWS Lambda to load metadata from Amazon S3 to a different Kinesis knowledge stream that shops metadata info. The next JSON file saved in Amazon S3 has metadata mappings to be loaded into the Kinesis knowledge stream:

[{"OrgID": 101,"OrgName" : "Care Hub","Region" : "APAC"},
{"OrgID" : 102,"OrgName" : "Prem Hub","Region" : "AMER"},
{"OrgID" : 103,"OrgName" : "IT Hub","Region" : "EMEA"},
{"OrgID" : 104,"OrgName" : "Help Hub","Region" : "EMEA"},
{"OrgID" : 105,"OrgName" : "Allied Hub","Region" : "LATAM"}]

Ingest knowledge from streaming knowledge sources to Kinesis Knowledge Streams

To begin ingesting your knowledge, full the next steps:

Create two Kinesis knowledge streams for the actual fact and dimension datasets, as proven within the following screenshot. For directions, consult with Making a Stream through the AWS Administration Console.

Create a Lambda operate on the Lambda console to load metadata information from Amazon S3 to Kinesis Knowledge Streams. Use the next code:

import boto3
import json

# Create S3 object
s3_client = boto3.consumer("s3")
S3_BUCKET = '<S3 Bucket Title>'
kinesis_client = boto3.consumer("kinesis")
stream_name="<Kinesis Stream Title>"
    
def lambda_handler(occasion, context):
  
  # Learn Metadata file on Amazon S3
  object_key = "<S3 File Title.json>"  
  file_content = s3_client.get_object(
      Bucket=S3_BUCKET, Key=object_key)["Body"].learn()
  
  #Decode the S3 object to json
  decoded_data = file_content.decode("utf-8").change("'", '"')
  json_data = json.dumps(decoded_data)
  
  #Add json knowledge to Kinesis knowledge stream
  partition_key = 'OrgID'
  for document in json_data:
    response = kinesis_client.put_record(
      StreamName=stream_name,
      Knowledge=json.dumps(document),
      PartitionKey=partition_key)

Use managed Apache Zeppelin notebooks with Amazon Managed Service for Apache Flink Studio to remodel the streaming knowledge

The subsequent step is to create tables in Amazon Managed Service for Apache Flink Studio for additional transformations (joins, aggregations, and so forth). To arrange and question Kinesis Knowledge Streams utilizing Amazon Managed Service for Apache Flink Studio, consult with Question your knowledge streams interactively utilizing Amazon Managed Service for Apache Flink Studio and Python and create an Amazon Managed Service for Apache Flink pocket book. Then full the next steps:

Within the Amazon Managed Service for Apache Flink Studio pocket book, create a truth desk from the details knowledge stream you created earlier, utilizing the next question.

The occasion time attribute is outlined utilizing a WATERMARK assertion within the CREATE desk DDL. A WATERMARK assertion defines a watermark era expression on an present occasion time area, which marks the occasion time area because the occasion time attribute.

The occasion time refers back to the processing of streaming knowledge primarily based on timestamps which are connected to every row. The timestamps can encode when an occasion occurred. Processing time (PROCTime) refers back to the machine’s system time that’s operating the respective operation.

%flink.ssql
CREATE TABLE <Truth Desk Title> (
AgentId INT,
OrgID INT,
QueueId BIGINT,
WorkMode VARCHAR,
WorkSkill VARCHAR,
HandleTime INT,
ServiceLevel VARCHAR,
WorkSates VARCHAR,
WaitingTime INT,
EventTime TIMESTAMP(3),
WATERMARK FOR EventTime AS EventTime - INTERVAL '4' SECOND
)
WITH (
'connector' = 'kinesis',
'stream' = '<truth stream title>',
'aws.area' = '<AWS area ex. us-east-1>',
'scan.stream.initpos' = 'LATEST',
'format' = 'json',
'json.timestamp-format.commonplace' = 'ISO-8601'
);

Create a dimension desk within the Amazon Managed Service for Apache Flink Studio pocket book that makes use of the metadata Kinesis knowledge stream:

%flink.ssql
CREATE TABLE <Metadata Desk Title> (
AgentId INT,
AgentName VARCHAR,
OrgID INT,
OrgName VARCHAR,
update_time as CURRENT_TIMESTAMP,
WATERMARK FOR update_time AS update_time
)
WITH (
'connector' = 'kinesis',
'stream' = '<Metadata Stream Title>',
'aws.area' = '<AWS area ex. us-east-1> ',
'scan.stream.initpos' = 'LATEST',
'format' = 'json',
'json.timestamp-format.commonplace' = 'ISO-8601'
);

Create a versioned view to extract the most recent model of metadata desk values to be joined with the details desk:

%flink.ssql(kind=replace)
CREATE VIEW versioned_metadata AS 
SELECT OrgID,OrgName
  FROM (
      SELECT *,
      ROW_NUMBER() OVER (PARTITION BY OrgID
         ORDER BY update_time DESC) AS rownum 
      FROM <Metadata Desk>)
WHERE rownum = 1;

Be a part of the details and versioned metadata desk on orgID and create a view that gives the entire calls within the queue in every group in a span of each 5 seconds, for additional reporting. Moreover, create a tumble window of 5 seconds to obtain the ultimate output in each 5 seconds. See the next code:

%flink.ssql(kind=replace)

CREATE VIEW joined_view AS
SELECT streamtable.window_start, streamtable.window_end,metadata.OrgName,streamtable.CallsInQueue
FROM
(SELECT window_start, window_end, OrgID, rely(QueueId) as CallsInQueue
  FROM TABLE(
    TUMBLE( TABLE <Truth Desk Title>, DESCRIPTOR(EventTime), INTERVAL '5' SECOND))
  GROUP BY window_start, window_end, OrgID) as streamtable
JOIN
    <Metadata desk title> metadata
    ON metadata.OrgID = streamtable.OrgID

Now you may run the next question from the view you created and see the ends in the pocket book:

%flink.ssql(kind=replace)

SELECT jv.window_end, jv.CallsInQueue, jv.window_start, jv.OrgName
FROM joined_view jv the place jv.OrgName = ‘Prem Hub’ ;

Visualize KPIs of name heart efficiency in near-real time by way of OpenSearch Dashboards

You possibly can publish the metrics generated inside Amazon Managed Service for Apache Flink Studio to OpenSearch Service and visualize metrics in near-real time by making a name heart efficiency dashboard, as proven within the following instance. Seek advice from Stream the information and validate output to configure OpenSearch Dashboards with Amazon Managed Service for Apache Flink. After you configure the connector, you may run the next command from the pocket book to create an index in an OpenSearch Service cluster.

%flink.ssql(kind=replace)

drop desk if exists active_call_queue;
CREATE TABLE active_call_queue (
window_start TIMESTAMP,
window_end TIMESTAMP,
OrgID int,
OrgName varchar,
CallsInQueue BIGINT,
Agent_cnt bigint,
max_handle_time bigint,
min_handle_time bigint,
max_wait_time bigint,
min_wait_time bigint
) WITH (
‘connector’ = ‘elasticsearch-7’,
‘hosts’ = ‘<Amazon OpenSearch host title>’,
‘index’ = ‘active_call_queue’,
‘username’ = ‘<username>’,
‘password’ = ‘<password>’
);

The active_call_queue index is created in OpenSearch Service. The next screenshot reveals an index sample from OpenSearch Dashboards.

Now you may create visualizations in OpenSearch Dashboards. The next screenshot reveals an instance.

Conclusion

On this publish, we mentioned methods to modernize a legacy, on-premises, real-time analytics structure and construct a serverless knowledge analytics resolution on AWS utilizing Amazon Managed Service for Apache Flink. We additionally mentioned challenges with relational databases when used for real-time analytics and methods to mitigate them by modernizing the structure with serverless AWS options.

You probably have any questions or recommendations, please go away us a remark.

Concerning the Authors

Bhupesh Sharma is a Senior Knowledge Engineer with AWS. His position helps clients architect highly-available, high-performance, and cost-effective knowledge analytics options to empower clients with data-driven decision-making. In his free time, he enjoys taking part in musical devices, highway biking, and swimming.

Devika Singh is a Senior Knowledge Engineer at Amazon, with deep understanding of AWS companies, structure, and cloud-based finest practices. Together with her experience in knowledge analytics and AWS cloud migrations, she offers technical steering to the group, on architecting and constructing a cheap, safe and scalable resolution, that’s pushed by enterprise wants. Past her skilled pursuits, she is keen about classical music and journey.