Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

July 5, 2023

1

Backtesting is a course of utilized in quantitative finance to guage buying and selling methods utilizing historic information. This helps merchants decide the potential profitability of a method and determine any dangers related to it, enabling them to optimize it for higher efficiency.

Index rebalancing arbitrage takes benefit of short-term value discrepancies ensuing from ETF managers’ efforts to attenuate index monitoring error. Main market indexes, comparable to S&P 500, are topic to periodic inclusions and exclusions for causes past the scope of this submit (for an instance, confer with CoStar Group, Invitation Properties Set to Be a part of S&P 500; Others to Be a part of S&P 100, S&P MidCap 400, and S&P SmallCap 600). The arbitrage commerce appears to revenue from going lengthy on shares added to an index and shorting those which can be eliminated, with the purpose of producing revenue from these value variations.

On this submit, we glance into the method of utilizing backtesting to guage the efficiency of an index arbitrage profitability technique. We particularly discover how Amazon EMR and the newly developed Apache Iceberg branching and tagging characteristic can deal with the problem of look-ahead bias in backtesting. It will allow a extra correct analysis of the efficiency of the index arbitrage profitability technique.

Terminology

Let’s first focus on a few of the terminology used on this submit:

Analysis information lake on Amazon S3 – A information lake is a big, centralized repository that lets you handle all of your structured and unstructured information at any scale. Amazon Easy Storage Service (Amazon S3) is a well-liked cloud-based object storage service that can be utilized as the inspiration for constructing an information lake.
Apache Iceberg – Apache Iceberg is an open-source desk format that’s designed to offer environment friendly, scalable, and safe entry to massive datasets. It offers options comparable to ACID transactions on prime of Amazon S3-based information lakes, schema evolution, partition evolution, and information versioning. With scalable metadata indexing, Apache Iceberg is ready to ship performant queries to quite a lot of engines comparable to Spark and Athena by decreasing planning time.
Look–forward bias – It is a frequent problem in backtesting, which happens when future data is inadvertently included in historic information used to check a buying and selling technique, resulting in overly optimistic outcomes.
Iceberg tags – The Iceberg branching and tagging characteristic permits customers to tag particular snapshots of their information tables with significant labels utilizing SQL syntax or the Iceberg library, which correspond to particular occasions notable to inside funding groups. This, mixed with Iceberg’s time journey performance, ensures that correct information enters the analysis pipeline and guards it from hard-to-detect issues comparable to look-ahead bias.

Testing scope

For our testing functions, think about the next instance, by which a change to the S&P Dow Jones Indices is introduced on September 2, 2022, turns into efficient on September 19, 2022, and doesn’t change into observable within the ETF holdings information that we are going to be utilizing within the experiment till September 30, 2022. We use Iceberg tags to label market information snapshots to keep away from look-ahead bias within the analysis information lake, which is able to allow us to check varied commerce entry and exit eventualities and assess the respective profitability of every.

Experiment

As a part of our experiment, we make the most of a paid, third-party information supplier API to determine SPY ETF holdings adjustments and assemble a portfolio. Our mannequin portfolio will purchase shares which can be added to the index, referred to as going lengthy, and can promote an equal quantity of shares faraway from the index, referred to as going brief.

We’ll take a look at short-term holding intervals, comparable to 1 day and 1, 2, 3, or 4 weeks, as a result of we assume that the rebalancing impact may be very short-lived and new data, comparable to macroeconomics, will drive efficiency past the studied time horizons. Lastly, we simulate totally different entry factors for this commerce:

Market open the day after announcement day (AD+1)
Market shut of efficient date (ED0)
Market open the day after ETF holdings registered the change (MD+1)

Analysis information lake

To run our experiment, now we have have used the next analysis information lake atmosphere.

As proven within the structure diagram, the analysis information lake is constructed on Amazon S3 and managed utilizing Apache Iceberg, which is an open desk format bringing the reliability and ease of relational database administration service (RDBMS) tables to information lakes. To keep away from look-ahead bias in backtesting, it’s important to create snapshots of the info at totally different time limits. Nonetheless, managing and organizing these snapshots could be difficult, particularly when coping with a big quantity of information.

That is the place the tagging characteristic in Apache Iceberg turns out to be useful. With tagging, researchers can create otherwise named snapshots of market information and observe adjustments over time. For instance, they will create a snapshot of the info on the finish of every buying and selling day and tag it with the date and any related market situations.

Through the use of tags to prepare the snapshots, researchers can simply question and analyze the info primarily based on particular market situations or occasions, with out having to fret in regards to the particular dates of the info. This may be notably useful when conducting analysis that isn’t time-sensitive or when searching for tendencies over lengthy intervals of time.

Moreover, the tagging characteristic can even assist with different facets of information administration, comparable to information retention for GDPR compliance, and sustaining lineages of the desk through totally different branches. Researchers can use Apache Iceberg tagging to make sure the integrity and accuracy of their information whereas additionally simplifying information administration.

Conditions

To observe together with this walkthrough, it’s essential to have the next:

An AWS account with an IAM function that has enough entry to provision the required assets.
To adjust to licensing issues, we can not present a pattern of the ETF constituents information. Due to this fact, it have to be bought individually for the dataset onboarding functions.

Answer overview

To arrange and take a look at this experiment, we full the next high-level steps:

Create an S3 bucket.
Load the dataset into Amazon S3. For this submit, the ETF information referred to was obtained through API name by means of a third-party supplier, however you may also think about the next choices:
1. You should use the next prescriptive steerage, which describes tips on how to automate information ingestion from varied information suppliers into an information lake in Amazon S3 utilizing AWS Information Alternate.
2. You can too make the most of AWS Information Alternate to pick out from a variety of third-party dataset suppliers. It simplifies the utilization of information information, tables, and APIs on your particular wants.
3. Lastly, you may also confer with the next submit on tips on how to use AWS Information Alternate for Amazon S3 to entry information from a supplier bucket: Analyzing affect of regulatory reform on the inventory market utilizing AWS and Refinitiv information.
Create an EMR cluster. You should use this Getting Began with EMR tutorial or we used CDK to deploy an EMR on EKS atmosphere with a customized managed endpoint.
Create an EMR pocket book utilizing EMR Studio. For our testing atmosphere, we used a customized construct Docker picture, which incorporates Iceberg v1.3. For directions on attaching a cluster to a Workspace, confer with Connect a cluster to a Workspace.
Configure a Spark session. You’ll be able to observe alongside through the next pattern pocket book.
Create an Iceberg desk and cargo the take a look at information from Amazon S3 into the desk.
Tag this information to protect a snapshot of it.
Carry out updates to our take a look at information and tag the up to date dataset.
Run simulated backtesting on our take a look at information to seek out probably the most worthwhile entry level for a commerce.

Create the experiment atmosphere

We are able to rise up and operating with Iceberg by making a desk through Spark SQL from an present view, as proven within the following code:

spark.sql("""
CREATE TABLE glue_catalog.quant.etf_holdings 
USING iceberg OPTIONS ('format-version'='2') 
LOCATION 's3://substitute_your_bucket/etf_holdings/' 
AS SELECT * FROM 2022Q1
""")
spark.sql("""
SELECT image, date, acceptanceTime, standing
FROM glue_catalog.quant.etf_holdings
""").present()

+------+----------+-------------------+------+
|image|      date|     acceptanceTime|standing|
+------+----------+-------------------+------+
|   HON|2022-03-31|2022-05-27 13:54:03|   new|
|   DFS|2022-03-31|2022-05-27 13:54:03|   new|
|   FMC|2022-03-31|2022-05-27 13:54:03|   new|
|  NDSN|2022-03-31|2022-05-27 13:54:03|   new|
|   CRL|2022-03-31|2022-05-27 13:54:03|   new|
|  EPAM|2022-03-31|2022-05-27 13:54:03|   new|
|  CSCO|2022-03-31|2022-05-27 13:54:03|   new|
|   ALB|2022-03-31|2022-05-27 13:54:03|   new|
|   AIZ|2022-03-31|2022-05-27 13:54:03|   new|
|   CRM|2022-03-31|2022-05-27 13:54:03|   new|
|  PENN|2022-03-31|2022-05-27 13:54:03|   new|
|  INTU|2022-03-31|2022-05-27 13:54:03|   new|
|   DOW|2022-03-31|2022-05-27 13:54:03|   new|
|   LHX|2022-03-31|2022-05-27 13:54:03|   new|
|   BLK|2022-03-31|2022-05-27 13:54:03|   new|
|  ZBRA|2022-03-31|2022-05-27 13:54:03|   new|
|   UPS|2022-03-31|2022-05-27 13:54:03|   new|
|    DG|2022-03-31|2022-05-27 13:54:03|   new|
|  DISH|2022-03-31|2022-05-27 13:54:03|   new|
|      |2022-03-31|2022-05-27 13:54:03|   new|
+------+----------+-------------------+------+

Now that we’ve created an Iceberg desk, we are able to use it for funding analysis. One of many key options of Iceberg is its help for scalable information versioning. Which means we are able to simply observe adjustments to our information and roll again to earlier variations with out making further copies. As a result of this information will get up to date periodically, we would like to have the ability to create named snapshots of the info in order that quant merchants have quick access to constant snapshots of information which have their very own retention coverage. On this case, let’s tag the dataset to point that it represents the ETF holdings information as of Q1 2022:

spark.sql("""
ALTER TABLE glue_catalog.quant.etf_holdings CREATE TAG Q1_2022
""")

As we transfer ahead in time and new information turns into obtainable by Q3, we could have to replace present datasets to replicate these adjustments. Within the following instance, we first use an UPDATE assertion to mark the shares as expired within the present ETF holdings dataset. Then we use the MERGE INTO assertion primarily based on matching situations comparable to ISIN code. If a match shouldn’t be discovered between the present dataset and the brand new dataset, the brand new information can be inserted as new data within the desk and standing code can be set to ‘new’ for these data. Equally, if the present dataset has shares that aren’t current within the new dataset, these data will stay expired with a standing code of ‘expired’. Lastly, for data the place a match is discovered, the info within the present dataset can be up to date with the info from the brand new dataset, and file could have an unchanged standing code. With Iceberg’s help for environment friendly information versioning and transactional consistency, we could be assured that our information updates can be utilized appropriately and with out information corruption.

spark.sql("""
UPDATE glue_catalog.quant.etf_holdings
SET standing="expired"
""")
spark.sql("""
MERGE INTO glue_catalog.quant.etf_holdings t
USING (SELECT * FROM 2022Q3) s
ON t.isin = s.isin
WHEN MATCHED THEN
    UPDATE SET t.acceptanceTime = s.acceptanceTime,
               t.date = s.date,
               t.stability = s.stability,
               t.valUsd = s.valUsd,
               t.pctVal = s.pctVal,
               t.standing = "unchanged"
WHEN NOT MATCHED THEN INSERT *
""")

As a result of we now have a brand new model of the info, we use Iceberg tagging to offer isolation for every new model of information. On this case, we tag this as Q3_2022 and permit quant merchants and different customers to work on this snapshot of the info with out being affected by ongoing updates to the pipeline:

spark.sql("""
ALTER TABLE glue_catalog.quant.etf_holdings CREATE TAG Q3_2022""")

This makes it very simple to see which shares are being added and deleted. We are able to use Iceberg’s time journey characteristic to learn the info at a given quarterly tag. First, let’s have a look at which shares are added to the index; these are the rows which can be within the Q3 snapshot however not within the Q1 snapshot. Then we’ll have a look at which shares are eliminated; these are the rows which can be within the Q1 snapshot however not within the Q3 snapshot:

spark.sql("""
SELECT image, isin, acceptanceTime, date 
FROM glue_catalog.quant.etf_holdings 
AS OF ‘Q3_2022’ EXCEPT 
SELECT image, isin, acceptanceTime, date 
FROM glue_catalog.quant.etf_holdings 
AS OF ‘Q1_2022’
""").present()

+------+------------+-------------------+----------+
|image|        isin|     acceptanceTime|      date|
+------+------------+-------------------+----------+
|   CPT|US1331311027|2022-11-28 15:50:55|2022-09-30|
|  CSGP|US22160N1090|2022-11-28 15:50:55|2022-09-30|
|  EMBC|US29082K1051|2022-11-28 15:50:55|2022-09-30|
|  INVH|US46187W1071|2022-11-28 15:50:55|2022-09-30|
|     J|US46982L1089|2022-11-28 15:50:55|2022-09-30|
|   KDP|US49271V1008|2022-11-28 15:50:55|2022-09-30|
|    ON|US6821891057|2022-11-28 15:50:55|2022-09-30|
|  VICI|US9256521090|2022-11-28 15:50:55|2022-09-30|
|   WBD|US9344231041|2022-11-28 15:50:55|2022-09-30|
+------+------------+-------------------+----------+

spark.sql("""
SELECT image, isin, acceptanceTime, date 
FROM glue_catalog.quant.etf_holdings 
AS OF ‘Q1_2022’ EXCEPT 
SELECT image, isin, acceptanceTime, date 
FROM glue_catalog.quant.etf_holdings 
AS OF ‘Q3_2022’
""").present()

+------+------------+-------------------+----------+
|image|        isin|     acceptanceTime|      date|
+------+------------+-------------------+----------+
|  PENN|US7075691094|2022-05-27 13:54:03|2022-03-31|
|    UA|US9043112062|2022-05-27 13:54:03|2022-03-31|
|   UAA|US9043111072|2022-05-27 13:54:03|2022-03-31|
|   LTP|US7127041058|2022-05-27 13:54:03|2022-03-31|
| DISCA|US25470F1049|2022-05-27 13:54:03|2022-03-31|
|  CERN|US1567821046|2022-05-27 13:54:03|2022-03-31|
|  IPGP|US44980X1090|2022-05-27 13:54:03|2022-03-31|
|      |US25470F3029|2022-05-27 13:54:03|2022-03-31|
|     J|US4698141078|2022-05-27 13:54:03|2022-03-31|
|   PVH|US6936561009|2022-05-27 13:54:03|2022-03-31|
+------+------------+-------------------+----------+

Now we use the delta obtained within the previous code to backtest the next technique. As a part of the index rebalancing arbitrage course of, we’re going to lengthy shares which can be added to the index and brief shares which can be faraway from the index, and we’ll take a look at this technique for each the efficient date and announcement date. As a proof of idea from the 2 totally different lists, we picked PVH and PENN as eliminated shares, and CSGP and INVH as added shares.

To observe together with the examples under, you will have to make use of the pocket book supplied within the Quant Analysis instance GitHub repository.

Cumulative Returns comparison

import numpy as np
import vectorbt as vbt

def backtest(entry_point="2022-09-02", exit_point="2022-10-31"):
    open_position = (historical_prices_pd.index == entry_point)
    close_position = (historical_prices_pd.index == exit_point)

    CASH = 100000
    COMMPERC = 0.000

    symbol_cols = pd.Index(['PENN', 'PVH', 'INVH', 'CSGP'], title="image")
    order_size = pd.DataFrame(index=historical_prices_pd.index, columns=symbol_cols)
    order_size['PENN'] = np.nan
    order_size['PVH'] = np.nan
    order_size['INVH'] = np.nan
    order_size['CSGP'] = np.nan

    #brief
    order_size.loc[open_position, 'PENN'] = -10
    order_size.loc[close_position, 'PENN'] = 0

    order_size.loc[open_position, 'PVH'] = -10
    order_size.loc[close_position, 'PVH'] = 0

    #lengthy
    order_size.loc[open_position, 'INVH'] = 10
    order_size.loc[close_position, 'INVH'] = 0

    order_size.loc[open_position, 'CSGP'] = 10
    order_size.loc[close_position, 'CSGP'] = 0

    # Execute on the subsequent bar
    order_size = order_size.vbt.fshift(1)

    portfolio = vbt.Portfolio.from_orders(
            historical_close_prices,  # present shut as reference value
            dimension=order_size,  
            value=historical_open_prices,  # present open as execution value
            size_type="targetpercent", 
            val_price=historical_close_prices.vbt.fshift(1),  # earlier shut as group valuation value
            init_cash=CASH,
            allow_partial=False,
            charges=COMMPERC,
            course='each',
            cash_sharing=True,  # share capital between belongings in the identical group
            group_by=True,  # all columns belong to the identical group
            call_seq='auto',  # promote earlier than shopping for
            freq='d'  # index frequency for annualization
    )
    return portfolio

portfolio = backtest('2022-09-02', '2022-10-31')

portfolio.orders.records_readable.head(20)

The next desk symbolize the portfolio orders data:

Order Id	Column	Timestamp	Dimension	Worth	Facet
0	(PENN, PENN)	2022-09-06	31948.881789	31.66	Promote
1	(PVH, PVH)	2022-09-06	18321.729571	55.15	Promote
2	(INVH, INVH)	2022-09-06	27419.797094	38.20	Purchase
3	(CSGP, CSGP)	2022-09-06	14106.361969	75.00	Purchase
4	(CSGP, CSGP)	2022-11-01	14106.361969	83.70	Promote
5	(INVH, INVH)	2022-11-01	27419.797094	31.94	Promote
6	(PVH, PVH)	2022-11-01	18321.729571	52.95	Purchase
7	(PENN, PENN)	2022-11-01	31948.881789	34.09	Purchase

Experimentation findings

The next desk exhibits Sharpe Ratios for varied holding intervals and two totally different commerce entry factors: announcement and efficient dates.

Experimentation findings

The info means that the efficient date is probably the most worthwhile entry level throughout most holding intervals, whereas the announcement date is an efficient entry level for short-term holding intervals (5 calendar days, 2 enterprise days). As a result of the outcomes are obtained from testing a single occasion, this isn’t statistically vital to just accept or reject a speculation that index rebalancing occasions can be utilized to generate constant alpha. The infrastructure we used for our testing can be utilized to run the identical experiment required to do speculation testing at scale, however index constituents information shouldn’t be available.

Conclusion

On this submit, we demonstrated how the usage of backtesting and the Apache Iceberg tagging characteristic can present worthwhile insights into the efficiency of index arbitrage profitability methods. Through the use of a scalable Amazon EMR on Amazon EKS stack, researchers can simply deal with the whole funding analysis lifecycle, from information assortment to backtesting. Moreover, the Iceberg tagging characteristic may also help deal with the problem of look-ahead bias, whereas additionally offering advantages comparable to information retention management for GDPR compliance and sustaining lineage of the desk through totally different branches. The experiment findings reveal the effectiveness of this method in evaluating the efficiency of index arbitrage methods and may function a helpful information for researchers within the finance trade.

In regards to the Authors

Boris Litvin is Principal Answer Architect, liable for monetary companies trade innovation. He’s a former Quant and FinTech founder, and is keen about systematic investing.

Man Bachar is a Options Architect at AWS, primarily based in New York. He accompanies greenfield prospects and helps them get began on their cloud journey with AWS. He’s keen about identification, safety, and unified communications.

Noam Ouaknine is a Technical Account Supervisor at AWS, and is predicated in Florida. He helps enterprise prospects develop and obtain their long-term technique by means of technical steerage and proactive planning.

Sercan Karaoglu is Senior Options Architect, specialised in capital markets. He’s a former information engineer and keen about quantitative funding analysis.