How To Pick A Good Cointegrating Pair

Photo by the author

Introduction

A time series is considered stationary if its probability distribution does not change over time. If the price series of a security is stationary, then it would be a suitable candidate for a mean-reversion trading strategy. However, most security price series are not stationary: they seem to follow a lognormal random walk; and drift farther and farther away from the initial value.

We need to find a pair of securities such that the combination of the two is stationary, e.g. buying a security and shorting another. Two securities that form a stationary or cointegrating pair are often from the same industry group such as Coca-Cola Company and PepsiCo. In this article, we illustrate how to pick a good cointegrating pair by applying the augmented Dickey-Fuller test to security pairs to check for cointegration.

Step-by-step

We will proceed as follows:

  1. Determine The Pairs: We present the security pairs to analyze.
  2. Prepare The Data: We pull and process securities’ open-high-low-close-volume (OHLCV) data.
  3. Calculate The Spread: We apply the ordinary least squares (OLS) method to calculate the spread between two securities.
  4. Check For Cointegration: We use the augmented Dickey-Fuller test to check if two securities form a stationary or cointegrating pair.

You can find the code on https://github.com/DinodC/cointegrating-pair.

Determine The Pairs

Below are the pairs of securities which we will check for cointegration:

1. Gold

Gold-themed exchange traded funds (ETF):

  • VanEck Vectors Gold Miners ETF (GDX): ETF which tracks a basket of gold-mining companies.
  • SPDR Gold Shares (GLD): ETF which replicates the price of gold bullion.

2. Fast Food

Companies serving fast food:

  • McDonald’s Corporation (MCD): Fast food company which gave the whole world classics like Big Mac, Hot Fudge Sundae, and Happy Meal.
  • YUM! Brands, Inc. (YUM): Fast food company which operates Taco Bell, KFC and Pizza Hut.

3. Cryptocurrencies

Digital currencies:

  • Bitcoin USD (BTC-USD): A decentralized cryptocurrency that can be sent from user to user on the peer-to-peer bitcoin network established in 2009.
  • Ethereum USD (ETH-USD): An open source, public, blockchain-based distributed computing platform and operating system released in 2015.

Prepare The Data

In this section, we illustrate download and preparation of securities’ price series.
We pull the securities’ historical OHLCV data from Yahoo Finance.
We select the adjusted close prices for each security and create a new Dataframe object.

Import packages

import pandas as pd
from pandas import DataFrame
from statsmodels.tsa.stattools import adfuller
import statsmodels.api as sm
import matplotlib.pyplot as plt

Magic

%matplotlib inline

Set tickers list

tickers = ['GDX', 'GLD', 'MCD', 'YUM', 'BTC-USD', 'ETH-USD']

Pull OHLCV data

# Initialize list of DataFrames
df_list = []

# Load DataFrames
for i in tickers:

    # Load data
    df = pd.read_csv(i + '.csv', index_col=0, parse_dates=True)    

    # Set multi-level columns
    df.columns = pd.MultiIndex.from_product([[i], ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']])

    # Update list
    df_list.append(df)

# Merge DataFrames
data = pd.concat(df_list, axis=1, join='inner')

# Drop NaNs
data.dropna(inplace=True)

Inspect OHLCV data

data.head()
GDX
OpenHighLowCloseAdj CloseVolume
Date
2015-08-0613.2113.6913.1113.3613.03352369121200
2015-08-0713.4213.8513.3313.4013.07254650618200
2015-08-1013.5714.2913.3614.2713.92128791376800
2015-08-1114.4414.5313.9414.5314.17493153731900
2015-08-1214.8115.5314.7815.5215.140740123217200
data.tail()
GDX
OpenHighLowCloseAdj CloseVolume
Date
2019-07-0825.45000125.61000125.20999925.42000025.42000040606100
2019-07-0925.33000025.66000025.20999925.65000025.65000037529700
2019-07-1026.02000026.23000025.77000026.20000126.20000156454300
2019-07-1126.12999926.28000125.71999925.94000125.94000154013400
2019-07-1226.00000026.25000025.87000126.20999926.20999931795200

Select adjusted close prices

# Initialize dictionary of adjusted close
close_dict = {}

# Update dictionary
for i in tickers:
    close_dict[i] = data[i]['Adj Close']

# Create DataFrame
close = pd.DataFrame(close_dict)

Inspect adjusted close prices

close.head()
GDXGLDMCDYUMBTC-USDETH-USD
Date
2015-08-0613.033523104.38999989.03874257.964733277.8900153.000
2015-08-0713.072546104.65000288.65334357.859062258.6000061.200
2015-08-1013.921287105.72000189.07457757.997757269.0299990.990
2015-08-1114.174931106.26000288.55478755.171165267.6600041.288
2015-08-1215.140740107.75000088.07979653.282372263.4400021.885
close.tail()
GDXGLDMCDYUMBTC-USDETH-USD
Date
2019-07-0825.420000131.289993212.160004110.05000312567.019531307.890015
2019-07-0925.650000131.750000212.089996110.48999812099.120117288.640015
2019-07-1026.200001133.830002213.000000110.98000311343.120117268.559998
2019-07-1125.940001132.699997212.690002111.50000011797.370117275.410004
2019-07-1226.209999133.529999212.990005111.05000311363.969727268.940002

Consider the training set from 2018 to present

training = close['2018-01-01':'2020-01-01'].copy()

Inspect training set

training.head()
GDXGLDMCDYUMBTC-USDETH-USD
Date
2018-01-0223.694632125.150002166.89537079.50389114754.129883861.969971
2018-01-0323.445948124.820000166.19200179.43569915156.620117941.099976
2018-01-0423.595158125.459999167.35783480.24437015180.080078944.830017
2018-01-0523.545422125.330002167.69508480.71203616954.779297967.130005
2018-01-0823.296738125.309998167.57942280.84844214976.1699221136.109985
training.tail()
GDXGLDMCDYUMBTC-USDETH-USD
Date
2019-07-0825.420000131.289993212.160004110.05000312567.019531307.890015
2019-07-0925.650000131.750000212.089996110.48999812099.120117288.640015
2019-07-1026.200001133.830002213.000000110.98000311343.120117268.559998
2019-07-1125.940001132.699997212.690002111.50000011797.370117275.410004
2019-07-1226.209999133.529999212.990005111.05000311363.969727268.940002

Calculate the number of pairs

no_pairs = round(0.5 * len(tickers))

Plot the adjusted close prices

plt.figure(figsize=(20, 20))

for i in range(no_pairs):
    # Primary axis
    color = 'tab:blue'
    ax1 = plt.subplot(3, 1, i+1)
    plt.plot(training[tickers[2*i]], color=color)
    ax1.set_ylabel('Adjusted Close Price of ' + tickers[2*i], color=color)
    ax1.tick_params(labelcolor=color)

    # Secondary axis 
    color = 'tab:orange'
    ax2 = ax1.twinx()
    plt.plot(training[tickers[2*i+1]], color=color)
    ax2.set_ylabel('Adjusted Close Price of ' + tickers[2*i+1], color=color)
    ax2.tick_params(labelcolor=color)

    # Both axis
    plt.xlim([training.index[0], training.index[-1]])
    plt.title('Adjusted Close Prices of ' + tickers[2*i] + ' and ' + tickers[2*i+1])

Calculate The Spread

In this section, we calculate the spread between the securities. We apply the OLS method between the securities to calculate for the hedge ratio. We standardize the spread by subtracting the mean and scaling by the standard deviation of the spread.

Calculate the spread between each pair

# Initialize the spread list
spread_list = []

for i in range(no_pairs):
    # Run an OLS regression between the pairs
    model = sm.regression.linear_model.OLS(training[tickers[2*i]], training[tickers[2*i+1]])

    # Calculate the hedge ratio
    results = model.fit()
    hedge_ratio = results.params[0]

    # Calculate the spread
    spread = training[tickers[2*i]] - hedge_ratio * training[tickers[2*i+1]]

    # Mean and standard deviation of the spread
    spread_mean = spread.mean()
    spread_std = spread.std()

    # Standardize the spread
    z_score = (spread - spread_mean) / spread_std

    # Update the spread list
    spread_list.append(z_score)

Plot the spread

plt.figure(figsize=(20, 20))

for i in range(no_pairs):
    plt.subplot(3, 1, i+1)
    plt.plot(spread_list[i])
    plt.xlim([spread.index[0], spread.index[-1]])
    plt.ylim([-3, 3])
    plt.title('Spread between ' + tickers[2*i] + ' and ' + tickers[2*i+1])

Check For Cointegration

In this section, we test if two securities form a stationary or cointegrating pair.
We use the augmented Dickey-Fuller (ADF) test where we have the following:

  1. The null hypothesis is that a unit root is present in the price series, it is non-stationary.
  2. The alternative is that unit root is not present in the prices series, it is stationary.

Run cointegration check using augmented Dickey-Fuller test

# Initialize stats
stats_list = []

for i in range(len(spread_list)):

    # ADF test
    stats = adfuller(spread_list[i])

    # Update stats
    stats_list.append(stats)

Set the pairs

# Initialize pairs
pairs = []

for i in range(no_pairs):
    # Update pairs
    pairs.append(tickers[2*i] + '/' + tickers[2*i+1])

Create stats DataFrame

# Initialize dict
stats_dict = {}

for i in range(no_pairs):

    # Update dict
    stats_dict[pairs[i]] = [stats_list[i][0],
                            stats_list[i][1],
                            stats_list[i][4]['1%'], stats_list[i][4]['5%'], stats_list[i][4]['10%']]

# Create DataFrame
stats_df = pd.DataFrame(stats_dict,
                          index=['ADF Statistic', 'P-value', '1%', '5%', '10%'])

Inspect

stats_df
GDX/GLDMCD/YUMBTC-USD/ETH-USD
ADF Statistic-3.386075-3.072452-2.161601
P-value0.0114430.0286600.220476
1%-3.448344-3.447815-3.448344
5%-2.869469-2.869237-2.869469
10%-2.570994-2.570870-2.570994

Remarks:

  1. For the spread between GDX and GLD, the ADF statistic is -3.39 which is lower than the 1% critical value -3.45, which means that there is a better than 99% probability that the spread between GDX and GLD is stationary.
  2. For the spread between MCD and YUM, the ADF statistic is -3.07 is between the 1% critical value -3.45 and 5% critical value of -2.87, which means that there is a better than 95% probability that the spread between MCD and YUM is stationary.
  3. For the spread between BTC-USD and ETH-USD, the ADF statistic is -2.16 which is higher than the critical values, which means that the spread between BTC-USD and ETH-USD is not stationary.

Conclusion

In this article, we demonstrated how to form a a good cointegrating pair of securities. We used the OLS method to determine the hedge ratio between securities; and the ADF test to check for stationarity. The results suggest the following: cointegraing pairs could be formed within gold (GDX and GLD) and fast food securities (MCD and YUM); and cointegrating pairs could not be formed within cryptocurrencies (BTC-USD and ETH-USD).

Advertisements

Backtesting A Trading Strategy Part 3

How To Backtest A Mean-reverting Trading Strategy Using Python

Photo by the author

Introduction

Backtesting is a tool to measure the performance of a trading strategy using historical data. The backtesting process consists of three parts: 1. determining the universe of securities where we will invest in (e.g. equity or fixed income? US or emerging markets?); 2. gathering historical data for the universe of securities; and 3. implementing a trading strategy using the historical data collected.

In the previous articles, I illustrated the first two steps in the backtesting process of determining the universe of stocks, and collecting historical data for each constituent. In this article, I will illustrate the last step of the process: implementing a mean-reverting trading strategy.

A Mean-reverting Trading Strategy

We implement a mean-reverting trading strategy based on Khandani and Lo. The idea is to buy the previous day’s “losers”, and sell the previous day’s “winners”. Stocks which underperform the market average are classified as “losers”; while stocks which outperform the market average are classified as “winners”.

For each stock i  , we calculate the weight w_{i, t}  at time t

\displaystyle w_{i, t} = - \frac{1}{N} \left( r_{i, t-1} - r_{Market, t-1} \right).

where N is the total number of stocks in the investment universe. Market return at time t-1 is calculated as

\displaystyle r_{Market, t-1} = \frac{1}{N} \sum_{i=1}^{N} r_{i, t-1},

and return of stock $i &s=2$ at time $t-1 &s=2$ is calculated as

\displaystyle r_{i, t-1} = \frac{S_{i, t-1} - S_{i, t-2}}{S_{i, t-2}},

where S_{i, t-1} is the spot price of stock i at time t-1 .

Backtesting A Trading Strategy

Step By Step

  1. Load S&P indices historical data.
  2. Implement a mean-reverting trading strategy.
  3. Calculate the trading strategy’s performance using the Sharpe ratio.

You can find the code on https://github.com/DinodC/backtesting-trading-strategy.

Import packages

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import pickle
import matplotlib.pyplot as plt
%matplotlib inline

Load S&P Historical Data

Set an id for each index

id = ['sp500', 'sp400', 'sp600']

Create a dictionary to map each id to a tickers file

input_file = {'sp500': 'sp500_data.pickle',
              'sp400': 'sp400_data.pickle',
              'sp600': 'sp600_data.pickle'}

Create a dictionary to map each id to a S&P historical data

sp500_data = pd.DataFrame()
sp400_data = pd.DataFrame()
sp600_data = pd.DataFrame()
sp_data = {'sp500': sp500_data,
           'sp400': sp400_data,
           'sp600': sp600_data}

Retrieve S&P historical data

for i in id:
    # Load data
    with open(input_file[i], 'rb') as f:
        sp_data[i] = pickle.load(f)
    f.close()

    # Select close prices
    sp_data[i] = sp_data[i].close

Implement A Mean-reverting Trading Strategy

Create a dictionary to map each id to a S&P returns data

sp500_returns = DataFrame()
sp400_returns = DataFrame()
sp600_returns = DataFrame()
sp_returns = {'sp500': sp500_returns,
              'sp400': sp400_returns,
              'sp600': sp600_returns}

Create a dictionary to map each id to a S&P market returns data

sp500_market_returns = Series()
sp400_market_returns = Series()
sp600_market_returns = Series()
sp_market_returns = {'sp500': sp500_market_returns,
                     'sp400': sp400_market_returns,
                     'sp600': sp600_market_returns}

Create a dictionary to map each id to a trading strategy weighting

sp500_weights = DataFrame()
sp400_weights = DataFrame()
sp600_weights = DataFrame()
sp_weights = {'sp500': sp500_weights,
              'sp400': sp400_weights,
              'sp600': sp600_weights}

Create a dictionary to map each id to a trading strategy pnl per stock

sp500_pnl = DataFrame()
sp400_pnl = DataFrame()
sp600_pnl = DataFrame()
sp_pnl = {'sp500': sp500_pnl,
          'sp400': sp400_pnl,
          'sp600': sp600_pnl}

Create a dictionary to map each id to a trading strategy pnl

sp500_pnl_net = Series()
sp400_pnl_net = Series()
sp600_pnl_net = Series()
sp_pnl_net = {'sp500': sp500_pnl_net,
              'sp400': sp400_pnl_net,
              'sp600': sp600_pnl_net}

Implement the mean-reverting trading strategy on stock universes: S&P 500, S&P MidCap 400, and S&P SmallCap indices

for i in id:
    # Calculate the returns
    sp_returns[i] = sp_data[i].pct_change()

    # Calculate the equally weighted market returns
    sp_market_returns[i] = sp_returns[i].mean(axis='columns')

    # Calculate the weights of each stock
    sp_weights[i] = - (sp_returns[i].sub(sp_market_returns[i], axis='index')).div(sp_data[i].count(axis='columns'), axis='index')

    # Adjust the weights to 0 if price or return is NaN
    sp_weights[i][sp_data[i].isna() | sp_data[i].shift(periods=1).isna()] = 0

    # Calculate the daily pnl
    # Idea is to buy yesterday's losers, and sell yesterday's winners
    sp_pnl[i] = (sp_weights[i].shift(periods=1)).mul(sp_returns[i], axis='index')
    sp_pnl_net[i] = sp_pnl[i].sum(axis='columns')

Calculate The Performance Metrics Of A Trading Strategy

In this section, I will provide a brief review of the Sharpe ratio which is a popular measure of a trading strategy’s performance. The Sharpe ratio is a special case of the more generic Information ratio (IR). Consequently, I will present the IR first, and followed by the illustration on the Sharpe ratio.

Information Ratio

The Information ratio measures the excess returns of a trading strategy over a benchmark, such as the S&P 500 index. The IR scales the excess returns with the standard deviation to measure the consistency of the trading strategy. We calculate the Information ratio using the formula

\displaystyle IR = \frac{r_{Strategy} - r_{Benchmark}}{\sigma_{Strategy}}.

In practice, we usually calculate the annualized IR as follows:

  1. Calculate the average and standard deviation of daily returns
  2. Annualize the two metrics
  3. Compute the annualized return of the benchmark

Sharpe Ratio

The Sharpe ratio is a special case of the Information ratio where the benchmark is set to the risk-free rate. It allows for decomposition of a trading strategy’s profit and losses into risky and risk-free parts. The Sharpe ratio is popular because it facilitates comparison of different trading strategies using different benchmarks.

How To Calculate The Sharpe Ratio

The Sharpe ratio calculation depends on the trading strategy deployed.

Long Only Strategies

For long-only trading strategies, we calculate the Sharpe ratio as

\displaystyle Sharpe = \frac{r_{Long} - r_{Risk-free}}{\sigma_{Long}},

where the r_{Risk-free} is usually obtained from the treasury yield curve.

Long And Short Strategies

For long and short strategies holding equal amount of capital on both positions, also known as “dollar-neutral”, we have a simpler formula for calculating the Sharpe ratio

\displaystyle Sharpe = \frac{r_{Long-Short}}{\sigma_{Long-Short}}.

The r_{Risk-free} disappears in the equation because the cash received from the short positions earns the same risk-free rate.

Performance Metrics Of A Mean-reverting Strategy

Divide the periods of observation

period = ['2014-01-01', '2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01']

Create a dictionary to map each id to a trading performance

sp500_performance = pd.DataFrame()
sp400_performance = pd.DataFrame()
sp600_performance = pd.DataFrame()
sp_performance = {'sp500': sp500_performance,
                  'sp400': sp400_performance,
                  'sp600': sp600_performance}

Calculate the trading strategy’s annualized Sharpe ratio, average daily returns and standard deviation

for i in id:
    # Initialize performance measures
    avg_returns = []
    std_returns = []
    sharpe = []

    # Calculate performance measures
    for j in range(len(period) - 1):
        # Period of observation
        start = period[j]
        end = period[j + 1]

        # Calculate average daily returns
        avg_returns.append(sp_pnl_net[i][start:end].mean())

        # Calculate standard deviation of daily returns
        std_returns.append(sp_pnl_net[i][start:end].std())

        # Calculate Sharpe ratio
        sharpe.append(np.sqrt(252) * avg_returns[j] / std_returns[j])

    # Update performance measures DataFrame
    sp_performance[i] = pd.DataFrame({'Avg Daily Returns': avg_returns,
                                      'Std Daily Returns': std_returns,
                                      'Sharpe Ratio': sharpe},
                                    index=['2014', '2015', '2016', '2017', '2018'])

Comparison Of The Mean-Reverting Strategy Using Different Stock Universes

Average Of Daily Returns
sp_avg_returns = pd.DataFrame({'S&P 500': (sp_performance['sp500']['Avg Daily Returns'] * 100).round(4).astype('str') + '%',
                               'S&P 400': (sp_performance['sp400']['Avg Daily Returns'] * 100).round(4).astype('str') + '%',
                               'S&P 600': (sp_performance['sp600']['Avg Daily Returns'] * 100).round(4).astype('str') + '%'})
sp_avg_returns
S&P 500S&P 400S&P 600
20140.0008%0.001%0.0015%
2015-0.0%-0.0007%0.0004%
2016-0.0001%-0.0004%0.001%
20170.0008%0.0009%0.0016%
20180.0004%0.0004%0.0004%
Standard Deviation Of Daily Returns
sp_std_returns = pd.DataFrame({'S&P 500': (sp_performance['sp500']['Std Daily Returns'] * 100).round(4).astype('str') + '%',
                               'S&P 400': (sp_performance['sp400']['Std Daily Returns'] * 100).round(4).astype('str') + '%',
                               'S&P 600': (sp_performance['sp600']['Std Daily Returns'] * 100).round(4).astype('str') + '%',})
sp_std_returns
S&P 500S&P 400S&P 600
20140.0052%0.0095%0.0134%
20150.0029%0.0075%0.0074%
20160.0068%0.0111%0.0125%
20170.0084%0.0112%0.0106%
20180.0037%0.004%0.0049%
Sharpe Ratio
sp_sharpe = pd.DataFrame({'S&P 500': sp_performance['sp500']['Sharpe Ratio'],
                          'S&P 400': sp_performance['sp400']['Sharpe Ratio'],
                          'S&P 600': sp_performance['sp600']['Sharpe Ratio']})
sp_sharpe
S&P 500S&P 400S&P 600
20142.3901201.7102631.750619
2015-0.136800-1.4629810.792153
2016-0.319971-0.5710371.213862
20171.5353401.2595382.348185
20181.8130911.4714771.328564

The mean-reverting strategy’s Sharpe ratio is highest on the S&P 600 index composed of small-cap stocks. The inverse relationship between profitability and market-capitalization suggests that inefficiencies are more abundant on small-cap stocks. However, be mindful that small-capitalization stocks are less liquid instruments and have higher transactions costs attached when trading them.

Profit And Losses

Plot the cumulative pnl of the mean-reverting strategy by investment universe

plt.figure(figsize=[20, 10])

plt.plot(sp_pnl_net['sp500'].cumsum())
plt.plot(sp_pnl_net['sp400'].cumsum())
plt.plot(sp_pnl_net['sp600'].cumsum())
plt.title('Cumulative PnL')
plt.legend(id)
<matplotlib.legend.Legend at 0x1173fa6a0&gt;

Improvements To The Backtesting Process

The Backtesting process measures the performance of our trading strategy using historical data. We hope that the performance of a trading strategy in the past will reproduce into the future. Unfortunately, such guarantees do not exist.

Our focus should center on rendering the backtesting process as close to reality as possible by including transaction costs, and correcting for the following:

  1. Data-snooping bias is the application of an overfitted model in the trading strategy.
  2. Look-ahead bias is the use of future data, or data not yet available, in the investment strategy.
  3. Survivorship bias is the absence of stocks in the investment universe belonging to companies who went bankrupt, merged or acquired.

Conclusion

In this article, we implemented a mean-reverting trading strategy and backtested it on our universe of stocks – the S&P 500, S&P MidCap 400 and S&P SmallCap 600 indices. The mean-reverting trading strategy performed best on the S&P 600 index which is composed of small-capitalization stocks. In the next articles, I will illustrate improvements to the backtesting process by including transaction costs, and correcting for potential biasses.