How To Pick A Good Cointegrating Pair

Photo by the author

Introduction

A time series is considered stationary if its probability distribution does not change over time. If the price series of a security is stationary, then it would be a suitable candidate for a mean-reversion trading strategy. However, most security price series are not stationary: they seem to follow a lognormal random walk; and drift farther and farther away from the initial value.

We need to find a pair of securities such that the combination of the two is stationary, e.g. buying a security and shorting another. Two securities that form a stationary or cointegrating pair are often from the same industry group such as Coca-Cola Company and PepsiCo. In this article, we illustrate how to pick a good cointegrating pair by applying the augmented Dickey-Fuller test to security pairs to check for cointegration.

Step-by-step

We will proceed as follows:

  1. Determine The Pairs: We present the security pairs to analyze.
  2. Prepare The Data: We pull and process securities’ open-high-low-close-volume (OHLCV) data.
  3. Calculate The Spread: We apply the ordinary least squares (OLS) method to calculate the spread between two securities.
  4. Check For Cointegration: We use the augmented Dickey-Fuller test to check if two securities form a stationary or cointegrating pair.

You can find the code on https://github.com/DinodC/cointegrating-pair.

Determine The Pairs

Below are the pairs of securities which we will check for cointegration:

1. Gold

Gold-themed exchange traded funds (ETF):

  • VanEck Vectors Gold Miners ETF (GDX): ETF which tracks a basket of gold-mining companies.
  • SPDR Gold Shares (GLD): ETF which replicates the price of gold bullion.

2. Fast Food

Companies serving fast food:

  • McDonald’s Corporation (MCD): Fast food company which gave the whole world classics like Big Mac, Hot Fudge Sundae, and Happy Meal.
  • YUM! Brands, Inc. (YUM): Fast food company which operates Taco Bell, KFC and Pizza Hut.

3. Cryptocurrencies

Digital currencies:

  • Bitcoin USD (BTC-USD): A decentralized cryptocurrency that can be sent from user to user on the peer-to-peer bitcoin network established in 2009.
  • Ethereum USD (ETH-USD): An open source, public, blockchain-based distributed computing platform and operating system released in 2015.

Prepare The Data

In this section, we illustrate download and preparation of securities’ price series.
We pull the securities’ historical OHLCV data from Yahoo Finance.
We select the adjusted close prices for each security and create a new Dataframe object.

Import packages

import pandas as pd
from pandas import DataFrame
from statsmodels.tsa.stattools import adfuller
import statsmodels.api as sm
import matplotlib.pyplot as plt

Magic

%matplotlib inline

Set tickers list

tickers = ['GDX', 'GLD', 'MCD', 'YUM', 'BTC-USD', 'ETH-USD']

Pull OHLCV data

# Initialize list of DataFrames
df_list = []

# Load DataFrames
for i in tickers:

    # Load data
    df = pd.read_csv(i + '.csv', index_col=0, parse_dates=True)    

    # Set multi-level columns
    df.columns = pd.MultiIndex.from_product([[i], ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']])

    # Update list
    df_list.append(df)

# Merge DataFrames
data = pd.concat(df_list, axis=1, join='inner')

# Drop NaNs
data.dropna(inplace=True)

Inspect OHLCV data

data.head()
GDX
OpenHighLowCloseAdj CloseVolume
Date
2015-08-0613.2113.6913.1113.3613.03352369121200
2015-08-0713.4213.8513.3313.4013.07254650618200
2015-08-1013.5714.2913.3614.2713.92128791376800
2015-08-1114.4414.5313.9414.5314.17493153731900
2015-08-1214.8115.5314.7815.5215.140740123217200
data.tail()
GDX
OpenHighLowCloseAdj CloseVolume
Date
2019-07-0825.45000125.61000125.20999925.42000025.42000040606100
2019-07-0925.33000025.66000025.20999925.65000025.65000037529700
2019-07-1026.02000026.23000025.77000026.20000126.20000156454300
2019-07-1126.12999926.28000125.71999925.94000125.94000154013400
2019-07-1226.00000026.25000025.87000126.20999926.20999931795200

Select adjusted close prices

# Initialize dictionary of adjusted close
close_dict = {}

# Update dictionary
for i in tickers:
    close_dict[i] = data[i]['Adj Close']

# Create DataFrame
close = pd.DataFrame(close_dict)

Inspect adjusted close prices

close.head()
GDXGLDMCDYUMBTC-USDETH-USD
Date
2015-08-0613.033523104.38999989.03874257.964733277.8900153.000
2015-08-0713.072546104.65000288.65334357.859062258.6000061.200
2015-08-1013.921287105.72000189.07457757.997757269.0299990.990
2015-08-1114.174931106.26000288.55478755.171165267.6600041.288
2015-08-1215.140740107.75000088.07979653.282372263.4400021.885
close.tail()
GDXGLDMCDYUMBTC-USDETH-USD
Date
2019-07-0825.420000131.289993212.160004110.05000312567.019531307.890015
2019-07-0925.650000131.750000212.089996110.48999812099.120117288.640015
2019-07-1026.200001133.830002213.000000110.98000311343.120117268.559998
2019-07-1125.940001132.699997212.690002111.50000011797.370117275.410004
2019-07-1226.209999133.529999212.990005111.05000311363.969727268.940002

Consider the training set from 2018 to present

training = close['2018-01-01':'2020-01-01'].copy()

Inspect training set

training.head()
GDXGLDMCDYUMBTC-USDETH-USD
Date
2018-01-0223.694632125.150002166.89537079.50389114754.129883861.969971
2018-01-0323.445948124.820000166.19200179.43569915156.620117941.099976
2018-01-0423.595158125.459999167.35783480.24437015180.080078944.830017
2018-01-0523.545422125.330002167.69508480.71203616954.779297967.130005
2018-01-0823.296738125.309998167.57942280.84844214976.1699221136.109985
training.tail()
GDXGLDMCDYUMBTC-USDETH-USD
Date
2019-07-0825.420000131.289993212.160004110.05000312567.019531307.890015
2019-07-0925.650000131.750000212.089996110.48999812099.120117288.640015
2019-07-1026.200001133.830002213.000000110.98000311343.120117268.559998
2019-07-1125.940001132.699997212.690002111.50000011797.370117275.410004
2019-07-1226.209999133.529999212.990005111.05000311363.969727268.940002

Calculate the number of pairs

no_pairs = round(0.5 * len(tickers))

Plot the adjusted close prices

plt.figure(figsize=(20, 20))

for i in range(no_pairs):
    # Primary axis
    color = 'tab:blue'
    ax1 = plt.subplot(3, 1, i+1)
    plt.plot(training[tickers[2*i]], color=color)
    ax1.set_ylabel('Adjusted Close Price of ' + tickers[2*i], color=color)
    ax1.tick_params(labelcolor=color)

    # Secondary axis 
    color = 'tab:orange'
    ax2 = ax1.twinx()
    plt.plot(training[tickers[2*i+1]], color=color)
    ax2.set_ylabel('Adjusted Close Price of ' + tickers[2*i+1], color=color)
    ax2.tick_params(labelcolor=color)

    # Both axis
    plt.xlim([training.index[0], training.index[-1]])
    plt.title('Adjusted Close Prices of ' + tickers[2*i] + ' and ' + tickers[2*i+1])

Calculate The Spread

In this section, we calculate the spread between the securities. We apply the OLS method between the securities to calculate for the hedge ratio. We standardize the spread by subtracting the mean and scaling by the standard deviation of the spread.

Calculate the spread between each pair

# Initialize the spread list
spread_list = []

for i in range(no_pairs):
    # Run an OLS regression between the pairs
    model = sm.regression.linear_model.OLS(training[tickers[2*i]], training[tickers[2*i+1]])

    # Calculate the hedge ratio
    results = model.fit()
    hedge_ratio = results.params[0]

    # Calculate the spread
    spread = training[tickers[2*i]] - hedge_ratio * training[tickers[2*i+1]]

    # Mean and standard deviation of the spread
    spread_mean = spread.mean()
    spread_std = spread.std()

    # Standardize the spread
    z_score = (spread - spread_mean) / spread_std

    # Update the spread list
    spread_list.append(z_score)

Plot the spread

plt.figure(figsize=(20, 20))

for i in range(no_pairs):
    plt.subplot(3, 1, i+1)
    plt.plot(spread_list[i])
    plt.xlim([spread.index[0], spread.index[-1]])
    plt.ylim([-3, 3])
    plt.title('Spread between ' + tickers[2*i] + ' and ' + tickers[2*i+1])

Check For Cointegration

In this section, we test if two securities form a stationary or cointegrating pair.
We use the augmented Dickey-Fuller (ADF) test where we have the following:

  1. The null hypothesis is that a unit root is present in the price series, it is non-stationary.
  2. The alternative is that unit root is not present in the prices series, it is stationary.

Run cointegration check using augmented Dickey-Fuller test

# Initialize stats
stats_list = []

for i in range(len(spread_list)):

    # ADF test
    stats = adfuller(spread_list[i])

    # Update stats
    stats_list.append(stats)

Set the pairs

# Initialize pairs
pairs = []

for i in range(no_pairs):
    # Update pairs
    pairs.append(tickers[2*i] + '/' + tickers[2*i+1])

Create stats DataFrame

# Initialize dict
stats_dict = {}

for i in range(no_pairs):

    # Update dict
    stats_dict[pairs[i]] = [stats_list[i][0],
                            stats_list[i][1],
                            stats_list[i][4]['1%'], stats_list[i][4]['5%'], stats_list[i][4]['10%']]

# Create DataFrame
stats_df = pd.DataFrame(stats_dict,
                          index=['ADF Statistic', 'P-value', '1%', '5%', '10%'])

Inspect

stats_df
GDX/GLDMCD/YUMBTC-USD/ETH-USD
ADF Statistic-3.386075-3.072452-2.161601
P-value0.0114430.0286600.220476
1%-3.448344-3.447815-3.448344
5%-2.869469-2.869237-2.869469
10%-2.570994-2.570870-2.570994

Remarks:

  1. For the spread between GDX and GLD, the ADF statistic is -3.39 which is lower than the 1% critical value -3.45, which means that there is a better than 99% probability that the spread between GDX and GLD is stationary.
  2. For the spread between MCD and YUM, the ADF statistic is -3.07 is between the 1% critical value -3.45 and 5% critical value of -2.87, which means that there is a better than 95% probability that the spread between MCD and YUM is stationary.
  3. For the spread between BTC-USD and ETH-USD, the ADF statistic is -2.16 which is higher than the critical values, which means that the spread between BTC-USD and ETH-USD is not stationary.

Conclusion

In this article, we demonstrated how to form a a good cointegrating pair of securities. We used the OLS method to determine the hedge ratio between securities; and the ADF test to check for stationarity. The results suggest the following: cointegraing pairs could be formed within gold (GDX and GLD) and fast food securities (MCD and YUM); and cointegrating pairs could not be formed within cryptocurrencies (BTC-USD and ETH-USD).

Advertisements

Investing On ETFs Using The Kelly Formula Part 2

How To Calculate The Kelly Formula Using Python

Photo by the author

Introduction

The main problem of an investor is identifying profitable trading strategies. To solve this, he uses fundamental and quantitative techniques to pick winning trading systems. The next problem concerning the investor is how to optimally allocate his capital to different winning strategies. A solution is to use the Kelly formula to calculate the amount of capital he will deploy to securities or trading systems before him.

In this article, I will present the Kelly formula which is a popular system used in both sports betting and investing. I will use the ETFs dataset sourced in the previous article to calculate the Kelly leverages or fractions for each ETF on the list. I will analyze the performance metrics of each ETF group under the different leverages considered: “unconstrained”, “full” and “half-Kelly”.

The Kelly Formula

The Kelly formula (or Kelly criterion) was introduced by Bell labs researcher John Kelly in his 1956 paper “A New Interpretation Of Information Rate”. He proposed a betting system maximizing the expected logarithm of wealth, which is equivalent to maximizing expected compounded growth rate. John Kelly demonstrated the formula’s application in horse racing where the bet size per horse is adjusted by the bettor’s informed estimate of the horse’s chance of winning.

The Kelly formula was popularized by mathematician, blackjack researcher and quantitative hedge fund pioneer Edward Thorp in his book “Beat the Dealer”. In 1960s, he combined a card-counting system and the Kelly criterion to beat dealers in the blackjack tables of Las Vegas. Later, Edward Thorp used statistical analysis and the Kelly formula to beat other investment firms in Wall Street.

The Kelly Formula In Investing

In this section, we follow Thorp’s paper presenting the Kelly formula’s application in securities investing.

Case 1: Single Security

The objective is to maximize the growth rate g of a portfolio composed of a single security

\displaystyle \max_{f} g(f),

where the growth rate g is

\displaystyle g(f) = r_{Risk-free} + f m - \frac{1}{2} f^2 s^2,

and f is the fraction of capital we allocate to the security.

Assuming that the security follows a Normal distribution with mean m and variance s^2, the optimal allocation f^* is

\displaystyle f^* = \frac{m}{s^2},

and the optimum growth rate is

\displaystyle g(f^*) = r_{Risk-free} + \frac{1}{2} f^{*2} s^2.

Case 2: Multiple Securities

We extend the previous case to maximize the growth rate g of a portfolio composed of multiple securities

\displaystyle \max_{F} g(F),

where the growth rate g is

\displaystyle g(F) = r_{Risk-free} + F^T M - \frac{1}{2} F^T C F,

and F is a vector containing fractions of capital we allocate to each security

\displaystyle F = \left( f_1, f_2, ..., f_n \right).

Assuming that each security i follows a Normal distribution, $M$ is the vector containing average of excess returns of each security

\displaystyle M = \left( m_1, m_2, ..., m_n \right),

and C is the covariance matrix of the returns between securities i and j

\displaystyle \begin{pmatrix}  c_{11} & c_{12} & ... & c_{1n} \\  c_{21} & c_{22} & ... & c_{2n} \\  \vdots & \vdots & \ddots & \vdots \\  c_{n1} & c_{n2} & ... & c_{nn} \\  \end{pmatrix}

the optimal allocation F^* is

\displaystyle F^* = C^{-1} M,

and the optimum growth rate is

\displaystyle g(F^*) = r_{Risk-free} + \frac{1}{2} F^T C F.

Step By Step

  1. Calculate the Kelly leverages and performance metrics.
  2. Analysis of the results.

You can find the code on https://github.com/DinodC/investing-etf-kelly.

Calculate the Kelly Leverages And Performance Metrics

In this section, we calculate compounded growth rate and Sharpe ratio under the following:

  1. “Unconstrained” Kelly leverages using the formula above.
  2. “Full” Kelly fractions with the constraint of maximum leverage set to 4.
  3. “Half” Kelly leverages with the same contraint as the full Kelly.

Note that practitioners recommend using half Kelly over full Kelly to counter possible input errors.

Import packages

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import pickle
import matplotlib.pyplot as plt
%matplotlib inline

Set the group keys

groups = ['us bonds',
          'us stocks',
          'intl bonds',
          'intl stocks',
          'sectors']

Set the input files

input = {'us bonds': 'etf_us_bonds.pickle', 
         'us stocks': 'etf_us_stocks.pickle',
         'intl bonds': 'etf_intl_bonds.pickle',
         'intl stocks': 'etf_intl_stocks.pickle',
         'sectors': 'etf_sectors.pickle'}

Create the output dictionary

output = {'us bonds': {},
          'us stocks': {},
          'intl bonds': {},
          'intl stocks': {},
          'sectors': {}}

Set the following parameters:

  1. Risk-free rate is assumed to be 2.5%.
  2. Maximum leverage is set to 4 which is consistent with US-approved brokers.
risk_free = 0.025
max_leverage = 4.00

Calculate the Kelly leverages for each portfolio

for i in groups:    

    # Load file
    with open(input[i], 'rb') as f:
        close = pickle.load(f)
    f.close()

    # Daily returns for the past 6 months
    returns = close[-120:].pct_change()

    # Excess daily returns
    excess_returns = returns - risk_free / 250

    # Mean excess daily returns annualized
    M = excess_returns.mean() * 250

    # Covariance of daily returns
    C = returns.cov() * 250

    # Kelly leverage
    F = np.matmul(np.linalg.inv(C), M)

    # Adjust leverage
    adj_leverage = max_leverage / np.sum(np.abs(F))

    # Leverages for unconstrained, full and half Kelly
    Kelly = [F, 
             adj_leverage * F, 
             0.5 * adj_leverage * F] 

    Kelly_id = ['unconstrained',
                'full',
                'half']

    for k in range(len(Kelly)):
        # Growth rate 
        g = risk_free + np.matmul(np.transpose(Kelly[k]), M) - 0.5 * np.matmul(np.matmul(np.transpose(Kelly[k]), C), Kelly[k])
        # Sharpe ratio
        sharpe = np.sqrt(np.matmul(np.matmul(np.transpose(Kelly[k]), C), Kelly[k]))

        # Updated outputs: F, g and sharpe
        output[i][Kelly_id[k]] = Kelly[k]
        output[i][Kelly_id[k] + ' growth'] = g
        output[i][Kelly_id[k] + ' sharpe'] = sharpe

    # Update output
    output[i]['tickers'] = list(close.columns)

Analysis Of The Results

Vanguard US Bond ETFs

The Kelly leverage for US bond ETFs are

group = 'us bonds'

pd.DataFrame({'Unconstrained': output[group]['unconstrained'],
              'Full Kelly': output[group]['full'],
              'Half Kelly': output[group]['half']},
             index=output[group]['tickers'])
UnconstrainedFull KellyHalf Kelly
EDV12.1562320.0334350.016717
BIV116.3245250.3199410.159971
VGIT-94.226520-0.259162-0.129581
BLV46.8311680.1288050.064403
VGLT-21.563687-0.059309-0.029655
VMBS-63.459954-0.174542-0.087271
BSV-204.894055-0.563545-0.281772
VTIP-90.945714-0.250139-0.125069
VGSH-85.628605-0.235515-0.117757
BND-46.758134-0.128604-0.064302
VCIT212.4253570.5842590.292129
VCLT-71.501762-0.196660-0.098330
VCSH387.6077711.0660840.533042

The growth rate and Sharpe ratio are

pd.DataFrame({'Unconstrained': [output[group]['unconstrained growth'], output[group]['unconstrained sharpe']],
              'Full Kelly': [output[group]['full growth'], output[group]['full sharpe']],
              'Half Kelly': [output[group]['half growth'], output[group]['half sharpe']]},
             index=['Growth Rate', 'Sharpe Ratio'])
UnconstrainedFull KellyHalf Kelly
Growth Rate17.1210540.1189130.071989
Sharpe Ratio5.8474020.0160830.008041

Applying the Kelly formula to US bond ETFs yield following remarks:

  1. Unconstrained Kelly proposes significant leverages e.g. 388 for VCSH and -205 for BSV.
  2. Considering the max leverage of 4, the full and half Kelly produces unsatisfactory Sharpe ratios (<1).

Vanguard US Stock ETFs

The Kelly leverage for US stock ETFs are

group = 'us stocks'

pd.DataFrame({'Unconstrained': output[group]['unconstrained'],
              'Full Kelly': output[group]['full'],
              'Half Kelly': output[group]['half']},
             index=output[group]['tickers'])
UnconstrainedFull KellyHalf Kelly
VIG11.5834350.0411410.020571
VUG282.3216061.0027290.501365
VYM-20.176372-0.071661-0.035830
VV-89.070723-0.316355-0.158177
MGC45.1221430.1602620.080131
MGK-215.370435-0.764937-0.382468
MGV39.4376990.1400720.070036
VOO-147.455077-0.523720-0.261860
VTI104.5916840.3714810.185741
VTV-22.634360-0.080391-0.040196
VXF-0.051763-0.000184-0.000092
VO64.8701500.2304010.115201
VOT10.1769840.0361460.018073
VOE-0.257829-0.000916-0.000458
VB7.0608040.0250780.012539
VBK-34.794622-0.123581-0.061790
VBR-31.236998-0.110945-0.055473

The growth rate and Sharpe ratio are

pd.DataFrame({'Unconstrained': [output[group]['unconstrained growth'], output[group]['unconstrained sharpe']],
              'Full Kelly': [output[group]['full growth'], output[group]['full sharpe']],
              'Half Kelly': [output[group]['half growth'], output[group]['half sharpe']]},
             index=['Growth Rate', 'Sharpe Ratio'])
UnconstrainedFull KellyHalf Kelly
Growth Rate7.1256070.0753490.050197
Sharpe Ratio3.7684500.0133850.006692

Applying the Kelly formula to US stock ETFs yield following remarks:

  1. Unconstrained Kelly propose significant leverages e.g. 282 for VUG and -215 for MGK.
  2. Considering the max leverage of 4, the full and half Kelly produces unsatisfactory Sharpe ratios (<1).

Vanguard International Bond ETFs

The Kelly leverage for international bond ETFs are

group = 'intl bonds'

pd.DataFrame({'Unconstrained': output[group]['unconstrained'],
              'Full Kelly': output[group]['full'],
              'Half Kelly': output[group]['half']},
             index=output[group]['tickers'])
UnconstrainedFull KellyHalf Kelly
BNDX210.4939732.7054691.352734
VWOB100.7185991.2945310.647266

The growth rate and Sharpe ratio are

pd.DataFrame({'Unconstrained': [output[group]['unconstrained growth'], output[group]['unconstrained sharpe']],
              'Full Kelly': [output[group]['full growth'], output[group]['full sharpe']],
              'Half Kelly': [output[group]['half growth'], output[group]['half sharpe']]},
             index=['Growth Rate', 'Sharpe Ratio'])
UnconstrainedFull KellyHalf Kelly
Growth Rate16.4622210.4448180.235588
Sharpe Ratio5.7336240.0736940.036847

Applying the Kelly formula to international bond ETFs yields the following remarks:

  1. Unconstrained Kelly propose significant leverages e.g. 210 for BNDX and 101 for VWOB.
  2. Considering the max leverage of 4, the full and half Kelly produces unsatisfactory Sharpe ratios (<1).

Vanguard International Stock ETFs

The Kelly leverage for international stock ETFs are

group = 'intl stocks'

pd.DataFrame({'Unconstrained': output[group]['unconstrained'],
              'Full Kelly': output[group]['full'],
              'Half Kelly': output[group]['half']},
             index=output[group]['tickers'])
UnconstrainedFull KellyHalf Kelly
VT11.8735990.1739850.086992
VEU56.4467950.8271190.413560
VSS-6.710062-0.098323-0.049162
VEA47.1702260.6911890.345595
VGK-10.031345-0.146990-0.073495
VPL-52.156336-0.764251-0.382125
VNQI26.6739470.3908550.195428
VXUS-59.953430-0.878502-0.439251
VWO1.9644450.0287850.014393

The growth rate and Sharpe ratio are

pd.DataFrame({'Unconstrained': [output[group]['unconstrained growth'], output[group]['unconstrained sharpe']],
              'Full Kelly': [output[group]['full growth'], output[group]['full sharpe']],
              'Half Kelly': [output[group]['half growth'], output[group]['half sharpe']]},
             index=['Growth Rate', 'Sharpe Ratio'])
UnconstrainedFull KellyHalf Kelly
Growth Rate3.5021370.1261550.075764
Sharpe Ratio2.6370960.0386420.019321

Applying the Kelly formula to international stock ETFs yields the following remarks:

  1. Unconstrained Kelly propose relatively smaller leverages (compared to US bond and US stock ETFs) and Sharpe ratio of 2.64.
  2. Considering the max leverage of 4, the full and half Kelly produces unsatisfactory Sharpe ratios (<1).

Vanguard Sector ETFs

The Kelly leverage for sector ETFs are

group = 'sectors'

pd.DataFrame({'Unconstrained': output[group]['unconstrained'],
              'Full Kelly': output[group]['full'],
              'Half Kelly': output[group]['half']},
             index=output[group]['tickers'])
UnconstrainedFull KellyHalf Kelly
VOX1.2191170.0506290.025314
VCR-8.332042-0.346022-0.173011
VDC-2.505247-0.104041-0.052020
VDE-11.760600-0.488407-0.244203
VFH17.2090400.7146750.357338
VHT-19.652473-0.816149-0.408074
VIS3.8426070.1595800.079790
VGT14.8552840.6169260.308463
VAW0.3967230.0164760.008238
VNQ12.3864440.5143970.257199
VPU4.1585220.1727000.086350

The growth rate and Sharpe ratio are

pd.DataFrame({'Unconstrained': [output[group]['unconstrained growth'], output[group]['unconstrained sharpe']],
              'Full Kelly': [output[group]['full growth'], output[group]['full sharpe']],
              'Half Kelly': [output[group]['half growth'], output[group]['half sharpe']]},
             index=['Growth Rate', 'Sharpe Ratio'])
UnconstrainedFull KellyHalf Kelly
Growth Rate6.2715590.5330540.281720
Sharpe Ratio3.5345600.1467870.073393

Applying the Kelly formula to sector ETFs yields the following remarks:

  1. Unconstrained Kelly propose relatively small leverages (compared to US bond and US stock ETFs) and Sharpe ratio of 3.53.
  2. Considering the max leverage of 4, the full and half Kelly produces unsatisfactory Sharpe ratios (<1).

Conclusion

In this article, I reviewed the Kelly formula which is a solution to the investor’s second problem. I demonstrated how we can use the Kelly criterion to calculate ETF fractions per Vanguard fund group. The unconstrained Kelly generates impressive performance but requires significant amounts of leverage. The full and half Kelly suggest reasonable levels of leverage but produces unsatisfactory performance.

Investing On ETFs Using The Kelly Formula Part 1

How To Retrieve ETF Data Using Python

Photo by the author

Introduction

The main problem of an investor is identifying profitable trading strategies. To solve this, he uses fundamental and quantitative techniques to pick winning trading systems. The next problem concerning the investor is how to optimally allocate his capital to different winning strategies. A solution is to use the Kelly formula to calculate the amount of capital he will deploy to securities or trading systems before him.

In this article, I will source Vanguard ETFs list from the their website using Python package Selenium. I will use the list to pull historical Open-High-Low-Close-Volume (OHLCV) data using Python package pandas-datareader. The ETF data collected in this article will be used in the next article to present a solution to the investor’s capital allocation problem using the Kelly formula.

Exchange Traded Fund (ETF)

ETFs are traded on major stock exchanges like the New York Stock Exchange and Nasdaq. You can buy and sell them using online brokers such as TD Ameritrade or Interactive Brokers. An ETF is a composed of tens, hundreds or sometimes even thousands of stocks (or bonds) in a single fund.

They have grown in popularity due to the following:

  1. Low costs: ETFs have an average expense ratio of 0.27% per year.
  2. Diversification: They provide a diversified basket of stocks instead of buying a single stock.
  3. Alternative themes: ETFs enable investors to gain exposure into niche industries like water and solar tech firms.

Vanguard

In this article, I will be using ETFs managed by Vanguard and present a brief overview of the firm. It was established in 1975, and offers both actively and passively managed funds. Vanguard’s founder is Jack Bogle, the father of index investing. As of January 2019, it has USD 5.2 trillion of AUM.

Vanguard ETFs

Vanguard ETFs are grouped by:

  1. US Bond ETFs
  2. US Stock ETFs
  3. International Bond ETFs
  4. International Stock ETFs
  5. Sector ETFs

Step By Step

  1. Source Vanguard ETFs list of tickers.
  2. Source Vanguard ETFs historical OHLCV data.

You can find the code on https://github.com/DinodC/investing-etf-kelly.

Source Vanguard ETFs List Of Tickers

In this section, we retrieve tickers of every ETF on Vanguard’s website. We use Selenium, a web browser automation tool, to help us retrieve the ETF tickers. We propose two approaches, direct and indirect, of sourcing ETF ticker data from website.

Selenium

Selenium automates web browsers. Selenium was conceived for automation of web application testing. It allows you to open a browser and automate actions like clicking buttons and completing forms.

I propose two ways of using Selenium scrape of ETF ticker data:

  1. An automated approach: We automate navigation of Vanguard’s web site by going from the home page to the ETF list page containing the tickers.
  2. A non-automated approach: We go directly to the ETF list page containing the tickers.

Import packages

import pandas as pd
from pandas import Series, DataFrame
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import pandas_datareader as web
import pickle
import matplotlib.pyplot as plt
%matplotlib inline

Automated Approrach

Code for scraping ETF tickers from Vanguard’s home page using the first approach

# Option
option = webdriver.ChromeOptions()
option.add_argument('--incognito')

# Create a Chrome webdriver 
driver = webdriver.Chrome('/Applications/chromedriver', options=option)

# Get to the Vanguard home page
driver.get('https://investor.vanguard.com/home/')

# Set delay in seconds
delay = 10 

# Head to the next page by clicking on 'Investing'
try:
    WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.LINK_TEXT, 'Investing')))
    driver.find_element_by_link_text('Investing').click()
except TimeoutException:
    print("Timed out waiting for page to load.")
    driver.quit()

# Head to the next page by clicking on 'Vanguard ETFs'
try:
    WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.LINK_TEXT, 'Vanguard ETFs')))
    driver.find_element_by_link_text('Vanguard ETFs').click()
except TimeoutException:
    print("Timed out waiting for page to load.")
    driver.quit()

# Head to the next page by clicking on 'Browse a list of Vanguard ETFs'
try:
    WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.LINK_TEXT, 'Browse a list of Vanguard ETFs')))
    driver.find_element_by_partial_link_text('Browse a list of Vanguard ETFs').click()
except TimeoutException:
    print("Timed out waiting for page to load.")
    driver.quit()

# Source the Vanguard ETF names and tickers
try:
    WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'productEntry')))
    content = driver.find_elements_by_class_name('productEntry')
except TimeoutException:
    print("Timed out waiting for page to load.")
    driver.quit()

Non-automated Approach

Code for scraping ETF tickers from Vanguard’s ETF list page using the second approach

# Option
option = webdriver.ChromeOptions()
option.add_argument('--incognito')

# Create a Chrome webdriver 
driver = webdriver.Chrome('/Applications/chromedriver', options=option)

# Get to the Vanguard ETF list page
driver.get('https://investor.vanguard.com/etf/list#/etf/asset-class/month-end-returns')

# Set delay in seconds
delay = 10

try:
    WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'productEntry')))
    # Source the Vanguard ETF names and tickers
    content = driver.find_elements_by_class_name('productEntry')
except TimeoutException:
    print("Timed out waiting for page to load.")
    driver.quit()

Vanguard ETFs

Inspect the content collected

len(content)
118
for i in content:
    print(i.text)
Extended Duration Treasury ETF
Vanguard Select Funds
Intermediate-Term Bond ETF
Intermediate-Term Treasury ETF
Vanguard Select Funds
Long-Term Bond ETF
Long-Term Treasury ETF
Mortgage-Backed Securities ETF
Vanguard Select Funds
Short-Term Bond ETF
Vanguard Select Funds
Short-Term Inflation-Protected Securities ETF
Short-Term Treasury ETF
Vanguard Select Funds
Total Bond Market ETF
Intermediate-Term Corporate Bond ETF
Long-Term Corporate Bond ETF
Short-Term Corporate Bond ETF
Total Corporate Bond ETF
Vanguard Select Funds
Tax-Exempt Bond ETF
Dividend Appreciation ETF
ESG U.S. Stock ETF
Growth ETF
High Dividend Yield ETF
Large-Cap ETF
Mega Cap ETF
Mega Cap Growth ETF
Mega Cap Value ETF
Vanguard Select Funds
S&P 500 ETF
Vanguard Select Funds
Total Stock Market ETF
Value ETF
Vanguard Select Funds
Extended Market ETF
Mid-Cap ETF
Mid-Cap Growth ETF
Mid-Cap Value ETF
Small-Cap ETF
Small-Cap Growth ETF
Small-Cap Value ETF
Total World Bond ETF
Vanguard Select Funds
Total International Bond ETF
Emerging Markets Government Bond ETF
Total World Stock ETF
ESG International Stock ETF
FTSE All-World ex-US ETF
FTSE All-World ex-US Small-Cap ETF
Vanguard Select Funds
FTSE Developed Markets ETF
FTSE Europe ETF
FTSE Pacific ETF
Global ex-U.S. Real Estate ETF
International Dividend Appreciation ETF
International High Dividend Yield ETF
Vanguard Select Funds
Total International Stock ETF
Vanguard Select Funds
FTSE Emerging Markets ETF
Communication Services ETF
Consumer Discretionary ETF
Consumer Staples ETF
Energy ETF
Financials ETF
Health Care ETF
Industrials ETF
Information Technology ETF
Materials ETF
Real Estate ETF
Utilities ETF
EDV Bond - Long-term Government 0.07% $126.45 $0.43
increased
0.34%
increased
2.64%A
30 day
6/13/2019 12.29% 14.61% 7.58% 8.82% 8.04% (12/06/2007)
BIV Bond - Inter-term Investment 0.07% $85.76 $0.01
increased
0.01%
increased
2.67%A
30 day
6/13/2019 6.80% 8.34% 3.13% 4.97% 5.07% (04/03/2007)
VGIT Bond - Inter-term Government 0.07% $65.62 -$0.04
decreased
-0.06%
decreased
2.06%A
30 day
6/13/2019 4.40% 6.75% 2.21% — 2.95% (11/19/2009)
BLV Bond - Long-term Treasury 0.07% $95.15 $0.12
increased
0.12%
increased
3.53%A
30 day
6/13/2019 10.52% 10.28% 5.13% 7.55% 6.86% (04/03/2007)
VGLT Bond - Long-term Government 0.07% $80.92 $0.17
increased
0.20%
increased
2.60%A
30 day
6/13/2019 9.57% 11.83% 5.40% — 6.32% (11/19/2009)
VMBS Bond - Inter-term Government 0.07% $52.55 -$0.01
decreased
-0.02%
decreased
2.83%J
30 day
6/13/2019 3.34% 5.16% 2.29% — 2.80% (11/19/2009)
BSV Bond - Short-term Investment 0.07% $80.25 -$0.04
decreased
-0.05%
decreased
2.26%A
30 day
6/13/2019 3.11% 4.59% 1.58% 2.02% 2.77% (04/03/2007)
VTIP Bond - Short-term Government 0.06% $48.99 -$0.24
decreased
-0.49%
decreased
0.44%E,G
30 day
6/13/2019 2.59% 2.76% 0.79% — 0.66% (10/12/2012)
VGSH Bond - Short-term Government 0.07% $60.78 -$0.02
decreased
-0.03%
decreased
2.08%A
30 day
6/13/2019 2.11% 3.45% 1.05% — 0.99% (11/19/2009)
BND Bond - Inter-term Investment 0.035% $82.39 $0.00
0.00%
2.69%A
30 day
6/13/2019 5.26% 6.56% 2.65% 3.68% 4.08% (04/03/2007)
VCIT Bond - Inter-term Investment 0.07% $88.55 -$0.02
decreased
-0.02%
decreased
3.41%A
30 day
6/13/2019 8.50% 8.76% 3.73% — 5.46% (11/19/2009)
VCLT Bond - Long-term Investment 0.07% $94.06 $0.11
increased
0.12%
increased
4.26%A
30 day
6/13/2019 12.35% 9.65% 5.06% — 7.16% (11/19/2009)
VCSH Bond - Short-term Investment 0.07% $80.25 $0.01
increased
0.01%
increased
2.80%A
30 day
6/13/2019 4.20% 5.28% 2.21% — 2.98% (11/19/2009)
VTC Bond - Inter-term Investment 0.07% $85.27 $0.04
increased
0.05%
increased
3.25%B
30 day
6/13/2019 8.17% 7.59% — — 3.22% (11/07/2017)
VTEB Bond - Long-term National Muni 0.08% $52.93 $0.03
increased
0.05%
increased
2.01%A
30 day
6/13/2019 4.70% 6.31% — — 3.51% (08/21/2015)
VIG Stock - Large-Cap Blend 0.06% $113.47 -$0.92
decreased
-0.80%
decreased
1.97%B
30 day
5/31/2019 16.96% 8.66% 9.32% 12.97% 8.29% (04/21/2006)
ESGV Stock - Large-Cap Blend 0.12% $50.43 $0.04
increased
0.07%
increased
1.74%B
30 day
5/31/2019 17.65% — — — -3.52% (09/18/2018)
VUG Stock - Large-Cap Growth 0.04% $161.29 $0.53
increased
0.33%
increased
1.23%B
30 day
5/31/2019 20.53% 4.63% 11.06% 15.07% 8.81% (01/26/2004)
VYM Stock - Large-Cap Value 0.06% $85.97 -$0.85
decreased
-0.97%
decreased
3.50%B
30 day
5/31/2019 11.98% 1.63% 8.03% 13.36% 7.28% (11/10/2006)
VV Stock - Large-Cap Blend 0.04% $132.90 $0.08
increased
0.06%
increased
2.01%B
30 day
5/31/2019 16.48% 3.75% 9.55% 13.92% 8.32% (01/27/2004)
MGC Stock - Large-Cap Blend 0.07% $100.06 $0.08
increased
0.08%
increased
2.06%B
30 day
5/31/2019 15.88% 4.23% 9.86% 13.81% 8.10% (12/17/2007)
MGK Stock - Large-Cap Growth 0.07% $127.98 $0.55
increased
0.43%
increased
1.34%B
30 day
5/31/2019 19.82% 3.98% 11.45% 15.15% 9.69% (12/17/2007)
MGV Stock - Large-Cap Value 0.07% $79.52 -$0.22
decreased
-0.27%
decreased
2.82%B
30 day
5/31/2019 12.28% 4.03% 8.42% 12.60% 6.59% (12/17/2007)
VOO Stock - Large-Cap Blend 0.03% $265.73 $0.10
increased
0.04%
increased
2.07%B
30 day
5/31/2019 16.36% 3.75% 9.63% — 13.49% (09/07/2010)
VTI Stock - Large-Cap Blend 0.03% $147.19 -$0.43
decreased
-0.29%
decreased
1.97%B
30 day
5/31/2019 16.44% 2.57% 9.25% 13.98% 6.88% (05/24/2001)
VTV Stock - Large-Cap Value 0.04% $109.43 -$0.28
decreased
-0.26%
decreased
2.83%B
30 day
5/31/2019 12.62% 2.76% 8.18% 12.85% 7.66% (01/26/2004)
VXF Stock - Mid-Cap Blend 0.07% $116.17 $0.18
increased
0.16%
increased
1.46%B
30 day
5/31/2019 17.07% -3.64% 7.33% 14.00% 9.09% (12/27/2001)
VO Stock - Mid-Cap Blend 0.04% $164.20 -$0.20
decreased
-0.12%
decreased
1.62%B
30 day
5/31/2019 19.36% 1.63% 7.98% 14.37% 9.18% (01/26/2004)
VOT Stock - Mid-Cap Growth 0.07% $146.99 -$0.04
decreased
-0.02%
decreased
0.85%B
30 day
5/31/2019 23.15% 6.34% 9.48% 14.64% 8.88% (08/17/2006)
VOE Stock - Mid-Cap Value 0.07% $109.31 -$0.25
decreased
-0.23%
decreased
2.40%B
30 day
5/31/2019 15.37% -3.12% 6.43% 13.95% 7.86% (08/17/2006)
VB Stock - Small-Cap Blend 0.05% $153.73 $0.62
increased
0.40%
increased
1.65%B
30 day
5/31/2019 16.94% -3.78% 7.30% 14.15% 8.71% (01/26/2004)
VBK Stock - Small-Cap Growth 0.07% $182.81 $1.66
increased
0.91%
increased
0.80%B
30 day
5/31/2019 21.68% 0.63% 8.67% 14.89% 9.09% (01/26/2004)
VBR Stock - Small-Cap Value 0.07% $128.12 -$0.16
decreased
-0.12%
decreased
2.36%B
30 day
5/31/2019 12.87% -7.52% 6.11% 13.33% 8.18% (01/26/2004)
BNDW International 0.09% $78.18 $0.01
increased
0.01%
increased
1.89%B
30 day
6/13/2019 5.38% — — — 6.12% (09/04/2018)
BNDX International 0.09% $56.91 $0.00
0.00%
0.74%A
30 day
6/13/2019 5.41% 6.89% 4.09% — 4.05% (05/31/2013)
VWOB International 0.30% $79.00 $0.05
increased
0.06%
increased
4.95%A
30 day
6/13/2019 8.08% 7.35% 3.97% — 4.26% (05/31/2013)
VT International 0.09% $73.38 -$0.51
decreased
-0.69%
decreased
— 13.45% -1.93% 5.45% 9.54% 5.55% (06/24/2008)
VSGX International 0.15% $48.92 $0.07
increased
0.14%
increased
— 10.17% — — — -4.05% (09/18/2018)
VEU International 0.09% $49.35 -$0.56
decreased
-1.12%
decreased
— 9.94% -6.26% 1.59% 5.90% 2.47% (03/02/2007)
VSS International 0.12% $102.05 -$0.94
decreased
-0.91%
decreased
— 8.72% -13.13% 1.15% 7.15% 9.61% (04/02/2009)
VEA International 0.05% $40.45 -$0.44
decreased
-1.08%
decreased
— 10.72% -7.04% 1.57% 6.33% 1.08% (07/20/2007)
VGK International 0.09% $53.21 -$1.06
decreased
-1.95%
decreased
— 12.08% -6.02% 0.26% 6.32% 4.01% (03/04/2005)
VPL International 0.09% $64.26 -$0.41
decreased
-0.63%
decreased
— 7.08% -9.98% 3.47% 6.35% 4.27% (03/04/2005)
VNQI International 0.12% $57.99 -$0.37
decreased
-0.63%
decreased
— 11.51% -0.60% 4.09% — 5.61% (11/01/2010)
VIGI International 0.25% $65.79 -$0.43
decreased
-0.65%
decreased
— 15.25% -0.63% — — 9.75% (02/25/2016)
VYMI International 0.32% $59.46 -$1.03
decreased
-1.69%
decreased
— 8.32% -4.25% — — 8.75% (02/25/2016)
VXUS International 0.09% $51.65 $0.05
increased
0.10%
increased
— 9.83% -6.98% 1.54% — 3.03% (01/26/2011)
VWO International 0.12% $40.65 -$0.13
decreased
-0.32%
decreased
— 7.52% -6.47% 1.91% 4.72% 5.92% (03/04/2005)
VOX Stock - Sector 0.10% $86.19 $0.86
increased
1.00%
increased
1.07%B
30 day
5/31/2019 16.59% 2.94% 2.17% 8.89% 6.65% (09/23/2004)
VCR Stock - Sector 0.10% $178.29 $0.19
increased
0.11%
increased
1.33%B
30 day
5/31/2019 18.79% 2.31% 10.96% 18.11% 9.44% (01/26/2004)
VDC Stock - Sector 0.10% $150.75 -$0.45
decreased
-0.29%
decreased
2.77%B
30 day
5/31/2019 15.55% 13.12% 6.90% 12.32% 9.32% (01/26/2004)
VDE Stock - Sector 0.10% $81.86 $0.74
increased
0.91%
increased
3.67%B
30 day
5/31/2019 6.96% -22.79% -7.86% 2.75% 5.18% (09/23/2004)
VFH Stock - Sector 0.10% $67.57 -$0.66
decreased
-0.96%
decreased
2.35%B
30 day
5/31/2019 14.50% -3.53% 9.64% 12.42% 3.81% (01/26/2004)
VHT Stock - Sector 0.10% $172.04 $1.29
increased
0.75%
increased
1.53%B
30 day
5/31/2019 7.89% 5.71% 10.06% 15.57% 9.43% (01/26/2004)
VIS Stock - Sector 0.10% $141.22 -$0.56
decreased
-0.39%
decreased
1.73%B
30 day
5/31/2019 17.90% -1.58% 7.54% 14.44% 8.93% (09/23/2004)
VGT Stock - Sector 0.10% $205.62 $0.25
increased
0.12%
increased
1.34%B
30 day
5/31/2019 23.68% 7.61% 17.18% 18.10% 10.10% (01/26/2004)
VAW Stock - Sector 0.10% $124.76 -$1.01
decreased
-0.80%
decreased
2.13%B
30 day
5/31/2019 13.13% -11.39% 3.11% 9.92% 7.70% (01/26/2004)
VNQ Stock - Sector 0.12% $90.68 $0.94
increased
1.05%
increased
note 22.43% 14.91% 7.58% 14.92% 8.72% (09/23/2004)
VPU Stock - Sector 0.10% $134.60 -$0.55
decreased
-0.40%
decreased
3.21%B
30 day
5/31/2019 15.09% 17.11% 10.36% 12.69% 10.07% (01/26/2004)

We notice the following structure:

  1. Elements 0-58 are the names of the ETFs.
  2. Elements 59-118 are the tickers, asset class, etc. of the ETFs.

Create lists to hold ETF names and tickers

# Initialize the lists
etf_names = []
etf_tickers = []

for i in range(0, 59):
    # Get the names
    try:
        etf_names.append(content[i].text.split('\n')[1])
    except IndexError:
        etf_names.append(content[i].text)

    # Get the tickers
    etf_tickers.append(content[i + 59].text.split()[0])

Create a DataFrame mapping an ETF’s name to it’s ticker

etf_map = pd.DataFrame({'Name': etf_names,
                        'Ticker': etf_tickers})

Vanguard ETFs By Group

Create a DataFrame mapping an ETF’s name to it’s ticker by group.

etf_us_bonds_map = etf_map.iloc[0:15]
etf_us_stocks_map = etf_map.iloc[15:33]
etf_intl_bonds_map = etf_map.iloc[33:36]
etf_intl_stocks_map = etf_map.iloc[36:48]
etf_sectors_map = etf_map.iloc[48:]

Source Vanguard ETFs Historical OHLCV Data

Set the parameters

  1. Source is set to the Investors Exchange.
  2. Start and end date are set to span 5 years from today.
source = 'iex'
start = '2014-01-01'
end = '2020-01-01'

Vanguard ETFs

Pull historical OHLCV data from the Investors Exchange

# List of tickers
etf_list = etf_map['Ticker'].tolist()

# Initialize data
etf_data = web.DataReader(etf_list[0], source, start, end)
# Conver index to datetime
etf_data.index = pd.to_datetime(etf_data.index)

for i in range(1, len(etf_list)):
    # Pull new data
    temp_data = web.DataReader(etf_list[i], source, start, end)

    # Update data
    etf_data = pd.concat([etf_data, temp_data], axis=1, sort=True)

# Set multi-level columns
etf_data.columns = pd.MultiIndex.from_product([etf_list, temp_data.columns])

Inspect

etf_data.head()
openhighlowclosevolume
date
2014-06-1285.279086.489684.773286.207734454
2014-06-1385.685386.937585.685386.282436060
2014-06-1686.423386.776286.025386.580972137
2014-06-1786.340486.340485.502985.733655781
2014-06-1885.925886.605885.677186.390215027

5 rows × 295 columns

etf_data.tail()
openhighlowclosevolume
date
2019-06-05125.13125.5000124.4900124.60301695
2019-06-06125.85126.5848124.8075125.26137576
2019-06-07126.65126.9993126.1300126.5688292
2019-06-10125.42125.4200124.8400125.20153210
2019-06-11125.12125.8000125.0900125.48187176

5 rows × 295 columns

Vanguard ETFs By Group

Create DataFrame objects containing close prices of ETFs by group.

# Create list of Vanguard ETF groups
groups = ['us bonds',
          'us stocks',
          'intl bonds',
          'intl stocks',
          'sectors']

# Create a dictionary of Vanguard ETF maps
maps = {'us bonds': etf_us_bonds_map,
        'us stocks': etf_us_stocks_map,
        'intl bonds': etf_intl_bonds_map,
        'intl stocks': etf_intl_stocks_map,
        'sectors': etf_sectors_map}

# Initialize Vanguard ETF close dictionary
close = {'us bonds': pd.DataFrame(),
         'us stocks': pd.DataFrame(),
         'intl bonds': pd.DataFrame(),
         'intl stocks': pd.DataFrame(),
         'sectors': pd.DataFrame()}

# Update Vanguard ETF close dictionary
for i in groups:
    # Set tickers
    tickers = maps[i]['Ticker'].tolist()

    res = {}
    for j in range(len(tickers)):
        res[tickers[j]] = etf_data[tickers[j]].close

    # Update close prices 
    close[i] = pd.DataFrame(res)

Vanguard US Bond ETFs

etf_us_bonds_close = close['us bonds']

Inspect

etf_us_bonds_close.head()
EDVBIVVGITBLVVGLT
date
2014-06-1286.207772.337258.302772.950961.2526
2014-06-1386.282472.199958.238772.996061.3021
2014-06-1686.580972.225758.184773.086261.3876
2014-06-1785.733671.942658.046772.667861.0080
2014-06-1886.390272.285758.210373.069861.2963
etf_us_bonds_close.tail()
EDVBIVVGITBLVVGLT
date
2019-06-05124.6085.4765.5094.1580.02
2019-06-06125.2685.4065.4494.5080.27
2019-06-07126.5685.6865.6195.1880.92
2019-06-10125.2085.4065.4094.6380.20
2019-06-11125.4885.3965.3694.5780.29

VTC and VTEB close prices are NaNs, we drop these funds.

etf_us_bonds_close.drop(['VTC', 'VTEB'], axis='columns', inplace=True)

Vanguard US Stock ETFs

etf_us_stocks_close = close['us stocks']

Inspect

etf_us_stocks_close.head()
VIGESGVVUGVYMVV
date
2014-06-1269.5502NaN91.260956.271580.5827
2014-06-1369.6492NaN91.457956.502780.8184
2014-06-1669.6852NaN91.523556.665480.9272
2014-06-1769.8382NaN91.701756.759681.1175
2014-06-1870.5041NaN92.405357.145081.7520
etf_us_stocks_close.tail()
VIGESGVVUGVYMVV
date
2019-06-05112.07NaN156.3085.44129.98
2019-06-06112.72NaN157.5186.00130.77
2019-06-07113.85NaN159.8586.37132.09
2019-06-10114.09NaN160.8286.55132.76
2019-06-11113.77NaN160.7286.72132.70

ESGV close prices are NaNs, we drop this fund.

etf_us_stocks_close.drop('ESGV', axis='columns', inplace=True)

Vanguard International Bond ETFs

etf_intl_bonds_close = close['intl bonds']

Inspect

etf_intl_bonds_close.head()
BNDWBNDXVWOB
date
2014-06-12NaN46.335763.9766
2014-06-13NaN46.353863.9528
2014-06-16NaN46.335763.8257
2014-06-17NaN46.299563.7462
2014-06-18NaN46.344863.7383
etf_intl_bonds_close.tail()
BNDWBNDXVWOB
date
2019-06-05NaN56.6178.370
2019-06-06NaN56.6778.640
2019-06-07NaN56.8278.870
2019-06-10NaN56.7178.960
2019-06-11NaN56.7379.075

BNDW close prices are NaNs, we drop this fund.

etf_intl_bonds_close.drop('BNDW', axis='columns', inplace=True)

Vanguard International Stock ETFs

etf_intl_stocks_close = close['intl stocks']

Inspect

etf_intl_stocks_close.head()
VTVSGXVEUVSSVEA
date
2014-06-1255.1091NaN45.499796.747537.0462
2014-06-1355.2604NaN45.551696.625337.0376
2014-06-1655.2248NaN45.525696.372337.0635
2014-06-1755.3317NaN45.421796.128037.0203
2014-06-1855.6521NaN45.863396.965637.3402
etf_intl_stocks_close.tail()
VTVSGXVEUVSSVEA
date
2019-06-0572.77NaN49.43102.3140.50
2019-06-0673.11NaN49.61102.4040.67
2019-06-0773.85NaN50.13103.5041.15
2019-06-1074.16NaN50.34103.8341.26
2019-06-1174.37NaN50.68104.2141.46

VSGX, VIGI, and VYMI close prices are NaNs, we drop these funds.

etf_intl_stocks_close.drop(['VSGX', 'VIGI', 'VYMI'], axis='columns', inplace=True)

Vanguard Sector ETFs

etf_sectors_close = close['sectors']

Inspect

etf_sectors_close.head()
VOXVCRVDCVDEVFH
date
2014-06-1273.209399.4362101.4614121.400841.7458
2014-06-1373.808999.5073101.4878122.611841.7458
2014-06-1673.868899.8709101.8401123.203841.5917
2014-06-1774.1943100.2812101.9634123.054542.0086
2014-06-1874.8623100.9804103.0908123.983442.2352
etf_sectors_close.tail()
VOXVCRVDCVDEVFH
date
2019-06-0583.15171.09147.2780.1967.76
2019-06-0683.43171.70148.6081.4967.98
2019-06-0784.47174.18149.9381.8267.81
2019-06-1084.47175.76149.9881.9968.37
2019-06-1184.61176.42150.7382.1268.40

Plot Of Vanguard ETFs By Group

plt.figure(figsize=[20, 40])

# US Bonds
plt.subplot(5, 1, 1)
for i in etf_us_bonds_close.columns.tolist():
    plt.plot(etf_us_bonds_close[i])    
plt.title('Vanguard US Bonds ETFs - Close')
plt.legend(etf_us_bonds_close.columns.tolist(), loc=2)
plt.xlim(etf_data.index[0], etf_data.index[-1])

# US Stocks
plt.subplot(5, 1, 2)
for i in etf_us_stocks_close.columns.tolist():
    plt.plot(etf_us_stocks_close[i])    
plt.title('Vanguard US Stocks ETFs - Close')
plt.legend(etf_us_stocks_close.columns.tolist(), loc=2)
plt.xlim(etf_data.index[0], etf_data.index[-1])

# Intl Bonds
plt.subplot(5, 1, 3)
for i in etf_intl_bonds_close.columns.tolist():
    plt.plot(etf_intl_bonds_close[i])    
plt.title('Vanguard International Bond ETFs - Close')
plt.legend(etf_intl_bonds_close.columns.tolist(), loc=2)
plt.xlim(etf_data.index[0], etf_data.index[-1])

# Intl Stocks
plt.subplot(5, 1, 4)
for i in etf_intl_stocks_close.columns.tolist():
    plt.plot(etf_intl_stocks_close[i])    
plt.title('Vanguard International Stock ETFs - Close')
plt.legend(etf_intl_stocks_close.columns.tolist(), loc=2)
plt.xlim(etf_data.index[0], etf_data.index[-1])

# Sectors
plt.subplot(5, 1, 5)
for i in etf_sectors_close.columns.tolist():
    plt.plot(etf_sectors_close[i])
plt.title('Vanguard Sector ETFs - Close')
plt.legend(etf_sectors_close.columns.tolist(), loc=2)
plt.xlim(etf_data.index[0], etf_data.index[-1])
(735396.0, 737221.0)

Save The Data

Create a list of output files

output = ['etf_us_bonds.pickle', 
          'etf_us_stocks.pickle', 
          'etf_intl_bonds.pickle', 
          'etf_intl_stocks.pickle', 
          'etf_sectors.pickle']

Create a list of ETF data

data = [etf_us_bonds_close, 
        etf_us_stocks_close,
        etf_intl_bonds_close,
        etf_intl_stocks_close,
        etf_sectors_close]

Pickle

for i in range(len(output)):
    with open(output[i], 'wb') as f:
        pickle.dump(data[i], f)

    f.close()

Conclusion

In this article, I showed how to scrape ETFs list from Vanguard web page, and pull historical OHLCV data from IEX. I used Python packages Selenium and pandas-datareader for web scraping and sourcing datasets, respectively. In the next article, I use the data to demonstrate the application of Kelly formula in investing.

Pair-trading With Cryptocurrencies Part 2

How To Implement A Pair-trading Strategy Using Python

Photo by the author

Introduction

Cryptocurrencies (or cryptos) captured the world’s imagination during the last couple of years. From their spectacular rise in 2017, to their equally hard fall in 2018, and to their most recent revival this year. Cryptocurrencies and the blockchain technology powering them are regarded as disruptive forces, proposing an alternative decentralized monetary system. They have also gathered their fair share of criticism. Cryptocurrencies are accused of facilitating transfer of “dirty money” between criminals. They provided mind-boggling returns a couple years back and earned the reputation of being a “get rich quick” scheme.

In this article, we turn our attention to cryptocurrencies as investment vehicles sharing common characteristics with other asset classes like FX, commodities and precious metals. Cryptocurrencies, like fiat currencies, allow for easy transfer of wealth between individuals and businesses. They are similar to commodities with regards to the limited supply they offer. Cryptocurrencies have drawn comparisons to gold which is viewed as a “safe-haven” investment.

This is the second part of the article where I will present how to implement a trading strategy using cryptocurrency pairs, and backtest it using the collected dataset in the previous article.

Implementing A Pair-trading Strategy

We implement a trading strategy inspired from E. Chan’s pair-trading strategy involving gold and gold miners ETFs. After money is injected to traditional markets like equity and fixed-income, it will eventually overflow into alternative asset classes like cryptocurrencies. We suppose that money flowing to cryptocurrencies will drive their prices together.

The price movement between two cryptocurrencies is measured by the difference in their prices or spread.
If the spread is mean-reverting, then we can buy (sell) when the spread is low (high).
We have the following rules for our pair-trading strategy:

  • Enter a long position on the spread if it is below -1 STD.
  • Enter a short position on the spread if it is above 1 STD.
  • Exit the position if the spread is greater than -0.5 and less than 0.5 STD.

Step By Step

  1. Load the historical data.
  2. Implement and backtest the trading strategy.
  3. Analyze the performance of the pair-trading strategy.

You can find the code on https://github.com/DinodC/pair-trading-cryptos.

Import packages

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import pickle
import matplotlib.pyplot as plt
import statsmodels.api as sm
%matplotlib inline

Load Data

Load the historical data

with open('data_coinmarketcap.pickle', 'rb') as f:
    data = pickle.load(f)

f.close()

Create close prices DataFrame object

close = pd.DataFrame({'BTC': data.BTC.Close,
                      'ETH': data.ETH.Close,
                      'LTC': data.LTC.Close,
                      'XRP': data.XRP.Close})

Create an id for Bitcoin, Ethereum, Litecoin and Ripple cryptocurrencies to USD.

id = ['BTC', 'ETH', 'LTC', 'XRP']

Pair-trading Strategy Implementation

Before implementing the pair-trading strategy, recall in a previous article the things to look out for when backtesting:

  1. Data-snooping bias
  2. Look-ahead bias
  3. Transaction costs

I will review these pitfalls and propose solutions to mitigate their effect on the backtesting results.

Correcting For Data-snooping Bias

Data-snooping is the use of data analysis to find statistically significant patterns when in fact they are not. It can originate from overfitting the model to a dataset by increasing the number of its parameters. Data-snooping can also come from optimizing trading decisions e.g. choosing entry and exit thresholds.

To mitigate data-snooping bias, we can do the following:

  1. Out-of-sample testing: Out-of-sample testing involves separating historical data into a training and testing set. We optimize the model’s parameters using the training set, and validate it on the testing or out-of-sample set.
  2. Paper trading: Paper trading share the same idea as out-of-sample testing in separating the dataset. We optimize the model’s parameters using the historical data, and run it using incoming actual data.

In this article, we apply out-of-sample testing by dividing the data into two equal sets.

# Define training set
training_start = '2015-08-07'
training_end = '2017-08-01'
training_set = close[training_start:training_end]

# Define testing set
testing_start = '2017-08-02'
testing_end = '2019-05-21'
testing_set = close[testing_start:testing_end]

Correcting For Look-ahead Bias

Look-ahead bias is the use of information not yet available at the moment the trade was recommended. An example is calculating the trading signal based on the day’s highest or lowest price. As the day’s high and low are only available at close, we correct this by lagging our historical data.

Another way to mitigate look-ahead bias is the following check:

  1. Backtest using historical data and save the positions.
  2. Backtest using truncated historical data, e.g. 10 days, and save the positions.
  3. Truncate the positions from point 1, e.g. 10 days, compare with the positions from point 2.

Note that the implementation of this check is just after the backtesting.

Including Transaction Costs

Transaction costs make the backtest more realistic.
They deflate a trading strategy’s performance by including the following:

  1. Slippage is the difference between your expected price and the executed price.
  2. Commission fee is a service charge extracted by your broker for handling your purchases.
  3. Market impact is the price effect you generate each time you pass a considerable order to the market.

In this article, we assume a round-trip transaction cost of 10 BPs per trade.

round_trip = 0.0010

Implementation

We define the entry and exit thresholds

entry_threshold = 1
exit_threshold = 0.5

We implement the pair-trading strategy

# Set crypto 1 to BTC
crypto_1 = id[0]

# Initialize output
output = {id[1]: {},
          id[2]: {}, 
          id[3]: {}}

for i in range(1, len(id)):

    # Set crypto 2 to ETH, LTC and XRP
    crypto_2 = id[i]

    # Calculate the hedge ratio using the training set
    model = sm.OLS(training_set[crypto_1], training_set[crypto_2])
    result = model.fit()
    hedge_ratio = result.params[crypto_2]

    # Calculate the spread
    spread = close[crypto_1] - hedge_ratio * close[crypto_2]
    # Mean of the spread on the training set
    spread_mean = spread[training_start:training_end].mean()
    # Standard deviation of the spread calculated on the training set
    spread_std = spread[training_start:training_end].std()
    # Z-score of the spread
    z_score = (spread - spread_mean) / spread_std

    # Implement pair trading strategy
    # Create masks for long, short and exit positions
    longs = (z_score <= -entry_threshold)
    shorts = (z_score &gt;= entry_threshold)
    exits = (np.abs(z_score) <= exit_threshold)
    # Initialize the positions
    positions = pd.DataFrame({crypto_1: np.nan * pd.Series(range(len(z_score))),
                              crypto_2: np.nan * pd.Series(range(len(z_score)))},
                             index=z_score.index)
    # Update the positions
    [positions[crypto_1][longs], positions[crypto_2][longs]] = [1, -1]
    [positions[crypto_1][shorts], positions[crypto_2][shorts]] = [-1, 1]
    [positions[crypto_1][exits], positions[crypto_2][exits]] = [0, 0]
    # Carry forward the positions except when there is an exit
    positions.fillna(method='ffill', inplace=True)
    # Lag the positions to the next day because we base calculations on close
    positions = positions.shift(periods=1)
    
    # Calculate the performance
    # Initialize the returns
    returns = pd.DataFrame({crypto_1: close[crypto_1],
                            crypto_2: close[crypto_2]})
    # Update the returns
    returns = returns.pct_change()
    # Calculate the pnl
    pnl = returns * positions

    # Calculate transaction costs
    # Create a mask to indicate changes in position
    mask = (~np.isnan(positions.BTC) &amp; (positions.BTC - positions.BTC.shift(periods=1)).astype(bool))
    # mask = (~np.isnan(positions.BTC) &amp; (positions.BTC != positions.BTC.shift(periods=1)))
    # Create a transaction costs Series
    tc = pd.Series(np.zeros(len(mask)), index=mask.index)
    tc[mask] = - round_trip
    
    # Update pnl DataFrame
    pnl['TC'] = tc
    # Calculate net pnl
    pnl_net = pnl.sum(axis='columns')
    
    # Calculate the Sharpe ratio under the training set
    sharpe_training = np.sqrt(252) * pnl_net[training_start:training_end].mean() / pnl_net[training_start:training_end].std()
    # Calculate the Sharpe ratio under the testing set
    sharpe_testing = np.sqrt(252) * pnl_net[testing_start:testing_end].mean() / pnl_net[testing_start:testing_end].std()
            
    # Generate the output
    # Gather data
    data = {'spread': z_score,
            'positions': positions,
            'pnl': pnl_net,
            'sharpe training': sharpe_training,
            'sharpe testing': sharpe_testing,
           }
    # Update the output
    output.update({crypto_2: data})

Check For Look-ahead Bias

Load the output generated from the truncated historical data

with open('output_truncated.pickle', 'rb') as f:
    output_truncated = pickle.load(f)

f.close()
positions_truncated_eth = output_truncated['ETH']['positions'].dropna()
positions_truncated_ltc = output_truncated['LTC']['positions'].dropna()
positions_truncated_xrp = output_truncated['XRP']['positions'].dropna()

Truncate the positions calculated from the full dataset

positions_eth = output['ETH']['positions'][:-100].dropna()
positions_ltc = output['LTC']['positions'][:-100].dropna()
positions_xrp = output['XRP']['positions'][:-100].dropna()

Compare the two sets of positions

mask = positions_eth.eq(positions_truncated_eth)
mask.ETH.value_counts()
True    1222
Name: ETH, dtype: int64
mask = positions_ltc.eq(positions_truncated_ltc)
mask.LTC.value_counts()
True    2124
Name: LTC, dtype: int64
mask = positions_xrp.eq(positions_truncated_xrp)
mask.XRP.value_counts()
True    2026
Name: XRP, dtype: int64

Performance Analysis

Spread

Recall that the spread between BTC and other cryptocurrencies provides the signal of our trading strategy.

Plot the spread

plt.figure(figsize=[20, 20])

# Spread BTC & ETH
# Training set
plt.subplot(3, 2, 1)
plt.plot(output['ETH']['spread'][training_start:training_end])
plt.title('Spread BTC & ETH - Training Set')
plt.xlim(training_start, training_end)
plt.ylim(-10, 40)
# Testing set
plt.subplot(3, 2, 2)
color = 'tab:orange'
plt.plot(output['ETH']['spread'][testing_start:testing_end], color=color)
plt.title('Spread BTC & ETH - Testing Set')
plt.xlim(testing_start, testing_end)
plt.ylim(-10, 40)

# Spread BTC & LTC
# Training set
plt.subplot(3, 2, 3)
plt.plot(output['LTC']['spread'][training_start:training_end])
plt.title('Spread BTC & LTC - Training Set')
plt.xlim(training_start, training_end)
plt.ylim(-30, 40)
# Testing set
plt.subplot(3, 2, 4)
color = 'tab:orange'
plt.plot(output['LTC']['spread'][testing_start:testing_end], color=color)
plt.title('Spread BTC & LTC - Testing Set')
plt.xlim(testing_start, testing_end)
plt.ylim(-30, 40)

# Spread BTC & XRP
# Training set
plt.subplot(3, 2, 5)
plt.plot(output['XRP']['spread'][training_start:training_end])
plt.title('Spread BTC & XRP - Training Set')
plt.xlim(training_start, training_end)
plt.ylim(-50, 50)
# Testing set
plt.subplot(3, 2, 6)
color = 'tab:orange'
plt.plot(output['XRP']['spread'][testing_start:testing_end], color=color)
plt.title('Spread BTC & XRP - Testing Set')
plt.xlim(testing_start, testing_end)
plt.ylim(-50, 50)
(-50, 50)

Remarks:

  1. On the training set, the spread between BTC and other cryptocurrencies is flat except for January to July 2017 where it exhibited mean-reversion.
  2. On the testing set, the spread between BTC and other cryptocurrencies shot up at the end of 2017 before falling down at the start of 2018.
  3. Comparing the spread under the training and testing sets, there seem to be two different underlying processes i.e. a regime change.

Sharpe Ratio

Create a Sharpe ratio table

sharpe = pd.DataFrame({'BTC & ETH': [output['ETH']['sharpe training'], output['ETH']['sharpe testing']],
                       'BTC & LTC': [output['LTC']['sharpe training'], output['LTC']['sharpe testing']],
                       'BTC & XRP': [output['XRP']['sharpe training'], output['XRP']['sharpe testing']]},
                      index=pd.MultiIndex.from_product([['Sharpe Ratio'], ['Training Set', 'Testing Set']]))
sharpe
BTC & ETHBTC & LTCBTC & XRP
Training Set1.8176631.1759021.272301
Testing Set-0.2007090.7821450.480900

Remarks:

  1. On the training set, trading pairs BTC & ETH, BTC & LTC and BTC & XRP generated good Sharpe ratios, higher than 1.
  2. On the testing set, the pair-trading strategy generated lower Sharpe ratios than the training set.
  3. We limited data-snooping bias but performance degraded in the testing set due regime change.

PnL

Plot the cumulative pnl

plt.figure(figsize=[20, 5])

plt.subplot(1, 2, 1)
plt.plot(output['ETH']['pnl'].cumsum()[training_start:training_end])
plt.plot(output['LTC']['pnl'].cumsum()[training_start:training_end])
plt.plot(output['XRP']['pnl'].cumsum()[training_start:training_end])
plt.title('Cumulative PnL Under The Training Set')
plt.legend(['BTC & ETH', 'BTC & LTC', 'BTC & XRP'])
plt.ylim(-1, 10)

plt.subplot(1, 2, 2)
plt.plot(output['ETH']['pnl'].cumsum()[testing_start:testing_end])
plt.plot(output['LTC']['pnl'].cumsum()[testing_start:testing_end])
plt.plot(output['XRP']['pnl'].cumsum()[testing_start:testing_end])
plt.title('Cumulative PnL Under The Testing Set')
plt.legend(['BTC & ETH', 'BTC & LTC', 'BTC & XRP'])
plt.ylim(-1, 10)
(-1, 10)

Remarks:

  1. On the training set, trading pairs BTC & ETH, BTC & LTC and BTC & XRP made profits from January to July 2017.
  2. On the test set, pair-trading strategy generated small profits during fall in January 2018.
  3. Lower profits in the testing set is due to regime change observed in the spread.

Conclusion

In this article, we implemented and backtested a pair-trading strategy using cryptocurrencies such as BTC, ETH, LTC and XRP. We limited data-snooping in our backtest by applying out-of-sample testing. We mitigated look-ahead bias by truncating the historical data. Transaction costs were included in the backtest to make it closer to reality.

Pair-trading With Cryptocurrencies Part 1

How To Scrape Cryptocurrency Data Using Python

Photo by the author

Introduction

Cryptocurrencies (or cryptos) captured the world’s imagination during the last couple of years. From their spectacular rise in 2017, to their equally hard fall in 2018, and to their most recent revival this year. Cryptocurrencies and the blockchain technology powering them are regarded as disruptive forces, proposing an alternative decentralized monetary system. They have also gathered their fair share of criticism. Cryptocurrencies are accused of facilitating transfer of “dirty money” between criminals. They provided mind-boggling returns a couple years back and earned the reputation of being a “get rich quick” scheme.

In this article, we turn our attention to cryptocurrencies as investment vehicles sharing common characteristics with other asset classes like FX, commodities and precious metals. Cryptocurrencies, like fiat currencies, allow for easy transfer of wealth between individuals and businesses. They are similar to commodities with regards to the limited supply they offer. Cryptocurrencies have drawn comparisons to gold which is viewed as a “safe-haven” investment.

This article is divided into two parts. In this first article, I will illustrate how to source cryptocurrency historical data. For the next article, I will demonstrate how to backtest a pair-trading strategy using the cryptocurrency dataset collected here.

Retrieving Cryptocurrencies Historical Data

We retrieve data mainly through databases, APIs and web pages. As I don’t have access to a database providing cryptocurrency historical data. I will use APIs and scrape web pages to retrieve cryptocurrency data necessary for backtesting purposes.

Step By Step

  1. Source data via APIs.
  2. Retrieve data via scraping web pages.

You can find the code on https://github.com/DinodC/pair-trading-cryptos.

Via APIs

APIs provide a convenient way to source data from providers, exchanges and other platforms.

1. Quandl

Quandl is a financial, economic and alternative data provider. It specializes in distributing alternative data to investment firms, arguing that non-traditional data is an untapped source of alpha. Quandl sources data from various providers such as exchanges, tech firms and other companies. It has free and premium datasets to choose from.

To pull data from Quandl, you need to:

  1. Register with Quandl to get an API key, it is free of charge.
  2. Pull data from Quandl using their API, or you can use Python, R and Excel.
  3. If you use Python, then you need to install Quandl Python package quandl or pandas-datareader.

Here we use quandl package to retrieve cryptocurrency data from Bitfinex and GDAX datasets.

Bitfinex

Bitfinex is one of the largest cryptocurrency trading platforms in the world, and is headquartered in Hong Kong. It provides an extensive list of crypto-to-crypto and crypto-to-fiat datasets. Collected data using quandl’s get function comes in a convenient DataFrame object. A drawback is the limited OHLCV (Open-High-Low-Close-Volume) historical data spanning 2 years only.

# Quandl Python package
import quandl

# Set API key
quandl.ApiConfig.api_key = 'Your API Authentication Key'

# Set parameters
start = '2009-01-01'
exchange = 'BITFINEX'
id = 'ETHUSD'

data = quandl.get(exchange + '/' + id, start_date=start)

GDAX (Global Digital Asset Exchange)

GDAX is a popular cryptocurrency exchange headquartered in San Francisco, California. It was formerly known as Coinbase Exchange. Pulled OHLCV historical data from GDAX spans 3y, which is longer than Bitfinex datasets. Collected data using quandl’s get function comes in a convenient DataFrame object. A drawback is the limited number of cryptocurrency datasets compared to Bitfinex.

# Quandl Python package
import quandl

# Set API key
quandl.ApiConfig.api_key = 'Your API Authentication Key'

# Set parameters
start = '2009-01-01'
exchange = 'GDAX'
id = 'ETH_USD'

data = quandl.get(exchange + '/' + id, start_date=start)

2. CoinAPI

CoinAPI is a one-stop shop providing cryptocurrency data from various exchanges such as GDAX, Huobi, Kraken, etc. It provides data via their API and other channels as well. Here, I will use CoinAPI’s free registration to retrieve cryptocurrency data.

To pull data from CoinAPI, you need to:

  1. Register with CoinAPI to get an API key for authenticating requests.
  2. Pull data from CoinAPI using their API, or you can use Python, Java, C#, and etc.
  3. If you use Python, then you need to install Python package requests to source data from CoinAPI.

CoinAPI provides a rich list of cryptocurrencies and exchanges to collect data from. Note that under the free subscription, requests are limited to 100 per day. Also, historical OHLCV data per request is limited to a 100 data points. CoinAPI returns data in JSON format.

import requests

start = '2009-01-01'
exchange = 'BITSTAMP'
id = 'ETH'
period = '1DAY'

url = 'https://rest.coinapi.io/v1/ohlcv/' + exchange + '_SPOT_' + id + '_USD/history?period_id=' + period + '&time_start=' + start + 'T00:00:00'
headers = {'X-CoinAPI-Key' : 'Your API Authentication Key'}
response = requests.get(url, headers=headers)

Via Scraping

Scraping provides a straightforward way of sourcing data from a web page.

CoinMarketCap

CoinMarketCap provides cryptocurrency data by pulling information from different exchanges. It queries cryptocurrency data from the exchanges, processes and delivers it through their website. CoinMarketCap has certain guidelines for cryptocurrencies and their exchanges to be included on the platform.

You can pull data from CoinMarketCap via their API but their free registration doesn’t include historical data. I will illustrate step-by-step how to retrieve historical OHLCV by scraping their website. Python package BeautifulSoup is used to collect data from CoinMarketCap.

CoinMarketCap provides an extensive list of crypto-to-crypto and crypto-to-fiat data. Cryptocurrency OHLCV historical data spans more than 5 years for BTCUSD, LTCUSD and XRPUSD. A limitation from scraping CoinMarketCap site is data type inconsistency for volume and market cap data.

Scrape Cryptocurrency Data From CoinMarketCap

We scrape data as follows:

  1. Identify the cryptocurrencies and their corresponding URL.
  2. Scrape data from the URLs using Python packages requests and BeautifulSoup.

Note that I scraped data for the 5 largest cryptocurrencies (Bitcoin, Ethereum, Litecoin and Ripple) as per market capitalization. I excluded Bitcoin Cash because it has limited historical data.

Import packages

from requests import get
from bs4 import BeautifulSoup
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import pickle
%matplotlib inline

Identify Cryptocurrencies

Set cryptocurrency ids

id = ['BTC', 'ETH', 'XRP', 'LTC']

Create a dictionary mapping cryptocurrency ids to their URLs

url = {'BTC': 'https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20130428&end=20190531',
       'ETH': 'https://coinmarketcap.com/currencies/ethereum/historical-data/?start=20130428&end=20190531',
       'XRP': 'https://coinmarketcap.com/currencies/ripple/historical-data/?start=20130428&end=20190531',
       'LTC': 'https://coinmarketcap.com/currencies/litecoin/historical-data/?start=20130428&end=20190531'}

Scrape Data

Before pulling data, recall that I identified an inconsistency on volume and market cap data type:

  1. There are str and float types in the volume and market cap Series.
  2. I created function my_convert to correct for this.
def my_convert(string):

    res = 0.0

    if string == 'NaN':
        res = string
    elif string == '-':
        res = 0.0
    elif type(string) == str:
        res = string.replace(',', '')
        res = float(res)
    else:
        pass

    return res

To scrape cryptocurrency historical data, we run the script below

# Create a list for concat
data_list = []

for i in id:

    # Retrieve data from URL
    response = get(url[i])

    # Create soup
    html_soup = BeautifulSoup(response.text, 'html.parser')

    # Find OHLCV table
    ohlcv_table = html_soup.find_all('tr', class_='text-right')

    # Store in a dictionary
    data = {}
    for j in range(len(ohlcv_table)):
        ohlcv_row = ohlcv_table[j].text.split('\n')
        data[ohlcv_row[1]] = [float(ohlcv_row[2]), # Open
                              float(ohlcv_row[3]), # High
                              float(ohlcv_row[4]), # Low
                              float(ohlcv_row[5]), # Close
                              my_convert(ohlcv_row[6]), # Volume
                              my_convert(ohlcv_row[7])] # Market cap

    # Store in a DataFrame
    data_df = pd.DataFrame.from_dict(data, orient='index')

    # Rename columns
    data_df.columns = pd.MultiIndex.from_product([[i], ['Open', 'High', 'Low', 'Close', 'Volume', 'Market Cap']])

    # Convert index to DateTimeIndex
    data_df.index = pd.to_datetime(data_df.index)

    # Append to list
    data_list.append(data_df)

# Create DataFrame containing all the cryptos
crypto = pd.concat(data_list, axis=1)  

Check the collected data

crypto.head()
OpenHighLowCloseVolumeMarket Cap
2013-04-28135.30135.98132.10134.210.01.488567e+09
2013-04-29134.44147.49134.00144.540.01.603769e+09
2013-04-30144.00146.93134.05139.000.01.542813e+09
2013-05-01139.00139.89107.72116.990.01.298955e+09
2013-05-02116.38125.6092.28105.210.01.168517e+09

5 rows × 24 columns

crypto.tail()
OpenHighLowCloseVolumeMarket Cap
2019-05-278674.078907.178668.708805.782.794984e+101.560938e+11
2019-05-288802.768807.028634.728719.962.422692e+101.545902e+11
2019-05-298718.598755.858482.738659.492.347348e+101.535370e+11
2019-05-308661.769008.318221.278319.472.924653e+101.475251e+11
2019-05-318320.298586.668172.558574.502.536519e+101.520599e+11

5 rows × 24 columns

Plot the crypto-to-fiat close prices

plt.figure(figsize=[20, 20])

# Subplot 1
# BTC
color = 'tab:blue'
ax1 = plt.subplot(3, 1, 1)
plt.plot(crypto.BTC.Close, color=color)
ax1.set_ylabel('Price of BTC in USD', color=color)
ax1.tick_params(labelcolor=color)
# ETH
color = 'tab:orange'
ax2 = ax1.twinx()
plt.plot(crypto.ETH.Close, color=color)
ax2.set_ylabel('Price of ETH in USD', color=color)
ax2.tick_params(labelcolor=color)
# Title
plt.title('BTCUSD and ETHUSD - Close')
plt.xlim('2014-01-01', '2019-06-01')

# Subplot 2
# BTC
color = 'tab:blue'
ax1 = plt.subplot(3, 1, 2)
plt.plot(crypto.BTC.Close, color=color)
ax1.set_ylabel('Price of BTC in USD', color=color)
ax1.tick_params(labelcolor=color)
# LTC
color = 'tab:orange'
ax2 = ax1.twinx()
plt.plot(crypto.LTC.Close, color=color)
ax2.set_ylabel('Price of LTC in USD', color=color)
ax2.tick_params(labelcolor=color)
# Title
plt.title('BTCUSD and LTCUSD - Close')
plt.xlim('2014-01-01', '2019-06-01')

# Subplot 3
# BTC
color = 'tab:blue'
ax1 = plt.subplot(3, 1, 3)
plt.plot(crypto.BTC.Close, color=color)
ax1.set_ylabel('Price of BTC in USD', color=color)
ax1.tick_params(labelcolor=color)
# ETH
color = 'tab:orange'
ax2 = ax1.twinx()
plt.plot(crypto.XRP.Close, color=color)
ax2.set_ylabel('Price of XRP in USD', color=color)
ax2.tick_params(labelcolor=color)
# Title
plt.title('BTCUSD and XRPUSD - Close')
plt.xlim('2014-01-01', '2019-06-01')

plt.tight_layout() # Right y label is not clipped

Note that ETHUSD prices are only available from July 2015 because Ethereum was released around this time.

Pickle the data

with open('data_coinmarketcap.pickle', 'wb') as f:
    pickle.dump(crypto, f)

f.close()

Conclusion

In this article, we identified sources of cryptocurrency historical data such as using APIs and scraping web pages. Using APIs provides a convenient way of retrieving data. A drawback using them, particularly under their free subscription plan, is the limited historical data collected. Scraping websites requires a bit more effort than using APIs, but offers more extensive datasets. In the next article, we will use the historical data collected here to backtest a trading strategy using cryptocurrency pairs.

Backtesting A Trading Strategy Part 3

How To Backtest A Mean-reverting Trading Strategy Using Python

Photo by the author

Introduction

Backtesting is a tool to measure the performance of a trading strategy using historical data. The backtesting process consists of three parts: 1. determining the universe of securities where we will invest in (e.g. equity or fixed income? US or emerging markets?); 2. gathering historical data for the universe of securities; and 3. implementing a trading strategy using the historical data collected.

In the previous articles, I illustrated the first two steps in the backtesting process of determining the universe of stocks, and collecting historical data for each constituent. In this article, I will illustrate the last step of the process: implementing a mean-reverting trading strategy.

A Mean-reverting Trading Strategy

We implement a mean-reverting trading strategy based on Khandani and Lo. The idea is to buy the previous day’s “losers”, and sell the previous day’s “winners”. Stocks which underperform the market average are classified as “losers”; while stocks which outperform the market average are classified as “winners”.

For each stock i  , we calculate the weight w_{i, t}  at time t

\displaystyle w_{i, t} = - \frac{1}{N} \left( r_{i, t-1} - r_{Market, t-1} \right).

where N is the total number of stocks in the investment universe. Market return at time t-1 is calculated as

\displaystyle r_{Market, t-1} = \frac{1}{N} \sum_{i=1}^{N} r_{i, t-1},

and return of stock $i &s=2$ at time $t-1 &s=2$ is calculated as

\displaystyle r_{i, t-1} = \frac{S_{i, t-1} - S_{i, t-2}}{S_{i, t-2}},

where S_{i, t-1} is the spot price of stock i at time t-1 .

Backtesting A Trading Strategy

Step By Step

  1. Load S&P indices historical data.
  2. Implement a mean-reverting trading strategy.
  3. Calculate the trading strategy’s performance using the Sharpe ratio.

You can find the code on https://github.com/DinodC/backtesting-trading-strategy.

Import packages

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import pickle
import matplotlib.pyplot as plt
%matplotlib inline

Load S&P Historical Data

Set an id for each index

id = ['sp500', 'sp400', 'sp600']

Create a dictionary to map each id to a tickers file

input_file = {'sp500': 'sp500_data.pickle',
              'sp400': 'sp400_data.pickle',
              'sp600': 'sp600_data.pickle'}

Create a dictionary to map each id to a S&P historical data

sp500_data = pd.DataFrame()
sp400_data = pd.DataFrame()
sp600_data = pd.DataFrame()
sp_data = {'sp500': sp500_data,
           'sp400': sp400_data,
           'sp600': sp600_data}

Retrieve S&P historical data

for i in id:
    # Load data
    with open(input_file[i], 'rb') as f:
        sp_data[i] = pickle.load(f)
    f.close()

    # Select close prices
    sp_data[i] = sp_data[i].close

Implement A Mean-reverting Trading Strategy

Create a dictionary to map each id to a S&P returns data

sp500_returns = DataFrame()
sp400_returns = DataFrame()
sp600_returns = DataFrame()
sp_returns = {'sp500': sp500_returns,
              'sp400': sp400_returns,
              'sp600': sp600_returns}

Create a dictionary to map each id to a S&P market returns data

sp500_market_returns = Series()
sp400_market_returns = Series()
sp600_market_returns = Series()
sp_market_returns = {'sp500': sp500_market_returns,
                     'sp400': sp400_market_returns,
                     'sp600': sp600_market_returns}

Create a dictionary to map each id to a trading strategy weighting

sp500_weights = DataFrame()
sp400_weights = DataFrame()
sp600_weights = DataFrame()
sp_weights = {'sp500': sp500_weights,
              'sp400': sp400_weights,
              'sp600': sp600_weights}

Create a dictionary to map each id to a trading strategy pnl per stock

sp500_pnl = DataFrame()
sp400_pnl = DataFrame()
sp600_pnl = DataFrame()
sp_pnl = {'sp500': sp500_pnl,
          'sp400': sp400_pnl,
          'sp600': sp600_pnl}

Create a dictionary to map each id to a trading strategy pnl

sp500_pnl_net = Series()
sp400_pnl_net = Series()
sp600_pnl_net = Series()
sp_pnl_net = {'sp500': sp500_pnl_net,
              'sp400': sp400_pnl_net,
              'sp600': sp600_pnl_net}

Implement the mean-reverting trading strategy on stock universes: S&P 500, S&P MidCap 400, and S&P SmallCap indices

for i in id:
    # Calculate the returns
    sp_returns[i] = sp_data[i].pct_change()

    # Calculate the equally weighted market returns
    sp_market_returns[i] = sp_returns[i].mean(axis='columns')

    # Calculate the weights of each stock
    sp_weights[i] = - (sp_returns[i].sub(sp_market_returns[i], axis='index')).div(sp_data[i].count(axis='columns'), axis='index')

    # Adjust the weights to 0 if price or return is NaN
    sp_weights[i][sp_data[i].isna() | sp_data[i].shift(periods=1).isna()] = 0

    # Calculate the daily pnl
    # Idea is to buy yesterday's losers, and sell yesterday's winners
    sp_pnl[i] = (sp_weights[i].shift(periods=1)).mul(sp_returns[i], axis='index')
    sp_pnl_net[i] = sp_pnl[i].sum(axis='columns')

Calculate The Performance Metrics Of A Trading Strategy

In this section, I will provide a brief review of the Sharpe ratio which is a popular measure of a trading strategy’s performance. The Sharpe ratio is a special case of the more generic Information ratio (IR). Consequently, I will present the IR first, and followed by the illustration on the Sharpe ratio.

Information Ratio

The Information ratio measures the excess returns of a trading strategy over a benchmark, such as the S&P 500 index. The IR scales the excess returns with the standard deviation to measure the consistency of the trading strategy. We calculate the Information ratio using the formula

\displaystyle IR = \frac{r_{Strategy} - r_{Benchmark}}{\sigma_{Strategy}}.

In practice, we usually calculate the annualized IR as follows:

  1. Calculate the average and standard deviation of daily returns
  2. Annualize the two metrics
  3. Compute the annualized return of the benchmark

Sharpe Ratio

The Sharpe ratio is a special case of the Information ratio where the benchmark is set to the risk-free rate. It allows for decomposition of a trading strategy’s profit and losses into risky and risk-free parts. The Sharpe ratio is popular because it facilitates comparison of different trading strategies using different benchmarks.

How To Calculate The Sharpe Ratio

The Sharpe ratio calculation depends on the trading strategy deployed.

Long Only Strategies

For long-only trading strategies, we calculate the Sharpe ratio as

\displaystyle Sharpe = \frac{r_{Long} - r_{Risk-free}}{\sigma_{Long}},

where the r_{Risk-free} is usually obtained from the treasury yield curve.

Long And Short Strategies

For long and short strategies holding equal amount of capital on both positions, also known as “dollar-neutral”, we have a simpler formula for calculating the Sharpe ratio

\displaystyle Sharpe = \frac{r_{Long-Short}}{\sigma_{Long-Short}}.

The r_{Risk-free} disappears in the equation because the cash received from the short positions earns the same risk-free rate.

Performance Metrics Of A Mean-reverting Strategy

Divide the periods of observation

period = ['2014-01-01', '2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01']

Create a dictionary to map each id to a trading performance

sp500_performance = pd.DataFrame()
sp400_performance = pd.DataFrame()
sp600_performance = pd.DataFrame()
sp_performance = {'sp500': sp500_performance,
                  'sp400': sp400_performance,
                  'sp600': sp600_performance}

Calculate the trading strategy’s annualized Sharpe ratio, average daily returns and standard deviation

for i in id:
    # Initialize performance measures
    avg_returns = []
    std_returns = []
    sharpe = []

    # Calculate performance measures
    for j in range(len(period) - 1):
        # Period of observation
        start = period[j]
        end = period[j + 1]

        # Calculate average daily returns
        avg_returns.append(sp_pnl_net[i][start:end].mean())

        # Calculate standard deviation of daily returns
        std_returns.append(sp_pnl_net[i][start:end].std())

        # Calculate Sharpe ratio
        sharpe.append(np.sqrt(252) * avg_returns[j] / std_returns[j])

    # Update performance measures DataFrame
    sp_performance[i] = pd.DataFrame({'Avg Daily Returns': avg_returns,
                                      'Std Daily Returns': std_returns,
                                      'Sharpe Ratio': sharpe},
                                    index=['2014', '2015', '2016', '2017', '2018'])

Comparison Of The Mean-Reverting Strategy Using Different Stock Universes

Average Of Daily Returns
sp_avg_returns = pd.DataFrame({'S&P 500': (sp_performance['sp500']['Avg Daily Returns'] * 100).round(4).astype('str') + '%',
                               'S&P 400': (sp_performance['sp400']['Avg Daily Returns'] * 100).round(4).astype('str') + '%',
                               'S&P 600': (sp_performance['sp600']['Avg Daily Returns'] * 100).round(4).astype('str') + '%'})
sp_avg_returns
S&P 500S&P 400S&P 600
20140.0008%0.001%0.0015%
2015-0.0%-0.0007%0.0004%
2016-0.0001%-0.0004%0.001%
20170.0008%0.0009%0.0016%
20180.0004%0.0004%0.0004%
Standard Deviation Of Daily Returns
sp_std_returns = pd.DataFrame({'S&P 500': (sp_performance['sp500']['Std Daily Returns'] * 100).round(4).astype('str') + '%',
                               'S&P 400': (sp_performance['sp400']['Std Daily Returns'] * 100).round(4).astype('str') + '%',
                               'S&P 600': (sp_performance['sp600']['Std Daily Returns'] * 100).round(4).astype('str') + '%',})
sp_std_returns
S&P 500S&P 400S&P 600
20140.0052%0.0095%0.0134%
20150.0029%0.0075%0.0074%
20160.0068%0.0111%0.0125%
20170.0084%0.0112%0.0106%
20180.0037%0.004%0.0049%
Sharpe Ratio
sp_sharpe = pd.DataFrame({'S&P 500': sp_performance['sp500']['Sharpe Ratio'],
                          'S&P 400': sp_performance['sp400']['Sharpe Ratio'],
                          'S&P 600': sp_performance['sp600']['Sharpe Ratio']})
sp_sharpe
S&P 500S&P 400S&P 600
20142.3901201.7102631.750619
2015-0.136800-1.4629810.792153
2016-0.319971-0.5710371.213862
20171.5353401.2595382.348185
20181.8130911.4714771.328564

The mean-reverting strategy’s Sharpe ratio is highest on the S&P 600 index composed of small-cap stocks. The inverse relationship between profitability and market-capitalization suggests that inefficiencies are more abundant on small-cap stocks. However, be mindful that small-capitalization stocks are less liquid instruments and have higher transactions costs attached when trading them.

Profit And Losses

Plot the cumulative pnl of the mean-reverting strategy by investment universe

plt.figure(figsize=[20, 10])

plt.plot(sp_pnl_net['sp500'].cumsum())
plt.plot(sp_pnl_net['sp400'].cumsum())
plt.plot(sp_pnl_net['sp600'].cumsum())
plt.title('Cumulative PnL')
plt.legend(id)
<matplotlib.legend.Legend at 0x1173fa6a0&gt;

Improvements To The Backtesting Process

The Backtesting process measures the performance of our trading strategy using historical data. We hope that the performance of a trading strategy in the past will reproduce into the future. Unfortunately, such guarantees do not exist.

Our focus should center on rendering the backtesting process as close to reality as possible by including transaction costs, and correcting for the following:

  1. Data-snooping bias is the application of an overfitted model in the trading strategy.
  2. Look-ahead bias is the use of future data, or data not yet available, in the investment strategy.
  3. Survivorship bias is the absence of stocks in the investment universe belonging to companies who went bankrupt, merged or acquired.

Conclusion

In this article, we implemented a mean-reverting trading strategy and backtested it on our universe of stocks – the S&P 500, S&P MidCap 400 and S&P SmallCap 600 indices. The mean-reverting trading strategy performed best on the S&P 600 index which is composed of small-capitalization stocks. In the next articles, I will illustrate improvements to the backtesting process by including transaction costs, and correcting for potential biasses.

Backtesting A Trading Strategy Part 2

How To Retrieve S&P Constituents Historical Data Using Python

Photo by the author

Introduction

Backtesting is a tool to measure the performance of a trading strategy using historical data. The backtesting process consists of three parts: 1. determining the universe of securities where we will invest in (e.g. equity or fixed income? US or emerging markets?); 2. gathering historical data for the universe of securities; and 3. implementing a trading strategy using the historical data collected.

In the previous article, I illustrated the first step in the backtesting process of determining the universe of stocks, namely the S&P 500, S&P MidCap 400 and S&P SmallCap 600 indices. In this article, I will discuss the second step of the backtesting process of collecting historical data for each constituent of the universe of stocks.

Retrieving S&P Constituents Historical Data

Step By Step

  1. Load the S&P tickers which were gathered from the previous article.
  2. Collect the S&P constituents’ 5-year historical data using Python package pandas-datareader from the Investors Exchange (IEX).

You can find the code below on https://github.com/DinodC/backtesting-trading-strategy.

Import packages

import pandas as pd
from pandas import Series, DataFrame
import pickle
import pandas_datareader.data as web

S&P Constituents Tickers

In this section, we load the lists pickled from the last article.

Set an id for each index

id = ['sp500', 'sp400', 'sp600']

Create a dictionary to map each id to a tickers file

input_file = {'sp500': 'sp500_barchart.pickle',
              'sp400': 'sp400_barchart.pickle', 
              'sp600': 'sp600_barchart.pickle'} 

Define a dictionary to map each id to a tickers list

sp500_tickers = []
sp400_tickers = []
sp600_tickers = []
sp_tickers = {'sp500': sp500_tickers,
              'sp400': sp400_tickers,
              'sp600': sp600_tickers}

Fill the tickers lists

for i in input_file:
    with open(input_file[i], 'rb') as f:

        # Update tickers list        
        sp_tickers[i] = pickle.load(f)

        # Sort tickers list
        sp_tickers[i].sort()

    f.close()

S&P Constituents Historical Data

Define dictionary of historical data

sp500_data = pd.DataFrame()
sp400_data = pd.DataFrame()
sp600_data = pd.DataFrame()
sp_data = {'sp500': sp500_data,
           'sp400': sp400_data,
           'sp600': sp600_data}

Set the start and date of the historical data

start_date = '2014-01-01'
end_date = '2020-01-01'

Set the source Investors Exchange(IEX) to be used

source = 'iex'

Create a dictionary to map each id to an output file

output_file = {'sp500': 'sp500_data.pickle',
               'sp400': 'sp400_data.pickle',
               'sp600': 'sp600_data.pickle'}

Retrieve historical data for each constituent of each S&P index

for i in output_file:

    # Retrieve historical data 
    # Note that we set number of tickers to < 100 because DataReader gives error when number of tickers &gt; 100
    data1 = web.DataReader(sp_tickers[i][:98], source, start_date, end_date)
    data2 = web.DataReader(sp_tickers[i][98:198], source, start_date, end_date)
    data3 = web.DataReader(sp_tickers[i][198:298], source, start_date, end_date)
    data4 = web.DataReader(sp_tickers[i][298:398], source, start_date, end_date)
    data5 = web.DataReader(sp_tickers[i][398:498], source, start_date, end_date)
    if i == 'sp400':
        # Concatenate historical data
        sp_data[i] = pd.concat([data1, data2, data3, data4, data5], axis=1, sort=True)
    if i == 'sp500':
        data6 = web.DataReader(sp_tickers[i][498:], source, start_date, end_date)
        # Concatenate historical data        
        sp_data[i] = pd.concat([data1, data2, data3, data4, data5, data6], axis=1, sort=True)
    elif i == 'sp600':
        data6 = web.DataReader(sp_tickers[i][498:598], source, start_date, end_date)
        data7 = web.DataReader(sp_tickers[i][598:], source, start_date, end_date)
        # Concatenate historical data
        sp_data[i] = pd.concat([data1, data2, data3, data4, data5, data6, data7], axis=1, sort=True)        
    else:
        pass

    # Convert index to datetime
    sp_data[i].index = pd.to_datetime(sp_data[i].index)

    # Save historical data to file
    with open(output_file[i], 'wb') as f:
        pickle.dump(sp_data[i], f)
    f.close()

Constituents Close Prices

S&P 500 Index

Look at the dimensions of our DataFrame

sp_data['sp500'].close.shape
(1258, 505)

Check the first rows

sp_data['sp500'].close.head()
SymbolsAAALAAPAAPLABBV
date
2014-06-1140.119940.2868125.069486.024945.0769
2014-06-1239.772638.2958123.225284.586044.6031
2014-06-1339.840738.4672123.740783.660345.0187
2014-06-1639.718139.1150124.008684.503544.8857
2014-06-1740.092739.8867124.941184.393545.1351

5 rows × 505 columns

Check the end rows

sp_data['sp500'].close.tail()
SymbolsAAALAAPAAPLABBV
date
2019-06-0467.9529.12154.61179.6476.75
2019-06-0568.3530.36154.61182.5477.06
2019-06-0669.1630.38154.90185.2277.07
2019-06-0769.5230.92155.35190.1577.43
2019-06-1070.2930.76153.52192.5876.95

5 rows × 505 columns

Descriptive stats

sp_data['sp500'].close.describe()
SymbolsAAALAAPAAPLABBV
count1258.0000001258.0000001258.0000001258.0000001258.000000
mean52.01092441.207727145.418612135.09439966.653293
std13.8665776.36623624.12833938.12761617.721243
min32.25860024.53980079.16870082.74380042.066600
25%39.39350036.586125132.519725103.62265053.170800
50%46.09760040.843300150.457700119.47380058.470800
75%65.59185046.119900161.899875167.84120082.894300
max81.94000057.586600199.159900229.392000116.445400

8 rows × 505 columns

S&P MidCap 400 Index

Look at the dimensions of our DataFrame

sp_data['sp400'].close.shape
(1128, 400)

Check the first rows

sp_data['sp400'].close.head()
SymbolsAANACCACHCACIWACM
2014-05-07NaNNaNNaNNaNNaN
2014-05-08NaNNaNNaNNaNNaN
2014-05-09NaNNaNNaNNaNNaN
2014-05-12NaNNaNNaNNaNNaN
2014-05-13NaNNaNNaNNaNNaN

5 rows × 400 columns

Check the end rows

sp_data['sp400'].close.tail()
SymbolsAANACCACHCACIWACM
2019-06-0454.8846.1733.7232.1033.17
2019-06-0554.9746.9733.3032.2633.29
2019-06-0654.7247.0233.2032.4033.18
2019-06-0755.7247.1733.9132.3433.47
2019-06-1059.2346.8133.8132.7533.45

5 rows × 400 columns

Descriptive stats

sp_data['sp400'].close.describe()
SymbolsAANACCACHCACIWACM
count1258.0000001258.0000001258.0000001258.0000001258.000000
mean35.34469639.07400248.80678122.65606032.409610
std9.8250814.94645313.9031604.0657613.178291
min20.11860027.79390024.75000013.61750023.150000
25%25.95767534.71830038.35500019.47000030.370000
50%34.79230039.72790046.52500022.27500032.465000
75%43.25912543.42245059.71750024.24000034.537500
max59.23000048.35330082.97000035.52000040.130000

8 rows × 400 columns

S&P SmallCap 600 Index

Look at the dimensions of our DataFrame

sp_data['sp600'].close.shape
(1116, 601)

Check the first rows

sp_data['sp600'].close.head()
SymbolsAAOIAAONAATAAWWAAXN
2014-06-10NaNNaNNaNNaNNaN
2014-06-1122.3320.769530.107038.0713.91
2014-06-1222.3020.590429.956637.0514.03
2014-06-1321.8620.379229.965437.2913.87
2014-06-1622.7120.571230.027336.9513.78

5 rows × 601 columns

Check the end rows

sp_data['sp600'].close.tail()
SymbolsAAOIAAONAATAAWWAAXN
date
2019-06-049.1947.8045.1139.7067.63
2019-06-059.2348.0846.3938.6268.00
2019-06-069.3048.1846.3937.9967.97
2019-06-079.5048.8546.6139.8869.18
2019-06-109.6048.6446.5140.3071.91

5 rows × 601 columns

Descriptive stats

sp_data['sp600'].close.describe()
SymbolsAAOIAAONAATAAWWAAXN
count1258.0000001258.0000001258.0000001258.0000001258.000000
mean26.52318829.59527237.38307749.43138731.595862
std17.1169137.7630963.59545810.15675815.917380
min8.38000016.28740029.35230031.40000010.500000
25%14.81000022.30587535.38415040.97000022.285000
50%19.84500030.68460037.50855049.15000025.175000
75%33.92750035.47340039.25435056.54250038.340000
max99.61000051.63140046.77000074.00000074.890000

8 rows × 601 columns

Summary

In this article, we retrieved historical data for every constituent in our universe of stocks – the S&P 500, S&P MidCap 400 and S&P SmallCap 600 indices. The 5-year historical data is relatively straightforward to obtain, and is provided for free by the Investors Exchange. In the next article, we implement a simple trading strategy, and backtest it using the historical data collected.

Backtesting A Trading Strategy Part 1

How To Scrape S&P Constituents Tickers Using Python

Photo by the author

Introduction

Backtesting is a tool to measure the performance of a trading strategy using historical data. The backtesting process consists of three parts: 1. determining the universe of securities where we will invest in (e.g. equity or fixed income? US or emerging markets?); 2. gathering historical data for the universe of securities; and 3. implementing a trading strategy using the historical data collected.

In this article, I will discuss the initial step in the backtesting process: determining the universe of securities. If we focus our attention on trading US equities, then a natural choice is the Standard and Poor’s 500 Index which is composed of shares of the 500 largest companies in the United States. The S&P 500 also provides the most liquid stocks. We can also consider the S&P MidCap 400 and S&P SmallCap 600 indices.

Standard & Poor’s Dow Jones Indices

This section provides a comparison of the different S&P indices which can be considered as possible universe of stocks.

S&P 500 Index

The S&P 500 index (or S&P 500) is a market capitalization-weighted index of the 500 largest US companies. It is viewed as the best gauge of large-cap US equity market. The S&P 500 has 505 constituents with a median capitalization of USD 22.3B.

S&P MidCap 400 Index

The S&P 400 index (or S&P 400) is a market capitalization-weighted index. It serves as a benchmark for mid-cap US equity market. The S&P 400 has 400 constituents with a median capitalization of USD 4.2B.

S&P SmallCap 600 Index

The S&P 600 index (or S&P 600) is a market capitalization-weighted index. It serves as a benchmark for small-cap US equity market. The S&P 600 has 601 constituents with a median capitalization of USD 1.2B.

After identifying potential universe of stocks candidates, we need to collect the list of constituents for each candidate universe. The list of constituents is not available on the official S&P Dow Jones Indices website. The constituents are only provided to product subscribers. We therefore, need to find alternative data providers. After a quick search on Google, two candidates are available: Wikipedia and Barchart. Wikipedia provides the S&P constituents in the form of a HTML table, which we will need to retrieve using Python package BeautifulSoup for web scraping. Barchart provides the S&P constituents as convenient downloadable CSV files. You just need to create a basic account with them, which fortunately is free.

Scraping S&P Constituents Tickers

Step By Step

  1. Collect the S&P tickers from Wikipedia, and then from Barchart.
  2. Compare the S&P symbols from the two providers.

You can find the code below on https://github.com/DinodC/backtesting-trading-strategy.

Import packages

import pandas as pd
import requests as re
from bs4 import BeautifulSoup
import pickle

From Wikipedia

Set an id for each index

id = ['sp500', 'sp400', 'sp1000']

Set the pages we want to scrape S&P indices data from Wikipedia. Note that page for S&P 600 list of constituents does not exists. However, we can deduce the list from S&P 1000 which is just a combination of the S&P 400 and S&P 600.

input_file = {'sp500': 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies',
              'sp400': 'https://en.wikipedia.org/wiki/List_of_S%26P_400_companies',
              'sp1000': 'https://en.wikipedia.org/wiki/List_of_S%26P_1000_companies'}

Define the files we want to store extracted data to

output_file = {'sp500': 'sp500_wikipedia.pickle',
               'sp400': 'sp400_wikipedia.pickle',
               'sp1000': 'sp1000_wikipedia.pickle'}

Define the S&P constituents lists from Wikipedia

sp500_wikipedia = []
sp400_wikipedia = []
sp1000_wikipedia = []
sp_wikipedia = {'sp500': sp500_wikipedia,
                'sp400': sp400_wikipedia,
                'sp1000': sp1000_wikipedia}

The code below scrapes data using Python package BeautifulSoup, and saves the extracted data using Python package pickle

for i in input_file:

    # Get URL    
    r = re.get(input_file[i])

    # Create a soup object 
    soup = BeautifulSoup(r.text)

    # Find S&P constituents table
    table = soup.find('table', attrs={'class', 'wikitable sortable'})

    # Get the rows containing the tickers
    tickers = table.find_all('a', attrs={'class', 'external text'})
    # find_all returns tickers and SEC fillings, get tickers only
    tickers = tickers[::2]

    # Create a list containing the tickers
    for j in range(len(tickers)):
        sp_wikipedia[i].append(tickers[j].text)

    # Save the list to a file
    with open(output_file[i], 'wb') as f:
        pickle.dump(sp_wikipedia[i], f)
    f.close()

Check the number of constituents, it should be equal to 505

len(sp500_wikipedia)
505

Check the number of constituents, it should be equal to 400

len(sp400_wikipedia)
400

Check the number of constituents, it should be equal to 1001

len(sp1000_wikipedia)
1001

Create a list of S&P 600 constituents given that the S&P 1000 index is the sum of S&P 400 and S&P 600 indices

sp600_wikipedia = list(set(sp1000_wikipedia) - set(sp400_wikipedia))

Check the number of constituents, it should be equal to 601

len(sp600_wikipedia)
598

In total, Wikipedia tickers sum up to 598 only, while the S&P Dow Jones Indices website indicates that there should be 601. The missing tickers, 3 in total, could be due to timing difference in updating the S&P 400 and S&P 1000 lists.

From Barchart

id = ['sp500', 'sp400', 'sp600']

We download the below files in csv format from Barhcart. Note that you need to sign up first, free of charge, before getting access.

input_file = {'sp500': 's&p-500-index-05-04-2019.csv',
              'sp400': 'sp-400-index-05-04-2019.csv',
              'sp600': 'sp-600-index-05-04-2019.csv'}

Define the files we want to store extracted data to

output_file = {'sp500': 'sp500_barchart.pickle',
               'sp400': 'sp400_barchart.pickle',
               'sp600': 'sp600_barchart.pickle'}

Define the S&P constituents lists from Barchart

sp500_barchart = []
sp400_barchart = []
sp600_barchart = []
sp_barchart = {'sp500': sp500_barchart,
              'sp400': sp400_barchart,
              'sp600': sp600_barchart}

The code below reads the data from the csv file, stores it to a DataFrame object, and saves the extracted information using pickle

for i in input_file:

    # Read data to a DataFrame
    data = pd.read_csv(input_file[i])
    # Exclude the last line since it does not contain a ticker
    data = data[:-1]

    # Create a list containing the tickers
    for j in range(len(data['Symbol'])):
        sp_barchart[i].append(data['Symbol'].iloc[j])

    # Save the list to a file
    with open(output_file[i], 'wb') as f:
        pickle.dump(sp_barchart[i], f)
    f.close()

Check the number of constituents, it should be equal to 505 according to S&P Dow Jones Indices website

len(sp500_barchart)
505

Check the number of constituents, it should be equal to 400 according to S&P Dow Jones Indices website

len(sp400_barchart)
400

Check the number of constituents, it should be equal to 601 according to S&P Dow Jones Indices website

len(sp600_barchart)
601

Comparison Between Wikipedia and Barchart

S&P 500 Index

Sort the lists

sp500_wikipedia.sort()
sp500_barchart.sort()

Compare the first 10 tickers

sp500_wikipedia[:10]
['A', 'AAL', 'AAP', 'AAPL', 'ABBV', 'ABC', 'ABMD', 'ABT', 'ACN', 'ADBE']
sp500_barchart[:10]
['A', 'AAL', 'AAP', 'AAPL', 'ABBV', 'ABC', 'ABMD', 'ABT', 'ACN', 'ADBE']

Compare all the tickers by calculating the difference between Wikipedia and Barchart

diff_wikipedia_barchart = list(set(sp500_wikipedia) - set(sp500_barchart))
diff_wikipedia_barchart
[]

There is no difference between the Wikipedia and Barchart lists.

S&P MidCap 400 Index

Sort the lists

sp400_wikipedia.sort()
sp400_barchart.sort()

Compare the first 10 tickers

sp400_wikipedia[:10]
['AAN', 'ACC', 'ACHC', 'ACIW', 'ACM', 'ADNT', 'AEO', 'AFG', 'AGCO', 'AHL']
sp400_barchart[:10]
['AAN', 'ACC', 'ACHC', 'ACIW', 'ACM', 'ADNT', 'AEO', 'AFG', 'AGCO', 'ALE']

Compare all the tickers by calculating the difference between Wikipedia and Barchart

diff_wikipedia_barchart = list(set(sp400_wikipedia) - set(sp400_barchart))
diff_wikipedia_barchart[:10]
['LPNT', 'NBR', 'AHL', 'DRQ', 'SPN', 'WAB', 'DNB', 'EGN', 'JKHY', 'GPOR']
len(diff_wikipedia_barchart)
28

Now, compare all the tickers by calculating the difference between Barchart and Wikipedia

diff_barchart_wikipedia = list(set(sp400_barchart) - set(sp400_wikipedia))
diff_barchart_wikipedia[:10]
['TREX', 'PEB', 'NGVT', 'XPO', 'CVET', 'PRSP', 'BHF', 'REZI', 'AMED', 'SRCL']
len(diff_barchart_wikipedia)
28

The difference between the two providers Wikipedia and Barchart is 28 tickers, which suggests that one of the two providers has a more up to date list.

S&P SmallCap 600 Index

Sort the lists

sp600_wikipedia.sort()
sp600_barchart.sort()

Compare the first 10 tickers

sp600_wikipedia[:10]
['AAOI', 'AAON', 'AAT', 'AAWW', 'AAXN', 'ABCB', 'ABG', 'ABM', 'ACET', 'ACLS']
sp600_barchart[:10]
['AAOI', 'AAON', 'AAT', 'AAWW', 'AAXN', 'ABCB', 'ABG', 'ABM', 'ACA', 'ACLS']

Compare all the tickers by calculating the difference between Wikipedia and Barchart

diff_wikipedia_barchart = list(set(sp600_wikipedia) - set(sp600_barchart))
diff_wikipedia_barchart[:10]
['GNBC', 'TREX', 'CLD', 'WCG', 'QHC', 'NGVT', 'FTK', 'DSW', 'PRSP', 'AMED']
len(diff_wikipedia_barchart)
51

Now, compare all the tickers by calculating the difference between Barchart and Wikipedia

diff_barchart_wikipedia = list(set(sp600_barchart) - set(sp600_wikipedia))
diff_barchart_wikipedia[:10]
['TCMD', 'IIPR', 'NBR', 'CSII', 'CCS', 'DRQ', 'DBI', 'SPN', 'TRHC', 'LPI']
len(diff_barchart_wikipedia)
54

In total, Wikipedia constituents sum up to 598 only, while Barchart sum up to 601 (complete):

  1. As previously noted, there are 3 tickers missing in Wikipedia list which could be due to timing difference in updating the S&P 400 and S&P 1000 lists.
  2. The difference between the two providers Wikipedia and Barchart is 51 tickers, which suggests that one of the two providers has a more up to date list.

Summary

In this article, we identified universe of securities candidates such as the S&P 500, S&P MidCap 400 and S&P SmallCap indices. We retrieved the constituents of each index from alternative data providers, namely Wikipedia and Barchart. The list of tickers for the S&P 500 is consistent between the two sources, while the list of tickers for the S&P MidCap 400 and S&P SmallCap 600 are not identical. There seems to be an inconsistency between the components of the S&P 400 and S&P 1000 indices from Wikipedia. As a result, we will consider the S&P constituents from Barchart in the next article where we will retrieve historical data for every ticker in every index.