Backtesting A Trading Strategy Part 3

How To Backtest A Mean-reverting Trading Strategy Using Python

Photo by the author

Introduction

Backtesting is a tool to measure the performance of a trading strategy using historical data. The backtesting process consists of three parts: 1. determining the universe of securities where we will invest in (e.g. equity or fixed income? US or emerging markets?); 2. gathering historical data for the universe of securities; and 3. implementing a trading strategy using the historical data collected.

In the previous articles, I illustrated the first two steps in the backtesting process of determining the universe of stocks, and collecting historical data for each constituent. In this article, I will illustrate the last step of the process: implementing a mean-reverting trading strategy.

A Mean-reverting Trading Strategy

We implement a mean-reverting trading strategy based on Khandani and Lo. The idea is to buy the previous day’s “losers”, and sell the previous day’s “winners”. Stocks which underperform the market average are classified as “losers”; while stocks which outperform the market average are classified as “winners”.

For each stock i  , we calculate the weight w_{i, t}  at time t

\displaystyle w_{i, t} = - \frac{1}{N} \left( r_{i, t-1} - r_{Market, t-1} \right).

where N is the total number of stocks in the investment universe. Market return at time t-1 is calculated as

\displaystyle r_{Market, t-1} = \frac{1}{N} \sum_{i=1}^{N} r_{i, t-1},

and return of stock $i &s=2$ at time $t-1 &s=2$ is calculated as

\displaystyle r_{i, t-1} = \frac{S_{i, t-1} - S_{i, t-2}}{S_{i, t-2}},

where S_{i, t-1} is the spot price of stock i at time t-1 .

Backtesting A Trading Strategy

Step By Step

  1. Load S&P indices historical data.
  2. Implement a mean-reverting trading strategy.
  3. Calculate the trading strategy’s performance using the Sharpe ratio.

You can find the code on https://github.com/DinodC/backtesting-trading-strategy.

Import packages

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import pickle
import matplotlib.pyplot as plt
%matplotlib inline

Load S&P Historical Data

Set an id for each index

id = ['sp500', 'sp400', 'sp600']

Create a dictionary to map each id to a tickers file

input_file = {'sp500': 'sp500_data.pickle',
              'sp400': 'sp400_data.pickle',
              'sp600': 'sp600_data.pickle'}

Create a dictionary to map each id to a S&P historical data

sp500_data = pd.DataFrame()
sp400_data = pd.DataFrame()
sp600_data = pd.DataFrame()
sp_data = {'sp500': sp500_data,
           'sp400': sp400_data,
           'sp600': sp600_data}

Retrieve S&P historical data

for i in id:
    # Load data
    with open(input_file[i], 'rb') as f:
        sp_data[i] = pickle.load(f)
    f.close()

    # Select close prices
    sp_data[i] = sp_data[i].close

Implement A Mean-reverting Trading Strategy

Create a dictionary to map each id to a S&P returns data

sp500_returns = DataFrame()
sp400_returns = DataFrame()
sp600_returns = DataFrame()
sp_returns = {'sp500': sp500_returns,
              'sp400': sp400_returns,
              'sp600': sp600_returns}

Create a dictionary to map each id to a S&P market returns data

sp500_market_returns = Series()
sp400_market_returns = Series()
sp600_market_returns = Series()
sp_market_returns = {'sp500': sp500_market_returns,
                     'sp400': sp400_market_returns,
                     'sp600': sp600_market_returns}

Create a dictionary to map each id to a trading strategy weighting

sp500_weights = DataFrame()
sp400_weights = DataFrame()
sp600_weights = DataFrame()
sp_weights = {'sp500': sp500_weights,
              'sp400': sp400_weights,
              'sp600': sp600_weights}

Create a dictionary to map each id to a trading strategy pnl per stock

sp500_pnl = DataFrame()
sp400_pnl = DataFrame()
sp600_pnl = DataFrame()
sp_pnl = {'sp500': sp500_pnl,
          'sp400': sp400_pnl,
          'sp600': sp600_pnl}

Create a dictionary to map each id to a trading strategy pnl

sp500_pnl_net = Series()
sp400_pnl_net = Series()
sp600_pnl_net = Series()
sp_pnl_net = {'sp500': sp500_pnl_net,
              'sp400': sp400_pnl_net,
              'sp600': sp600_pnl_net}

Implement the mean-reverting trading strategy on stock universes: S&P 500, S&P MidCap 400, and S&P SmallCap indices

for i in id:
    # Calculate the returns
    sp_returns[i] = sp_data[i].pct_change()

    # Calculate the equally weighted market returns
    sp_market_returns[i] = sp_returns[i].mean(axis='columns')

    # Calculate the weights of each stock
    sp_weights[i] = - (sp_returns[i].sub(sp_market_returns[i], axis='index')).div(sp_data[i].count(axis='columns'), axis='index')

    # Adjust the weights to 0 if price or return is NaN
    sp_weights[i][sp_data[i].isna() | sp_data[i].shift(periods=1).isna()] = 0

    # Calculate the daily pnl
    # Idea is to buy yesterday's losers, and sell yesterday's winners
    sp_pnl[i] = (sp_weights[i].shift(periods=1)).mul(sp_returns[i], axis='index')
    sp_pnl_net[i] = sp_pnl[i].sum(axis='columns')

Calculate The Performance Metrics Of A Trading Strategy

In this section, I will provide a brief review of the Sharpe ratio which is a popular measure of a trading strategy’s performance. The Sharpe ratio is a special case of the more generic Information ratio (IR). Consequently, I will present the IR first, and followed by the illustration on the Sharpe ratio.

Information Ratio

The Information ratio measures the excess returns of a trading strategy over a benchmark, such as the S&P 500 index. The IR scales the excess returns with the standard deviation to measure the consistency of the trading strategy. We calculate the Information ratio using the formula

\displaystyle IR = \frac{r_{Strategy} - r_{Benchmark}}{\sigma_{Strategy}}.

In practice, we usually calculate the annualized IR as follows:

  1. Calculate the average and standard deviation of daily returns
  2. Annualize the two metrics
  3. Compute the annualized return of the benchmark

Sharpe Ratio

The Sharpe ratio is a special case of the Information ratio where the benchmark is set to the risk-free rate. It allows for decomposition of a trading strategy’s profit and losses into risky and risk-free parts. The Sharpe ratio is popular because it facilitates comparison of different trading strategies using different benchmarks.

How To Calculate The Sharpe Ratio

The Sharpe ratio calculation depends on the trading strategy deployed.

Long Only Strategies

For long-only trading strategies, we calculate the Sharpe ratio as

\displaystyle Sharpe = \frac{r_{Long} - r_{Risk-free}}{\sigma_{Long}},

where the r_{Risk-free} is usually obtained from the treasury yield curve.

Long And Short Strategies

For long and short strategies holding equal amount of capital on both positions, also known as “dollar-neutral”, we have a simpler formula for calculating the Sharpe ratio

\displaystyle Sharpe = \frac{r_{Long-Short}}{\sigma_{Long-Short}}.

The r_{Risk-free} disappears in the equation because the cash received from the short positions earns the same risk-free rate.

Performance Metrics Of A Mean-reverting Strategy

Divide the periods of observation

period = ['2014-01-01', '2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01']

Create a dictionary to map each id to a trading performance

sp500_performance = pd.DataFrame()
sp400_performance = pd.DataFrame()
sp600_performance = pd.DataFrame()
sp_performance = {'sp500': sp500_performance,
                  'sp400': sp400_performance,
                  'sp600': sp600_performance}

Calculate the trading strategy’s annualized Sharpe ratio, average daily returns and standard deviation

for i in id:
    # Initialize performance measures
    avg_returns = []
    std_returns = []
    sharpe = []

    # Calculate performance measures
    for j in range(len(period) - 1):
        # Period of observation
        start = period[j]
        end = period[j + 1]

        # Calculate average daily returns
        avg_returns.append(sp_pnl_net[i][start:end].mean())

        # Calculate standard deviation of daily returns
        std_returns.append(sp_pnl_net[i][start:end].std())

        # Calculate Sharpe ratio
        sharpe.append(np.sqrt(252) * avg_returns[j] / std_returns[j])

    # Update performance measures DataFrame
    sp_performance[i] = pd.DataFrame({'Avg Daily Returns': avg_returns,
                                      'Std Daily Returns': std_returns,
                                      'Sharpe Ratio': sharpe},
                                    index=['2014', '2015', '2016', '2017', '2018'])

Comparison Of The Mean-Reverting Strategy Using Different Stock Universes

Average Of Daily Returns
sp_avg_returns = pd.DataFrame({'S&P 500': (sp_performance['sp500']['Avg Daily Returns'] * 100).round(4).astype('str') + '%',
                               'S&P 400': (sp_performance['sp400']['Avg Daily Returns'] * 100).round(4).astype('str') + '%',
                               'S&P 600': (sp_performance['sp600']['Avg Daily Returns'] * 100).round(4).astype('str') + '%'})
sp_avg_returns
S&P 500S&P 400S&P 600
20140.0008%0.001%0.0015%
2015-0.0%-0.0007%0.0004%
2016-0.0001%-0.0004%0.001%
20170.0008%0.0009%0.0016%
20180.0004%0.0004%0.0004%
Standard Deviation Of Daily Returns
sp_std_returns = pd.DataFrame({'S&P 500': (sp_performance['sp500']['Std Daily Returns'] * 100).round(4).astype('str') + '%',
                               'S&P 400': (sp_performance['sp400']['Std Daily Returns'] * 100).round(4).astype('str') + '%',
                               'S&P 600': (sp_performance['sp600']['Std Daily Returns'] * 100).round(4).astype('str') + '%',})
sp_std_returns
S&P 500S&P 400S&P 600
20140.0052%0.0095%0.0134%
20150.0029%0.0075%0.0074%
20160.0068%0.0111%0.0125%
20170.0084%0.0112%0.0106%
20180.0037%0.004%0.0049%
Sharpe Ratio
sp_sharpe = pd.DataFrame({'S&P 500': sp_performance['sp500']['Sharpe Ratio'],
                          'S&P 400': sp_performance['sp400']['Sharpe Ratio'],
                          'S&P 600': sp_performance['sp600']['Sharpe Ratio']})
sp_sharpe
S&P 500S&P 400S&P 600
20142.3901201.7102631.750619
2015-0.136800-1.4629810.792153
2016-0.319971-0.5710371.213862
20171.5353401.2595382.348185
20181.8130911.4714771.328564

The mean-reverting strategy’s Sharpe ratio is highest on the S&P 600 index composed of small-cap stocks. The inverse relationship between profitability and market-capitalization suggests that inefficiencies are more abundant on small-cap stocks. However, be mindful that small-capitalization stocks are less liquid instruments and have higher transactions costs attached when trading them.

Profit And Losses

Plot the cumulative pnl of the mean-reverting strategy by investment universe

plt.figure(figsize=[20, 10])

plt.plot(sp_pnl_net['sp500'].cumsum())
plt.plot(sp_pnl_net['sp400'].cumsum())
plt.plot(sp_pnl_net['sp600'].cumsum())
plt.title('Cumulative PnL')
plt.legend(id)
<matplotlib.legend.Legend at 0x1173fa6a0&gt;

Improvements To The Backtesting Process

The Backtesting process measures the performance of our trading strategy using historical data. We hope that the performance of a trading strategy in the past will reproduce into the future. Unfortunately, such guarantees do not exist.

Our focus should center on rendering the backtesting process as close to reality as possible by including transaction costs, and correcting for the following:

  1. Data-snooping bias is the application of an overfitted model in the trading strategy.
  2. Look-ahead bias is the use of future data, or data not yet available, in the investment strategy.
  3. Survivorship bias is the absence of stocks in the investment universe belonging to companies who went bankrupt, merged or acquired.

Conclusion

In this article, we implemented a mean-reverting trading strategy and backtested it on our universe of stocks – the S&P 500, S&P MidCap 400 and S&P SmallCap 600 indices. The mean-reverting trading strategy performed best on the S&P 600 index which is composed of small-capitalization stocks. In the next articles, I will illustrate improvements to the backtesting process by including transaction costs, and correcting for potential biasses.

Advertisements

Backtesting A Trading Strategy Part 2

How To Retrieve S&P Constituents Historical Data Using Python

Photo by the author

Introduction

Backtesting is a tool to measure the performance of a trading strategy using historical data. The backtesting process consists of three parts: 1. determining the universe of securities where we will invest in (e.g. equity or fixed income? US or emerging markets?); 2. gathering historical data for the universe of securities; and 3. implementing a trading strategy using the historical data collected.

In the previous article, I illustrated the first step in the backtesting process of determining the universe of stocks, namely the S&P 500, S&P MidCap 400 and S&P SmallCap 600 indices. In this article, I will discuss the second step of the backtesting process of collecting historical data for each constituent of the universe of stocks.

Retrieving S&P Constituents Historical Data

Step By Step

  1. Load the S&P tickers which were gathered from the previous article.
  2. Collect the S&P constituents’ 5-year historical data using Python package pandas-datareader from the Investors Exchange (IEX).

You can find the code below on https://github.com/DinodC/backtesting-trading-strategy.

Import packages

import pandas as pd
from pandas import Series, DataFrame
import pickle
import pandas_datareader.data as web

S&P Constituents Tickers

In this section, we load the lists pickled from the last article.

Set an id for each index

id = ['sp500', 'sp400', 'sp600']

Create a dictionary to map each id to a tickers file

input_file = {'sp500': 'sp500_barchart.pickle',
              'sp400': 'sp400_barchart.pickle', 
              'sp600': 'sp600_barchart.pickle'} 

Define a dictionary to map each id to a tickers list

sp500_tickers = []
sp400_tickers = []
sp600_tickers = []
sp_tickers = {'sp500': sp500_tickers,
              'sp400': sp400_tickers,
              'sp600': sp600_tickers}

Fill the tickers lists

for i in input_file:
    with open(input_file[i], 'rb') as f:

        # Update tickers list        
        sp_tickers[i] = pickle.load(f)

        # Sort tickers list
        sp_tickers[i].sort()

    f.close()

S&P Constituents Historical Data

Define dictionary of historical data

sp500_data = pd.DataFrame()
sp400_data = pd.DataFrame()
sp600_data = pd.DataFrame()
sp_data = {'sp500': sp500_data,
           'sp400': sp400_data,
           'sp600': sp600_data}

Set the start and date of the historical data

start_date = '2014-01-01'
end_date = '2020-01-01'

Set the source Investors Exchange(IEX) to be used

source = 'iex'

Create a dictionary to map each id to an output file

output_file = {'sp500': 'sp500_data.pickle',
               'sp400': 'sp400_data.pickle',
               'sp600': 'sp600_data.pickle'}

Retrieve historical data for each constituent of each S&P index

for i in output_file:

    # Retrieve historical data 
    # Note that we set number of tickers to < 100 because DataReader gives error when number of tickers &gt; 100
    data1 = web.DataReader(sp_tickers[i][:98], source, start_date, end_date)
    data2 = web.DataReader(sp_tickers[i][98:198], source, start_date, end_date)
    data3 = web.DataReader(sp_tickers[i][198:298], source, start_date, end_date)
    data4 = web.DataReader(sp_tickers[i][298:398], source, start_date, end_date)
    data5 = web.DataReader(sp_tickers[i][398:498], source, start_date, end_date)
    if i == 'sp400':
        # Concatenate historical data
        sp_data[i] = pd.concat([data1, data2, data3, data4, data5], axis=1, sort=True)
    if i == 'sp500':
        data6 = web.DataReader(sp_tickers[i][498:], source, start_date, end_date)
        # Concatenate historical data        
        sp_data[i] = pd.concat([data1, data2, data3, data4, data5, data6], axis=1, sort=True)
    elif i == 'sp600':
        data6 = web.DataReader(sp_tickers[i][498:598], source, start_date, end_date)
        data7 = web.DataReader(sp_tickers[i][598:], source, start_date, end_date)
        # Concatenate historical data
        sp_data[i] = pd.concat([data1, data2, data3, data4, data5, data6, data7], axis=1, sort=True)        
    else:
        pass

    # Convert index to datetime
    sp_data[i].index = pd.to_datetime(sp_data[i].index)

    # Save historical data to file
    with open(output_file[i], 'wb') as f:
        pickle.dump(sp_data[i], f)
    f.close()

Constituents Close Prices

S&P 500 Index

Look at the dimensions of our DataFrame

sp_data['sp500'].close.shape
(1258, 505)

Check the first rows

sp_data['sp500'].close.head()
SymbolsAAALAAPAAPLABBV
date
2014-06-1140.119940.2868125.069486.024945.0769
2014-06-1239.772638.2958123.225284.586044.6031
2014-06-1339.840738.4672123.740783.660345.0187
2014-06-1639.718139.1150124.008684.503544.8857
2014-06-1740.092739.8867124.941184.393545.1351

5 rows × 505 columns

Check the end rows

sp_data['sp500'].close.tail()
SymbolsAAALAAPAAPLABBV
date
2019-06-0467.9529.12154.61179.6476.75
2019-06-0568.3530.36154.61182.5477.06
2019-06-0669.1630.38154.90185.2277.07
2019-06-0769.5230.92155.35190.1577.43
2019-06-1070.2930.76153.52192.5876.95

5 rows × 505 columns

Descriptive stats

sp_data['sp500'].close.describe()
SymbolsAAALAAPAAPLABBV
count1258.0000001258.0000001258.0000001258.0000001258.000000
mean52.01092441.207727145.418612135.09439966.653293
std13.8665776.36623624.12833938.12761617.721243
min32.25860024.53980079.16870082.74380042.066600
25%39.39350036.586125132.519725103.62265053.170800
50%46.09760040.843300150.457700119.47380058.470800
75%65.59185046.119900161.899875167.84120082.894300
max81.94000057.586600199.159900229.392000116.445400

8 rows × 505 columns

S&P MidCap 400 Index

Look at the dimensions of our DataFrame

sp_data['sp400'].close.shape
(1128, 400)

Check the first rows

sp_data['sp400'].close.head()
SymbolsAANACCACHCACIWACM
2014-05-07NaNNaNNaNNaNNaN
2014-05-08NaNNaNNaNNaNNaN
2014-05-09NaNNaNNaNNaNNaN
2014-05-12NaNNaNNaNNaNNaN
2014-05-13NaNNaNNaNNaNNaN

5 rows × 400 columns

Check the end rows

sp_data['sp400'].close.tail()
SymbolsAANACCACHCACIWACM
2019-06-0454.8846.1733.7232.1033.17
2019-06-0554.9746.9733.3032.2633.29
2019-06-0654.7247.0233.2032.4033.18
2019-06-0755.7247.1733.9132.3433.47
2019-06-1059.2346.8133.8132.7533.45

5 rows × 400 columns

Descriptive stats

sp_data['sp400'].close.describe()
SymbolsAANACCACHCACIWACM
count1258.0000001258.0000001258.0000001258.0000001258.000000
mean35.34469639.07400248.80678122.65606032.409610
std9.8250814.94645313.9031604.0657613.178291
min20.11860027.79390024.75000013.61750023.150000
25%25.95767534.71830038.35500019.47000030.370000
50%34.79230039.72790046.52500022.27500032.465000
75%43.25912543.42245059.71750024.24000034.537500
max59.23000048.35330082.97000035.52000040.130000

8 rows × 400 columns

S&P SmallCap 600 Index

Look at the dimensions of our DataFrame

sp_data['sp600'].close.shape
(1116, 601)

Check the first rows

sp_data['sp600'].close.head()
SymbolsAAOIAAONAATAAWWAAXN
2014-06-10NaNNaNNaNNaNNaN
2014-06-1122.3320.769530.107038.0713.91
2014-06-1222.3020.590429.956637.0514.03
2014-06-1321.8620.379229.965437.2913.87
2014-06-1622.7120.571230.027336.9513.78

5 rows × 601 columns

Check the end rows

sp_data['sp600'].close.tail()
SymbolsAAOIAAONAATAAWWAAXN
date
2019-06-049.1947.8045.1139.7067.63
2019-06-059.2348.0846.3938.6268.00
2019-06-069.3048.1846.3937.9967.97
2019-06-079.5048.8546.6139.8869.18
2019-06-109.6048.6446.5140.3071.91

5 rows × 601 columns

Descriptive stats

sp_data['sp600'].close.describe()
SymbolsAAOIAAONAATAAWWAAXN
count1258.0000001258.0000001258.0000001258.0000001258.000000
mean26.52318829.59527237.38307749.43138731.595862
std17.1169137.7630963.59545810.15675815.917380
min8.38000016.28740029.35230031.40000010.500000
25%14.81000022.30587535.38415040.97000022.285000
50%19.84500030.68460037.50855049.15000025.175000
75%33.92750035.47340039.25435056.54250038.340000
max99.61000051.63140046.77000074.00000074.890000

8 rows × 601 columns

Summary

In this article, we retrieved historical data for every constituent in our universe of stocks – the S&P 500, S&P MidCap 400 and S&P SmallCap 600 indices. The 5-year historical data is relatively straightforward to obtain, and is provided for free by the Investors Exchange. In the next article, we implement a simple trading strategy, and backtest it using the historical data collected.

Backtesting A Trading Strategy Part 1

How To Scrape S&P Constituents Tickers Using Python

Photo by the author

Introduction

Backtesting is a tool to measure the performance of a trading strategy using historical data. The backtesting process consists of three parts: 1. determining the universe of securities where we will invest in (e.g. equity or fixed income? US or emerging markets?); 2. gathering historical data for the universe of securities; and 3. implementing a trading strategy using the historical data collected.

In this article, I will discuss the initial step in the backtesting process: determining the universe of securities. If we focus our attention on trading US equities, then a natural choice is the Standard and Poor’s 500 Index which is composed of shares of the 500 largest companies in the United States. The S&P 500 also provides the most liquid stocks. We can also consider the S&P MidCap 400 and S&P SmallCap 600 indices.

Standard & Poor’s Dow Jones Indices

This section provides a comparison of the different S&P indices which can be considered as possible universe of stocks.

S&P 500 Index

The S&P 500 index (or S&P 500) is a market capitalization-weighted index of the 500 largest US companies. It is viewed as the best gauge of large-cap US equity market. The S&P 500 has 505 constituents with a median capitalization of USD 22.3B.

S&P MidCap 400 Index

The S&P 400 index (or S&P 400) is a market capitalization-weighted index. It serves as a benchmark for mid-cap US equity market. The S&P 400 has 400 constituents with a median capitalization of USD 4.2B.

S&P SmallCap 600 Index

The S&P 600 index (or S&P 600) is a market capitalization-weighted index. It serves as a benchmark for small-cap US equity market. The S&P 600 has 601 constituents with a median capitalization of USD 1.2B.

After identifying potential universe of stocks candidates, we need to collect the list of constituents for each candidate universe. The list of constituents is not available on the official S&P Dow Jones Indices website. The constituents are only provided to product subscribers. We therefore, need to find alternative data providers. After a quick search on Google, two candidates are available: Wikipedia and Barchart. Wikipedia provides the S&P constituents in the form of a HTML table, which we will need to retrieve using Python package BeautifulSoup for web scraping. Barchart provides the S&P constituents as convenient downloadable CSV files. You just need to create a basic account with them, which fortunately is free.

Scraping S&P Constituents Tickers

Step By Step

  1. Collect the S&P tickers from Wikipedia, and then from Barchart.
  2. Compare the S&P symbols from the two providers.

You can find the code below on https://github.com/DinodC/backtesting-trading-strategy.

Import packages

import pandas as pd
import requests as re
from bs4 import BeautifulSoup
import pickle

From Wikipedia

Set an id for each index

id = ['sp500', 'sp400', 'sp1000']

Set the pages we want to scrape S&P indices data from Wikipedia. Note that page for S&P 600 list of constituents does not exists. However, we can deduce the list from S&P 1000 which is just a combination of the S&P 400 and S&P 600.

input_file = {'sp500': 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies',
              'sp400': 'https://en.wikipedia.org/wiki/List_of_S%26P_400_companies',
              'sp1000': 'https://en.wikipedia.org/wiki/List_of_S%26P_1000_companies'}

Define the files we want to store extracted data to

output_file = {'sp500': 'sp500_wikipedia.pickle',
               'sp400': 'sp400_wikipedia.pickle',
               'sp1000': 'sp1000_wikipedia.pickle'}

Define the S&P constituents lists from Wikipedia

sp500_wikipedia = []
sp400_wikipedia = []
sp1000_wikipedia = []
sp_wikipedia = {'sp500': sp500_wikipedia,
                'sp400': sp400_wikipedia,
                'sp1000': sp1000_wikipedia}

The code below scrapes data using Python package BeautifulSoup, and saves the extracted data using Python package pickle

for i in input_file:

    # Get URL    
    r = re.get(input_file[i])

    # Create a soup object 
    soup = BeautifulSoup(r.text)

    # Find S&P constituents table
    table = soup.find('table', attrs={'class', 'wikitable sortable'})

    # Get the rows containing the tickers
    tickers = table.find_all('a', attrs={'class', 'external text'})
    # find_all returns tickers and SEC fillings, get tickers only
    tickers = tickers[::2]

    # Create a list containing the tickers
    for j in range(len(tickers)):
        sp_wikipedia[i].append(tickers[j].text)

    # Save the list to a file
    with open(output_file[i], 'wb') as f:
        pickle.dump(sp_wikipedia[i], f)
    f.close()

Check the number of constituents, it should be equal to 505

len(sp500_wikipedia)
505

Check the number of constituents, it should be equal to 400

len(sp400_wikipedia)
400

Check the number of constituents, it should be equal to 1001

len(sp1000_wikipedia)
1001

Create a list of S&P 600 constituents given that the S&P 1000 index is the sum of S&P 400 and S&P 600 indices

sp600_wikipedia = list(set(sp1000_wikipedia) - set(sp400_wikipedia))

Check the number of constituents, it should be equal to 601

len(sp600_wikipedia)
598

In total, Wikipedia tickers sum up to 598 only, while the S&P Dow Jones Indices website indicates that there should be 601. The missing tickers, 3 in total, could be due to timing difference in updating the S&P 400 and S&P 1000 lists.

From Barchart

id = ['sp500', 'sp400', 'sp600']

We download the below files in csv format from Barhcart. Note that you need to sign up first, free of charge, before getting access.

input_file = {'sp500': 's&p-500-index-05-04-2019.csv',
              'sp400': 'sp-400-index-05-04-2019.csv',
              'sp600': 'sp-600-index-05-04-2019.csv'}

Define the files we want to store extracted data to

output_file = {'sp500': 'sp500_barchart.pickle',
               'sp400': 'sp400_barchart.pickle',
               'sp600': 'sp600_barchart.pickle'}

Define the S&P constituents lists from Barchart

sp500_barchart = []
sp400_barchart = []
sp600_barchart = []
sp_barchart = {'sp500': sp500_barchart,
              'sp400': sp400_barchart,
              'sp600': sp600_barchart}

The code below reads the data from the csv file, stores it to a DataFrame object, and saves the extracted information using pickle

for i in input_file:

    # Read data to a DataFrame
    data = pd.read_csv(input_file[i])
    # Exclude the last line since it does not contain a ticker
    data = data[:-1]

    # Create a list containing the tickers
    for j in range(len(data['Symbol'])):
        sp_barchart[i].append(data['Symbol'].iloc[j])

    # Save the list to a file
    with open(output_file[i], 'wb') as f:
        pickle.dump(sp_barchart[i], f)
    f.close()

Check the number of constituents, it should be equal to 505 according to S&P Dow Jones Indices website

len(sp500_barchart)
505

Check the number of constituents, it should be equal to 400 according to S&P Dow Jones Indices website

len(sp400_barchart)
400

Check the number of constituents, it should be equal to 601 according to S&P Dow Jones Indices website

len(sp600_barchart)
601

Comparison Between Wikipedia and Barchart

S&P 500 Index

Sort the lists

sp500_wikipedia.sort()
sp500_barchart.sort()

Compare the first 10 tickers

sp500_wikipedia[:10]
['A', 'AAL', 'AAP', 'AAPL', 'ABBV', 'ABC', 'ABMD', 'ABT', 'ACN', 'ADBE']
sp500_barchart[:10]
['A', 'AAL', 'AAP', 'AAPL', 'ABBV', 'ABC', 'ABMD', 'ABT', 'ACN', 'ADBE']

Compare all the tickers by calculating the difference between Wikipedia and Barchart

diff_wikipedia_barchart = list(set(sp500_wikipedia) - set(sp500_barchart))
diff_wikipedia_barchart
[]

There is no difference between the Wikipedia and Barchart lists.

S&P MidCap 400 Index

Sort the lists

sp400_wikipedia.sort()
sp400_barchart.sort()

Compare the first 10 tickers

sp400_wikipedia[:10]
['AAN', 'ACC', 'ACHC', 'ACIW', 'ACM', 'ADNT', 'AEO', 'AFG', 'AGCO', 'AHL']
sp400_barchart[:10]
['AAN', 'ACC', 'ACHC', 'ACIW', 'ACM', 'ADNT', 'AEO', 'AFG', 'AGCO', 'ALE']

Compare all the tickers by calculating the difference between Wikipedia and Barchart

diff_wikipedia_barchart = list(set(sp400_wikipedia) - set(sp400_barchart))
diff_wikipedia_barchart[:10]
['LPNT', 'NBR', 'AHL', 'DRQ', 'SPN', 'WAB', 'DNB', 'EGN', 'JKHY', 'GPOR']
len(diff_wikipedia_barchart)
28

Now, compare all the tickers by calculating the difference between Barchart and Wikipedia

diff_barchart_wikipedia = list(set(sp400_barchart) - set(sp400_wikipedia))
diff_barchart_wikipedia[:10]
['TREX', 'PEB', 'NGVT', 'XPO', 'CVET', 'PRSP', 'BHF', 'REZI', 'AMED', 'SRCL']
len(diff_barchart_wikipedia)
28

The difference between the two providers Wikipedia and Barchart is 28 tickers, which suggests that one of the two providers has a more up to date list.

S&P SmallCap 600 Index

Sort the lists

sp600_wikipedia.sort()
sp600_barchart.sort()

Compare the first 10 tickers

sp600_wikipedia[:10]
['AAOI', 'AAON', 'AAT', 'AAWW', 'AAXN', 'ABCB', 'ABG', 'ABM', 'ACET', 'ACLS']
sp600_barchart[:10]
['AAOI', 'AAON', 'AAT', 'AAWW', 'AAXN', 'ABCB', 'ABG', 'ABM', 'ACA', 'ACLS']

Compare all the tickers by calculating the difference between Wikipedia and Barchart

diff_wikipedia_barchart = list(set(sp600_wikipedia) - set(sp600_barchart))
diff_wikipedia_barchart[:10]
['GNBC', 'TREX', 'CLD', 'WCG', 'QHC', 'NGVT', 'FTK', 'DSW', 'PRSP', 'AMED']
len(diff_wikipedia_barchart)
51

Now, compare all the tickers by calculating the difference between Barchart and Wikipedia

diff_barchart_wikipedia = list(set(sp600_barchart) - set(sp600_wikipedia))
diff_barchart_wikipedia[:10]
['TCMD', 'IIPR', 'NBR', 'CSII', 'CCS', 'DRQ', 'DBI', 'SPN', 'TRHC', 'LPI']
len(diff_barchart_wikipedia)
54

In total, Wikipedia constituents sum up to 598 only, while Barchart sum up to 601 (complete):

  1. As previously noted, there are 3 tickers missing in Wikipedia list which could be due to timing difference in updating the S&P 400 and S&P 1000 lists.
  2. The difference between the two providers Wikipedia and Barchart is 51 tickers, which suggests that one of the two providers has a more up to date list.

Summary

In this article, we identified universe of securities candidates such as the S&P 500, S&P MidCap 400 and S&P SmallCap indices. We retrieved the constituents of each index from alternative data providers, namely Wikipedia and Barchart. The list of tickers for the S&P 500 is consistent between the two sources, while the list of tickers for the S&P MidCap 400 and S&P SmallCap 600 are not identical. There seems to be an inconsistency between the components of the S&P 400 and S&P 1000 indices from Wikipedia. As a result, we will consider the S&P constituents from Barchart in the next article where we will retrieve historical data for every ticker in every index.