Principal Component Analysis As A Factor Model

Photo by the author

Introduction

Principal component analysis (PCA) is a statistical technique which enjoys applications in image processing and quantitative finance. In this article, we focus on the later application in quantitative trading, in particular using PCA as a multi-factor model of portfolio returns. We use the multi-factor model to design a momentum trading strategy, backtest it under different investment universes, and present the performance results.

The project is shared on my online repository https://github.com/DinodC/pca-factor-model.

Import packages

import numpy as np
import pandas as pd
import pickle
import statsmodels.api as sm
import matplotlib.pyplot as plt

Magic

%matplotlib inline

Data Collection

In this section, we collect S&P consittuents’ historical data from a previous project https://quant-trading.blog/2019/06/24/backtesting-a-trading-strategy-part-2/.

Set S&P Index keys

keys = ['sp500',
        'sp400',
        'sp600']

Initialize S&P indices close data

close = {}

Pull S&P indices close data

for i in keys: 
    # Load OHLCV data
    with open(i + '_data.pickle', 'rb') as f:
        data = pickle.load(f)

    # Update close prices data
    close[i] = data.close.loc['2014-06-12':]

    # Close file
    f.close

Inspect S&P 500 index constituents data

close['sp500'].head()
SymbolsAAALAAPAAPLABBV
date
2014-06-1239.772638.2958123.225284.586044.6031
2014-06-1339.840738.4672123.740783.660345.0187
2014-06-1639.718139.1150124.008684.503544.8857
2014-06-1740.092739.8867124.941184.393545.1351
2014-06-1840.365140.6392128.680984.485245.3595

5 rows × 505 columns

close['sp500'].tail()
SymbolsAAALAAPAAPLABBV
date
2019-06-0366.9927.20153.17173.3075.70
2019-06-0467.9529.12154.61179.6476.75
2019-06-0568.3530.36154.61182.5477.06
2019-06-0669.1630.38154.90185.2277.07
2019-06-0769.5230.92155.35190.1577.43

5 rows × 505 columns

close['sp500'].describe()
SymbolsAAALAAPAAPLABBV
count1257.0000001257.0000001257.0000001257.0000001257.000000
mean52.02038441.208459145.434801135.13343666.670458
std13.8680356.36871624.13110638.11763317.717831
min32.25860024.53980079.16870082.74380042.066600
25%39.39150036.574400132.526400103.63750053.175300
50%46.12340040.847900150.479700119.47620058.475100
75%65.60380046.127100161.921100167.85770082.898700
max81.94000057.586600199.159900229.392000116.445400

8 rows × 505 columns

close['sp500'].shape
(1257, 505)

Fill NaNs with previous observation

close['sp500'].fillna(method='ffill', inplace=True)
close['sp400'].fillna(method='ffill', inplace=True)
close['sp600'].fillna(method='ffill', inplace=True)

Initialize daily returns

returns = {}

Calculate daily returns

for i in keys:
    returns[i] = close[i].pct_change()

Momentum Strategy Implementation

In this section, we present the model (factor model) and tool (principal compenent analysis) used in implementing the momentum trading strategy.

1. Factor Models

Factor models use economic (e.g. interest rates), fundamental (e.g. price per earnings), and statistical (e.g. principal component analysis) factors to explain asset prices (and returns). Fama and French initially designed the three-factor model which extends the capital asset pricing model (CAPM) to include size and value factors. The general framework is known as the arbitrage pricing theory (APT) developed by Stephen Ross and proposes multiple factors.

2. Principal Component Analysis (PCA)

Principal component analysis is a statistical procedure for finding patterns in high dimension data. PCA allows you to compress high dimension data by reducing the number of dimensions, without losing much information. Principal component analysis has the following applications in quantitative finance: interest-rate modeling and portfolio analysis.

PCA implementation has the following steps:

  1. Pull data
  2. Adjust data by subtracting the mean
  3. Calculate the covariance matrix of the adjusted data
  4. Calculate the eigenvectors and eigenvalues of the covariance matrix
  5. Choose the most significant components which explain around 95% of the data

Note that in this article we are only interested in applying principal component analysis, for a complete illustration you can start by checking out https://en.wikipedia.org/wiki/Principal_component_analysis#Applications.

3. Momentum Trading Strategy

Momentum trading strategies profit from current market trends continuation. The momentum trading strategy proposed below assumes that factor returns have momentum. The idea is to long the winning (losing) stocks which have the highest (lowest) expected returns according to factors.

Before implementation, we set the following parameters:

  1. Lookback period
  2. Number of significant factors
  3. Number of winning and losing stocks to pick
lookback = 250
number_of_factors = 5
top_n = 50

Initialize the trading positions

positions = {}

for i in keys:
    # Update positions
    positions[i] = pd.DataFrame(np.zeros((returns[i].shape[0], returns[i].shape[1])),
                                 index=returns[i].index,
                                 columns=returns[i].columns
                                )

Implementation

for i in keys:
    for j in range(lookback + 1, len(close[i])):
        # Calculate the daily returns
        R = returns[i].iloc[j - lookback + 1:j, :]

        # Avoid daily returns with NaNs
        has_data = (R.count() == max(R.count()))
        has_data_list = list(R.columns[has_data])
        R = R.loc[:, has_data_list]

        # Calculate the mean of the daily returns
        R_mean = R.mean()

        # Calculate the adjusted daily returns
        R_adj = R.sub(R_mean)

        # Calculate the covariance matrix
        cov = R_adj.cov()

        # Calculate the eigenvalues (B) and eigenvectors (X)
        eigen = np.linalg.eig(cov)
        B = eigen[0]
        X = eigen[1]

        # Retain only a number of factors
        X = X[:, :number_of_factors]

        # OLS
        model = sm.OLS(R_adj.iloc[-1], X)
        results = model.fit()
        b = results.params

        # Calculate the expected returns
        R_exp = R_mean.add(np.matmul(X, b))

        # Momentum strategy
        shorts = R_exp.sort_values()[:top_n].index
        positions[i].iloc[j][shorts] = -1
        longs = R_exp.sort_values()[-top_n:].index
        positions[i].iloc[j][longs] = 1

Remarks:

  1. Investment universes used in backtesting are the S&P 500, S&P 400 MidCap and S&P 600 SmallCap indices.
  2. Ordinary least squares (OLS) method is used to calculate stocks’ expected returns from significant factors.
  3. Only a single stock is bought for each of the winning (losing) stocks, this could be improved by adjusting the number by the rank.

Performance Analysis

In this section, we present the performance of the momentum trading strategy based on principal component analysis.

Adjust the positions because we consider close prices

for i in keys:
    positions[i] = positions[i].shift(periods=1)

Calculate the daily PnL of the momentum strategy

pnl_strat = {}
avg_pnl_strat = {}

for i in keys:
    # Daily pnl
    pnl_strat[i] = (positions[i].mul(returns[i])).sum(axis='columns')
    # Annualized average pnl of the momentum strategy
    avg_pnl_strat[i] = pnl_strat[i].mean() * 250

Average daily PnL of momentum strategy using different investment universes i.e. S&P 500, S&P 400 and S&P 600 indices

pd.DataFrame(avg_pnl_strat,
            index=['Average PnL'],
            )
sp500sp400sp600
Average PnL-0.033961-0.491665-1.941058

Remark: the average daily pnl of the momentum strategy is negative regardless of the investment universe used.

Plot the cumulative PnL of the momentum trading strategy

# Set size
plt.figure(figsize=(20, 10))

for i in range(len(keys)):
    plt.plot(pnl_strat[keys[i]].cumsum())

plt.xlim(pnl_strat['sp500'].index[0], pnl_strat['sp500'].index[-1])
# plt.ylim(-10, 5)

# Set title and legend
plt.title('Cumulative PnL Of Momentum Trading Strategy')
plt.legend(keys)

Remarks:

  1. From July 2015 to January 2017, the momentum trading strategy generated negative PnL.
  2. From July 2017 to January 2019, the momentum trading strategy turned it around and generated positive PnL.
  3. From January 2019, the momentum trading strategy continuted to generate positive PnL for S&P 500 and S&P 400 MidCap indices only.

Conclusion

In this article, we implemented a momentum trading strategy based on principal component analysis. The momentum trading strategy generated positive PnL from July 2017 to January 2019, and negative PnL from July 2015 to July 2017. A way to enhance the current momentum trading strategy is to include exit and entry points depending on the expected profitability of the trading system.

Advertisements

How To Pick A Good Cointegrating Pair

Photo by the author

Introduction

A time series is considered stationary if its probability distribution does not change over time. If the price series of a security is stationary, then it would be a suitable candidate for a mean-reversion trading strategy. However, most security price series are not stationary: they seem to follow a lognormal random walk; and drift farther and farther away from the initial value.

We need to find a pair of securities such that the combination of the two is stationary, e.g. buying a security and shorting another. Two securities that form a stationary or cointegrating pair are often from the same industry group such as Coca-Cola Company and PepsiCo. In this article, we illustrate how to pick a good cointegrating pair by applying the augmented Dickey-Fuller test to security pairs to check for cointegration.

Step-by-step

We will proceed as follows:

  1. Determine The Pairs: We present the security pairs to analyze.
  2. Prepare The Data: We pull and process securities’ open-high-low-close-volume (OHLCV) data.
  3. Calculate The Spread: We apply the ordinary least squares (OLS) method to calculate the spread between two securities.
  4. Check For Cointegration: We use the augmented Dickey-Fuller test to check if two securities form a stationary or cointegrating pair.

You can find the code on https://github.com/DinodC/cointegrating-pair.

Determine The Pairs

Below are the pairs of securities which we will check for cointegration:

1. Gold

Gold-themed exchange traded funds (ETF):

  • VanEck Vectors Gold Miners ETF (GDX): ETF which tracks a basket of gold-mining companies.
  • SPDR Gold Shares (GLD): ETF which replicates the price of gold bullion.

2. Fast Food

Companies serving fast food:

  • McDonald’s Corporation (MCD): Fast food company which gave the whole world classics like Big Mac, Hot Fudge Sundae, and Happy Meal.
  • YUM! Brands, Inc. (YUM): Fast food company which operates Taco Bell, KFC and Pizza Hut.

3. Cryptocurrencies

Digital currencies:

  • Bitcoin USD (BTC-USD): A decentralized cryptocurrency that can be sent from user to user on the peer-to-peer bitcoin network established in 2009.
  • Ethereum USD (ETH-USD): An open source, public, blockchain-based distributed computing platform and operating system released in 2015.

Prepare The Data

In this section, we illustrate download and preparation of securities’ price series.
We pull the securities’ historical OHLCV data from Yahoo Finance.
We select the adjusted close prices for each security and create a new Dataframe object.

Import packages

import pandas as pd
from pandas import DataFrame
from statsmodels.tsa.stattools import adfuller
import statsmodels.api as sm
import matplotlib.pyplot as plt

Magic

%matplotlib inline

Set tickers list

tickers = ['GDX', 'GLD', 'MCD', 'YUM', 'BTC-USD', 'ETH-USD']

Pull OHLCV data

# Initialize list of DataFrames
df_list = []

# Load DataFrames
for i in tickers:

    # Load data
    df = pd.read_csv(i + '.csv', index_col=0, parse_dates=True)    

    # Set multi-level columns
    df.columns = pd.MultiIndex.from_product([[i], ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']])

    # Update list
    df_list.append(df)

# Merge DataFrames
data = pd.concat(df_list, axis=1, join='inner')

# Drop NaNs
data.dropna(inplace=True)

Inspect OHLCV data

data.head()
GDX
OpenHighLowCloseAdj CloseVolume
Date
2015-08-0613.2113.6913.1113.3613.03352369121200
2015-08-0713.4213.8513.3313.4013.07254650618200
2015-08-1013.5714.2913.3614.2713.92128791376800
2015-08-1114.4414.5313.9414.5314.17493153731900
2015-08-1214.8115.5314.7815.5215.140740123217200
data.tail()
GDX
OpenHighLowCloseAdj CloseVolume
Date
2019-07-0825.45000125.61000125.20999925.42000025.42000040606100
2019-07-0925.33000025.66000025.20999925.65000025.65000037529700
2019-07-1026.02000026.23000025.77000026.20000126.20000156454300
2019-07-1126.12999926.28000125.71999925.94000125.94000154013400
2019-07-1226.00000026.25000025.87000126.20999926.20999931795200

Select adjusted close prices

# Initialize dictionary of adjusted close
close_dict = {}

# Update dictionary
for i in tickers:
    close_dict[i] = data[i]['Adj Close']

# Create DataFrame
close = pd.DataFrame(close_dict)

Inspect adjusted close prices

close.head()
GDXGLDMCDYUMBTC-USDETH-USD
Date
2015-08-0613.033523104.38999989.03874257.964733277.8900153.000
2015-08-0713.072546104.65000288.65334357.859062258.6000061.200
2015-08-1013.921287105.72000189.07457757.997757269.0299990.990
2015-08-1114.174931106.26000288.55478755.171165267.6600041.288
2015-08-1215.140740107.75000088.07979653.282372263.4400021.885
close.tail()
GDXGLDMCDYUMBTC-USDETH-USD
Date
2019-07-0825.420000131.289993212.160004110.05000312567.019531307.890015
2019-07-0925.650000131.750000212.089996110.48999812099.120117288.640015
2019-07-1026.200001133.830002213.000000110.98000311343.120117268.559998
2019-07-1125.940001132.699997212.690002111.50000011797.370117275.410004
2019-07-1226.209999133.529999212.990005111.05000311363.969727268.940002

Consider the training set from 2018 to present

training = close['2018-01-01':'2020-01-01'].copy()

Inspect training set

training.head()
GDXGLDMCDYUMBTC-USDETH-USD
Date
2018-01-0223.694632125.150002166.89537079.50389114754.129883861.969971
2018-01-0323.445948124.820000166.19200179.43569915156.620117941.099976
2018-01-0423.595158125.459999167.35783480.24437015180.080078944.830017
2018-01-0523.545422125.330002167.69508480.71203616954.779297967.130005
2018-01-0823.296738125.309998167.57942280.84844214976.1699221136.109985
training.tail()
GDXGLDMCDYUMBTC-USDETH-USD
Date
2019-07-0825.420000131.289993212.160004110.05000312567.019531307.890015
2019-07-0925.650000131.750000212.089996110.48999812099.120117288.640015
2019-07-1026.200001133.830002213.000000110.98000311343.120117268.559998
2019-07-1125.940001132.699997212.690002111.50000011797.370117275.410004
2019-07-1226.209999133.529999212.990005111.05000311363.969727268.940002

Calculate the number of pairs

no_pairs = round(0.5 * len(tickers))

Plot the adjusted close prices

plt.figure(figsize=(20, 20))

for i in range(no_pairs):
    # Primary axis
    color = 'tab:blue'
    ax1 = plt.subplot(3, 1, i+1)
    plt.plot(training[tickers[2*i]], color=color)
    ax1.set_ylabel('Adjusted Close Price of ' + tickers[2*i], color=color)
    ax1.tick_params(labelcolor=color)

    # Secondary axis 
    color = 'tab:orange'
    ax2 = ax1.twinx()
    plt.plot(training[tickers[2*i+1]], color=color)
    ax2.set_ylabel('Adjusted Close Price of ' + tickers[2*i+1], color=color)
    ax2.tick_params(labelcolor=color)

    # Both axis
    plt.xlim([training.index[0], training.index[-1]])
    plt.title('Adjusted Close Prices of ' + tickers[2*i] + ' and ' + tickers[2*i+1])

Calculate The Spread

In this section, we calculate the spread between the securities. We apply the OLS method between the securities to calculate for the hedge ratio. We standardize the spread by subtracting the mean and scaling by the standard deviation of the spread.

Calculate the spread between each pair

# Initialize the spread list
spread_list = []

for i in range(no_pairs):
    # Run an OLS regression between the pairs
    model = sm.regression.linear_model.OLS(training[tickers[2*i]], training[tickers[2*i+1]])

    # Calculate the hedge ratio
    results = model.fit()
    hedge_ratio = results.params[0]

    # Calculate the spread
    spread = training[tickers[2*i]] - hedge_ratio * training[tickers[2*i+1]]

    # Mean and standard deviation of the spread
    spread_mean = spread.mean()
    spread_std = spread.std()

    # Standardize the spread
    z_score = (spread - spread_mean) / spread_std

    # Update the spread list
    spread_list.append(z_score)

Plot the spread

plt.figure(figsize=(20, 20))

for i in range(no_pairs):
    plt.subplot(3, 1, i+1)
    plt.plot(spread_list[i])
    plt.xlim([spread.index[0], spread.index[-1]])
    plt.ylim([-3, 3])
    plt.title('Spread between ' + tickers[2*i] + ' and ' + tickers[2*i+1])

Check For Cointegration

In this section, we test if two securities form a stationary or cointegrating pair.
We use the augmented Dickey-Fuller (ADF) test where we have the following:

  1. The null hypothesis is that a unit root is present in the price series, it is non-stationary.
  2. The alternative is that unit root is not present in the prices series, it is stationary.

Run cointegration check using augmented Dickey-Fuller test

# Initialize stats
stats_list = []

for i in range(len(spread_list)):

    # ADF test
    stats = adfuller(spread_list[i])

    # Update stats
    stats_list.append(stats)

Set the pairs

# Initialize pairs
pairs = []

for i in range(no_pairs):
    # Update pairs
    pairs.append(tickers[2*i] + '/' + tickers[2*i+1])

Create stats DataFrame

# Initialize dict
stats_dict = {}

for i in range(no_pairs):

    # Update dict
    stats_dict[pairs[i]] = [stats_list[i][0],
                            stats_list[i][1],
                            stats_list[i][4]['1%'], stats_list[i][4]['5%'], stats_list[i][4]['10%']]

# Create DataFrame
stats_df = pd.DataFrame(stats_dict,
                          index=['ADF Statistic', 'P-value', '1%', '5%', '10%'])

Inspect

stats_df
GDX/GLDMCD/YUMBTC-USD/ETH-USD
ADF Statistic-3.386075-3.072452-2.161601
P-value0.0114430.0286600.220476
1%-3.448344-3.447815-3.448344
5%-2.869469-2.869237-2.869469
10%-2.570994-2.570870-2.570994

Remarks:

  1. For the spread between GDX and GLD, the ADF statistic is -3.39 which is lower than the 1% critical value -3.45, which means that there is a better than 99% probability that the spread between GDX and GLD is stationary.
  2. For the spread between MCD and YUM, the ADF statistic is -3.07 is between the 1% critical value -3.45 and 5% critical value of -2.87, which means that there is a better than 95% probability that the spread between MCD and YUM is stationary.
  3. For the spread between BTC-USD and ETH-USD, the ADF statistic is -2.16 which is higher than the critical values, which means that the spread between BTC-USD and ETH-USD is not stationary.

Conclusion

In this article, we demonstrated how to form a a good cointegrating pair of securities. We used the OLS method to determine the hedge ratio between securities; and the ADF test to check for stationarity. The results suggest the following: cointegraing pairs could be formed within gold (GDX and GLD) and fast food securities (MCD and YUM); and cointegrating pairs could not be formed within cryptocurrencies (BTC-USD and ETH-USD).