# How To Pick A Good Cointegrating Pair

## Introduction

A time series is considered stationary if its probability distribution does not change over time. If the price series of a security is stationary, then it would be a suitable candidate for a mean-reversion trading strategy. However, most security price series are not stationary: they seem to follow a lognormal random walk; and drift farther and farther away from the initial value.

We need to find a pair of securities such that the combination of the two is stationary, e.g. buying a security and shorting another. Two securities that form a stationary or cointegrating pair are often from the same industry group such as Coca-Cola Company and PepsiCo. In this article, we illustrate how to pick a good cointegrating pair by applying the augmented Dickey-Fuller test to security pairs to check for cointegration.

## Step-by-step

We will proceed as follows:

1. Determine The Pairs: We present the security pairs to analyze.
2. Prepare The Data: We pull and process securities’ open-high-low-close-volume (OHLCV) data.
3. Calculate The Spread: We apply the ordinary least squares (OLS) method to calculate the spread between two securities.
4. Check For Cointegration: We use the augmented Dickey-Fuller test to check if two securities form a stationary or cointegrating pair.

You can find the code on https://github.com/DinodC/cointegrating-pair.

## Determine The Pairs

Below are the pairs of securities which we will check for cointegration:

### 1. Gold

• VanEck Vectors Gold Miners ETF (GDX): ETF which tracks a basket of gold-mining companies.
• SPDR Gold Shares (GLD): ETF which replicates the price of gold bullion.

### 2. Fast Food

Companies serving fast food:

• McDonald’s Corporation (MCD): Fast food company which gave the whole world classics like Big Mac, Hot Fudge Sundae, and Happy Meal.
• YUM! Brands, Inc. (YUM): Fast food company which operates Taco Bell, KFC and Pizza Hut.

### 3. Cryptocurrencies

Digital currencies:

• Bitcoin USD (BTC-USD): A decentralized cryptocurrency that can be sent from user to user on the peer-to-peer bitcoin network established in 2009.
• Ethereum USD (ETH-USD): An open source, public, blockchain-based distributed computing platform and operating system released in 2015.

## Prepare The Data

In this section, we illustrate download and preparation of securities’ price series.
We pull the securities’ historical OHLCV data from Yahoo Finance.
We select the adjusted close prices for each security and create a new Dataframe object.

Import packages

```import pandas as pd
from pandas import DataFrame
import statsmodels.api as sm
import matplotlib.pyplot as plt
```

Magic

```%matplotlib inline
```

Set tickers list

```tickers = ['GDX', 'GLD', 'MCD', 'YUM', 'BTC-USD', 'ETH-USD']
```

Pull OHLCV data

```# Initialize list of DataFrames
df_list = []

for i in tickers:

df = pd.read_csv(i + '.csv', index_col=0, parse_dates=True)

# Set multi-level columns
df.columns = pd.MultiIndex.from_product([[i], ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']])

# Update list
df_list.append(df)

# Merge DataFrames
data = pd.concat(df_list, axis=1, join='inner')

# Drop NaNs
data.dropna(inplace=True)
```

Inspect OHLCV data

```data.head()
```
GDX
Date
2015-08-0613.2113.6913.1113.3613.03352369121200
2015-08-0713.4213.8513.3313.4013.07254650618200
2015-08-1013.5714.2913.3614.2713.92128791376800
2015-08-1114.4414.5313.9414.5314.17493153731900
2015-08-1214.8115.5314.7815.5215.140740123217200
```data.tail()
```
GDX
Date
2019-07-0825.45000125.61000125.20999925.42000025.42000040606100
2019-07-0925.33000025.66000025.20999925.65000025.65000037529700
2019-07-1026.02000026.23000025.77000026.20000126.20000156454300
2019-07-1126.12999926.28000125.71999925.94000125.94000154013400
2019-07-1226.00000026.25000025.87000126.20999926.20999931795200

```# Initialize dictionary of adjusted close
close_dict = {}

# Update dictionary
for i in tickers:

# Create DataFrame
close = pd.DataFrame(close_dict)
```

```close.head()
```
GDXGLDMCDYUMBTC-USDETH-USD
Date
2015-08-0613.033523104.38999989.03874257.964733277.8900153.000
2015-08-0713.072546104.65000288.65334357.859062258.6000061.200
2015-08-1013.921287105.72000189.07457757.997757269.0299990.990
2015-08-1114.174931106.26000288.55478755.171165267.6600041.288
2015-08-1215.140740107.75000088.07979653.282372263.4400021.885
```close.tail()
```
GDXGLDMCDYUMBTC-USDETH-USD
Date
2019-07-0825.420000131.289993212.160004110.05000312567.019531307.890015
2019-07-0925.650000131.750000212.089996110.48999812099.120117288.640015
2019-07-1026.200001133.830002213.000000110.98000311343.120117268.559998
2019-07-1125.940001132.699997212.690002111.50000011797.370117275.410004
2019-07-1226.209999133.529999212.990005111.05000311363.969727268.940002

Consider the training set from 2018 to present

```training = close['2018-01-01':'2020-01-01'].copy()
```

Inspect training set

```training.head()
```
GDXGLDMCDYUMBTC-USDETH-USD
Date
2018-01-0223.694632125.150002166.89537079.50389114754.129883861.969971
2018-01-0323.445948124.820000166.19200179.43569915156.620117941.099976
2018-01-0423.595158125.459999167.35783480.24437015180.080078944.830017
2018-01-0523.545422125.330002167.69508480.71203616954.779297967.130005
2018-01-0823.296738125.309998167.57942280.84844214976.1699221136.109985
```training.tail()
```
GDXGLDMCDYUMBTC-USDETH-USD
Date
2019-07-0825.420000131.289993212.160004110.05000312567.019531307.890015
2019-07-0925.650000131.750000212.089996110.48999812099.120117288.640015
2019-07-1026.200001133.830002213.000000110.98000311343.120117268.559998
2019-07-1125.940001132.699997212.690002111.50000011797.370117275.410004
2019-07-1226.209999133.529999212.990005111.05000311363.969727268.940002

Calculate the number of pairs

```no_pairs = round(0.5 * len(tickers))
```

```plt.figure(figsize=(20, 20))

for i in range(no_pairs):
# Primary axis
color = 'tab:blue'
ax1 = plt.subplot(3, 1, i+1)
plt.plot(training[tickers[2*i]], color=color)
ax1.set_ylabel('Adjusted Close Price of ' + tickers[2*i], color=color)
ax1.tick_params(labelcolor=color)

# Secondary axis
color = 'tab:orange'
ax2 = ax1.twinx()
plt.plot(training[tickers[2*i+1]], color=color)
ax2.set_ylabel('Adjusted Close Price of ' + tickers[2*i+1], color=color)
ax2.tick_params(labelcolor=color)

# Both axis
plt.xlim([training.index[0], training.index[-1]])
plt.title('Adjusted Close Prices of ' + tickers[2*i] + ' and ' + tickers[2*i+1])
```

In this section, we calculate the spread between the securities. We apply the OLS method between the securities to calculate for the hedge ratio. We standardize the spread by subtracting the mean and scaling by the standard deviation of the spread.

Calculate the spread between each pair

```# Initialize the spread list

for i in range(no_pairs):
# Run an OLS regression between the pairs
model = sm.regression.linear_model.OLS(training[tickers[2*i]], training[tickers[2*i+1]])

# Calculate the hedge ratio
results = model.fit()
hedge_ratio = results.params[0]

spread = training[tickers[2*i]] - hedge_ratio * training[tickers[2*i+1]]

# Mean and standard deviation of the spread

```

```plt.figure(figsize=(20, 20))

for i in range(no_pairs):
plt.subplot(3, 1, i+1)
plt.ylim([-3, 3])
plt.title('Spread between ' + tickers[2*i] + ' and ' + tickers[2*i+1])
```

## Check For Cointegration

In this section, we test if two securities form a stationary or cointegrating pair.
We use the augmented Dickey-Fuller (ADF) test where we have the following:

1. The null hypothesis is that a unit root is present in the price series, it is non-stationary.
2. The alternative is that unit root is not present in the prices series, it is stationary.

Run cointegration check using augmented Dickey-Fuller test

```# Initialize stats
stats_list = []

# Update stats
stats_list.append(stats)
```

Set the pairs

```# Initialize pairs
pairs = []

for i in range(no_pairs):
# Update pairs
pairs.append(tickers[2*i] + '/' + tickers[2*i+1])
```

Create stats DataFrame

```# Initialize dict
stats_dict = {}

for i in range(no_pairs):

# Update dict
stats_dict[pairs[i]] = [stats_list[i][0],
stats_list[i][1],
stats_list[i][4]['1%'], stats_list[i][4]['5%'], stats_list[i][4]['10%']]

# Create DataFrame
stats_df = pd.DataFrame(stats_dict,
index=['ADF Statistic', 'P-value', '1%', '5%', '10%'])
```

Inspect

```stats_df
```
GDX/GLDMCD/YUMBTC-USD/ETH-USD
P-value0.0114430.0286600.220476
1%-3.448344-3.447815-3.448344
5%-2.869469-2.869237-2.869469
10%-2.570994-2.570870-2.570994

Remarks:

1. For the spread between GDX and GLD, the ADF statistic is -3.39 which is lower than the 1% critical value -3.45, which means that there is a better than 99% probability that the spread between GDX and GLD is stationary.
2. For the spread between MCD and YUM, the ADF statistic is -3.07 is between the 1% critical value -3.45 and 5% critical value of -2.87, which means that there is a better than 95% probability that the spread between MCD and YUM is stationary.
3. For the spread between BTC-USD and ETH-USD, the ADF statistic is -2.16 which is higher than the critical values, which means that the spread between BTC-USD and ETH-USD is not stationary.

## Conclusion

In this article, we demonstrated how to form a a good cointegrating pair of securities. We used the OLS method to determine the hedge ratio between securities; and the ADF test to check for stationarity. The results suggest the following: cointegraing pairs could be formed within gold (GDX and GLD) and fast food securities (MCD and YUM); and cointegrating pairs could not be formed within cryptocurrencies (BTC-USD and ETH-USD).

# Pair-trading With Cryptocurrencies Part 2

## Introduction

Cryptocurrencies (or cryptos) captured the world’s imagination during the last couple of years. From their spectacular rise in 2017, to their equally hard fall in 2018, and to their most recent revival this year. Cryptocurrencies and the blockchain technology powering them are regarded as disruptive forces, proposing an alternative decentralized monetary system. They have also gathered their fair share of criticism. Cryptocurrencies are accused of facilitating transfer of “dirty money” between criminals. They provided mind-boggling returns a couple years back and earned the reputation of being a “get rich quick” scheme.

In this article, we turn our attention to cryptocurrencies as investment vehicles sharing common characteristics with other asset classes like FX, commodities and precious metals. Cryptocurrencies, like fiat currencies, allow for easy transfer of wealth between individuals and businesses. They are similar to commodities with regards to the limited supply they offer. Cryptocurrencies have drawn comparisons to gold which is viewed as a “safe-haven” investment.

This is the second part of the article where I will present how to implement a trading strategy using cryptocurrency pairs, and backtest it using the collected dataset in the previous article.

We implement a trading strategy inspired from E. Chan’s pair-trading strategy involving gold and gold miners ETFs. After money is injected to traditional markets like equity and fixed-income, it will eventually overflow into alternative asset classes like cryptocurrencies. We suppose that money flowing to cryptocurrencies will drive their prices together.

The price movement between two cryptocurrencies is measured by the difference in their prices or spread.
We have the following rules for our pair-trading strategy:

• Enter a long position on the spread if it is below -1 STD.
• Enter a short position on the spread if it is above 1 STD.
• Exit the position if the spread is greater than -0.5 and less than 0.5 STD.

### Step By Step

2. Implement and backtest the trading strategy.
3. Analyze the performance of the pair-trading strategy.

You can find the code on https://github.com/DinodC/pair-trading-cryptos.

Import packages

```import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import pickle
import matplotlib.pyplot as plt
import statsmodels.api as sm
%matplotlib inline
```

```with open('data_coinmarketcap.pickle', 'rb') as f:

f.close()
```

Create close prices DataFrame object

```close = pd.DataFrame({'BTC': data.BTC.Close,
'ETH': data.ETH.Close,
'LTC': data.LTC.Close,
'XRP': data.XRP.Close})
```

Create an id for Bitcoin, Ethereum, Litecoin and Ripple cryptocurrencies to USD.

```id = ['BTC', 'ETH', 'LTC', 'XRP']
```

Before implementing the pair-trading strategy, recall in a previous article the things to look out for when backtesting:

1. Data-snooping bias
3. Transaction costs

I will review these pitfalls and propose solutions to mitigate their effect on the backtesting results.

### Correcting For Data-snooping Bias

Data-snooping is the use of data analysis to find statistically significant patterns when in fact they are not. It can originate from overfitting the model to a dataset by increasing the number of its parameters. Data-snooping can also come from optimizing trading decisions e.g. choosing entry and exit thresholds.

To mitigate data-snooping bias, we can do the following:

1. Out-of-sample testing: Out-of-sample testing involves separating historical data into a training and testing set. We optimize the model’s parameters using the training set, and validate it on the testing or out-of-sample set.
2. Paper trading: Paper trading share the same idea as out-of-sample testing in separating the dataset. We optimize the model’s parameters using the historical data, and run it using incoming actual data.

In this article, we apply out-of-sample testing by dividing the data into two equal sets.

```# Define training set
training_start = '2015-08-07'
training_end = '2017-08-01'
training_set = close[training_start:training_end]

# Define testing set
testing_start = '2017-08-02'
testing_end = '2019-05-21'
testing_set = close[testing_start:testing_end]
```

Look-ahead bias is the use of information not yet available at the moment the trade was recommended. An example is calculating the trading signal based on the day’s highest or lowest price. As the day’s high and low are only available at close, we correct this by lagging our historical data.

Another way to mitigate look-ahead bias is the following check:

1. Backtest using historical data and save the positions.
2. Backtest using truncated historical data, e.g. 10 days, and save the positions.
3. Truncate the positions from point 1, e.g. 10 days, compare with the positions from point 2.

Note that the implementation of this check is just after the backtesting.

### Including Transaction Costs

Transaction costs make the backtest more realistic.
They deflate a trading strategy’s performance by including the following:

1. Slippage is the difference between your expected price and the executed price.
2. Commission fee is a service charge extracted by your broker for handling your purchases.
3. Market impact is the price effect you generate each time you pass a considerable order to the market.

```round_trip = 0.0010
```

### Implementation

We define the entry and exit thresholds

```entry_threshold = 1
exit_threshold = 0.5
```

```# Set crypto 1 to BTC
crypto_1 = id[0]

# Initialize output
output = {id[1]: {},
id[2]: {},
id[3]: {}}

for i in range(1, len(id)):

# Set crypto 2 to ETH, LTC and XRP
crypto_2 = id[i]

# Calculate the hedge ratio using the training set
model = sm.OLS(training_set[crypto_1], training_set[crypto_2])
result = model.fit()
hedge_ratio = result.params[crypto_2]

spread = close[crypto_1] - hedge_ratio * close[crypto_2]
# Mean of the spread on the training set
# Standard deviation of the spread calculated on the training set

# Create masks for long, short and exit positions
longs = (z_score <= -entry_threshold)
shorts = (z_score &gt;= entry_threshold)
exits = (np.abs(z_score) <= exit_threshold)
# Initialize the positions
positions = pd.DataFrame({crypto_1: np.nan * pd.Series(range(len(z_score))),
crypto_2: np.nan * pd.Series(range(len(z_score)))},
index=z_score.index)
# Update the positions
[positions[crypto_1][longs], positions[crypto_2][longs]] = [1, -1]
[positions[crypto_1][shorts], positions[crypto_2][shorts]] = [-1, 1]
[positions[crypto_1][exits], positions[crypto_2][exits]] = [0, 0]
# Carry forward the positions except when there is an exit
positions.fillna(method='ffill', inplace=True)
# Lag the positions to the next day because we base calculations on close
positions = positions.shift(periods=1)

# Calculate the performance
# Initialize the returns
returns = pd.DataFrame({crypto_1: close[crypto_1],
crypto_2: close[crypto_2]})
# Update the returns
returns = returns.pct_change()
# Calculate the pnl
pnl = returns * positions

# Calculate transaction costs
# Create a mask to indicate changes in position
mask = (~np.isnan(positions.BTC) &amp; (positions.BTC - positions.BTC.shift(periods=1)).astype(bool))
# mask = (~np.isnan(positions.BTC) &amp; (positions.BTC != positions.BTC.shift(periods=1)))
# Create a transaction costs Series

# Update pnl DataFrame
pnl['TC'] = tc
# Calculate net pnl
pnl_net = pnl.sum(axis='columns')

# Calculate the Sharpe ratio under the training set
sharpe_training = np.sqrt(252) * pnl_net[training_start:training_end].mean() / pnl_net[training_start:training_end].std()
# Calculate the Sharpe ratio under the testing set
sharpe_testing = np.sqrt(252) * pnl_net[testing_start:testing_end].mean() / pnl_net[testing_start:testing_end].std()

# Generate the output
# Gather data
'positions': positions,
'pnl': pnl_net,
'sharpe training': sharpe_training,
'sharpe testing': sharpe_testing,
}
# Update the output
output.update({crypto_2: data})
```

Load the output generated from the truncated historical data

```with open('output_truncated.pickle', 'rb') as f:

f.close()
```
```positions_truncated_eth = output_truncated['ETH']['positions'].dropna()
positions_truncated_ltc = output_truncated['LTC']['positions'].dropna()
positions_truncated_xrp = output_truncated['XRP']['positions'].dropna()
```

Truncate the positions calculated from the full dataset

```positions_eth = output['ETH']['positions'][:-100].dropna()
positions_ltc = output['LTC']['positions'][:-100].dropna()
positions_xrp = output['XRP']['positions'][:-100].dropna()
```

Compare the two sets of positions

```mask = positions_eth.eq(positions_truncated_eth)
```
```True    1222
Name: ETH, dtype: int64
```
```mask = positions_ltc.eq(positions_truncated_ltc)
```
```True    2124
Name: LTC, dtype: int64
```
```mask = positions_xrp.eq(positions_truncated_xrp)
```
```True    2026
Name: XRP, dtype: int64
```

## Performance Analysis

Recall that the spread between BTC and other cryptocurrencies provides the signal of our trading strategy.

```plt.figure(figsize=[20, 20])

# Training set
plt.subplot(3, 2, 1)
plt.title('Spread BTC & ETH - Training Set')
plt.xlim(training_start, training_end)
plt.ylim(-10, 40)
# Testing set
plt.subplot(3, 2, 2)
color = 'tab:orange'
plt.title('Spread BTC & ETH - Testing Set')
plt.xlim(testing_start, testing_end)
plt.ylim(-10, 40)

# Training set
plt.subplot(3, 2, 3)
plt.title('Spread BTC & LTC - Training Set')
plt.xlim(training_start, training_end)
plt.ylim(-30, 40)
# Testing set
plt.subplot(3, 2, 4)
color = 'tab:orange'
plt.title('Spread BTC & LTC - Testing Set')
plt.xlim(testing_start, testing_end)
plt.ylim(-30, 40)

# Training set
plt.subplot(3, 2, 5)
plt.title('Spread BTC & XRP - Training Set')
plt.xlim(training_start, training_end)
plt.ylim(-50, 50)
# Testing set
plt.subplot(3, 2, 6)
color = 'tab:orange'
plt.title('Spread BTC & XRP - Testing Set')
plt.xlim(testing_start, testing_end)
plt.ylim(-50, 50)
```
```(-50, 50)
```

Remarks:

1. On the training set, the spread between BTC and other cryptocurrencies is flat except for January to July 2017 where it exhibited mean-reversion.
2. On the testing set, the spread between BTC and other cryptocurrencies shot up at the end of 2017 before falling down at the start of 2018.
3. Comparing the spread under the training and testing sets, there seem to be two different underlying processes i.e. a regime change.

### Sharpe Ratio

Create a Sharpe ratio table

```sharpe = pd.DataFrame({'BTC & ETH': [output['ETH']['sharpe training'], output['ETH']['sharpe testing']],
'BTC & LTC': [output['LTC']['sharpe training'], output['LTC']['sharpe testing']],
'BTC & XRP': [output['XRP']['sharpe training'], output['XRP']['sharpe testing']]},
index=pd.MultiIndex.from_product([['Sharpe Ratio'], ['Training Set', 'Testing Set']]))
```
```sharpe
```
BTC & ETHBTC & LTCBTC & XRP
Training Set1.8176631.1759021.272301
Testing Set-0.2007090.7821450.480900

Remarks:

1. On the training set, trading pairs BTC & ETH, BTC & LTC and BTC & XRP generated good Sharpe ratios, higher than 1.
2. On the testing set, the pair-trading strategy generated lower Sharpe ratios than the training set.
3. We limited data-snooping bias but performance degraded in the testing set due regime change.

### PnL

Plot the cumulative pnl

```plt.figure(figsize=[20, 5])

plt.subplot(1, 2, 1)
plt.plot(output['ETH']['pnl'].cumsum()[training_start:training_end])
plt.plot(output['LTC']['pnl'].cumsum()[training_start:training_end])
plt.plot(output['XRP']['pnl'].cumsum()[training_start:training_end])
plt.title('Cumulative PnL Under The Training Set')
plt.legend(['BTC & ETH', 'BTC & LTC', 'BTC & XRP'])
plt.ylim(-1, 10)

plt.subplot(1, 2, 2)
plt.plot(output['ETH']['pnl'].cumsum()[testing_start:testing_end])
plt.plot(output['LTC']['pnl'].cumsum()[testing_start:testing_end])
plt.plot(output['XRP']['pnl'].cumsum()[testing_start:testing_end])
plt.title('Cumulative PnL Under The Testing Set')
plt.legend(['BTC & ETH', 'BTC & LTC', 'BTC & XRP'])
plt.ylim(-1, 10)
```
```(-1, 10)
```

Remarks:

1. On the training set, trading pairs BTC & ETH, BTC & LTC and BTC & XRP made profits from January to July 2017.
2. On the test set, pair-trading strategy generated small profits during fall in January 2018.
3. Lower profits in the testing set is due to regime change observed in the spread.

## Conclusion

In this article, we implemented and backtested a pair-trading strategy using cryptocurrencies such as BTC, ETH, LTC and XRP. We limited data-snooping in our backtest by applying out-of-sample testing. We mitigated look-ahead bias by truncating the historical data. Transaction costs were included in the backtest to make it closer to reality.

# Pair-trading With Cryptocurrencies Part 1

## Introduction

Cryptocurrencies (or cryptos) captured the world’s imagination during the last couple of years. From their spectacular rise in 2017, to their equally hard fall in 2018, and to their most recent revival this year. Cryptocurrencies and the blockchain technology powering them are regarded as disruptive forces, proposing an alternative decentralized monetary system. They have also gathered their fair share of criticism. Cryptocurrencies are accused of facilitating transfer of “dirty money” between criminals. They provided mind-boggling returns a couple years back and earned the reputation of being a “get rich quick” scheme.

In this article, we turn our attention to cryptocurrencies as investment vehicles sharing common characteristics with other asset classes like FX, commodities and precious metals. Cryptocurrencies, like fiat currencies, allow for easy transfer of wealth between individuals and businesses. They are similar to commodities with regards to the limited supply they offer. Cryptocurrencies have drawn comparisons to gold which is viewed as a “safe-haven” investment.

This article is divided into two parts. In this first article, I will illustrate how to source cryptocurrency historical data. For the next article, I will demonstrate how to backtest a pair-trading strategy using the cryptocurrency dataset collected here.

## Retrieving Cryptocurrencies Historical Data

We retrieve data mainly through databases, APIs and web pages. As I don’t have access to a database providing cryptocurrency historical data. I will use APIs and scrape web pages to retrieve cryptocurrency data necessary for backtesting purposes.

### Step By Step

1. Source data via APIs.
2. Retrieve data via scraping web pages.

You can find the code on https://github.com/DinodC/pair-trading-cryptos.

## Via APIs

APIs provide a convenient way to source data from providers, exchanges and other platforms.

### 1. Quandl

Quandl is a financial, economic and alternative data provider. It specializes in distributing alternative data to investment firms, arguing that non-traditional data is an untapped source of alpha. Quandl sources data from various providers such as exchanges, tech firms and other companies. It has free and premium datasets to choose from.

To pull data from Quandl, you need to:

1. Register with Quandl to get an API key, it is free of charge.
2. Pull data from Quandl using their API, or you can use Python, R and Excel.
3. If you use Python, then you need to install Quandl Python package quandl or pandas-datareader.

Here we use quandl package to retrieve cryptocurrency data from Bitfinex and GDAX datasets.

### Bitfinex

Bitfinex is one of the largest cryptocurrency trading platforms in the world, and is headquartered in Hong Kong. It provides an extensive list of crypto-to-crypto and crypto-to-fiat datasets. Collected data using quandl’s get function comes in a convenient DataFrame object. A drawback is the limited OHLCV (Open-High-Low-Close-Volume) historical data spanning 2 years only.

```# Quandl Python package
import quandl

# Set API key
quandl.ApiConfig.api_key = 'Your API Authentication Key'

# Set parameters
start = '2009-01-01'
exchange = 'BITFINEX'
id = 'ETHUSD'

data = quandl.get(exchange + '/' + id, start_date=start)
```

### GDAX (Global Digital Asset Exchange)

GDAX is a popular cryptocurrency exchange headquartered in San Francisco, California. It was formerly known as Coinbase Exchange. Pulled OHLCV historical data from GDAX spans 3y, which is longer than Bitfinex datasets. Collected data using quandl’s get function comes in a convenient DataFrame object. A drawback is the limited number of cryptocurrency datasets compared to Bitfinex.

```# Quandl Python package
import quandl

# Set API key
quandl.ApiConfig.api_key = 'Your API Authentication Key'

# Set parameters
start = '2009-01-01'
exchange = 'GDAX'
id = 'ETH_USD'

data = quandl.get(exchange + '/' + id, start_date=start)
```

### 2. CoinAPI

CoinAPI is a one-stop shop providing cryptocurrency data from various exchanges such as GDAX, Huobi, Kraken, etc. It provides data via their API and other channels as well. Here, I will use CoinAPI’s free registration to retrieve cryptocurrency data.

To pull data from CoinAPI, you need to:

1. Register with CoinAPI to get an API key for authenticating requests.
2. Pull data from CoinAPI using their API, or you can use Python, Java, C#, and etc.
3. If you use Python, then you need to install Python package requests to source data from CoinAPI.

CoinAPI provides a rich list of cryptocurrencies and exchanges to collect data from. Note that under the free subscription, requests are limited to 100 per day. Also, historical OHLCV data per request is limited to a 100 data points. CoinAPI returns data in JSON format.

```import requests

start = '2009-01-01'
exchange = 'BITSTAMP'
id = 'ETH'
period = '1DAY'

url = 'https://rest.coinapi.io/v1/ohlcv/' + exchange + '_SPOT_' + id + '_USD/history?period_id=' + period + '&time_start=' + start + 'T00:00:00'
```

## Via Scraping

Scraping provides a straightforward way of sourcing data from a web page.

### CoinMarketCap

CoinMarketCap provides cryptocurrency data by pulling information from different exchanges. It queries cryptocurrency data from the exchanges, processes and delivers it through their website. CoinMarketCap has certain guidelines for cryptocurrencies and their exchanges to be included on the platform.

You can pull data from CoinMarketCap via their API but their free registration doesn’t include historical data. I will illustrate step-by-step how to retrieve historical OHLCV by scraping their website. Python package BeautifulSoup is used to collect data from CoinMarketCap.

CoinMarketCap provides an extensive list of crypto-to-crypto and crypto-to-fiat data. Cryptocurrency OHLCV historical data spans more than 5 years for BTCUSD, LTCUSD and XRPUSD. A limitation from scraping CoinMarketCap site is data type inconsistency for volume and market cap data.

### Scrape Cryptocurrency Data From CoinMarketCap

We scrape data as follows:

1. Identify the cryptocurrencies and their corresponding URL.
2. Scrape data from the URLs using Python packages requests and BeautifulSoup.

Note that I scraped data for the 5 largest cryptocurrencies (Bitcoin, Ethereum, Litecoin and Ripple) as per market capitalization. I excluded Bitcoin Cash because it has limited historical data.

Import packages

```from requests import get
from bs4 import BeautifulSoup
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import pickle
%matplotlib inline
```

### Identify Cryptocurrencies

Set cryptocurrency ids

```id = ['BTC', 'ETH', 'XRP', 'LTC']
```

Create a dictionary mapping cryptocurrency ids to their URLs

```url = {'BTC': 'https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20130428&end=20190531',
'ETH': 'https://coinmarketcap.com/currencies/ethereum/historical-data/?start=20130428&end=20190531',
'XRP': 'https://coinmarketcap.com/currencies/ripple/historical-data/?start=20130428&end=20190531',
'LTC': 'https://coinmarketcap.com/currencies/litecoin/historical-data/?start=20130428&end=20190531'}
```

### Scrape Data

Before pulling data, recall that I identified an inconsistency on volume and market cap data type:

1. There are str and float types in the volume and market cap Series.
2. I created function my_convert to correct for this.
```def my_convert(string):

res = 0.0

if string == 'NaN':
res = string
elif string == '-':
res = 0.0
elif type(string) == str:
res = string.replace(',', '')
res = float(res)
else:
pass

return res
```

To scrape cryptocurrency historical data, we run the script below

```# Create a list for concat
data_list = []

for i in id:

# Retrieve data from URL
response = get(url[i])

# Create soup
html_soup = BeautifulSoup(response.text, 'html.parser')

# Find OHLCV table
ohlcv_table = html_soup.find_all('tr', class_='text-right')

# Store in a dictionary
data = {}
for j in range(len(ohlcv_table)):
ohlcv_row = ohlcv_table[j].text.split('\n')
data[ohlcv_row[1]] = [float(ohlcv_row[2]), # Open
float(ohlcv_row[3]), # High
float(ohlcv_row[4]), # Low
float(ohlcv_row[5]), # Close
my_convert(ohlcv_row[6]), # Volume
my_convert(ohlcv_row[7])] # Market cap

# Store in a DataFrame
data_df = pd.DataFrame.from_dict(data, orient='index')

# Rename columns
data_df.columns = pd.MultiIndex.from_product([[i], ['Open', 'High', 'Low', 'Close', 'Volume', 'Market Cap']])

# Convert index to DateTimeIndex
data_df.index = pd.to_datetime(data_df.index)

# Append to list
data_list.append(data_df)

# Create DataFrame containing all the cryptos
crypto = pd.concat(data_list, axis=1)
```

Check the collected data

```crypto.head()
```
OpenHighLowCloseVolumeMarket Cap
2013-04-28135.30135.98132.10134.210.01.488567e+09
2013-04-29134.44147.49134.00144.540.01.603769e+09
2013-04-30144.00146.93134.05139.000.01.542813e+09
2013-05-01139.00139.89107.72116.990.01.298955e+09
2013-05-02116.38125.6092.28105.210.01.168517e+09

5 rows × 24 columns

```crypto.tail()
```
OpenHighLowCloseVolumeMarket Cap
2019-05-278674.078907.178668.708805.782.794984e+101.560938e+11
2019-05-288802.768807.028634.728719.962.422692e+101.545902e+11
2019-05-298718.598755.858482.738659.492.347348e+101.535370e+11
2019-05-308661.769008.318221.278319.472.924653e+101.475251e+11
2019-05-318320.298586.668172.558574.502.536519e+101.520599e+11

5 rows × 24 columns

Plot the crypto-to-fiat close prices

```plt.figure(figsize=[20, 20])

# Subplot 1
# BTC
color = 'tab:blue'
ax1 = plt.subplot(3, 1, 1)
plt.plot(crypto.BTC.Close, color=color)
ax1.set_ylabel('Price of BTC in USD', color=color)
ax1.tick_params(labelcolor=color)
# ETH
color = 'tab:orange'
ax2 = ax1.twinx()
plt.plot(crypto.ETH.Close, color=color)
ax2.set_ylabel('Price of ETH in USD', color=color)
ax2.tick_params(labelcolor=color)
# Title
plt.title('BTCUSD and ETHUSD - Close')
plt.xlim('2014-01-01', '2019-06-01')

# Subplot 2
# BTC
color = 'tab:blue'
ax1 = plt.subplot(3, 1, 2)
plt.plot(crypto.BTC.Close, color=color)
ax1.set_ylabel('Price of BTC in USD', color=color)
ax1.tick_params(labelcolor=color)
# LTC
color = 'tab:orange'
ax2 = ax1.twinx()
plt.plot(crypto.LTC.Close, color=color)
ax2.set_ylabel('Price of LTC in USD', color=color)
ax2.tick_params(labelcolor=color)
# Title
plt.title('BTCUSD and LTCUSD - Close')
plt.xlim('2014-01-01', '2019-06-01')

# Subplot 3
# BTC
color = 'tab:blue'
ax1 = plt.subplot(3, 1, 3)
plt.plot(crypto.BTC.Close, color=color)
ax1.set_ylabel('Price of BTC in USD', color=color)
ax1.tick_params(labelcolor=color)
# ETH
color = 'tab:orange'
ax2 = ax1.twinx()
plt.plot(crypto.XRP.Close, color=color)
ax2.set_ylabel('Price of XRP in USD', color=color)
ax2.tick_params(labelcolor=color)
# Title
plt.title('BTCUSD and XRPUSD - Close')
plt.xlim('2014-01-01', '2019-06-01')

plt.tight_layout() # Right y label is not clipped
```

Note that ETHUSD prices are only available from July 2015 because Ethereum was released around this time.

Pickle the data

```with open('data_coinmarketcap.pickle', 'wb') as f:
pickle.dump(crypto, f)

f.close()
```

## Conclusion

In this article, we identified sources of cryptocurrency historical data such as using APIs and scraping web pages. Using APIs provides a convenient way of retrieving data. A drawback using them, particularly under their free subscription plan, is the limited historical data collected. Scraping websites requires a bit more effort than using APIs, but offers more extensive datasets. In the next article, we will use the historical data collected here to backtest a trading strategy using cryptocurrency pairs.