Pairs Trading In Python

by Alexander Jeanis

Part 1: Choosing and Testing A Pair

Choosing A Sector (Financial Services)
Cointegration
Choosing A Pair
Risk (and Return) Factor Neutrality

Part 2: Creating A Strategy

Moving Averages
Moving Average Z-Scores
Creating Buy and Sell Signals
Breaking Down Buy and Sell Signals

Part 3: Testing The Trading Strategy

Simulating Our Trading Strategy
Cointegration Period Simulation Results
Benchmarking Cointegration Period Performance
Out-Of-Sample Simulation Results

Areas of Improvement And Final Thoughts

Introduction

Pairs trading is a market neutral mean-reversion trading strategy that involves matching long and short positions in highly correlated securities. In addition to correlation, pairs trading requires a general understanding of stationarity, cointegration. and the risk factors affecting each stock within a pair.

In part 1, we will go through the process of choosing a target sector, locating a cointegrated pair, and then testing the price ratio and spread of the pair for stationarity.

In part 2, we will design a relative value trading strategy that uses rolling averages to take advtange of deviations from "true" mean of their price ratio

In part 3, we will construct a function that simulates our trading algorithm and then explore the results in light of other pairs / benchmarks.

Libraries and Packages

import numpy as np
import pandas as pd

from statsmodels.tsa.stattools import coint, adfuller
from statsmodels.tools.tools import add_constant
from statsmodels.regression.linear_model import OLS

import yfinance as yf

import matplotlib.pyplot as plt

Part 1: Choosing and Testing A Pair

Choosing A Sector (Financial Services)

One of our main goals when picking our pair is to eliminate as much asymmetrical risk as possible. Because the strategy is long / short neutral at any time, it doesn't matter whether the factor exposure causes the prices to go up or down. What does matter, is that the risk factors affect the pair in a nearly identical fashion. A large difference in driving factors could easily 'break' the cointegration of the pair or destroy the stationarity of the spread / price ratio; this is where the majority of risk in a pairs trading strategy comes from.

To illustrate this point, let's take the fictional example of a pair of pharmaceutical stocks which have identical stables of drugs. Let's say that they are in fact cointegrated and that their price ratio is stationary (we will explain this later, but for now, just know that they pass the statistical tests for being a 'pair'). Since these stocks are functionally identical, we can capitalize on deviations from their expected mean and make profits when the pair returns to their 'true' price ratio (after all they are identical, so they should move together, be it up or down). Well, if one of the companies suddenly develops a miracle cancer cure, then the returns of the two are no longer being driven by identical factors. Thus, the previous price ratio now indicates nothing and our algorithm would simply keep shorting the stock (and losing money) expecting that Miracle Drug Co. should eventually return to its expected price (based on its original relationship with the other stock). This extreme example should drive home the point: the main risks in pairs trading are not from traditional risk exposure (i.e factors that will make the stock price go down), but from asymmetric risk exposure (i.e factors that will cause the stocks to diverge in the long term) We will discuss this idea more in the 'Risk (and Return) Factor Neutrality' section

Based on this analysis, I decided on the financial services industry*, as its constituents hold extremely similar structures, risk factors, and revenue streams--perhaps more so than any other sector.

*Other apt choices could be: Commodities (Namely Energy, Metals, and some Utilities), Real Estate, and Airlines.

*Some sectors that I would steer clear of would be: Biotech / Pharmaceutical (differing drug lineups / R&D prospects), Aerospace & Defense (very contract dependent), Consumer Discretionary (very hard to quantify exposure differences due to branding)

First, lets grab a bunch of financial tickers and pull price data.

Tickers = ['JPM', 'BAC', 'WFC', 'C', 'BLK', 'GS', 'MS', 'SCHW', 'AXP', 'USB', 'TFC', 'PNC', 'BK', 'COF', 'STT', 'FITB']

data = yf.download(Tickers, start="2013-01-01", end="2018-01-01")
df = data["Adj Close"].dropna(how="all")

df.tail()

[*********************100%***********************]  16 of 16 completed

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	AXP	BAC	BK	BLK	C	COF	FITB	GS	JPM	MS	PNC	SCHW	STT	TFC	USB	WFC
Date
2017-12-22	93.623421	27.800545	49.638512	469.876221	68.443298	95.404030	27.162580	243.448608	97.019302	48.409340	130.877167	49.719582	88.843521	44.947334	48.584419	55.097980
2017-12-26	93.462212	27.707504	49.463863	471.344208	67.799568	94.853447	27.021654	242.273499	96.631035	48.179775	129.784454	49.519096	88.398994	44.582050	48.182972	54.722012
2017-12-27	93.993210	27.660984	49.491440	471.894745	67.899315	94.692070	27.004042	240.609604	96.811623	48.271610	129.856064	49.499992	88.643929	44.528591	48.236488	54.560883
2017-12-28	94.533676	27.726112	49.748825	475.252777	68.071587	95.214180	27.127993	241.126678	97.326309	48.345070	130.474136	49.681389	89.115685	44.653328	48.397976	54.874187
2017-12-29	94.163895	27.465597	49.509827	471.335144	67.464111	94.530678	26.862379	239.490952	96.558807	48.179775	129.238052	49.041752	88.933456	44.296955	48.066048	54.310234

Cointegration

What is cointegration?

From Paul Wilmott's Frequently Asked Questions in Quantitative Finance:

Short Answer:

Two time series are cointegrated if a linear combination has constant mean and standard deviation. In other words, the two series never stray too far from one another. Cointegration is a useful technique for studying relationships in multivariate time series, and provides a sound methodology for modelling both long-run and short-run dynamics in a financial system.

Example:

Suppose you have two stocks S1 and S2 and you find that S1 − 3 S2 is stationary, so that this combination never strays too far from its mean. If one day this ‘spread’ is particularly large then you would have sound statistical reasons for thinking the spread might shortly reduce, giving you a possible source of statistical arbitrage profit. This can be the basis for pairs trading.

With that in mind, we will need to define a function which will take our stock price dataframe and output a cointegration matrix, alongside any pairs meeting our significance value of 0.02

For this task, we will be using statsmodels to import the function which will perform an augmented Engle-Granger two-step cointegration test.

def cointegration_checker(stock_dataframe):
    cointegrated_pairs = []
    
    k = stock_dataframe.shape[1]
    #getting the number of columns (tickers)
    p_values = np.ones( (k, k) )
    
    
    keys = stock_dataframe.keys()
    #getting the column names (tickers) from df
    
    for i in range(k):
        for j in range(i+1, k):
            
            Asset_1 = stock_dataframe[keys[i]]
            Asset_2 = stock_dataframe[keys[j]]
            
            #iterating through the df and testing cointegration for all pairs of tickers
            Coint_Test = coint(Asset_1, Asset_2)
            
            pvalue = Coint_Test[1]
            # statsmodels coint returns p-values (our primary concern) in the 1th index slot
            p_values[i, j] = pvalue
            #p value matrix where the output of the coint test is the ith, jth index
            if pvalue < 0.02:
                cointegrated_pairs.append((keys[i], keys[j]))
                
    return p_values, cointegrated_pairs

Choosing A Pair

Using the cointegration checker we created above, we will construct a matrix of possible pairs for easier visualization

pvalues, pairs = cointegration_checker(df)
fig, ax = plt.subplots(figsize=(25, 10))
im = ax.imshow(pvalues <= .02)

ax.set_xticks(np.arange(len(Tickers)))
ax.set_yticks(np.arange(len(Tickers)))

ax.set_xticklabels(Tickers)
ax.set_yticklabels(Tickers)
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")

ax.set_title("Cointegration Matrix Of Financial Services Stocks")
plt.show()
print(pairs)

[('C', 'STT'), ('MS', 'STT'), ('SCHW', 'TFC')]

After analyzing the pairs, it appears that the best match is Citgroup and State Street Corp.

Lets plot the prices and run a standalone cointegration test to make sure everything looks normal.

plt.figure(figsize=(12,8))
df['STT'].plot(color = 'black')
df['C'].plot(color = 'royalblue')
plt.legend(['State Street', 'Citigroup'], prop={'size': 15})
plt.ylabel('Price In U.S. Dollars')
plt.show()

placeholder, pvalue, placeholder2 = coint(df['STT'][0:252], df['C'][0:252])
print("The pvalue of the augmented Engle-Granger two-step cointegration test between $C and $STT is", round(pvalue,5))

The pvalue of the augmented Engle-Granger two-step cointegration test between $C and $STT is 0.17249

After checking the pvalue and price chart above, it is reasonable to say that STT and C are both cointegrated and correlated.

Next, lets plot the price ratio and price spread and then test their stationarity.

But first, what is stationarity?

From NIST/SEMATECH e-Handbook of Statistical Methods:

A stationary process has the property that the mean, variance and autocorrelation structure do not change over time. Stationarity can be defined in precise mathematical terms, but for our purpose we mean a flat looking series, without trend, constant variance over time, a constant autocorrelation structure over time and no periodic fluctuations (seasonality).

A common assumption in many time series techniques is that the data are stationary.

Note: When calculating the spread, we will use linear regression to get a coefficient for linear combination between our securities.

Asset_1 = df['STT']
Asset_2 =  df['C']

#calculating the spread
Asset_1  = add_constant(Asset_1)
results = OLS(Asset_2, Asset_1).fit()

coef = results.params['STT']
Asset_1 = Asset_1['STT']

spread = Asset_2 - (coef * Asset_1)

spread.plot(figsize=(15,10), color = 'black')
plt.xlim('2013-01-01', '2018-01-01')
plt.axhline(spread.mean(), color='red', linestyle=':')
plt.legend(['Spread Between C and STT', 'Mean'], prop={'size': 12});
plt.title("Spread Between Citigroup and State Street Corporation")
plt.show()

Next, lets calculate the price ratio. The price ratio, in this case, is simply computed as higher / lower (State Street / Citigroup)

price_ratio = Asset_1/Asset_2
price_ratio.plot(figsize=(15,10), color = 'black')
plt.axhline(price_ratio.mean(), color='royalblue', linestyle='dashed')
plt.xlim('2013-01-01', '2018-01-01')
plt.legend(['Price Ratio Of C and STT', 'Mean'], prop={'size': 12});
plt.title("Price Ratio Between Citigroup and State Street Corporation")
plt.ylabel('Price Ratio')
plt.show()

To check if the price ratio is stationary, we will use an Augmented Dickey–Fuller Test implemented with statsmodels

Background: The Augmented Dickey Fuller Test tests the null hypothesis that a unit root is present in a time series sample. In this case, the alternative hypothesis is that no unit root exists (indiciating that the Time Series is stationary).

Unit Root: A unit root (also called a unit root process or a difference stationary process) is a stochastic trend in a time series, sometimes called a “random walk with drift”; If a time series has a unit root, it shows a systematic pattern that is unpredictable.

Now, lets compute and compare the p-values of both the price spread and price ratio.

price_ratio_pvalue = round(adfuller(price_ratio)[1],5)
spread_pvalue = round(adfuller(spread)[1],5)

(price_ratio_pvalue, spread_pvalue)

(7e-05, 0.00035)

Based on both the graph of the price ratio and the results of our Augmented Dickey Fuller test, we can say with reasonable certainty that the price ratio of Citigroup and State Street is stationary and will likely fluctuate around a mean of 1.35

The price ratio p-value is the lower of the two, so we will be using the price ratio as the stationary mean which we will base our mean-reversion trading strategy on.

While analyzing the absolute price ratio of the two stocks is important, we should also considering examining a normalized version of the price ratio. To do this, we will be converting the dataframe of price ratios into a dataframe of price ratio Z-scores.

This will simplify our decision making by allowing us to quantify the relative 'unusualness' of the price ratio at present time. For instance, a price ratio Z-score of +0.5 might not be that intresting, but a Z-score of +2.5 would be a strong indicator that our higher stock (STT) is likely overvalued.

price_ratio_z_score = (price_ratio - price_ratio.mean()) / np.std(price_ratio)
# Z score formula:  (observed value -- mean of the sample) / standard deviation of the sample

price_ratio_z_score.plot(figsize=(15,10), color = 'black')
plt.axhline(price_ratio_z_score.mean(), color = 'darkgrey' )
plt.axhline(1, color='tomato', linestyle='dashed')
plt.axhline(2, color='darkred', alpha = .4)
plt.axhline(-1, color='limegreen', linestyle='dashed')
plt.axhline(-2, color='darkgreen', alpha = .4)


plt.xlim('2013-01-01', '2018-01-01')
plt.legend(['Price Ratio Z Score Of Citigroup and StateStreet', 'Mean', 'Z=+1', 'Z+=2', 'Z-=1', 'Z-=2' ]);
plt.ylabel('Z Score')

plt.show()

From the graph we can see that nearly all large divergences will converge back to the approximate mean; this is what we were hoping to see.

Risk (and Return) Factor Neutrality

On top of examining the cointegration and stationarity of the pair, we should take a deep dive into the factors that drive each stocks' returns. We will be using beta to examine the relationship of our pair with various sectors / factors. Traditionally, beta is used in the context of a stock's risk-relationship to the general market and is defined as:

$ Beta = \frac{Covariance(Asset1, Benchmark)}{Variance(Benchmark)} \ $

In this project, we will be computing the beta of both our stocks with various benchmarks, then comparing the results to assess factor neutrality. Ideally, what we would like to see is that both stocks have identical betas in all exposure factors. On the other hand, a high factor beta ratio^ would indicate the pair is more suscptiable to that factor, which would threaten the stationarity of the price ratio. For example, if one stock in the pair was excessively correlated to a factor / sector (say the U.S. Technology Sector), and U.S. Tech was performing extremely succesfully for an extended period of time, the stocks (and their price ratio) would likely diverge; this is the same concept that we discussed in our 'choosing a sector' section.

In the next code block, I pulled price data for some etfs which will serve as benchmarks for their respective categories. This list is by no means exhaustive, but aims to target major factors / sectors.

^Note that we are taking the ratio rather than the absolute difference to account for low beta values. 0.2 - 0.1 is a small absolute difference but a large percentage difference.

Factor_Tickers = ['MTUM', 'QUAL', 'IQLT', 'VTV', 'QQQ', 'SIZE', 'SPY', 'XLK', 'IXN','VNQ','ITB', 'XLF', 'STT', 'C']
Factor_Dictionary = {'SPY': 'Market', 'MTUM': 'Momentum', 'IQLT': 'International Quality', 'QQQ': 'Large Cap',
                     'QUAL': 'Quality', 'SIZE': 'Size (Small / Medium)', 'VTV': 'Value', 'XLK': 'US Technology',
                    'IXN':'Global Technology', 'VNQ': 'US Real Estate', 'ITB': 'US HomeBuilding', 'XLF': 'Financial Services'}
factor_data = yf.download(Factor_Tickers, start="2016-01-01", end="2018-01-01")
factor_df = factor_data["Adj Close"].dropna(how="all")

factor_df.head()

[*********************100%***********************]  14 of 14 completed

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	C	IQLT	ITB	IXN	MTUM	QQQ	QUAL	SIZE	SPY	STT	VNQ	VTV	XLF	XLK
Date
2016-01-04	45.273872	21.832926	25.608055	91.230064	67.328712	104.816719	58.056442	57.771698	181.917801	56.995235	63.383949	70.094666	17.105494	39.188770
2016-01-05	45.034794	22.524345	25.676241	90.963463	67.647499	104.634834	58.211578	57.771698	182.225494	57.082722	64.607063	70.320999	17.171368	39.086788
2016-01-06	44.379551	21.478350	25.082066	89.544685	67.291229	103.629753	57.518047	57.460899	179.926880	55.184052	64.430031	69.319992	16.907867	38.604694
2016-01-07	42.112766	21.123779	24.039818	87.021355	65.791069	100.384743	56.149208	56.473656	175.610184	52.690403	63.174740	67.701019	16.432096	37.464352
2016-01-08	40.846539	20.884438	23.504084	86.297684	65.162895	99.561531	55.510433	55.797226	173.682556	51.316708	62.329830	66.900253	16.175928	37.167686

Next, lets define a function for calculating beta. We will be using a simple regression model to compute the beta of our stocks relative to our chosen benchmarks.

def beta_calculation(y,x):
    
    x = x.pct_change().iloc[1:]
    x = x.to_frame()
    x = x.values
    
    y = y.pct_change().iloc[1:]
    y = y.to_frame()
    y = y.values
    
    
    x = sm.add_constant(x)
    model = OLS(y,x).fit()
    x = x[:, 1]
    
    return model.params[1]

for i in (Factor_Tickers[:-2]):

    factor_ratio = ((beta_calculation(factor_df['C'],factor_df[i])) / (beta_calculation(factor_df['STT'],factor_df[i])))
    
    print("The exposure of the pair to {} is {}".format(Factor_Dictionary[factor_df[i].name], round(factor_ratio,2)))

The exposure of the pair to Momentum is 1.13
The exposure of the pair to Quality is 1.11
The exposure of the pair to International Quality is 1.07
The exposure of the pair to Value is 1.17
The exposure of the pair to Large Cap is 1.06
The exposure of the pair to Size (Small / Medium) is 1.14
The exposure of the pair to Market is 1.13
The exposure of the pair to US Technology is 1.06
The exposure of the pair to Global Technology is 1.07
The exposure of the pair to US Real Estate is 1.18
The exposure of the pair to US HomeBuilding is 1.16
The exposure of the pair to Financial Services is 1.22

These ratios, while not extremely high, are still larger than we would like. An easy way to mitigate some of this risk in live production would be to test the cointegration every day using sliding windows with varying shorter durations (say 3 months, 6 months, and 1 year). Obviously if this strategy was to be put into production we would build out a much more comprehesive risk mode, but for now I'm content to say that State Street Corporation (STT) and Citigroup (C) are a suitable pair to formulate a trading strategy with.

Part 2: Creating A Strategy

Moving Averages

Since this trading strategy is at its core, a mean reversion strategy, we can use rolling averages to inform our trading decisions.

Lets start by plotting the 10 and 60 day moving (rolling) averages against the actual price ratio.

I arbitrarily chose 10 and 60 day windows for our long and short periods. Hypothetically, we could assess various short / long window lengths and then test the statistical significance of our results for a more robust implementation, but I will leave this for next.

#price ratio = ($STT / $C)
price_ratio = Asset_1/Asset_2

price_ratio_10D_MAVG = price_ratio.rolling(window=10).mean() 
price_ratio_60D_MAVG = price_ratio.rolling(window=60).mean() 

plt.figure(figsize=(15, 10))
price_ratio.plot(color = 'black')
price_ratio_10D_MAVG.plot(color = 'magenta', linewidth=2, alpha = .8)
price_ratio_60D_MAVG.plot(color = 'b', linewidth=3)
plt.axhline(price_ratio.mean(), color = 'darkgrey', linestyle = 'dashed' )
plt.legend(['Actual Price Ratio', '10d Ratio Moving Avg', '60d Ratio Moving Avg', 'Price Ratio Mean'])
plt.ylabel('(Moving) Price Ratio')
plt.show()

Moving Average Z-Scores

Similarly to our analysis of the composite price ratio, lets examine the rolling Z score of our moving averages to determine if we could use z-scores as trading indicators.

STD_60 = price_ratio.rolling(window=60).std()
Rolling_Z_Score = ((price_ratio_10D_MAVG - price_ratio_60D_MAVG)/STD_60)
Rolling_Z_Score.plot(figsize=(15,10), color = 'black')

#plt.xlim('2013-01-01', '2018-01-01')
plt.axhline(0, color='black')
plt.axhline(1, color='tomato', linestyle='dashed')
plt.axhline(-1, color='limegreen', linestyle='dashed')
plt.legend(['Rolling Z-Score', 'Mean', 'Z=+1', 'Z=-1'])
plt.title("Rolling Z-Score of Price Ratio Between C and STT")
plt.ylabel('Z Score')
plt.show()

The rolling Z-scores appear to consistently drift away from the mean and then revert back to it.

Accordingly, we should be design a strategy that 'buys' the ratio when it is too low and 'sells' the ratio when it is too high.

Creating Buy and Sell Signals

The rules for this strategy are simple:

We want to 'buy' the ratio when the rolling z score is lower than 1.25 SD from the mean.
We want to 'sell' the ratio when the rolling z score is higher than 1.25 SD from the mean.

Thats it!

To accomplish this, we will make two copies of the price ratio dataframe and then enact the following modifications:

We want to 'sell' the ratio when the rolling z score is higher than 1.25 SD from the mean, so we will set all values below +1.25 SD equal to zero, leaving us with the desired sell dates.
We want to 'buy' the ratio when the rolling z score is lower than 1.25 SD from the mean, so we will set all values above -1.25 SD equal to zero, leaving us with the desired buy dates.

price_ratio.plot(figsize=(15,10), color = 'black', linewidth = 2)
buy  = price_ratio.copy()
sell = price_ratio.copy()


buy[Rolling_Z_Score > -1.25] = 0
buy.plot(color='g', linestyle='None', marker='^')


sell[Rolling_Z_Score < 1.25] = 0
sell.plot(color='r', linestyle='None', marker='v')


plt.axis(('2013-04-01','2018-01-01', 1, 1.65))

plt.legend(['Price Ratio', 'Buy Signal', 'Sell Signal'])
plt.show()

Breaking Down Buy and Sell Signals

But what exactly does it mean to 'buy' or 'sell' a price ratio?

Again, the rules are simple:

'Buying' the ratio means you buy the higher stock (STT) and sell the lower stock (C)
'Selling' the ratio means you sell the higher stock (STT) and buy the lower stock (C)

To visualize this, we will translate our price ratio buy / sell signals into actions for each stock.

plt.figure(figsize=(15,10))
Asset_1 = df['STT']
Asset_2 = df['C']

Asset_1.plot(color='black', linewidth=2)
Asset_2.plot(color='royalblue', linewidth=2)

Buy_The_Ratio = 0*Asset_1.copy()
Sell_The_Ratio = 0*Asset_1.copy()
#creating blank dfs for our order instructions


# 'buying' the ratio means you buy the higher stock (STT) and sell the lower stock (C)
Buy_The_Ratio[buy!=0] = Asset_1[buy!=0]
Buy_The_Ratio[sell!=0] = Asset_2[sell!=0]

# 'selling' the ratio means you sell the higher stock (STT) and buy the lower stock (C)
Sell_The_Ratio[buy!=0] = Asset_2[buy!=0]
Sell_The_Ratio[sell!=0] = Asset_1[sell!=0]


Buy_The_Ratio.plot(color='g', linestyle='None', marker='^')
Sell_The_Ratio.plot(color='r', linestyle='None', marker='v')

plt.legend(['STT', 'C', 'Buy Signal', 'Sell Signal'])

plt.ylim([20,95])

plt.show()

Part 3: Testing The Trading Strategy

Simulating Our Trading Strategy

By now, the gist of our trading strategy should be pretty clear.

Lets make a function to simulate our strategy in action.

In order:

First, we will begin by calculating the price ratio and rolling average z-score of our two assets.
Then, we will use if statements to execute trades when the price ratio exceeds +- 1.25 SD; this entails keeping track of the total position sizes (for leverage constraints) and prices that shares of each stock were bought / sold for, so that we can calculate profit later.
Afer opening a trade, we will track the position value everyday and close the position if it exceeds our stoploss.
Finally, we will close our postions when the price ratio z-score is back within normal range (+-0.5 SD).
Along the way, we will also be keeping track of a multitude of variables to analyze our performance. Among these the number of trades, trade dates, trade gains / losses, and overall profit are all recorded for high / low trades individually.

Note: High and Low in the following variable declarations refer to trades made when the price ratio is high (+1.25 SD) or low (-1.25 SD)

Ex. ratio_high_sell = shorting selling Asset_1 when the price ratio z-score was high (+1.25 SD)
Ex. profit_low = profit from trades made when the price ratio z-score was low (-1.25 SD)

def trading_simulation(Asset1, Asset2, window1, window2, leverage, stop_loss):
    #Asset1 and Asset should be adjusted close price data
    
    #window 2 should exceed window 1 in length
    
    price_ratio = Asset1/Asset2

    moving_average1 = price_ratio.rolling(window=window1).mean()
    moving_average2 = price_ratio.rolling(window=window2).mean()
    std = price_ratio.rolling(window=window2).std()
    
    z_score = ((moving_average1-moving_average2)/std)
    # calculating the z score with moving averages as shown in previous sections
    
    
    profit, profit_high, profit_low = 0,0,0
    ratio_high_sell, ratio_high_buy, ratio_low_buy, ratio_low_sell = 0, 0, 0, 0
    low_trade_total, high_trade_total, = 0,0
    count_high, count_low = 0,0
    old_profit_high, old_profit_low = 0,0
    high_hit_rate, low_hit_rate = [], []
    low_dic, high_dic = {}, {}
    Asset1_shares, Asset2_shares = 0,0
    open_trade = 0
    potential_loss_high, potential_loss_low  = 0,0
    
    df = pd.DataFrame(columns = ['Date', 'Profit'])
    
    for i in range(len(price_ratio)):
        
        
        if z_score[i] > 1.25 and open_trade == 0: 
        #if the price ratio z score is high we will 'sell' the ratio 
        
            #calculating the maximum number of shares for each stock; positions are weighted equally
            Asset1_shares = (leverage // 2) // Asset1[i]
            Asset2_shares = (leverage // 2) // Asset2[i]
            
            
            #selling the ratio means you sell the higher stock (STT) and buy the lower stock (C)
            ratio_high_sell = Asset1[i] * Asset1_shares
            ratio_high_buy = Asset2[i] * Asset2_shares
            
                
            #tracking the number of trades for profit calculation and overall frequency
            count_high += 1
            high_trade_total +=1
            #checker count high with tracker
            open_trade = 1

##############################################################################################################################            
        elif z_score[i] < -1.25 and open_trade == 0:
        #if the price ratio z score is low we will 'buy' the ratio 
        
            #calculating the maximum number of shares for each stock; positions are weighted equally
            Asset1_shares = (leverage // 2) // Asset1[i]
            Asset2_shares = (leverage // 2) // Asset2[i]
            
            #'Buying' the ratio means you buy the higher stock (STT) and sell the lower stock (C)
            ratio_low_buy = Asset1[i] * Asset1_shares
            ratio_low_sell = Asset2[i] * Asset2_shares
            
            count_low += 1
            low_trade_total +=1
            open_trade = 1

##############################################################################################################################            
        if open_trade == 1 and (abs(z_score[i]) > .5):

            potential_loss_high = ((ratio_high_sell - (Asset1[i] * Asset1_shares * count_high)) + ((Asset2[i]*Asset2_shares*count_high) - ratio_high_buy))
            potential_loss_low = (((Asset1[i]*Asset1_shares*count_low) - ratio_low_buy) + (ratio_low_sell - (Asset2[i]*Asset2_shares*count_low)))
            #tracking the current profit from high / low positions 
            
            if potential_loss_high < -stop_loss or potential_loss_low < -stop_loss:
            # if potential losses exceed stop loss then we will cut the positions
            
                old_profit_high = profit_high
                old_profit_low = profit_low
            
            
                profit_high += ratio_high_sell - (Asset1[i] * Asset1_shares * count_high)
                profit_high += (Asset2[i]*Asset2_shares*count_high) - ratio_high_buy
            
                if (profit_high-old_profit_high) != 0:
                    high_hit_rate.append(profit_high-old_profit_high)
                    high_dic[Asset1.index[i].strftime('%Y-%m-%d')] = (profit_high-old_profit_high)
                
                profit_low += (Asset1[i]*Asset1_shares*count_low) - ratio_low_buy
                profit_low += ratio_low_sell - (Asset2[i]*Asset2_shares*count_low)
                
                if (profit_low-old_profit_low) != 0:
                    low_hit_rate.append(profit_low-old_profit_low)
                    low_dic[Asset1.index[i].strftime('%Y-%m-%d')] = (profit_low-old_profit_low)
            
                ratio_high_sell, ratio_high_buy, ratio_low_buy, ratio_low_sell = 0, 0, 0, 0
                count_high,count_low = 0,0
                open_trade = 0
                        
                
            
        
        elif (abs(z_score[i]) < .5):
        #once the z score has returned to 'normal' we will close our positions
            
            #tracking the previous profit level so that we can calculate changes
            old_profit_high = profit_high
            old_profit_low = profit_low
            
            
            profit_high += ratio_high_sell - (Asset1[i] * Asset1_shares * count_high)
            profit_high += (Asset2[i]*Asset2_shares*count_high) - ratio_high_buy
            # profit is derived from (shorted share price - current price) + (current share price - initial long share price)
            
            if (profit_high-old_profit_high) != 0:
            #tracking profit from high trades for metrics
                high_hit_rate.append(profit_high-old_profit_high)
                high_dic[Asset1.index[i].strftime('%Y-%m-%d')] = (profit_high-old_profit_high)
                
            profit_low += (Asset1[i]*Asset1_shares*count_low) - ratio_low_buy
            profit_low += ratio_low_sell - (Asset2[i]*Asset2_shares*count_low)
                
            if (profit_low-old_profit_low) != 0:
            #tracking profit from low trades for metrics
                low_hit_rate.append(profit_low-old_profit_low)
                low_dic[Asset1.index[i].strftime('%Y-%m-%d')] = (profit_low-old_profit_low)
            
            #clearing all positions
            ratio_high_sell, ratio_high_buy, ratio_low_buy, ratio_low_sell = 0, 0, 0, 0
            count_high,count_low = 0,0
            open_trade = 0
            
            
    profit = profit_low + profit_high
    
    high_biggest_loss = min(high_hit_rate)
    high_biggest_gain = max(high_hit_rate)
    
    low_biggest_loss = min(low_hit_rate)
    low_biggest_gain = max(low_hit_rate)
    
    trades_list = high_hit_rate + low_hit_rate
    
    high_list = high_hit_rate
    low_list = low_hit_rate
    
    high_hit_rate = (len([x for x in high_hit_rate if x > 0]) / len(high_hit_rate)) * 100
    low_hit_rate = (len([x for x in low_hit_rate if x > 0]) / len(low_hit_rate)) * 100
    
    print("The total profit between {} and {} was ${:.2f}".format(Asset1.index[0].strftime('%Y-%m-%d'), Asset1.index[len(Asset1)-1].strftime('%Y-%m-%d'), profit ))
    
    print("{} trades where the price ratio was high were executed; they were profitable {:.2f}% of the time, with a largest single gain of ${:.2f}, and a largest single loss of ${:.2f}"
           .format(high_trade_total, high_hit_rate, high_biggest_gain, high_biggest_loss))
    print("{} trades where the price ratio was low were executed they were profitable {:.2f}% of the time, with a largest single gain of ${:.2f}, and a largest single loss of ${:.2f}"
           .format(low_trade_total, low_hit_rate, low_biggest_gain, low_biggest_loss))
        
    trades_dic = {**high_dic, **low_dic}
    #trades_dic = sorted(trades_dic.keys())
    total = leverage
    tracker = []
    for key, value in sorted(trades_dic.items()):
        total += trades_dic[key]
        tracker.append(total)
        
    trades = pd.DataFrame({'Date': list(trades_dic.keys()), 'Profit':list(trades_dic.values()) })
    growth_tracker = pd.DataFrame({'Date': sorted(list(trades_dic.keys())), 'Cumulative Value': tracker})
    growth_tracker = growth_tracker.set_index('Date')
    return  ("High Trades", high_dic), ("Low Trades", low_dic), growth_tracker, ("Total Profit:", profit)

Now that our function is made, lets import our pair and test the results in-sample and out-of-sample.

Note:

In-sample refers to the period in which we tested for cointegration (2013-2018) and out-of-sample refers to the period (2018-2021) where we dont know if the pair is cointegrated.
The results for 2013-2018 are subject to forward looking bias as we know that the pair was cointegrated during the time. The point of simulating trading during this period is two fold:
- to make sure our strategy works when we know that cointegration exists
- to illustrate a rough baseline of exactly how profitable pairs trading can be when cointegration holds.

We will also be importing 5-year T-Note yield data to calculate the return for a comparable "riskless" asset over the same time period, and additional random stocks to verify the validity of our strategy.

In-Sample Tickers:

Coint_Tickers = [ 'C', 'STT', '^FVX', 'GOOG', 'AAPL', 'T','FB']
test_data = yf.download(Coint_Tickers, start="2013-01-01", end="2018-01-01")
test_df = test_data["Adj Close"].dropna(how="all")
test_df.tail()

[*********************100%***********************]  7 of 7 completed

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	AAPL	C	FB	GOOG	STT	T	^FVX
Date
2017-12-22	42.040852	68.443298	177.199997	1060.119995	88.843521	31.276472	2.251
2017-12-26	40.974274	67.799568	175.990005	1056.739990	88.398994	31.372862	2.237
2017-12-27	40.981483	67.899315	177.619995	1049.369995	88.643929	31.236317	2.199
2017-12-28	41.096786	68.071587	177.919998	1048.140015	89.115685	31.469244	2.234
2017-12-29	40.652374	67.464111	176.460007	1046.400024	88.933456	31.228283	2.206

Out-Of-Sample Tickers:

Test_Tickers = [ 'C', 'STT', '^FVX', 'GOOG', 'AAPL', 'T','FB']
test2_data = yf.download(Test_Tickers, start="2018-01-01", end="2021-04-01")
test2_df = test2_data["Adj Close"].dropna(how="all")
test2_df.tail()

[*********************100%***********************]  7 of 7 completed

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	AAPL	C	FB	GOOG	STT	T	^FVX
Date
2021-03-26	121.209999	73.019997	283.019989	2035.550049	85.372528	29.800423	0.856
2021-03-29	121.389999	71.580002	290.820007	2055.949951	84.815910	30.046221	0.891
2021-03-30	119.900002	72.959999	288.000000	2055.540039	85.790001	30.213362	0.910
2021-03-31	122.150002	72.750000	294.529999	2068.629883	84.010002	29.761097	0.938
2021-04-01	NaN	NaN	NaN	NaN	NaN	NaN	0.907

Cointegration Period Simulation Results

Lets take a look at how our strategy did when tested over a period where we know that cointegration holds.

#'in-sample'
trading_simulation(test_df['STT'], test_df['C'], 5, 40, 100000, 5000)

The total profit between 2013-01-02 and 2017-12-29 was $38477.66
17 trades where the price ratio was high were executed; they were profitable 81.25% of the time, with a largest single gain of $3745.52, and a largest single loss of $-1475.22
18 trades where the price ratio was low were executed they were profitable 88.89% of the time, with a largest single gain of $3236.18, and a largest single loss of $-883.84





(('High Trades',
  {'2013-07-17': -1475.2200775146484,
   '2013-09-10': 1995.166820526123,
   '2013-11-19': 815.9623947143555,
   '2014-02-05': 2537.277099609375,
   '2014-04-14': 3745.5196647644043,
   '2014-07-18': -60.61799621582031,
   '2014-08-15': 858.8240661621094,
   '2014-12-08': 1208.4047966003418,
   '2015-02-18': -386.6806640625,
   '2015-09-22': 1368.127784729004,
   '2016-02-01': 2723.5733795166016,
   '2016-03-03': 2457.1830978393555,
   '2016-09-09': 857.4477920532227,
   '2016-10-06': 894.4264259338379,
   '2017-01-30': 2565.181930541992,
   '2017-05-16': 714.2291488647461}),
 ('Low Trades',
  {'2013-03-22': 1931.1374740600586,
   '2013-05-22': 588.3607406616211,
   '2014-03-25': 1200.0155868530273,
   '2014-09-16': 1068.4920768737793,
   '2014-10-28': 1974.4127807617188,
   '2015-03-25': 450.3178176879883,
   '2015-07-10': 29.29065704345703,
   '2015-08-11': 220.76516342163086,
   '2015-11-05': -61.36859893798828,
   '2016-01-19': 2350.3444061279297,
   '2016-05-03': 676.1742210388184,
   '2016-07-22': 3236.1774864196777,
   '2016-11-04': 210.63382720947266,
   '2017-01-09': -883.8449096679688,
   '2017-03-24': 1136.3276138305664,
   '2017-09-08': 1328.011100769043,
   '2017-10-13': 1422.9088287353516,
   '2017-11-20': 780.6961975097656}),
             Cumulative Value
 Date                        
 2013-03-22     101931.137474
 2013-05-22     102519.498215
 2013-07-17     101044.278137
 2013-09-10     103039.444958
 2013-11-19     103855.407352
 2014-02-05     106392.684452
 2014-03-25     107592.700039
 2014-04-14     111338.219704
 2014-07-18     111277.601707
 2014-08-15     112136.425774
 2014-09-16     113204.917850
 2014-10-28     115179.330631
 2014-12-08     116387.735428
 2015-02-18     116001.054764
 2015-03-25     116451.372581
 2015-07-10     116480.663239
 2015-08-11     116701.428402
 2015-09-22     118069.556187
 2015-11-05     118008.187588
 2016-01-19     120358.531994
 2016-02-01     123082.105373
 2016-03-03     125539.288471
 2016-05-03     126215.462692
 2016-07-22     129451.640179
 2016-09-09     130309.087971
 2016-10-06     131203.514397
 2016-11-04     131414.148224
 2017-01-09     130530.303314
 2017-01-30     133095.485245
 2017-03-24     134231.812859
 2017-05-16     134946.042007
 2017-09-08     136274.053108
 2017-10-13     137696.961937
 2017-11-20     138477.658134,
 ('Total Profit:', 38477.65813446045))

Now, lets take a look at some random pairs put into the same simulation.

If our strategy has merit, then we should see that random combinations of stocks have poor performance in both total profit and hit rate.

null = trading_simulation(test_df['T'], test_df['STT'], 5, 40, 100000, 5000)

The total profit between 2013-01-02 and 2017-12-29 was $-3465.31
20 trades where the price ratio was high were executed; they were profitable 68.42% of the time, with a largest single gain of $5870.19, and a largest single loss of $-5361.00
21 trades where the price ratio was low were executed they were profitable 57.14% of the time, with a largest single gain of $5096.20, and a largest single loss of $-5940.68

null = trading_simulation(test_df['GOOG'], test_df['C'], 5, 40, 100000, 5000)

The total profit between 2013-01-02 and 2017-12-29 was $1861.14
23 trades where the price ratio was high were executed; they were profitable 65.22% of the time, with a largest single gain of $4612.42, and a largest single loss of $-7868.41
19 trades where the price ratio was low were executed they were profitable 68.42% of the time, with a largest single gain of $4472.66, and a largest single loss of $-5593.68

null = trading_simulation(test_df['AAPL'], test_df['FB'], 5, 40, 100000, 5000)

The total profit between 2013-01-02 and 2017-12-29 was $-12377.18
18 trades where the price ratio was high were executed; they were profitable 66.67% of the time, with a largest single gain of $6120.14, and a largest single loss of $-5880.20
26 trades where the price ratio was low were executed they were profitable 57.69% of the time, with a largest single gain of $4899.44, and a largest single loss of $-15665.97

Benchmarking Cointegration Period Performance

All things considered the strategy looks pretty good.

Lets plot our cumulative returns against a Five Year Treasury Note (denoted FVX or TMUBMUSD05Y) and compare the results.

To calculate the holding period return we will create a function which calculates the cumulative value of the T-Note and its coupon payments every 6 months.

**Recall that as Yield To Maturity increases bond price decreases

def bond_calculation(YTM, duration, coupon_rate, frequency, investment):
    #this is an implementation of the standard bond pricing formula which splits Bond PV into PV of face + future cash flows
    YTM = YTM.dropna()
    YTM = YTM / 100
    #converting ^FVX into percentage rates and dropping null values; data is initially presented as % * 100 (2.5 = 2.5%)
    
    coupon = ((coupon_rate / 100) / frequency) * investment
    #calculating coupon payments based on frequency and desired investment
    
    
    old_price, total, coupon_payment =  0, 0, 0
    
    face = (((coupon/YTM[126])* (1-(1/(1+YTM[126])**duration)) -investment )*((1+YTM[126])**duration))/-1
    
    
    price_dic = {YTM.index[0].strftime('%Y-%m-%d'): investment}
    
    for i in range(126, len(YTM), 126):
        
        price = ((coupon/YTM[i])*(1-(1/(1+YTM[i])**duration)) + (face / (1+YTM[i])**duration))
        
        price_change = price - old_price
        
        coupon_payment += coupon
        
        total = price + coupon_payment
        
        old_price = price
        
        duration -= (1 / frequency)
        
        price_dic[YTM.index[i].strftime('%Y-%m-%d')] = total
        
    
    value_tracker = pd.DataFrame({'Date': list(price_dic.keys()), 'Cumulative Value':list(price_dic.values()) })
    
    value_tracker = value_tracker.set_index('Date')
    
    return value_tracker

Lets see how the T-Note fared:

bond_calculation(test_df['^FVX'], 5, .75, 2, 100000)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Cumulative Value
Date
2013-01-02	100000.000000
2013-07-05	100375.000000
2014-01-07	101013.425198
2014-07-10	102140.152216
2015-01-12	104068.601604
2015-07-14	104140.780633
2016-01-12	105488.553963
2016-07-13	107458.311692
2017-01-13	106917.646151
2017-07-18	108148.229154

Let's plot the cumulative value for both ventures and see how they stack up.

trading_simulation(test_df['STT'], test_df['C'], 5, 40, 100000, 5000)[2].plot(figsize=(15,10),
title= 'Hypothetical Growth of $100,000 in Pairs Trading Strategy', color = 'r')

(bond_calculation(test_df['^FVX'], 5, .75, 2,100000 )).plot(figsize=(15,10),
title= 'Hypothetical Growth of $100,000 in 5 Year T-Note Held to Maturity', color = 'black' );

The total profit between 2013-01-02 and 2017-12-29 was $38477.66
17 trades where the price ratio was high were executed; they were profitable 81.25% of the time, with a largest single gain of $3745.52, and a largest single loss of $-1475.22
18 trades where the price ratio was low were executed they were profitable 88.89% of the time, with a largest single gain of $3236.18, and a largest single loss of $-883.84

From the looks of the things, our strategy outearns its 'risk free' by ~30% over 5 years.

The results look excellent but we need to remember two things:

Our strategy is not truly as 'risk free' as a treasury note.
Forward-Looking Bias — we already knew the pair would be cointegrated over the whole timeperiod

Nonetheless its usually a good idea to test a strategy in a period where you the desired outcome; if the algorithm doesnt succeed when conditions are known and ideal, it definitely shouldnt see live trials.

At a later date, I may return to this project and see if I better quantify the amount of alpha vs risk that was actually taken on to earn these returns.

Out-Of-Sample Simulation Results

Now lets see how our strategy fared when tested in period where we do not know its cointegration status.

#'out-of-sample'
null = trading_simulation(test2_df['STT'], test2_df['C'], 5, 40, 100000,5000)

The total profit between 2018-01-02 and 2021-04-01 was $-20067.23
13 trades where the price ratio was high were executed; they were profitable 76.92% of the time, with a largest single gain of $4212.73, and a largest single loss of $-5909.60
15 trades where the price ratio was low were executed they were profitable 46.67% of the time, with a largest single gain of $3560.34, and a largest single loss of $-6177.59

Ouch!

I was extremely suprised at the results here, especially given the promise of the pair, but I think the cause will become clear when we examine the price graphs from 2018-2021.

Side Note: I hope this graph amuses you as much as it amused me when I first saw it.

plt.figure(figsize=(15,10))
diff = (test_df['STT'] - test_df['C'])

test_df['STT'].plot(color = 'black')
test_df['C'].plot(color = 'royalblue')
diff.plot(color = 'midnightblue')

plt.legend(['State Street', 'Citigroup', 'Nominal Price Difference'], prop={'size': 12}, loc = 'lower left')

plt.axvline("2018-01-01", color='black')
plt.axvspan('2013-01-01', '2018-01-01', color='green', alpha=0.2)
plt.axvspan('2018-01-01', '2021-04-01', color='red', alpha=0.2)

plt.text("2014-11-01", 100, 'training period', fontsize=15, style='italic')
plt.text("2019-03-01", 100, 'testing period', fontsize=15, style='italic')

plt.text("2018-09-01", 25, 'Cointegration P-Value = .55', fontsize=15, style='italic')
plt.text("2014-05-01", 25, 'Cointegration P-Value = .001', fontsize=15, style='italic')
plt.ylabel('Price In U.S. Dollars')
        
plt.show()

This graph explains in part why every brokerage has a generic risk warning that goes something like:
"All investing is subject to risk, including the possible loss of the money you invest."

5 years of near perfect cointegration and correlation and hilariously, it all falls apart right after we stop testing for it.

I looked into the factors might have affected State Street's performance during 2018 and found the event that likely broke its cointegration with Citigroup on Wikipedia:

In 2018, State Street completed its acquisition of Charles River Development, a Burlington, Massachusetts provider of investment management software. The deal closed October 1, 2018, at a cost of approximately $2.6 billion that will be financed by the suspension of share repurchases and the issuing of common and preferred equity. News of the acquisition led to a drop in State Street shares of nearly 10% with share prices remaining flat since the purchase.

While it is disappointing that this strategy did not work in its current iteration, I have plans to test out more sectors, improve trading operations, and ultimately run a similar strategy on a live paper trading platform; more on these in the final section.

Areas of Improvement And Final Thoughts

Here are some thoughts on improvements and ideas I would like to implement in the future:

1. Identical Pairs

Since no pairs are truly identical, I would like to try out this strategy with a large number of pairs from various sectors to mitigate some asynchronous risk.

2. Kalman Filters

I would like to test out the efficacy of Kalman filters in place of moving averages.

3. Calculating Alpha

Divising a way to better quantify risk so that we can see our strategies true (or at least more truthful) alpha

4. Modular Code

Converting Code (in particular the trading simulation) to be more modular, readable, and concise.

5. Live Paper Trading

Ultimately, I would like to implement a relative value trading strategy based on this project into production.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
algo		algo
documentation		documentation
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

AJeanis/Pairs-Trading

Folders and files

Latest commit

History

Repository files navigation

Pairs Trading In Python

by Alexander Jeanis

Table of Contents

Introduction

Libraries and Packages

Part 1: Choosing and Testing A Pair

Part 2: Creating A Strategy

Part 3: Testing The Trading Strategy

Areas of Improvement And Final Thoughts

Introduction

Libraries and Packages

Part 1: Choosing and Testing A Pair

Choosing A Sector (Financial Services)

Cointegration

Choosing A Pair

Risk (and Return) Factor Neutrality

Part 2: Creating A Strategy

Moving Averages

Moving Average Z-Scores

Creating Buy and Sell Signals

Breaking Down Buy and Sell Signals

Part 3: Testing The Trading Strategy

Simulating Our Trading Strategy

Cointegration Period Simulation Results

Benchmarking Cointegration Period Performance

Out-Of-Sample Simulation Results

Areas of Improvement And Final Thoughts

Here are some thoughts on improvements and ideas I would like to implement in the future:

1. Identical Pairs

2. Kalman Filters

3. Calculating Alpha

4. Modular Code

5. Live Paper Trading

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages