Readers of Cross Sectional Volatility raised numerous questions on implementing the trading signal derived from cross sectional volatility. Towards exploring and visualizing these questions, an implementation in R is considered here.

Begin by assuming a frame exists named ret which contains a sequence of log returns (columns are stocks, rows are seconds), measured at second intervals, with $T + H$ observations. Define values for the key model constants:

N <- 5		# number of stocks
T <- 50		# number of seconds in &quot;past&quot;
H <- 60		# number of seconds in &quot;future&quot;
k <- 4		# number of PCA eigenvectors to use


For simulation sake, suitable values drawn from a normal distribution can be constructed as:

ret <- data.frame(replicate(N, log(1 + rnorm(T + H)/100)))


Given returns in ret, construct $X$, $M$, $Y$, per p. 26; X is the “past” returns over which principal components are calculated, with length T ($T \times N$); future are the “future” returns over which prediction is generated, with length H ($H \times N$); M is the average return for each stock ($1 \times N$), and Y is the de-meaned “past” ($T \times N$):

N <- length(ret)
X <- as.matrix(ret[1:T,])
future <- ret[(T+1):(T+H),]
M <- colMeans(X)
Y <- scale(X, scale=FALSE)


Visualize log returns and cumulative log returns via plotting the following:

plot.ts(X)
par(mfrow=c(3,2))
for (i in c(1:N)) {
lab <- paste('X', i);
plot(cumsum(X[,i]), type='l', main=lab, ylab='', xlab='')
}


Providing charts which look like familiar medium-frequency equities:

Next, perform PCA and construct $D$ from the top $k$ eigenvectors in $\phi$ projected onto $Y$, via SVD (as proposed on p. 58). Recall the eigenvectors are columns in rotation, thus [,c(1:k)] selects the top-$k$:

phi <- prcomp(X)$rotation[,c(1:k)] D <- Y %*% phi  A summary on prcomp(X) reveals: Importance of components: PC1 PC2 PC3 PC4 PC5 Standard deviation 0.01137 0.009963 0.009274 0.008849 0.008132 Proportion of Variance 0.28177 0.216240 0.187370 0.170580 0.144040 Cumulative Proportion 0.28177 0.498010 0.685380 0.855960 1.000000  Thus, the top $k$ principal components explain 85.6% of variance, leaving the remaining 14.4% to denoise. Given $D$, calculate the matrix of $\beta$ (which is $N \times k$) from the long-horizon regressions for each stock, iterating the summation from $(t + 1)$ to $(T + H)$, per p. 28 (which due to lacking rigor, has several possible interpretations): B <- c() for (i in c(1:N)) # iterate over stocks { hsum <- future[1,i] Dsum <- D[T,] for (j in c(2:H))&nbsp;&nbsp;&nbsp;# generate rows by walking up the horizon { hsum <- rbind(hsum, sapply(data.frame(future[c(1:j),i]), sum)) Dsum <- rbind(Dsum, sapply(data.frame(D[c((T-j+1):T),]), sum)) } B <- cbind(B, lm(hsum ~ Dsum)$coefficient[c(1:k+1)])
}


Plot hsum and Dsum for a given stock:

via:

par(mfrow=c(3,3))
plot(hsum, type='l')
for (i in c(1:k)) {
lab=paste("Dsum[",i,"]");
plot(Dsum[,i], type='l', ylab=lab)
}


And, OLS residuals for all N stocks (which exhibit serial correlation, but do look mostly stationary):

Generate accumulated estimated $\hat{S}$ and actual returns $R$, using $D$ and $M$ (p. 29):

Dhat <- colSums(D[c((T-H):T),])
S <- Dhat %*% B
Shat <- as.vector(S + M)
R <- as.vector(colSums(future))


Where each is a vector, with one accumulated horizon estimate per stock. Inequality comparison of estimated and actual generates the trading signal:

signal <- R > Shat
cat('Shat:',  Shat, "\n")
cat('R:', R, "\n")


With results:

Shat: 0.02340865 -0.00604718 -0.03142925 0.02960236 0.04318216 R: 0.04080749 -0.002699089 -0.05541317 0.02654577 0.04163165 Signal: SELL SELL BUY BUY BUY

Thus, signal for the current second: sell short stocks 1 and 2; buy stocks 3, 4, and 5.

33 Comments leave one →
1. March 8, 2011 12:55 am

Hi,

have you tried backtesting this technique on daily data and universe be e.g. S&P 500, EUR/USD, Bond ETF and Silver? I would be interested to know if it is worth implementing on not very high-frequency data, but the daily ones.

Jozef

March 8, 2011 1:01 am

@jozef: I have not done so, but am fairly skeptical for daily.

2. March 8, 2011 12:56 am

By universe I meant only S&P 500 index (not included stocks), thus I suggest 4 assets in my comment above.

March 8, 2011 1:10 am

@jozef: while admittedly not having performed the analysis, unclear whether there is economic intuition to justify strong principal components underlying such a small asset set (covariance matrix is likely to be very noisy); with the possible exception of macro factors, such as business cycle or monetary policy, which would a priori seem to have little potential for driving daily convergence.

• March 8, 2011 1:39 am

Hi, seems reasonable, that the asset set is very heterogenous. What about stocks from the same sector, but on daily data? I was more interested to know whether this technique is feasable on daily data.

3. March 8, 2011 1:39 am

sorry, have not spotted your answer about daily data. thanks for prompt reaction. Jozef

4. Breno Neri permalink
March 8, 2011 6:38 am

Well, if we have buys and sells for with a random data, there is something wrong with this strategy, no?

March 8, 2011 9:46 am

@Breno: nothing wrong, that is intended behavior: the model generates a continuous sequence of long/short directional signals (one per second), presuming a position for every stock at all times; thus, the model is naturally suited to supply liquidity when run via limit orders.

If execution discretion is desired, the model can be overlaid with an execution confidence signal (for example, by evaluating a fitness statistic from the long-horizon regressions).

March 8, 2011 1:52 pm

You might want to define X before calling plot.ts(X) in your code.

(Hi BN!)

March 8, 2011 2:11 pm

@Matthew: good point; I will reorder the prose to clarify.

March 8, 2011 3:00 pm

The code as is fails on line

Dsum <- rbind(Dsum, sapply(data.frame(D[c((T-j+1):T),]), sum))

when j=52 using your parameters because (T-j+1) <0

March 8, 2011 4:07 pm

Yeah, it’s better to make H equal T in this example.

March 8, 2011 4:14 pm

@Matthew: yeah, good call; the invariant (H <= T) must be satisfied for long-horizon, which I did not satisfy in my toy example. If I get time, I will update post accordingly.

March 8, 2011 5:31 pm

Quantivity,

I’m very glad to see the increase in activity and the posts, as always, are very interesting. It’s very kind of you to post the example in R. Please keep up the great work.

Trey

March 8, 2011 11:23 pm

March 9, 2011 3:30 pm

So many posts are very inspiring, and some “live” examples in R are quite nice way to do a bit of experimenting and research. Thanks again for covering gray areas of CSV so much.

March 10, 2011 2:56 am

There is small bug in the code.
Y <- X – M doesn't really produce a centered by colums matrix (you can easily check by running colMeans(Y), it would not return 0 values).
Instead, it could be possible to use something like this:
Y <- scale(X, scale=FALSE)

March 10, 2011 10:22 am

@Aleksey: agreed, thanks for calling that out; post updated.

March 17, 2011 1:52 pm

Quantivity,

I appreciated your thoughtful quantitative trading book list in a previous post. Can you please recommend one portfolio analysis book that would cover several methods from simple to complex as well as address optimization?

Much Appreciated,
Jeff

March 17, 2011 6:32 pm

@Jeff: please clarify what type of portfolio analysis?

March 17, 2011 8:27 pm

I was looking for methods for optimizing a collection of return streams derived from multiple strategies applied to multiple markets and even across multiple asset classes. The end result of such an analysis would be optimal weights for the various return streams to arrive at a particular portfolio return/risk/volatility profile.

Many Thanks,
Jeff

March 18, 2011 11:30 am

FYI, in the code you guys are running regression over not centered future returns and then add means of historical returns to Shat. Probably need to add one more line to center future returns as well? Also, if we center future returns, shouldn’t we add M*H – since we sum H future returns with a mean of M?

March 18, 2011 3:19 pm

I suppose the regression should be be run over future returns centered around the historical mean. Otherwise, you can’t know mean of the future returns without actually “looking” into the future.

March 20, 2011 2:38 pm

@Aleksey: recall “future” is with respect to the model trading clock, which is in the past for wall clock time. So, there is no data snooping concern.

March 20, 2011 4:45 pm

@Alexander: implementation above is consistent with the thesis, which states “ran a regression, estimated by OLS, on the future accumulated log returns with the last sum of H-period dimensionally reduced returns in the principal component space” (p. 27). Whether using centered future returns is a preferable alternative is a different question; indeed, one worth considering.

The future cannot be defined as centered as that would result in signal being miscalculated, due to centering inconsistency: inequality on centered (R) versus non-centered (S^) returns. If you want to regress centered future returns, then hsum should be summed over a centered future:

 centerfuture <- scale(future, scale=FALSE) ... hsum <- centerfuture[1,i] ... for (j in c(2:H)) {    hsum <- rbind(hsum, sapply(data.frame(centerfuture[c(1:j),i]), sum))    ... 

Finally, S^ requires inverting the centering performed on Y, and thus is appropriate to add back the mean H (as opposed to M*H). See the bottom of page 28 in the thesis for further explanation.

March 26, 2011 9:22 am

There is no any assumptions about distribution of returns and all this mdoel does – trying to predict. Hence, if instead of ret <- data.frame(replicate(N, log(1 + rnorm(T + H)/100))) I put ret print(Shat)
[1] 90.98732 90.88485 90.87899 90.98547 90.87325
> print(R)
[1] 120.0778 119.9253 119.9823 120.0589 119.9198

Now, if we only do centering of future returns we get:

> print(Shat)
[1] 2.973299 2.997521 2.996584 2.976858 2.957039
> print(R)
[1] 120.0274 120.0547 120.0358 120.0382 120.0044

Now, if we add H*M instead of M in the end we get:

> print(Shat)
[1] 120.0223 119.8222 120.0676 120.1109 119.9371
> print(R)
[1] 119.9645 119.8452 120.0216 120.0628 119.9399

Last result seems the best to me…

March 26, 2011 9:24 am

in the above post some text didnt fit: I put ret <- data.frame(replicate(N, 3.0+log(1 + rnorm(T + H)/100))) instead of current code.

March 26, 2011 7:07 pm

@Alexander: can you clarify what fitness metric you are using to assess “best”?

Also, can you elaborate on rationale for use of “3.0 + log(…)” for returns generation?

March 26, 2011 9:01 pm

issue I am trying to point out here is that something is wrong with the prediction model. It happens when returns have non zero means. I bump returns high enough so the issue can be clearly seen, hence 3.0+log.

Better means smaller prediction error produced.

April 2, 2011 2:08 am

Alexander – no, I don’t think adding H*M is correct. You can actually “debug” the prediction model very easily, that’s how I did that:
define about 5 linear functions, like y(x) = a + kx, where for every function you have different a and k coefficients.
Then compute log differences, separate into three intervals and feed that as an input into this program.
Because these functions are linear, PCA will find one main component, and OLS will estimate necessary coefficients matrix B to “restore” future values of the function. Then by comparing computed output and real value you can see if there is any problem with the prediction model.

April 6, 2011 8:32 pm

For one time series, did you do that?:

x – norm(0,1)

y=a+k*x — simulate y N times

Feed log(y_t)-log(y_{t-1}) as returns?