Scratching that itch from ifelse

Okay, as I wrote yesterday, ifelse is rather slow, at least compared to working in C++. As my current project is using ifelse rather a lot, i decided to write a small utility function. In the expectation that I will collect a number of similar functions, I made a package out of it and posted it on github: https://github.com/ojessen/ojUtils

I get a speedup of about 30 times, independent of the target type.

Feedback and corrections greatly appreciated.

Thanks to the people at Travis for providing a free CI server which works directly with github. This of course is a tiny example, but it is good to know that the workflow to set this up can be done in 5 minutes.

And thanks to Romain Fraoncois for showing some Rcpp sugar:

Some data:

require(ojUtils)
## Loading required package: ojUtils
require(microbenchmark)
## Loading required package: microbenchmark
test = sample(c(T,F), size = 1e5, T)
yes = runif(1e5)
no = runif(1e5)

microbenchmark(ifelse(test, yes, no), ifelseC(test, yes, no))
## Loading required package: Rcpp
## Unit: microseconds
##                    expr   min      lq  median      uq    max neval
##   ifelse(test, yes, no) 31925 33404.8 34065.1 58083.5  71891   100
##  ifelseC(test, yes, no)   620   647.5   721.8   817.7 209254   100
test = sample(c(T,F), size = 1e5, T)
yes = rep("a", 1e5)
no = rep("b", 1e5)

microbenchmark(ifelse(test, yes, no), ifelseC(test, yes, no))
## Unit: milliseconds
##                    expr    min     lq median     uq   max neval
##   ifelse(test, yes, no) 57.313 58.763 59.626 72.435 87.92   100
##  ifelseC(test, yes, no)  1.747  1.837  1.926  2.749 29.56   100
test = sample(c(T,F), size = 1e5, T)
yes = rep(1L, 1e5)
no = rep(2L, 1e5)

microbenchmark(ifelse(test, yes, no), ifelseC(test, yes, no))
## Unit: microseconds
##                    expr     min      lq  median      uq   max neval
##   ifelse(test, yes, no) 30747.6 31868.5 32274.8 32829.0 59412   100
##  ifelseC(test, yes, no)   453.7   548.9   581.5   646.2 27575   100
test = sample(c(T,F), size = 1e5, T)
yes = rep(T, 1e5)
no = rep(F, 1e5)

microbenchmark(ifelse(test, yes, no), ifelseC(test, yes, no))
## Unit: microseconds
##                    expr     min      lq  median      uq   max neval
##   ifelse(test, yes, no) 29331.2 31167.3 31719.7 32455.3 60589   100
##  ifelseC(test, yes, no)   460.1   537.1   566.8   640.7 27118   100

Comparing ifelse with C++ for loop with ifs

I currently am reading a bit about using Rcpp and its potential for speeding up R. I found one unexpected example in the lecture from Hadley Wickham:

require(Rcpp)
signR <- function(x) {
  if (x > 0) {
    1
  } else if (x == 0) {
    0
  } else {
    -1
  }
}

cppFunction('int signC(int x) {
  if (x > 0) {
    return 1;
  } else if (x == 0) {
    return 0;
  } else {
    return -1;
  }
}')

require(microbenchmark)
microbenchmark(signC(rnorm(1)), signR(rnorm(1)),times = 1e5)
## Unit: microseconds
##             expr   min    lq median   uq  max neval
##  signC(rnorm(1)) 2.832 3.186  3.540 3.54 4130 1e+05
##  signR(rnorm(1)) 2.478 3.186  3.186 3.54 2641 1e+05

As expected, the two versions perform nearly identical. Now for the surprise: I changed the scalar version of signC into a vectorized version:

library(Rcpp)

cppFunction('IntegerVector signCVec(NumericVector x){
int n = x.size();
IntegerVector out(n);
for(int i = 0; i < n; i++){
            if(x[i] > 0 ){
out[i] = 1;
            } else if(x[i] == 0){
out[i] = 0;
            } else {
out[i] = -1;
            }
}
return out;
}

            ')

signRVec <- function(x) {
  ifelse(x > 0,1, ifelse(x == 0,0,-1))
}

Now I would have expected the two functions also to be rather similar in execution, but see for yourself:

x = rnorm(1e6)

microbenchmark(signCVec(x), signRVec(x),times = 10)
## Unit: milliseconds
##         expr    min      lq  median      uq     max neval
##  signCVec(x)   8.07   8.103   8.311   8.761   8.952    10
##  signRVec(x) 571.91 581.988 607.664 620.322 743.546    10

Wow: A 60-odd-times reduction using Rcpp.

Der Effekt, wenn man am Höhepunkt Aktien kauft

Bei mir hat heute wieder mal der Boulevard-Indikator ausgeschlagen, also die Erkenntnis, dass man sich von Aktien verabschieden sollte, wenn in der Boulevardpresse zur Investition in Aktien aufgerufen wird. Ein Beispiel hierfür sei folgender Artikel vom Manager Magazin:

Der Dax notiert bei knapp 10.000 Punkten. Höchste Zeit, sich von einigen Aktien-Irrtümern zu verabschieden. Zum Beispiel davon, dass man Aktien nur kaufen sollte, wenn sie billig sind.

Um diese Aussage zu überprüfen, sollen folgende Strategien miteinander verglichen werden:
1. Die Buy-High-Strategie: Es wird im Abstand von 400 Handelstagen, also ca. 2 Jahren, am höhchsten Punkt des DAX gekauft, und nach 10 Jahren verkauft. Die Renditen der einzelnen Trades werden gemittelt, und stellen den Ertrag der Strategie dar.
2. Die Buy-Low-Strategie: Es wird im Abstand von 400 Handelstagen, also ca. 2 Jahren, am niedrigsten Punkt des DAX gekauft, und nach 10 Jahren verkauft. Die Renditen der einzelnen Trades werden gemittelt, und stellen den Ertrag der Strategie dar.

Wenn man sich den Verlauf des DAX anschaut, bekomme ich erste Zweifel,dass die erste Strategie überlegen sein könnte.

## [1] "GDAXI"

plot of chunk unnamed-chunk-1

Zunächst identifizieren wir die Einstiegs- und Ausstiegs-Tage für die beiden Strategien

seq_times = seq(from= 1, to = nrow(GDAXI)-2000, by = 400)
index(GDAXI[seq_times])
##  [1] "1990-11-26" "1992-07-10" "1994-02-09" "1995-09-12" "1997-04-18"
##  [6] "1998-11-23" "2000-06-23" "2002-01-23" "2003-08-21" "2005-03-16"
high_points = low_points =numeric(length(seq_times)-1)

for(point in 2:length(seq_times)){
  high_points[point-1] = seq_times[point-1] + which.max(GDAXI[(seq_times[point-1]):seq_times[point],4]) - 1
  low_points[point-1] = seq_times[point-1] + which.min(GDAXI[(seq_times[point-1]):seq_times[point],4]) - 1
}


high_points_exit = pmin(high_points + 2000, nrow(GDAXI)) 
low_points_exit = pmin(low_points + 2000, nrow(GDAXI)) 

data.frame(high_points = index(GDAXI[high_points]),
           high_points_exit = index(GDAXI[high_points_exit]),
           low_points = index(GDAXI[low_points]),
           low_points_exit = index(GDAXI[low_points_exit])
           )
##   high_points high_points_exit low_points low_points_exit
## 1  1992-05-25       2000-05-11 1991-01-16      1999-01-12
## 2  1994-01-03       2001-12-10 1992-10-06      2000-09-19
## 3  1995-09-06       2003-08-15 1995-03-28      2003-03-11
## 4  1997-03-11       2005-02-08 1995-10-27      2003-10-06
## 5  1998-07-20       2006-06-01 1997-04-22      2005-03-18
## 6  2000-03-07       2008-01-21 1998-12-14      2006-10-26
## 7  2000-07-20       2008-06-04 2001-09-21      2009-08-03
## 8  2002-03-19       2010-01-26 2003-03-12      2011-01-11
## 9  2005-03-07       2012-12-28 2003-09-30      2011-07-28

Anschliessend berechnen wir die Renditen und deren Durchschnitt

high_returns = log(as.numeric(GDAXI[high_points_exit,4]))-log(as.numeric(GDAXI[high_points,4]))
low_returns = log(as.numeric(GDAXI[low_points_exit,4]))/log(as.numeric(GDAXI[low_points,4]))

mean(high_returns)
## [1] 0.3453371
mean(low_returns)
## [1] 1.095384

Wie man auch grafisch sieht, bietet die Strategie, im Tiefpunkt zu investieren relativ verlässliche Renditen von > 100 Prozent auf den 10-Jahres-Horizont, während die Strategie, am Höhepunkt zu investieren, eine sehr gemischte Performance aufweisst.

plot of chunk unnamed-chunk-4

Man kann natürlich einwänden, dass man ja nie weiss, wann der Höhe- oder Tiefpunkt erreicht ist. Daher wird zum Vergleich das Ergebnis dargestellt, wenn man jeweils am mittleren Tag des 400-Tage-Fensters einkauft.

seq_mean = round((seq_times[-length(seq_times)]+seq_times[-1])/2,0)
seq_mean_exit = pmin(seq_mean+2000, nrow(GDAXI))
mean_returns = log(as.numeric(GDAXI[seq_mean_exit,4]))-log(as.numeric(GDAXI[seq_mean,4]))

mean(high_returns)
## [1] 0.3453371
mean(low_returns)
## [1] 1.095384
mean(mean_returns)
## [1] 0.5344595

Das Ergebnis dieser Strategie ähnelt eher dem der Investition am Höhepunkt, kann aber im Durchschnitt eine höhere Rendite erzielen.
plot of chunk unnamed-chunk-6

Shoutout for Atlassian

In the last two days I was able to download and install a great stack of software from Atlassian, which would theoretically allow a small team of 5-10 people to organize software development on enterprise-grade tools:

  • JIRA for issue tracking and planning, with support for Agile project management
  • STASH for hosting and integrating your git repositories
  • Crucible + Fish Eye for Code review
  • Bamboo for Continuous Integration
  • Confluence for Documentation

As a novice to server administration I was able to install the software, set the packages up for integrated work and import my projects from github – including issues, tags and so on, in about 10 working hours. Using the software is rather easy, maybe with the exception of the CI-server, which probable is more a sign of my lack of knowledge in this area, and a lack of templates in the R domain. And the price is a snap: for each package there is a starter licence setting you back 10$, covering 5-10 seats, which Atlassian will donate. So, if you are thinking of starting a startup, you could probably do worse than locking yourself into the Atlassian stack.

SAS on the road

I like some of the ideas what SAS might need that truck for:

https://twitter.com/kjhealy/status/423829146113155072