Lehrstuhl fuer Rechnerorientierte Statistik und Datenanalyse Uni Augsburg  
Institut Forschung Lehre Mitarbeiter News Allgemeines Research Staff Software
TWIX

About TWIX

printable form

About TWIX (Trees WIth eXtra splits)
What is TWIX?
        

TWIX is a binary-split decision tree algorithm for classification and data mining developed by Sergej Potapov, Martin Theus (University of Augsburg) and Simon Urbanek (AT&T). TWIX stands for Trees WIth eXtra splits.

TWIX is quite similar to Leo Breiman's CART algorithm. The major differences are:

  • The number of splits and the method for selecting variables can be controlled by the user.
  • If more than one split per level is selected, a forest of classification trees is returned. The best model can be chosen automatically with the help of the score function or by looking at various diagnostic plots.
  • Forests can also be used for aggregation to improve prediction accuracy.
  • If a computing cluster is available, the classification tree forest can be built in parallel.

The advantages of TWIX are:

  • better accuracy for single trees and for aggregation than traditional CART (rpart) and Bagging on many UCI repository datasets.
  • the best TWIX trees are more stable towards variations in the training set.
  • good alternative tree models can be found, that describe the training data set well.

TWIX is now oficially released and can be obtained from CRAN - install.packages("TWIX") in R will do the trick.

What's new?

2006/30/06 - TWIX 0.2-1 released and TWIX is now available on CRAN.

Examples

The following examples descibe the most important paramters of TWIX and illustrate the use of the R-package.
> library(TWIX)
Lade nötiges Paket: rpart

# load sample data & create test and training samples
> data(olives)
> i <- sample(572,150)
> ic <- setdiff(1:572,i)
> training <- olives[ic,]
> test <- olives[i,]

# now the tree ensemble is computed
# (takes a couple of seconds depending on your hardware ...)

> Tree1 <- TWIX(Region ~ .,
data = training[,1:9],
test.data = test,
topN = c(9,2),
method = "local")
n = 32
Deviance and TIC of the best TWIX-tree: 718.6426 0.9509263
Deviance and TIC of the greedy tree(Nr.17): 724.3516 0.8747074
The TWIX call returns the number of generated trees (usually far less trees are created than the parameter topN would allow, because one of the stopping criteria  permits a further recursion of the algorithm) and the deviance and TIC (Twix Information Criteria, which scores the generated trees) of the best TWIX tree and the greedy tree. Note, that the greedy tree not necessarily will be identically with the result of rpart, since rpart does various optimizations "behind the scene" which are not exactly the same in TWIX.
> plot(Tree1)
Warning message:
test data not supplied in: plot.TWIX(Tree1)
deviance

If no test data is specified, the training deviance are plotted, the training deviance is shown in a histogram - the red line is the devianve of the greedy tree.
> plot(Tree1, type="ccr")

# mark the tree selected by the TIC
> y <- predict(Tree1,newdata=test,sq=1,ccr=TRUE)$CCR
> x <- predict(Tree1,newdata=training,sq=1,ccr=TRUE)$CCR

> points(x,y, col=3, pch="+", cex=3)
ccr

The option type="ccr" shows a bubble plot of the CCRs of the tree ensemble. Again, the greedy tree is colored in red. Note, how the TIC picked the almost optimal tree for the test data!
Almost all functions that access a tree ensemble in TWIX use the parameter seq, which has the lenght of the number of tree in the ensemble and is used as index vector. The trees are sorted according to their TIC.

Parameters:
  • topN

    topN is a sequence of integers that defines the number of alternative splits per level. topN = c(9, 2) specifies 9 alternatives at the root level and 2 alternative splits at each node of the second level. In total this could mean 9^2^0 * 2^2^1 = 81 * 4 = 324 trees in total.

  • method


  • topn.method


Papers and Talk

  1. Martin Theus, Sergej Potapov and Simon Urbanek (2005). TWIX. (Talk at the internal seminar week of the Augsburg stats department in Sion 2005 on "exhaustive tree ensembles".)
  2. Martin Theus, Sergej Potapov and Simon Urbanek (2006). TWIX. (A more up-to-date talk on TWIX, given at the 3rd Ensemble Workshop in Munich 2006.)
  3. Sergej Potapov (2006), Analyse von Bäumen. (Diploma Thesis.)

If you want to run TWIX in parallel, please see the Andrew D. Martin's pages for details or Sergej Potapov (2005), "R and Parallel Computing" (german).


Last updated: 2006/10/15 mt


[Institut] [Forschung] [Lehre] [Mitarbeiter] [News] [Allgemeines] [Software]