| |
TWIX is a binary-split decision tree algorithm for
classification and data mining developed by Sergej Potapov, Martin
Theus (University of Augsburg) and Simon Urbanek (AT&T). TWIX
stands for Trees WIth eXtra
splits.
TWIX is quite similar to Leo Breiman's CART algorithm. The
major differences are:
- The number of splits and the method for selecting variables
can be controlled by the user.
- If more than one split per level is selected, a forest of
classification trees is returned. The best model can be chosen
automatically with the help of the score function or by looking at
various diagnostic plots.
- Forests can also be used for aggregation to improve
prediction accuracy.
- If a computing cluster is available, the classification
tree forest can be built in parallel.
The advantages of TWIX are:
- better accuracy for single trees and for aggregation than
traditional CART (rpart)
and Bagging on many UCI repository
datasets.
- the best TWIX trees are more stable towards variations in
the training set.
- good alternative tree models can be found, that describe
the training data set well.
TWIX is now oficially released and can be obtained from
CRAN - install.packages("TWIX")
in R will do the trick.
|
|
The following examples descibe
the most important paramters of TWIX and illustrate the use of the
R-package.
> library(TWIX) Lade nötiges Paket: rpart
# load sample data & create test and training samples > data(olives) > i <- sample(572,150) > ic <- setdiff(1:572,i) > training <- olives[ic,] > test <- olives[i,]
# now the tree ensemble is computed # (takes a couple of seconds depending on your hardware ...)
> Tree1 <- TWIX(Region ~ ., data = training[,1:9], test.data = test, topN = c(9,2), method = "local") n = 32 Deviance and TIC of the best TWIX-tree: 718.6426 0.9509263 Deviance and TIC of the greedy tree(Nr.17): 724.3516 0.8747074
The TWIX call returns the number of generated trees (usually far less
trees are created than the parameter topN would allow, because one
of the stopping criteria permits a further recursion of the
algorithm) and the deviance and TIC (Twix Information Criteria, which
scores the generated trees) of the best TWIX tree and the greedy tree.
Note, that the greedy tree not necessarily will be identically with the
result of rpart, since rpart does various
optimizations "behind the scene" which are not exactly the same in TWIX.
> plot(Tree1) Warning message: test data not supplied in: plot.TWIX(Tree1)
If no test data is specified, the training deviance are plotted, the
training deviance is shown in a histogram - the red line is the
devianve of the greedy tree.
> plot(Tree1, type="ccr")
# mark the tree selected by the TIC > y <- predict(Tree1,newdata=test,sq=1,ccr=TRUE)$CCR > x <- predict(Tree1,newdata=training,sq=1,ccr=TRUE)$CCR
> points(x,y, col=3, pch="+", cex=3)
The option type="ccr" shows a bubble plot of the CCRs of the tree
ensemble. Again, the greedy tree is colored in red. Note, how the TIC
picked the almost optimal tree for the test data!
Almost all functions that access a tree ensemble in TWIX use the
parameter seq, which has
the lenght of the number of tree in the ensemble and is used as index
vector. The trees are sorted according to their TIC.
Parameters:
- topN
topN is a sequence of integers that defines the number of alternative
splits per level. topN = c(9, 2)
specifies 9 alternatives at the root level and 2 alternative splits at
each node of the second level. In total this could mean 9^2^0 * 2^2^1 =
81 * 4 = 324 trees in total.
- method
- topn.method
|
|
- Martin Theus, Sergej Potapov and Simon Urbanek (2005). TWIX. (Talk
at the internal seminar week of the Augsburg stats department in Sion
2005 on "exhaustive tree ensembles".)
- Martin Theus, Sergej Potapov and Simon Urbanek (2006). TWIX. (A more
up-to-date talk on TWIX, given at the 3rd Ensemble Workshop in Munich
2006.)
- Sergej Potapov (2006), Analyse von Bäumen. (Diploma
Thesis.)
If you want to run TWIX in parallel, please see the Andrew D. Martin's pages
for details or Sergej Potapov (2005), "R and Parallel Computing" (german).
Last updated: 2006/10/15 mt
|