Interactive Modelling of Categorical Data


Diese Seite gibt's nur auf Englisch.

Abstract

In data analysis exploratory and interactive methods are more and more accepted. It should be possible and desirable to extend these methods to modelling and model finding.

Based on graphical models and some new graphics for loglinear models we introduce a first approach for categorical data as implemented in the software TURNER.

Keywords

Interactive Modelling; Exploratory Data Analysis; Graphical Models; (Bayesian) Model Averaging; TURNER.

 

1 Introduction

 

Nowadays exploratory data analysis and interactive graphical methods are well accepted, especially in the sector of data mining and data warehousing. Depending on the existence of powerful computers these methods are not to be taken as the only method to proceed with data, but can be seen as a rather useful 'hypothesis generator', especially if nobody really knows what's going on.

Interactive graphical methods enable the user to gain 'insight' into his/her data and to catch an idea about the 'true underlying relationships'.

But not so with modelling. Being frozen on the level of years ago, modelling is still a highly complicated and unintuitive matter. Of course very sophisticated models have been developed and computers enable the calculation of thousands of such models in a split second. But the building of models, not to speak of reasonable models, is still not easier than, say, 10 or 20 or 50 or 150 years ago. Persuading a computer to calculate the model desired is a nontrivial task, and no non-statistician can dare to set up even the simplest cases correctly. And for getting insight into the models, which is their goodness-of-fit, their interpretation and implications, all the things why we're modelling at all, for getting this insight we still have to browse through cryptic ASCII outputs, but thanks to the computers we now at least can have thousands of these.

It should be not only possible but desirable to apply interactive graphical methods onto the modelling process as well as the resulting models. We need computer interfaces which are easy to use and save our nerves for the real problems. And we need methods to 'look into' single models to understand their working with our specific data and methods to compare models to be able to choose one or even a few from a whole range of candidates rather than helplessly staring at thousands of meaningless numbers. There are quite a few methods available for determining which one model to choose, but none of these really satisfies the needs to be served. Stepwise algorithms are almost all that can be done, but usually the user has no overview over the alternatives and the 'interesting' branchings (as opposed to the branchings which lack any serious alternative, maybe even regardless of the criterion used). Second, for stepwise algorithms you have to use one single number as a criterion for branching. But if you use AIC or Mallow's Cp or the likelihood ratio G 2, or something else: a single number cannot convey all the information to be considered within a whole model, and the user cannot estimate the impact of changing the criterion.

If we constrain ourselves to the modelling of discrete data via loglinear models, a first approach to solving these problems is at hand: graphical models. But graphical models in their classical sense are more a visualization of the model's formula than of the operation of the model on the data . Section 2 will show how to extend graphical models to cover this. In section 3 we will introduce another visualization of models, so called interaction lattices . Combining and displaying our results on single models with incomplete model trees (section 4), we suggest an approach which might be a first extension of the classical modelling process towards the methods and strategies of interactive data analysis. Section 5 gives some preliminary conclusions and ideas for future improvement and implementation.

 

2 Interactive Graphical Models

 

Graphical models as outlined in Whittaker (1991) can be used to visualize loglinear models, at least graphical hierarchical loglinear models. Since interpretation is considered to be important, hierarchy is not a real constraint, non-hierarchical models are at best difficult to interpret. Being bound to graphical models is more annoying but there are first steps for improvement (see my Diplomarbeit) and the use of interaction lattices (section 3) avoids this problem entirely (of course we will have other problems then).

FIGURE 1: Left: the classical graph for the model ME, MG, MP. Right: the same model with fuzzy edges. Easily can be seen that the interaction ME is highly significant while MG is not. This information is data dependent and not available from the graph on the left, which is only a representation of the model formula.

But such model graphs only reflect the presence or absence of interactions, which is a visualization of the formula of the model (see fig. 1, left side). When working with real data one is concerned about the ability of the model to describe reality in some sense, the goodness-of-fit of the model. Many efforts have been made to describe the goodness-of-fit of a model with one single number, for example Read & Cressie (1988). But why not overlaying the graphical representation of the formula, the model graph, with data dependent information about the goodness-of-fit or the significance of the model elements? This could be done by introducing fuzzy edges.

In a graphical model edges, which represent interaction terms of the formula, are present or not. But when modelling data, interaction terms are more or less significant. It would be possible to draw an edge whenever the corresponding interaction term has been in the model, and the linewidth denotes the significance of this interaction. So the goodness-of-fit can be retrieved from the graphic, see fig. 1, right side.

This opens the door for a lot of new problems to enter. For example, the significance of edges (interactions) isn't uniformly interesting. A change of the p-value from 0.6 to 0.8 isn't worth mentioning, while a change from 0.06 to 0.04 seems much more important. So it sounds reasonable to apply some sort of nonlinear scaling on the edges, which could be interactively accessible by displaying the weight function and manipulating it with the mouse (exactly like changing the shape of Bezier-splines within any graphics package).

Furthermore the number of distinguishable linewidths on a computer screens is limited. But if there are huge differences to be displayed, all not-quite-so-huge differences will almost undestinguishly look the same. This can be solved by introducing upper and lower bounds for the displayed weights. All edges with weights below will be uniformly displayed with the minimum linewidth, all above with the maximum linewidth available, where every weight in between will be discretisized to fit the number of distinguishable lines, let's say ten. But then we need warnings (for example by using red color for the edges) when an edge is displayed to it's true weight and when it's 'out of range'.

Lots of interactive methods can be added for further improvement, such as interrogation (see fig. 2) and linking & highlighting.

FIGURE 2: An example for the interrogation of model elements. Interrogation of variables could lead to displaying the marginal distribution in a bar chart, which reveals the number and names of levels and the distribution within these levels.

This will enable the user to 'communicate' with the model and ask questions about the elements of this model. Otherwise distracting additional information can be retrieved on demand and the possibility of switching back and forth between alternate displays will improve insight into the implications of the model, very similar to the process of learning about an unknown object in your hand: you first of all want to have a look at it from all sides.

If consequently implemented, this interface will replace the classical command-line so that even non-statisticians can easily build models by dragging variables in and out, selecting, adding or removing edges. This way the intuitive elements of the graph can be exploited to prevent the user from shuffling through formulas and enable orientation on the symbolic level of the graph instead of dealing with command-line expressions.

Another possibility for visualizing loglinear models is given in the following section.

 

3 Interaction Lattices

 

Interaction lattices are graphs whose nodes represent all possible interaction terms, given the variables and edges connecting adjacent interaction terms in the sense of the hierarchy of the interactions (see fig. 3).

FIGURE 3: An interaction lattice for the four variables G, P, E, M. Highlighted is the model GPE, PEM.

The main advantage of the interaction lattices is (besides their applications for multiple models of course) that all models can be visualized, even non-hierarchical ones. Excluding the non-hierarchical models due to their difficulties with interpretation, all hierarchical models can be displayed, including the non-graphical ones.

A model is displayed by highlighting the corresponding interaction nodes. Since interaction lattices of the same set of variables are congruent, model comparison is as easy as overlaying the two (or even more) lattices, preferrably using multiple colors for the different models.

Another feature of interaction lattices is the possibility of weighted overlaying of models, which opens ways for visualising the results of (Bayesian) model averaging. Each of the models is assigned a weight (or a Bayesian probability of 'being the true underlying model'). Visually adding these models, overlaying the interaction lattices, cannot be used to interpret the added weights on each interaction. But it gives quite a good idea about what the uncertainty within models looks like, and gives a at-the-first-glance impression about which interactions are present in all models, in allmost all models or in only few of the models.

Of course here, too, have to be added lots of interactive features to be able to work with these displays. Weights of models should be interactively accessible and changable, interrogation and warnings have to be present, and via highlighting \& linking it is possible to 'translate' the interaction lattice into other model views such as graphical models, whenever possible.

 

4 Model Trees

 

In a modern data analysis definitively more than one single model has to be considered. If you are lucky, you can with certainty determine the one and only 'true' model. But even then you have to step through multiple models until finally coming up with your 'best' model (or everybody else will reach 'a few' reasonable models).

There is no way of checking if you found the best model according to any arbitrary criterion due to the sheer size of the problem. All you can get is a local optimum. Displaying the space of all models is futile, but why not displaying the subset of all (currently) analysed models?

Two models are directly comparable to each other if the interaction terms of one model are a genuine subset of the interaction terms of the other model (then, the difference in G 2 is asymptotically chi-squared distributed and can be used to judge if the extra gain of fit is significant). So all analysed models can be arranged in a (not necessarily connected) tree (Strictly speaking it's not a tree but only a graph, since there can be (undirected) circles. But with directed edges (from simpler to more complex models) it is a tree, which corresponds much better to the idea of modelling step by step). This visualises the 'path' through the set of all models the analyst has taken.

If a stepwise algorithm has been used, each of the branches made is based on some optimality criterion. But then a 'second-best' branch exists. If we let the computer step through all models we visited and calculate in every step not only 'the best' solution, but the second-best as well, and then display the second-best solution as a branch not taken in our model tree, we get an overview over how crucial the decisions of the algorithm have been, if we (1) flag all second-best branchings with their optimality in percent of the best solution and (2) evaluate, if the second-best alternative leads back to the main path in a reasonable small number of steps (let's say immediately, i. e. one step) or not.

Now we get an idea where 'almost taken' alternatives have the possibility, not to head back to the main path immediatedly. This might be a good starting point for evaluating other paths, which might or might not lead to interesting models not yet found.

In a nutshell, this comes down to the attempt to unify the strategic oversight of a weak calculator, the human user, with the computing power of a partner stunningly ignorant in decisions that require foresight, the computer.

 

5 Conclusion

 

Visualization of models and modelling processes seems to be a reasonable extention. Since interactive methods can only be tested in an interactive environment like a computer, we implemented the first steps of the methods described above in TURNER for Macintosh computers.

Once this approach is accepted, various additions and extentions are to be considered, such as directed graphs, displaying distances between models in the model trees, or even introducing continuous variables and more general model classes.

But why all the effort, why all the new methods and further complications? The introduction of interactive modelling serves two purposes. First, it provides a new and easy-to-use tool for the experienced statistician. Second, interactive modelling is much more intuitive and understandable than the classical way via typing formulas and analysing text output. This enables non-statisticians to build models or --- even more interestingly --- enables non-statisticians, but experts in other fields (like in the data collected) to follow the modelling process done by a professional statistican, so that their specific knowledge can be used within the analysis rather than having the usual dilemma: the statistician knows all about the models, but nothing about the problem, and the expert knows his/her data and all the related problems, has lots of 'expert knowledge', but cannot support the statistician, because he/she simply doesn't know what's going on.

 

References


oooooHome

Stephan.Lauer@Math.Uni-Augsburg.DE, September '98