| Organizer: | Tony Babinec, tony@spss.com |
Speakers
1:30 p.m.
Hybrid CART-Logit and CART-Neural Nets for Classification and
Regression
Dan Steinberg,
Salford Systems and San Diego State University
with Nicholas Scott Cardell
CART and logistic regression are among the most frequently used classification and response probability modeling tools. Both exhibit from good to excellent performance in a variety of data analysis situations. However, because the two methods have quite different strengths and weaknesses, it is natural to investigate whether some combination might prove superior to either used separately. In particular, CART excels in detecting local complex data structure while logit excels in detecting linear and global structure. The hybrid model introduced here contains pure CART and pure Logit or NN models as special cases, and straightforward statistical tests for the hybrid alternative versus the null of either a CART only or logit or NN only are presented. The method introduced here differs considerably from previous hybridization experiments in that the logistic component is not run within CART child nodes and it includes a complete summary of the CART tree in its covariate set. Monte Carlo tests are reported and several real world data mining examples are discussed.
2:00 p.m.
Hybrid and Sequential Modeling with Trees
Thomas W. Miller,
A.C. Nielsen Center for Marketing Research, University of
Wisconsin-Madison
Most analysts employ hybrid modeling methods, combinations of statistical tools and techniques. Empowered by fast computers and interactive software, analysts choose from a wide range of methods including generalized linear models, nonlinear models, neural networks, smoothing, spline-based, and tree-structured methods. Data visualization tools, dynamic graphics, and diagnostics help analysts to specify models and to make appropriate variable transformations. Model selection criteria, such as Akaike's information criterion (AIC), help analysts to make decisions about alternative models.The practice of modeling typically involves a sequence of modeling steps. Analysts use some statistical tools and techniques before others. Some methods make more sense to use in the early stages of data screening, variable definition, variable selection, and model specification. Other methods make more sense in later stages of model checking and model selection. Often difficult to describe, the practice of modeling is complex, relying upon expert judgment at various stages.
As modern, data-adaptive technologies become more widely available, we see their application within hybrid and sequential modeling frameworks. One possible sequence involves using classification trees to select explanatory variables, followed by the fitting of logistic regression models. The regression analog of this involves using regression trees to select explanatory variables, followed by the fitting of linear regression models.
Our research evaluates alternative modeling practices using statistical simulations and bootstrap techniques. We consider modeling practices that can be described as a sequence of simple modeling steps, such as trees followed by traditional methods, versus traditional stepwise methods. We consider practices that can be automated or partially automated by computer.
The objective of our research is to explore the implications of hybrid and sequential modeling practices. What effects do hybrid and sequential modeling practices have upon estimation bias and variance? What statistical tools and techniques work well in combination with other tools and techniques? How can we identify good modeling practices?
2:30 p.m.
A Comparison of Classifiers
Wei-Yin Loh,
Department of Statistics, University of Wisconsin - Madison
Hyunjoong Kim, Worcester Polytechnic Institute
Yu-Shan Shih,
Department of Mathematics, National Chung Cheng University
Keywords: classification tree, neural network, discriminant
analysis
Thirty-four classification algorithms are compared on thirty-two datasets in terms of misclassification error rate, training time, and classifier complexity. The algorithms cover a wide spectrum, from classification trees and neural networks to classical and modern statistical techniques. The spline-based POLYCLASS algorithm has the lowest mean error rate but it is not statistically significantly different from that of many others, including linear discriminant analysis, polytomous logistic regression, CART, QUEST, C4.5, and a new classification tree with multiway splits. In general, classification trees with linear splits tend to have lower errors rates than trees with univariate splits. The training times of the classifiers range from seconds to days, with POLYCLASS one of the slowest and linear discriminant analysis and C4.5 among the fastest. Tree complexity also varies greatly among classification trees, with some algorithms frequently yielding trees with extremely large numbers of leaves.