Multivariable regression model building by using fractional polynomials: Description of SAS, STATA and R programs
Introduction
In fitting regression models data analysts are often faced with many predictor variables which may influence the outcome. Strategies for selection of variables are used to identify a subset of ‘important’ predictors. Difficulties associated with strategies such as sequential procedures (e.g. stepwise or backward procedures) or all-subset selection with different optimization criteria (e.g. Akaike (AIC) or Bayesian (BIC) information criteria) are overfitting, underfitting, biased estimates of the regression parameters of the final model and a lack of reproducibility of the regression parameters in new data (Miller, 1990); some approaches to investigate these issues by resampling methods are discussed in Sauerbrei (1999). Although subject-matter knowledge should guide selection, some variables will inevitably be chosen mainly by statistical principles—typically by -values for including or excluding variables. The definition of a ‘best’ strategy to produce a model which has good predictive properties in new data is difficult. A model which fits the current data set well may be too much data driven to give adequate predictive accuracy in other settings.
A second obstacle to model building is how to deal with non-linearity in the relationship between outcome and a continuous or ordered predictor. The traditional assumption of linearity may be incorrect, leading to a misspecified final model in which a relevant variable may not be included because its true relationship with outcome is non-monotonic, or in which the assumed functional form differs substantially from the unknown true form. Alternatively, continuous predictors may be converted into categorical variables by grouping into two or more categories. With dichotomization, considerable variability may be subsumed within each group. The implicit model is unrealistic, since individuals close to but on opposite sides of the cutpoint have very similar rather than very different outcomes. The arbitrariness of the choice of cutpoint may encourage a search for a value which gives the most ‘satisfactory’ result. Taken to extremes, all possible cutpoints may be tried and the value which maximizes statistical significance may be chosen. Because of multiple testing, the overall Type I error rate will be around 40% rather than the nominal 5% (Altman et al., 1994, Miller and Siegmund, 1982, Lausen and Schumacher, 1996). The cutpoint chosen will have a wide confidence interval and will have no substantive meaning. Crucially, the difference in outcome between the two groups will be overestimated and the confidence interval will be too narrow.
An alternative approach is to keep the variable continuous and to allow some form of non-linearity. Instead of using quadratic or cubic polynomials, a general family of parametric models have been proposed by Royston and Altman (1994), that is based on so-called fractional polynomial (FP) functions. Here, usually one or two terms of the form are fitted, the exponents being chosen from a small predefined set of integer and non-integer values. Although only a small number of functions is considered (besides no transformation , the set includes 7 transformations for FPs of degree 1 (FP1) and 36 for FPs of degree 2 (FP2), FP functions provide a rich class of possible functional forms leading to a satisfactory fit to the data in many situations. Royston and Altman (1994) dealt mainly with the case of a single predictor, but they also suggested and illustrated an algorithm for fitting FPs in multivariable models. By combining backward elimination (BE) with the search for the most suitable FP transformation for continuous predictors Sauerbrei and Royston (1999) propose modifications to this multivariable FP (MFP) procedure. A further extension of the MFP procedure aims to reflect basic knowledge of the types of relationships to be expected between certain predictors and the outcome. Stability investigations with bootstrap resampling have shown that MFP can find stable models, despite the considerable flexibility of the family of FPs and the consequent risk of overfitting when several variables are considered (Royston and Sauerbrei, 2003).
A program in Stata (Royston and Ambler, 1999) has been available for several years, and in Stata 8 MFP is now a standard procedure. Recently, we developed programs in SAS and R. With all the programs, modelling can be done for the linear regression model, the logistic regression model and the Cox model for censored survival times. In Stata, many additional types of models are available (see Section 6.1).
Aims of the paper are to describe the MFP algorithm, introduce the new SAS and R programs, show advantages of the approach and demonstrate in detail how it simultaneously selects variables and determines functional relationships for continuous predictors. For illustration, we will explain several steps to select the final model in a multiple regression analysis by using the SAS macro. We will show that our approach finds a significant non-linear effect which would have been missed by assuming a linear relationship. In a second example analysed in the framework of a Cox regression model for survival data we will also illustrate the necessity to search systematically for possible non-linear effects of continuous predictors. We will also present a new approach how to present FP functions in a simple way. Furthermore, we will briefly discuss differences of our Stata and R programs.
Section snippets
Fractional polynomials
Suppose that we have an outcome variable, a single continuous covariate , and a suitable regression model relating them. Our starting point is the straight line model, (for easier notation we will suppress the constant term, ). Often this will be an adequate description of the relationship, but other models must be investigated for possible improvements in fit. A simple extension of the straight line is a power transformation model, . This model has often been used by practitioners
Multivariable fractional polynomials: the MFP algorithm
Usually in many areas of application, several predictors or confounders must be handled simultaneously. The aim is to include in a final model only variables with influence on the outcome. For continuous variables the functional form must be determined. An approach to building such a model is described by Sauerbrei and Royston (1999) who illustrated its use in two examples to obtain a prognostic and a diagnostic model where several continuous and categorical predictors were considered. Backward
Linear regression model
To exemplify some issues in multiple regression analysis, Johnson (1996) described explorations of a dataset comprising a response variable (the estimated percentage of body fat) and 13 continuous covariates (AGE, WEIGHT, HEIGHT and 10 body circumference measurements) in 252 men. The aim was to predict percentage body fat from the covariates. The study was originally reported in the sports medicine literature by Penrose et al. (1985). The dataset is available at //lib.stat.cmu.edu/datasets/bodyfat
Presentation of FP functions
Although FP functions are mathematically simple, presenting the model in the usual way through the estimated 's and transformed values of the covariate , these values give no impression of what is most relevant, namely the estimated function and its uncertainty at particular values of . From a substantive point of view, the 's are not interpretable. The first step is to plot the fitted function against . Plots shown in this paper are all from multivariable models and show the partial
Stata
The MFP algorithm is incorporated in Stata 8 as a standard part of the package and is therefore documented and supported by Stata Corp (2003). Earlier implementations were published in the Stata Technical Journal for versions 5, 6 and 7 of Stata.
Although the syntax of the mfp command in Stata is quite different from that in SAS, all of the features of the SAS mfp8 command are available in Stata. The facilities of mfp in Stata are a superset of those found in the SAS implementation. In summary,
Discussion
We have shown that the multivariable FP procedure described by Royston and Altman (1994) and extended by Sauerbrei and Royston (1999) to combine backward elimination with a systematic search for possible non-linear effects can be used to construct simple parametric regression models. The procedure overcomes the serious problem of arbitrary categorization and in many of the datasets we have examined, it has sufficient power to detect non-linearities which should be accommodated in the final
Acknowledgements
We would like to thank Gareth Ambler and Carina Ortseifen for developing the first versions of the R and SAS programs, respectively. Furthermore, we would like to thank Elena Pashko for her assistance in preparing the manuscript.
References (30)
- et al.
Evaluating the effect of optimized cut-off values in the assessment of prognostic factors
Comput. Statist. Data Anal.
(1996) - et al.
The dangers of using optimal cutpoints in the evaluation of prognostic factors
J. Nat. Cancer Inst.
(1994) - et al.
Fractional polynomial model selection procedures: investigation of type I error rate
J. Statist. Simulation Comput.
(2001) - et al.
Inflation of the type I error rate when a continuous confounding variable is categorized in logistic regression analyses
Statist. Med.
(2004) The concept of residual confounding in regression models and some applications
Statist. Med.
(1992)- et al.
An improper use of statistical significance testing in study covariables
Internat. J. Epidemiol.
(1978) - Holländer, N., Schumacher, M., 2005. Estimating the functional form of continuous covariates effect on survival time....
Fitting percentage of body fat to simple body measurements
J. Statist. Education
(1996)Why stepdown procedures in variable selection
Technometrics
(1970)- et al.
On closed test procedures with special reference to ordered analysis of variance
Biometrika
(1976)