Multivariable regression model building by using fractional polynomials: Description of SAS, STATA and R programs

https://doi.org/10.1016/j.csda.2005.07.015Get rights and content

Abstract

In fitting regression models data analysts are often faced with many predictor variables which may influence the outcome. Several strategies for selection of variables to identify a subset of ‘important’ predictors are available for many years. A further issue to model building is how to deal with non-linearity in the relationship between outcome and a continuous predictor. Traditionally, for such predictors either a linear functional relationship or a step function after grouping is assumed. However, the assumption of linearity may be incorrect, leading to a misspecified final model. For multivariable model building a systematic approach to investigate possible non-linear functional relationships based on fractional polynomials and the combination with backward elimination was proposed recently. So far a program was only available in Stata, certainly preventing a more general application of this useful procedure. The approach will be introduced, advantages will be shown in two examples, a new approach to present FP functions will be illustrated and a macro in SAS will be shortly introduced. Differences to Stata and R programs are noted.

Introduction

In fitting regression models data analysts are often faced with many predictor variables which may influence the outcome. Strategies for selection of variables are used to identify a subset of ‘important’ predictors. Difficulties associated with strategies such as sequential procedures (e.g. stepwise or backward procedures) or all-subset selection with different optimization criteria (e.g. Akaike (AIC) or Bayesian (BIC) information criteria) are overfitting, underfitting, biased estimates of the regression parameters of the final model and a lack of reproducibility of the regression parameters in new data (Miller, 1990); some approaches to investigate these issues by resampling methods are discussed in Sauerbrei (1999). Although subject-matter knowledge should guide selection, some variables will inevitably be chosen mainly by statistical principles—typically by P-values for including or excluding variables. The definition of a ‘best’ strategy to produce a model which has good predictive properties in new data is difficult. A model which fits the current data set well may be too much data driven to give adequate predictive accuracy in other settings.

A second obstacle to model building is how to deal with non-linearity in the relationship between outcome and a continuous or ordered predictor. The traditional assumption of linearity may be incorrect, leading to a misspecified final model in which a relevant variable may not be included because its true relationship with outcome is non-monotonic, or in which the assumed functional form differs substantially from the unknown true form. Alternatively, continuous predictors may be converted into categorical variables by grouping into two or more categories. With dichotomization, considerable variability may be subsumed within each group. The implicit model is unrealistic, since individuals close to but on opposite sides of the cutpoint have very similar rather than very different outcomes. The arbitrariness of the choice of cutpoint may encourage a search for a value which gives the most ‘satisfactory’ result. Taken to extremes, all possible cutpoints may be tried and the value which maximizes statistical significance may be chosen. Because of multiple testing, the overall Type I error rate will be around 40% rather than the nominal 5% (Altman et al., 1994, Miller and Siegmund, 1982, Lausen and Schumacher, 1996). The cutpoint chosen will have a wide confidence interval and will have no substantive meaning. Crucially, the difference in outcome between the two groups will be overestimated and the confidence interval will be too narrow.

An alternative approach is to keep the variable continuous and to allow some form of non-linearity. Instead of using quadratic or cubic polynomials, a general family of parametric models have been proposed by Royston and Altman (1994), that is based on so-called fractional polynomial (FP) functions. Here, usually one or two terms of the form Xp are fitted, the exponents p being chosen from a small predefined set S of integer and non-integer values. Although only a small number of functions is considered (besides no transformation (p=1)), the set S includes 7 transformations for FPs of degree 1 (FP1) and 36 for FPs of degree 2 (FP2), FP functions provide a rich class of possible functional forms leading to a satisfactory fit to the data in many situations. Royston and Altman (1994) dealt mainly with the case of a single predictor, but they also suggested and illustrated an algorithm for fitting FPs in multivariable models. By combining backward elimination (BE) with the search for the most suitable FP transformation for continuous predictors Sauerbrei and Royston (1999) propose modifications to this multivariable FP (MFP) procedure. A further extension of the MFP procedure aims to reflect basic knowledge of the types of relationships to be expected between certain predictors and the outcome. Stability investigations with bootstrap resampling have shown that MFP can find stable models, despite the considerable flexibility of the family of FPs and the consequent risk of overfitting when several variables are considered (Royston and Sauerbrei, 2003).

A program in Stata (Royston and Ambler, 1999) has been available for several years, and in Stata 8 MFP is now a standard procedure. Recently, we developed programs in SAS and R. With all the programs, modelling can be done for the linear regression model, the logistic regression model and the Cox model for censored survival times. In Stata, many additional types of models are available (see Section 6.1).

Aims of the paper are to describe the MFP algorithm, introduce the new SAS and R programs, show advantages of the approach and demonstrate in detail how it simultaneously selects variables and determines functional relationships for continuous predictors. For illustration, we will explain several steps to select the final model in a multiple regression analysis by using the SAS macro. We will show that our approach finds a significant non-linear effect which would have been missed by assuming a linear relationship. In a second example analysed in the framework of a Cox regression model for survival data we will also illustrate the necessity to search systematically for possible non-linear effects of continuous predictors. We will also present a new approach how to present FP functions in a simple way. Furthermore, we will briefly discuss differences of our Stata and R programs.

Section snippets

Fractional polynomials

Suppose that we have an outcome variable, a single continuous covariate X, and a suitable regression model relating them. Our starting point is the straight line model, β1X (for easier notation we will suppress the constant term, β0). Often this will be an adequate description of the relationship, but other models must be investigated for possible improvements in fit. A simple extension of the straight line is a power transformation model, β1Xp. This model has often been used by practitioners

Multivariable fractional polynomials: the MFP algorithm

Usually in many areas of application, several predictors or confounders must be handled simultaneously. The aim is to include in a final model only variables with influence on the outcome. For continuous variables the functional form must be determined. An approach to building such a model is described by Sauerbrei and Royston (1999) who illustrated its use in two examples to obtain a prognostic and a diagnostic model where several continuous and categorical predictors were considered. Backward

Linear regression model

To exemplify some issues in multiple regression analysis, Johnson (1996) described explorations of a dataset comprising a response variable (the estimated percentage of body fat) and 13 continuous covariates (AGE, WEIGHT, HEIGHT and 10 body circumference measurements) in 252 men. The aim was to predict percentage body fat from the covariates. The study was originally reported in the sports medicine literature by Penrose et al. (1985). The dataset is available at //lib.stat.cmu.edu/datasets/bodyfat

Presentation of FP functions

Although FP functions are mathematically simple, presenting the model in the usual way through the estimated β's and transformed values of the covariate X, these values give no impression of what is most relevant, namely the estimated function and its uncertainty at particular values of X. From a substantive point of view, the β's are not interpretable. The first step is to plot the fitted function against X. Plots shown in this paper are all from multivariable models and show the partial

Stata

The MFP algorithm is incorporated in Stata 8 as a standard part of the package and is therefore documented and supported by Stata Corp (2003). Earlier implementations were published in the Stata Technical Journal for versions 5, 6 and 7 of Stata.

Although the syntax of the mfp command in Stata is quite different from that in SAS, all of the features of the SAS mfp8 command are available in Stata. The facilities of mfp in Stata are a superset of those found in the SAS implementation. In summary,

Discussion

We have shown that the multivariable FP procedure described by Royston and Altman (1994) and extended by Sauerbrei and Royston (1999) to combine backward elimination with a systematic search for possible non-linear effects can be used to construct simple parametric regression models. The procedure overcomes the serious problem of arbitrary categorization and in many of the datasets we have examined, it has sufficient power to detect non-linearities which should be accommodated in the final

Acknowledgements

We would like to thank Gareth Ambler and Carina Ortseifen for developing the first versions of the R and SAS programs, respectively. Furthermore, we would like to thank Elena Pashko for her assistance in preparing the manuscript.

References (30)

  • B. Lausen et al.

    Evaluating the effect of optimized cut-off values in the assessment of prognostic factors

    Comput. Statist. Data Anal.

    (1996)
  • D.G. Altman et al.

    The dangers of using optimal cutpoints in the evaluation of prognostic factors

    J. Nat. Cancer Inst.

    (1994)
  • G. Ambler et al.

    Fractional polynomial model selection procedures: investigation of type I error rate

    J. Statist. Simulation Comput.

    (2001)
  • P. Austin et al.

    Inflation of the type I error rate when a continuous confounding variable is categorized in logistic regression analyses

    Statist. Med.

    (2004)
  • H. Becher

    The concept of residual confounding in regression models and some applications

    Statist. Med.

    (1992)
  • L.G. Dales et al.

    An improper use of statistical significance testing in study covariables

    Internat. J. Epidemiol.

    (1978)
  • Holländer, N., Schumacher, M., 2005. Estimating the functional form of continuous covariates effect on survival time....
  • R.W. Johnson

    Fitting percentage of body fat to simple body measurements

    J. Statist. Education

    (1996)
  • N. Mantel

    Why stepdown procedures in variable selection

    Technometrics

    (1970)
  • R. Marcus et al.

    On closed test procedures with special reference to ordered analysis of variance

    Biometrika

    (1976)
  • Meier-Hirmer, C., Ortseifen, C., Sauerbrei, W., 2003. Multivariable fractional polynomials in SAS—an algorithm for...
  • R.M. Mickey et al.

    The impact of confounder selection criteria on effect estimation

    Amer. J. Epidemiol.

    (1989)
  • A.J. Miller

    Subset Selection in Regression

    (1990)
  • R. Miller et al.

    Maximally selected chi-square statistics

    Biometrics

    (1982)
  • K.W. Penrose et al.

    Generalized body composition prediction equation for men using simple measurement techniques

    Med. Sci. Sports Exercise

    (1985)
  • Cited by (0)

    View full text