Diploma d'Estudis Avançats - Programa de doctorat en Estadística, Anàlisi de dades i bioestadística. 2008. Tutors: Josep Fortiana i Jordi Alonso
Objective: Health status measures usually have an asymmetric distribution and present a high
percentage of respondents with the best possible score (ceiling effect), specially when they are
assessed in the overall population. Different methods to model this type of variables have been
proposed that take into account the ceiling effect: the tobit models, the Censored Least Absolute
Deviations (CLAD) models or the two-part models, among others. The objective of this work
was to describe the tobit model, and compare it with the Ordinary Least Squares (OLS) model,
that ignores the ceiling effect.
Methods: Two different data sets have been used in order to compare both models: a) real data
comming from the European Study of Mental Disorders (ESEMeD), in order to model the
EQ5D index, one of the measures of utilities most commonly used for the evaluation of health
status; and b) data obtained from simulation. Cross-validation was used to compare the
predicted values of the tobit model and the OLS models. The following estimators were
compared: the percentage of absolute error (R1), the percentage of squared error (R2), the Mean
Squared Error (MSE) and the Mean Absolute Prediction Error (MAPE). Different datasets were
created for different values of the error variance and different percentages of individuals with
ceiling effect. The estimations of the coefficients, the percentage of explained variance and the
plots of residuals versus predicted values obtained under each model were compared.
Results: With regard to the results of the ESEMeD study, the predicted values obtained with the
OLS model and those obtained with the tobit models were very similar. The regression
coefficients of the linear model were consistently smaller than those from the tobit model. In the
simulation study, we observed that when the error variance was small (s=1), the tobit model
presented unbiased estimations of the coefficients and accurate predicted values, specially when
the percentage of individuals wiht the highest possible score was small. However, when the
errror variance was greater (s=10 or s=20), the percentage of explained variance for the tobit
model and the predicted values were more similar to those obtained with an OLS model.
Conclusions: The proportion of variability accounted for the models and the percentage of
individuals with the highest possible score have an important effect in the performance of the
tobit model in comparison with the linear model.