Evaluating the equivalence of different formulations of the scaled Brier score

Background

The Brier Score is a composite measure of discrimination and calibration for a prediction model. The Brier Score is defined as

\[ BS = \frac{1}{N} \sum (y_i - \hat y_i)^2, \]

where \(N\) is the number of observations, \(y_i\) is the observed outcome, either 0 or 1, and \(\hat y_i\) is the predicted probability for the \(i\)th observation. Let’s create an R function to calculate the Brier Score:

brier_score <- function(obs, pred) {
  mean((obs - pred)^2)
}

The scaled Brier Score accounts for the event rate and provides an immediate comparison to an uninformative model that is equivalent to “just guess the event rate.” An intuitive way to define the scaled Brier Score (also called the “Brier skill score”) is

\[ BS_{scaled} = 1 - \frac{BS}{BS_{max}}, \]

where \(BS_{max} = \frac{1}{N} \sum (y_i - \bar y_i)^2\) and \(\bar y_i\) is the event rate among the observed outcome.

My confusion

This formulation of the scaled Brier Score makes intuitive sense to me and is how I go about calculating it in practice. However, two other distinct formulations have been proposed for calculating \(BS_{max}\) that — at least to the limits of my algebraic skills – differ. Thus, here I proposed a numeric investigation of these different definitions to see if they are indeed equivalent.

Definition 1

This is the intuitive definition to which I am accustomed, and is made explicit here: https://www.ncbi.nlm.nih.gov/pubmed/29713202

\[ BS_{scaled} = 1 - \frac{\frac{1}{N} \sum (y_i - \hat y_i)^2}{\frac{1}{N} \sum (y_i - \bar y_i)^2}. \]

Let’s create an R function to calculate this value.

scaled_brier_score_1 <- function(obs, pred) {
  1 - (brier_score(obs, pred) / brier_score(obs, mean(obs)))
}

Definition 2

A second formulation of the scaled Brier Score is defined with a slightly different definition of \(BS_{max}\), which is in this case described in https://www.ncbi.nlm.nih.gov/pubmed/20010215

\[ BS_{max} = \bar p \times (1 - \bar p). \]

Let’s create an R function to calculate this measure.

scaled_brier_score_2 <- function(obs, pred) {
  1 - (brier_score(obs, pred) / (mean(obs) * (1 - mean(obs))))
}

Definition 3

A third formulation of the scaled Brier Score is defined with a slightly different definition of \(BS_{max}\), which is in this case described in https://www.ncbi.nlm.nih.gov/pubmed/22961910

\[ BS_{max} = \bar p \times (1 - \bar p)^2 + (1 - \bar p) \times \bar p^2. \]

Let’s create an R function to calculate this measure.

scaled_brier_score_3 <- function(obs, pred) {
  1 - (brier_score(obs, pred) / (mean(obs) * (1 - mean(obs))^2 + (1 - mean(obs)) * mean(obs)^2))
}

Build a model

Let’s build a sample model based on the UCI abalone data (https://archive.ics.uci.edu/ml/datasets/Abalone).

# get data
df <- read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data')
names(df) <- c('sex', 'length', 'diameter', 'height', 'weight_whole', 
               'weight_shucked', 'weight_viscera', 'weight_shell', 'rings')
# inspect data
knitr::kable(head(df))
sex length diameter height weight_whole weight_shucked weight_viscera weight_shell rings
M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7
I 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.120 8
F 0.530 0.415 0.150 0.7775 0.2370 0.1415 0.330 20
# Let's predict whether or not an abalone will have > 10 rings
m1 <- glm(I(rings > 10) ~ ., data = df, family = binomial)
preds_m1 <- predict(m1, type = 'response')

# And another model with severe class imablance
m2 <- glm(I(rings > 3) ~ ., data = df, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
preds_m2 <- predict(m2, type = 'response')

Score the model

# ---------- Model 1
# Calculate the Brier Score
brier_score(df$rings > 10, preds_m1)
## [1] 0.1479862
# Calculate each type of scaled Brier Score
scaled_brier_score_1(df$rings > 10, preds_m1)
## [1] 0.3462507
scaled_brier_score_2(df$rings > 10, preds_m1)
## [1] 0.3462507
scaled_brier_score_3(df$rings > 10, preds_m1)
## [1] 0.3462507
# ---------- Model 2
# Calculate the Brier Score
brier_score(df$rings > 3, preds_m2)
## [1] 0.002690905
# Calculate each type of scaled Brier Score
scaled_brier_score_1(df$rings > 3, preds_m2)
## [1] 0.3362851
scaled_brier_score_2(df$rings > 3, preds_m2)
## [1] 0.3362851
scaled_brier_score_3(df$rings > 3, preds_m2)
## [1] 0.3362851
Assistant Professor of Medicine and Informatics

I am a pulmonary and critical care physician at the University of Pennsylvania Perelman School of Medicine, and core faculty member at the Palliative and Advanced Illness Research (PAIR) Center. My work seeks to translate innovations in artificial intelligence methods into bedside clinical decision support systems that alleviate uncertainty in critical clinical decisions. My research interests include clinical informatics, artificial intelligence, natural language processing, machine learning, population health, and pragmatic trials.