Empirical calibration
Empirical calibration
Martijn J. Schuemie, Marc A. Suchard
In observational studies, there is always the possibility that an effect size estimate is biased. This can betrue even for advanced, well thought out study designs, because of unmeasured or unmodeled confounding.
Negative controls (test-hypotheses where the exposure is not believed to cause the outcome) can be usedto detect the potential for bias in a study, and with enough negative controls we can start to estimatethe systematic error distribution inherent in an observational analysis. We can then use this estimateddistribution to compute a calibrated p-value, which reflects the probability of observing an effect size estimate
when the null hypothesis (of no effect) is true, taking both systematic and random error into account.
In this document we will use an example study to illustrate how this can be done using theEmpiricalCalibration R package.
In the example, we will try to answer the question whether
sertraline (an SSRI) causes GI bleeding. We use a Self-Controlled Case Series (SCCS) design, and haveapplied this to a large insurance claims database.
The results from this study are available in the package, and can be loaded using the data() command:
data(sccs)
drugOfInterest <- sccs[sccs$groundTruth == 1, ]
drugOfInterest
drugName groundTruth
1 0.7326235 0.07371708
Here we see that the effect estimate for sertraline is 2.1, with a p-value that is so small R rounds it to 0.
Negative controls are drug-outcome pairs where we believe the drug does not cause (or prevent) the outcome.
In other words, we believe the true effect size to be a relative risk of 1. We would prefer our negative controlsto have some resemblance with out hypothesis of interest (in our example sertraline - GI bleed), and wetherefore typically pick negative controls where the outcome is the same (exposure controls), or the exposureis the same (outcome controls). In this example, we have opted for exposure controls, and have identified aset of drugs not believed to cause GI bleed. We have executed exactly the same analysis for these exposures,resulting in a set of effect size estimates, one per negative control:
data(sccs)
negatives <- sccs[sccs$groundTruth == 0, ]
head(negatives)
drugName groundTruth
0 0.4339021 0.7617538
0 0.6363184 0.1839892
0 0.9297549 0.2979802
0 1.6919273 0.5955222
0 0.5261691 0.1967199
## 8 Prochlorperazine
0 0.8581890 0.1308460
Plot negative control effect sizes
We can start by creating a forest plot of our negative controls:
Ferrous gluconate
Here we see that many negative controls have a confidence interval that does not include a relative risk of 1
(orange lines), certainly more than the expected 5%. This indicates the analysis has systematic error.
Empirical null distribution
Fitting the null distribution
We can use the negative controls to estimate the systematic error distribution. We assume the distribution is
a Gaussian distribution, which we have found to give good performance in the past.
## Estimated null distribution####
We see that the mean of our distribution is greater than 0, indicating the analysis is positively biased. We also
see the standard deviation is greater than 0.25, indicating there is considerable variability in the systematicerror from one estimate to the next.
Evaluating the calibration
To evaluate whether our estimation of the systematic error distribution is a good one, we can test whether
the calibrated p-value is truly calibrated, meaning the fraction of negative controls with a p-value belowalpha is approximately the same as alpha:
P−value calculation
action with p < Fr
This method uses a leave-one-out design: for every negative control, the null distribution is fitted using all
other negative controls, and the calibrated p-value for that negative control is computed.
In the graph we see that the calibrated p-value is much closer to the diagonal than the uncalibrated p-value.
Plotting the null distribution
We can create a graphical representation of the null distribution, together with the negative controls used to
estimate that distribution:
In this graph, the blue dots represent the negative controls. Any estimates below the gray dashed lines willhave a traditional p-value below .05. In contrast, only estimates that fall within the orange areas will have acalibrated p-value below .05.
Calibrating the p-value
We can now use the estimated null distribution to compute the calibrated p-value for our drug of interest:
p <-
calibrateP(null, drugOfInterest$logRr, drugOfInterest$seLogRr)
p
In this case, the calibrated p-value is 0.84, meaning we have very little confidence we can reject the nullhypothesis.
Plotting the null distribution
A visual representation of the calibration makes it clear why we are no longer certain we can reject the null
In this plot we see that, even though the drug of interest (the yellow diamond) has a high relative risk, it isindistinguishable from our negative controls.
Computing the credible interval
Depending on how much information we have in terms of number of negative controls, or precision of thosenegative controls, we will be more or less certain about the parameters of the null distribution and thereforeabout the calibrated p-value. To estimate our uncertainty we can compute the 95% credible interval usingMarkov Chain Monte Carlo (MCMC). We can apply the fitMcmcNull function for this purpose:
## Estimated null distribution (using MCMC)####
Estimate lower .95 upper .95
## Precision 12.59629
#### Acceptance rate: 0.320767923207679
We see that there is uncertainty around the estimates of the mean and precision (= 1/SDˆ2), as expressed
in the 95% credible intervals. This uncertainty can be reduced by either increasing the number of negativecontrols, or by increasing the power for the existing controls (e.g. by waiting for more data to accumulate).
The acceptance rate of the MCMC seems reasonable (ideal values are typically between 0.2 and 0.6), but we
can investigate the trace just to be sure:
For both variables the trace should look like ‘random noise', as is the case above. When we see auto-correlation,meaning that one value of the trace depends on the previous value of the trace, the MCMC might not bereliable and we should not trust the 95% credible interval.
We can use the new null object to compute the calibrated p-value as well as the 95% credible interval:
p <-
calibrateP(null, drugOfInterest$logRr, drugOfInterest$seLogRr)
p
## 1 0.8289433 0.5494193 0.9898567
Note that there is uncertainty around the calibrated p-value as expressed in the 95% credible interval.
#### To cite EmpiricalCalibration in publications use:#### Schuemie MJ, Ryan PB, DuMouchel W, Suchard MA and Madigan D## (2014). "Interpreting observational studies: why empirical## calibration is needed to correct p-values." _Statistics in## Medicine_, *33*(2), pp. 209-218. <URL:## http://onlinelibrary.wiley.com/doi/10.1002/sim.5925/abstract>.
#### A BibTeX entry for LaTeX users is####
author = {M. J. Schuemie and P. B. Ryan and W. DuMouchel and M. A. Suchard and D. Madigan},
title = {Interpreting observational studies:
why empirical calibration is needed to correct p-values},
journal = {Statistics in Medicine},
pages = {209-218},
Source: https://cran.rstudio.com/web/packages/EmpiricalCalibration/vignettes/EmpiricalCalibrationVignette.pdf
Microsoft® Sculpt Ergonomic Desktop Version Information Microsoft® Sculpt Ergonomic Desktop Keyboard Version Microsoft Sculpt Ergonomic Keyboard Microsoft Sculpt Ergonomic Keypad Microsoft Sculpt Ergonomic Mouse Transceiver Version Microsoft 2.4 GHz Transceiver v9.0 Product Dimensions 15.4 inches (392 millimeters) 8.96 inches (228 millimeters)
Foundations and Trends R Information RetrievalVol. 4, No. 5 (2010) 377–486 2011 C. Castillo and B. D. DavisonDOI: 10.1561/1500000021 Adversarial Web Search By Carlos Castillo and Brian D. Davison Search Engine Spam Activists, Marketers, Optimizers, and Spammers The Battleground for Search Engine Rankings Previous Surveys and Taxonomies Overview of Search Engine Spam Detection