A general coverage of quantiles, the ECDF (Empirical Cumulative Distribution Function) and confidence intervals (CIs) around quantiles
### this is just the code that creates the "copy to clipboard" function in the code blocks
htmltools::tagList(
xaringanExtra::use_clipboard(
button_text = "<i class=\"fa fa-clone fa-2x\" style=\"color: #301e64\"></i>",
success_text = "<i class=\"fa fa-check fa-2x\" style=\"color: #90BE6D\"></i>",
error_text = "<i class=\"fa fa-times fa-2x\" style=\"color: #F94144\"></i>"
),
rmarkdown::html_dependency_font_awesome()
)
This is one of my “background for practitioners” blog posts not one of my geeky ones. It will lead into another about two specific functions I’ve added to the CECPfuns R package. However, this one will cross-link to entries in the free online OMbook glossary.
Quantiles are important and very useful but seriously underused. They are important both because they are useful to describe distributions of data but also because they can help us map from an individual’s data to and from collective data: one of the crucial themes in the mental health and therapy evidence base.
I recommend that you read my post about ECDFs before going further if you don’t feel familiar with quantiles: the ECDF (Empirical Cumulative Density Function) is a great introduction to quantiles.
Quick terminological point: quantiles are pretty much the same as percentiles, centiles, deciles and the median and lower and upper quartiles are specific quantiles. I’ll come back to that.
Quantiles tell us about the location of observed values within a distribution of data. Let’s start with: the Gaussian distribution (often called, but a bit misleadingly, the “Normal”. Note the capital letter: it’s not normal in any normal sense of that word!) This facetted plot shows three simulated samples from the Gaussian distribution. The sample sizes are 100, 1,000 and 10,000.
set.seed(12345) # set seed to get the same data regardless of platform and occasion
# rnorm(10000) %>%
# as_tibble() %>%
# mutate(n = 10000) %>%
# rename(score = value) -> tibGauss10k
#
# rnorm(1000) %>%
# as_tibble() %>%
# mutate(n = 1000) %>%
# rename(score = value) -> tibGauss1k
#
# rnorm(100) %>%
# as_tibble() %>%
# mutate(n = 100) %>%
# rename(score = value) -> tibGauss100
#
# bind_rows(tibGauss100,
# tibGauss1k,
# tibGauss10k) -> tibGaussAll
### much more tidyverse way of doing that
c(100, 1000, 10000) %>% # set your sample sizes
as_tibble() %>%
rename(n = value) %>%
### now you are going to generate samples per value of n so rowwise()
rowwise() %>%
mutate(score = list(rnorm(n))) %>%
ungroup() %>% # overrided grouping by rowwise() and unnest to get individual values
unnest_longer(score) -> tibGauss
### get sample statistics
tibGauss %>%
group_by(n) %>%
summarise(min = min(score),
median = median(score),
mean = mean(score),
sd = sd(score),
lquart = quantile(score, .25),
uquart = quantile(score, .75),
max = max(score),
### and bootstrap mean (could have used parametric derivation as this is true Gaussian but I couldn't remember it!)
CI = list(getBootCImean(score, verbose = FALSE))) %>%
unnest_wider(CI) -> tibGaussStats
ggplot(data = tibGauss,
aes(x = score)) +
facet_wrap(facets = vars(n),
nrow = 1) +
geom_histogram(aes(y = after_stat(density))) +
ylim(c(0, .6)) +
ylab("Count") +
ggtitle("Faceted histogram for three samples from Gaussian distribution",
subtitle = "Sample sizes 100, 1,000 and 10,000")
Here, taking us to quantiles, are the same data as boxplots overlaid with jittered points for the actual data. This brings us to the simplest quantiles.
tibGauss %>%
filter(n == 1000) -> tmpTibGauss
tibGaussStats %>%
filter(n == 1000) -> tmpTib
ggplot(data = tibGauss,
aes(x = 1, y = score)) +
facet_wrap(facets = vars(n),
nrow = 1) +
geom_boxplot(notch = TRUE,
varwidth = TRUE,
fill = "grey80") +
geom_jitter(width = .35, height = 0,
alpha = .05,
colour = "grey40") +
ylab("Scores") +
ggtitle("Faceted boxplot with jittered observations for three samples from Gaussian distribution",
subtitle = "Sample sizes 100, 1,000 and 10,000, sample mean in red")
The boxplot uses three quantiles describe the box: the median locates the belt and waist across the box, and the quartiles fix the lower and upper limits of the box. What are these quantiles? The median is the score (not necessarily present in the data) that would bisect the data into two equal sized halves so it’s the value such that half the observed values lie below it and half lie above it. The lower quartile is the value such that a quarter of the observed values lie below it and three quarters above it, the upper quartile is the value such that three quarters of the sample lie below it and one quarter above it.
So we can now start to look at these names.
Quantile | Quartile | Percentile (a.k.a. centile) |
---|---|---|
.25 | lower | 25% |
.50 | median | 50% |
.75 | upper | 75% |
Now we get to the ECDF which, like the histogram, violin plot and boxplot is another way to describe a distribution of observed data. This takes us into the richness of quantiles.
tibGaussStats %>%
select(n, min, lquart, median, uquart, max) %>%
pivot_longer(cols = min:max) %>%
rename(Quantile = name) %>%
mutate(Quantile = ordered(Quantile,
levels = c("min", "lquart", "median", "uquart", "max"),
labels = c("Min", "Lower quartile", "Median", "Upper quartile", "Max"))) -> tibGaussStatsLong
ggplot(data = tibGauss,
aes(x = score)) +
facet_wrap(facets = vars(n),
ncol = 1) +
stat_ecdf() +
geom_vline(data = tibGaussStatsLong,
aes(xintercept = value, colour = Quantile)) +
geom_text(data = tibGaussStatsLong,
aes(label = round(value, 2),
x = value,
y = .28),
nudge_x = -.04,
hjust = 1) +
ylab("Proportion of the sample scoring below the value") +
ggtitle("Faceted ECDF plot for three samples from Gaussian distribution",
subtitle = "Sample sizes 100, 1,000 and 10,000, quantiles shown as coloured lines with their values.")
I’ve changed to faceting by rows here instead of columns to give a better spread on the plot. The ECDF plots on the y axis the proportion of the sample scoring below the value on the x axis.
A few pretty obvious comments on the impact of sample size when in the classical model of random sampling from an infinitely large population. These impacts are visible in all those distribution plots above.
tibGaussStats %>%
select(n, lquart, median, uquart) %>%
pivot_longer(cols = -n) %>%
mutate(value = round(value, 4),
proportion = case_when(
name == "lquart" ~ .25,
name == "median" ~ .5,
name == "uquart" ~ .75),
popVal = qnorm(proportion),
popVal = round(popVal, 4)) %>%
pivot_wider(names_from = n, names_prefix = "n", values_from = value) %>%
flextable() %>%
autofit()
name | proportion | popVal | n100 | n1000 | n10000 |
---|---|---|---|---|---|
lquart | 0.25 | -0.6745 | -0.5901 | -0.6359 | -0.6652 |
median | 0.50 | 0.0000 | 0.4837 | 0.0080 | -0.0019 |
uquart | 0.75 | 0.6745 | 0.9004 | 0.6453 | 0.6576 |
It can be seen there that the observed values for the quantiles get closer to the population values the larger the sample.
So, as we can see in the above any observed quantile value, like any sample statistic, will have a different value for the next sample assuming any real sampling process, whether truly random (only in simulations in my view) or not. That means that, like any sample statistic, any observed quantile value can be given a confidence interval around it and this CI will be narrower the larger the sample size.
This brings us to the fact that there are various ways of computing this confidence interval. The R package quantileCI gives three methods including a bootstrap method. They’re all non-parametric, i.e. not making assumptions about the shape of the distribution of the scores for which you computed the quantiles. From a bit of reading led by Michael Höhle’s R package quantileCI (https://github.com/hoehleatsu/quantileCI) it seems to me that the Nyblom method is probably best (and the differences between the methods unlikely to cause us any headaches with typical MH/therapy score data). As ever the confidence interval gives a range around the observed value for a sample statistic that should include the population value in a given proportion of samples. The proportion usually used is 95%, i.e. a 95% confidence interval. Here they are for our sample data.
tibGauss %>%
group_by(n) %>%
summarise(lquartCI = list(quantileCI::quantile_confint_nyblom(score, .25)),
medianCI = list(quantileCI::quantile_confint_nyblom(score, .5)),
uquartCI = list(quantileCI::quantile_confint_nyblom(score, .75))) %>%
unnest_wider(lquartCI, names_sep = ":") %>%
unnest_wider(medianCI, names_sep = ":") %>%
unnest_wider(uquartCI, names_sep = ":") %>%
rename(lquartLCL = `lquartCI:1`,
lquartUCL = `lquartCI:2`,
medianLCL = `medianCI:1`,
medianUCL = `medianCI:2`,
uquartLCL = `uquartCI:1`,
uquartUCL = `uquartCI:2`) %>%
mutate(lquartCI = paste0(round(lquartLCL, 2), " to ", round(lquartUCL, 2)),
medianCI = paste0(round(medianLCL, 2), " to ", round(medianUCL, 2)),
uquartCI = paste0(round(uquartLCL, 2), " to ", round(uquartUCL, 2))) %>%
left_join(tibGaussStats, by = "n") %>%
mutate(lquart = round(lquart, 2),
median = round(median, 2),
uquart = round(uquart, 2)) %>%
select(-c(min, mean, sd, max:UCLmean)) -> tmpTib
tmpTib %>%
select(n, lquart, lquartCI, median, medianCI, uquart, uquartCI) %>%
flextable() %>%
autofit()
n | lquart | lquartCI | median | medianCI | uquart | uquartCI |
---|---|---|---|---|---|---|
100 | -0.59 | -0.88 to -0.31 | 0.48 | -0.01 to 0.61 | 0.90 | 0.71 to 1.44 |
1,000 | -0.64 | -0.72 to -0.55 | 0.01 | -0.06 to 0.08 | 0.65 | 0.58 to 0.74 |
10,000 | -0.67 | -0.69 to -0.64 | 0.00 | -0.03 to 0.02 | 0.66 | 0.63 to 0.69 |
That shows the observed values for the quartiles and the median for the three samples. It’s very clear that the widths of the intervals get tighter as the sample size increases.
I’ve added the function plotQuantileCIsfromDat
to the CECPfuns R package package. This creates these plots below which I like. They show any requested quantiles, here the quartiles and median, with the ECDF from the data, and plots the confidence intervals for those quantiles. Here are the plots for those three quantiles and for the the three simulated samples that we’ve been using so far.
tibGauss %>%
filter(n == 100) %>%
select(score) %>%
pull() -> tmpVec
plotQuantileCIsfromDat(tmpVec, vecQuantiles = c(.25, .5, .75), addAnnotation = FALSE, printPlot = FALSE, returnPlot = TRUE) -> tmpPlot100
tibGauss %>%
filter(n == 1000) %>%
select(score) %>%
pull() -> tmpVec
plotQuantileCIsfromDat(tmpVec, vecQuantiles = c(.25, .5, .75), addAnnotation = FALSE, printPlot = FALSE, returnPlot = TRUE) -> tmpPlot1000
tibGauss %>%
filter(n == 10000) %>%
select(score) %>%
pull() -> tmpVec
plotQuantileCIsfromDat(tmpVec, vecQuantiles = c(.25, .5, .75), addAnnotation = FALSE, printPlot = FALSE, returnPlot = TRUE) -> tmpPlot10000
library(patchwork)
### standardise the x axis ranges
tmpPlot100 +
xlim(c(-4, 4)) -> tmpPlot100
tmpPlot1000 +
xlim(c(-4, 4)) -> tmpPlot1000
tmpPlot10000 +
xlim(c(-4, 4)) -> tmpPlot10000
tmpPlot100 /
tmpPlot1000 /
tmpPlot10000
Unsurprisingly those plots show that the three quantiles are well separated with non-overlapping confidence intervals even for n = 100 and they show how the confidence intervals tighten as the n increases.
I hope this was a fairly clear introduction to putting confidence intervals around observed quantiles.
18/04/2024 at 10:18
Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Evans (2023, Aug. 9). Chris (Evans) R SAFAQ: Confidence intervals for quantiles. Retrieved from https://www.psyctc.org/R_blog/posts/2023-08-09-confidence-intervals-for-quantiles/
BibTeX citation
@misc{evans2023confidence, author = {Evans, Chris}, title = {Chris (Evans) R SAFAQ: Confidence intervals for quantiles}, url = {https://www.psyctc.org/R_blog/posts/2023-08-09-confidence-intervals-for-quantiles/}, year = {2023} }