This is a preliminary experiment in putting my lectures up on the Web. For the more visual lecturing I do, i.e. the more stats/research oriented, I tend to create a series of overheads and a handout using Microsoft Powerpoint. Unfortunately it appears to be rather difficult to get good quality conversion of that format to HTML. You can get the text across via RTF and the Microsoft Internet Assistant for Word for Windows 6.0 but graphics can only be saved a slide at a time (and the whole slide, including headings) to .WMF format and then converted to .GIF using (in my case, HiJaak Pro). The results are awful.

However, here for those who may want to amuse themselves, or perhaps even teach themselves some statistics. Is the material from the two lectures I give on the biennial Guildford revision course for psychiatrists about to sit their Part II M.R.C.Psych. exams. I'd be amused to receive feedback to: me [if your HTML browser doesn't support "mailto:" then use a mail package and send to: C.Evans@sghms.ac.uk]

I'll work on Microsoft and the HiJaak people to see if I can improve on the quality of the graphics and will be looking into the feasibility of putting things up using the Adobe Acrobat .PDF format instead of HTML in the near future.

Epidemiology, stats. & research methods for the M.R.C.Psych.

Differential diagnoses:
- Phobic anxiety
- Overvalued ideas
- Frank delusional ideation
Treatment
- Exposure and response prevention

Relationship between sample & population

Sadly this file has converted very badly to GIF format. It shows the relationships between observations (single woman on left!), samples (gaggle in middle) made up of observations and, at least in theory, drawn at random from population (e.g. British adults!)

Sample parameters and population parameters

Samples give you data
Descriptive sample statistics summarise those data... e.g. ...
- rate (for binary variables)
- mean
- median, quartiles & centiles
- standard deviation & deviance
Two groups of statistics:
- "central tendency" or "location"
- scatter about that central tendency

Sampling and confidence intervals

you have a sample parameter (typically a mean or a rate)
this is a "best guess" estimate of the population parameter
sample size determines how precisely you estimate the "true" value for the population
larger sample => greater precision
confidence intervals indicate precision

N.B. difference between C.I. and variance or s.d.:

(that the s.d. is a description of scatter of observations in the sample, whereas the C.I. is an estimate of the location of a parameter in the population from which the sample was taken)

n=100    mean = 15.5    s.d. = 5.1    95%CI = 14.5 to 16.5
n=1000   mean = 15.5    s.d. = 5.1    95%CI = 15.2 to 15.8
n=10000  mean = 15.5    s.d. = 5.1    95%CI = 15.4 to 15.6

Confidence intervals continued

95% confidence interval will span the "true" population value of the parameter in 95% of experiments provided ...
... that the assumptions underlying the calculation are correct
not possible to say for any 95% C.I. that it contains the "true" population value

95% C.I. for sample rate of .1

      n      95% C.I.
     10    .0025 to .45
     20    .012  to .32
     40    .028  to .24
     50    .033  to .22
     70    .041  to .20
    100    .049  to .18
    200    .062  to .15
    500    .075  to .13
  1,000    .082  to .12
  5,000    .092  to .11
 10,000    .094  to .11
100,000    .098  to .102

95% C.I. for sample rate of .1

This should show the last set of figures as a plot. The axes have been completely lost in translation but I hope it still gives a flavour of the narrowing of the 95% confidence interval as the sample size increases (from Left to Right).

95% C.I. for a mean - simulation of a classical psychopharmacology study ("experiment")

This converted even less satisfactorily than the last two graphics. This shows how the pool of depressed subjects are split into two samples (by randomisation), their depression scores are noted, then, double-blind, one group get active compound and the other placebo for a suitable period after which their depression scores are again noted. The dependent variable to be analysed will be the difference between the two scores so a negative will indicate success: a reduction in depression.

Simulation of a two group comparison study

measure of interest is four week drop in HRSD rating
equal sized groups (not paired)
independent observations
advantage to active compound: -5
both groups equal s.d. 7
200 such studies simulated for ...
- n = 10
- n = 100

n = 10

This has converted so badly it seems scarcely worth the candle. It is supposed to show two samples each of n=10 one with a group of drops representing the active compound, the other the placebo where the samples were made on the assumptions shown in the definition above. This particular run of the simulation gives rise to the 95% confidence intervals below.

Active compound: -26.0 to -18.0
Placebo: -15.0 to -7.0
Difference: -16.6 to -6.0

30 simulations
n = 10

This shows a the confidence intervals for a number of runs of the simulation.

n = 100

Another terrible conversion. It shows the two samples now with n=100 but still with the same population model. This run of the simulation gives rise to the 95% confidence intervals below.

Active compound: -20.0 to -17.3
Placebo: -14.3 to -11.8
Difference: -7.5 to -3.8

30 simulations
n = 100

Confidence intervals and their assumptions

Confidence intervals are calculated on a model of the study with assumptions

for rates these assumptions are:
- random sampling from large (infinite) population(s)
- independent observations
- constant probabilities within each group
for means these assumptions are (for pt.II):
- random sampling from large (infinite) populaton(s)
- independent observations
- Normal or Gaussian distributions
- equal population variance within each group
  (for C.I. of difference between independent groups)

Erotomania & the erotic, delusional transference

delusions are falsely derived, unshakeable beliefs
the content of an erotic transference (and erotomania) is apparently libidinal, sexual love
... but is often the cover for envy and hatred
psychiatry has just such a relationship with statistical tests and the p value
... not that they're not great in a reality oriented relationship!

Inferential statistics "hypothesis testing", "tests"

this branch of statistics is:
- clear
- rigorous
- objective
- potentially enlightening
- dichotomous
it is not:
- a means of transmuting uncertainty to certainty
- immune to abuse
- complete and balanced (as it's usually (ab)used)

Assumptions underpinnning inferential statistics

Test the probability of getting results as interesting as you did given:

"null" hypothesis
random sampling from large (infinite) population
independent observations
other assumptions if dealing with means not rates

If this is less than a certain level (chosen in advance) ...
... the result is declared "significant"

the pre chosen level is the alpha or "type I error rate"

... i.e. the probability of deciding that an effect is "significant" given that the null hypothesis were true for the population

The meaning of "significant" or "p < .05"

The probability of a result as marked as this was lower than 1 in 20 if the null hypothesis (and other assumptions) were true

this doesn't mean:
- it's clinically interesting
- it's managerially interesting
- it's theoretically or philosophically interesting
- you can say anything for certain about your next patient
however it is:
- a completely logical way to decide significance in terms of likelihoods
- and other ways of doing this are even more complex
  and contentious!

Null hypotheses

Remember "statisticians do it backwards
(but beautifully)"
therefore the null hypothesis is (depending on the test) that:
- the population mean is zero
- the population change is zero
- the difference between the population means is zero
- there is no association (no correlation) in the population
it is a model of the population not the sample
it's part of a set of assumptions forming a model that can be tested against the sample data

So what's an "alternative hypothesis"?

The alternative hypothesis is the logical complement to the null hypothesis
... that there is
- a population mean other than zero
- population change other than zero
- a difference between population means other than zero
- an association (correlation) in the population that is other than zero
whereas there is only one quantitative null hypothesis ....
... there are many possible quantitative alternative hypotheses contained within the one logical alternative hypothesis

Alternative hypotheses, type II error rates and statistical power

There is one type I error rate for a study
but there are an infinite no. of possible type II error rates
- ... the risk of deciding "non-significant"
- ... when the null hypothesis was not true
- ... for each of the infinite number of true population effects

Given a small study and/or a weak "true" population effect the type II error rate is often very high (.9 for many published "NS" results)

i.e. a non-significant result does not prove there is no population effect

Statistical power

Statistical power is the probability

given a certain "true" population effect
and all the other assumptions of the test
that you would have found a significant result
given your sample size(s)

Statistical power is:

(1 - type II error rate)
or (1 - beta)

As statistical power has generally been neglected, confidence intervals are replacing p values as they show how large an effect you could be missing

Confidence intervals and "significance" testing

Historically two very different approaches to the problem and there was much acrimonious argument between protagonists of the two approaches

but generally mathematically complementary
as a 95% confidence interval that doesn't embrace a zero effect is equivalent to a significant effect
... when the assumptions are the same (true for most tests on means)

Research frames likely in the exam

One study group (and one measure)
- how many?
- how much?
One study group (and two measures)
- how much associated/correlated?
Differences between groups (one measure)
- how many?
- how much?
The remainder: "multivariate statistics"

A linked set of questions

"How many psychiatric registrars are depressed?"

Prevalence
- always think of as a ratio:
  
  number affected
  number at risk
- point prevalence
- period prevalence
Incidence
- always a ratio but now of onsets
  
  number of new cases (or relapses)
  number at risk

Quality of measurement - 1 scaling

nominal scaling
ordinal scaling
interval scaling
ratio scaling

(N.B. Binary measurements are not easy to classify)

Quality of measurement - 2

Qualities of a binary measure of caseness

qualities that are independent of prevalence

sensitivity
specificity

more clinically useful qualities that are dependent on prevalence

positive predictive value
negative predictive value

indications of lack of contamination

reliability
validity

The four screening numbers: a,b,c,d

"True" status Score on test
-ve +ve total
non-case a b a+b
case c d c+d
total a+c b+d a+b+c+d = n

"True" status		Score on test
	-ve	+ve	total
non-case	a	b	a+b
case	c	d	c+d
total	a+c	b+d	a+b+c+d = n

          a = true negatives               b = false positives
          c = false negatives              d = true positives
sensitivity = d/(c+d)            specificity = a/(a+b)
        PPV = d/(b+d)                    NPV = a/(a+c)


PPV = "Positive predictive value", probability a positive test result will be a case
NPV = "Negative predictive value", probability a negative test result will be a non-case

Reliability

Evidence that the measure is not contaminated by random variation

stability across time (if the latent variable is supposed to be stable)
- "test-retest reliability"
stability across observers
- "inter-rater reliability"
stability across subunits of the measure
- "internal reliability"

Validity

Evidence of criterion related validity
- concurrent validity
- predictive validity
Evidence of content validity
- assumes you have definite agreement about content
- includes "face validity"
Evidence of construct validity
- is it located within a theoretical model ...
- ... aspects of which can be measured ...
- ... and do the measurements fit the theoretical model?

The "multiple tests" issue

If the null hypothesis is true and you do:
- one test: type I error rate is a .05
- two tests: rate is 1 - (1- a)2 .098
- three tests: rate is 1 - (1- a)3 .143
- five tests: rate is 1 - (1- a)5 .226
- ten tests: rate is 1 - (1- a)10 .401
- twenty tests: rate is 1 - (1- a)20 .642
- thirty tests: rate is 1 - (1- a)30 .785
- fifty tests: rate is 1 - (1- a)50 .923
... so just don't tell the reader about the the non-significant tests you conducted!

Distributional assumptions

For M.R.C.Psych. purposes there are two different classes of statistical test

those involving distributional assumptions, a.k.a. "parametric" tests and based on Normal or Gaussian distributions
those not involving distributional assumptions, a.k.a. "nonparametric", applying automatically to counts & rates, or based on converting scores to ranks

Parametric vs. non-parametric

Parametric tests give better statistical power than non-parametric for similar sample sizes
but non-parametric tests not dependent on distributional assumptions
choice between the two best taken after consultation with expert statistician

Matching "tests" to problems

If it's a count, apply Chi squared test
paired tests:
- if it's roughly Gaussian, apply paired t-test
- if not, Wilcoxon test
unpaired tests of difference between two groups
- Gaussian and similar variances in each group (or similar sample sizes) - unpaired t-test (or ANOVA)
- if not, Mann-Whitney test
tests of differences between >2 groups
- Gaussian ... - ANOVA
- if not, Kruskal-Wallis test

Matching "tests" to problems (contd.)

if it's an association or correlation
- between counts - Chi squared and related phi
- ... or Kappa (particularly for looking at inter-rater agreement)
- between Gaussian variables (similar variances not necessary) - Pearson correlation coefficient

between non-Gaussian variables - Spearman or Kendall correlation coefficients (latter better if many ties which is often the case on short ordinal ratings)

"Robustness" of statistical tests

There are two main distributional assumptions:
- Gaussian shape
- with the same variance in each group (for >1 group)

parametric tests are often robust (i.e. continue to give about the right type I and II error rates) for distributions that are only roughly Gaussian

many things can be transformed to "roughly" Gaussian
e.g. by taking logs (pH is one example)

parametric test are also often robust to differing variances provided sample sizes in the groups are similar

Multivariate statistics

The one fairly unproblematical one is internal reliability (coefficient alpha): the proportion of common variance in a set of items

Others are:
- factor analysis
- cluster analysis
- multivariate ANOVA (MANOVA)
problems are:
- need large sample sizes
- not as robust to distribution as univariate statistics
- they're complex, easy to "blind with science"!

Epidemiology, stats. & research methods for the M.R.C.Psych.

Relationship between sample & population

Sample parameters and population parameters

Sampling and confidence intervals

N.B. difference between C.I. and variance or s.d.:

Confidence intervals continued

95% C.I. for sample rate of .1

95% C.I. for sample rate of .1

95% C.I. for a mean - simulation of a classical psychopharmacology study ("experiment")

Simulation of a two group comparison study

n = 10

30 simulations n = 10

n = 100

30 simulations n = 100

Confidence intervals and their assumptions

Confidence intervals are calculated on a model of the study with assumptions

Erotomania & the erotic, delusional transference

Inferential statistics "hypothesis testing", "tests"

Assumptions underpinnning inferential statistics

Test the probability of getting results as interesting as you did given:

If this is less than a certain level (chosen in advance) ... ... the result is declared "significant"

The meaning of "significant" or "p < .05"

The probability of a result as marked as this was lower than 1 in 20 if the null hypothesis (and other assumptions) were true

Null hypotheses

So what's an "alternative hypothesis"?

Alternative hypotheses, type II error rates and statistical power

Statistical power

Statistical power is the probability

Statistical power is:

As statistical power has generally been neglected, confidence intervals are replacing p values as they show how large an effect you could be missing

Confidence intervals and "significance" testing

Research frames likely in the exam

A linked set of questions

"How many psychiatric registrars are depressed?"

Quality of measurement - 1 scaling

Quality of measurement - 2

Qualities of a binary measure of caseness

qualities that are independent of prevalence

more clinically useful qualities that are dependent on prevalence

indications of lack of contamination

The four screening numbers: a,b,c,d

Reliability

Evidence that the measure is not contaminated by random variation

Validity

The "multiple tests" issue

Distributional assumptions

For M.R.C.Psych. purposes there are two different classes of statistical test

Parametric vs. non-parametric

Matching "tests" to problems

Matching "tests" to problems (contd.)

"Robustness" of statistical tests

Multivariate statistics

30 simulations
n = 10

30 simulations
n = 100

If this is less than a certain level (chosen in advance) ...
... the result is declared "significant"