Monday, 27 November 2017

Reproducibility and phonics: necessary but not sufficient

Over a hotel breakfast at an unfeasibly early hour (I'm a clock mutant) I saw two things on Twitter that appeared totally unrelated but which captured my interest for similar reasons.

The two topics were the phonics wars and the reproducibility crisis. For those of you who don't work on children's reading, the idea of phonics wars may seem weid. But sadly, there we have it: those in charge of the education of young minds locked in battle over how to teach children to read. Andrew Old (@oldandrewuk), an exasperated teacher, sounded off this week about 'phonics denialists', who are vehemently opposed to phonics instrution, despite a mountain of evidence indicating this is an important aspect of teaching children to read. He analysed three particular arguments used to defend an anti-phonics stance. I won't summarise the whole piece, as you can read what Andrew says in his blogpost. Rather, I just want to note one of the points that struck a chord with me. It's the argument that 'There's more to phonics than just decoding'. As Andrew points out, those who say this want to imply that those who teach phonics don't want to do anything else.
'In this fantasy, phonics denialists are the only people saving children from 8 hours a day, sat in rows, being drilled in learning letter combinations from a chalkboard while being banned from seeing a book or an illustration.'
This is nonsense: see, for instance, this interview with my colleague Kate Nation, who explains how phonics knowledge is necessary but not sufficient for competent reading.

So what has this got to do with reproducibility in science? Well, another of my favourite colleagues, Dick Passingham, started a little discussion on Twitter - in response to a tweet about a Radiolab piece on replication. Dick is someone I enjoy listening to because he is a fount of intelligence and common sense, but on this occasion, what he said made me a tad irritated:

This has elements of the 'more to phonics than just decoding' style of argument. Of course scientists need to know more than how to make their research reproducible. They need to be able to explore, to develop new theories and to see how to interpret the unexpected. But it really isn't an either/or. Just as phonics is necessary but not sufficient for learning to read, so are reproducible practices necessary but not sufficient for doing good science. Just as phonics denialists depicts phonics advocates as turning children into bored zombies who hate books, those trying to fix reproducibility problems are portrayed as wanting to suppress creative geniuses and turn the process of doing research into a tedious and mechanical exercise. The winds of change that are blowing through psychology won't stop researchers being creative, but they will force them to test their ideas more rigorously before going public.

For those, like Dick, who was trained to do rigorous science from the outset, the focus on reproducibiity may seem like a distraction from the important stuff. But the incentive structure has changed dramatically in recent decades with the rewards favouring the over-hyped sensational result over the careful, thoughful science that he favours. The result is an enormous amount of waste - of resources, of time and careers. So I'm not going to stop 'obsessing about the reproducibility crisis.' As I replied rather sourly to Dick:

Friday, 24 November 2017

ANOVA, t-tests and regression: different ways of showing the same thing

Intuitive explanations of statistical concepts for novices #2

In my last post, I gave a brief explainer of what the term 'Analysis of variance' actually means – essentially you are comparing how much variation in a measure is associated with a group effect and how much with within-group variation.

The use of t-tests and ANOVA by psychologists is something of a historical artefact. These methods have been taught to generations of researchers in their basic statistics training, and they do the business for many basic experimental designs. Many statisticians, however, prefer variants of regression analysis. The point of this post is to explain that, if you are just comparing two groups, all three methods – ANOVA, t-test and linear regression – are equivalent. None of this is new but it is often confusing to beginners.

Anyone learning basic statistics probably started out with the t-test. This is a simple way of comparing the means of two groups, and, just like ANOVA, it looks at how big that mean difference is relative to the variation within the groups. You can't conclude anything by knowing that group A has a mean score of 40 and group B has a mean score of 44. You need to know how much overlap there is in the scores of people in the two groups, and that is related to how variable they are. If scores in group A range from to 38 to 42 and those in group B range from 43 to 45 we have a massive difference with no overlap between groups – and we don't really need to do any statistics! But if group A ranges from 20 to 60 and group B ranges from 25 to 65, then a 2-point difference in means is not going to excite us. The t-test gives a statistic that reflects how big the mean difference is relative to the within-group variation.  What many people don't realise is that the t-test is computationally equivalent to the ANOVA. If you square the value of t from a t-test, you get the F-ratio*.

Figure 1: Simulated data from experiments A, B, and C.  Mean differences for two intervention groups are the same in all three experiments, but within-group variance differs

Now let's look at regression. Consider Figure 1. This is similar to the figure from my last post, showing three experiments with similar mean differences between groups, but very different within-group variance. These could be, for instance, scores out of 80 on a vocabulary test. Regression analysis focuses on the slope of the line between the two means, shown in black, which is referred to as b. If you've learned about regression, you'll probably have been taught about it in the context of two continuous variables, X and Y, where the slope b, tells you how much change there is in Y for every unit change in X. But if we have just two groups, b is equivalent to the difference in means.

So, how can it be that regression is equivalent to ANOVA, if the slopes are the same for A, B and C? The answer is that, just as illustrated above, we can't interpret b unless we know about the variation within each group. Typically, when you run a regression analysis, the output includes a t-value that is derived by dividing b by a measure known as the standard error, which is an index of the variation within groups.

An alternative way to show how it works is to transform data from the three experiments to be on the same scale, in a way that takes into account the within-group variation. We achieve this by transforming the data into z-scores. All three experiments now have the same overall mean (0) and standard deviation (1). Figure 2 shows the transformed data – and you see that after the data have been rescaled in this way, the y-axis now ranges from -3 to +3, and the slope is considerably larger for Experiment C than Experiment A. The slope for z- transformed data is known as beta, or the standardized regression coefficient.

Figure 2: Same data as from Figure 1, converted to z-scores


The goal of this blogpost is to give an intuitive understanding of the relationship between ANOVA, t-tests and regression, so I am avoiding algebra as far as possible. The key point is when you are comparing two groups, t and F are different ways of representing the ratio between variation between groups and variation within groups, and t can be converted into F by simply squaring the value. You can derive t from linear regression by dividing the b or beta by its standard error - and this is automatically done by most stats programmes. If you are nerdy enough to want to use algebra to transform beta into F, or to see how Figures 1 and 2 were created, see the script Rftest_with_t_and_b.r here.

How do you choose which statistics to do? For a simple two-group comparison it really doesn't matter and you may prefer to use the method that is most likely to be familiar to your readers. The t-test has the advantage of being well-known – and most stats packages also allow you to make an adjustment to the t-value which is useful if the variances in your two groups are different. The main advantage of ANOVA is that it works when you have more than two groups. Regression is even more flexible, and can be extended in numerous ways, which is why it is often preferred.

Further explanations can be found here:

*It might not be exactly the same if your software does an adjustment for unequal variances between groups, but it should be close. It is identical if no correction is done.

Monday, 20 November 2017

How Analysis of Variance Works

Intuitive explanations of statistical concepts for novices #1

Lots of people use Analysis of Variance (Anova) without really understanding how it works, so I thought I'd have a go at explaining the basics in an intuitive fashion.

Consider three experiments, A, B and C, each of which compares the impact of an intervention on an outcome measure. The three experiments each have 20 people in a control group and 20 in an intervention group. Figure 1 shows the individual scores on an outcome measure for the two groups as blobs, and the mean score for each group as a dotted black line.

Figure 1: Simulated data from 3 intervention studies

In terms of average scores of control and intervention groups, the three groups look very similar, with the intervention group about .4 to .5 points higher than the control group. But we can't interpret this difference without having an idea of how variable scores are in the two groups.

For experiment A, there is considerable variation within each group, that swamps the average difference between the groups. In contrast, for experiment C, the scores within each group are tightly packed. Group B is somewhere in between.

If you enter these data into a one-way Anova, with group as a between-subjects factor, you get out a F-ratio, which can then be evaluated in terms of a p-value which gives the probability of obtaining such an extreme result if there is really no impact of the intervention. As you will see, the F-ratios are very different for A, B, and C, even though the group mean differences are the same. And in terms of the conventional .05 level of significance, the result from experiment A is not significant, experiment C is significant at the .001 level, and experiment B shows a trend (p = .051).

So how is the F-ratio computed? It just involves computing a number that reflects the ratio between the variance of the means of the groups, and the average variance within each group. When we just have two groups, as here, the first value just reflects how far away the two group means are from the overall mean. This is the Between Groups term, which is just the Variance of the two means multiplied by the number in each group (20). That will be similar for A, B and C, because the means for the two groups are similar and the numbers in each group are the same.

But the Within Groups term will differ substantially for A, B, and C, because it is computed as the average variance for the two groups. The F-ratio is obtained by just dividing the between groups term by the within groups term. If the within groups term is big, F is small, and vice versa.

The R script used to generate Figure 1 can be found here: https://github.com/oscci/intervention/blob/master/Rftest.R

PS. 20/11/2017. Thanks to Jan Vanhove for providing code to show means rather than medians in Fig 1. 

Friday, 3 November 2017

Prisons, developmental language disorder, and base rates

There's been some interesting discussion on Twitter about the high rate of developmental language disorder (DLD) in the prison population. Some studies give an estimate as high as 50 percent (Anderson et al, 2016), and this has prompted calls for speech-language therapy services to be involved in the working with offenders. Work by Pam Snow and others has documented the difficulties of navigating the justice system if your understanding and ability to express yourself are limited.

This is important work, but I have worried from time to time about the potential for misunderstanding. In particular, if you are a parent of a child with DLD, should you be alarmed at the prospect that your offspring will be incarcerated? So I wanted to give a brief explainer that offers some reassurance.

The simplest way to explain it is to think about gender. I've been delving into the latest national statistics for this post, and found that the UK prison population this year contained 82,314 men, but a mere 4,013 women. That's a staggering difference, but we don't conclude that because most criminals are men, therefore most men are criminals. This is because we have to take into account base rates: the proportion of the general population who are in prison. Another set of government statistics estimates the UK population as around 64.6 million, about half of whom are male, and 81% are adults. So a relatively small proportion of the adult population is in prison, and the numbers of non-criminal men vastly outnumber the number of criminal men.

I did similar sums for DLD, using data from Norbury et al (2016) to estimate a population prevalence of 7% in adult males, and plugging in that relatively high figure of 50% of prisoners with DLD. The figures look like this.


Numbers (in thousands) assuming 7% prevalence of DLD and 50% DLD in prisoners*
As you can see, according to this scenario, the probability of going to prison is much greater for those with DLD than for those without DLD (2.24% DLD vs 0.17% without DLD), but the absolute probability is still very low – 98% of those with DLD will not be incarcerated.

The so-called base rate fallacy is a common error in logical reasoning. It seems natural to conclude that if A is associated with B, then B must be associated with A. Statistically, that is true, but if A is extremely rare, then the likelihood of B given A can be considerably less than the likelihood of A given B.

So I don't think therefore that we need to seek explanations for the apparent inconsistency that's being flagged up on Twitter between rates of incarceration in studies of those with DLD, vs rates of DLD in those who are incarcerated. It could just be the consequence of the low base rate of incarceration.

References
Anderson et al (2016) Language impairments among youth offenders: A systematic review. Children and Youth Services Review, 65, 195-203.

Norbury, C. F.,  et al. (2016). The impact of nonverbal ability on prevalence and clinical presentation of language disorder: evidence from a population study. Journal of Child Psychology and Psychiatry, 57, 1247-1257.

*An R script for generating this figure can be found here.


Postscript - 4th November 2017 
The Twitter discussion has continued and drawn attention to further sources of information on rates of language and related problems in prison populations. Happy to add these here if people can send sources:

Talbot, J. (2008). No One Knows: Report and Final Recommendations. Report by Prison Reform Trust.  

House of Commons Justice Committee (2016) The Treatment of Young Adults in the Criminal Justice System.  Report HC 169.