Monday, 1 October 2012

Data from the phonics screen: a worryingly abnormal distribution


The new phonics screening test for children has been highly controversial.  I’ve been surprised at the amount of hostility engendered by the idea of testing children’s knowledge of how letters and sounds go together. There’s plenty of evidence that this is a foundational skill for reading, and poor ability to do phonics is a good predictor of later reading problems. So while I can see there are aspects of the implementation of the phonics screen that could be improved,  I don’t buy arguments that it will ‘confuse’ children, or prevent them reading for meaning.

I discovered today that some early data on the phonics screen had recently been published by the Department for Education, and my inner nerd was immediately stimulated to visit the website and download the tables.  What I found was both surprising and disturbing.

Most of the results are presented in terms of proportions of children ‘passing’ the screen, i.e. scoring 32 or more. There are tables showing how this proportion varies with gender, ethnic background, language background, and provision of free school meals. But I was more interested in raw scores: after all, a cutoff of 32 is pretty arbitrary. I wanted to see the range and distribution of scores.  I found just one table showing the relevant data, subdivided by gender, and I have plotted the results here.
Data from Table 4, Additional Tables 2, SFR21/2012
Department for Education (weblink above)

Those of you who are also statistics nerds will immediately see something very odd, but other readers may need a bit more explanation.  When you have a test like the phonics test, where each item is scored right or wrong, and the number of correct items is totalled up, you’d normally expect to get a continuous distribution of scores. That is to say, the numbers of children obtaining a given score should increase gradually up to some point corresponding to the most typical score (the mode), and then gradually decline again. If the test is pretty easy, you may get a ceiling effect, i.e. the mode may be at or close to the maximum score, so you will see a peak at the right hand side of the plot, with a long straggly tail of lower scores.  There may also be a ‘bump’ at the left hand edge of the distribution, corresponding to those children who can’t read at all – a so-called ‘floor’ effect.  That's evident in the scores for boys. But there's also something else. There’s a sudden upswing in the distribution, just at the ‘pass’ mark. Okay, you might think, that’s because the clever people at the DfE have devised the phonics test that way, so that 31 of the items are really easy, and most children can read them, but then they suddenly get much harder.  Well, that seems unlikely, and it would be a rather odd way to develop a test, but it’s not impossible. The really unbelievable bit is the distribution of scores just above and below the cutoff. What you can see is that for both boys and girls, fewer children score 31 than 30, in contrast to the general upward trend that was seen for lower scores. Then there’s a sudden leap , so that about five times as many children score 32 than 31. But then there’s another dip: fewer children score 33 than 32. Overall, there’s a kind of ‘scalloped’ pattern to the distribution of scores above 32, which is exactly the kind of distribution you’d expect if a score of 32 was giving a kind of ‘floor effect’.  But, of course, 32 is not the test floor.

This is so striking, and so abnormal, that I fear it provides clear-cut evidence that the data have been manipulated, so that children whose scores would put them just one or two points below the magic cutoff of 32 have been given the benefit of the doubt, and had their scores nudged up above cutoff.

This is most unlikely to indicate a problem inherent in the test itself. It looks like human bias that arises when people know there is a cutoff and, for whatever reason, are reluctant to have children score below that cutoff.  As one who is basically in favour of phonics testing, I’m sorry to put another cat among the educational pigeons, but on the basis of this evidence, I do query whether these data can be trusted.

16 comments:

  1. For any of the 32+ portion of that graph to make sense, there would have to be an exponential rise from ~25 up to 40. (Assuming the top scores haven't been manipulated. Is that what you'd expect to see?

    ReplyDelete
  2. For all those who won't believe that a simple graph can tell you enough to conclude that the data has been manipulated, I should say that I examined the graph before I read your conclusion and my interpretation was just the same as yours: some children who should score below 32 have been put at 32 or 33 (possibily as early as collection time, not necessarily after the fact).
    This is another nice illustration of how much a graph can tell, and how important it is to always plot your data before drawing any conclusion.

    ReplyDelete
  3. Franck, it will have been at collection time as the DfE published the minimum standard threshold in advance of testing.

    ReplyDelete
  4. @Dorothy: Well spotted. People with a stake in the outcome (teachers and school heads) can't be trusted to administer these tests: Sad, but I guess they need to be computerised, and the speech samples scored independently.

    While very sad, this is consistent with other analyses of what happens when teachers are tested: they cheat (one example linked below, but there are many).

    @Steve Jones: Yes, most kids master the skills of reading, including both sounding out and building a basic sight vocabulary, so the distribution on tests of simple words and nonwords is not normal, but negatively skewed (long left tail). But you can literally see the chunk of 29-31 scores sculpted out and bolted on as 32-34 here...

    @Franck Ramus: Seems most likely the scores were enhanced at collection time by teachers "giving the benefit of the doubt". If it's happened post collection, that would be IMHO worse. But both are of course inexcusable.

    http://www.freakonomics.com/2011/07/06/massive-teacher-cheating-scandal-uncovered-in-atlanta/

    ReplyDelete
  5. I'm no expert on teaching phonics, and have watched the debate as an interested layperson, but I would say that this convincing piece of forensic statistical analysis backs up some of the criticisms of the phonics screen. In a nutshell: it's not seen as a screen, but as a pass/fail test, and a high-stakes one at that. Hence, teachers see it not as a useful diagnostic tool, but as a threat. That may or may not be accurate, but it surely reflects teachers' perceptions of the educational climate and culture that they are working in.

    ReplyDelete
  6. @tim: Yes, most kids master the skills of reading, including both sounding out and building a basic sight vocabulary, so the distribution on tests of simple words and nonwords is not normal, but negatively skewed (long left tail).

    Perhaps, but what equally interesting here is that the median on the distribution falls almost exactly on the 'pass mark' of 32 - statistically it's 32.27 and the deviation in individual medians for boys and girls is of close to 0.5 for both (boys downwards, girls upwards).

    That rather suggests that we have a normalised test, in addition to the data manipulation, which also suggests that its also likely that the test does, in fact, incorporate a sharp increase in difficulty at the pass/fail boundary.

    ReplyDelete
  7. @Deevybee, @Steve Jones, @Frank Ramus and @Unity.

    This seems unfair on teachers. I am aware that when my children were going through primary school they were assessed in year one for literacy problems and those who struggled were given quite extensive help. This help would be "cut off" when the child reached an adequate standard. These interventions took place in year 1. Most state infant and junior schools have nurseries and Early Years forms and the school will have had three or four years of experience of their children prior to the screen. They will know who needs, and will benefit from intervention.

    You, @Steve Jones, @Frank Ramus and @Unity are clear that you think the data has been manipulated. To do this you must discount the possibility that professional teachers have done their job, identified struggling children and provided targeted help during the Early Years and KS1 stages. It would be helpful if you explained why you all have discounted this possibility.



    Robin Cousins

    ReplyDelete
  8. If teachers spot children with special needs and provide appropriate help before the phonics screen is administered (and I hope many of them do), then the effect is to shift the left part of the distribution rightwards. This would not be expected to produce the blips around the 32 cut-off.
    Unless the data reported reflect a second administration of the screen after a first administration triggering appropriate responses?

    ReplyDelete
  9. In my experience the system is not quite so neat and tidy. Within the school there tends to be an ad hoc collection of helpers including special needs teachers, sencos, teaching assistants, and parents, who take an interest in literacy. There appears to be a general sense of what constitutes under-performance in reading, which is not the same as special needs. Again, from what I've seen, the kids are aware that they have been selected for help and don't like the stigma, even at the age of six or seven. The upshot is that a reasonable number of children can be helped over the line into a "normal" category. Then there is a social explanation as to why there is a lump just over the pass line and the progress does not continue. Schools seem honest and very adaptive organisations with an acute sense of where assessment boundaries lie. My belief would be that children rather than numbers have been "manipulated" over the 32 cut-off. I agree that the sharpness of the rise at 32 is curious but the shape of the graph is very much as I would expect. I certainly would not expect continuous or parametric data from a screen on something about which the schools are so sensitive.

    Robin Cousins

    ReplyDelete
  10. That's what's I call a "Student's wtf-distribution".

    In fairness, it could come about by some really strangely constructed test items - you can imagine that if some of the items were actually measuring the same thing, you might have few children who got 1 right rather than 0 or 2... in theory, but some kind of manipulation looks much more likely.

    ReplyDelete
  11. Actually that looks a little bit like the distribution of scores in an oral reading test in children who are learning to read a regularly spelled language. We've published data on this in our 2000 paper. There's a bump at 0 and then a bump at the top end. Children either know little about how to decode words or they've got the trick and can do all or almost all the words.

    So it looks like children learning to read an irregular orthography still do this - they either can't decode regularly spelled words that well, or they can decode almost all of them (and most of them can decode almost all of them). I'm not too surprised the middle isn't completely flat as it is in the regular orthography, as the children are used to being "fooled" by known words.

    The blips are probably due to bad data collection etc. I'd agree.

    ReplyDelete
  12. Thanks for great analysis. I think the issue is with the subjective nature of these types of tests- I have ran decoding tasks with lots of kids and I have been struck how difficult it can be to a)make out what some kids say and b)decide whether the pronunciation is entirely correct. I don't necessarily think that teachers are being disingenuous, but they would rather give children the benefit of the doubt. And of course they are also under pressure from the governement in terms of reaching certain pre-set standards. This makes poor data colection, but what do we expect? Teachers are teachers, not researchers.

    ReplyDelete
  13. Might it also be something to do with the number of words and non-words in the test? Lots of children -- many of them able readers -- struggled more with the non-words than they did actual words. Might the unusual distribution of the graph reflect that in some way?


    ReplyDelete
  14. Anonymous. Several people have made a similar point, but I think they're just wrong. Given that there is variation from child to child, I can't see any way that you would get a peak at the cutoff point with a drop on either side. It would be good to see raw data on individual items, but I would be a very large sum of money on this not being an item effect.
    See also similar effect on GCSE scoring http://blogs.ft.com/ftdata/2012/11/02/english-gcse-and-ofqual/#axzz2B3oGpUsZ

    ReplyDelete
  15. Was this a data set from a first small run of a draft version of the test? Such a bizarre distribution would surely have led to an examiner's facepalm and a swift redesign?

    ReplyDelete