Tuesday, 29 May 2012

The problem with performance measures (e.g. REF and NSS)

Increasingly, in all walks of life and all fields of work, performance now has to be quantified; we must all be given a number that identifies how good we are at whatever we do.

For academics we have the Research Excellence Framework (REF) exercise, which aims to measure the research performance of individual departments and whole universities, the National Student Survey (NSS) to measure the student experience and numerous newspaper league tables. In particular, the REF (formerly the Research Assessment Exercise, RAE) is a very influential performance indicator since it directly determines the distribution of one of the major funding sources for UK. It attracts a fair amount of comment as a result. See, for example, these informative posts calling the REF a time-wasting monster (Dorothy Bishop) and defending it (Andrew Derrington). Those articles have focused predominantly on whether the REF is good value for money (since it costs tens of millions of pounds to run).

I see a different problem with the REF, and also with all the other performance measures that are currently in use, which is that we simply have no idea how well the measures quantify what we want to know. All scientists know that in order to declare one treatment 'better' than another, we need not only a measure of its performance, but also an understanding of how good that measure is. None of us, beyond about first year undergraduate, would dream of presenting a mean without an error bar. We would also not give values beyond the number of decimal places that we can measure. We would certainly want to explain in our journal articles the caveats that should be taken into account to prevent people from over-interpreting our data.

Yet the REF, the NSS, and all of the newspaper league tables seem to be exempt from these things that all psychologists would consider good practice. So how do they 'measure up'? Are they reliable and precise? Do they even measure the thing they claim to [ed: 'valid', for the scientists in the readership]?

The authors of the posts above don't seem concerned about whether REF is an accurate measure. Dorothy Bishop's blog asserts that the REF yields no surprises (which would make it pointless but not inaccurate). Really? The top four or five might be obvious, but as you look down the list do you really find all those entries unsurprising? I don't want to name names, but I find it surprising how high some of those institutions appear to be, and similarly how low others are. 

If I look at the tables of departments within my own field, rather than at the institutional ranks, I am very surprised indeed. I don't have any evidence to show that they are inaccurate or noisy other than my own experience (to quantify that we would need to conduct the exercise twice in quick succession with a new panel, which never occurs). But I certainly have my doubts about whether a REF rank of 10 is actually different to a REF rank of 15, say, or even 20. I mean 'significantly different' to use the same bar that we set for our scientific reporting. The question turns out to be incredibly important given the financial impact. In the newspaper league tables, which are created on much smaller budgets, my own department's ranking year-on-year changes enormously without any dramatic changes to the department or course.

Ok, so these measures might be somewhat noisy, but do they at least measure the thing we want? That isn't clear either. In the case of the REF we have no measure with which to compare other than our own personal judgements. And if the two disagree I suppose we have to decide that "the REF was based on more data than my opinion" so it wins. In fact, possibly the fact that Dorothy finds the tables unsurprising is that she has more experience (she certainly does). Without a gold standard, or any other metric at all for that matter, how do we decide whether REF is measuring the 'right' thing? We don't. We just hope. And when you hear that some universities are drafting in specialists to craft the documents for them, while other departments leave the job to whoever is their Director of Research, I suspect what we might be measuring is not how good an institution is at research, but how much they pay for their public relations people. How well they convince people or their worth and, possibly, how willing they are to 'accentuate' their assets.

Andrew Derrington (aka Russell Dean) points out that REF (and RAE) "seem to have earned a high degree of trust." I'm not sure who trusts it so much (possibly just the senior academic managers?) but even if it is trusted we know that doesn't mean it's good, right? It could well be the simple problem that people put far too much faith in a number when given one. Even highly competent scientists.

I think it's far from clear that the REF is measuring what we want, and even less that it's doing so accurately. But I should add that I don't have a better idea. I'm not saying we should throw the baby out with the bathwater. Maybe I'm just saying be careful about trying to interpret what a baby says.

Hmm, possibly I over-extended with the baby metaphor.


  1. I think you make a good point here. I also think that concerns about reliability and validity of the measures is one reason why the assessments get ever more complex: 'impact' is now included because it's realised that importance of work can't be measured by publications alone. People want the system to be fair and so they try harder and harder to improve breadth and precision of the measures, but they're doomed because you can't reduce a complex multifactorial system to a number. And at the end of the day, it's down to the subjective judgement of committees. Andrew D points out we need some system to distribute funds and asks what alternative system would be better. I have to say I hanker after the old pre-RAE system, but it's true that also contained arbitrary decisions and was less transparent than REF, where the rules are at least explicit. But it was highly efficient - and perhaps no less unfair than the REF.
    I don't think there were that many surprises in the last round, but agree they sometimes happen and when there ARE discrepancies between outcomes and people's expectations, then most people conclude that it's likely to reflect better gamesmanship, not better scholarship, in the institutions that get surprisingly good results.

  2. Though the ideas and considerations evoked by actively thinking about impact can be useful (formation of explicit end-goals and research direction), the REF's impact section is as-yet poorly defined. It also encourages more concrete, shorter-term research as opposed to innovative work, which has been shown to result in demonstrably higher impact despite initially appearing as somewhat of a 'gamble' (Azoulay, Zivin, and Manso, 2011).

    1. Allowing people to talk about the impact their work has had above and beyond the paper citations is a good thing, as Dorothy points out above. And at least, in this instance, we are talking about impacts that have occurred, unlike in grant applications where we have to discuss impacts that might occur (which appears to be fairly close to science fiction writing).
      Does it cause people to go for the quick study? I'm not sure. It might cause scientists to work on simple or sexy topics, rather than some important topics that are difficult for the general public to understand or engage with. But if it's just that a scientist can get credit for creating something that turned out to be useful as well as scientifically novel, then that's surely good.

  3. On behalf of Andrew Derrington (cruelly bounced by blogger):

    I think it is easy to be imprecise when talking about the RAE and the REF. I think we should distinguish three different things.

    - The RAE itself, which is a way of assessing the quality and quantity of research in UK Universities. RAE scores are based mostly on the quality of publications, and partially on assessments of the research environment and esteem. REF will discard the esteem component and introduce a component based on impact. The last RAE was in 2008 and cost £12M, it will be replaced by the REF in 2013.
    - League tables and rankings based on RAE results.
    - All of the activities in universities that prepare for and respond to, the RAE and now the REF. HEFCE commissioned a survey that estimated the cost of these as nearly £50million for the last RAE.

    I think that most arguments against the RAE and the REF are based on bad qualities of the league tables and on the apparently excessive and counterproductive preparations for theREF and responses to the RAE in some universities.

    I think that it would be a tragedy if we moved away from a system in which we assess university research mainly by the quality of its outputs.

    1. Yes, I didn't distinguish between the different measures. But on this issue they are all the same. /None/ tell you about their confidence intervals. We know a lot, statistically, about how to measure such things, even for unusual tools such as this. For example, Mori's political polls tell you the expected error in their measurements. I just believe REF could do the same. Or, at least, highlight that there /is/ measurement error.

  4. The construction of the REF somewhat reminds me of the creation of the first intelligence tests. Inclusion of items on those tests was based on the principle that kids who were deemed intelligent by others should come out intelligent in the test; test construction and inclusion of items were atheoretical and largely included and excluded to fit the shape of the curve.
    The REF seems quite similar to that. Including our perception that institutions we *think* are top should come out on top, and if they don't than the measurement is deemed not accurate. If we are to have a system that is probably not reliable, perhaps not valid, constructed on somewhat arbitrary grounds, mainly used to confirm hunches we have about institutions etc. then perhaps we could look for a process that at least takes less time and money.