David Read blew my cover on Sunday night with a tweet mentioning my averaging of Likert data in our recent work on badges. If there is ever a way to get educational researchers to look up from their sherry on a Sunday evening, this is it.
Hadn't realised it before, but @seerymk advocates averaging of Likert scores in his excellent paper on digital badges: https://t.co/Zz8XJP8M5o. You can't argue with MKS, @PaulDuckmanton and @aw_mckinley! Certainly seems valid in this exploratory case.
— David Read (@lowlevelpanic) February 4, 2018
Averaging Likert scales is fraught with problems. The main issue is that Likert response is ordinal, meaning that when you reply to a rating by selecting 1, 2, 3, 4, 5 – these are labels. Their appearance as numbers doesn’t make them numbers, and Likevangels note correctly that unlike the numbers 1, 2, 3… the spaces between the labels 1, 2, 3… do not have to be the same. In other words, if I ask you to rate how much you enjoyed the new season of Endeavour and gave you options 1, 2, 3, 4, 5 where 1 is not at all and 5 is very much so, you might choose 4 but that might be just because it was while it was near perfect TV, it wasn’t quite, there were a few things that bothered you (including that new Fancy chap), so you are holding back from a 5. If you could, you might say 4.9…
But someone else might say well it was just better than average, but only just mind. That new fella wasn’t a great actor but, hell, it is Endeavour, so you put 4 but really if you could you would say 3.5.
So both respondents are choosing 4, but the range of thought represented by that 4 is quite broad.
I can’t dispute this argument, but my own feeling on the matter is that this is a problem with Likert scales rather than a problem with their subsequent analysis. Totting up all of the responses in each category, we would still get two responses in the ‘4’ column, and those two responses would still represent quite a broad range of sentiments. Also, while I understand the ordinal argument, I do feel, that on balance, when respondents are asked to select between 1 and 5, there is an implied scale incorporated. One could of course emphasise the scale by adding in more points, but how many would be needed before the ordinal issue dissipates? A scale of 1 to 10? 1 – 100? Of course you could be very clever by doing what Bretz does with the Meaningful Learning in the Lab questionnaire and ask students to use a sliding scale which returns a number (Qualtrics allows for this more unusual question type). Regardless, it is still a rating influenced by the perception of the respondent.
Our badges paper tried to avoid being led by data by first exploring how the responses shifted in a pre-post questionnaire, so as to get some “sense” of the changes qualitatively. We saw a large decrease in 1s and 2s, and a large increase in 4s and 5s. Perhaps it is enough to say that; we followed the lead of Towns, whose methodology we based our own on, in performing a pre-post comparison with a t-test. But like any statistic, the devil is in the detail, the statistic is just the summary. Conan Doyle wrote that “You can, never foretell what any one man will do, but you can say with precision what an average number will be up to. Individuals vary, but percentages remain constant. So says the statistician.”
There is a bigger problem with Likert scales. They are just so darned easy to ask. It’s easy to dream up lots of things we want to know and stick it in a Likert question. Did you enjoy Endeavour? Average response: 4.2. It’s easy. But what does it tell us? It depends on what the respondent’s perception of enjoy is. Recently I’ve been testing out word associations rather than Likert. I want to know how student feel at a particular moment in laboratory work. Rather than asking them how nervous they feel or how confident they feel, I ask them to choose a word from a selection of a range of possible feelings. It’s not ideal, but it’s a move away from a survey of a long list of Likert questions.