upa - home page JUS - Journal of usability studies
An international peer-reviewed journal

Determining What Individual SUS Scores Mean: Adding an Adjective Rating Scale

Aaron Bangor, Philip Kortum, and James Miller

Journal of Usability Studies, Volume 4, Issue 3, May 2009, pp. 114-123

Article Contents


Discussion

The finding that the adjective rating scale very closely matches the SUS scale suggests that it is a useful tool in helping to provide a subjective label for an individual study’s mean SUS score. Given the strength of the correlation, it may be tempting to think about using the single question adjective rating alone, in place of the SUS. Certainly administration of a single item instrument would be more efficient, and the result would be an easy to interpret metric that could be quickly shared within the product team. However, there are several reasons why using a single item scale alone may not be the best course. First, in the absence of objective measures, like task success rates or time-on-task measures, we cannot adequately determine whether the SUS or the adjective rating scale is the more accurate metric. Indeed, anecdotal evidence in our lab suggests that a test participant may provide a favorable SUS score, yet fail to complete the tasks being tested. The reverse has also been observed. Collecting this kind of corroborating data is an effort that we will be undertaking in future studies.

Second, psychometric theory suggests that multiple questions are generally superior to a single question. Many studies have found that multiple question surveys tend to yield more reliable results than single question surveys. For example, in a study of overall job satisfaction, Oshagbemi (1999) found that single item measures tended to produce a higher score on job satisfaction than did the comparable multi-question surveys. Because specific elements of dissatisfaction could not be uniquely addressed, the single question survey tended to dilute dissatisfaction measures. In another study, users were asked to determine their intake of fish products. In one survey, respondents were asked to estimate intake for 71 different fish items, and in another survey they were asked a single question regarding their intake of fish. The results showed that when respondents used the single question survey they underestimated their intake of fish by approximately 50% (Mina, Fritschi, & Knuiman, 2007). These studies seem to indicate the superiority of multiple item questionnaires.

Other research, however, indicates that single item surveys can produce results similar to those found with multiple item surveys. For example in a study that measured workers focus of attention while on the job it was found that there were no differences between single and multiple measures (Gardner, Cummings, Dunham, & Pierce, 1998). Similarly, Bergkvist and Rossiter (2007) found that the correlation between consumers’ attitudes towards specific brands and advertisements was the same regardless of whether single or multiple item questionnaires were used.

Based on these disparate results, how do we determine whether using the adjective rating scale alone might be appropriate? The key lies in trying to understand whether the construct of usability is a concrete singular object as defined by Rossiter (2002). In order for a construct to be concrete, all of the users must understand what object is being rated. In the case of the usability studies that is a reasonable assumption, because a single item was presented to the user for evaluation. In order for an object to be considered singular, it must be considered homogenous—a single item rather than a collection of separate but related items. If an item is considered to be concrete singular, then single item questionnaires can be utilized. However, if an item is not considered to be concrete singular, then multiple item questionnaires should be utilized. Because different parts of an interface may be judged differently (e.g., the main navigation vs. the help system), we believe that the items tested as part of usability assessments are not necessarily singular. Because we assume that the interfaces are not always singular, as defined by Rossiter (2002), the non-singular nature of the item makes using only a single item questionnaire alone inadvisable.

Another note of caution regarding the single adjective scale is the observation that OK might be too variable for use in this context. In this study, OK had the highest variance of the seven adjectives. It is striking, though, that its mean score (50.9 out of 100) is at the SUS scale’s mid-point, which matches previous research on adjective ratings (Babbitt & Nystrom, 1989), that lists OK as being a mid-point value between Neutral and Average. However, participants may have believed OK to mean that something is acceptable. In fact, some project team members have taken a score of OK to mean that the usability of the product is satisfactory and no improvements are needed, when scores within the OK range were clearly deficient in terms of perceived usability.

It seems clear that the term OK is probably not appropriate for this adjective rating scale. Not only is its meaning too variable, but it may also give the intended audience for SUS scores a mistaken impression that an OK score is satisfactory in some way. Using other, established rating scales (Babbitt & Nystrom, 1989), we believe that the terms fair or so-so are likely to still result in a mid-point value on the scale, while at the same time appropriately connoting an overall level of usability that is not acceptable in some way.

Because of the questions about how accurately the actual adjectives map to SUS scores, we are also considering testing a different adjective scale. As described earlier, we have found that a useful analog to convey a study’s mean SUS score to others involved in the product development process has been the traditional school grading scale (i.e., 90-100 = A, 80-89 = B, etc.) (Bangor, Kortum, & Miller, 2008). This has strong face validity for our existing data insofar as a score of 70 has traditionally meant passing, and our data show that the average study mean is about 70. We had earlier proposed a set of acceptability ranges (Bangor, Kortum, & Miller, 2008) that would help practitioners determine if a given SUS score indicated an acceptable interface or not. The grading scale matches quite well with these acceptability scores as well. Figure 4 shows how the adjective ratings compare to both the school grading scale and the acceptability ranges.

Figure 4. A comparison of the adjective ratings, acceptability scores, and school grading scales, in relation to the average SUS score

Figure 4. A comparison of the adjective ratings, acceptability scores, and school grading scales, in relation to the average SUS score

Finally, regardless of whether words or letter grades are used for such a scale, we believe that the results from a single score should be considered to be complementary to the SUS score and the results should be used together to create a clearer picture of the products overall usability.

The work presented here suggests several lines of future research that are needed in order to further understand both the SUS and the use of an additional single question rating scale. First and foremost, data collection will continue with the substitution of the mid-point adjective with one that carries a stronger neutral connotation than the current term of OK. With this substitution, we will also be including a letter grade scale to allow the users themselves to make the determination of a grade assignment, rather than having to rely on the anecdotal evidence presented to date. One virtue of the letter grade approach is that the subject could be asked verbally to assign a letter grade prior to presentation of the SUS. This would help remove the letter grade from the context of the SUS questions and perhaps increase the degree of independence between the two measures. We hypothesize that users may be less reluctant to give low or failing grades to poor interfaces because of their extensive exposure to this familiar scale in other domains. We believe that users may have self-generated reference points across the entire letter grade scale and because of their previous exposures could be more willing to use the full scale. If this is true, it may prove to be a valuable extension of the SUS and help solve the range restriction issue that is prevalent in SUS scores. If the letter grade score does indeed prove to be reliable and useful, further investigations will need to focus on whether such a single score assessment might be sufficient. One important element of these investigations will be to examine the relationship between the SUS, the seven-point adjective rating scale, and the letter grade scale with objective measures of usability such as time-on-task and task success rates.

Previous | Next