upa - home page JUS - Journal of usability studies
An international peer-reviewed journal

Response Interpolation and Scale Sensitivity: Evidence Against 5-Point Scales

Kraig Finstad

Journal of Usability Studies, Volume 5, Issue 3, May 2010, pp. 104 - 110

Article Contents


Introduction

The Information Technology department at Intel® Corporation has employed the System Usability Scale (SUS) for the subjective component of some of its internal usability evaluations. The SUS is a 10 item, 5-point Likert scale anchored with 1= Strongly Disagree and 5 = Strongly Agree and is used to evaluate a system’s usability in a relatively quick and reliable fashion (Brooke, 1996).

The SUS can be administered electronically, which is common in post-deployment situations where the researcher wants to conduct a usability evaluation with a large base of system users. During the system development phase, it may be administered manually, e.g., during usability testing or other validation activities. It is in these situations, where a facilitator elicits verbal responses or the participant responds with pen and paper, that otherwise hidden logistics issues may become apparent. Finstad (2006) noted that the language of the SUS doesn’t lend itself well to electronic distribution in international settings. Another issue that has emerged is the insensitivity of 5-point Likert items as evidenced by response interpolation. During the course of responding to the SUS, participants will not always conform to the boundaries set by the scaling. For example, instead of responding with discrete values such as 3 or 4, a participant may respond 3.5 verbally or make a mark on a survey sheet between 3 and 4. This interpolation may also be implicit, e.g., saying “between 3 and 4” with no exact value. From a scoring perspective, the administrator has a number of options, such as requesting that participants limit their responses to discrete integers. This puts the burden on participants to conform to an item that does not reflect their true intended responses. The administrator might also leave the responses as-is and introduce decimal values into an otherwise integer scoring system. In the case of implicit interpolation, an administrator might specify a value, e.g., assuming 3.5 to be a fair evaluation of what the respondent means by “between 3 and 4.” Additionally, the administrator might force an integer value by rounding the score to the more conservative (i.e., neutral-leaning) side of the item, in this case 3. Note that in this example, information is lost by not utilizing the respondent’s actual data. Even more data are lost with the most conservative option—discarding the response entirely. In any case, without insisting that the respondent choose a discrete value (and thereby forcing data loss), differences will emerge between such a manually-administered scale and an electronic one (e.g., equipped with radio buttons) that will not accept interpolated values.

The issue of data lost in this fashion, i.e., unrecorded due to the mismatch of the item to the respondent’s true subjective rating, has been touched upon in previous research. Russell and Bobko (1992) found that 5-point Likert items were too coarse a method for accurately gathering data on moderator effects. Instead, items approximating a more continuous distribution dramatically increased effect sizes as detected by moderated regression analysis. Essentially, the 5-point items were unable to capture the subtle degrees of measure the participants wanted to express. While some may argue that simpler items are motivated by potential issues with reliability, Cummins and Gullone (2000) made a case for higher-valued Likert items based on a lack of empirical evidence that expanded-choice Likert items are less reliable. Their final recommendation was a move towards 10-point items, because reliability and validity are not adversely affected by this expansion. Higher-order scales beyond this, however, can present complications. Nunnally (1978) also argued for higher order scales based on reliability. Adding scale steps provides a rapid increase in reliability, but begins to plateau at around 7 and provides little further increases in reliability beyond 11. Preston and Colman (2000) found that respondent test/retest reliability suffered in scales with more than 10 options. However, there are also arguments that 7 items may be optimal. Miller (1956, p. 4) noted that “psychologists have been using seven-point rating scales for a long time, on the intuitive basis that trying to rate into finer categories does not really add much to the usefulness of the ratings.” Lewis (1993) found that 7-point scales resulted in stronger correlations with t-test results. Diefenbach, Weinstein, and O’Reilly (1993) undertook an investigation of a range of Likert items, including 2-point, 5-point, 7-point, 9-point, 11-point, 12-point, and percentage (100-point) varieties. Subjective evaluations were measured, namely how easy the items were to use and how accurate they were perceived to be, i.e., the match between the items and the participant’s true evaluation. Quantitatively, the Likert items were evaluated via a booklet of questions about personal health risks, the scaled responses to which were compared to the participants’ rankings of 12 health risks at the beginning of the study. It was found that the 7-point item scale emerged as the best overall. Seven-point items produced among the best direct ranking matches, and were reported by participants as being the most accurate and the easiest to use. For comparison, the 100-point item scale performed well in direct ranking matches and test/retest reliability, but didn’t reach the 7-point item’s high marks for ease of use and accuracy. The 5-point item scale was slightly poorer than the 7-point item scale on all criteria, and significantly worse with subjective opinions. Essentially it was shown that “No scale performed significantly better than the seven-point verbal category scale on any criterion” in the two studies conducted (Diefenbach et al., 1993, p 189).

At a more general level, a comprehensive review of response alternatives was undertaken by Cox (1980). The review covered information theory and metric approaches as the most prevalent means for determining the optimal number of responses in an item. From information theory come the concept of bits (binary units) and channel capacity (Hmax), a monotonically increasing measure of the maximum amount of information in an item. An associated measure is H(y), response information, which indicates how much information is obtained by the responses to an item (Cox, 1980). H(y) has been empirically shown to increase, although at a slower rate than Hmax, as more response alternatives are made available. Although experiments organized around this information-theoretic approach have not provided conclusive evidence regarding the optimal number of item alternatives (Cox, 1980), some concepts have proven useful in metric approaches. One example is channel capacity which, usually through correlational reliability analysis, is the maximum variation that can be accounted for r2. Like Hmax, r2 increases monotonically and demonstrates that smaller response alternative sets return less information. Symonds’ work on reliability (as cited by Cox, 1980, p. 407) led him to conclude that seven was the optimal number of alternatives for items. At the end of the review, Cox concluded that the ideal number of item alternatives seemed to be centered on seven, with some situations calling for as few as five or as many as nine. Also of importance was that an odd number of alternatives, i.e., allowing for a neutral response, were preferable (Cox, 1980).

For the purposes of this investigation, one finding in particular stands out. Osgood, Suci, and Tannenbaum (1957) reported that, in the course of running studies with a variety of response alternative possibilities, seven emerged as their top choice. It was found that the 9-point items, the three discriminative steps on either side of the neutral option (and between the anchors), were used at consistently low frequencies. With 5-point Likert items, participants were irritated by the categorical nature of the options. Prior to the advent of electronic survey methods distributed without a facilitator, this may not have prevented too much of a logistical problem. A facilitator can remind participants of the constraints of the instrument and make decisions about coding in the analysis phase of a study based on a participant’s response. In an electronic setting, survey responses commonly take the form of radio button controls for each number. When participants are confronted with a set of discrete options that are not aligned with their true subjective evaluation, data loss occurs because the instrument is not sensitive enough. For example, a response intended to be 3 1/2 loses half a point of data as the participant is forced to choose either 3 or 4. Consequently, perhaps the ideal Likert item is the one that gathers just the right amount of information (i.e., is as compact and easy to administer as possible) without causing the respondents to interpolate in manually administered surveys or alter their choices in electronic ones. It is from this perspective that the following experiment was developed.

Previous | Next