upa - home page JUS - Journal of usability studies
An international peer-reviewed journal

The Effect of Experience on System Usability Scale Ratings

Sam McLellan, Andrew Muddimer, and S. Camille Peres

Journal of Usability Studies, Volume 7, Issue 2, February 2012, pp. 56 - 67

Article Contents


Method and Process

The following sections discuss the method, evaluation measures, participants, and results of our study.

Method

The System Usability Scale (SUS) is a simple, widely used 10-statement survey developed by John Brooke while at Digital Equipment Corporation in the 1980s as a “quick-and-dirty” subjective measure of system usability. The tool asks users to rate their level of agreement or disagreement to the 10 statements—half worded positively, half negatively—about the software under review. For reporting results, we used a scoring template that turns the raw individual survey ratings across multiple users of a specific software product into a single SUS score based on Brooke’s standard scoring method (manipulating statement ratings to get them a common 0-4 rating, then multiplying the sum by 2.5 to get a score that can range from 0-100). We used such tools with reviews, regardless of whether we were looking at interface designs or implementations.

The results of our study were from one 2009 testing cycle for two related products from the same suite: one with a Web-based frontend and the other, a desktop application. The SUS questionnaire was administered by one of our product commercialization teams and the associated deployment team—teams responsible for conducting internal testing and training or coordinating external prerelease (beta) testing with customers. The SUS was given to users at the end of an iteration period, which could last one week or much longer.

The SUS surveys were provided in English for these tests. Because both internal and external user populations come from any number of countries with non-native English speakers, we asked users upfront to let us know if any part of the survey instruments was unclear or confusing, and we examined individual user scores after the test for any potential problems resulting from misunderstanding or inadvertent miscues.

The SUS survey included requests for demographic information from users: their name, their company, their job role, the software being evaluated, the software version, date of the user’s evaluation, duration of the evaluation, and the user’s experience using the software. The survey then provided the following 10 standard statements with 5 response options (5-point Likert scale with anchors for Strongly agree and Strongly disagree):

  1. I think that I would like to use this system frequently
  2. I found the system unnecessarily complex
  3. I thought the system was easy to use
  4. I think that I would need the support of a technical person to be able to use this system
  5. I found the various functions in this system were well integrated
  6. I thought there was too much inconsistency in this system
  7. I would imagine that most people would learn to use this system very quickly
  8. I found the system very cumbersome to use
  9. I felt very confident using the system
  10. I needed to learn a lot of things before I could get going with this system

Measure

Typically, to evaluate the SUS responses, we look at the mean and standard deviations of the user responses for a specific system. We then color code individual responses in the scoring template to help visualize positive, neutral, and negative responses, accounting for the alternating positive-then-negative makeup of the statements.

With cases that had responses that were remarkably high or low, we contacted users directly and reviewed their responses with them to confirm their intentions were correctly captured. Figure 1 shows one example representative of what we found—here, one of several users had an overall SUS rating far lower than all others on the same product (User Ev.5). The user had responded as if all statements were positively worded, despite prior instructions on filling out the survey. We used a simple color-coding scheme. For positively worded statements, where a higher number means a higher rating, we assigned green to 5 or 4, yellow to 3, and orange to 2 or 1. For negatively worded statements, the color codes were reversed: orange for 5 or 4, yellow for 3, and green for 2 or 1. We did this so that we could more easily compare ratings for an individual user or across users. As seen in Figure 1, every other statement for one user has an orange color code indicating a negative rating—a possible indication that users forgot to reverse their responses. This is similar to Sauro’s findings who noted that users sometimes “think one thing but respond incorrectly” (2011a, p.102). In fact, Sauro and Lewis’ research found that approximately 13% of SUS questionnaires likely contain mistakes (2011). Similar to Sauro and Lewis’ findings, 11% of our SUS questionnaires likely contained mistakes. In cases where we thought the SUS scores were in error—we contacted individual users to go over their responses with them. In the case shown below, the user’s positive overall comment about the product’s much deserved usability also made us question the user’s SUS score, and this was verified when we spoke to the user over the phone later.

Figure 1

Figure 1. Example of user miscue in SUS scoring instrument

Participants

Participants were actual users of our software. A total of 262 users responded, 190 for the first product and 72 for the second. Prior to their familiarizing themselves with and using the new product software versions, all users were asked to identify themselves as one of the following:

Figure 2 shows the experience level of the users tested. Approximately the same number of users from different locations were given the SUS after a set period of training and subsequent testing and use of the product.

Figure 2

Figure 2. Number of users in each experience level for both products

Results

A 3 by 2 (Experience-Extensive, Some, Never by Product-One and Two) between subjects factorial ANOVA was conducted to determine the effects of experience and product type on usability ratings. As seen in Figure 3, SUS scores increased based on experience level, and this effect was significant, F (2, 256) = 15.98, p < 0.001, n2 = 0.11. There was no main effect of product F (1, 256) = 3.57, p = 0.06) nor was there an interaction between product and experience, F (2, 256) = 0.46, p = 0.63. Table 1 provides the results of a Tukey’s HSD pairwise comparison for post-hoc analysis. This table shows that the Extensive group had higher ratings than both the Never and Some groups (both p < 0.001), and that there was no significant difference between the Some and Never groups (p = 0.117).

Table 1. Results of Pairwise Comparison for Three Difference Experience Levels

Table 1

Figure 3

Figure 3. SUS scores across products and experience levels: There was a main effect of experience but no effect of product or interaction between experience and product. Error bars represent the 95% confidence interval.

 

Previous | Next