upa - home page JUS - Journal of usability studies
An international peer-reviewed journal

Rent a Car in Just 0, 60, 240 or 1,217 Seconds? Comparative Usability Measurement, CUE-8

Rolf Molich, Jarinee Chattratichart, Veronica Hinkle, Janne Jul Jensen, Jurek Kirakowski, Jeff Sauro, Tomer Sharon, Brian Traynor

Journal of Usability Studies, Volume 6, Issue 1, November 2010, pp. 8 - 24

Article Contents


The following sections discuss that CUE-8 is not a scientific experiment, the measurement approaches, computing time-on-task, reporting uncertainty for time-on-task, qualitative results, reproducibility of results, participant profiles, satisfaction measurements, handling failed tasks, productivity, contaminated data, measuring time-on-task, and the usability of a remote tool.

CUE-8 Is Not a Scientific Experiment

From previous CUE studies (Molich, 2009) it is known that there is a wide range of approaches by professional teams when undertaking qualitative evaluations. The main motivation for CUE-8 was to see to what extent there was variation in approach to quantitative measurement.

The teams who participated were essentially a convenience sample. They were not recruited specifically to provide either best practices or state-of-the-art measurement techniques. For this reason we abandoned one of our original goals, to investigate whether usability measurements are reproducible. Nevertheless, there was a good mixture of qualifications between teams within CUE-8. Although the range of variation was wide, it could well be wider if a more systematic random sample of teams over the world was taken, but such a comprehensive systematic study would most likely be cost prohibitive.

The results from CUE-8 cannot therefore be generalised or summated into averages with sampling confidence intervals to produce overall trends. Methodological purity of this kind is not accessible in the real world. What we present is essentially 15 separate case studies showing 15 different approaches to the quantitative measurement and reporting of time, performance, and satisfaction. The benefit of CUE-8 is that in this area of quantitative evaluation, we can comment on what appear to us to be the strengths and weaknesses within each case study and present them as take aways.

Measurement Approaches

Team C discouraged participants to think aloud. Team K and N on the other hand explicitly asked participants to think aloud. At the workshop it was discussed whether think aloud increases or decreases total time-on-task. Some argued that the mental workload increases when thinking aloud, thus causing additional problems to arise. It may also impact task completion time, as participants tend to occasionally pause their task solving to elaborate on an issue. Others argued that think aloud forces participants to consider their moves more carefully thus decreasing time-on-task. This would be an interesting variable to investigate further.

A take away from the study was that instructions and tasks must be precise and exhaustive for unmoderated, quantitative studies because there was no moderator to correct misunderstandings. Even in moderated studies the moderator should not have to intervene because this influences task time. Unfortunately, this often means that tasks get quite long, so participants don't read all of the instructions or tasks. For example, in order to provide the necessary details, task 1 and 4 became so wordy that some participants overlooked information in the tasks. Some teams declared measurements from misinterpreted tasks invalid; others did not report their procedures. Instructions and tasks should be tested carefully in pilot tests.

Our study shows that task order is critical; there is a substantial learning effect as shown by the two teams who repeated task 1 and 2 after task 1-5.

A few teams presented the tasks in random order. At the workshop it was pointed out that this conflicts with a reasonable business workflow; task 1 or 2 should be first because the vast majority of users would start with one of these tasks.

Computing Time-on-Task

There is substantial agreement within the measurement community that measures such as time-on-task are not normally distributed because it is common to observe a positive skew in such data, that is, there is a sharp rise from the start to the center point of the distribution but a long tail back from the center to the end. Under such conditions, the mean is a poor indicator of the center of a distribution. The median or geometric mean is often used as a substitute for the mean for heavily skewed distributions (Sauro, 2009). Using the median censors data or discards extreme observations.

There are, as alternatives, a variety of statistical techniques that will "correct" a skewed distribution in order to make it symmetrical and therefore amenable to summary using means and standard deviations. Team F and G used such an approach. The rest reported time-on-task the way it is usually reported in the HCI literature: untransformed data are the norm.

Reporting Uncertainty in Time-on-Task

At the workshop it was argued that usability practitioners mislead their stakeholders if they were not reporting confidence intervals. Understanding the variability in point estimates from small samples is important in understanding the limits of small sample studies. Confidence intervals are the best way to describe both the location and precision of the estimate, although the mathematical techniques of computing confidence intervals on sample distributions from non-normal populations are still a matter of controversy in the statistical literature.

In order to compare teams' confidence intervals, all teams must meet the same screening criteria for participants. As discussed in the Participant Profiles section, this was not the case in this study where convenience samples were often employed.

If the sample on which the measures were taken is from a normally distributed population, the mean is a useful measure of the average tendency of the data, and the variance is a useful measure of variability of the data. The confidence interval is a statistic that is derived from the computation of the variance and also assumes normality of population distribution.

Because time-on-task is not normally distributed, means, variances, and confidence intervals derived from variances are possibly misleading ways of estimating average tendency and variability. There are a number of ways of getting over this as was displayed in our teams: some teams used medians (which are not sensitive to ends of distributions), others used a transformation that would "normalize" the distributions mathematically (Sauro, 2009).

Qualitative Results

As shown in Table 2, four of the reports included 10-20 qualitative usability findings. This seems to strike a useful balance between reporting informally a few quantitative findings, which 6 teams did, and reporting a high number of qualitative findings, which team L and N did (68 and 79 findings, respectively).

Reproducibility of Results

Did the teams get the same results? The answer is no, but the reported measurements from several teams—sometimes a majority—agree quite well as you can see from Figure 2.

Eyeballing shows that the results from six teams (A, E, F, J, K, and M) were in reasonable agreement for all five tasks. Two more teams (B and L) agreed with the six teams for all tasks except task 1. Two teams (D and O) agreed with the majority for three tasks. On the other hand, five teams mostly reported diverging results. Team H and N consistently diverged from the other teams.

We examined the overlap between confidence intervals for any pair of teams for each task. Overlap scores were computed based on the overlap between confidence intervals in percent. For example, for task 1 team B reported the confidence interval [103, 163], while team E reported [145, 258]. Team B's overlap score is (163-145)/(163-103) = 30%, while team E's comparable score is (163-145)/(258-145) = 16%. The total overlap score for a team is the sum of the team's 14*5=70 overlap scores for the 14 other teams and 5 tasks. This scoring method favors teams that reported narrow confidence intervals that overlap with the wider intervals of many other teams, which seems fair. Team L did best by this scoring method, followed by team J and M. See the complete results in Figure 3.

Figure 3

Figure 3. Overlap between findings: This scoring attempts to quantify the eyeballing "Yeah, I think the overlap between team J, K, L, and M is pretty good."

An analysis of the teams’ approaches reveals the following sources for diverging results:

Participant Profiles

Recruiting was an important reason that some teams reported diverging results. Not all teams seemed to use strict participant screening criteria; some used convenience samples.

The following are examples of questionable recruiting:

Because we had no contact with Budget, it was not technically possible to recruit people who were actually visiting the site.

Satisfaction Measurements

As with many of the metrics collected, there was variability in the SUS scores the teams reported. The SUS scores are shown in Figure 4. An analysis of overall variance shows that there is a statistical difference between SUS scores, F(6,451)=6.73, p <.01, which can be attributed to variation between teams. There are two groups of scores that differ significantly from each other (between each group), but which do not differ statistically within each group.

One cluster of four teams (B, K, L, and G) generated SUS scores within 7-10% of each other (73, 77, 78, and 80). The other cluster of three teams scores (M, H, and P) were within 4-6% of each other (62, 65, and 66). Table 3 shows the mean and standard deviation for the teams' SUS scores in the two clusters.

Figure 4

Figure 4. Reported SUS scores for the seven teams that used the original SUS measurement

Table 3. Mean SUS Scores Organized by Clusters of Scores, Standard Deviation (SD)

Table 3

Other researchers (Bangor, Kortum, & Miller, 2008) have pointed out that SUS has shown to be positively skewed with an "average" score for websites and web-applications of approximately 68 out of 100 (with a standard deviation of 21.5). The average SUS score for the first cluster of 77.1 suggests Budget.com is a better than the average website falling in the 66th percentile of the Bangor et al. dataset. The average SUS scores for the second cluster of 64.5 suggests Budget.com is a worse than average website falling in the 43rd percentile of the Bangor et al. dataset.

It is unclear whether or not the differences observed in the SUS scores are a reflection of SUS being inadequate for measuring websites. It is likely that many of the observed differences occurred due to the different participants and evaluation procedures used by the teams.

The SUS scores can be contrasted with the score from the one team who used WAMMI (team A). The Budget.com WAMMI Global User Satisfaction score was in the 38th percentile suggesting it to be a below average website (the industry average is at the 50th percentile).  Because there was only a single WAMMI data point, it is not possible to know how much more or less WAMMI scores would fluctuate compared to the SUS scores.

Handling Failed Tasks

Ten out of 15 teams chose to include time-on-task for tasks where participants gave up or obtained an incorrect answer in their calculations of mean or median time-on-task. For the failed tasks they used the time until the participant gave up. At the workshop, it was successfully argued that these figures are incompatible. The time until a participant gives up or finds an incorrect result is irrelevant for time-on-task; reported time-on-task should include only data from successfully completed tasks. Failed times are still useful as their own metric called average time to task failure. If you only report one measure, then report task completion times and exclude failure.

Some teams argued that three separate results are of equal importance: time for successful completion, success rate (or failure rate), and disaster rate—that is, the percentage of participants who arrive at a result they believe in but which is incorrect.

It is useful to differentiate between task success and failure in satisfaction results. For instance one may report that "X% of participants who successfully completed task 2 gave a satisfaction score higher than Y%."

Some teams opted for using a binary code for task success, whereas other teams used error-based percentages to classify success (0/50/100% or even 0/25/50/75/100%).

Productivity—Team Hours Spent per Participant

In examining the average times spent on moderated versus unmoderated or hybrid studies (60 hours vs. 37 hours), there was surprisingly a lot of overhead for unmoderated tests.

Productivity varied remarkably from 4 minutes per participant (team L) to 9 hours per participant (team B). Median productivity was 2:22 hours per participant.

Unmoderated studies pay off when the number of participants gets large. For example, team L ran 313 unmoderated participants in 21 hours, which is about 4 minutes per participant, whereas team D ran 14 unmoderated participants in 28 hours, which is about 2 hours per participant, similar to the moderated test ratio. Team L had both the largest number of participants (313) and the lowest person hours used (21). This impressive performance, however, came at a cost as described in the next section.

Cleaning Contaminated Data, or Killing the Ugly Ducklings

Teams who used unmoderated sessions all reported some unrealistic measurements.

Table 2, row "Minimum time," shows that few observed participants were able to complete the rental task in anywhere near 60 seconds. Teams agreed that it was impossible even for an expert who had practiced extensively to carry out the reservation task (task 1) in less than 50 seconds. Yet, team H reported a minimum time of 0 seconds for successful completion of this task; 22 of their 57 measurements were below 50 seconds. Team L reported a minimum time of 18 seconds for successful completion of the same task; 6 of their 305 measurements for this task were below 50 seconds.

Some of the teams decided to discard measurements that appeared to be too fast or too slow, in other words, they decided to "to kill the ugly ducklings." For example, the cleaning procedure used by team L was to delete

It is not clear from the reports how teams came up with criteria for discarding measurements. Apparently, the criteria were based on common sense rather than experience or systematic studies. Team M used click path records to check measurements that looked suspicious; they also discarded measurements where "the test tool must have encountered a technical problem capturing page views and clicks across all tasks."

The teams hypothesized that participants had either guessed or pursued other tasks during the measurement period. However, by discarding data based solely on face value, teams admitted that their data were contaminated in unknown ways. It could then be argued that other data that appeared valid at first glance were equally contaminated. Example: Team F analyzed the data from their unmoderated videos and found measurements that appeared realistic but were invalid. They also found a highly suspicious measurement where the participant used almost 17 minutes to complete the rental task, which turned out to be perfectly valid; the participant looked for discounts on the website and eventually found a substantial discount that no one else discovered.

Measuring Time-on-Task

In both moderated and unmoderated testing it is difficult to compensate for the time used by the participant to read the task multiple times while solving the task. In unmoderated testing it is difficult to judge if the participant has found the correct answer unless they include video recordings or click maps, which may take considerable time to analyze. Multiple-choice questions are an option, which was used by some teams as shown in Figure 5. However, some of the answers changed during the period where the measurements occurred making all choices incorrect, and some participants might have been able to guess the right answer from the multiple choice list.

At the workshop, team M argued that automated unmoderated tools such as Userzoom, which they used, allowed them to validate answers also by URLs (reaching a certain page as a way to validate the successful or unsuccessful completion of a task). Team M argued that they had more issues with moderated sessions where moderators and participants were discussing issues while the clock was measuring time-on-task.

Figure 5

Figure 5. Multiple choice answers for task 1 and 2 used by team L for determining if their participants had obtained the right answer in unmoderated sessions. For task 1 participants were asked to find the label of the button that performed the rental; the correct answer is "rent it!" For task 2 the answer varied. Most often the rental price was in the $176-$200 range, but on some days it was more than $200.

The discussion at the workshop concluded that it is not completely clear what you are measuring as time-on-task in an unmoderated session: Time to complete task? Time to comprehend and complete task? Time to comprehend and complete task plus time to remember to click the "Task finished" button? Time to complete test task and other parallel tasks? Also, irrelevant overhead varied considerably from participant to participant. Some grasped a task almost instantly, some printed the task, and some referred back to the task description again and again. In moderated sessions, in comparison, the moderator ensures some consistency in measuring time-on-task.

Usability of Remote Tool

The ease of use of the remote tool, the clarity of the instructions, etc., has a considerable impact on unmoderated participants' performance. For example, one of the teams used a tool that hid the website when participants indicated that they had completed a task; this made it unrealistically difficult for participants to answer the follow-up questions that checked whether or not the task was completed correctly.

Previous | Next