The Magazine of the Usability Professionals' Association
By Tom Tullis
In our user experience team at Fidelity Investments, we’ve conducted over forty unmoderated remote usability tests over the past five years. We use them as an adjunct to traditional lab tests and remote, moderated usability tests. We’ve found that unmoderated remote tests reveal usability variations between different design solutions that typical lab tests generally don’t detect. The advantage of the unmoderated remote tests lies in the sheer number of participants. We usually have at least 500 participants in just a few days when we can use our own employees as participants in these tests, and it’s not uncommon to have over 1,000 participants. When performing evaluations with panels of our customers, we commonly have at least 200 participants in a week. These numbers provide tremendous data. We routinely get statistically significant differences in task completion rates, task times, and subjective ratings when comparing alternative designs. Even what appears to be a minor design difference (e.g., a different phrase to describe a single link on a website) can yield significant differences in usability measures.
The best way to describe unmoderated remote usability tests is with an example, so I devised a test comparing two Apollo space program websites the official NASA site (Figure 1) and the Wikipedia site (Figure 2)
Figure 1: Apollo program home page on NASA
Figure 2: Apollo program home page on Wikipedia
Participants in the study were randomly assigned to use only one of these sites. Most of the unmoderated remote studies I’ve conducted are this “between-subjects” design, where each participant uses only one of the alternatives being tested.
The next step was to develop tasks for the participants. I developed a set of candidate tasks before studying either site based on my own knowledge of the Apollo program. I then eliminated any tasks that I couldn't find the answer to on both sites. That left nine tasks:
The best tasks have clearly defined correct answers. In this study, the participants chose the answer to each question from a dropdown list. We’ve also used free-form text entry for answers, but the results are more challenging to analyze.
We design most of our unmoderated remote usability studies so that most participants can complete them in under thirty minutes. One way to keep the time down is to randomly select a smaller number of tasks from the full set. Across many participants, this gives us good task coverage while minimizing each participant’s time. We gave each participant four randomly selected tasks out of the full set of nine, presented in a random order to minimize order effects.
When a potential participant went to the starting page (http://www.webusabilitystudy.com/Apollo/), an overview of the study displayed. When the user clicked "Next," a set of instructions was shown. As explained in those instructions, when the user clicked "Begin Study," two windows opened, filling the screen (Figure 3).
Figure 3: Screen and window configuration for an unmoderated remote usability study.
The small window at the top presents the tasks to perform; the larger window presents one of the two sites being evaluated. The users were free to use any of the features of the site; however, they were instructed not to use any other sites to find the answers (e.g., Google).
Each task included a dropdown list of possible answers, including "None of the above" and "Give Up." Three to six other options were listed, one of which was the correct answer to the question. We required the user to select an answer (which could be "Give Up") to continue to the next task. The participant was also asked to rate the task on a 5-point scale ranging from "Very Difficult" to "Very Easy." We automatically recorded the time required to select an answer for each task, as well as the answer given.
After attempting all four tasks, we asked the participant to rate the site on two seven-point scales, each of which had an associated comment field:
We vary these rating scales from one study to another depending on the sites being tested and the study goals. We followed with two open-ended questions about any aspects of the website they found particularly challenging or frustrating, and any they thought were particularly effective or intuitive. We use these questions in most of our usability studies.
We also modified the System Usability Scale (SUS) to help evaluate websites. The original version of SUS was developed by John Brooke while working at Digital Equipment Corporation in 1986. We instructed participants to select the response that best describes their overall reactions to the website using each of ten rating scales (e.g., “I found this website unnecessarily complex,” or “I felt very confident using this website.”) Each statement was presented along with a 5-point scale of "Strongly Disagree" to "Strongly Agree"; half of the statements were positive and half negative.
The main purpose of the study was to illustrate the testing technique, not to seriously evaluate these particular sites. We posted a link to the unmoderated remote study on several usability-related email lists, and collected data from March 11 - 20, 2008. Many of the participants in the study work in the usability field or a related field, so they can't be considered a random sample.
A total of 192 people began the study and 130 (68 percent) completed the tasks in some manner. Undoubtedly, some people simply wanted to see what the online study looked like and were not really interested in taking it.
One of the challenges with unmoderated remote studies is identifying participants who are not performing the tasks but simply clicking through them, answering randomly or choosing "Give Up." They might be interested in the tasks or want to enter the drawing. In studies like this, about 10 percent of the participants usually fall into this category.
To identify these participants, I first completed all nine of the tasks myself several times using both sites, having first studied the sites to find exactly where the answers were. The best time I was able to achieve was an average of thirty seconds per task. I then eliminated thirteen (10 percent) participants who had an average time per task less than thirty seconds, bringing the total number of participants was 117. Of those, fifty-six used the NASA site and sixty-one used the Wikipedia site.
The basic findings of the study were that users of the Wikipedia site:
One way to see an overall view of the task data for each site is to convert the accuracy, time, and rating data to percentages and then average those together. This provides an “overall usability score” for each task that gives equal weight to speed, accuracy, and task ease rating (Figure 4). With this score, if a given task had perfect accuracy, the fastest time, and a perfect rating of task ease, it would get an overall score of 100 percent. These results clearly show that Tasks 3 and 7 were the easiest, especially for the Wikipedia site, and Tasks 4 and 8 were among the most difficult.
Figure 4: Average usability scores for each task and site, with equal weighting for accuracy, speed and task ease.
After attempting their four tasks, the participants were asked to rate the site they had just used on two scales: Ease of Finding Information and Visual Appeal. The Wikipedia site received a significantly better rating for Ease of Finding Information (p<.01), while the NASA site received a marginally better rating for Visual Appeal (p=.06).
The final part of the study was the System Usability Scale (SUS), which consists of ten rating scales. A single SUS score was calculated for each participant by combining the ratings on the ten scales such that the best possible score is 100 and the worst is 0. Think of the SUS score as a percentage of the maximum possible score. The Wikipedia site received a significantly better SUS rating than the NASA site (64 vs. 40, p<.00001).
The study yielded a rich set of verbatim comments. The NASA site received 132 individual comments from the various open-ended questions while the Wikipedia site received 135. Some of these comments were distinctly negative (e.g., for the NASA site: “Search on this site is next to useless”) while others were quite positive (e.g., for the Wikipedia site: “The outlines for the pages were helpful in locating a specific section of the site to find the desired information.”)
The performance data, subjective ratings, and verbatim comments can be used to help identify usability issues within the test site. Verbatim comments often provide clues indicating why tasks for a given site yield particularly low success rates, long task times, or poor task ratings.
The primary strength of an unmoderated remote usability study is the potential for collecting data from a large number of participants in a short period of time. Since they all participate “in parallel” on the web, the number of participants is mainly limited by your resourcefulness in recruiting. Larger numbers provide additional advantages:
Unmoderated remote usability studies are especially good at enabling comparisons between alternative designs. We’ve performed these studies where we simultaneously compared up to ten different designs. In just a few days we were able to test these designs with a large number of users and quickly identify the most promising designs.
Unmoderated remote usability studies aren’t always appropriate and some of the limitations of the technique follow:
You need to be able to develop tasks that have relatively well defined end-states. Tasks like “find explicit information about this” work well.
Early exploratory studies, where you want to have an ongoing dialog with the participants about what they’re doing, are obviously not well suited to an unmoderated remote approach.
Unmoderated remote usability tests will never completely replace traditional moderated usability tests. A moderated test, with direct observation and the potential for interaction with the participant if needed, provides a much richer set of qualitative data from each session. But an unmoderated remote test can provide a surprisingly powerful set of data from a large number of users that often compensates for the lack of direct observation and interaction.
About the Author:
Tom Tullis is senior vice president of user insight at Fidelity Investments. With more than thirty years of experience in human factors and usability, he has published over fifty papers in numerous technical journals. He is the co-author, with Bill Albert, of the recent book Measuring the User Experience: Collecting, Analyzing, and Presenting Usability Metrics.
Usability Professionals' Association
promoting usability concepts and techniques worldwide
User Experience Magazine is by and about usability professionals, featuring significant and unique articles dealing with the broad field of usability and the user experience.
This article was originally printed in User Experience Magazine, Volume 7, Issue 3, 2008.
© Usability Professionals' Association
Contact UPA at http://www.usabilityprofessionals.org/about_upa/contact_upa.html