upa - home page JUS - Journal of usability studies
An international peer-reviewed journal

A Commentary of “How To Specify the Participant Group Size for Usability Studies: A Practitioner’s Guide” by Macefield

Rolf Molich

Journal of Usability Studies, Volume 5, Issue 3, May 2010, pp. 124 - 128

Article Contents


Introduction

Today, no usability conference seems to be complete without one or more heated debates on participant group sizes for usability studies (Bevan, 2003; Molich, Bachmann, Biesterfeldt, & Quesenbery, 2010). In this respect, Macefield’s recent article on group sizes for usability tests is timely (Macefield, 2009). However, when reading Macefield's article I found that a number of important issues were not addressed. With this critique I would like to draw the readers’ attention to some of these issues.

Relevant Real-World Data

We must base discussions on topics such as the number of test participants on relevant real-world data. We should not base them only on analyses of simple systems such as Faulkner (2003) and Woolrych and Cockton (2001) or on studies conducted with undergraduates, such as Woolrych and Cockton's study (2001), unless, of course, it can be proven that undergraduates perform in a way that is comparable to usability professionals. My personal experience from teaching introductory usability engineering classes at the Technical University of Denmark is that only the best 10-20% of my students hand in reports that are at a professional level.

In 1998, I started a series of Comparative Usability Evaluations (CUE) to provide real-world data about how usability testing is carried out in practice. The essential characteristic of a CUE study is that a number of organizations (commercial and academic) involved in usability work agree to evaluate the same product or service, report their evaluation results anonymously, and discuss their results at a workshop. Usually, 12-17 professional teams participate. Teams conduct their studies independently and in parallel using their favorite evaluation approach, which is most often "think aloud" usability testing or expert review.

CUE-1 to CUE-6 focused mainly on qualitative usability evaluation methods, such as think-aloud testing, expert reviews, and heuristic inspections. CUE-7 focused on usability recommendations. CUE-8 focused on usability measurement. An overview of the eight CUE studies and their results is available at http://www.dialogdesign.dk/cue.html (Molich, 2010).

The CUE-2, CUE-4, CUE-5, and CUE-6 studies reported remarkably similar overall results. As an example, consider the following key results from the CUE-4 study (Molich & Dumas, 2008), where 17 teams analyzed the usability of the website for the Hotel Pennsylvania in New York:

The Total Number of Usability Issues Is Close to Infinite

The CUE studies show that it is impossible—or at least infeasible—to find all usability issues in a realistic website or product because the number is huge, most likely in the thousands. This has important implications for discussions of participant group size. Because you can't find all problems anyway, go for a small number of participants and use them to drive a useful iterative cycle where you pick the low-hanging fruit in each cycle.

It could be argued that most industry usability tests are conducted by one team. While the CUE studies show that one team finds only a small fraction of the problems, they probably find most of the problems that the one team will find. A discussion of participant group size is not complete without at least mentioning that varying the number of evaluators will affect results considerably and probably more than varying the participant group size.

The "infinite" number of usability issues also has important implications for many of the studies of participant group size because they assume that the total number of issues is known.

The Group Size Depends on the Purpose

A discussion of the participant group size only makes sense if the purpose of a usability test is known as shown in Table 1.

Table 1. Optimal participant group size for various purposes of a usability test based on the CUE studies and the author's experience.

Table 1

Table 1 says that the total number of participants required to find all usability issues or even all serious usability problems is unknown at this time ("infinite"). What we do know is that in four CUE studies that each involved 9-17 teams, at least 60% of the issues were reported by single teams only. We conjecture that if we had done additional testing with for example 20 more teams and hundreds of additional participants, several hundred additional issues would have been discovered and reported. Only a few of the reported issues were invalid, for example not reproducible or in conflict with commonly accepted usability definitions.

Macefield makes a similar point that group size should depend on the purpose of the test, for example in his Figure 1. But he fails to make the important points that tests can be run for political purposes and that finding all problems is infeasible for most real-world systems.

For usability measurements, the group size depends on the desired level of confidence. Computing the level of confidence is difficult because measurement results are not normally distributed (Sauro, 2010).

If Test Quality Is Poor, Group Size Doesn't Matter

If an evaluator uses poor test methodology, the results will be poor irrespective of the participant group size. Results of usability tests depend considerably on the evaluator (Jacobsen & Hertzum, 2001). This evaluator effect has been confirmed by the CUE studies. The CUE studies and the practical experience of this author show one of the reasons why the evaluator effect exists: Test quality is sometimes a problem. In other words, at least some usability tests are conducted with poor use of the "think aloud" methodology. Problems particularly arise in the following areas:

The author's personal experience from certification of usability professionals in usability testing indicates that about 50% of professional evaluators make one or more severe errors in their test planning, facilitation, or reporting.

Conclusions and Practitioner's Takeaway

The following were the main findings in this article:

References

Previous