How To Specify the Participant Group Size for Usability Studies: A Practitioner’s Guide
Journal of Usability Studies, Volume 5, Issue 1, Nov 2009, pp. 34 - 45
Article Contents
Studies Related to Problem Discovery
Many commercial usability studies are concerned with problem discovery in interfaces and here practitioners need to keep in mind two important and interrelated facts.
First, unlike widgets and people, it is not always easy to objectively define and/or identify a problem. This is primarily because, as pointed out by Caulton (2001), problems are a function of the interaction and do not necessarily constitute a static feature of the interface. So a feature of the system may constitute a problem for one user but not another and, similarly, it may constitute a problem for a user on one day but not the next. Problems also arise from rich and complex interrelationships between features so it is not always easy to “pin them down.” In summary, problems with interfaces are often fuzzy and subjective in nature. Indeed, these properties of problems are one reason why there is so much controversy as to what statistical methods and thinking best applies to these studies.
Second, an important goal of these studies is typically to rank the severity of problems. Put another way, simple enumeration of problems (and analysis on that basis) would not typically be a useful exercise within these studies. Yet such ranking is an issue that is not well addressed in current research literature (although it is often mentioned e.g., Faulkner, 2003). A possible reason for this is that ranking problems is complex and highly subjective matter. There may even be disagreement within a study team as to what mechanism and heuristics should be used to rank problems. Similarly, practitioners often disagree as to whether a feature of the system constitutes a problem at all.
Problem discovery level and context criticality
Table 1 is an abstract from Faulkner (2003) showing how, based on a large number of studies, various participant group sizes (“No. Users” column) probably influences the problem discovery level that a study will achieve. If we accept this advice we can simply specify the group size according to the probable mean and/or minimum level of problem discovery we are seeking.
Table 1. Abstract from Faulkner (2003)

This leaves the challenge of how to determine what problem discovery level is appropriate for a particular study. There are some factors to aid us in meeting this challenge; we can easily argue that high(er) problem detection levels are desirable in the following contexts:
- work in highly secure environments e.g., the military
- work involving safety critical applications e.g., air traffic control and the emergency services
- where the socio-economic or political stakes are high e.g., with governmental applications
- work with enterprise critical applications where the financial stakes are high e.g., on-line banking and major e-commerce systems
- when a previous study, using a small(er) participant group size, has yielded suspect or inconclusive results
- In conjunction with these factors, we should also carefully consider the implications of undiscovered problems remaining in the interface after the study, and what opportunities there will be to fix these later in the system development lifecycle (SDLC).
- To summarize here, the optimal group size depends greatly on what problem discovery level we are seeking and, in turn, this should be driven by the study’s context.
Complexity of the study
Another key reason why we must be careful not to over generalize advice concerning study group sizes relates to the complexity of a study. For example, Hudson (2001) and Spool and Schroeder (2001) have criticized the advice in Nielsen (2000) that five participants is optimal for these studies because this advice is underpinned by relatively simple studies utilizing quite closed/specific tasks. By contract, Spool and Schroeder (2001) conducted more complex studies, utilizing very open tasks, and found that five participants would probably discover only 35% of the problems in an interface. Similarly, Caulton (2001) and Woolrych and Cockton (2001) attacked Nielsen’s advice on the basis that he had grossly underestimated the impact of variation across individual participants within a particular study.
Taking this into account, it is argued here that the optimal group size should be influenced by the study’s complexity, with larger numbers of participants being required for more complex studies.
This leads us to the challenge of assessing a study’s complexity and, again, there are factors to aid us here. It is easy to argue that a study’s complexity typically increases along with increases in the following factors:
- scope of the system(s) being used
- complexity of the system(s) being used
- (potential) pervasiveness of the system
- scope, complexity, and openness of the tasks(s) being performed
- number and complexity of the metrics being used
- degree of diversity across the facilitators being used
- (potential) degree of diversity across the target user group
- degree of diversity across the study participants
- degree of potential for contaminating experimental effects in the study
- degree to which the study participants reflect the target user group, particularly in terms of what relevant knowledge they will bring to the interactions
Another key factor here is the nature and volume of any training that the target user group would be given on the system, and which must then be reflected in the study design. Studies requiring such training are common with many non-pervasive systems (e.g., call centre applications or accounting systems) and this has the potential to increase a study’s complexity because any variation in the training input can easily become a contaminating experimental effect. On the other hand, if the training input is consistent and well reflects the training actually used for the target users, we can argue that this decreases complexity because the study participants should well reflect the target users in terms of what relevant knowledge they will bring to the interactions.
These factors can also be used as criteria to help determine the relevance of particular research literature i.e., it is preferable that practitioners are informed by literature underpinned by studies that have similar (levels of) complexity to that they are designing.
To summarize here, there is no “one size fits all” figure for the optimal group size for usability studies related to problem discovery. Rather, this should be influenced by the study’s context and complexity. Further, practitioners should accept these studies will inevitably involve a degree of subjectivity and that any numeric values that result are indicative. Similarly, they should view these studies as being formative and diagnostic exercises rather than (quasi) experiments designed to give objective answers. Indeed, it could be argued that the considerable volume of research literature that seeks to apply statistical methods to this type of study is not as important as some might think; particularly given that this literature has (understandable) little to offer as to how statistical methods might account for problem of differing severity.
However, there is the following advice from the research community that is useful to consider here:
- At the low end of the range, Virzi (1992) argued that the optimal group size in terms of commercial cost-benefits may be as low as three participants. At the high end, Perfetti and Landesman (2002) argued that 20 participants are appropriate for many commercial studies.
- As already pointed out in this article, the popular advice from Nielsen and Landauer (1993) and Nielsen (2000) is that five participants will probably discover 80% of the problems and, although this advice has been criticized because it was underpinned by relatively simple studies, it remains valid because, even if this criticism is accepted, there are plenty of commercial usability studies that are also relatively simple in nature.
- Research by Faulkner (2003) found that a group size of 10 participants will probably reveal a minimum of 82% of the problems. This is an attractive minimum figure but we should keep in mind that this research was also underpinned by relatively simple studies.
- The research by Turner et al. (2006) imply that a group size of seven participants may be optimal, even where the study is quite complex in nature.
Studies related to problem discovery in early conceptual prototypes
Usability practitioners often need to study novel interface design concepts. These range from new types of control to whole new interface paradigms. Most of these studies involve an early conceptual prototype and are worthy of special consideration here for the following reasons:
These studies are typically interested primarily in discovering severe usability problems (“show stoppers”) at an early stage so that we do not waste resources refining design concepts that are ultimately unviable.
Because the conceptual prototypes are produced early in the SDLC, they are more likely to contain errors than would be the case with more mature prototypes or working systems. These may be technical errors (bugs) or articulator errors (the way in which a concept works).
Interfaces exploiting novel design concepts typically present significantly greater usability challenges for users than is the case for more conventional interface designs. This is because the novelty, by its very nature, limits the usefulness of any existing (tacit) knowledge that the user has of operating interfaces (e.g., Macefield, 2005, 2007; Raskin, 1994; Sasse, 1993, 1997).
Given this, it is easy to argue that these prototypes are likely to contain more (severe) usability problems than systems exploiting more conventional interface design concepts. In turn, it is easy to argue that this significantly increases the likelihood that fewer study participants will be required to discover these problems. Therefore, we can argue that with studies involving early conceptual prototypes, the degree of novelty is inversely proportional to the number of participants that are likely to be required.
Another factor that drives the optimal group size for this type of study towards the lower end of the range is that early conceptual prototypes are typically quite low fidelity and very limited in scope. This is primarily to mitigate the risk of expending resources on developing unviable design concepts. As a consequence, these prototypes are typically capable of supporting only simple/constrained tasks. As such, it is easy to argue that these studies are often relatively simple in nature and, therefore, it is easy to argue that the advice from e.g., Nielsen (2000) to use small study group sizes is particularly relevant here (because Nielsen’s advice is underpinned by relatively simple studies).
To summaries here, it is easy to argue that for most studies related to problem discovery a group size of 3-20 participants is valid, with 5-10 participants being a sensible baseline range, and that the group size should be increased along with the study’s complexity and the criticality of its context. In the case of studies related to problem discovery in early conceptual prototypes, there are typically factors that drive the optimal group size towards the lower end of this range.
