upa - home page JUS - Journal of usability studies
An international peer-reviewed journal

The Combined Walkthrough: Measuring Behavioral, Affective, and Cognitive Information in Usability Testing

Timo Partala and Riitta Kangaskorte

Journal of Usability Studies, Volume 5, Issue 1, Nov 2009, pp. 21 - 33

Article Contents


Current usability evaluation methods can be divided into three categories: methods for usability testing, usability inspection, and inquiry. In usability testing, a product or service is evaluated by testing it on test users. Usability testing has been a central activity in the field of human-computer interaction (HCI) for almost two decades. It has had a substantial practical influence on the development of computing systems. Different usability-related professions now employ thousands of usability professionals worldwide. Some of the methods for usability inspection include heuristic evaluation (Nielsen, 1994) and the cognitive walkthrough method (Wharton, Rieman, Lewis, & Poson, 1994). Usability inquiry methods include, for example, questionnaires, surveys, and focus groups.

Usability cannot be directly measured (Nielsen & Levy, 2003), but it has been studied by measuring various different usability parameters and metrics. Nielsen (1994) presented the famous model consisting of five usability parameters: learnability, efficiency, memorability, error avoidance, and subjective satisfaction. Another well-known model, presented in ISO 9241-11: guidance for usability (1998), consists of the concepts of effectiveness, efficiency, and subjective satisfaction. In this model, effectiveness has been defined as the accuracy and completeness with which users accomplish their goals. Measures of effectiveness include, for example, quality of interaction outcome and error rates. Efficiency was defined as the relation between effectiveness and the resources used in achieving the task goals. Efficiency indicators include task completion times and learning times. Subjective satisfaction was defined as the user’s comfort with and attitudes toward the use of the system. Satisfaction is typically measured using evaluation forms, for example, using questions from Software Usability Measurement Inventory (SUMI) (Kirakowski, 1996), which also includes one of the first attempts to include affective factors in a usability evaluation. An important problem in studying subjective satisfaction using questionnaires is the great number of different methods used. In his review of 180 studies of computing system usability, published in core human-computer interaction journals and proceedings, Hornbæk (2006) identified only 12 studies that had used standard questionnaires.

Many current usability evaluation methods concentrate on producing information related to one particular viewpoint only. For example, the cognitive walkthrough has become an important usability inspection method, but the information produced is limited to the cognitive challenges that a user interface might have. However, the need for producing information from different perspectives has already been acknowledged. For example, Frøkjær, Hertzum, and Hornbæk (2000) stressed that different usability criteria (e.g., measures of effectiveness, efficacy, and satisfaction) should be brought into usability evaluations and each criterion should be examined separately when the usability of a product is evaluated. Another important trend is broadening the measurement of subjective satisfaction so that the focus shifts toward the measurement of the users’ experienced emotions. In an early study, Edwardson (1998) concluded that “It may indeed be far more useful to measure and understand customer happiness and customer anger as the primary exemplars of consumer experience rather than satisfaction” (p.11). According to Dillon (2001), affects cover elements related to attitudes, emotions, and mental events. The existence of this kind of phenomena has not been sufficiently taken into account in usability research. Besides measuring user satisfaction, it should be studied whether the user is frustrated, annoyed, insecure, or trusting. Focusing on the users’ emotions shifts the focus from whether the users can use an application to whether they want to use it.

In the late 1990s the research field of affective computing started to emerge. Affective computing was defined as computing that relates to, arises from, or deliberately influences emotion or other affective phenomena (Picard, 1997). In this field, the main focus has been in the construction of systems with capabilities for recognizing affect. However, in this field physiology-based measures for evaluating affective interactions have also been developed (Partala & Surakka, 2003, 2004; Partala, Surakka, & Vanhala, 2006; Ward, 2005) utilizing, for example, the user’s facial expressions or eye pupil size. During the current decade, more general methods for affective evaluation of HCI have started to emerge in the field of user experience. The field of user experience highlights the user’s holistic experience that results from the use of technology. Hassenzahl, Platz, Burmester, and Lehner (2000) presented the hedonic quality scale. They defined hedonic quality as the pleasure-giving quality of a system. Many psychometric assessments of users’ responses to computing systems have focused on positive affective constructs (Hassenzahl & Tractinsky, 2006). This is consistent with the studies reported by Hornbæk (2006). In his review of 180 studies of computing system usability, Hornbæk identified 70 measures of specific attitudes that had been assessed by self-report. Of these, only 13 addressed explicitly negative emotional or physiological states. While positive affect has been found to have many kinds of positive consequences on cognition, for example, enabling more effective decision making (Isen, 2006), this kind of unidirectional measures provide only a partial view on the user’s emotions. In fact, understanding the causes of negative experiences may be more important in order to further develop the tested system iteratively based on the test results.

Lately, emotions have been studied in interactive contexts, for example, by Mahlke and Thϋring (2007) and Hazlett and Benedek (2007). These studies have indicated that variations in emotions—especially in terms of emotional valence ranging from positive to negative emotions—play an important role in interacting with technology. However, the new methods developed in this field have been largely independent of previous developments in the area of usability. Practical approaches for extending traditional usability testing methods with methods for studying experiential, especially affective aspects, have still been sparse.

There is substantial evidence from psychological research that affective experiences can be effectively organized using a dimensional model. Factor analyses have confirmed that three dimensions are enough to account for most of the variance in affective experiences. Currently, the most commonly used dimensional model of emotions, consisting of valence, arousal, and dominance, was presented by Bradley and Lang (1994). Of these three scales, the valence dimension ranges from negative to neutral to positive affect, while the arousal dimension ranges from very calm to neutral to highly aroused emotion. The dominance dimension varies from the sense of being controlled to neutral and to the sense of being in control (e.g., of a particular situation or an event). Valence and arousal are the most fundamental and commonly studied dimensions. The dominance dimension accounts for much less variation in semantic evaluation of emotions and it is consequently less often used in empirical studies. Bradley and Lang (1994) presented an easy-to-use pictorial method called Self-Assessment Manikin (SAM) for affective self-reports. Using their method, the participant chooses a picture that best represents his or her affective state on each scale. For example, the valence scale ranges from negative (an unhappy face) to neutral (a neutral face) to positive (a happy face) emotion. In addition to SAM, a non-pictorial version of their method has also been used successfully in many basic research experiments (e.g., Partala & Surakka, 2003). This method is based on the semantic differential method (Osgood, 1952) that connects scaled measurement of attitudes with the connotative meaning of words and has a long history of successful use in various fields. Using this method, the participants evaluate their emotional experiences on valence and arousal scales with emotional words used as anchors (e.g., on a 1-9 arousal scale: 1 = very calm, 5 = neutral, 9 = very highly aroused). These or similar methods have been used in user experience research, for example in Partala and Surakka (2004); Mahlke, Minge, and Thüring (2006); and Mahlke and Thüring (2007). However, the methods have not been largely used in the field of usability testing, but they could offer fast and reliable methods for affective self-reports.

Hornbæk (2006) suggested that the subjective measures of usability typically concern the users’ perception of or attitudes toward the interface, the interaction, or the outcome. He also suggested the following important future challenges in the area of usability:

This experiment addressed all three challenges pointed out by Hornbæk (2006). While we suggested that measurement of the users’ subjective affective responses can be useful in evaluating traditional interaction (e.g., with graphical user interfaces), we also acknowledged that it would be most suitable for the evaluation of products, in which influencing the users’ emotions is part of the intended user experience (e.g., interactive media products or entertainment applications). Based on their research, Sauro and Dumas (2009) suggested that post-task (cf. post-test) questions can be valuable additions to usability tests by providing additional diagnostic information. In this experiment, we used post-task questions on a semantic differential scale for studying the participants’ affective valence and arousal related to each task.

Some researchers, such as Matera, Costabile, Garzotto, and Paolini (2002), have suggested combining expert evaluation and usability testing. In this experiment, we presented an approach that also applies a version of a popular usability inspection method—the cognitive walkthrough—to usability testing. In the original cognitive walkthrough, the evaluator goes through a task phase by phase answering to a set of questions in each phase (e.g., “Will the user notice that the correct action is available?”). In this paper, we proposed an approach in which the evaluator observed the actual use of a system, detected potential problems in the system usage, and retrospectively asked the cognitive walkthrough questions directly to the participant for the detected potential usability problems.

In this paper, we described our experiences of using a method that combined measuring information about the participants’ behavior, affects, and cognitive processes during human-computer interaction. For measuring behavior, traditional measures such as task times and task completion rates were used. For affective evaluations on valence and arousal, we used a non-pictorial version of Bradley and Lang’s SAM method (Bradley & Lang, 1994). The previously mentioned interactive version of cognitive walkthrough was used for studying the participants’ cognition. The ideas underlying the methods used in this experiment were first presented by Partala (2002). In this experiment, the methods were further developed and tested in practice for the first time.

Previous | Next