The Combined Walkthrough: Measuring Behavioral, Affective, and Cognitive Information in Usability Testing

Timo Partala and Riitta Kangaskorte

Journal of Usability Studies, Volume 5, Issue 1, Nov 2009, pp. 21 - 33

The following sections discuss the participants, equipment, materials and tasks, procedure, and data analysis used in this experiment.


Sixteen volunteer participants (seven females and nine males, mean age 40.3 years, range 30-51 years) participated in the experiment. The participants were unaware of the purpose of the experiment on arrival.


The tests were run on a Fujitsu Siemens Amilo D7830 computer with a display resolution of 1024 x 768 pixels. The system was fast enough to run the software and play all the video and sound clips in real time. Volume was kept at a constant comfortable level throughout the experiment. The participants used a regular mouse to control the interactive media software and viewed the display from a distance of about 50 cm. The display of the computer and the participants’ comments were recorded with a video camera for further analyses.

Materials and Tasks

In this experiment, the object of evaluation was CD-Facta, an interactive multimedia encyclopedia in Finnish, the native language of the participants. In this encyclopedia a particular information or media file could be found typically by navigating through 3-5 levels in the multimedia product hierarchy. Any information could be found using 3-6 different paths through the hierarchy. Consequently, there were different ways for the participant to accomplish the task goals. The main screen of the product had four different ways to start looking for information: themes, articles by topic, a map-based interface, and a multimedia gallery. In addition to textual articles, the encyclopedia contained audio and video samples, and the user interface contained pictures and icons.

The test participants were presented with seven information retrieval tasks. Each task required a different action chain through the interactive software (the tasks were designed so that completing previous tasks would not help in the subsequent tasks). The tasks were designed so that they were of approximately similar complexity and took on average about 2 minutes per task to complete. One task contained a highly positive audio element (an engaging sports commentary ending in a victory), one task contained a highly negative audio element (traditional old women’s crying songs), one task contained a highly positive video element (an exciting portrayal of wild animals), and finally one task contained a highly negative video element (air raid during a war). Completing the other three tasks did not necessarily involve any audio or video media elements. The tasks are presented in Table 1.

Table 1. The Experimental Tasks (Translated from Finnish)

Audio was played back to the participants at a constant comfortable volume level. The video elements were embedded in the encyclopedia and the resolution of the video display areas was 320 x 240 pixels.


In this research, an experimental usability testing method developed by the authors was tested. The working title of the method was combined walkthrough, because it is partly based on the cognitive walkthrough method and because of the aim of combining expert evaluation and laboratory testing as well as measurements of behavioral, affective, and cognitive information.

One participant at a time participated in the test. The sessions were carried out in a silent laboratory, and they lasted for about an hour for each participant. The participant was first seated in front of a computer desk and asked to fill in a demographic data form. After that the researcher read aloud the instructions for the test. The participants were made aware that task times were measured, and they were instructed to complete the tasks without any unnecessary breaks. The participants were told that answers to post-hoc questions (based on the cognitive walkthrough) were recommended to be short yes or no answers, but could be extended with more detailed comments, if necessary.

In the evaluation of affective experiences, methods and instructions typically used in basic research (e.g., Partala & Surakka, 2003) were used. Nine-point rating scales for valence and arousal were shown and explained to the participant with examples of rating affective experiences. The participants were told to try to get through the tasks independently without thinking aloud. They were instructed to tell the researcher the answer to the task question when they thought they had found the answer. If the answer was incorrect, the researcher quickly indicated that to the participant, who continued solving the task.

The participants were then presented with seven information retrieval tasks, one at a time, in a randomized order (different for each participant). During the task performance the researcher observed the completion of the task and made notes about phases in which the participant chose an incorrect action or had difficulties in finding the right action. More specifically, the researcher paid special attention to two typical actions: the participant takes a wrong action and starts following a wrong path or stays on the path without returning immediately, or the participant uses a significantly long time in trying to find a correct function from a screen.

If the participant had not found the correct answer in four minutes (twice the expected average task time), the tasks were classified as incomplete tasks. The tasks were designed so that if the participants had not completed the tasks in four minutes, they were typically stuck in a problematic situation that they could not solve themselves. In these cases, time measurements were no longer valid, but the participants completed the task to be able to fill in the questionnaire on affective experiences. To avoid long experimental sessions, the researcher gave hints to the participant about the next correct move if the participant was lost in the interface and time measurements were not valid any more.

After finishing the task (telling aloud the correct answer for the task) the participant rated his or her overall user experience related to the performed task by filling in an affective experience rating form. The ratings of the participants’ affective experiences were carried out using a form containing a 9-point rating scale for both valence and arousal. On the valence scale, number 1 indicated a very negative affective experience, number 5 indicated a neutral experience, and number 9 indicated a very positive affective experience. On the arousal scale, number 1 indicated a very calm experience, number 5 indicated a neutral experience, and number 9 indicated a very highly aroused experience. If the recently completed task contained an audio or video media element, the participants also rated their experienced valence and arousal in response to this media element (the media element was played once again before this evaluation). Special attention was paid to ensure that the participants understood that the target of the first evaluation was the holistic affective user experience related to the interaction with the system when completing the task and that the target of the second evaluation was the experience evoked by the particular media element alone.

Before moving on to the following task, the problematic situations detected by the researcher during the completion of the task were examined interactively by the researcher and the participant. For this, a method was developed based on the cognitive walkthrough method (Wharton et al., 1994). The researcher and the participant together revisited each detected problematic point using the software. The researcher asked the participant three selected questions based on the original cognitive walkthrough. Out of the four original questions of the cognitive walkthrough, we selected three questions that were the most appropriate to be used in the context of usability testing. We changed the wording of the questions so that the researcher could directly ask the participant. The following were the three questions asked:

The original question “Will the user try to achieve the right effect?” was left out, because the tasks in the current experiment were straightforward and the possibilities for misunderstanding the tasks were minimized. The researcher wrote down the participant’s answers and asked further questions if the participant’s initial answers were ambiguous. The trials were video recorded and analyzed afterwards in order to understand the nature of usability problems found and to evaluate the performance of the researcher.

Data Analysis

The differences between the categories of subjective ratings of emotional experiences were analyzed using Friedman’s tests and Wilcoxon’s matched pairs signed ranks tests. These tests were selected due to the nonparametric nature of the evaluation data. Spearman correlations were used to analyze the relationships between the different usability indicators. Data from one task out of total 112 (0.9%) had to be dropped from analysis due to missing data.

