upa - home page JUS - Journal of usability studies
An international peer-reviewed journal

Discourse Variations Between Usability Tests and Usability Reports

Erin Friess

Journal of Usability Studies, Volume 6, Issue 3, May 2011, pp. 102 - 116

Article Contents


This group partook in do-it-yourself usability testing in that the designers who developed the document for the postal service were the same people who conducted the usability testing itself. Although every usability test was video recorded, to the best of my knowledge, none of the videos were reviewed by any member of the group (including the evaluators) after the test. Therefore, the only people who had access to the information revealed during the test were the two evaluators who moderated and observed the test.

These evaluators were, thus, placed in a broker of knowledge position—certainly a position of potential power. The evaluators of the usability tests determined what information from the tests was presented to the group and what information was omitted. From this gatekeeper position, the evaluators had the capacity to influence the outcome of the document by restricting, altering, or overemphasizing information gathered from UT in their oral reports. I do not mean to suggest that alterations to the findings were done maliciously or even consciously by the reporting evaluators. Nonetheless, the findings in the oral reports often had changed from or had no basis in the usability reports. In what follows, I explore possible explanations as to why the findings presented in the oral reports did not always accurately represent what occurred in the UT.

Confirmation Bias in Oral Reports

Either intentionally or unintentionally, the evaluators appeared to seek out confirmation for issues that they had previously identified. For example, in his oral report on P2’s session, Tom mentioned on three separate occasions that P2 “liked” a particular chart. Indeed, P2 did say at one point in the study, “I love this chart,” but P2 never mentioned that particular chart again. At the large group meeting immediately prior to the one with these oral reports, a fellow team member questioned Tom as to whether this particular chart was intuitive to the needs of the user. Tom’s repetitive statements that P2 “liked” this chart may be residual arguments for the prior discussion or an attempt to save face in front of the group. While the information was correctly transferred from the usability study, the selection of that finding and the repeated mentioning of that finding was perhaps misplaced.

In another instance, Tom mentioned that P2 did not like the font on the measuring pages. As discussed previously, P2 never outright said that he did not like the font, but instead said that he found it “interesting” and that he had “never seen it before.” Tom parlayed that into a statement that P2 “didn’t like the font on the measuring pages,” which may or may not be accurate. However, the reported finding that P2 didn’t like the font supports a claim that Tom had made in many prior meetings that the font on the measuring pages was “inappropriate” and “at least needs its leading opened up.” By interpreting the finding as negative rather than a neutral statement, Tom lends credence to his ongoing argument to change the font. I am not suggesting that Tom intentionally misread the data to support his personal argument. However, it may be that Tom perhaps unintentionally appropriated the equivocal statement from P2 to support an issue that clearly meant quite a lot to him.

Additionally, Ericka mentioned that P1 “wanted page numbers at the bottom,” which was an accurately reported finding, but one that also supported Ericka’s ongoing claim that the top of the pages were too cluttered to also have page numbers. Further, Laura mentioned that P3 had “a little bit of trouble navigating the information at the beginning” possibly because the “TOC ...was four pages in.” Again, this was accurate from the usability study, but it also supported a concern that Laura had mentioned in a previous meeting that the table of contents wasn’t in the proper spot for quick skimming and scanning.

Ultimately, of the findings presented in the oral reports that had some basis (accurate or not) in the usability sessions, approximately one third of the findings were in support of an issue that the evaluator had previously mentioned in the group meetings. Therefore, it may be that instead of looking neutrally for global concerns throughout the testing, the evaluators were keeping a keen ear open during the evaluations for statements and actions that supported their own beliefs.

This notion of confirmation and evaluator bias is nothing new; indeed, in 1620 in the Novum Organan, Francis Bacon said, “The human understanding, once it has adopted opinions, either because they were already accepted and believed, or because it likes them, draws everything else to support and agree with them” (1994, p. 57). According to Bacon, and countless others since the 17th century, it is in our nature to seek support for our beliefs and our opinions, rather than try to identify notions that contradict our own. Therefore, these evaluators seem to have similar issues to those described by Jacobsen, Hertzem, and John: “evaluators may be biased toward the problems they originally detected themselves” (1998, p. 1338). These evaluators sometimes accurately represented the findings to support their cause and other times evaluators seemingly “massaged” the findings to match their causes. For example, Tom appears to massage his findings. Tom’s clinging to P2’s statement that P2 liked a chart, despite the singular mention by P2, or Tom’s reporting that P2 didn’t like a font, despite P2 saying that it was “interesting” suggests that in some way Tom is looking for data within the scope of the evaluation sessions that support his ongoing claims.

Conversely, Laura’s report that P3 had trouble with navigation due to the location of the table of contents, and Ericka’s report of P1’s dislike of the page numbers at the top of the page did, in fact, support their ongoing claims, but also accurately conveyed the findings from the usability session. Laura and Ericka’s findings were subsequently retested and, eventually, changes were made to the document to reflect their findings from their usability sessions and, ultimately, the claims they made prior to the evaluation sessions. While it may be that the evaluators were trolling for findings during UT sessions that they could employ for their own causes, it may also be that they were advocates for these issues. Though there was no separately delineated design team and usability team, the design team did have specialists: chart specialists, organization and navigation specialists, icon and image specialists, writing specialists, printing specialists, color specialists, and many others. (Though, despite their specialties, the individual team members often served on many teams at once). It may be that the team members were capturing relevant data, but they were doing so in a capacity that advocated their issue. Such advocation was appropriate and useful when based on accurate findings (such as done in these examples by Laura and Ericka), but can be potentially detrimental to the success of the document if done on inaccurate, or “wishful,” findings (as done in these examples by Tom). Again, I don’t believe that Tom’s reports were necessarily intended to deceive, but Tom appropriated what he wanted to see in the usability session and then reported those perhaps misappropriated results to the group.

Bias in What’s Omitted in the Usability Reports

While the evaluators appeared to highlight findings from UT that supported their predefined and ongoing design concerns, they also appeared to selectively omit findings that could be harmful to their concerns. The omission of findings from UT sessions was not necessarily problematic because “a too-long list of problem[s] can be overwhelming to the recipients of the feedback” (Hoegh, Nielsen, Overgaard, Pedersen, & Stage, 2006, p. 177), and that the desire for concise reports and the desire to explain results clearly and completely may be incompatible (Theofanos & Quesenbery, 2005). Indeed, only about 25% of the findings from the UT made their way (either accurately or inaccurately) into the oral reports. However, what is potentially problematic is that, although each evaluator had the opportunity, at no time did the evaluators present findings to the group that ran counter to a claim that the evaluators had made in a previous meeting. In other words, though the language or the actions of the UT participant could provide support against a previously stated claim, no evaluator presented that information in the oral report. For example, in a meeting two weeks prior to these oral reports, Laura suggested that a particular picture be used on the title page of the document because it is “funny, open, and inviting. It kinda says that everyone likes mail.” However, in her UT, P3 pointed directly at the photo and said, “This photo is scary.” Laura did not follow up with P3 to determine why P3 thought the photo was “scary,” nor did she tell the group that P3 thought the photo was scary, possibly because the finding would undermine her suggestion that the photo was appropriate for the document.

In another example, the team that developed potential titles for the document advocated the use of “A Household Guide to Mailing” out of several possibilities.  In that meeting a month before the results of the usability evaluations were discussed, Ericka said, “[‘A Household Guide to Mailing’], you know, conveys a couple of ideas, like that it’s a household guide, that anyone, anywhere can manage this guide because everybody’s got a household of some kind, and the second idea is that it really distinguishes it from the small business mailers and the print houses, ‘cause this isn’t going to help them out.” In Ericka’s usability session, P1 said, “Okay, so this is a Household Guide to mailing. I’m not exactly sure what that means, what household means, but I guess that’s just me.” Again, P1’s statement in the usability session casts some doubt on the certainty with which Ericka advocated “A Household Guide to Mailing.” However, Ericka did not question P1 about her statement and Ericka did not include this finding in the oral report.

Each of the evaluators heard statements or observed actions in their usability sessions that did not support a claim they had made previously. But none of these evaluators reported those findings to the group. It may be that the evaluators consciously sought to omit the offending findings from the report, but it may be something less deliberate. Perhaps these evaluators, knowing they had a limited time frame to present the results of the evaluation, decided (rightly or wrongly) that the finding that contradicted their claim was not one of the more pressing issues revealed by the test. Indeed, previous studies have shown that even expert evaluators who are not conducting do-it-yourself evaluations and, thus, have less potential for bias, have difficulty agreeing in the identification of usability problems, the severity of those problems, and in the quality of their recommendations (Jacobsen, Hertzum, & John, 1998; Molich et al., 2004; Molich, Jeffries, & Dumas, 2007). These novice evaluators certainly had the same struggles (perhaps more so) as expert evaluators. It may be that, given that the stated goal of the evaluation was to determine the degree of success in navigating the document, that findings that did not relate to navigation were not given high priority in the oral report. However, of the 26 findings presented in the oral reports that stemmed in some way from the usability evaluations, only 6 (or 23.1%) dealt in some way (even highly tangentially) with navigation. Therefore, 77.9% of the reported findings dealt with something other than the primary issue of navigation, yet none of those reported findings ever contradicted a claim made previously by an evaluator. Given that this group so definitively declined to present evidence from usability testing that ran counter to their previously stated opinions, more research is needed to determine how influential personal belief and the desire to save face is in groups conducting do-it-yourself usability testing.

Biases in Client Desires

One, perhaps unexpected, area of potential bias came from the USPS clients. In a meeting a few weeks before this round of evaluation, several stakeholders from USPS headquarters stated that they were not going to pay for an index in a document so small. As one stakeholder said, “If you need an index for a 40-page book, then there’s something wrong with the book.” While the designers countered their client’s assertion by claiming that people read documents in “a multitude of ways, none no better than the other,” the clients repeated that they did not want to pay for the additional 1-2 pages of the index. Therefore, in the subsequent meeting, the project manager told the designers that “sometimes sacrifices must be made” and that they were taking the index out of the prototype.

In all three of the evaluations observed for this study, the usability participant mentioned at least once that he or she wanted an index. However, the desire for an index was not mentioned by any of the evaluators during the oral reports. It may be that these evaluators declined to include this finding in their report because both the clients and their project manager had made it clear that they did not want an index in the document. The desire to not present a recommendation that flies in the face of what the client specifically requested may have chilled the evaluators desire to report these findings. Perhaps if the evaluators had been more experienced or had felt as though they were in a position of power, they may have broached the unpopular idea of including an index. This reluctance to give bad or undesirable news to superiors has been well documented (Sproull & Kiesler, 1986; Tesser & Rosen, 1975; Winsor, 1988). Interestingly, after the next round of testing, one evaluator mentioned that his usability participant said, “if it doesn’t have an index, I’m not using it.” At that point, many of the previous evaluators commented that their participants also requested an index. Ultimately, the version of the document that went to press had an index.

Poor Interpretation Skills

Further, it may also be that these evaluators were not particularly adept at analyzing and interpreting the UT sessions. The protocol, the collection of data, and the lack of metrics may not have been ideal, but was certainly reflective of real-world (particularly novice real-world) do-it-yourself usability practice. The lack of experience on the part of the evaluators may explain why these evaluators clung to the sound bite findings—findings that were relatively easy to repeat but that may not reveal the most critical problems or the most pressing usability issues. Indeed, these evaluators were fairly good at accurately representing the sound bites, but struggled with representing more complex interpretations in their oral reports. It may be that, for example, Tom truly interpreted P2’s statement that he found a font “interesting” to be an indicator that P2 did not like the font, while other evaluators may have interpreted that statement differently. Studies on the evaluator effect have shown that evaluators vary greatly in problem detection and severity rating, and that the effect exists both for novices and experts (Hertzum & Jacobsen, 2001). This variation is not surprising given that user-based UT typically involves subjective assessments by UT participants, followed by subjective assessments of those assessments by the evaluator to determine what is important for document. It may be that, just as a variety of evaluators will have high variations in their findings, individual evaluators may show great variation in the findings they (consciously or unconsciously) choose to include, omit, or alter. As Norgaard and Hornbaek noted, while there is a plethora of information on how to create and conduct a usability test, there is “scarce advice” regarding how to analyze the information received from usability tests (2006, p. 216). Given that the “how” of UT analysis isn’t the primary goal of UT textbooks and professional guides, the fact that these novice evaluators struggled to accurately repeat findings obtained in the UT sessions in their oral reports may simply be a product of attempting to process the findings without a clearly identified system of analysis. Without such a system of analysis, these evaluators may have interpreted the findings in ways they think were appropriate, even when they countered the statements made in the UT session.


Previous | Next