upa - home page JUS - Journal of usability studies
An international peer-reviewed journal

Heuristic Evaluation Quality Score (HEQS): Defining Heuristic Expertise

Shazeeye Kirmani

Journal of Usability Studies, Volume 4, Issue 1, November 2008, pp. 31-48

Article Contents


Overall results indicated that the group found an average HEQS% of 8% in 1 hour ranging from 2% to 17% (see Figure 4). This compares with previous studies of evaluators finding 24% and 25% (highest in both cases being 38%) in 2 hours (Kirmani & Rajasekaran, 2007). The numbers are slightly higher for the previous case studies as the average heuristic experience was higher (more than 30 months) than the average heuristic experience of the contestant group (23.7 months). The contestant group also included 4 contestants who have never been exposed to heuristic evaluations. Evaluators can be compared and their performance can be studied by looking at the issues identified based on UI parameter (see Figure 2) and severity (see Figure 3). For example, Evaluator 3 found twice as many interaction design issues, thrice as many content issues, and five times as many navigation issues as Evaluator 18. However, Evaluator 18 found five times as many showstoppers as Evaluator 3 indicating that Evaluator 3 is good at covering the breadth of issues (across UI parameters) while Evaluator 18 is good at covering severe issues.

Figure 2. HE skills based on UI parameters.

Figure 2. HE skills based on UI parameters.

Figure 3. HE skills based on severity.

Figure 3. HE skills based on severity.

Figure 4. HEQS%.

Figure 4. HEQS%.

What is the average expertise of heuristic evaluators?

The average HEQS% is 8% for evaluations of 1 hour conducted by a group of evaluators with an average heuristic experience of 2 years and an average usability experience of 2.5 years.

What are the factors affecting heuristic evaluation expertise?

The following factors affect heuristic evaluation expertise:

The following factors do not affect heuristic evaluation expertise:

This study did not shed light on site complexity or previous experience on the site. Future research should look at more complex examples and prior experience with the site.

Table 3. Correlation Analysis of Demographic Data.
Parameter Range Average Median Significant/Not
(at significance level of 0.1)
Gender Female, Male -- --
Age 20 - 34 years 28.4 years 29 years
Usability Experience 0 - 144 months 30.7 months 15 months Significant
Heuristic Experience 0 - 120 months 23.7 months 13 months Significant
Domain Experience 0 - 24 months 4.2 months 0
Self rating and Confidence to win 1 - 5
(5 being "I will absolutely win")
4.1 4

What level of expertise is required for one to conduct a heuristic evaluation?

Expertise can be divided into three levels:

It is known that any evaluator who identifies issues to improve the usability of an application is better than none, but I recommend that you choose 3-5 above average or exceptional evaluators to see an evaluation of high quality.

Improving severity and UI parameter categorization

From the inter-rater reliability in Table 4 we see that evaluators can categorize showstoppers consistently but are not consistently categorizing major issues and irritants.

Table 4. Inter-rater Reliability for Issues Based on Severity.
Benchmark Showstopper Major Issue Irritant Non-issue
Number of unique issues 21 153 33 49
Complete consensus 91% 66% 55% 86%

Currently descriptions of severity (Nielsen, 1994) are seen in Table 5.

Table 5. Current Descriptions of Severity.
Severity Description
Showstopper A catastrophic issue that prevents users from using the site effectively and hinders users from accomplishing their goals.
Major Issue An issue that causes a waste of time and increases the learning or error rates.
Irritant A minor cosmetic or consistency issue that slows users down slightly. It minimally violates the usability guidelines.

After judging the competition notes on categorization were compared and we arrived at the following grid to improve the categorization by adding two dimensions: the user and the environment (see Table 6).

Table 6. Revised Severity Ratings.
Severity About the issue Different users Different environments Yes/No*
Showstopper Does the issue stop you from completing the task? Can colorblind users interpret a colorful graph to complete a task? Does the issue create an unstable environment?
Showstopper Example The “Submit” button is not working and hinders users from sending their forms. If colors are the only form of communicating critical data to complete an online transaction, colorblind users are forced to abandon the task. For a healthcare site it is critical that advice given pertains to the conditions chosen. Incorrect association can cause harm.
Major Issue Does the issue cause you a major waste of time, increase your learning, increase the error rate, or violate a major consistency guideline? Does the issue increase errors for older adults? Does the issue increase learning for all users? Does the issue create an environment with a higher error or learning rates?
Major issue Example Using an “X” as an icon to zoom out breaks user mental models and increases errors considerably, especially in an environment where close is also denoted by an “X”. A low contrast between font and the background can cause an increase in error rates for older adults. Providing smaller than usual buttons on a mobile interface where people are always moving can increase error rates considerably.
Irritant Does the issue involve a cosmetic error, slow you down slightly, or violate a minor consistency guideline? Does the site not have visual appeal to teenagers? Does the issue create an environment that slows you down slightly?
Irritant Example The label is “Symptom” when it actually should be plural as it has many symptoms. If the colors are not young and vibrant (e.g., pink and yellow) for a site catering to teenagers it violates a cosmetic error. If you are checking symptoms for your daughter, changing the content to cater to a different environment (third person) is helpful.

*Answering positively to one or more questions is Yes.

From the inter-rater reliability in Table 7 we see that evaluators are not consistently categorizing information architecture issues.

Table 7. Inter-rater Reliability for Issues Based on UI Parameters.
Benchmark Information Architecture Navigation Labeling Other Visual Design Interaction Design Content Functionality
Number of unique issues 13 20 16 12 52 52 33 9
Complete consensus 70% 85% 100% 92% 80% 92% 88% 78%

This could be due to the poor labeling of the group as Information Architecture in usability circles denotes structure and organization, navigation, and labeling. Hence, we have decided to re-label it as Structure and Organization (see Table 8).

Table 8. Revised UI Parameter.
Current UI Parameter Current Description Redefined UI Parameter New Description
Information Architecture Accurate structuring of information into groups best matching the mental model of users. Structure and Organization Accurate structuring of information into groups best matching the mental model of users.

Limitations of this study

It is known that this study and its results are limited to the small sample size that has been used. Generalizing these results will require many more competitions with a diverse and larger sample size.

Previous | Next