Heuristic Evaluation Quality Score (HEQS): Defining Heuristic Expertise
Journal of Usability Studies, Volume 4, Issue 1, November 2008, pp. 31-48
Article Contents
Results
Overall results indicated that the group found an average HEQS% of 8% in 1 hour ranging from 2% to 17% (see Figure 4). This compares with previous studies of evaluators finding 24% and 25% (highest in both cases being 38%) in 2 hours (Kirmani & Rajasekaran, 2007). The numbers are slightly higher for the previous case studies as the average heuristic experience was higher (more than 30 months) than the average heuristic experience of the contestant group (23.7 months). The contestant group also included 4 contestants who have never been exposed to heuristic evaluations. Evaluators can be compared and their performance can be studied by looking at the issues identified based on UI parameter (see Figure 2) and severity (see Figure 3). For example, Evaluator 3 found twice as many interaction design issues, thrice as many content issues, and five times as many navigation issues as Evaluator 18. However, Evaluator 18 found five times as many showstoppers as Evaluator 3 indicating that Evaluator 3 is good at covering the breadth of issues (across UI parameters) while Evaluator 18 is good at covering severe issues.

Figure 2. HE skills based on UI parameters.

Figure 3. HE skills based on severity.

Figure 4. HEQS%.
What is the average expertise of heuristic evaluators?
The average HEQS% is 8% for evaluations of 1 hour conducted by a group of evaluators with an average heuristic experience of 2 years and an average usability experience of 2.5 years.
What are the factors affecting heuristic evaluation expertise?
The following factors affect heuristic evaluation expertise:
- Usability experience: The relationship between usability experience and heuristic evaluation expertise is significant (see Table 3). Thirty percent of the variation between usability experience and heuristic evaluation expertise is related. The more usability experience the better is the quality of the evaluation.
- Heuristic experience: The relationship between heuristic experience and heuristic evaluation expertise is significant. Seventeen percent of the variation between heuristic experience and heuristic evaluation expertise is related. The more heuristic experience the better is the quality of the evaluation.
- Domain experience: Domain experience in this study did not significantly impact expertise. This could be due to the non-technicality of the website. Identifying conditions for a set of symptoms is understood world wide and does not require a lot of learning but other studies have shown that domain experts are better evaluators (Anthanasis & Andreas, 2001).
- Training: Training does impact heuristic evaluation experience. Quality of the evaluation improves with training. A 48.4% improvement was seen in a study conducted on a group of 26 evaluators (Kirmani & Rajasekaran, 2007).
The following factors do not affect heuristic evaluation expertise:
- Age: Age does not affect heuristic evaluation expertise.
- Gender: Gender does not affect heuristic evaluation expertise.
- Self rating: Self rating or self proclamation of calling oneself an expert does not corroborate with heuristic expertise. Eighty-five percent of 20 contestants felt confident of winning the competition and rated themselves 4 or higher on a scale of 5.
This study did not shed light on site complexity or previous experience on the site. Future research should look at more complex examples and prior experience with the site.
| Parameter | Range | Average | Median | Significant/Not (at significance level of 0.1) |
|---|---|---|---|---|
| Gender | Female, Male | -- | -- | |
| Age | 20 - 34 years | 28.4 years | 29 years | |
| Usability Experience | 0 - 144 months | 30.7 months | 15 months | Significant |
| Heuristic Experience | 0 - 120 months | 23.7 months | 13 months | Significant |
| Domain Experience | 0 - 24 months | 4.2 months | 0 | |
| Self rating and Confidence to win | 1 - 5 (5 being "I will absolutely win") |
4.1 | 4 |
What level of expertise is required for one to conduct a heuristic evaluation?
Expertise can be divided into three levels:
- Below average evaluators: Evaluators finding an HEQS% of less than 8% are below average performers.
- Above average evaluators: Evaluators finding an HEQS% of 8% or more are above average performers.
- Exceptional evaluators: Evaluators finding an HEQS% of 15% or more are exceptional performers. Fifteen percent has been arrived at by selecting the top 5%, given the highest performers have been identifying HEQS% of 17% - 19% of issues in 1 hour.
It is known that any evaluator who identifies issues to improve the usability of an application is better than none, but I recommend that you choose 3-5 above average or exceptional evaluators to see an evaluation of high quality.
Improving severity and UI parameter categorization
From the inter-rater reliability in Table 4 we see that evaluators can categorize showstoppers consistently but are not consistently categorizing major issues and irritants.
| Benchmark | Showstopper | Major Issue | Irritant | Non-issue |
|---|---|---|---|---|
| Number of unique issues | 21 | 153 | 33 | 49 |
| Complete consensus | 91% | 66% | 55% | 86% |
Currently descriptions of severity (Nielsen, 1994) are seen in Table 5.
| Severity | Description |
|---|---|
| Showstopper | A catastrophic issue that prevents users from using the site effectively and hinders users from accomplishing their goals. |
| Major Issue | An issue that causes a waste of time and increases the learning or error rates. |
| Irritant | A minor cosmetic or consistency issue that slows users down slightly. It minimally violates the usability guidelines. |
After judging the competition notes on categorization were compared and we arrived at the following grid to improve the categorization by adding two dimensions: the user and the environment (see Table 6).
| Severity | About the issue | Different users | Different environments | Yes/No* |
|---|---|---|---|---|
| Showstopper | Does the issue stop you from completing the task? | Can colorblind users interpret a colorful graph to complete a task? | Does the issue create an unstable environment? | |
| Showstopper Example | The “Submit” button is not working and hinders users from sending their forms. | If colors are the only form of communicating critical data to complete an online transaction, colorblind users are forced to abandon the task. | For a healthcare site it is critical that advice given pertains to the conditions chosen. Incorrect association can cause harm. | |
| Major Issue | Does the issue cause you a major waste of time, increase your learning, increase the error rate, or violate a major consistency guideline? | Does the issue increase errors for older adults? Does the issue increase learning for all users? | Does the issue create an environment with a higher error or learning rates? | |
| Major issue Example | Using an “X” as an icon to zoom out breaks user mental models and increases errors considerably, especially in an environment where close is also denoted by an “X”. | A low contrast between font and the background can cause an increase in error rates for older adults. | Providing smaller than usual buttons on a mobile interface where people are always moving can increase error rates considerably. | |
| Irritant | Does the issue involve a cosmetic error, slow you down slightly, or violate a minor consistency guideline? | Does the site not have visual appeal to teenagers? | Does the issue create an environment that slows you down slightly? | |
| Irritant Example | The label is “Symptom” when it actually should be plural as it has many symptoms. | If the colors are not young and vibrant (e.g., pink and yellow) for a site catering to teenagers it violates a cosmetic error. | If you are checking symptoms for your daughter, changing the content to cater to a different environment (third person) is helpful. |
*Answering positively to one or more questions is Yes.
From the inter-rater reliability in Table 7 we see that evaluators are not consistently categorizing information architecture issues.
| Benchmark | Information Architecture | Navigation | Labeling | Other | Visual Design | Interaction Design | Content | Functionality |
|---|---|---|---|---|---|---|---|---|
| Number of unique issues | 13 | 20 | 16 | 12 | 52 | 52 | 33 | 9 |
| Complete consensus | 70% | 85% | 100% | 92% | 80% | 92% | 88% | 78% |
This could be due to the poor labeling of the group as Information Architecture in usability circles denotes structure and organization, navigation, and labeling. Hence, we have decided to re-label it as Structure and Organization (see Table 8).
| Current UI Parameter | Current Description | Redefined UI Parameter | New Description |
|---|---|---|---|
| Information Architecture | Accurate structuring of information into groups best matching the mental model of users. | Structure and Organization | Accurate structuring of information into groups best matching the mental model of users. |
Limitations of this study
It is known that this study and its results are limited to the small sample size that has been used. Generalizing these results will require many more competitions with a diverse and larger sample size.
