# Making Energy Savings Easier: Usability Metrics for Thermostats

Daniel Perry, Cecilia Aragon, Alan Meier, Therese Peffer, and Marco Pritoni

Journal of Usability Studies, Volume 6, Issue 4, August 2011, pp. 226 - 244

## Article Contents

### Description of the Metrics

In developing our metrics, we had two primary aims in mind. First, we wished to capture the unique elements of user behavior when interacting with PTs, for example, how users deal with a constrained number of buttons/controls that require the implementation of multiple system modes or with small screens mounted in inconvenient locations. Second, we wished to develop metrics that would be acceptable to thermostat manufacturers and provide them with some choice in how they record and measure usability. A usability metric must be itself usable to facilitate widespread adoption. A common drawback of many usability metrics involving task duration or number of steps to complete a task is that the value of the metric is unbounded and varies from task to task. This creates a difficulty; the metric cannot be compared on an absolute scale from one task or device to another. This challenge was recognized by Sauro and Kindlund (2005) when they devised their SUM metric. However, the NIST standard efficiency metric does have the drawback of being unbounded (NIST, 2001).

An unbounded metric would be difficult to use in a program such as EnergyStar™ run by the EPA. The EPA and manufacturers need to define a single measure of usability to facilitate consumer understanding and to create an absolute scale of usability that is not dependent on arbitrary task length.

Additionally, our four metrics each have different inputs that are all highly correlated. This offers manufacturers several diverse options in selecting a metric most appropriate to their available resources and testing environment, whether in a usability lab or remotely.

#### Metric Development

In order to create such metrics, we decided to utilize the logistic function (Verhulst, 1838):

The logistic function is a sigmoid curve, commonly used in a variety of applications across multiple fields. It is often employed for the normalization of data, as it maps the real line to the interval (0, 1). Thus an unbounded domain can be mapped to a finite range. Because our data was non-negative but had an unbounded upper limit, and because higher task durations or path lengths were “worse” than lower ones, we chose a variant of the logistic function

that maps to the interval [1, 0). In other words, a shorter time on task or path length is mapped to a value close to 1, and a longer time or path length would be mapped to a value closer to 0.

Additionally, we wished to account for success rates on a per-trial basis (where a task “trial” is a single instance of a participant performing a task on a thermostat model, also sometimes called a “task observation”) rather than averaging over all trials of a given task. In order to accomplish this, we incorporated the task completion or success rate variable, *s*, directly into our primary equation, which we called the “*M”* statistic. The *M* statistic is calculated for each metric *i* as follows on a per-trial basis:

where

Note that *Mi* will always be normalized between 0 and 1. The distinguishing variables for each metric will be defined later in this section.

The success rate variable, *s*, also always falls between 0 and 1. It can be a binary variable (where *s* = 1 if the task is completed and 0 otherwise), have multiple values for partial success (e.g., if the task has several subparts that can be completed successfully), or be a continuous variable that measures percentage of task completion. For the purposes of the metrics evaluated in this paper, *s* is always either a binary (*s* = 0 or 1) or a trinary variable (*s* = 0, 0.5, or 1).

Note that the *M*-statistic combines time on task with success of the trial in an intuitive manner: If the task is not completed so that *s *= 0, the value of the *M*-statistic is 0. Intuitively, this means that if the task was not completed, it should not matter how long the user spent attempting it; it is still a failure. If, on the other hand, the task is completed successfully, then the time on task (or other distinguishing variable such as path length) weighs into the *M*-statistic. For example, a shorter task duration will yield a higher value of *M*, a longer task duration will yield a lower value of *M*, and an uncompleted task will set *M* = 0.

The distinguishing variable in the *M*-statistic equation, *x _{i}*, is defined differently for each of the four metrics. The metrics are named

*Time & Success, Path Length, Button Mash*, and

*Confusion*. Note that “good” values of each of these metrics are close to 1, and “bad” values are close to 0. In addition, an empirically determined scaling factor,

*k*, was incorporated into each metric to maximize data dispersion. Because the metric values changed based on units chosen (hours vs. minutes, for example), we selected constant k-values empirically in order that the data would spread evenly over the entire 0-1 range and enable straightforward comparison of the metrics.

_{i}Finally, to compute the value of the metric, the *M*-statistic is computed over all trials and all tasks for a particular device model. These values are then averaged to produce the final metric value. The four metrics and their distinguishing variables are described in detail below.

#### Time & Success

For the Time & Success metric, the distinguishing variable was the time on task, *t*, measured in seconds. Starting time commenced when users were told by the experimenter to begin. End time was defined as the point at which subjects either verbally confirmed they had completed the task or verbally confirmed that they were unable to complete the given task.

where

t= time for subject to complete the task (seconds)

k= 50 (empirically determined constant)_{1}

#### Path Length

For the Path Length metric, the minimum path length, *m*, was defined as the shortest function path (e.g., series of button presses if the device had buttons) that a user could invoke to successfully accomplish a given task. Whenever possible, this path was determined by using the path given in the device user’s manual. The actual number of functions (e.g., buttons, actions such as opening cover) used, *f*, was calculated as the number of functions the user attempted while trying to complete a task. This included actions that were not successful, such as when a user attempted to press an area of the device that was not touch sensitive.

where

f= number of buttons (functions) user actually acted upon

m= minimum number of buttons (functions) required to complete the task

k= 5 (empirically determined constant)_{2}

#### Button Mash Effect

The determining variable for this metric was the sum of the number of times the user attempted to interact with the interface without actually changing the state or programming of the device. This number was also termed interaction errors or actions with no effect, *ane*. We named this metric the Button Masheffect due to the manner in which the mental state of the user at times appeared to mirror the common gaming phenomenon known as “button mashing,” in which a gamer, often a novice, presses any or all buttons possible in a frenetic attempt to affect their progress in the game (Murphy, 2004). The interaction errors similarly reflect users’ lack of understanding of how functions on the device or screen would affect their progress in a task.

where

a= number of actions with no effect_{ne}

k= 5 (empirically determined constant)_{3}

#### Confusion

The distinguishing variable of the Confusion metric was the total number of hesitations, *h*, that users incurred over the course of a task. A hesitation was defined to consist of a pause or stop in user interaction for three seconds or longer. A pause was considered an indication that the user was uncertain of the next steps to complete the task.

where

h= sum of count of user hesitations ≥ 3 seconds

k= 2(empirically determined constant)_{4}