The Magazine of the Usability Professionals' Association
By Norbert Elliott, Robert Barat, and Kamal Joshi
Gloria Rorty, the star of this story, was curious by nature. In college, she had taken courses in art history and basic statistics. When the Web became integrated into everyday life, she found she was interested in the interstices as much as the threads. And she found herself now in a network especially suited to her education: website design, remote assessment, and quantitative analysis. She sharpened her pencil and flipped open her notebook. Here it was: her big chance.
As a usability specialist at Aesthetics.com, a company specializing in the design of websites for clients in the visual arts, Gloria had been assigned to a new client and she decided to make the most of an assessment opportunity. Executives at Postmodern and Such, a retail firm specializing in art supplies, had commissioned a redesign of their company's website. Gloria had met with the leaders and realized that these MBA types were going to want evidence that the new site served its target audience better than had the old. "When you present the changes," one had said, "be sure to tell us what the potential users thought about the original and redesigned sites. The basic numbers are good to have around."
The basic numbers, she knew, would also help her to justify the changes she was making so that this site, as well as future sites, could be imagined as a set of variables over which she had control. She had a theoretical model for an ideal site for the client, and she wanted to measure the variables that would yield that ideal.
Drawn from her current web design experience, she began with a model that would allow her to examine the relationships.
Gloria had chosen five variable sets as predictors of an effective site design: primary aesthetic variables, secondary aesthetic variables, elemental aesthetic variables, efficacy conditions, and recurrence. Because she knew that these variables had to be unpacked, she delineated each into component parts. So, for example, the primary aesthetic variables were broken down into conditions of hierarchy and proportion, efficacy was understood as the user's ability to perform an information retrieval task within the site, and recurrence was defined as the user's desire to return to the site.
Each of the variables was also understood to be associated with the holistic impression of the site. Hence, the five predictor variables (X) were—if all went well—going to be effectively associated with an outcome variable (Y), the holistic score. The holistic score would demonstrate that the new site served its target audience.
To assure that Gloria's variables were going to be understood by the users in her study, she prepared a series of Likertscale questions. Defining "scale" as a user judgment based on harmonious comparisons of visual elements (such as logo design), her first Likert statement was as follows: "The site is designed so that the scale of the visual elements is consistent throughout the site." Potential responses would fall between "very strongly agree" (scored 6) and "very strongly disagree" (scored 1).
As an adjunct instructor in her city's college, Gloria had access to two distance learning classes of seminar students who could serve as potential users of the original and redesigned sites. She wanted to use her students as reviewers but realized that their impressions of what was very good—that "strongly agree" response—would differ vastly, as might the other awarded scores. She therefore needed to train them to make similar judgments according to her interpretation of the scale and its values.
Figure 1. Effective Ssite design: a variable model
Gloria taught her seminar in a distance learning format, so she could easily employ screen recording software with voiceover to tour various aspects of the sites before and after redesign. As she toured the original and revised sites, she demonstrated which site features would receive, for instance, a score of 6 on a targeted dependent variable and which would receive a score of 2. She also prepared a brief written description of each score level from 6 to 1. The online seminar reached consensus through conversation about evaluative differences and similarities.
She planned a system in which the two sites would be reviewed twice, with each set of scores recorded independently of the other. That is, thirtyone students in one graduate class, after watching her training video, would use the Likert scale and award preredesign scores. Then, twentyseven different students in the other graduate class would use the same scale and award postredesign scores. The total score range would, thus, be set for 12 (the highest score for each of the variables in Figure 1) to 2 (the lowest score). If any score differed by more than one point—a score of 6, for example, and a score of 4—that score would be considered discrepant, and she would resolve the discrepancy herself through adjudication by validating either the higher or lower score. Such decision rules, she realized, were part of pursuing reliability. Without that elusive sense of interuser agreement, she knew that her statements regarding postredesign improvement could be questioned.
When the scores came back—the entire evaluative process had taken a week—she was thrilled to see that the level of agreement was quite good. One quarter of the evaluations needed no adjudication at all, and 53 percent needed adjudication on only one or two traits. With this measure of interuser agreement in hand, she was ready for the first stage of analysis.
Gloria began with a basic question: could statistics show that the site had been effectively redesigned?
Using the SPSS 15.0 statistical software, she began her analysis of the preredesign (n = 31) and postredesign (n = 27) scores. She also used an independent sample ttest (a mathematical test designed to demonstrate score differences between two groups) to see whether a statistical difference set at p < .05 (the lowest probability standard to avoid nonsignificant results) had been reached between the preredesign and postredesign sites. She prepared Table 1 to demonstrate these measures. As she looked at the table—an empirical argument that would justify the redesign of the client's site—she realized that hierarchy, proportion, color, efficacy, and holism had achieved statistically significant difference. The post score means were statistically higher than the pre score means—good news for both Aesthetics.com and Postmodern and Such. Indeed, her redesign had created such a vast difference that in four out of the five cases there was better than a 99 percent chance that the difference had been caused by her redesign efforts and not by sheer chance.
When a significant difference was not achieved, she reasoned, there could have been two causes: the site was in solid shape to begin with, or her explanation of the variables was not clear enough for users to discern a design shift.
Table 1: Preredesign and postredesign scores: means and tests of significance.
Variable 
Mean 
Range 
Standard Deviation 
r test 
p Statistically 

PreRedesign (n=31) / Post Redesign (n=27) 

1. Primary Aesthetic Variables 


1.1 Hierarchy 
7.61 
9.51 
12,4 
11,6 
1.33 
1.93 
4.22 
<.0001 
Yes 
1.2 Proportion 
8.12 
9.33 
11,6 
12,7 
1.45 
1.68 
■2.92 
.005 
Yes 
2.0 Secondary Aesthetic Variable 







2.1 Scale 
7.45 
8.25 
10,2 
12,2 
1.87 
2.44 
1.42 
0.16107 
No 
2.2 Tempo 
8.48 
9.51 
12,2 
12,4 
2.33 
2.31 
■1.96 
0.0964 
No 
3.0 Elemental Aesthetic Values 


3.1 Color 
8.51 
9.85 
11,2 
12,5 
a/ 
1.89 
■2.28 
0.0264 
Yes 
3.2 Writing 
7.41 
8.11 
9,6 
12,2 
1.05 
2.3 
1.4 
0.1396 
No 
4.0 Efficacy 
6.45 
9.62 
12,2 
12,7 
3.45 
1.36 
■4.48 
<.0001 
Yes 
5.0 Recurrence 
8.32 
9.03 
12,6 
12,4 
1.34 
2.05 
■1.57 
0.1225 
No 
6.0 Holism 
7.8 
9.25 
10,4 
11,8 
1.66 
1.09 
■3.87 
0.0003 
Yes 
Was the model sound? A regression model would allow her to see if the five predictor variables (X) were associated with the outcome variable (Y), the holistic score. The predesign failed to yield a coherent model, with the allpurpose F test (1.70)—another basic formula for comparing groups—failing to reach the critical level and also failing to achieve the desired standard significance level of .05 (p = .153). However, the posttest model yielded amazing coherence, yielding a model that accounted for 74 percent of the variance through the variables and achieving a very high (p = .001) level of probability.
That single holistic impression was often all that resulted, along with a task analysis or two, in traditional usability studies. Gloria realized she was onto something important here, a model that could yield solid statistical evidence that her redesign was effective.
Questions, nevertheless, remained. The number of reviewers was small, just barely reaching the thirty or so evaluators needed to justify her model. Her statistical tests would not function with a small number. But time and expense constraints made it impossible to add any more reviewers.
She returned to her spreadsheet and decided to use resampling statistics. (Devised by the late Julian Simon, professor of Business Administration at the University of Maryland, resampling statistics exploits the computational power of the desktop computer to generate data as a repeated random sample.) Would the differences from preredesign to postredesign remain statistically different if there were, for example, 20,000 reviewers? That is, if the scores of the thirtyone reviewers in the pretest sample and the twentyseven reviewers in the posttest sample were randomly put into a wicker basket and redrawn 20,000 times in nearly endless random combinations, would the differences between the predesign and postdesign scores remain the same?
After running this new simulation in SAS 9.2, Gloria created Table 2.
Table 2: Preredesign and postredesign scores: significance with 20,000 samples.
Variable 
Traditional Analysis 
Resampling Analysis 


Original sample 
Statistically Significant? 
p value 
Statistically Significant? 
1.0 Primary Aesthetic Variables 

1.1 Hierarchy 
<.0001 
Yes 
0.0005 
Yes 
1.2 Proportion 
.005 
Yes 
0.0388 
Yes 
2.0 Secondary Aesthetic Variables 

2.1 Scale 
0.16107 
No 
0.6625 
No 
2.2 Tempo 
0.0964 
No 
0.4691 
No 
3.0 Elemental Aesthetic Values 

3.1 Color 
0.0264 
Yes 
0.165 
No 
3.2 Writing 
0.1396 
No 
0.6034 
No 
4.0 Efficacy 
<.0001 
Yes 
0.0005 
Yes 
5.0 Recurrence 
0.1225 
No 
0.5554 
No 
6.0 Holism 
0.0003 
Yes 
0.0022 
Yes 
The use of resampling statistics confirmed that her limited sample size was, after all, quite solid. The small sample of reviewers reflected the estimates of the 20,000 observations that were resampled with replacement. Only the color variable failed to achieve a statistically significant difference under resampling conditions.
One final set of questions remained for Gloria: what were the relationships among these variables? Since the model she had built was complex, how could she better understand the hidden relationships between the holistic score and the other five variables?
She returned to her spreadsheet and decided to run a procedure employing neural network analysis. Widely used for a variety of engineering and financial applications, neural networks are metaphorically patterned after the learning functions of the human brain: the program learns patterns and applies them to new situations. Once an acceptable neural network structure has been trained and tested, Gloria realized, it could then be used to estimate the relative importance of each independent variable to the outcome variable.
Gloria uploaded the spreadsheet into her NeuroSolutions software and ran the data twelve times for both the preredesign and postredesign scores. The neural network was set so that the data converged; that is, the readjustment of the weights—the training—was set to stop when the cumulative error between the predicted and desired output values dropped below 0.01, a standard more rigorous than the 0.05 lowest probability standard. As with the traditional model, the five independent variables (X) were considered inputs associated with the dependent variable (Y), the holistic output. What was going on, Gloria wanted to know, in the black box of relationships within her model?
Gloria created one final table in which she recorded the results of the analysis.
The neural networking analysis suggested that the contributions of the variables to the model were uneven, and that lack of uniformity suggested an erratic preredesign—which was, of course, why the client had come to her in the first place. She could now conclusively demonstrate to her client that the original site was in need of redesign. As well, the estimated standard deviations on the predesign model were large—yet another measure of erratic design. Truly, the original site was in bad shape.
In contrast, just as the regression model on the postredesign analysis had demonstrated coherence, the neural network analysis revealed that the new site was also more cohesive. All those wildly different estimated mean scores had disappeared due to her postredesign efforts, and the tighter standard deviations suggested the new design was more coherent, each independent variable contributing evenly to the holistic impact of the redesign.
Table 3: Estimated contribution of input variables to output variable (Holism)
Input Variables 
Preredesign 
Postredesign 


Mean 
Std. Dev. 
Mean 
Std. Dev. 
1. Primary Aesthetic Variables 

1.1 Hierarchy 
6.6 
4 
12.6 
4.2 
1.2 Proportion 

4.8 
11 
2.1 
2.0 Secondary Aesthetic Variables 

2.1 Scale 
11.1 
3.5 
13 
1.9 
2.2 Tempo 
14.2 
3.8 
13.8 
2.8 
3.0 Elemental Aesthetic Values 

3.1 Color 
10.7 
3.1 
14.5 
2.6 
3.2 Writing 
10.4 
3.6 
12.2 
2.2 
4.0 Efficacy 
14.7 
3.3 
11.5 
4.2 
5.0 Recurrence 
8.9 
4.6 
11.5 
2.5 
It had been a long day, but there was one final step: the design of her argument; she used a model based on work by Stephen E. Toulmin.
While her sample had been small—a qualification of her claim— she had employed resampling statistics to demonstrate that the model
held under the most demanding conditions: the original differences remained, in all but a single case. To further warrant evidence of her success, she had used neural networking analysis techniques to demonstrate that the original regression analysis—one accounting for 74 percent of the variance within her variable and achieving a very high (p = .001) level of probability—had held with reduced standard deviations under postdesign conditions.
But, beyond the analysis for the present client, Gloria knew there would be additional benefits. A new database had been built, and she could add to it as she designed more sites, strengthened her techniques, and refined her model. She also knew that she now had a method of dealing with the inherent problem of small sample sizes in empirical usability studies. And she had an analytic technique that could be constantly trained and updated to test and refine her models as they grew. She was using the machine to help her think.
That was, after all, the idea
About the Authors: The authors are all researchers at New Jersey Institute of Technology. Norbert Elliot is professor of English in the Department of Humanities. Robert Barat is professorof Chemical Engineering in the Otto York Department of Chemical Engineering. Kamal Joshi is database manager in the Office of Institutional Research and Planning.
Usability Professionals' Association
promoting usability concepts and techniques worldwide
User Experience Magazine is by and about usability professionals, featuring significant and unique articles dealing with the broad field of usability and the user experience.
http://www.usabilityprofessionals.org/upa_publications/user_experience/
This article was originally printed in User Experience Magazine, Volume 7, Issue 3, 2008.
http://www.usabilityprofessionals.org/upa_publications/past_issues/20083.html.
© Usability Professionals' Association
Contact UPA at http://www.usabilityprofessionals.org/about_upa/contact_upa.html