upa - home page JUS - Journal of usability studies
An international peer-reviewed journal

Conducting Iterative Usability Testing on a Web Site: Challenges and Benefits

Jennifer C. Romano Bergstrom, Erica L. Olmsted-Hawala, Jennifer M. Chen, and Elizabeth D. Murphy

Journal of Usability Studies, Volume 7, Issue 1, November 2011, pp. 9 - 30

Article Contents

Project Structure and Procedure

Four successive usability tests were conducted in an 18-month period; each test was tied to a corresponding development cycle of the new AFF. Given the scale of the project spanning multiple years, delivery of functions were divided into three separate cycles. See Table 1.

Table 1. Development Cycle and Usability Iterations

Table 1

See Figure 1 for a timeline of the project. Iteration 1 (Conceptual Design) was a low-fidelity usability test of an early conceptual design that was represented on paper. Iteration 2 (Cycle 1) tested a design of slightly higher fidelity that was presented as static images on a computer screen. The user interface was semi-functional in Iteration 3 (Cycle 2) as it presented participants with some clickable elements. Iteration 4 (Cycle 3) was even more functional with all elements clickable but with fewer data sets loaded into the application than the live site. In each iteration, we evaluated the user interface of the new AFF Web site by examining participants’ success, satisfaction, and understanding of the site, as measured by their performance on tasks and self-rated satisfaction.

Prior to beginning the usability tests, the usability team met with the AFF team to discuss the test plan for the iterative tests and create a set of participant tasks that could be used across all iterations. Our objective was to create realistic tasks that people typically attempt on the AFF Web site. The AFF team had ideas about what tasks participants should perform, but their proposed tasks seemed to only test whether the user interface worked in the way the programmers had intended, and the wording of the proposed task scenarios provided too much information for the participants about what they should do on the site. For example, most of the AFF team’s suggested tasks contained precise terminology that could easily be found in labels and links on the Web site. Because the purpose of usability testing is to allow participants to interact with an interface naturally and freely and to test what users typically do on a site, it took some time to develop tasks that succinctly examined typical activities and did not “give away” too much information to participants1.

Figure 1

Figure 1. Project Timeline: I1 = Iteration 1, I2 = Iteration 2, I3 = Iteration 3, I4 = Iteration 4

Although we intended to use the same tasks throughout the iterations, as more functionality was available, it was important to test the available functionality.  In addition, we worked with what we were given, and only certain data sets had been uploaded to the site.  Thus, as we moved through the iterations, tasks had to be tweaked and new ones created to test the new functionality with the data sets that were available.  We realized this and tweaked tasks in Iteration 2 and continued throughout the iterations.  We knew that tasks should remain as close as possible to each other to allow for comparison across iterations, but given the iterative nature of the software development cycle, data was not available or only limited data was available for testing.  With each iteration providing new/updated functionality, it was not always possible for the AFF team to focus on the same tasks from one testing cycle to the next.  When they gave us the screen shots they had developed, we tweaked the tasks to fit what they gave us.  In hindsight, we realize that we should have worked closer with the AFF team to encourage them to load data that we could use in our tasks, such that the tasks would not change much from one iteration to the next.  For example, the geography, year, and topics should have stayed constant so the comparison across iterations could have been more reliable.  In future testing, we plan on setting this “consistency standard” with designers and developers before they create screens to test.  For example, if they are to only have one data set loaded, it should be the same data set that was available in an earlier round of testing.  However, it is unrealistic to expect that one set of tasks will remain relevant as more functionality is added and as the design changes in response to earlier iterations.  Keeping a few common tasks as others are replaced is a realistic expectation.  See Table 2 for tasks (and accuracy, as detailed below) that were repeated across iterations.

Table 2. Mean Accuracy for Repeated Tasks, Across Usability Studies


Iteration 1

Iteration 2

Iteration 3

Iteration 4

1. Imagine that you are thinking about moving to Virginia and you want to do extensive research on that area before moving.  A friend has recommended this American FactFinder site to you.  Here is the Main page.  How would you start your search?





2. Your friend recommended this American Fact Finder site to you.  Look for as much information as possible in California and Texas, including education, income, children, families, language, poverty, and elderly.*





3. You decide that there is just way too much information here and you want to narrow your results to just California.  What would you do?





4. You are interested in information about your sister’s neighborhood.  You want to get as much information as you can about her home and the area that she lives in.  She lives at 4237 Peapod Lane, Fairfax, VA 22030.  How would you find all the available information about her neighborhood?





5. You are doing a report on education in the United States and want to know how many men in California and Texas were White and college educated in 2005.**





5a. Is there a way to visualize this information?





6. You’ve already done a search on place of birth by sex in the United States.  You are now looking at a table of your results.  You would like to see a map of all males by birth location, specifically in Florida.  What would you do?





7. You are currently looking at a map of males in poverty.  How would you view a map of the same information but for females?***





8. You want to change the colors on the map to fit better with the presentation you will be giving. How do you do this?





9. How would you add Alaska and Hawaii to this table?****





10. You don’t want to see the payroll information.  What would you do to simplify these results?***





11. You decide that payroll is important for your project.  How would you get that information back on the screen?***





12. How would you make a map of your results?





13. How would you zoom in to include only Florida on your map?****





14. You want to see a map of Sarasota, FL, but you don’t know where it is.  How would you find Sarasota?****





15. Now you decide that it is late and you want to go home.  You plan to come back tomorrow and know you will want to access the same exact search results.  What would you do?





* In Iteration 1, wording was slightly different. 
**In Iteration 2, Nevada was used instead of California and Texas, and the year was changed from 2005 to 2006.
***In Iteration 4, wording was slightly different.
****In Iteration 4, different states were used.

In Iteration 1, testing took place over four days, and all members of the AFF team attended some of the sessions (across all iterations, attendance ranged from one to five AFF members).  Our results from the first round of testing reaffirmed feedback from other parts of the Census Bureau and stakeholders.  It led to an overhaul of the site in which the design was scaled back.  The AFF team also had a number of staff changes, including the addition of a seasoned expert who had previously contracted at the Census Bureau and was brought back for this project.  Both the overhaul and the changes led to a six-month break in the proposed iterative cycle.  

Iteration 2, testing took longer than it had in Iteration 1 because finding and recruiting experts with the experience we needed to participate in the usability testing was difficult and took longer.  We tested over the course of one month and, as with Iteration 1, some members of the AFF team attended all sessions.  We sent a preliminary report to the AFF team eight days later and met with them two weeks later to recap findings and plan the next test.  During that period, the developers worked on the back end of the site, on more Web pages and functionality.

Iteration 3 took place in two parts: first we had nine sessions with participants that we recruited, and some members of the AFF team attended the sessions; second, a three-day conference took place at the Census Bureau one week later, in which conference attendees were avid AFF users (and thus, they were the experts we were seeking).  We learned about this conference from a person unrelated to this project, and we seized the opportunity to work with these users.  From the conference, we recruited four additional experts to take part in Iteration 3 testing, and they tested the user interface in one day.  Members of the AFF team were unable to attend the sessions due to the short notice, but we decided it was important to include these experts in our sample.  The results were added into the final report and confirmed what we had seen with the previous experts.

Iteration 4, testing took place over a period of two weeks, and as with Iterations 1 and 2, members of the AFF team attended all sessions.  

Novice participants were recruited for all iterations via the Census Bureau Human Factors and Usability Research Group's database.  The database holds information about potential and past study participants and is maintained solely by staff of the Human Factors and Usability Research Group.  Information about study participants includes their age, education, level of familiarity with Census Bureau sites and surveys, and their level of computer experience.  All novice participants reported being unfamiliar with the AFF Web site and having at least one year of computer and Internet experience.  Experts were (a) Census Bureau employees who reported using AFF regularly but were not involved in the AFF redesign, (b) Census Bureau Call Center and State Data Center employees who assisted the public with finding information on the AFF site, and (c) graduate students in the Washington DC area who reported using AFF regularly as part of their studies.  See Table 3 for participant demographics for each study, Table 4 for accuracy and satisfaction across all iterations, and Table 2 for accuracy for repeated tasks.

Table 3. Participants’ Self-Reported Mean (and Range) Demographics for Each Usability Study

Table 3

Tasks were designed to expose participants to the AFF user interface without leading them in a step-by-step fashion.  For each participant, the test administrator rated each task completion as a success or a failure.  A success involved the participant’s successful navigation of the user interface and identification of the correct piece of information on the Web site based on the task objective.  If the participant struggled to find the information but eventually arrived at the correct response, this effort was marked as a success.  A failure was recorded when the user interface presented obstacles to the participant’s attempts to identify or find the correct piece of information, and thus the participant did not achieve the task objective.  The average accuracy score is reported in two different ways: (a) mean accuracy across the participants and (b) mean accuracy across the tasks.

Table 4. Mean Accuracy and Mean (Standard Deviation) Satisfaction, Across All Participants, Across All Usability Studies

Table 4

Participants began by reading each task aloud.  While participants were completing each task, the test administrator encouraged them to think aloud using the communicative think-aloud protocol (Boren & Ramey, 2000; Olmsted-Hawala, Murphy, Hawala, & Ashenfelter, 2010).  When the participants found the answer they were seeking, they told the test administrator what it was, and the task ended.    

After completing the usability session, each participant indicated his/her satisfaction with various aspects of the Web site using a tailored, 10-item satisfaction questionnaire (displayed in Figure 2), which was loosely based on the Questionnaire for User Interaction Satisfaction (QUIS; Chin, Diehl, & Norman, 1988).  The test administrator then asked participants debriefing questions about specific aspects of the site and/or about specific things the participant said or did during the session.  Upon completion of the entire session, participants received monetary compensation.  Each session was audio and video recorded.

Observers from the AFF team watched the usability tests on a television screen and computer monitor in a separate room.  They did not interact directly with the participants, but they had the opportunity to ask questions of the participants: Observers wrote their questions for participants to answer during the debriefing, and the test administrator asked the participants the questions.   At the end of each session, the test administrator and observers discussed the findings from that session and compared them to findings from other sessions.  Early on, we realized that the development team was benefiting from watching participants struggle with their prototypes.  Together we discussed the issues participants were having and how to reconcile the issues.  The attendance of development team members was constant throughout the iterations.  Each member of their team came to at least a few sessions, and at least one team member was present at every session.  This attendance contributed to the commitment/collaboration of the team members.



1See Olmsted-Hawala, Romano Bergstrom, & Murphy (under review) for details about the communication between the design-and-development team and the usability team.

Figure 2

Figure 2. Satisfaction Questionnaire (based on the QUIS; Chin et al., 1988)


Previous | Next