The Need for a Standardized Set of Usability Metrics
Arnold M. Lund
U S WEST Advanced Technologies
One issue that consistently stimulates active discussion on bulletin boards frequented by human factors professionals is the issue of metrics to assess usability. Practitioners, especially in usability programs that are just starting within corporations, regularly seek usability metrics that can be used to evaluate new products and systems. When a posting is made, the response is usually immediate and heated. Some participants in the discussion point to existing tools designed to assess the usability of traditional software systems. Others argue that metrics only serve to hide the important issues. There is sometimes a suspicion that some may be afraid to have their work objectively evaluated and compared with the work of others. Still others point to or defend one of the many different definitions of usability. There is usually give and take on the "objectivity" of subjective ratings versus the virtue of measurements of time and errors. For many of us, the state of the art is that there are many potential benefits to having a valid set of usability metrics, but the kind of metrics that would be useful do not exist.
Within industry, the benefits to having reliable, valid metrics for usability are obvious. Intelligent decisions about the level of resources to invest in improving usability can be made based on the level of usability desired, and these levels can be specified in product or system requirements. Products can be compared based on their usability across time and usability can be measured to determine whether a new product has a competitive advantage in this area. Usability should be more likely to be highlighted in advertising, since testing data with external validity will stand behind the advertisements. Purchasing organizations such as the government can specify a level of usability as part of requests for proposals and in requirements. From the perspective of cost-justifying usability, it should be possible over time to demonstrate the relationship between various levels of usability and product revenues and customer loyalty.
Within academic and industrial research environments, the ability for many areas of research to impact our field is being affected by our lack of metrics. A generally agreed-upon set of dependent variables that can be compared across experiments is what has allowed the science to move engineering ahead. In our field, when the term usability is used, it is used in a variety of ways, and when it is measured, it is operationalized with similar variation. The ever increasing set of user interface methodologies cannot be compared because there is no common dependent variable for comparing them. While some of the methodologies address ease of use, many do not address other aspects of usability that are important for successful design. What is needed is a taxonomy of user interface methodologies, a way of organizing them based on the aspects of usable design that they are address most effectively (Lund, in press). Laboratory studies that demonstrate that a new design technique or guideline has value, do not provide quantification of the degree of value so practitioners can determine whether to invest the resources to incorporate it into their designs. Research that results in the design guidelines that drive standards sometimes fail to have impact because the guidelines cannot be associated with reliable, valid metrics to determine conformance and there are no effective tools for managing their interactions.
Background
Fortunately, there is a variety of work that suggest such metrics are possible. The SUMI (Kirakowski and Corbett, 1993) and QUIS (Chin, Diehl, and Norman, 1988) questionnaires are well suited to traditional, non-entertainment-oriented software applications (e.g., running on PCs), and provide a wealth of diagnostic data for identifying usability problems. These questionnaires appear to contain items that would be relevant to measuring usability as a dependent variable, but are not in themselves generalizable across domains. Schwartz and Seifert (1996) reported a study designed to assess what users themselves identify as being relevant to usability, and found two important factors that were inter-related, ease of use and usefulness. I subsequently conducted a series of studies designed to build an instrument for measuring these factors. An initial study was conducted to rough in a useful set of items using consumer products, and subsequent studies found both the factors and many of the items contributing to the factors were reliably identified when measuring the usability of software systems used in business settings, voice messaging systems tested in a laboratory setting, and customer attitudes about products when tested in the field (see, for example, the use of the scales in Figure 1). This early work suggested that usability scales could be built that would meet standard psychometric criteria and that would apply across a diverse set of human interfaces. Interestingly, the factors and many of the items loading on the factors have also emerged in research in the MIS area studying client satisfaction (Davis, 1989; Adams, Nelson, and Todd, 1992). Recently, Morris and Dillon (1997) have described a model based on acceptance theory that demonstrates a theoretical basis for relating software usability to usage patterns. The same factors have also been found in research studying technology diffusion (e.g., Moore and Benbasat, 1991).
Proposal
What is needed is research to build reliable and valid scales for measuring products on the dimensions of usability. The dimensions should be defined from the users perspective, and derived from their interaction with the variety of user interfaces that people encounter each day. These interfaces could include hardware, personal computer software, interactive telephony applications, written instructions, and perhaps even technologically supported service centers. The applications should include the variety of task domains that people typically encounter (Lund, 1994), including office applications, information retrieval, entertainment, and shopping. The goal should be to create an instrument that is capable of becoming a default if not a formal standard, and that is capable of evolving as our field evolves. The scales would then serve as common dependent variables that could be used to advance our discipline through the improved integration of research and linking of that research to practice.
References
Figure 1. Comparison of interfaces on 2 usability dimensions,
and user satisfaction.