![]() |
|
|
|
How do you do information retrieval evaluation?Several information retrieval evaluation methodologies and measures have been developed (Harter & Hert, 1997). But one of the first challenges in evaluating information retrieval is deciding upon the criteria to be used to judge success. Historically, the two primary criteria that have been used are “precision” and “recall” (Arms, 2000). Precision is the proportion of documents retrieved that are relevant to the information an individual is seeking (i.e., meets the requirements of their search). If a scholar is searching for information related to the relationship between President Theodore Roosevelt and teddy bears, precision would be high if most of the documents retrieved are directly related to this topic, and not to President Roosevelt's role in the development of the Panama Canal. Recall is simply the proportion of relevant documents that are retrieved from the collection of all relevant documents. Recall is much more difficult to estimate than precision because few digital libraries are cataloged in such a way that all the possibly relevant documents are known or can be identified. In any case, Arms (2000) points out that the precision and recall criteria were originally defined in terms of evaluating a single search. Few users search that way any more because one of the major benefits of digital libraries is that they enable “interactive searching” whereby a user employs multiple iterative strategies (e.g., searching for a specific topic combined with delimiters followed by browsing followed by a new search with more specific search terms). Hence, evaluating information retrieval is much more difficult because it will usually involve evaluating interactive search sessions rather than simple, one-time searches. So how should you proceed with respect to conducting an information retrieval evaluation of your digital library? The first issue you should consider is what kinds of decisions you hope to inform and then what kinds of information the evaluation must provide to influence those decisions. Suppose you and your colleagues are facing a decision concerning whether the metadata approach you are utilizing in your digital library should be replaced with another one or abandoned altogether. You could consider conducting an information retrieval evaluation focused on the performance of your specific system using either external standards or data from real users. Alternatively, you can do a comparison of your system with other information retrieval systems. If you decide to focus on the performance of your system based on its use by real users, you will need to identify criteria and standards. The limits of traditional criteria such as precision and recall are obvious when evaluating today's complex interactive searches, but new criteria are evolving, e.g., search cost. Search cost is calculated in terms of the time and money that a user expends before reaching a satisfactory conclusion to an information retrieval session. Another important criterion could be relevance, i.e., how relevant does a user regard the results of using the information retrieval system? User satisfaction is another possible criterion upon which to base your evaluation of information retrieval. In this case, user satisfaction would be viewed as equivalent to effectiveness. Methods of collecting data would encompass qualitative methods such as observations, interviews, open-ended questionnaires, and even think aloud protocols similar to those used in usability testing. Qualitative approaches are necessary when the criteria are as subjective and situational as relevance and satisfaction are. On the other hand, if you wish to compare the information retrieval of your digital library with external standards, the T ext RE trieval Conference (TREC) (http://trec.nist.gov/) has several databases available for use in comparison evaluations. Indeed, TREC is one of the main venues for discovering the latest information about research in the information retrieval community. Two of TREC's main goals are: (a) increasing the speed of transfer of technology from research into commercial products demonstrating substantial improvements in retrieval methodologies on real world problems, and (b) increasing the availability of appropriate evaluation techniques for use by industry and academia. TREC is a unique evaluation community in that it has developed test collections (a set of documents, topics, query questions, and corresponding relevance judgments) and evaluation software that is available to the research community and other organizations so that any developers can evaluate their own retrieval systems at any time. If you choose to do a comparison study, you will need to decide which system to use for comparison purposes. Do you wish to compare the information retrieval effectiveness of your digital library with another digital library or with some sort of standardized database? In the first instance, you might wish to compare the information retrieval results of your digital library with the search results obtained using a popular search engine such as Google. Instead of focusing only on qualitative user-based evaluations or standards-based performance evaluations, Wu and Sonnenwald (1999) promote the concept of multiple-methods approaches that would blend together user-studies with systems-oriented studies. This might involve extending the criteria beyond the relevance and satisfaction perceptions of individual users to external factors such as the effects on research, productivity, and decision-making (Saracevic, 1995).
|