Metadata harvesting for content-based distributed information retrieval

Simeoni, Fabio; Yakici, Murat; Neely, Steve; Crestani, Fabio

doi:10.1002/asi.20694

Back

Journal article

Metadata harvesting for content-based distributed information retrieval

Simeoni, Fabio University of Strathclyde, Glasgow, United Kingdom
Yakici, Murat University of Strathclyde, Glasgow,United Kingdom
Neely, Steve University College of Dublin, Ireland
Crestani, Fabio Facoltà di scienze informatiche, Università della Svizzera italiana, Svizzera

27.02.2007

Published in:

Journal of the American Society for information science and technology. - Wiley. - 2008, vol. 59, no. 1, p. 12-24

English We propose an approach to content-based Distributed Information Retrieval based on the periodic and incremental centralisation of full content indices of widely dispersed and autonomously managed document sources. Inspired by the success of the Open Archive Initiative’s protocol for metadata harvesting, the approach occupies middle ground between content crawling and distributed retrieval. As in crawling, some data moves towards the retrieval process, but it is statistics about the content rather than content itself; this grants more efficient use of network resources and wider scope of application. As in distributed retrieval, some processing is distributed along with the data, but it is indexing rather than retrieval; this reduces the costs of content provision whilst promoting the simplicity, effectiveness, and responsiveness of retrieval. Overall, we argue that the approach retains the good properties of centralised retrieval without renouncing to cost-effective, large-scale resource pooling. We discuss the requirements associated with the approach and identify two strategies to deploy it on top of the OAI infrastructure. In particular, we define a minimal extension of the OAI protocol which supports the coordinated harvesting of full-content indices and descriptive metadata for content resources. Finally, we report on the implementation of a proof-of-concept prototype service for multi-model content-based retrieval of distributed file collections.

Language

English

Classification

Computer science and technology

License

License undefined

Open access status

green

Identifiers

RERO DOC 11235
DOI 10.1002/asi.20694
ARK ark:/12658/srd1318109

Persistent URL

https://n2t.net/ark:/12658/srd1318109

Statistics

Document views: 191 File downloads:

Texte intégral: 269