This conceptual model for repository federation is based on the following documents:
- CORDRA: http://cordra.net/docs/ , http://cordra.net/information/publications
- Mellon Report: Augmenting Interoperability across Scholarly Repositories
- ISO 2146. Information and documentation—Registry Services for Libraries and Related Organisations
- Research Libraries Group: Trusted Digital Repositories, Attributes and Responsibilities
Collection
In the most general sense, as defined in ISO 2146, a collection is any aggregation of content items, physical or digital.
A repository “contains” collections: it provides access to them, manages them, etc. So a repository defines a repository collection of items.
A collection can be a proper subset of a repository collection; a superset; or it can partly overlap. This is completely open, and the distinction between repository and collection in scope over content can lead to a consistency problem in managing collections through repositories.
Possible examples of collections include:
- All items in the Jack repository.
- All items in the Jack repository of type audio-visual
- All items in the Jack repository authored by Jill
- All items in the Jack repository about bioethics
- All items in any repository in the Jerome federation
- All items in the Jerome federation about bioethics
- All items in the Jerome federation whose author name begins with Y and whose repository name begins with C.
Parties
Parties are defined by ISO 2146 as all actors involved in the repository space. The relations parties enter into are not enumerated by ISO 2146; this model postulates that they fall into three classes: acquiring (data is created); curating (data is managed and enriched); and publishing (data is made available outside).
- Acquiring: e.g. Legally own; procure; generate item
- Curating: e.g. Generate metadata, Manage item (CRUD)
- Publishing: e.g. Manage access, Manage Service, Use Service
Since under ISO 2146 a service is exclusively an interface to the outside world, so services are only managed as publishing activities. Curating the content through interaction with individual items is a distinct responsibility from publishing the content (possibly as a single collection) through a repository.
Managed Collections
Managed collection refers to a collection which is accountably curated by a party or parties, according to a policy. Managing the publishing of the collection is independent of the collection being managed; a collection manager may insist on their own more restrictive policies, attaching to the publishing service provided by the repository.
If a repository contains only one collection, and policy entrusts all responsibilities for curating that collection with the repository manager, then the repository manager is the collection manager. In more complex cases, a repository can contain several independently managed collections, each with their own publishing policies and potentially custom publishing services, as well as other items managed by the repository manager. So the repository collection is a superset of the managed collections the repository houses. (Collections can also be unmanaged, as occurs with Internet content; but that is atypical of federating repositories.)
A repository attaches to a collection of items a service, which Stores and Exposes those items (as a single digital object). A managed collection cannot override those services; if it does, it becomes its own repository. But it can override repository-internal activities, even when those activities change how the outside world interacts with the items (e.g. metadata generation, access policy). If the repository already has policies and standards for those activities, the managed collection must comply with these to remain part of the repository.
Trustworthiness
One of the major motivations for using repositories as opposed to just the Internet is the assurance of quality content: content is presumed added to a repository only after it has been appropriately vetted. This invokes a notion of trustworthiness, which can be decomposed into dependability (the consumer is assured that the referent will satisfy that attribute consistently over a non-trivial length of time) and accountability (the consumer has information made available to them to establish that the referent is satisfied at any time).
Trustworthiness of the content of repositories is the primary kind of trustworthiness which users evaluate repositories by; but this involves curating responsibilities, as defined above, and is therefore outside the scope of the FRED project. In this regard, trustworthiness has scope over authenticity, provenance, and scholarliness, applying both to content itself, and to the metadata describing the content.
There are also trustworthiness metrics independent of the repository content, and to do with the accessibility and accountability of content as delivered through a repository. This involves publishing responsibilities, as defined above. Since repositories are the mechanism through which publishing occurs, the trustworthiness of content publishing is in scope for the FRED project, so long as the publishing takes place through a repository federation.
The trustworthiness metrics for the publishing of content are involved in the definition of a trusted repository advanced by the Research Libraries Group (RLG). The design choice made by FRED of a central registry-based model of federation is motivated by such considerations.
For reference, the trustworthiness criteria established by the RLG for publishing content are enumerated, along with our classification of the criteria:
Dependable Publishing Services:
- accept responsibility for the long-term maintenance of digital resources on behalf of its depositors and for the benefit of current and future users;
- have an organizational system that supports not only long-term viability of the repository, but also the digital information for which it has responsibility;
- demonstrate fiscal responsibility and sustainability;
- design its system(s) in accordance with commonly accepted conventions and standards to ensure the ongoing management, access, and security of materials deposited within it;
- be depended upon to carry out its long-term responsibilities to depositors and users openly and explicitly;
- establish methodologies for system evaluation that meet community expectations of trustworthiness.
These criteria ensure that the repository continues to provide a consistent service over a non-trivial period of time.
Accountable Publishing Services:
- have policies, practices, and performance that can be audited and measured.
Operational Publishing Responsibilities:
- determines, either by itself of with others, the users that make up its designated community, which should be able to understand the information provided;
- ensures that the information to be preserved is “independently understandable” to the designated community; that is, that the community can understand the information without needing the assistance of experts;
- follows documented policies and procedures that ensure the information is preserved against all reasonable contingencies and enables the information to be disseminated as authenticated copies of the original or as traceable to the original;
- makes the preserved information available to the designated community;
- works closely with the repository’s designated community to advocate the use of good and (where possible) standard practice in the creation of digital resources; this may include an outreach program for potential depositors.
These criteria motivate the following design choices:
- Dependable Publishing Services: requirement of service level agreement between participating repository and federation; use of persistent identifiers; preference for federation-level provision as more sustainable fiscally and institutionally; use of establish data transport standards.
- Accountable Publishing Services: all the transactions of a repository should be logged and monitored, so they can be queried.
- Operational Publishing Responsibilities: orientation of content provision to community defined by discipline rather than institutional affiliation; consistent presentation of content regardless of source repository; coordination of consistent creation and preservation policies; authoritative duplication of content.
These design choices in turn are best realised through a central service point representing content provision from a disciplinarily coherent but institutionally disparate set of repositories.
Repository
A repository is a managed collection, presented to consumers through a publishing service as a single digital object—specifically, a data source. In other words, a repository is a system (prototypically a software system) used to maintaining and publish a Collection. In terms of ISO 2146 (which does not define repositories as distinct from registries), a repository is a minimal registry, consisting of one collection, one party (the collection curator), one service (publish), and one activity (curate). A managed repository further involves a party which takes responsibility for the repository as a single digital object (i.e. the service interface of the repository). The activities which that party undertakes are publishing activities, including the development and maintenance of access policy.
A repository can include multiple collections, each curated by a distinct party. But it can only have one party accountable for managing its publishing service.
The notion of a publishing service does not require public access to content, but merely consistent presentation of the collection(s) as a single digital object. If a repository is archival-only, the only party with access to the repository through the publishing service may be the repository manager.
Federation
There are two models of repository federation, according to the locus of content discovery. In both models, discovery is assumed to depend on metadata which resides in participating repositories. Both models also presume some digital representation of the repositories in a data source coordinating the federation, which we term a federator. The federator is a repository which contains representations of repositories (federatees).
In a centralised model, metadata is gathered from participating repositories into a central metadata source. Requests for discovery are transacted on the metadata in the central metadata source. Given the results, users can be redirected for content delivery to the participating repositories. The federator coordinates the population of the central metadata source, and the redirection to participating repositories may be mediated through the federator. The federator and the central metadata source may be the same data source.
In a decentralised model, metadata is not gathered into a central data source. The federator redirects any requests for discovery to each of the participating repositories, and the requests as well as the delivery of data are transacted there.
Registry, Directory, Dropbox and Roster
Both the federator and the central metadata source are data sources containing representations of digital objects and the repositories that contain them. They thus match the ISO 2146 definition of a registry as a collection of registry objects—collections, activities, parties, and services. (A registry is a second order collection.)
We use here a more restrictive sense of registry, as a (second order) data source providing accountability for its content through the provision of provenance and authority metadata. We allow for lesser degrees of trustworthiness in data sources, from registry through directory, dropbox, and roster. The federator and central metadata sources need not have high trustworthiness in order to function, but the FRED project recommends such trustworthiness, and describes the federator and the central metadata source as the repository registry and the metadata registry (CORDRA: Master Catalogue). (The FRED Service Usage Model also allows a distinct collection registry; the CORDRA model additionally provides for a System Registry presenting the semantic model instantiated in the federation.)
A registry under this model’s definition is a managed collection of repositories and/or of federated metadata. Managing the registry as a collection has a strong policy component at the acquisition phase: a repository is only added to the federation through a contractual arrangement. This arrangement includes assurances of the federatee’s trustworthiness in providing a publishing service (SLA): the trustworthiness of the federation is conditional on the trustworthiness of its participants (the centralised metadata must be up to date). The arrangement includes a policy on the format of the metadata, which also ensures trustworthiness for the federator (the metadata looks the same whatever is source, so queries on the registry will behave consistently).
The less trustworthy types of second order data source are not described here in any detail. A directory is a managed collection of repositories and/or of federated metadata, but makes no contractual requirement for adding a repository to the federation. (The federation is open and “third-party”, and makes no guarantee of the trustworthiness of its publishing service.) A dropbox is an unmanaged collection of repositories and/or of federated metadata, and no systematic attempt is made to expose or record authority metadata; so the accountability and provenance of content is not guaranteed. A roster of repositories is the minimum infrastructure required by a decentralised federation: it is merely a listing of participating repositories, with no requirement of policy compliance.
Federating Repository
In the context of federating repositories, a (participating) repository is conceived of as a federatee: a system actor that enters into federation transactions with a federator, such as harvesting. Note that repositories rather than collections are what enter into federations, since participation in a federation is conditional on the ability to publish content to end users, a characteristic which defines repositories rather than collections.
According to the CORDRA model, a repository can store registered objects, give an interface for retrieval, and have standard formats allowing for data import/export. For it to share content with a central metadata registry through harvesting or depositing also means, at the policy level, that a repository has an associated manager, capable of entering into a relation of trust with a registry. This imposes the following requirements on repositories:
- Declare standard formats for its content objects, for data exchange
- Have an interface for content retrieval (CORDRA Expose): either push (deposit) or pull (harvest)
- Be a single digital object (as defined in CORDRA)
- Corollary: Appear as a single digital object to outside actors (this does not rule out aggregations of repositories, which allows federations of federations)
- Store content
- Enable its content to be registered
To enable content to be registered, the repository MUST:
- Be managed as a single object
- Have a manager actor able to enter into and honour a contractual arrangement (SLA) with the registry.
- Per SLA, expose agreed information to registry
Accordingly, the following qualities of repositories are required for a repository to enter into a well-defined federation, such as FRED defines:
- Unitary: the repository appears to outside actors as a single system, through well-defined interfaces
- Managed: for any repository policy the repository has a single actor (which can be a committee) responsible for enacting it, as a single point of accountability. (The policy actors can be coordinated by a single umbrella actor, who is the overall repository manager; this is not mandatory but it is desirable, to provide a single point of contact.)
- Federable: the repository is capable of entering into a federation through a federator. This has at least three components:
- Technically Federable: the repository supports the service expressions required by the registry for federation (whether as push metadata; pull metadata; or for decentralised federations, redirect discovery request).
- Managerially Federable: the manager of the repository is capable and prepared to enter into a contractual arrangement with the distinct manager of the registry, to comply with the terms of the federation. Managed is a prerequisite for Managerially Federable, since a single manager actor responsible for the repository can legitimately enter into the necessary negotiations.
- Activatable = Access Federable: the content required for federation can be activated (made exposed) by the repository manager, compliant with repository and federation policies.
Managerially- and Access Federable are matters of policy; they distinguish between policy at the repository level and policy at the item level
- Encoding: the repository can export its content items through an agreed exchange format to some user. This may include archival use, and needs not mean that the repository is exposed outside the local domain.
- Accessioning: the repository provides a protocol to allow retrieval of items. So the identifiers that the registry returns for discovery of an item are actionable.
Access need not be immediate, through say a hyperlink to content: it can be a PO Box address or a shopping cart interface (which provides delayed accessioning). It is enough that the process of accessioning (retrieval) can be initiated. If on the other hand, no access protocol is provided, the registry merely records that the item exists and has certain metadata, but gives no indication of where the item can be retrieved.
FRED expects that all these qualities are fulfilled by participating repositories in its centralised federation model. The repository enters into federation as a single system. (This is how a federation of federations can be built: the participating federation is presented to the federation of federations as a single system, through its central metadata source.) The repository has a known party who is responsible for it to the federation, and who is prepared to establish a relation of trust with the federation. The repository must have the appropriate services and permissions defined to share its metadata with a central metadata registry. Once content on the participating repository has been discovered through the central metadata registry, the discovery service must provide accessioning for the user to retrieve the content from the participating repository; and the participating repository must be capable of packaging the content for delivery to the user. Neither FRED nor CORDRA require that accessioning be immediate (e.g. through a hyperlink on a resolvable identifier).
Note that not all repositories fulfil all these qualities, and indeed not all federations require them. For instance, a repository may participate in a federation without providing either encoding or accessioning for its content. In that case the end user can only retrieve descriptions of the content through metadata, and not the content itself.
Federation Data
In order to participate in a centralised federation, a repository must expose some of its data to be registered in the metadata registry. This includes but is not limited to content metadata; it also includes access information about the item itself, which is resource metadata—item ID, item location, item access policy. There is a mismatch between the data the federatee exposes to the federator, and the data the federator requires from the federatee for the federation to work.
The federatee (the participating repository) exposes leviable data, which can potentially be ingested into the registry. The federator only requires levied data to be exposed for ingesting into the metadata registry by the federatee. Leviable data is not restricted to metadata descriptions of individual content items, and can also include metadata about the overall collection, or the publishing service (i.e. the repository). The levied data for an item should be a subset of the leviable data (allowing for combinations of item and collection metadata not made explicit in the source repository), although other scenarios are conceivable.
Levied data is transformed internally by the federator into the formats it actually needs for compliance with the federation. For a content item, this is registration data: the data required for registration of an item conformant to the federation. The responsibility for the transformation and preparation of registration data may lie with either the federator or the federatee, depending on community practice.
Finally, data may be exposed by the federation as federation metadata, in order to provide federated repository discovery. This is reexposed data, since it is based on data the participating repository had initially exposed to the registry. Typically, this will be a subset of registration data, which the federation has determined is fit to be exposed for discovery. Not all registration data need be reexposed. Levied data can be exposed independently by policy if it is expedient—e.g. to provide direct access to it if the end user is interested, rather than sending them to the repository, or to support value-added discovery beyond what the registration data allows (reserve metadata). But the reexposed data is what the registry has contractually committed to exposing. (Leviable data which has not actually been levied, on the other hand is not available to the federator to reexpose, and must be accessed directly from the participating repository.)
This brings about the following workflow:
- A repository exposes some data (and metadata). It may do so by arrangement with a registry, but a registry could also levy metadata from an open access repository.
- A registry levies data from a repository (either through push or pull); it ingests the exposed data it needs to make the federation effective. Such a data levy is essential to the notion of a centralised repository.
- The data levied (ingested for the purposes of a federation) may not conform to the requirements the federation imposes on registration data. In that case, the levied data must still be adequate to allow such conforming data to be generated, and this generation is undertaken, typically as a transformation operation.
- The registry reexposes data to provide federated repository discovery, typically as a subset of the registration data.
Levy data (what the repository has) would ideally be the same as registration data (what the registry needs); but that may not happen. Along the way, the metadata can be augmented from other sources:
- Leviable data describing an item may be augmented with data about the collection or repository.
- Levied data about an item may be augmented with data about the federation or registry.
- Registration data may be augmented with value-added metadata contributed within the federation.
The final augmentation of data is unrelated to the content or presentation of data from the federatees. Such value-added data includes annotations and rankings. It may have been generated within the registry system, or by third parties. But the association of such value-added data with the content item occurs in the context of the federation, not the source repository. Like registration data, such value-added data needs to be activated (=made exposable) explicitly to be available to end users.
The FRED model assumes that registration data includes both content metadata (describing the content of the item, and exploited in discovery on the metadata registry), and resource metadata (describing the item itself as a [digital] object). Resource metadata crucially includes address data: data that facilitates access to content data, through a direct, indexical relation. This includes both identifiers and locators. Without address data, the federation cannot redirect requests to obtain discovered content back to the participating repositories. If only resource metadata is registered, the federation cannot enable discovery on the central metadata registry; but it might still allow decentralised federation discovery.
CORDRA differentiates global identifiers from local locators; if the identifier is globally resolvable, the intermediate locator providing a service of access to the item is superfluous. FRED presumes all content is identified through identifiers which are resolvable by any end user of the federation, through whichever compatible system they access the federation. The intent is that access to content is not restricted to particular systems.
Managed collections are within scope of repository federations, and have their own metadata, including their own address data. The federation registers information about items directly, since individual item metadata is searchable on the registry whether it belongs to a managed collection or not. The retrieval of an item in a managed collection, however, may be mediated through that collection’s access policy. So the resolution of the item as to respect that policy, though this functionality should already be provided by the hosting repository.
Acquiring content
Federating data is a process logically subsequent to acquiring data, although there are similarities between the two processes—both involve moving data into a data source. A data object must be in a federatee repository, before data about it can levied for federation. Moreover, data may have to be reviewed before it is published through a repository—whether the repository is a federatee (i.e. a normal repository), or a federator. So the following sequence of events occurs:
Business Process |
Event Type |
Data moves from |
Data moves to |
Actor initiating |
Create |
Acquire |
— |
Author’s data source |
Creator |
Submit |
Acquire |
Author’s data source |
Federatee |
Creator or Federatee Manager |
Transform |
Curate |
Federatee |
Federatee |
Federatee or Federatee Manager |
Review |
Curate |
Federatee |
Federatee |
Federatee Manager |
Publish |
Publish |
Federatee |
Federatee |
Federatee Manager |
Harvest |
Acquire |
Federatee |
Federator |
Federatee (push) or Federator (pull) |
Transform |
Curate |
Federator |
Federator |
Federator or Federator Manager |
Review |
Curate |
Federator |
Federator |
Federator Manager |
Publish |
Publish |
Federator |
Federator |
Federator Manager |
What happens to data when it is entered into a federatee repository is similar to what happen when it is entered into a federator repository. In both cases, the data is transformed to conform to the target data source’s format requirements (optional); reviewed for quality assurance (optional); and made available to end users of the repository. There are three main differences between the respective workflows:
- Federation often does not involve a review process: the federation trusts the quality assurance processes of the source repositories.
- The data involved in federation is typically metadata describing content objects, and facilitating its discovery. The data acquired by the federatee is in the first instance the content object itself.
- The initial process of getting the data into the target data source is different: federation can be substantially automated, whereas getting the data into a normal federation is typically triggered by the creator of the content.
Because of the manual triggering of getting content into a normal repository, and the greater emphasis on the review processes, this procedure is modelled separately, with the following vocabulary:
The overall workflow involved in getting content into a normal repository is Deposit. The deposit workflow involves three stages:
- a Submit event, of the creator requesting that a content object be considered for inclusion in a repository. (The content object is assumed already to exist.)
- a series of Transformations, in which the content object is changed to conform to the repository requirements (including generating metadata)
- a Review process, in which the content object is evaluated for inclusion in the repository
If the object is reviewed successfully, a Publication workflow can be initiated for the object: the object is thereby discovered, disseminated and delivered through the repository. (It is discovered through exposing metadata to search or browse; it is delivered by fulfilling requests to obtain the object; and it is disseminated by making the object accessible to other users or systems, including through harvesting and re-ingesting.)
The deposit workflow is undertaken in order to include something in a repository, whether for reasons of safekeeping, legislative obligation, or accessibility. The workflow is undertaken only with respect to the target repository: the quality assurance in the review is set by the target repository, and the publication is specific to that repository. Any quality assurance undertaken in the federation, and the availability of the object through the federation, are negotiated separately in the processes of establishing and maintaining the federation: they are not involved in the deposit workflow itself.
The data objects and metadata required for a submit event may be bundled together as a submission package, e.g. in a schema like METS. The data object and metadata generated for publication, at the conclusion of the deposit workflow, may be bundled together as a dissemination package. Packaging allows the endpoints of the deposit workflow to be integrated with other systems as services, without manual intervention. This is necessary if the publication data source is distinct from the ingest data source: the dissemination package is what populates the publication data source, through an exchange of metadata between repositories. Preserving such metadata in a consistent format is also important in establishing object provenance.
The deposit workflow enhances the content object with more information (e.g. publishing policies, suitability for publication, reviewer comments, search keywords), so the submission package (the process input) will usually not be identical to the dissemination package (the process output). The submission package may contain suggestions about how the content object is to be enhanced through curatorial or publication metadata (e.g. suggested embargo period, hierarchical relation with other content objects); but the decision on that metadata is undertaken during the course of the deposit workflow within the repository system.
A party submitting an object to a repository is followed by the repository ingesting the object. This means that the object is moved into the control domain of the repository (and out of the control domain of the creator), and is then processed to conform to the repository’s data requirements. So the ingest function combines moving the object into the repository system, and transforming the object. Submitting and ingesting the object are typically coupled tightly, but this need not be the case:
- A creator may submit content to a single front-end for a range of repositories or content types (“virtual loading dock”), and some selection process then decides how the content should be ingested to for further processing.
- The repository may need to verify any claims made in the submission package before acting on them (e.g. the creator may suggest the content is of type X, but it is actually of type Y). For that reason, content identifiers with meaning contributed by the creator should not determine how the content will be processed by the repository; the claim should be made as explicit metadata in the submission package, which can be verified independently.
- The repository may need to be aware of content before it is ingested, e.g. to avoid duplication later on.
- Content may be aggregated or disaggregated on publication (e.g. a journal issue may be broken up into individual articles).
The transformations the object can undergo in ingestion include unpackaging, identifier generation, virus checking, checksum generation, metadata generation, as well as transformations of file formats, schemata, vocabularies, and encodings. The transformations can be quite complex; for instance, send metadata to a discovery service, extract metadata, or create derivative objects.
Content may be ingested from one repository into another, and this need not occur in the context of a deposit workflow: Ingest can apply to:
- Periodic harvesting of metadata from a federatee into a federator.
- Archiving content from a publication repository allowing immediate delivery, to an archival repository which does not allow immediate delivery.
One or more data sources may be involved in the deposit workflow. Before submission, the object may be located on the creator’s data source, or in no data source at all; but it cannot be located on a data source within the target repository system.
Once ingested, the content object may reside on the same data source throughout the workflow, and be published without changing location. Alternatively, the object may be placed into a temporary data source while it is undergoing transformation or review: it is only moved to its final destination on publication. This depends on how publication is realised in the repository: the repository may house published and unpublished objects on a single data source, selectively releasing objects for external access; or else it may have a dedicated publication data source, on which all objects housed are available for external use. (So a repository may combine several data sources architecturally, but as noted is presented to external users as only a single digital object—the publication data source.)
Appropriate copy
An appropriate copy service selects for delivery one out of several instances of a resource, according to user and other contextual parameters. If an identifier has multiple resolution, an appropriate copy service can select to resolve to the most appropriate instance of the thing identified, out of the instances nominated by each possible resolution.
http://www.fred.usq.edu.au