This conceptual model for repository federation is based on the following documents:

Collection

In the most general sense, as defined in ISO 2146, a collection is any aggregation of content items, physical or digital.

A repository “contains” collections: it provides access to them, manages them, etc. So a repository defines a repository collection of items.

A collection can be a proper subset of a repository collection; a superset; or it can partly overlap. This is completely open, and the distinction between repository and collection in scope over content can lead to a consistency problem in managing collections through repositories.

Possible examples of collections include:

Parties

Parties are defined by ISO 2146 as all actors involved in the repository space. The relations parties enter into are not enumerated by ISO 2146; this model postulates that they fall into three classes: acquiring (data is created); curating (data is managed and enriched); and publishing (data is made available outside).

Since under ISO 2146 a service is exclusively an interface to the outside world, so services are only managed as publishing activities. Curating the content through interaction with individual items is a distinct responsibility from publishing the content (possibly as a single collection) through a repository.

Managed Collections

Managed collection refers to a collection which is accountably curated by a party or parties, according to a policy. Managing the publishing of the collection is independent of the collection being managed; a collection manager may insist on their own more restrictive policies, attaching to the publishing service provided by the repository.

If a repository contains only one collection, and policy entrusts all responsibilities for curating that collection with the repository manager, then the repository manager is the collection manager. In more complex cases, a repository can contain several independently managed collections, each with their own publishing policies and potentially custom publishing services, as well as other items managed by the repository manager. So the repository collection is a superset of the managed collections the repository houses. (Collections can also be unmanaged, as occurs with Internet content; but that is atypical of federating repositories.)

A repository attaches to a collection of items a service, which Stores and Exposes those items (as a single digital object). A managed collection cannot override those services; if it does, it becomes its own repository. But it can override repository-internal activities, even when those activities change how the outside world interacts with the items (e.g. metadata generation, access policy). If the repository already has policies and standards for those activities, the managed collection must comply with these to remain part of the repository.

Trustworthiness

One of the major motivations for using repositories as opposed to just the Internet is the assurance of quality content: content is presumed added to a repository only after it has been appropriately vetted. This invokes a notion of trustworthiness, which can be decomposed into dependability (the consumer is assured that the referent will satisfy that attribute consistently over a non-trivial length of time) and accountability (the consumer has information made available to them to establish that the referent is satisfied at any time).

Trustworthiness of the content of repositories is the primary kind of trustworthiness which users evaluate repositories by; but this involves curating responsibilities, as defined above, and is therefore outside the scope of the FRED project. In this regard, trustworthiness has scope over authenticity, provenance, and scholarliness, applying both to content itself, and to the metadata describing the content.

There are also trustworthiness metrics independent of the repository content, and to do with the accessibility and accountability of content as delivered through a repository. This involves publishing responsibilities, as defined above. Since repositories are the mechanism through which publishing occurs, the trustworthiness of content publishing is in scope for the FRED project, so long as the publishing takes place through a repository federation.

The trustworthiness metrics for the publishing of content are involved in the definition of a trusted repository advanced by the Research Libraries Group (RLG). The design choice made by FRED of a central registry-based model of federation is motivated by such considerations.

For reference, the trustworthiness criteria established by the RLG for publishing content are enumerated, along with our classification of the criteria:

Dependable Publishing Services:

These criteria ensure that the repository continues to provide a consistent service over a non-trivial period of time.

Accountable Publishing Services:

Operational Publishing Responsibilities:

These criteria motivate the following design choices:

These design choices in turn are best realised through a central service point representing content provision from a disciplinarily coherent but institutionally disparate set of repositories.

Repository

A repository is a managed collection, presented to consumers through a publishing service as a single digital object—specifically, a data source. In other words, a repository is a system (prototypically a software system) used to maintaining and publish a Collection. In terms of ISO 2146 (which does not define repositories as distinct from registries), a repository is a minimal registry, consisting of one collection, one party (the collection curator), one service (publish), and one activity (curate). A managed repository further involves a party which takes responsibility for the repository as a single digital object (i.e. the service interface of the repository). The activities which that party undertakes are publishing activities, including the development and maintenance of access policy.

A repository can include multiple collections, each curated by a distinct party. But it can only have one party accountable for managing its publishing service.

The notion of a publishing service does not require public access to content, but merely consistent presentation of the collection(s) as a single digital object. If a repository is archival-only, the only party with access to the repository through the publishing service may be the repository manager.

Federation

There are two models of repository federation, according to the locus of content discovery. In both models, discovery is assumed to depend on metadata which resides in participating repositories. Both models also presume some digital representation of the repositories in a data source coordinating the federation, which we term a federator. The federator is a repository which contains representations of repositories (federatees).

In a centralised model, metadata is gathered from participating repositories into a central metadata source. Requests for discovery are transacted on the metadata in the central metadata source. Given the results, users can be redirected for content delivery to the participating repositories. The federator coordinates the population of the central metadata source, and the redirection to participating repositories may be mediated through the federator. The federator and the central metadata source may be the same data source.

In a decentralised model, metadata is not gathered into a central data source. The federator redirects any requests for discovery to each of the participating repositories, and the requests as well as the delivery of data are transacted there.

Registry, Directory, Dropbox and Roster

Both the federator and the central metadata source are data sources containing representations of digital objects and the repositories that contain them. They thus match the ISO 2146 definition of a registry as a collection of registry objects—collections, activities, parties, and services. (A registry is a second order collection.)

We use here a more restrictive sense of registry, as a (second order) data source providing accountability for its content through the provision of provenance and authority metadata. We allow for lesser degrees of trustworthiness in data sources, from registry through directory, dropbox, and roster. The federator and central metadata sources need not have high trustworthiness in order to function, but the FRED project recommends such trustworthiness, and describes the federator and the central metadata source as the repository registry and the metadata registry (CORDRA: Master Catalogue). (The FRED Service Usage Model also allows a distinct collection registry; the CORDRA model additionally provides for a System Registry presenting the semantic model instantiated in the federation.)

A registry under this model’s definition is a managed collection of repositories and/or of federated metadata. Managing the registry as a collection has a strong policy component at the acquisition phase: a repository is only added to the federation through a contractual arrangement. This arrangement includes assurances of the federatee’s trustworthiness in providing a publishing service (SLA): the trustworthiness of the federation is conditional on the trustworthiness of its participants (the centralised metadata must be up to date). The arrangement includes a policy on the format of the metadata, which also ensures trustworthiness for the federator (the metadata looks the same whatever is source, so queries on the registry will behave consistently).

The less trustworthy types of second order data source are not described here in any detail. A directory is a managed collection of repositories and/or of federated metadata, but makes no contractual requirement for adding a repository to the federation. (The federation is open and “third-party”, and makes no guarantee of the trustworthiness of its publishing service.) A dropbox is an unmanaged collection of repositories and/or of federated metadata, and no systematic attempt is made to expose or record authority metadata; so the accountability and provenance of content is not guaranteed. A roster of repositories is the minimum infrastructure required by a decentralised federation: it is merely a listing of participating repositories, with no requirement of policy compliance.

Federating Repository

In the context of federating repositories, a (participating) repository is conceived of as a federatee: a system actor that enters into federation transactions with a federator, such as harvesting. Note that repositories rather than collections are what enter into federations, since participation in a federation is conditional on the ability to publish content to end users, a characteristic which defines repositories rather than collections.

According to the CORDRA model, a repository can store registered objects, give an interface for retrieval, and have standard formats allowing for data import/export. For it to share content with a central metadata registry through harvesting or depositing also means, at the policy level, that a repository has an associated manager, capable of entering into a relation of trust with a registry. This imposes the following requirements on repositories:

To enable content to be registered, the repository MUST:

Accordingly, the following qualities of repositories are required for a repository to enter into a well-defined federation, such as FRED defines:

Managerially- and Access Federable are matters of policy; they distinguish between policy at the repository level and policy at the item level

Access need not be immediate, through say a hyperlink to content: it can be a PO Box address or a shopping cart interface (which provides delayed accessioning). It is enough that the process of accessioning (retrieval) can be initiated. If on the other hand, no access protocol is provided, the registry merely records that the item exists and has certain metadata, but gives no indication of where the item can be retrieved.

FRED expects that all these qualities are fulfilled by participating repositories in its centralised federation model. The repository enters into federation as a single system. (This is how a federation of federations can be built: the participating federation is presented to the federation of federations as a single system, through its central metadata source.) The repository has a known party who is responsible for it to the federation, and who is prepared to establish a relation of trust with the federation. The repository must have the appropriate services and permissions defined to share its metadata with a central metadata registry. Once content on the participating repository has been discovered through the central metadata registry, the discovery service must provide accessioning for the user to retrieve the content from the participating repository; and the participating repository must be capable of packaging the content for delivery to the user. Neither FRED nor CORDRA require that accessioning be immediate (e.g. through a hyperlink on a resolvable identifier).

Note that not all repositories fulfil all these qualities, and indeed not all federations require them. For instance, a repository may participate in a federation without providing either encoding or accessioning for its content. In that case the end user can only retrieve descriptions of the content through metadata, and not the content itself.

Federation Data

In order to participate in a centralised federation, a repository must expose some of its data to be registered in the metadata registry. This includes but is not limited to content metadata; it also includes access information about the item itself, which is resource metadata—item ID, item location, item access policy. There is a mismatch between the data the federatee exposes to the federator, and the data the federator requires from the federatee for the federation to work.

The federatee (the participating repository) exposes leviable data, which can potentially be ingested into the registry. The federator only requires levied data to be exposed for ingesting into the metadata registry by the federatee. Leviable data is not restricted to metadata descriptions of individual content items, and can also include metadata about the overall collection, or the publishing service (i.e. the repository). The levied data for an item should be a subset of the leviable data (allowing for combinations of item and collection metadata not made explicit in the source repository), although other scenarios are conceivable.

Levied data is transformed internally by the federator into the formats it actually needs for compliance with the federation. For a content item, this is registration data: the data required for registration of an item conformant to the federation. The responsibility for the transformation and preparation of registration data may lie with either the federator or the federatee, depending on community practice.

Finally, data may be exposed by the federation as federation metadata, in order to provide federated repository discovery. This is reexposed data, since it is based on data the participating repository had initially exposed to the registry. Typically, this will be a subset of registration data, which the federation has determined is fit to be exposed for discovery. Not all registration data need be reexposed. Levied data can be exposed independently by policy if it is expedient—e.g. to provide direct access to it if the end user is interested, rather than sending them to the repository, or to support value-added discovery beyond what the registration data allows (reserve metadata). But the reexposed data is what the registry has contractually committed to exposing. (Leviable data which has not actually been levied, on the other hand is not available to the federator to reexpose, and must be accessed directly from the participating repository.)

This brings about the following workflow:

Levy data (what the repository has) would ideally be the same as registration data (what the registry needs); but that may not happen. Along the way, the metadata can be augmented from other sources:

The final augmentation of data is unrelated to the content or presentation of data from the federatees. Such value-added data includes annotations and rankings. It may have been generated within the registry system, or by third parties. But the association of such value-added data with the content item occurs in the context of the federation, not the source repository. Like registration data, such value-added data needs to be activated (=made exposable) explicitly to be available to end users.

The FRED model assumes that registration data includes both content metadata (describing the content of the item, and exploited in discovery on the metadata registry), and resource metadata (describing the item itself as a [digital] object). Resource metadata crucially includes address data: data that facilitates access to content data, through a direct, indexical relation. This includes both identifiers and locators. Without address data, the federation cannot redirect requests to obtain discovered content back to the participating repositories. If only resource metadata is registered, the federation cannot enable discovery on the central metadata registry; but it might still allow decentralised federation discovery.

CORDRA differentiates global identifiers from local locators; if the identifier is globally resolvable, the intermediate locator providing a service of access to the item is superfluous. FRED presumes all content is identified through identifiers which are resolvable by any end user of the federation, through whichever compatible system they access the federation. The intent is that access to content is not restricted to particular systems.

Managed collections are within scope of repository federations, and have their own metadata, including their own address data. The federation registers information about items directly, since individual item metadata is searchable on the registry whether it belongs to a managed collection or not. The retrieval of an item in a managed collection, however, may be mediated through that collection’s access policy. So the resolution of the item as to respect that policy, though this functionality should already be provided by the hosting repository.

Acquiring content

Federating data is a process logically subsequent to acquiring data, although there are similarities between the two processes—both involve moving data into a data source. A data object must be in a federatee repository, before data about it can levied for federation. Moreover, data may have to be reviewed before it is published through a repository—whether the repository is a federatee (i.e. a normal repository), or a federator. So the following sequence of events occurs:


Business Process

Event Type

Data moves from

Data moves to

Actor initiating

Create

Acquire

Author’s data source

Creator

Submit

Acquire

Author’s data source

Federatee

Creator or Federatee Manager

Transform

Curate

Federatee

Federatee

Federatee or Federatee Manager

Review

Curate

Federatee

Federatee

Federatee Manager

Publish

Publish

Federatee

Federatee

Federatee Manager

Harvest

Acquire

Federatee

Federator

Federatee (push) or Federator (pull)

Transform

Curate

Federator

Federator

Federator or Federator Manager

Review

Curate

Federator

Federator

Federator Manager

Publish

Publish

Federator

Federator

Federator Manager

What happens to data when it is entered into a federatee repository is similar to what happen when it is entered into a federator repository. In both cases, the data is transformed to conform to the target data source’s format requirements (optional); reviewed for quality assurance (optional); and made available to end users of the repository. There are three main differences between the respective workflows:

Because of the manual triggering of getting content into a normal repository, and the greater emphasis on the review processes, this procedure is modelled separately, with the following vocabulary:

The overall workflow involved in getting content into a normal repository is Deposit. The deposit workflow involves three stages:

If the object is reviewed successfully, a Publication workflow can be initiated for the object: the object is thereby discovered, disseminated and delivered through the repository. (It is discovered through exposing metadata to search or browse; it is delivered by fulfilling requests to obtain the object; and it is disseminated by making the object accessible to other users or systems, including through harvesting and re-ingesting.)

The deposit workflow is undertaken in order to include something in a repository, whether for reasons of safekeeping, legislative obligation, or accessibility. The workflow is undertaken only with respect to the target repository: the quality assurance in the review is set by the target repository, and the publication is specific to that repository. Any quality assurance undertaken in the federation, and the availability of the object through the federation, are negotiated separately in the processes of establishing and maintaining the federation: they are not involved in the deposit workflow itself.

The data objects and metadata required for a submit event may be bundled together as a submission package, e.g. in a schema like METS. The data object and metadata generated for publication, at the conclusion of the deposit workflow, may be bundled together as a dissemination package. Packaging allows the endpoints of the deposit workflow to be integrated with other systems as services, without manual intervention. This is necessary if the publication data source is distinct from the ingest data source: the dissemination package is what populates the publication data source, through an exchange of metadata between repositories. Preserving such metadata in a consistent format is also important in establishing object provenance.

The deposit workflow enhances the content object with more information (e.g. publishing policies, suitability for publication, reviewer comments, search keywords), so the submission package (the process input) will usually not be identical to the dissemination package (the process output). The submission package may contain suggestions about how the content object is to be enhanced through curatorial or publication metadata (e.g. suggested embargo period, hierarchical relation with other content objects); but the decision on that metadata is undertaken during the course of the deposit workflow within the repository system.

A party submitting an object to a repository is followed by the repository ingesting the object. This means that the object is moved into the control domain of the repository (and out of the control domain of the creator), and is then processed to conform to the repository’s data requirements. So the ingest function combines moving the object into the repository system, and transforming the object. Submitting and ingesting the object are typically coupled tightly, but this need not be the case:

The transformations the object can undergo in ingestion include unpackaging, identifier generation, virus checking, checksum generation, metadata generation, as well as transformations of file formats, schemata, vocabularies, and encodings. The transformations can be quite complex; for instance, send metadata to a discovery service, extract metadata, or create derivative objects.

Content may be ingested from one repository into another, and this need not occur in the context of a deposit workflow: Ingest can apply to:

One or more data sources may be involved in the deposit workflow. Before submission, the object may be located on the creator’s data source, or in no data source at all; but it cannot be located on a data source within the target repository system.

Once ingested, the content object may reside on the same data source throughout the workflow, and be published without changing location. Alternatively, the object may be placed into a temporary data source while it is undergoing transformation or review: it is only moved to its final destination on publication. This depends on how publication is realised in the repository: the repository may house published and unpublished objects on a single data source, selectively releasing objects for external access; or else it may have a dedicated publication data source, on which all objects housed are available for external use. (So a repository may combine several data sources architecturally, but as noted is presented to external users as only a single digital object—the publication data source.)

Appropriate copy

An appropriate copy service selects for delivery one out of several instances of a resource, according to user and other contextual parameters. If an identifier has multiple resolution, an appropriate copy service can select to resolve to the most appropriate instance of the thing identified, out of the instances nominated by each possible resolution.