Policy-Based Data Management: The Future of Reproducible, Data-Driven Research

There is a trend to place strict policies on the preservation and dissemination of research data that may be publicly funded. As an example, the Engineering and Physical Sciences Research Council (EPSRC) in the U.K. has published a comprehensive description of expectations and clarifications to enable compliance with these policies.

The required EPSRC policies can be analyzed in terms of computer actionable rules that are automatically enforced by the data management system. For each property required for preservation and dissemination, rules can be defined that ensure compliance over time. For instance, to ensure accurate, reliable, complete and authentic data, a data management system should be able to generate a list of rules to be enforced, define the required storage locations, define storage procedures that generate archival information packages, maintain checksums and manage replication.

To ensure that the data are citable and attributable, rules are needed to support versioning of files and assignment of Global Unique Identifiers (GUIDs), and provide access to files through a URL. To ensure data are identifiable, retrievable and available, rules can be used to organize the data in a collection hierarchy, and enforce retention and disposition strategies.

Rules are also required in order to secure data from loss and degradation by assigning an owner to each file, defining access controls for public and shared use, verifying the number of replicas and the integrity of the data, as well as for tracking the chain of custody. In addition, rules help to ensure compliance with legal obligations, ethical responsibilities and rules of funding bodies by documenting the restrictions, generating usage and storage cost reports, tracking staff expertise, tracking management approval of the rules that are enforced and tracking changes to policies.

As data sets increase in size, number and complexity, the need for automation in data management will become more apparent. The Integrated Rules Oriented Data System (iRODS) is a flexible, open-source product that can be utilized to enable automation—providing both enforcement of policies and a reporting infrastructure useful to research institutions, funding organizations and individual researchers.

As with all compliance initiatives, there must first be a willingness on the part of researchers to engage in collection formation. A collection provides a context for data, recording provenance and descriptive information. Collections also enable management of standard data formats, standard metadata and standard access mechanisms. There should be guidelines that are generally followed defining both the data and metadata structure. The automated policies should include format checking and metadata extraction such that an external database can be developed fully describing the retained data sets. The researcher, the host institution and the funding organization can be provided a report indicating that requirements have been met and that the data are in a form that can be disseminated. This report could prove to be a very useful feedback tool that allows the researcher to know that the data are in a usable form. Similarly, the funding agency can be provided the information needed to know that the management criteria have been met.

Policy-based data management systems rely on the formation of a consensus for the desired properties of a collection. This is needed for all types of data management applications. Given the desired property (i.e., integrity), management policies can be defined for each collection. An example would be replication of all files, and generation of checksums for verifying integrity. The policies control the execution of procedures that are implemented as computer actionable workflows (Isa => Is a). The workflows chain together explicit tasks that perform the data replication and generate a checksum (see Figure 1). Each task may generate state information that is saved as metadata attributes about the digital objects in the collection. Users interact with the system through clients. Actions specified by a client are trapped at policy enforcement points. Validation of assessment criteria is typically done through automatic execution of periodic policies. This makes it possible to enforce management policies, automate administrative tasks, and automate verification of collection properties.

Figure 1 ‒ This diagram illustrates the process of data ingestion, assurance of format adherence and disposition including persistent state information and policy definitions. The entire management workflow can be accomplished without the need for human intervention.

There is a general expectation that published research papers should include a statement describing how and on what terms supporting research data may be accessed. The automated data management system should have the ability to extract this expectation and place it in a format that regulates the required access rights. The researcher should also have the capability to request an automated report when the data have been accessed. This report could be very useful in developing additional funding requests. The access rights may be complex depending on the nature of the regulated data, and there may be legal or ethical reasons for limiting access. The automated data management system could be used to provide an access requirements checklist to a potential user, which could simply require the authentication of credentials or may even include the acceptance of a non-disclosure agreement. Of course, the controlling institution would also receive a report in all cases such that legal requirements could be realized.

Research organizations are required to publish provenance metadata and descriptions of publications that they hold so that others may know that these data exist and that they can be accessed. This process can be fully automated with information regarding the nature of the data, how they were generated and why they were published, extracted upon ingestion into the archive. The ability to automate the generation of desired metadata is essential when collections contain millions of files. For example, the generation of a robust digital object identifier could be controlled by a policy on ingestion of each file into a collection and managed by a policy-based system.

Data retention is a critical part of data curation and management, and the EPSRC has mandated that funded research must be maintained online for 10 years after the date of publication. Retaining accessible data for an extended duration is nontrivial and requires an infrastructure of hardware and software capable of the task. Object storage is one technology being used in research environments today as a storage resource of iRODS. An object storage system can execute policies that dictate geographical data replication and placement. The policies can be specified by iRODS and determined from metadata extraction upon ingestion. Once the policy is attached to a data object, it is maintained regardless of site failures, network failures or storage element failures. Each object is protected from silent data corruption caused by storage element failures through evaluation of a checksum, which is checked each time the data are served. The policy-based data management system supported by iRODS can run periodic data checks and can generate a report as any data reaches the initially defined limit of retention. iRODS can also be used in this case to execute policies to extend retention. For example, a researcher may publish follow-on work that references a previous work. If the follow-on work must be retained for 10 years, it is possible that reference data must also be retained. Again, these decisions can be made without administrative intervention once the overarching master policies have been defined. Policies can also be defined to specify appropriate disposition of the data once the retention period is over. The data may be moved to an archive, submitted to another collection or deleted if better data sets have become available.

While the iRODS policy-based data management system thus far has been described as a compliance tool, an important and useful consequence is that it can also become a feedback tool. As mentioned earlier, it is possible to generate a number of automated reports that can be based on time or triggered by an action, such as specific data access. Often, when research data are published, it may be years before they are cited in a related publication. The citation is often one of the most useful forms of feedback for the researcher and the funding agency. The use of iRODS reports can shorten the feedback loop, allowing a researcher to know that his/her work is being accessed and how it is being used. This type of feedback can enable collaboration in a much shorter timeframe, permitting researchers with similar interests to identify each other. Projects such as the NSF DataBridge initiative are developing the capabilities needed to correlate access information with analysis procedures and data ownership. Data owners can then determine whether their data sets are being used correctly by checking the types of analyses that are done.

The end goal is reproducible data-driven research. It should be possible for a researcher to publish data sets and workflows that are used to analyze the data sets. A second researcher should be able to reexecute the workflow and obtain the same result. Conversely, a third researcher should be able to change the input parameters for the workflow, compare analyses and save the new version of the results. A data management system that provides this capability must manage not only data sets, but also workflows. The iRODS data grid provides the ability through workflow structured objects in which workflows can be saved and shared. The provenance for each execution of the workflow is tracked, enabling the correlation of input files and output files with the version of the workflow that is executed.

The preservation and dissemination of publicly funded data sets should also include the preservation and dissemination of the policies that are used to manage those data sets. Preservation policies define the properties of the preserved collection. Thus, the policies can be evaluated by users of the archive to determine compliance with archival principles of authenticity, integrity, chain of custody and original arrangement. Just as managers of archives can then make assertions about their enforcement of data management plans, users of archives can evaluate the effectiveness of the policies to decide whether the data are trustworthy.

Additional reading

  1. Clarifications of EPSRC expectations on research data management; www.epsrc.ac.uk.
  2. The Integrated Rule-Oriented Data System (iRODS); http://irods.org.

Dave Fellinger is chief scientist, DataDirect Networks, 2929 Patrick Henry Dr., Santa Clara, Calif. 95054, U.S.A.; tel.: 818-700-4000; www.ddn.com. Dr. Reagan W. Moore is professor in the School of Information and Library Science, UNC at Chapel Hill and Renaissance Computing Institute, Chapel Hill, N.C. Dr. Hao Xu is research scientist with the Data Intensive CyberEnvironments Center, UNC at Chapel Hill.