Category Archives: Digital Curation

Digital Curation & Preservation: At what cost?

brainIt continues to astound me on this MLIS course how many ideas, theories and practices blindly press ahead with the supposed ‘advancement’ of the industry without ever addressing important fundamental questions about the underlying nature, impact and value of the work being undertaken. It also amazes me at how information studies academics continue to theorise while passively ignoring the poststructuralist theory that has been informing many other disciplines uninterrupted for the last 50 years. Reading Helen Shenton’s work has left me no less bemused.

Digital curation and preservation takes as its starting point the mantra ‘we must preserve’ without ever asking whether or not it is right, or indeed valuable to preserve. Poststructuralism has worked hard to ensure that history and culture are not controlled as homogenous entities, but digital curation is now threatening to undo much of that good work. Poststructuralism is a theory of language that denies words as static culture building objects, and instead views language as a highly dispersive subjective heterogenous experience. It is the theory that underlies so much of our achievements in the last 50 years. It lead to the feminist movement, to the reconceptualisation of history as a discipline, and to the destruction of periodisation in literature. With real world artifacts we still have the potential to make new discoveries about the past. However, with born-digital objects which only have a lifespan of up to 25 years, we will not have the capacity to re-write the past through new discoveries. As a result, the digital curators of today are essentially the historians of tomorrow. The files that they choose to save will create a static history that cannot be questioned in the future. Howard Zinn, a postmodern historian argued that history has traditionally been written by those who win wars. Digital Curation, which is funded by governments or private organizations, is in danger of destroying the culture it is aiming to preserve in what could be become a Big Brother like scenario.


Helen Shenton’s work in ‘Virtual reunification, virtual preservation and enhanced conservation’ focuses on the digitisation of dispersed works. It is in many ways a hugely interesting project, but it needs to be taken to question about its real underlying value. It is disturbing that Shenton’s work has as its goal ‘reunification’. This word summons forth a whole litany of other terms like ’empire’, ‘colonisation’, ‘power’, ‘race’, ‘slavery’ and ‘control’ to name but a few. This word inherently references imperialism at a time when the breaking apart of the United kingdom has become a real possibility in the near future. The fact that some important texts exist in a dispersed format is in itself culturally significant because it is indicative of the breaking apart of empire itself. Bringing these texts together has the potential to create a false narrative and a homogenous cultural discourse, and in this sense Shenton, like many of her contemporary information professionals, uses an outmoded form of structuralism to inform her ideas. She argues, in relation to the Sinaiticus Project, that it requires ‘the production of an historical account of the document’ that needs to be objective. The very idea that a homogenous ‘objective’ narrative is being added to these documents is a regulating process that ignores the lessons learned in the arts through poststructuralism. Structuralism is also implicitly referenced in the layer of information in the form of digital links over the manuscripts, which again inherently asserts control and authority over the material. Shenton has not stopped to ask what is the cost of such a project. Nor has she asked why the British Library feel as though they have the right to oversee the reunification of material from different cultures around the world.

The British Library is not only collecting material, but they are seeking to play a role in culture building. I thought the function of a library was to provide non-judgemental access to information. Shenton talks about ‘enhancing’ culture through diplomacy insofar as cultural diplomacy can play a role in international relations. It shows that there is an implicit and dangerous politics behind these preservation projects. Questions need to be posed regarding for whom is the British Library attempting to play a role in international relations and to what end? This project seems to be going beyond simply collecting material, but is ‘using’ material to re-tell an old story of empire. It feeds into an attempt by governments to create and control fake grand narratives. Howard Zinn’s principle of postmodern history was a way of challenging power by telling history through dispersed narratives. Shenton’s digitisation project runs the risk of more easily cutting off avenues to the past for us here in the present, but more dangerously, for people in the future. It poses the danger of manipulating information in ways that reassert a new kind of imperialism, a new homogeny of information, and an oppressive future in which subjectivity is no longer valued.


Building Speculative Future Capital into Digital Curation and Embracing the Changing ‘Signs’ in Metadata


The following essay aims to discuss the case study ‘Using The DCC Lifecyle Model to Curate a Gene Expression Database’. There is little doubting the benefit of such a project. Gene expression in early human foetal development can help scientists and medical professionals to better understand human growth in relation to the contracting of disease both early and later in life because it forms the foundation of how all human life develops. Recent research dictates that most genetic diseases begin in this early stage of human development so an archive that allows scientists to draw upon previous gene samples and subsequent experiment results is invaluable in understanding where disease comes from, and by extension, how to prevent and cure it. However, digital curation, no matter how much planning and policy outlining is involved, and no matter how valuable the collection is, is ultimately at the mercy of financial sponsors. Therefore, while it is important to plan through the full life cycle of a project, it is also just as important to build ideas for future ongoing funding into the project, and if possible, to either make the project profitable or to suggest ways in which the project could generate revenue based on possible discoveries. In this case, clear guidelines have to be made about future rights of the project even if funding is taken over by private enterprise that are interested in potential discoveries that could be made based on the use of those datasets. In this sense, the Gene Expression Database runs into some problems that could hinder its long term sustainability. The case study focuses quite heavily on the technical challenges of the project, as well as on the cycle from creator to end-user, or designated community, however, some of the technical plans are upset by a lack of clear financial planning, while the designated community needs to have clearer policies about future human infrastructure and how the representation information and metadata may evolve beyond the life of current creators and users.

Firstly, the case study provides a strong focus on the technical processes and applications that will be needed to provide long term security and accessibility to the gene expression data base. This is emphasised in the study because (O’Donaghue & van Hemert 2009, Pg.58): “One of the main concerns on the informatics side of the design study is how to curate this resource over the long term. DGEMap will not be a simple archive of images, but rather a constantly changing project with several types of research output and digital assets that will require both coordination and preservation.” The results of the experiments will be processed in local databases where raw digital images will be created, cleaned using photoshop, representation information and metadata added, before being transferred to searchable databases online. The datasets will then also need to be archived and stored in a long term digital repository. DGEMap, in this sense (O’Donaghue & van Hemert 2009, Pg.59) , “comprises two constantly changing databases and a large quantity of images that need to be transformed and mapped before being submitted to one of those databases”. The aim of this essay is not to go into detail describing the technical processes other than to say that the curator has decided to use all open source programmes (ie. MySQL, DRAMBORA, AONSII and SIARD) in order to facilitate later open access to the databases and to ensure consistency across all platforms. They will also apply the OAIS model for the same reason, which will be used in conjunction with MISHFISHIE, METS and PREMIS.

It is surprising to find, with such detailed technical planning, that there is one glaring shortcoming in the case study. That shortcoming arises in relation to the project’s funding and their budget which in the case study the curators of DGEMap do not appear to have full cognitive control over. The project has funding for 10 years from the European Union, however, they envision the project having to be sustained for an indefinite period of time. Also, the actual budget that has been allocated is never discussed within the case study which prohibits readers from fully understanding the policies of the project. For example DGEMap (O’Donaghue & van Hemert 2009, Pg.64) propose the use of “the Dark Archive In The Sunshine State because it is intended for back-end use to other systems, so while having no public interface, it can be used in conjunction with other access systems which adds to the protection of the data”. However, they (O’Donaghue & van Hemert 2009, Pg.64) go on to add that “one problem with this usage is the exorbitant costs which may not be sustainable over the long-term. DGEMap propose investigating other storage options further”. It is clear why they want to use the DAITSS, but it is worrying that they do not have the budget to use the most suitable storage facility, especially for even a ten year period, let alone thinking about their indefinite storage needs. The questions that remain over funding issues do call into questions the longevity of such a project once the EU come to the end of their funding obligations. This uncertainty naturally is exacerbated by the fact that the EU can undergo considerable economic fluctuations which can further disrupt long term funding commitments.

One might argue that a curator cannot plan for an indefinite period and that if one were to attempt to do so, the project would never begin in the first place. However, because data is going to be constantly added to the database, the case study could do more to emphasis the ongoing process of conceptualisation of the project. The first cycle of data has already been conceptualised and funding is in place for the initial stages. However, the DCC Lifecycle Model is not a one cycle model, something that the curators of DGEMap identify but never build upon. They [O’Donaghue & van Hemert 2009, Pg.68] accept that as future users access the data and use it to perform new analyses, the data is transformed and re-entered into the start of the cycle again. It is the argument of this essay that this ongoing reconceptualisation allows space for the curators to continually update potential future funding partners of the achievements of the project. One experiment builds upon previous experiments and with the new knowledge sets comes closer transitions to new discoveries that could lead to the development of new treatments for disease. Each time a reconeptualisation happens, these discoveries, or transitions towards discoveries could be capitalised on to continue adding scientific and monetary value to the project with a view to acquiring future funding. The DCC Lifecycle model allows for this reconceptualisation and this needs to be written into the policies of DGEMap.


The second area of interest in this case study lies in its explication of policies based around human infrastructure. Obviously, there is going to be an ever-evolving body of participants in this project as new creators are generated. These creators are initially responsible for adding representation information to the digital files. Also, this information is later checked by curators with the aim to maintaining consistent linguistic labelling of the information. In this sense, the policies of DGEMap aim to control human infrastructure to ensure consistency. This is why they use PREMIS, to ensure the information must remain readable by the community which means metadata will need to correspond to knowledge in the designated community. However, this writer believes that any attempt at enforcing a static nature onto language is destructive to the necessary evolution in such projects. Contemporary linguistic theory has been demonstrating since the 1960s that language is far from static. On the contrary, language is an ever changing malleable condition in any discipline. Within the realm of science, the labels that we use to signify meaning can play a role in promoting creativity in research and experimentation methods. Because the datasets are stored in two locations there is always going to be a static account of the results and the images. Allowing creators, who are entering the project with new evolving perspectives and by extension new, more relevant linguistic codes to develop the language used in the metadata in a natural way, can only lead to the speeding up of the discovery process. In this sense, words do not come into existence from a vacuum, but are an ever-evolving chain of signifiers that can add meaningfully to the growing body of knowledge that is being stored. Again there is plenty of scope within the policies of DGEMap to develop in this more open and organic way. For example, when referring to the monitoring of the designated community, the case study (O’Donaghue & van Hemert 2009, Pg.68) admits: “Effectively, DGEMap would be harnessing the knowledge of its designated community to increase the use and importance of the public database, allowing an even better resource to develop over time.” There is a sense that monitoring is too distant a term and that more control of the project needs to be given over to those creators and end-users that are developing it.

In conclusion, this essay has attempted to examine two important features of the case study on Gene Expression Data Storage (DGEMap) in ways that might allow the curators of that project to develop their policies more stringently. The first part of the essay examines ways in which budget constraints and long term funding uncertainty can undermine the most careful technical and conceptual planning. The essay has, however, suggested that through more fully utilising and emphasising the ongoing conceptualisation of new datasets that are entering the lifecycle of the project, the curators can build in a framework that allows that project to constantly speculate for new funding partners beyond the ten year EU funding base. The second part of the essay casts some doubt on the project’s attempt to control the human infrastructure in a way that may hinder the development of new discoveries. It suggests that digital curators need to more fully understand contemporary linguistic theory which can inform them to embrace the unavoidable nature of linguistic evolution which can add even greater momentum to the discovery process. This is especially true in that curators aim to store information for future generations. If there is a linguistic disconnect between end users and original creators, then the data may become misinterpreted or even linguistically obsolete.


O’Donague, Jean & van Hemert, Jano I. (2009), ‘Using the DCC Lifecycle Model to Curate a Gene Expression Database: A Case Study’, The International Journal of Digital Curation, Issue 3, Volume 4