“Before data-stewarding practices can be expertly developed we need to understand the nature of the data to begin with, and how it is expected to change in the coming years.”
The idea of the “data deluge” has been looming over the life-sciences for the last 10 years. Advancing technologies are increasingly improving the speed and quality with which vast quantities of large data sets can be collected. Far from being something to fear, if handled correctly, this process of ramping data collection can really drive the speed of discovery in systems research.
Handling this change is a key challenge for the life-sciences, and requires forward thinking for developing the right platforms and techniques for collecting, annotating and storing this data during its lifecycle. There are 5 characteristics that should be understood for this process :
Volume – data size – the amount of data that can be produced in a given amount of time varies over methods and tends to increase with years. The highest volume of data tends to come directly from machines in a raw format. The volume often reduces during post-processing to a final form. Understanding these changes ensures appropriate storage, preservation, and access solutions can be used. It also important to understand how much raw data to preserve.
Velocity – speed of change – how fast the data is replaced. With improving technologies some types of datasets can become obsolete quickly. Knowing a timescale for obsoletion allows appropriate managing of stored datasets over the long term, dictating if and when to scrap the old.
Variety – different forms of data storage – data types can be collected using different methods, and technologies. These usually allow a specialised way of collecting the most valid data for any given study. Understanding the range of these techniques, and how the final data is used and, over the long-term, re-used and/or repurposed, is very important for valid downstream use.
Veracity – uncertainty of data – describes how messy the data is, and therefore how much it can be trusted for study. Often high throughput data has to compromise qualities which may be useful (e.g. full quantification for metabolomics data).
Value – how useful is it for the investigation at hand?
As a community, we held a joint meeting in March 2014, bringing together biomedical sciences research infrastructures (BMS RIs) covering genomics, proteomics, imaging, metabolomics, and clinical data. In a two day meeting we brainstormed current data characteristics, and those predicted for the future, with teams of experts for all data-types. As a result we managed to produce a report for both infrastructures and researchers to use as a basis to begin developing our data-management plans. The findings are still useful two years on.
 Bernard Marr (2015) Big Data. John Wiley & Sons; 1 edition.
Written by Natalie Stanford of FAIRDOM, and edited by Steffi Suhr from the BioMedBridges Infrastructure and EBI.