Webinars | Steven Wiley
Lead Scientist for Systems Biology,
Environmental Molecular Sciences Laboratory,
Pacific Northwest National Laboratory,
Richland, WA 99352 USA.
Dr. Wiley was one of the pioneers of Computational Biology. In the early 1980s he published one of the first computer simulations of cell dynamics as well as a number of free and commercial scientific data analysis programs for the Apple II computer. He helped establish one of the first servers and e-mail networks at the University of Utah Medical Center and developed some of their first data management programs. In the 1990s, he created several commercial scientific image acquisition and analysis programs for BioRad Laboratories. During this time he also headed a research group dedicated to building models of cell signaling pathways in breast cancer, which were parameterized using molecular and imaging data. In 2000, he left the University of Utah as a Full Professor and joined Pacific Northwest National Laboratory to build a Systems Biology program to leverage the Laboratory’s analytical and computational technologies to solve important problems in biology. Currently, he is Lead Scientist in Systems Biology for the Environmental Molecular Sciences Laboratory at PNNL, a US Department of Energy scientific user facility, where he is leading a team developing a scalable data capture, management and sharing system for data generated from a wide variety of different instruments.
1st February 2016 (14:00 – 14:45 GMT) : “Future-proofing your data: working to ensure that your work will survive the connected world”
Sign-up for future webinars.
Biology is increasing becoming a data driven science in which high-throughput analytical platforms are generating data at an ever-increasing rate. Traditionally, the primary mechanism of communicating biological information was the journal article, which is essentially a description of how scientific groups interpret their own data with a few anecdotal data examples thrown in for support. In the future, however, the data itself will be the most useful output of primary scientific research. To enable this transformation, data must be available in a form that is easily discoverable with sufficient metadata to permit quality assessment, normalization and integration. It is also highly likely that future biological data analysis system will exploit noSQL database systems for scalability. Thus, to ensure future use of currently generated biological data, there should be a clear migration path to these future systems. We have explored what is needed to ensure compatibility with these future systems by using the integration of high-throughput genomics, transcriptomics and proteomics data with a noSQL system as a use case. From this work, we have been able to define minimal metadata standards that permitted data normalization and integration across different sample types and experimental conditions. We have also defined a flexible data framework using unique sample IDs as key values that is compatible with both relational and Hadoop/HBase systems. We have found that the types of metadata needed for data reuse and integration is highly dependent on the specific target user. We also found that essential metadata were distributed across a wide variety of different primary data files, requiring multiple mechanisms, interfaces and processes for their capture. Our experience suggests that because of the distributed and multidisciplinary nature of biological data generation and analysis, multiple types of software systems and interfaces will be required for data capture and dissemination. To integrate and reuse that data, however, will require the adoption of a universal metadata framework that is linked to the associated primary data files and/or data repositories.