CASE STUDY - UMIT Fights Cancer with Data Integration, Mining, and Analysis
Commentary by Dr. Bernhard Pfeifer, Associate Professor, University for Health Sciences, Medical Informatics and Technology
The IMGuS project
Prostate cancer is the most frequent tumor type in males and the second most frequent cause of male death. The IMGuS (Institute for Medical Genomics Research and Systems Biology) project aims at the application of high throughput data processing to identify molecular signatures allowing the stratification of patients who are susceptible to curative treatment of prostate cancer and who need treatment.
A key participant in the IMGuS project, the University for Health Sciences, Medical Informatics and Technology (UMIT), based in Hall (Austria), manages the technical infrastructure and the life science data warehouse part of the project, in coordination with five other research groups.
Data processing is key to cancer research
“A large part of cancer research today consists of data processing and statistical analysis,” explains Dr. Bernhard Tilg, professor and board member at UMIT Institute of Biomedical Engineering. “The goal of these projects is to identify molecular signatures associated with certain types of tumors, so that efficient and non-intrusive diagnostic mechanisms can be designed. Some cancer treatments have high success rates, when the disease is diagnosed in time, but the key problem remains the diagnostic.”
“We use data integration to combine several different data sources to perform advanced analysis and statistics on the whole set,” clarifies Dr. Bernhard Pfeifer, associate professor at UMIT Institute of Biomedical Engineering. “And because of the amount of data the high throughput sources create, an automated approach is mandatory. We looked at a number of data integration solutions, both proprietary and open source, and settled on Talend’s solutions because of their flexibility, openness, and high performance.”
UMIT/biomed relies entirely on Talend’s solutions for all data integration needs. We have high hopes that the IMGuS project will contribute to the reduction of prostate cancer mortality rates, and data integration is a critical part of this project. Talend is helping us save lives!
Indeed, it is critical for the project that the chosen data integration solution not only work with all data sources, but also be able to integrate specific data processing approaches; for example, since various medical devices deliver data in different formats, preprocessing of this data is required. Talend’s open architecture allowed UMIT to develop specific components to access and process this data.
The PostgreSQL-based LINDA data warehouse, which is the basis for the statistical analysis of the IMGuS project data, is loaded in two stages. The first stage, dubbed electronic data capture, or EDC, centralizes data from all the different sources: patient samples, reference medical data, genome cartography, etc. “The complexity of the electronic data capture stage is very high,” explains Pfeifer. “Not only are the data providers very diverse—five different universities and research centers—but the formats vary vastly: very large CSV files, high resolution images, RDBMS, XML data, etc.”
Administrative data is also loaded at this stage: patient demographics, information about the biological source a certain sample comes from (tissue, serum, etc.), or information on the data source in which the information is stored.
The second loading stage reconciles, transforms, cleanses, and enriches the data contained in the EDC and loads the LINDA data warehouse. “At this stage, we need to bring in reference data from external providers— medical publications, legacy systems, reference medical databases. Talend’s native support of Web services and XML brings tremendous value to the project,” Pfeifer says. “It allows very easily to parse and to cross reference external sources of data, reducing greatly the time it would otherwise take to enrich the data warehouse.”
The frequent refresh of the data warehouse, performed every night, ensures that researchers can use ad hoc query and data mining tools and apply advanced statistical models to extract data relevant for their research.
“UMIT/biomed relies entirely on Talend’s solutions for all data integration needs,” concludes Pfeifer. “We have high hopes that the IMGuS project will contribute to the reduction of prostate cancer mortality rates, and data integration is a critical part of this project. Talend is helping us save lives!”