Managing Pathology Research Data in an AI World

Tissue diagnostics and biomarker analytics are the keystones of cancer discovery research. However, delivering on the promise of personalized medicine requires multiple data sources to be integrated and analyzed. Management and analysis of large volumes of tissue samples are crucial to realizing the potential of personalized medicine.
 

To accomplish this, imaging and pathology informatics support will be essential to moving forward in an increasingly complex and multifaceted medical research environment.
 

There are several problems that a research platform needs to address if digital pathology is to prove an effective tool for drug and biomarker discovery studies.
 

  • Shared archive: Virtual Slides – high quality images produced using a scanner – are typically several gigabytes in size. In the past, sharing these image files with colleagues has proven problematic due to their size and the lack of IT infrastructure supporting fast sharing and viewing of these images. Sharing tissue image archives across centers encourages multisite collaboration.
  • Image file formats: In research some users need a viewer that can read many different image file formats. Pathologists and lab managers therefore can be trained on one viewing software platform.
  • Multi-modality data management: Collating and organizing data from multiple data sources brings its own challenges. Clinical histories, diagnostic classes, clinico-pathological data, image analysis data, molecular profiling, genomic and epidemiological data may all be stored in different files and across various locations. Spreadsheets, CSV/TSV files, LIMS, and in-house databases and applications are incapable of managing the quantity and variety of data that needs to be captured as part of a typical study.
  • Image analytics integration: Image analysis tools can quantify and qualify tissue cells and cell structures in a rapid and consistent manner. However, a range of image analysis applications are available – from commercial vendors, as well as in-house and open source solutions. Managing slides that are used in multiple applications and the data produced by their algorithms can be difficult, due to limited interoperability.
  • TMA management: Tissue microarrays (TMAs) provide the means for high-throughput analysis of multiple tissues and cells. The issues involved in collating, organizing and associating data with whole sections of tissue is multiplied with a TMA, as a single TMA slide may contain 200 cores, each potentially representing a different patient. A TMA study of a single 20x10 block, with five stains, three scoring criteria per stain and two reviewing pathologists will result in six thousand ‘scores’. Mapping these scores to different patients, comparing scores and results, and identifying trends across different cohorts will be time-consuming and prone to human error. Data mining tools, detailed in the next section, are required to solve the challenges of conducting large biomarker investigative studies.
  • Cross study data management: Given the range and volume of data mentioned in the points above, and the high quantity of slides that may be part of a typical study, the organizing slides in different ways, and maintaining a link with the aforementioned data, will prove challenging. Slides may need to be included as part of several studies. Studies may also be organized in many different ways.

Computational Pathology and AI – a rapidly growing disciple

 

The use of Artificial Intelligence (AI) and, in particular, Deep Learning in research is a growing area of interest for many. Management of pathology images is important in the development, validation and use of these algorithms, particularly within a research workflow. As with other disciplines, a large part of building AI algorithms involves curation of the images and annotations alongside any associated metadata. A recent publication by Campanella et al in Nature Medicine [1] used over 44,000 slides, showing the need to be able to manage datasets (images and associated data) of this magnitude.

When an algorithm is created, this data must be used along with metadata to test hypotheses. All of this means that ad-hoc methods of managing the high-resolution images inherent in digital pathology and their associated metadata and other derived data will inevitably cause difficulties in validating and reproducing large-scale studies. In this blog, we will attempt to highlight some of the challenges of such studies, and how they may be addressed through use of pathology image management solutions such as Xplore. [Not for diagnostic, monitoring or therapeutic purposes or in any other manner for regular medical practice]

Data management challenges in building and applying AI algorithms

Xplore images diagram

Cohort Selection


The fundamental step in identifying a biomarker is to select the cohort of slides which will be used for the study. This requires the ability to ingest digital images in various formats, store them and select those which are appropriate for a research study. There will also be metadata associated with such images such as clinico-pathological data, data source, antibody, etc. Associating this data with the images in the cohort is necessary, as misalignment will inevitably be a risk if these are held in separate databases or, as is often the case, spreadsheets and CSV files.

Splitting the image sets for good machine learning practice


It is generally understood within the Machine Learning community that it is good practice to split the available data into three subsets, and this is referenced in the recent FDA discussion paper, fda.gov/media/122535/download, as part of Good Machine Learning Practice (GMLP) – one which is used to train the algorithm(s), one which is not used within the training, and is used by the data scientists to validate and tune their model architecture and approaches, and one which is not used at all during the training and validation steps and is used solely to test the algorithm performance on completely unseen data. In order to ensure complete independence of the unseen test set, the ability to restrict permissions to view the test prevents inadvertent data ‘leakage’ from the test set to the training.

It is also necessary to ensure that the three sets of data (training, tuning and test) are drawn from the same distributions, so the ability to use metadata stored in the image management system to ensure this (in as much as it is possible) is a key feature.

Collaboration with Pathologists


Very often, pathologists are intimately involved in the validation of algorithms and their scores and annotations are needed to provide the ‘ground truth’ against which the algorithm will be compared. The ability to collect pathologist scores and input and store these alongside the images is invaluable (both for WSIs and TMAs) and allows the data to be analysed alongside quantitative results obtained from the algorithms. This can not only be the correlation analyses and other measures of technical performance of the algorithms, but since the clinic-pathological data is also available for the images, the use of the algorithmic output can be compared to pathologist scores in subsequent survival/response analysis, paving the way for the use of algorithmic scores as an alternative to laborious and inconsistent manual pathology scoring for large-scale biomarker studies.

[1] Campanella, Gabriele, et al. "Clinical-grade computational pathology using weakly supervised deep learning on whole slide images." Nature medicine 25.8 (2019): 1301-1309.