GITHAIGA, JOHN IRUNGU

Project Title
Machine Learning Approaches to Cancer Diagnostics in Human Body Fluids Utilizing Laser Raman Microspectrometry
Degree Name
DOCTOR OF PHILOSOPHY DEGREE IN PHYSICS
Project Summary

Near-infrared Raman spectroscopy is a vibrational spectroscopic technique capable of providing fingerprint-type information on biochemical molecules. For the early detection of cancer, highly sensitive and specific biomarkers are needed. Particularly, biomarkers in biofluids can be useful in reflecting the early presence of cancer in the body. The aim of this study was to test and evaluate novelized machine learning techniques for the detection and identification of trace biomarker alterations in saliva and blood pointing to the onset and progression of leukemia and breast cancers via a laser Raman spectral analysis approach. Measurements were done in the
500-1800 cm-1 region, based on a 785 nm excitation laser.

Trace biomarkers were studied by analysis of intermediate and higher-order principal components. The utility of intermediate and higher-order principal components in revealing subtle biochemical alterations (trace biomarkers) during cancer progression was first experimented on the discrimination of prostate malignancy based on a model biological tissue (metastatic androgen insensitive (PC3) and immortalized normal (PNT1a) prostate cell lines). For prostate, breast, and leukemia malignancy, the statistical relevance of principal components were determined by the use of the two-sample t-test and the effect size statistical criteria.

For breast cancer and leukemia studies, the concentrations of trace biomarker alterations were estimated using the partial least squares regression model applied to the spectra of pure compounds and the biofluids spectrum. Then, various optimized chemometric methods that include independent component analysis (ICA), multidimensional scaling (MDS), partial least square discriminant analysis (PLS-DA), kernel density estimators, support vector machines (SVM), and backpropagation neural networks (BPNN) were utilized to analyze and classify the blood and saliva trace biomarkers’ Raman spectra from healthy and diseased subjects.

Results using pairwise comparison of mean intensity (peak intensity ratios) and multivariate statistical techniques disclosed that biochemical changes of proteins, lipids, and nucleic acid components components can be associated with prostate cancer, breast cancer, and leukemia progression. Four prominent regions: 566 ± 0.70 cm-1, 630 cm-1, 1370 ± 0.86 cm-1, 1618 ± 1.73 cm-1; and six subtle regions: 1076 cm-1, (1232, 1234 cm-1), (1276, 1278 cm-1), (1330, 1333cm-1), (1434, 1442 cm-1), (1471, 1479 cm-1) were identified, which can be regarded as useful biomarkers for prostate cancer diagnosis. Similarly, six spectral regions were determined: 589 cm-1, 594 cm-1, 630 cm-1, 1626 cm-1, 1630 cm-1 and 1638 cm-1, which can be regarded as new biomarkers of breast cancer in the blood-based breast cancer spectroscopy. The fitting model revealed that proteins, nucleic acids, and lipid biochemicals in blood and saliva increased with breast malignancy, whereas amounts of glycogen decreased with progression of breast malignancy. Using leukemia data, PLS regression quantitative analysis in the fingerprint (500-1800 cm-1) region revealed that biochemical changes of proteins and nucleic acids in leukemia patients increased with malignancy. In contrast, quantitative analysis based on the selected trace biomarker regions suggested that biochemical changes of proteins and membranous lipids increased with leukemia malignancy whereas biochemical changes of nucleic acids, glycogen, and non-membranous lipids decreased with leukemia malignancy.

The cross-validated models utilized to analyze and classify the blood and saliva Raman spectra from healthy subjects, breast tumor patients, and leukemia patients yielded diagnostic sensitivities of 46% to 100%, as well as specificities of 71% to 100%. The ICA-MDS followed by PLS-DA and ICA-MDS followed by kernel density estimators proved to be powerful diagnostic algorithms for breast cancer detection using blood and saliva, respectively, yielding diagnostic sensitivities and specificities of more than 95%. The RBF-SVM diagnostic model performed better than linear SVM in leukemia and breast cancer diagnosis yielding a sensitivity of up to 92%.

This difference in performance was attributed to the nonparametric capability of RBF kernel functions in handling complex spectroscopic data. The BPNN diagnostic model performed better than linear-SVM and RBF-SVM diagnostic models in diagnosing breast cancer, potentially due to the capability of BPNN converging on a global minimum that allows a better tolerance to the noise
in non-linear datasets. Though the RBF-SVM model performed better than the linear-SVM and BPNN model in diagnosing leukemia, utility of saliva spectra yielded poor diagnostic capabilities in terms of sensitivity parameters. This could be due to inherently small scattering cross-section and the strong background fluorescence interference of the Raman technique on saliva samples, which most likely made the Raman technique not sensitive enough for detecting the subtle biochemical changes in human saliva samples. Although the number of samples involved in this study were few, the results demonstrate that analysis of Raman spectra of blood and saliva using optimized chemometric diagnostic algorithms has great potential for the noninvasive and labelfree detection of breast cancer and leukemia.