Summary

In this project presented by Novartis during their 2019 Academia to Industry Quantitative Sciences Hackathon, I

Problem

Can we find the identify proteins that are most predictive of weight loss from SOMAscan assay in obese monkeys at a certain dose?

Data description and pre-processing

The data we were presented were protein abundance measurements(rfu2) of 5080 proteins from SOMAscan assay generated simultaneously. These measurements were from 33 animals from 3 different studies injected with 8 different doses at 4 timepoints(day 0,1 ,4 and 7). Compound blood concentration at these different timepoints was also made available. We note that some of these monkeys were reused for these different studies.

For the purpose of our analysis, we assume that the animals reused for different studies were independent since they were one year apart. Since, we’re interested in looking at dose-response relationship, our first task was to identify an appropriate dose to conduct our analysis.

The idea was to pick a dose where we has sufficient sample size and where log pk drug concentration wasn’t to an extent where it could result in over-expression of protein. From Figure 1, we can see that dose 0.1 has the maximum number of animals(12) so we have sufficient sample size and it results in reasonable drug concentration in blood.

percentage body weight change was recoded into factors where they were categorized as significant body weight loss(< -1) and little or no body weight loss(>= -1) in order to perform below described classification problem. All the rfu2 values were log transformed. Protein measurements at timepoint 0 was excluded because we wanted to study the effects after the dose had been administered.

Figure 1

Figure 1

Approach

Since we have small sample and thousands of protein abundance data, the selection of those proteins that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensional data analysis.

Below are some properties I based my model selection on:

Figure 2

Figure 2

SVM-RFE[1] is recognized as one of the most effective filtering methods for genes in the bioinfomatics world. It is based on an algorithm that finds the best possible combination for classification. The drawback of it, however, is that it does it without considering the differentially significant features between the classes. Features with notable classification accuracy and also differentially significant between the classes have a predominant role in a biological aspect. Hence, to overcome this limitation, in the proposed algorithm[2], t-statistics is included along with SVM-RFE.

SVM-T-RFE algorithm

  1. The coefficient and the expression mean difference between two compared groups of each individual feature is calculated using welch’s t-test which is calculated as

  2. It is incorporated into the weight value of that particular feature by where \(t_i\) is the t-statistics between the two comapred groups and \(w_i\) is the SVM-RFE weight vector defined by Guyon et el. [7] and θ is the parameter determining the tradeoff between SVM weight score and T-statistic, and θ is the value between 0 and 1.

  3. The iteration process is followed by backward removal of the feature.

  4. The iteration process is continued until there is no feature remaining in the dataset.

  5. The smallest ranking weight will be removed by the algorithm and stored in a stack, while remaining the feature variables have a significant impact for producing weight.

  6. Finally, the feature variables which are stored in the stack will be listed in the descending order of descriptive different degree.

To gain a deeper understanding of how this algorithm works, please see the article by Xiaobo et al.[4]

Modelling

In order to compare the performance of both SVM-T-RFE and SVM-RFE in classification, a histogram of p values of the selected top 100 proteins is plotted. The p values in the left histogram were calculated using t-statistic. We can see that for the model with t-statistics, all the p-values for the top 100 proteins is less than the generally accepted 0.05 while the range is much larger for the model with SVM-RFE. This is further illustrated by the boxplot below with p-value of 0.3 as the median for the second model. (done using sigFeature package in R)

Figure 3

Figure 3

Figure 4

Figure 4

Validation

More work needs to be done in this regard since validating CV error using k-fold is computationally expensive.

However, to get a sense of model performance, the data was split into 80% training and 20% testing and the parameters from Table 1 were calculated with top 1000 proteins from each of these models.

Table 1

Table 1

We can see that SVM-RFE with t-statistics does a far better job of classifying weight loss than just SVM-RFE.

List of Top 20 Proteins

Below are the list of top proteins from SVM-RFE and SVM-T-RFE. Interestingly, we notice that the target 1620 shows up as a top protein from the SVM-RFE model. This needs to be further investigated in this future.

Top 20 Proteins from SVM-RFE

Top 20 Proteins from SVM-RFE

Top 20 Proteins from SVM-T-RFE

Top 20 Proteins from SVM-T-RFE

Conclusion and Future Directions:

High-demsional reduction is biomarker discovery is a hard problem. SVM-RFE and it’s modified method SVM-T-RFE seem promising to identify suitable biomakers with reasonable predictive accuracy. Since project was done to satisfy a curiosity to test out an approach usually used in gene expression data with SOMAscan protein expressions, I make no claims about the robustness of the identified proteins in playing a role in weight loss. However, it has brought certain proteins and dose to attention which can be further studied.

For the future steps,

References:

  1. Kai-Bo Duan, J. C. Rajapakse, Haiying Wang and F. Azuaje, “Multiple SVM-RFE for gene selection in cancer classification with expression data,” in IEEE Transactions on NanoBioscience, vol. 4, no. 3, pp. 228-234, Sept. 2005.
  2. Das, P., Roychoudhury, S., & Tripathy, S. (2017). sigFeature: An R-package for significant feature selection using SVM-RFE & t-statistic. Canadian Journal of Biotechnology, 1, 35. doi:http://dx.doi.org/10.24870/cjb.2017-a22
  3. Zheng Rong Yang, Biological applications of support vector machines, Briefings in Bioinformatics, Volume 5, Issue 4, December 2004, Pages 328–338, https://doi.org/10.1093/bib/5.4.328
  4. Xiaobo Li, Sihua Peng, Jian Chen, Bingjian Lü, Honghe Zhang, Maode Lai, SVM–T-RFE: A novel gene selection algorithm for identifying metastasis-related genes in colorectal cancer using gene expression profiles, Biochemical and Biophysical Research Communications
  5. Yousef M, Jung S, Showe LC, Showe MK. Recursive cluster elimination (RCE) for classification and feature selection from gene expression data. BMC Bioinformatics. 2007;8:144. Published 2007 May 2. doi:10.1186/1471-2105-8-144
  6. Lin, X.; Li, C.; Zhang, Y.; Su, B.; Fan, M.; Wei, H. Selecting Feature Subsets Based on SVM-RFE and the Overlapping Ratio with Applications in Bioinformatics. Molecules 2018, 23, 52.
  7. I. Guyon, J. Weston, S. Barnhill, V. Vapnik Gene selection for cancer classification using support vector machines Machine Learning, 46 (2002), pp. 389-422