CX Matrix Decompositions for Tumour Classifications =================================================== In this example, we are going to replicate one of the experiements from the paper `CUR matrix decompositions for improved data analysis `__, which uses data from `Nielson et. al. (2002) `__. We have a dataset of gene expressions for 3,935 genes from 31 different tumours, with three different cancer subtypes represented in the dataset, and the question we want to answer is: can we determine the type of tumour from just a handful of the 4,000 different genes? We are going to do this by picking genes that have high *leverage scores*. First step is to import the data: .. code:: ipython3 import pandas from spalor.models import CUR from spalor.datasets import Nielsen2002 gex=Nielsen2002() gex.head() .. raw:: html
GIST GIST GIST GIST GIST GIST GIST GIST GIST GIST ... LEIO SARC SARC SARC SARC SARC SARC SARC SARC SARC
Gene
TACSTD2 -1.3650 -0.7588 0.33435 1.7160 0.18766 0.1467 0.3831 0.8449 -0.7469 0.9075 ... -0.2423 -1.9880 1.6110 -0.9822 -2.3360 -0.7156 -0.6364 1.8910 -0.4032 -0.3697
GJB2 -0.0950 0.3063 0.63040 0.7806 0.81530 -0.9518 -0.7240 -1.0940 -0.4872 -0.6808 ... -1.5760 0.0433 0.4723 -1.2890 -1.7290 -0.9109 -0.6991 -0.5254 -0.1763 -0.1103
CUGBP2 -0.6385 -0.2870 -0.17250 -0.5951 0.17030 0.6095 -0.1460 0.4343 -0.8280 -0.3281 ... 0.1620 -0.0807 0.2439 -3.5830 -0.0795 0.8805 1.6600 2.0190 -0.2785 -0.2276
KIAA0080 -0.5501 1.0980 1.11400 1.0330 -0.34850 0.0632 -0.7378 0.0826 0.6216 -1.3870 ... 0.9759 1.2240 -0.6170 -3.1070 0.6073 0.7063 -1.1070 0.5016 -0.0544 -0.7320
CED-6 -0.4295 -3.2950 -2.00600 0.5949 0.48850 -1.3600 -0.5136 -1.5670 1.5310 0.1229 ... -0.8084 0.2960 -0.8529 -1.9260 -0.5620 0.6970 0.8229 2.1340 2.0010 1.5360

5 rows × 31 columns

.. code:: ipython3 genes=gex.index.to_numpy(); cancer_type=gex.columns data=gex.to_numpy().T Theres a function in SpaLor for calculating the leverage scores. It requires a rank, but this is not the same as the number of columns we hope to sample. The leverage scores are how important a given column is when we are constructing a rank *r* approximation of the matrix. We are going to calculate and plot them here: .. code:: ipython3 from spalor.matrix_tools import leverage_score from matplotlib import pyplot as plt ls=leverage_score(data,k=3, axis=1) plt.plot(ls,'o') plt.show() .. image:: interpretable_low_rank_models_for_tumour_classification_files/interpretable_low_rank_models_for_tumour_classification_4_0.png A good way to think about this plot is that the genes in this plot that have a much larger leverage score than average are the ones that contain the most information. When we fit our data to a CX model from SpaLoR, its going to randomly sample genes with a probability proportional to the leverage score squared. .. code:: ipython3 # from spalor.models.cx import CX import numpy as np import pandas as pd cx=CX(n_components=30, method="exact") C=cx.fit_transform(data) C=pd.DataFrame(C, columns=genes[cx.cols], index=cancer_type) print("genes selected: ",genes[cx.cols]) .. parsed-literal:: genes selected: ['ANXA1' 'IGKC' 'FLJ20898' 'CSF2RB' 'RNF24' 'IGKC' 'C20ORF1' 'ZFHX1B' 'RPS27' 'CD24' 'PCOLCE' 'DUSP6' 'EPS8' 'SSBP2' 'CEP2' 'GFRA2' 'FLJ20701' 'KIAA0008' 'KIAA0300' 'FLJ14054' 'COPEB' 'IGF2' 'TYROBP' 'IMPA2' 'RAB39' 'OSF-2' 'APLP2' nan 'EIF2B3' 'EDN3'] Here is the same plot as before with the selected genes highlighted red. Most of them have a high leverage score, but some do not. .. code:: ipython3 plt.plot(ls,'o') ls=leverage_score(data,k=1, axis=1) #cols=np.where(20* ls > np.random.rand(*ls.shape))[0] plt.plot(cx.cols, ls[cx.cols],'or') plt.show() .. image:: interpretable_low_rank_models_for_tumour_classification_files/interpretable_low_rank_models_for_tumour_classification_8_0.png A clustermap of the genes shows the limited gene set can seperate the three different types of cancer. .. code:: ipython3 import seaborn as sns sns.clustermap(C.T, col_cluster=1, z_score=1) .. parsed-literal:: .. image:: interpretable_low_rank_models_for_tumour_classification_files/interpretable_low_rank_models_for_tumour_classification_10_1.png