Predicting phenotypic states from single cell analysis

15 minute read

Single-cell analysis has revolutionized biological studies by providing greater resolution to tissue profiling and allowing for rare cell characterization. By far, the most popular application of single-cell analysis is transcriptional profiling of cells; that is, determining what cellular mRNA is being expressed. I’ve been fortunate to have the opportunity to contribute to this field from my academic and industry work.

For some single-cell approaches, it is possible to link phenotypic information about the cell with the thousands of genes that a cell can express. For example, a certain phenotype could be indicated by expression of a known marker protein. The protein’s expression can be assessed by flow sorting or cellular imaging with fluorescence microscopy upstream of processing for single-cell transcriptome expression. However, it can be labor-intensive or not always possible to phenotype the cell, especially as current experiments scale to thousands of single-cell samples.

In this project, I am interested in exploring the possibility of using machine learning to train an algorithm towards identifying cellular phenotype based on gene expression alone. I’ll use one set of data that we analyzed in my academic work. In our study, we looked in the mouse brain and characterized neurons activated by a novel experience. A commonly used phenotype of an activated neuron is whether it expresses the gene Fos (otherwise known as c-Fos).

This dataset begins with 96 samples of which 48 were Fos+ and 48 were Fos-, but as you’ll see, filtering steps were applied to reduce the number of samples. The number of genes starts off with over 43,000(!) but as you see, this also gets reduced with filtering. I then use the unsupervised algorithm of PCA to see whether activated and non-activated subpopulations separate following dimensional reduction. Finally, I use logistic regression to train a subset of neurons for classification based on Fos state, then assess the accuracy of the model on a cross-validation subset.

#Import packages
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Import data

# Import gene expression data frame
parent_path = '/Users/lacar/Documents/Gage lab/Gage Lab Dropbox/CellIsolation_MethodsPaper/PaperAndFigs/FromOneDrive_finalDocuments/Lab/AnalysisOfSequencing'
child_path = '/SaraRSEM_output/update15_0225/'
os.chdir(parent_path + child_path)
df_tpm = pd.read_table('RSEM_geneSymbol_tpm_141204_allsamples.txt', sep=' ').transpose()
# The genes are the columns (features) of the dataframe,
# while each single-cell sample are the row indexes.
# The values are the expression level of that gene for that sample.
df_tpm.head()
0610005C13Rik 0610007P14Rik 0610009B22Rik 0610009E02Rik 0610009L18Rik 0610009O20Rik 0610010F05Rik 0610010K14Rik 0610011F06Rik 0610012D04Rik ... Zwilch Zwint Zxda Zxdb Zxdc Zyg11a Zyg11b Zyx Zzef1 Zzz3
nc_ui_ti_C10_141204 6.54 652.15 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 ... 0.0 22.39 0.0 0.0 0.00 0.0 0.00 0.0 22.03 0.00
nc_ui_ti_C11_141204 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 ... 0.0 0.00 0.0 0.0 0.00 0.0 0.00 0.0 69.07 0.00
nc_ui_ti_C12_141204 4.98 29.90 0.0 0.0 0.0 0.0 0.0 173.27 0.0 0.0 ... 0.0 2905.22 0.0 0.0 0.00 0.0 0.00 0.0 0.00 5.87
nc_ui_ti_C7_141204 4.45 56.36 0.0 0.0 0.0 0.0 0.0 0.00 2.3 0.0 ... 0.0 156.51 0.0 0.0 2.91 0.0 200.94 0.0 62.27 0.00
nc_ui_ti_C8_141204 0.00 86.04 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 ... 0.0 0.00 0.0 0.0 0.00 0.0 0.00 0.0 2.99 0.00

5 rows × 43309 columns

Data filtering

Apply a gene filter

# Filter the dataframe by removing genes that are not expressed in any cell
genes_retained_mask = df_tpm.sum(axis=0)>0
df_tpm2 = df_tpm[genes_retained_mask.index[genes_retained_mask]]
df_tpm2.head()
0610005C13Rik 0610007P14Rik 0610009B22Rik 0610009E02Rik 0610009L18Rik 0610009O20Rik 0610010F05Rik 0610010K14Rik 0610011F06Rik 0610012G03Rik ... Zw10 Zwilch Zwint Zxda Zxdb Zxdc Zyg11b Zyx Zzef1 Zzz3
nc_ui_ti_C10_141204 6.54 652.15 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 ... 0.0 0.0 22.39 0.0 0.0 0.00 0.00 0.0 22.03 0.00
nc_ui_ti_C11_141204 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 ... 0.0 0.0 0.00 0.0 0.0 0.00 0.00 0.0 69.07 0.00
nc_ui_ti_C12_141204 4.98 29.90 0.0 0.0 0.0 0.0 0.0 173.27 0.0 0.0 ... 0.0 0.0 2905.22 0.0 0.0 0.00 0.00 0.0 0.00 5.87
nc_ui_ti_C7_141204 4.45 56.36 0.0 0.0 0.0 0.0 0.0 0.00 2.3 0.0 ... 0.0 0.0 156.51 0.0 0.0 2.91 200.94 0.0 62.27 0.00
nc_ui_ti_C8_141204 0.00 86.04 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 ... 0.0 0.0 0.00 0.0 0.0 0.00 0.00 0.0 2.99 0.00

5 rows × 24345 columns

Apply a sample filter

Start by exploring the distribution of gene counts for each sample

# Filter the dataframe by removing samples with low gene count
df_tpm2[df_tpm2 > 1].count(axis=1).hist();

png

Removing samples that have less than 3000 genes seems like a logical cutoff.

# Filter the dataframe by removing samples with low gene count (3000 genes)
samples_retained_mask = df_tpm2[df_tpm2 > 1].count(axis=1) > 3000
df_tpm3 = df_tpm2[samples_retained_mask]
# Sanity check that appropriate filter is implemented
df_tpm3[df_tpm3 > 1].count(axis=1).hist();

png

# Create a vector containing the fos label
df_tpm3['SampleName'] = df_tpm3.index  # temporarily create a sample name vector
df_tpm3.loc[:,'fos_label'] = df_tpm3['SampleName'].str.split('_', expand=True)[2]
df_tpm3.drop('SampleName', axis=1, inplace=True)
/Users/lacar/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
/Users/lacar/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py:357: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
/Users/lacar/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py:537: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
/Users/lacar/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df_tpm3.head()
0610005C13Rik 0610007P14Rik 0610009B22Rik 0610009E02Rik 0610009L18Rik 0610009O20Rik 0610010F05Rik 0610010K14Rik 0610011F06Rik 0610012G03Rik ... Zwilch Zwint Zxda Zxdb Zxdc Zyg11b Zyx Zzef1 Zzz3 fos_label
nc_ui_ti_C10_141204 6.54 652.15 0.0 0.0 0.0 0.00 0.0 0.00 0.0 0.0 ... 0.00 22.39 0.0 0.0 0.00 0.00 0.00 22.03 0.00 ti
nc_ui_ti_C12_141204 4.98 29.90 0.0 0.0 0.0 0.00 0.0 173.27 0.0 0.0 ... 0.00 2905.22 0.0 0.0 0.00 0.00 0.00 0.00 5.87 ti
nc_ui_ti_C7_141204 4.45 56.36 0.0 0.0 0.0 0.00 0.0 0.00 2.3 0.0 ... 0.00 156.51 0.0 0.0 2.91 200.94 0.00 62.27 0.00 ti
nc_ui_ti_C8_141204 0.00 86.04 0.0 0.0 0.0 0.00 0.0 0.00 0.0 0.0 ... 0.00 0.00 0.0 0.0 0.00 0.00 0.00 2.99 0.00 ti
nc_ui_ti_C9_141204 10.13 5.22 0.0 0.0 0.0 2.26 0.0 0.00 0.0 0.0 ... 5.87 1045.78 0.0 0.0 0.00 2.41 421.23 113.95 0.00 ti

5 rows × 24346 columns

# Replace labels with 0 and 1 for classification
df_tpm3['fos_label'].replace(to_replace='ti', value=1, inplace=True)
df_tpm3['fos_label'].replace(to_replace='tn', value=0, inplace=True)
/Users/lacar/anaconda/lib/python3.5/site-packages/pandas/core/generic.py:4619: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
# Show the number of examples for each group. Note that 1 = FOS+, 0 = FOS-.
df_tpm3.groupby('fos_label')['fos_label'].count()
fos_label
0    43
1    39
Name: fos_label, dtype: int64

PCA visualization

Visualize how the groups would separate without using the label.

X = df_tpm3.loc[:,:'Zzz3']
# Run PCA, ignoring the label
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(X)
X = pca.transform(X)
# X1 = pca.fit(X).transform(X)
# Add PCA vectors to dataframe
df_tpm3.loc[:,'PCAx'] = X[:,0]
df_tpm3.loc[:,'PCAy'] = X[:,1]
/Users/lacar/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py:357: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
/Users/lacar/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py:537: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
df_tpm3.head()
0610005C13Rik 0610007P14Rik 0610009B22Rik 0610009E02Rik 0610009L18Rik 0610009O20Rik 0610010F05Rik 0610010K14Rik 0610011F06Rik 0610012G03Rik ... Zxda Zxdb Zxdc Zyg11b Zyx Zzef1 Zzz3 fos_label PCAx PCAy
nc_ui_ti_C10_141204 6.54 652.15 0.0 0.0 0.0 0.00 0.0 0.00 0.0 0.0 ... 0.0 0.0 0.00 0.00 0.00 22.03 0.00 1 55265.247009 -2057.597830
nc_ui_ti_C12_141204 4.98 29.90 0.0 0.0 0.0 0.00 0.0 173.27 0.0 0.0 ... 0.0 0.0 0.00 0.00 0.00 0.00 5.87 1 25373.311476 -2100.477988
nc_ui_ti_C7_141204 4.45 56.36 0.0 0.0 0.0 0.00 0.0 0.00 2.3 0.0 ... 0.0 0.0 2.91 200.94 0.00 62.27 0.00 1 66632.262116 -1961.555875
nc_ui_ti_C8_141204 0.00 86.04 0.0 0.0 0.0 0.00 0.0 0.00 0.0 0.0 ... 0.0 0.0 0.00 0.00 0.00 2.99 0.00 1 83235.237261 -1817.245181
nc_ui_ti_C9_141204 10.13 5.22 0.0 0.0 0.0 2.26 0.0 0.00 0.0 0.0 ... 0.0 0.0 0.00 2.41 421.23 113.95 0.00 1 54599.738613 -681.069431

5 rows × 24348 columns

# PCA, colored by ethnicities
f, ax1 = plt.subplots(1,1)
cmap = 'bwr'

points = ax1.scatter(df_tpm3['PCAx'], df_tpm3['PCAy'], c=df_tpm3['fos_label'], s=20, cmap=cmap)

# plot formatting
ax1.set(xlabel='')
ax1.set_ylabel('')
ax1.set_title('neurons clustered by FOS state', size=14)
#ax1.set_ylim(-4000,0)
#ax1.legend(loc=0, title='FOS state')
    
# figure properties
sns.set(font='Franklin Gothic Book')
ax1.tick_params(axis='both', which='both',length=0)   # remove tick marks
sns.set_style(style='white')
sns.despine(left=True, bottom=True, right=True)   #  remove frame

png

Yikes, this is not the prettiest plot. I spent time figuring out if I had done something wrong, even going back to Andrew Ng’s lecture on PCA. I eventually figured out that I forgot to transform the feature values so that they’re not spread out across so many different scales. Log2 normalization is common in the single-cell field. It’s also clear that there are some outliers here. I’ll take care of both of these modifications in the next bit of code.

# Filter the samples that are obvious outliers
df_tpm4 = df_tpm3[df_tpm3['PCAy'] < 25000]
# Use new dataframe and log transform so features are not spread across different scales
X = df_tpm4.loc[:,:'Zzz3']
X = np.log2(X+1)  # the +1 is added to deal with the log transformation
# Run PCA again
pca = PCA(n_components=2)
pca.fit(X)
X = pca.transform(X)
# Add new PCA vectors to dataframe
df_tpm4.loc[:,'PCAx'] = X[:,0]
df_tpm4.loc[:,'PCAy'] = X[:,1]
/Users/lacar/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py:537: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
df_tpm4.shape
(80, 24348)
# PCA, colored by FOS state

f, ax1 = plt.subplots(1,1)
cmap = 'bwr'

points = ax1.scatter(df_tpm4['PCAx'], df_tpm4['PCAy'], c=df_tpm4['fos_label'], s=20, cmap=cmap)

# plot formatting
ax1.set(xlabel='')
ax1.set_ylabel('')
ax1.set_title('neurons, FOS+ (red), FOS- (blue)', size=14)

# figure properties
sns.set(font='Franklin Gothic Book')
ax1.tick_params(axis='both', which='both',length=0)   # remove tick marks
sns.set_style(style='white')
sns.despine(left=True, bottom=True, right=True)   #  remove frame

png

There we go. With the exception of a few FOS+ neurons, the groups separate quite nicely as we showed in the paper. The FOS+ neurons that are amongst the FOS- neurons are discussed further in our paper (we call them pseudo-FOS+ neurons), but for the purposes of this project, I will remove them from further analysis.

# Filtered dataframe by removing pseudo-FOS+ neurons.
mask = (df_tpm4['fos_label']==1) & (df_tpm4['PCAy'] < 0)
df_tpm5 = df_tpm4[~mask]   # use ~ to get the inverse of what the mask is getting
# Group sizes moving forward
df_tpm5.groupby('fos_label')['fos_label'].count()
fos_label
0    43
1    30
Name: fos_label, dtype: int64
df_tpm5.shape
(73, 24348)
df_tpm5.head()
0610005C13Rik 0610007P14Rik 0610009B22Rik 0610009E02Rik 0610009L18Rik 0610009O20Rik 0610010F05Rik 0610010K14Rik 0610011F06Rik 0610012G03Rik ... Zxda Zxdb Zxdc Zyg11b Zyx Zzef1 Zzz3 fos_label PCAx PCAy
nc_ui_ti_C10_141204 6.54 652.15 0.0 0.0 0.0 0.00 0.0 0.00 0.0 0.0 ... 0.0 0.0 0.00 0.00 0.00 22.03 0.00 1 -98.442336 35.550913
nc_ui_ti_C12_141204 4.98 29.90 0.0 0.0 0.0 0.00 0.0 173.27 0.0 0.0 ... 0.0 0.0 0.00 0.00 0.00 0.00 5.87 1 -89.691450 58.955329
nc_ui_ti_C7_141204 4.45 56.36 0.0 0.0 0.0 0.00 0.0 0.00 2.3 0.0 ... 0.0 0.0 2.91 200.94 0.00 62.27 0.00 1 -73.873524 53.025737
nc_ui_ti_C8_141204 0.00 86.04 0.0 0.0 0.0 0.00 0.0 0.00 0.0 0.0 ... 0.0 0.0 0.00 0.00 0.00 2.99 0.00 1 -89.653363 43.937353
nc_ui_ti_C9_141204 10.13 5.22 0.0 0.0 0.0 2.26 0.0 0.00 0.0 0.0 ... 0.0 0.0 0.00 2.41 421.23 113.95 0.00 1 -77.416201 33.667595

5 rows × 24348 columns

Supervised machine learning

What machine learning algorithm should I apply? This is a classification problem (FOS+ vs. FOS-) but this is also a situation where the number of m examples (73) is much, much smaller than the number of n features (24345).

I reviewed my notes from the Machine Learning course I completed and saw that logistic regression or support vector machines without a kernel (“linear”) would be good options. Andrew Ng talks about this here.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score
/Users/lacar/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
# I didn't carry over the log2 normalization, so I'll do it again here.
# Using X and y to refer to my feature set and label set, respectively

X = np.log2(df_tpm5.loc[:,:'Zzz3']+1)   # a matrix
y = df_tpm5['fos_label']  # a vector

Using all samples for training and testing

This is a first pass at trying the data on this model.

# Followed DataSchool example found online
# (http://nbviewer.jupyter.org/gist/justmarkham/6d5c061ca5aee67c4316471f8c2ae976)

# Instantiate a logistic regression model, and fit with X and y 
model = LogisticRegression()
model = model.fit(X, y)

# check the accuracy on the training set
model.score(X, y)
1.0

Perfect accuracy, which is awesome, but not entirely surprising for two reasons. We used all of the samples for both training and testing and we saw that PCA split these two groups pretty cleanly. Nevertheless, let’s see how the model coefficients look.

# Examining the model coefficients
df_gene_model_coef = pd.DataFrame()
df_gene_model_coef['gene'] = X.columns
df_gene_model_coef['coef'] = np.transpose(model.coef_)
df_gene_model_coef.sort_values(by='coef', ascending=False).head(5)
gene coef
2020 Arc 0.012490
17392 Plk2 0.011933
1875 Ankrd33b 0.011126
16864 Pcdh8 0.010876
13538 Ifrd1 0.010867

While the coefficients are not very high, the top gene is Arc, a immediate early gene known for being involved in activity. This is encouraging for our model since it matches what we predict and also what we’ve shown through differential expression, which is shown in the paper.

Evaluation by splitting into training and validation sets

# Split into train and test sets. I'll use a ratio of 60/40 of train/test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
model2 = LogisticRegression()
model2.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Sanity check by evaluating train and test samples and shape

X_train.head(10)
0610005C13Rik 0610007P14Rik 0610009B22Rik 0610009E02Rik 0610009L18Rik 0610009O20Rik 0610010F05Rik 0610010K14Rik 0610011F06Rik 0610012G03Rik ... Zw10 Zwilch Zwint Zxda Zxdb Zxdc Zyg11b Zyx Zzef1 Zzz3
nc_ux_tn_A6_141204 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.000000 9.227905 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.0 0.0 6.257199 0.000000 0.000000 0.000000 6.616475
nc_ui_ti_C8_141204 0.000000 6.443607 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.000000 1.996389 0.000000
nm_ux_ti_F12_141204 3.693766 4.218781 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 10.052269 0.0 0.0 0.000000 1.847997 0.000000 3.344828 0.000000
nc_ui_tn_C1_141204 0.000000 9.720073 0.0 0.0 0.000000 0.000000 0.000000 0.000000 3.705978 0.000000 ... 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.000000 7.086189 3.379898 0.000000
nc_ux_tn_B1_141204 0.000000 8.137811 0.0 0.0 9.276031 6.100978 0.000000 0.000000 0.000000 0.000000 ... 3.906891 0.000000 0.000000 0.0 0.0 0.000000 5.612058 0.000000 7.188440 0.000000
nm_ux_ti_F9_141204 1.169925 3.865919 0.0 0.0 0.000000 6.123708 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 8.506367 0.0 0.0 0.000000 4.922198 0.992768 4.292782 0.000000
nm_ui_tn_H5_141204 3.580145 9.489045 0.0 0.0 0.000000 0.000000 0.000000 0.000000 2.070389 0.000000 ... 0.000000 0.000000 9.886215 0.0 0.0 0.000000 0.000000 0.000000 6.649759 0.000000
nm_ux_tn_E1_141204 2.160275 5.434962 0.0 0.0 0.000000 0.000000 4.426936 0.000000 3.240314 8.639702 ... 6.805163 0.000000 7.471756 0.0 0.0 0.000000 2.912650 0.000000 5.597829 7.555816
nc_ux_tn_B2_141204 0.000000 5.245648 0.0 0.0 0.000000 5.044831 0.000000 0.000000 1.985500 0.000000 ... 0.000000 2.446256 9.375735 0.0 0.0 6.007196 5.561937 3.910733 3.897240 7.298017
nc_ui_tn_C6_141204 0.000000 5.305606 0.0 0.0 0.000000 3.711495 0.000000 0.000000 0.000000 0.000000 ... 0.000000 11.446024 3.165108 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000

10 rows × 24345 columns

print('training set shape: ', X_train.shape, 'test set shape: ', X_test.shape)
training set shape:  (43, 24345) test set shape:  (30, 24345)

Run predictions!

# Predict fos labels for the test set
predicted = model2.predict(X_test)
print(predicted)
[0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 1 1 0 1 0 0 0]
# generate class probabilities
probs = model2.predict_proba(X_test)
probs
array([[  9.79129870e-01,   2.08701297e-02],
       [  9.27710561e-04,   9.99072289e-01],
       [  2.52824449e-03,   9.97471756e-01],
       [  9.90136982e-01,   9.86301831e-03],
       [  9.94536631e-01,   5.46336891e-03],
       [  9.96892526e-01,   3.10747429e-03],
       [  9.96896054e-01,   3.10394595e-03],
       [  9.93028840e-01,   6.97116028e-03],
       [  9.98386314e-01,   1.61368561e-03],
       [  8.44309409e-01,   1.55690591e-01],
       [  4.00968687e-03,   9.95990313e-01],
       [  6.90008585e-03,   9.93099914e-01],
       [  7.28804895e-03,   9.92711951e-01],
       [  1.54732724e-03,   9.98452673e-01],
       [  9.90972984e-01,   9.02701592e-03],
       [  9.96487663e-01,   3.51233658e-03],
       [  9.97721457e-01,   2.27854258e-03],
       [  9.95794236e-01,   4.20576436e-03],
       [  7.64663538e-04,   9.99235336e-01],
       [  9.35144305e-01,   6.48556948e-02],
       [  3.57385183e-02,   9.64261482e-01],
       [  7.93948287e-01,   2.06051713e-01],
       [  9.79677545e-01,   2.03224554e-02],
       [  2.55274797e-02,   9.74472520e-01],
       [  6.76726234e-03,   9.93232738e-01],
       [  9.96743892e-01,   3.25610777e-03],
       [  7.75563327e-03,   9.92244367e-01],
       [  9.97680715e-01,   2.31928534e-03],
       [  9.92873225e-01,   7.12677514e-03],
       [  9.83361760e-01,   1.66382404e-02]])
# generate evaluation metrics
print('Accuracy score: ', metrics.accuracy_score(y_test, predicted))
print('ROC AUC: ', metrics.roc_auc_score(y_test, probs[:, 1]))
Accuracy score:  0.966666666667
ROC AUC:  0.99537037037
metrics.confusion_matrix(y_test, predicted, labels=[1, 0])
array([[11,  1],
       [ 0, 18]])
print('Confusion matrix: \n',
      'TP', 'FN\n', 'FP', 'TN\n',
      metrics.confusion_matrix(y_test, predicted, labels=[1, 0]))
Confusion matrix: 
 TP FN
 FP TN
 [[11  1]
 [ 0 18]]
print(metrics.classification_report(y_test, predicted))
             precision    recall  f1-score   support

          0       0.95      1.00      0.97        18
          1       1.00      0.92      0.96        12

avg / total       0.97      0.97      0.97        30

Therefore, all 18 FOS- neurons were predicted correctly while 11 of the 12 FOS+ neurons were predicted correctly. You can see the actual sample that was missed in the table below.

pd.DataFrame({'actual': y_test, 'predicted': predicted})
actual predicted
nm_ui_tn_H3_141204 0 0
nm_ux_ti_E10_141204 1 1
nc_ux_ti_A7_141204 1 1
nm_ui_tn_G1_141204 0 0
nc_ux_tn_A4_141204 0 0
nc_ux_tn_A2_141204 0 0
nm_ux_tn_F4_141204 0 0
nm_ui_tn_H4_141204 0 0
nc_ux_tn_B4_141204 0 0
nc_ui_ti_D7_141204 1 0
nm_ui_ti_H7_141204 1 1
nm_ux_ti_E7_141204 1 1
nm_ui_ti_H10_141204 1 1
nm_ux_ti_F8_141204 1 1
nc_ux_tn_B3_141204 0 0
nm_ux_tn_E4_141204 0 0
nc_ux_tn_A3_141204 0 0
nm_ux_tn_F5_141204 0 0
nm_ux_ti_E8_141204 1 1
nm_ui_tn_H6_141204 0 0
nc_ui_ti_D12_141204 1 1
nm_ui_tn_H1_141204 0 0
nm_ux_tn_E6_141204 0 0
nc_ui_ti_C9_141204 1 1
nm_ux_ti_E12_141204 1 1
nm_ux_tn_E2_141204 0 0
nc_ui_ti_C7_141204 1 1
nm_ux_tn_F6_141204 0 0
nm_ui_tn_G3_141204 0 0
nc_ui_tn_C2_141204 0 0

Summary

The goal of this classification project was to determine if we could train an algorithm to predict cellular phenotype based on gene expression. Using a previously published dataset based on activated neurons, I establish proof-of-principle that this is possible. Application of logistic regression leads to an accuracy score of 96.7%, which was just one actual FOS+ neuron falsely called as FOS-. This is pretty good when looking at two different cellular states for this sample size. Possible next steps are to try a different classification algorithm (like SVM) to try and get perfect accuracy, scale up and do more samples, and/or do more cellular phenotypes. Ultimately, this approach has the potential to enhance the value of single-cell applications, by providing phenotypic information in experiments where it would difficult or impossible to experimentally assess.

Updated: