# SUBAcon

### Supplementary information

An explanation of the theory and data of subacon

26MB

# Maximum Entropy Justification for Naive Bayes¶

## Theory¶

Standard input for classification. A table of predictions with the correct category.

We define our event space as the cross product of all the predictors outputs viz:.

$\mathcal{X} = \underset{L \text{ times}}{ \underbrace{A \otimes B \otimes C \otimes ... \otimes Z}}$

and a particular event labeled as $$1_A,....,L_B$$ for example. A picturegraphic of what we mean by this is given right.

The features are assumed to have finite domains ( i-th feature ki has values), and are often called nominal. (A nominal feature can be transformed into a numeric one by imposing an order on its domain.)

We define the function $$\delta_{i_A}(\mathcal{X})$$ as a function that picks out the ith column (predictor) and the Ath category. e.g.

$\begin{split}\delta_{i_A}(\mathcal{X}) = \begin{cases} 1& \text{ if } \mathcal{X} = \underset{\phantom{XXXXXXXX} i^{th} \text{position}}{C\otimes D\otimes ... A ... \otimes Z} \\ 0& \text{ otherwise } \end{cases}\end{split}$

Note that this set of functions form a basis of functions over the domain $$\mathcal{X}$$ such that any function defined on this domain can be represented as a linear combination of our “basis” functions $$f(\mathcal{X}) = \sum_{i,A} a_{i,A} \delta_{i_A}(\mathcal{X})$$.

This is a little like dummy coding for regression analysis (see also here).

NB: In what follows it doesn’t matter if the $$\chi$$ characteristic functions are multiplied by a constant factor $$g_{i_A}$$ since these can be absorded into a redefinition of the lagrange multipliers (see below) $$h_{i_A}$$. This means that even if we “quantize” a metric into categories or features with different “areas” this will only mean that the frequencies reflect this size differentiation.

For the example where each element of X is a position on a protein strand and at this position one could have nominal values {A,B,C,D,E,F,G....Z} then

$\begin{split}\mathcal{X} = \left(\begin{matrix} A \\ B \\ C \\ D \\ \vdots \\ Z\end{matrix}\right) \otimes \left(\begin{matrix} A \\ B \\ C \\ D \\ \vdots \\ Z\end{matrix}\right) \otimes \left(\begin{matrix} A \\ B \\ C \\ D \\ \vdots \\ Z\end{matrix}\right) \otimes \left(\begin{matrix} A \\ B \\ C \\ D \\ \vdots \\ Z\end{matrix}\right) \otimes \dots \otimes \left(\begin{matrix} A \\ B \\ C \\ D \\ \vdots \\ Z\end{matrix}\right)\end{split}$

In our case $$\mathcal{X} = v_1 \otimes v_2 \otimes \dots v_n$$ where the domain of the vi are the set of all subsets of a set each element of which can be represented as a bitmap (0,1,0,1,0) etc. For continuous underlying state the deltas become characteristic functions but importantly they must define the meaning of apriori equally likely outcomes.

Although it seems that any type of function - not just a characteristic function - could be used we need to be able to evaluate the partition function with it. This is only simple when we have a characteristic function.

### Simple categorical predictors¶

Our assumption here is that we can estimate $$\langle \delta_{i_A} (\chi) \rangle$$ using our gold standard and the Maximum Entropy probability distribution is given by:

$\frac{1}{Z} \exp\{ \sum_{i_A}^K h_{i_A}\delta_{i_A} (\mathcal{X}) \}$

Where K is the number of features for predictor i (In reality K depends on i but our notation is getting too complex). The normalisation term (partition function) is estimated as:

$\begin{split}Z = \sum_\mathcal{X} \exp\left\{ \sum_{i_A}^K h_{i_A}\delta_{i_A} (\mathcal{X}) \right\} =& \prod_i^L \left( \sum_A e^{h_{i_A}}\right)\end{split}$

because we have $$\sum_{i_A}^{q} \delta_{i_A} (\mathcal{X}) = 1 \quad \forall i$$ (i.e. our categorical classifiers give only one category per prediction).

It is easy to show that to satisfy $$\partial ln Z / \partial h_{i_A} = f_{i_A}$$ we need $$e^{h_{i_A}} = f_{i_A}$$. And this also implies that $$Z = \prod_i 1 = 1$$.

The probabilty is thus:

$\begin{split}P( \mathcal{X} | c) \sim& \prod_{i_A} f_{c,i_A}^{\delta_{i_A} (\mathcal{X})} \\ \implies \log P( \mathcal{X} | c) \sim& \sum_i \sum_A \delta_{i_A}(\mathcal{X}) \log f_{c,i_A} = \sum_i \delta_i(\mathcal{X}) \cdot\log f_{c,i}\end{split}$

The matrix $$f_{c,A}$$ is what we caluclated with $$X^\top Y$$ (see below and in the code.)

### Multi-locations predictors¶

For a categorical predictor above which recognises q features the $$\mathcal{X}$$ vector will only have q values viz: (1,0,0,...,0), (0,1,0,...,0), ... (0,0,0,...,1). Now we allow $$\mathcal{X}$$ to all possible vectors with 0,or 1 then the computation for Z is

$Z = \sum_{\mathcal{X}} \exp \left\{\sum_{i_A} h_{i_A} \delta_{i_A}(\mathcal{X})\right\} = \sum_{\mathcal{X}_1=0}^1\dots\sum_{\mathcal{X}_{K}=0}^1 \prod_{i_A} e^{\left( h_{i_A} \delta_{i_A}(\mathcal{X}) \right)} = \prod_{i_A}^{K} \left[ e^{h_{i_A}} + 1\right]$

then the solution of $$\partial \ln Z/\partial h_{i_A} = \langle \delta_{i_A}(\mathcal{X}) \equiv f_{i_A}$$ is $$e^{h_{i_A}}/(e^{h_{i_A}}+1) = f_{i_A}$$. or $$h_{i_A} = \ln (f_{i_A}/(1-f_{i_A}))$$. And $$Z = \prod_{i_A} (1/(1-f_{i_A}))$$. So the probability is

$\begin{split}P(\mathcal{X}) = & \prod_i \prod_{i_A}^K f_{i_A}^{\delta_{i_A}(\mathcal{X})}\left(1-f_{i_A}\right)^{1- \delta_{i_A}(\mathcal{X})} \\ \equiv& \prod_i \prod_{i_A}^K \left[ f_{i_A} {\delta_{i_A}(\mathcal{X})} + \left(1-f_{i_A}\right)\left({1- \delta_{i_A}(\mathcal{X})}\right) \right]\end{split}$

The latter representation is usually the way it appears in textbooks a or with say scikit-learn as bernoulli naive bayes.

This is normalized since:

$\sum_\mathcal{X} P(\mathcal{X}) = \sum_{\mathcal{X}}\prod_i \prod_{i_A}^K f_{i_A}^{\delta_{i_A}(\mathcal{X})}\left(1-f_{i_A}\right)^{1-\delta_{i_A}(\chi)} = \prod_i \prod_{i_A}^K \sum_{\chi_{i_a}=0}^1 f_{i_A}^{\delta_{i_A}(\mathcal{X})}\left(1-f_{i_A}\right)^{1-\delta_{i_A}(\mathcal{X})} =\prod_i \prod_{i_A}^K 1 = 1$

NOTE: One can also split this kind of predictor up in to q mini-predictors with each mini-predictor having two states yes or no for each feature. So that the number of “features” this predictor has is 2. This used in the laplace correction for estimating frequencies.

### The Experimental Predictors¶

Here we assume that for a particular subcellular location there is a constant (category dependent) probability that a paper will be published saying it is in location Y. We include in this the probability - in any particular time interval - of no publication.

We divide up time into N slices (N is arbitrary but large small enough that only one publication - or no publication - is published in any interval).

The Maximum Entropy probability for a given count of publications $$n_f$$ if we know the subcellular location and we have estimated the expected number of publications $$\langle n_{c,f} \rangle$$ is given by [Jaynes638]

$\begin{split}P (n_f| c) \sim & \prod_f \left[ \frac{\langle n_{c,f}\rangle}{N} \right]^{n_{f}}\end{split}$

We now assume the number of time intervals N is large so the overwhelming majority contain only “no publication” events.

$\begin{split}P (n_f| c) =& N^{-N}\prod_{f^\prime} { \langle n_{c,f} \rangle }^{n_{f}} {\left( N- \sum_{f^\prime}\langle n_{c,f} \rangle \right)}^{N - \sum_{f^\prime}n_{f}} \\ =& N^{-N}\prod_{f^\prime} { \langle n_{c,f} \rangle }^{n_{f}} N^{N-M} {\left( 1- \sum_{f^\prime}\langle n_{c,f} \rangle/N \right)}^{N - M} \\ \lim_{N\rightarrow \infty}\approx& N^{-M} \prod_{f^\prime} { \langle n_{c,f} \rangle }^{n_{ f}} e^{-\sum_{f^\prime}\langle n_{c,f} \rangle} \\ =& N^{-M} \prod_{f^\prime} { \ f_{c,f} }^{n_{f}} \langle M_c \rangle^{M} e^{-\langle M_c \rangle} \quad \text{where}\; M = \sum_{f^\prime} \ n_{f} \;\text{and}\; \langle M_c\rangle = \sum_{f^\prime} \langle\ n_{c,f} \rangle\end{split}$

and $$f_{c,f}$$ is $$\langle n_{c,f} \rangle/\langle M_c\rangle$$ The effect of this is to give a Bernoulli Multinomial distribution modulated by a poisson distribution to account for the varying number of publications. The dependency on N drops out with normalisation.

## SUBAcon Data¶

The data discussed here is available as a download. To use this data you need an assess to a MySQL database (for example AWS)

gzcat subacon.sql.gz | mysql -u <user> -p -h <host> database


The basic SUBAcon type is the mysql set type ususally applied to a column call suba_location.

suba_location set('cell plate','cytoskeleton','cytosol','endoplasmic reticulum',
'extracellular','golgi','mitochondrion','nucleus','peroxisome',
'plasma membrane','plastid','unclear','vacuole','endosome')


The data is stored in MySQL as a bitmap so that (1<< 3 | 1<< 4 | 1<< 5 | 1<< 9 | 1<< 12) = 0x1238 is a bitmap of the subset ('endoplasmic reticulum','extracellular','golgi','plasma membrane','vacuole') which we name “secretory”. Also we fold cytoskeleton and cytosol together and name them simply cytosol. There are no entries in the database with ‘cell plate’ or ‘endosome’ set and we filter out all entries with ‘unclear’ only set. These entries are there for historical reasons.

### Feature counts¶

Each predictor has a set of features that it “predicts”. In this case we are working off the same data structure so that “features” and “categories” look like the same thing.

For example predotar has a feature (endoplasmic reticulum, extracellular, golgi)

-- data from suba3
select  ifnull(p.suba_location,'no location') as suba_location,count(*) as num
from tair10 left outer join pred_predotar as p using(locus) group by p.suba_location
order by p.suba_location


And a sankey digram show the mapping between the features and categories

A full list is given Predictor Graphs

### Categorical Predictors¶

We generate from each predictor indicator matrices. Something like:

select locus,
case when p.suba_location & (1<< 6)  then 1 else 0 end as 'mito',
case when p.suba_location & (1<< 10) then 1 else 0 end as 'plastid',
case when p.suba_location & 0x1238   then 1 else 0 end as 'secretory',
case when p.suba_location is NULL or
p.suba_location & ~(1<<6 | 1<<10 | 0x1238) then 1 else 0 end as 'none',

'|' as |,
case when gs.suba_location & (1<<1 | 1<<2)   then 1 else 0 end as 'cytosol',
case when gs.suba_location & (1<< 6)         then 1 else 0 end as 'mito',
case when gs.suba_location & (1<< 7)         then 1 else 0 end as 'nucleus',
case when gs.suba_location & (1<< 8)         then 1 else 0 end as 'perox',
case when gs.suba_location & (1<< 10)        then 1 else 0 end as 'plastid',
case when gs.suba_location & (0x1238)        then 1 else 0 end as 'secretory'

from gold_std as gs join pred_predotar as p using(locus);


That creates (effectively) a pair of matrcies X and Y. We then compute $$Y^\top X$$. (We use the terms X and Y to be consitent with scikit-learn )

$$X^\top Y$$ is equivalent to the categories x features matrix (for example from the data for Predotar)

-- data from suba3
SELECT
category,
sum( case when feat & (1<< 6)  then 1 else 0 end) as mitochondrion,
sum( case when feat & (1<< 10) then 1 else 0 end) as plastid,
sum( case when feat &  0x1238  then 1 else 0 end) as secretory,
sum( case when feat is NULL    then 1 else 0 end) as none
FROM (SELECT -- join "gold standard" to "predictor" table so we can match predictions
gs.locus,         -- the protein
gs.suba_location, -- gold standard for the protein
pred.suba_location as feat  -- the predicted location or "feature"
FROM gold_std as gs
LEFT OUTER JOIN pred_predotar as pred
ON gs.locus = pred.locus
) AS gs2pred
JOIN ( -- MySQL attempt at a literal table of categories
-- for latin1 charset for string literal
(select _latin1'peroxisome' as category , 1<<8 as mask)
union all (select 'cytosol',    1<<1 | 1<<2)
union all (select 'secretory',   0x1238 << 0) -- ensure "numeric" context with zero shift (ugh!)
union all (select 'plastid',         1<<10)
union all (select 'nucleus',         1<< 7)
union all (select 'mitochondrion',   1<< 6)

) AS  categories

GROUP BY category ORDER BY category;


For each row of this matrix the columns are summed and the total is divided into the cell to make a frequency matrix $$f_{c,A}$$. This is the data that goes into the predictor algorithm.

### Experimental Data¶

GFP and MSMS tables are a simple list unique triples (Locus, location, PMID). where location is one of the locations of the suba_location set. We first convert using:

SELECT
gs.locus,
gs.suba_location,
sum( case when location in ('cytosol','cytoskeleton') then 1 else 0 end) as cytosol,
sum( case when location = 'mitochondrion' then 1 else 0 end) as mitochondrion,
sum( case when location = 'nucleus'       then 1 else 0 end) as nucleus,
sum( case when location = 'peroxisome'    then 1 else 0 end) as peroxisome,
sum( case when location = 'plastid'       then 1 else 0 end) as plastid,
sum( case when location in ('vacuole','golgi','endoplasmic reticulum',
'plasma membrane','extracellular')  then 1 else 0 end) as secretory

FROM gold_std as gs
JOIN src_gfp using(locus) group by locus,gs.suba_location;


This gives a table of paper counts for each locus. (Note it’s not problematic to group with gs.suba_location since locus already uniquely locates a row in the gold_std)

This gives the $$Y^\top X$$ table (for GFP as):

and for MS/MS as

### The PPI predictor¶

Protein-Protein Interaction (PPI) information is taken from .. (??) plus data curated by our researchers (see ASURE).

We use the following SQL to create a feature list from PPIs. The src_ppi table is a list of AGI pairs (locusA, locusB) and a PMID. We turn this into a predictor by joining locusA of the src_ppi with the gold_std and then predicting locusB location to be that of locusA. We use the publication counts exactly as for the GFP/MSMS predictors.

SELECT locus,
sum(if( suba_location & (1<< 3 | 1<< 4 | 1<< 5 | 1<< 9 | 1<< 12) ,num,0)) as secretory,
sum(if( suba_location & (1<<1 | 1<<2) ,num,0)) as cytosol,
sum(if( suba_location & (1<< 6) ,num,0)) as mitochondrion,
sum(if( suba_location & (1<< 7) ,num,0)) as nucleus,
sum(if( suba_location & (1<< 8) ,num,0)) as peroxisome,
sum(if( suba_location & (1<<10) ,num,0)) as plastid
FROM (
SELECT ppi.locusB AS locus, -- locusB is in same compartment as locusA!
gs.suba_location AS suba_location, -- gold standard says locusA is here
count(DISTINCT ppi.paper) AS num -- we have this many papers for this pair of locusB,suba_location
FROM gold_std as gs INNER JOIN src_ppi as ppi ON ppi.locusA = gs.locus
WHERE ppi.locusA != ppi.locusB GROUP BY ppi.locusB, gs.suba_location
) as pred_src_ppi

GROUP by pred_src_ppi.locus


### Atted Predictor¶

The Atted Data is a quadruplet of (locus, locus_coex, rank, average) The following SQL converts the Atted data in to a list of publication weights for each locus. We can the join against the gold standard

SELECT
locus,
sum(if( suba_location & (1<< 3 | 1<< 4 | 1<< 5 | 1<< 9 | 1<< 12) ,num,0)) as secretory,
sum(if( suba_location & (1<<1 | 1<<2) ,num,0)) as cytosol,
sum(if( suba_location & (1<< 6) ,num,0)) as mitochondrion,
sum(if( suba_location & (1<< 7) ,num,0)) as nucleus,
sum(if( suba_location & (1<< 8) ,num,0)) as peroxisome,
sum(if( suba_location & (1<<10) ,num,0)) as plastid

FROM (SELECT -- join "gold standard" to "predictor" table so we can match predictions
gs.locus,         -- the protein
gs.suba_location, -- gold standard for the protein
pred_src_att.suba_location as att,  -- what atted "predicts"
pred_src_att.num                    -- "weight" attached to this protein
FROM gold_std as gs
INNER JOIN

( -- this is the predictor table "pred_src_att"
-- we join to the gold standard on locus and then match gs.suba_location to locus_coex
SELECT att.locus_coex AS locus, -- locus_coex is in same compartment as locus!
gs2.suba_location AS suba_location, -- gold standard says locus_coex is here
count(DISTINCT att.rank,att.ave) AS num -- we have this many ranks locus_coex,suba_location
FROM gold_std as gs2 INNER JOIN atted as att ON att.locus = gs2.locus
WHERE att.locus != att.locus_coex
AND
(  ( gs2.suba_location & (1<<1 | 1<<2)>0  and rank < 50 and ave > .7)   -- cytosol/cytoskeleton
or ( gs2.suba_location & (1<<10)  >0      and rank < 50 and ave > .8)   -- plastid
or ( gs2.suba_location & (0x1238) >0      and rank < 50 and ave > .5)   -- secretory
or ( gs2.suba_location & (1<<7)  >0       and rank < 50 and ave > .5)   -- nucleus
or ( gs2.suba_location & (1<<6)  >0       and rank < 50 and ave > .6)   -- mito
or ( gs2.suba_location & (1<<8)  >0       and rank < 50 and ave > .5)   -- perox
)
AND rank >= 0 -- avoid bad rows

GROUP BY att.locus_coex, gs2.suba_location
) AS pred_src_att

ON gs.locus = pred_src_att.locus
) AS gs2pred

GROUP by gs2pred.locus


This renders a table in the form of a GFP/MSMS table above and we can then use it to compute as for those predictors.

The final counts table

-- data from suba3
SELECT
category,
sum( case when find_in_set('cytosol',att)>0 or find_in_set('cytoskeleton',att)>0
then num else 0 end)+1 as cytosol,
sum( case when find_in_set('mitochondrion',att)>0 then num else 0 end)+1 as mitochondrion,
sum( case when find_in_set('nucleus',att)>0       then num else 0 end)+1 as nucleus,
sum( case when find_in_set('peroxisome',att)>0    then num else 0 end)+1 as peroxisome,
sum( case when find_in_set('plastid',att)>0       then num else 0 end)+1 as plastid,
sum( case when att & 0x1238                       then num else 0 end)+1 as secretory

FROM (SELECT -- join "gold standard" to "predictor" table so we can match predictions
gs.locus, -- the protein
gs.suba_location, -- gold standard for the protein
pred_src_att.suba_location as att,  -- what atted "predicts"
pred_src_att.num -- number of unique papers attached to this protein
FROM gold_std as gs
INNER JOIN ( -- this is the predictor table "pred_src_att"
-- we join to the gold standard on locus and then match gs.suba_location to locus_coex
SELECT att.locus_coex AS locus, -- locus_coex is in same compartment as locus!
gs2.suba_location AS suba_location, -- gold standard says locus_coex is here
count(DISTINCT att.rank,att.ave) AS num -- we have this many ranks locus_coex,suba_location
FROM gold_std as gs2 INNER JOIN atted as att ON att.locus = gs2.locus
WHERE att.locus != att.locus_coex
AND rank >= 0 -- avoid bad rows
AND rank < 5
AND ave > .8
GROUP BY att.locus_coex, gs2.suba_location
) AS pred_src_att
ON gs.locus = pred_src_att.locus
) AS gs2pred
JOIN ( -- MySQL attempt at a literal table of categories
(select _latin1'peroxisome' as category ,1<<8 as mask) -- for latin1 charset for string literal
union all (select 'cytosol',    1<<1 | 1<<2)
union all (select 'secretory',   0x1238 << 0) -- ensure "numeric" context with zero shift (ugh!)