SUBAcon

Supplementary information

An explanation of the theory and data of subacon

Get The Data 26MB Get The Code

Maximum Entropy Justification for Naive Bayes

Theory

img/goldstandard_x300px.png

Standard input for classification. A table of predictions with the correct category.

We define our event space as the cross product of all the predictors outputs viz:.

\[\mathcal{X} = \underset{L \text{ times}}{ \underbrace{A \otimes B \otimes C \otimes ... \otimes Z}}\]

and a particular event labeled as \(1_A,....,L_B\) for example. A picturegraphic of what we mean by this is given right.

The features are assumed to have finite domains ( i-th feature ki has values), and are often called nominal. (A nominal feature can be transformed into a numeric one by imposing an order on its domain.)

We define the function \(\delta_{i_A}(\mathcal{X})\) as a function that picks out the ith column (predictor) and the Ath category. e.g.

\[\begin{split}\delta_{i_A}(\mathcal{X}) = \begin{cases} 1& \text{ if } \mathcal{X} = \underset{\phantom{XXXXXXXX} i^{th} \text{position}}{C\otimes D\otimes ... A ... \otimes Z} \\ 0& \text{ otherwise } \end{cases}\end{split}\]

Note that this set of functions form a basis of functions over the domain \(\mathcal{X}\) such that any function defined on this domain can be represented as a linear combination of our “basis” functions \(f(\mathcal{X}) = \sum_{i,A} a_{i,A} \delta_{i_A}(\mathcal{X})\).

This is a little like dummy coding for regression analysis (see also here).

NB: In what follows it doesn’t matter if the \(\chi\) characteristic functions are multiplied by a constant factor \(g_{i_A}\) since these can be absorded into a redefinition of the lagrange multipliers (see below) \(h_{i_A}\). This means that even if we “quantize” a metric into categories or features with different “areas” this will only mean that the frequencies reflect this size differentiation.

For the example where each element of X is a position on a protein strand and at this position one could have nominal values {A,B,C,D,E,F,G....Z} then

\[\begin{split}\mathcal{X} = \left(\begin{matrix} A \\ B \\ C \\ D \\ \vdots \\ Z\end{matrix}\right) \otimes \left(\begin{matrix} A \\ B \\ C \\ D \\ \vdots \\ Z\end{matrix}\right) \otimes \left(\begin{matrix} A \\ B \\ C \\ D \\ \vdots \\ Z\end{matrix}\right) \otimes \left(\begin{matrix} A \\ B \\ C \\ D \\ \vdots \\ Z\end{matrix}\right) \otimes \dots \otimes \left(\begin{matrix} A \\ B \\ C \\ D \\ \vdots \\ Z\end{matrix}\right)\end{split}\]

In our case \(\mathcal{X} = v_1 \otimes v_2 \otimes \dots v_n\) where the domain of the vi are the set of all subsets of a set each element of which can be represented as a bitmap (0,1,0,1,0) etc. For continuous underlying state the deltas become characteristic functions but importantly they must define the meaning of apriori equally likely outcomes.

Although it seems that any type of function - not just a characteristic function - could be used we need to be able to evaluate the partition function with it. This is only simple when we have a characteristic function.

Simple categorical predictors

Our assumption here is that we can estimate \(\langle \delta_{i_A} (\chi) \rangle\) using our gold standard and the Maximum Entropy probability distribution is given by:

\[\frac{1}{Z} \exp\{ \sum_{i_A}^K h_{i_A}\delta_{i_A} (\mathcal{X}) \}\]

Where K is the number of features for predictor i (In reality K depends on i but our notation is getting too complex). The normalisation term (partition function) is estimated as:

\[\begin{split}Z = \sum_\mathcal{X} \exp\left\{ \sum_{i_A}^K h_{i_A}\delta_{i_A} (\mathcal{X}) \right\} =& \prod_i^L \left( \sum_A e^{h_{i_A}}\right)\end{split}\]

because we have \(\sum_{i_A}^{q} \delta_{i_A} (\mathcal{X}) = 1 \quad \forall i\) (i.e. our categorical classifiers give only one category per prediction).

It is easy to show that to satisfy \(\partial ln Z / \partial h_{i_A} = f_{i_A}\) we need \(e^{h_{i_A}} = f_{i_A}\). And this also implies that \(Z = \prod_i 1 = 1\).

The probabilty is thus:

\[\begin{split}P( \mathcal{X} | c) \sim& \prod_{i_A} f_{c,i_A}^{\delta_{i_A} (\mathcal{X})} \\ \implies \log P( \mathcal{X} | c) \sim& \sum_i \sum_A \delta_{i_A}(\mathcal{X}) \log f_{c,i_A} = \sum_i \delta_i(\mathcal{X}) \cdot\log f_{c,i}\end{split}\]

The matrix \(f_{c,A}\) is what we caluclated with \(X^\top Y\) (see below and in the code.)

Multi-locations predictors

For a categorical predictor above which recognises q features the \(\mathcal{X}\) vector will only have q values viz: (1,0,0,...,0), (0,1,0,...,0), ... (0,0,0,...,1). Now we allow \(\mathcal{X}\) to all possible vectors with 0,or 1 then the computation for Z is

\[Z = \sum_{\mathcal{X}} \exp \left\{\sum_{i_A} h_{i_A} \delta_{i_A}(\mathcal{X})\right\} = \sum_{\mathcal{X}_1=0}^1\dots\sum_{\mathcal{X}_{K}=0}^1 \prod_{i_A} e^{\left( h_{i_A} \delta_{i_A}(\mathcal{X}) \right)} = \prod_{i_A}^{K} \left[ e^{h_{i_A}} + 1\right]\]

then the solution of \(\partial \ln Z/\partial h_{i_A} = \langle \delta_{i_A}(\mathcal{X}) \equiv f_{i_A}\) is \(e^{h_{i_A}}/(e^{h_{i_A}}+1) = f_{i_A}\). or \(h_{i_A} = \ln (f_{i_A}/(1-f_{i_A}))\). And \(Z = \prod_{i_A} (1/(1-f_{i_A}))\). So the probability is

\[\begin{split}P(\mathcal{X}) = & \prod_i \prod_{i_A}^K f_{i_A}^{\delta_{i_A}(\mathcal{X})}\left(1-f_{i_A}\right)^{1- \delta_{i_A}(\mathcal{X})} \\ \equiv& \prod_i \prod_{i_A}^K \left[ f_{i_A} {\delta_{i_A}(\mathcal{X})} + \left(1-f_{i_A}\right)\left({1- \delta_{i_A}(\mathcal{X})}\right) \right]\end{split}\]

The latter representation is usually the way it appears in textbooks a or with say scikit-learn as bernoulli naive bayes.

This is normalized since:

\[\sum_\mathcal{X} P(\mathcal{X}) = \sum_{\mathcal{X}}\prod_i \prod_{i_A}^K f_{i_A}^{\delta_{i_A}(\mathcal{X})}\left(1-f_{i_A}\right)^{1-\delta_{i_A}(\chi)} = \prod_i \prod_{i_A}^K \sum_{\chi_{i_a}=0}^1 f_{i_A}^{\delta_{i_A}(\mathcal{X})}\left(1-f_{i_A}\right)^{1-\delta_{i_A}(\mathcal{X})} =\prod_i \prod_{i_A}^K 1 = 1\]

NOTE: One can also split this kind of predictor up in to q mini-predictors with each mini-predictor having two states yes or no for each feature. So that the number of “features” this predictor has is 2. This used in the laplace correction for estimating frequencies.

The Experimental Predictors

Here we assume that for a particular subcellular location there is a constant (category dependent) probability that a paper will be published saying it is in location Y. We include in this the probability - in any particular time interval - of no publication.

We divide up time into N slices (N is arbitrary but large small enough that only one publication - or no publication - is published in any interval).

The Maximum Entropy probability for a given count of publications \(n_f\) if we know the subcellular location and we have estimated the expected number of publications \(\langle n_{c,f} \rangle\) is given by [Jaynes638]

\[\begin{split}P (n_f| c) \sim & \prod_f \left[ \frac{\langle n_{c,f}\rangle}{N} \right]^{n_{f}}\end{split}\]

We now assume the number of time intervals N is large so the overwhelming majority contain only “no publication” events.

\[\begin{split}P (n_f| c) =& N^{-N}\prod_{f^\prime} { \langle n_{c,f} \rangle }^{n_{f}} {\left( N- \sum_{f^\prime}\langle n_{c,f} \rangle \right)}^{N - \sum_{f^\prime}n_{f}} \\ =& N^{-N}\prod_{f^\prime} { \langle n_{c,f} \rangle }^{n_{f}} N^{N-M} {\left( 1- \sum_{f^\prime}\langle n_{c,f} \rangle/N \right)}^{N - M} \\ \lim_{N\rightarrow \infty}\approx& N^{-M} \prod_{f^\prime} { \langle n_{c,f} \rangle }^{n_{ f}} e^{-\sum_{f^\prime}\langle n_{c,f} \rangle} \\ =& N^{-M} \prod_{f^\prime} { \ f_{c,f} }^{n_{f}} \langle M_c \rangle^{M} e^{-\langle M_c \rangle} \quad \text{where}\; M = \sum_{f^\prime} \ n_{f} \;\text{and}\; \langle M_c\rangle = \sum_{f^\prime} \langle\ n_{c,f} \rangle\end{split}\]

and \(f_{c,f}\) is \(\langle n_{c,f} \rangle/\langle M_c\rangle\) The effect of this is to give a Bernoulli Multinomial distribution modulated by a poisson distribution to account for the varying number of publications. The dependency on N drops out with normalisation.

SUBAcon Data

The data discussed here is available as a download. To use this data you need an assess to a MySQL database (for example AWS)

gzcat subacon.sql.gz | mysql -u <user> -p -h <host> database

The basic SUBAcon type is the mysql set type ususally applied to a column call suba_location.

`suba_location` set('cell plate','cytoskeleton','cytosol','endoplasmic reticulum',
        'extracellular','golgi','mitochondrion','nucleus','peroxisome',
        'plasma membrane','plastid','unclear','vacuole','endosome')

The data is stored in MySQL as a bitmap so that (1<< 3 | 1<< 4 | 1<< 5 | 1<< 9 | 1<< 12) = 0x1238 is a bitmap of the subset ('endoplasmic reticulum','extracellular','golgi','plasma membrane','vacuole') which we name “secretory”. Also we fold cytoskeleton and cytosol together and name them simply cytosol. There are no entries in the database with ‘cell plate’ or ‘endosome’ set and we filter out all entries with ‘unclear’ only set. These entries are there for historical reasons.

Feature counts

Each predictor has a set of features that it “predicts”. In this case we are working off the same data structure so that “features” and “categories” look like the same thing.

For example predotar has a feature (endoplasmic reticulum, extracellular, golgi)

-- data from suba3
select  ifnull(p.suba_location,'no location') as suba_location,count(*) as num
           from tair10 left outer join pred_predotar as p using(locus) group by p.suba_location
           order by p.suba_location

And a sankey digram show the mapping between the features and categories

A full list is given Predictor Graphs

Categorical Predictors

We generate from each predictor indicator matrices. Something like:

select locus,
case when p.suba_location & (1<< 6)  then 1 else 0 end as 'mito',
case when p.suba_location & (1<< 10) then 1 else 0 end as 'plastid',
case when p.suba_location & 0x1238   then 1 else 0 end as 'secretory',
case when p.suba_location is NULL or
          p.suba_location & ~(1<<6 | 1<<10 | 0x1238) then 1 else 0 end as 'none',

'|' as `|`,
case when gs.suba_location & (1<<1 | 1<<2)   then 1 else 0 end as 'cytosol',
case when gs.suba_location & (1<< 6)         then 1 else 0 end as 'mito',
case when gs.suba_location & (1<< 7)         then 1 else 0 end as 'nucleus',
case when gs.suba_location & (1<< 8)         then 1 else 0 end as 'perox',
case when gs.suba_location & (1<< 10)        then 1 else 0 end as 'plastid',
case when gs.suba_location & (0x1238)        then 1 else 0 end as 'secretory'

from gold_std as gs join pred_predotar as p using(locus);

That creates (effectively) a pair of matrcies X and Y. We then compute \(Y^\top X\). (We use the terms X and Y to be consitent with scikit-learn )

\(X^\top Y\) is equivalent to the categories x features matrix (for example from the data for Predotar)

-- data from suba3
SELECT
category,
sum( case when feat & (1<< 6)  then 1 else 0 end) as mitochondrion,
sum( case when feat & (1<< 10) then 1 else 0 end) as plastid,
sum( case when feat &  0x1238  then 1 else 0 end) as secretory,
sum( case when feat is NULL    then 1 else 0 end) as `none`
FROM (SELECT -- join "gold standard" to "predictor" table so we can match predictions
      gs.locus,         -- the protein
      gs.suba_location, -- gold standard for the protein
      pred.suba_location as feat  -- the predicted location or "feature"
     FROM gold_std as gs
      LEFT OUTER JOIN pred_predotar as pred
     ON gs.locus = pred.locus
    ) AS gs2pred
JOIN ( -- MySQL attempt at a literal table of categories
  -- for latin1 charset for string literal
  (select _latin1'peroxisome' as category , 1<<8 as mask)
  union all (select 'cytosol',    1<<1 | 1<<2)
  union all (select 'secretory',   0x1238 << 0) -- ensure "numeric" context with zero shift (ugh!)
    -- see http://dev.mysql.com/doc/refman/5.0/en/hexadecimal-literals.html
  union all (select 'plastid',         1<<10)
  union all (select 'nucleus',         1<< 7)
  union all (select 'mitochondrion',   1<< 6)

  ) AS  categories

ON (gs2pred.suba_location & categories.mask)
GROUP BY category ORDER BY category;

For each row of this matrix the columns are summed and the total is divided into the cell to make a frequency matrix \(f_{c,A}\). This is the data that goes into the predictor algorithm.

Experimental Data

GFP and MSMS tables are a simple list unique triples (Locus, location, PMID). where location is one of the locations of the suba_location set. We first convert using:

SELECT
 gs.locus,
 gs.suba_location,
 sum( case when location in ('cytosol','cytoskeleton') then 1 else 0 end) as cytosol,
 sum( case when location = 'mitochondrion' then 1 else 0 end) as mitochondrion,
 sum( case when location = 'nucleus'       then 1 else 0 end) as nucleus,
 sum( case when location = 'peroxisome'    then 1 else 0 end) as peroxisome,
 sum( case when location = 'plastid'       then 1 else 0 end) as plastid,
 sum( case when location in ('vacuole','golgi','endoplasmic reticulum',
       'plasma membrane','extracellular')  then 1 else 0 end) as secretory

FROM gold_std as gs
JOIN src_gfp using(locus) group by locus,gs.suba_location;

This gives a table of paper counts for each locus. (Note it’s not problematic to group with gs.suba_location since locus already uniquely locates a row in the gold_std)

This gives the \(Y^\top X\) table (for GFP as):

and for MS/MS as

The PPI predictor

Protein-Protein Interaction (PPI) information is taken from .. (??) plus data curated by our researchers (see ASURE).

We use the following SQL to create a feature list from PPIs. The src_ppi table is a list of AGI pairs (locusA, locusB) and a PMID. We turn this into a predictor by joining locusA of the src_ppi with the gold_std and then predicting locusB location to be that of locusA. We use the publication counts exactly as for the GFP/MSMS predictors.

SELECT locus,
   sum(if( suba_location & (1<< 3 | 1<< 4 | 1<< 5 | 1<< 9 | 1<< 12) ,num,0)) as `secretory`,
   sum(if( suba_location & (1<<1 | 1<<2) ,num,0)) as `cytosol`,
   sum(if( suba_location & (1<< 6) ,num,0)) as `mitochondrion`,
   sum(if( suba_location & (1<< 7) ,num,0)) as `nucleus`,
   sum(if( suba_location & (1<< 8) ,num,0)) as `peroxisome`,
   sum(if( suba_location & (1<<10) ,num,0)) as `plastid`
FROM (
      SELECT ppi.locusB AS locus, -- locusB is in same compartment as locusA!
              gs.suba_location AS suba_location, -- gold standard says locusA is here
              count(DISTINCT ppi.paper) AS num -- we have this many papers for this pair of locusB,suba_location
         FROM gold_std as gs INNER JOIN src_ppi as ppi ON ppi.locusA = gs.locus
         WHERE ppi.locusA != ppi.locusB GROUP BY ppi.locusB, gs.suba_location
         ) as pred_src_ppi

     GROUP by pred_src_ppi.locus

Atted Predictor

The Atted Data is a quadruplet of (locus, locus_coex, rank, average) The following SQL converts the Atted data in to a list of publication weights for each locus. We can the join against the gold standard

SELECT
 locus,
   sum(if( suba_location & (1<< 3 | 1<< 4 | 1<< 5 | 1<< 9 | 1<< 12) ,num,0)) as `secretory`,
   sum(if( suba_location & (1<<1 | 1<<2) ,num,0)) as `cytosol`,
   sum(if( suba_location & (1<< 6) ,num,0)) as `mitochondrion`,
   sum(if( suba_location & (1<< 7) ,num,0)) as `nucleus`,
   sum(if( suba_location & (1<< 8) ,num,0)) as `peroxisome`,
   sum(if( suba_location & (1<<10) ,num,0)) as `plastid`

FROM (SELECT -- join "gold standard" to "predictor" table so we can match predictions
       gs.locus,         -- the protein
       gs.suba_location, -- gold standard for the protein
       pred_src_att.suba_location as att,  -- what atted "predicts"
       pred_src_att.num                    -- "weight" attached to this protein
      FROM gold_std as gs
       INNER JOIN

       ( -- this is the predictor table "pred_src_att"
         -- we join to the gold standard on locus and then match gs.suba_location to locus_coex
         SELECT att.locus_coex AS locus, -- locus_coex is in same compartment as locus!
              gs2.suba_location AS suba_location, -- gold standard says locus_coex is here
              count(DISTINCT att.rank,att.ave) AS num -- we have this many ranks locus_coex,suba_location
         FROM gold_std as gs2 INNER JOIN atted as att ON att.locus = gs2.locus
         WHERE att.locus != att.locus_coex
         AND
         (  ( gs2.suba_location & (1<<1 | 1<<2)>0  and rank < 50 and ave > .7)   -- cytosol/cytoskeleton
         or ( gs2.suba_location & (1<<10)  >0      and rank < 50 and ave > .8)   -- plastid
         or ( gs2.suba_location & (0x1238) >0      and rank < 50 and ave > .5)   -- secretory
         or ( gs2.suba_location & (1<<7)  >0       and rank < 50 and ave > .5)   -- nucleus
         or ( gs2.suba_location & (1<<6)  >0       and rank < 50 and ave > .6)   -- mito
         or ( gs2.suba_location & (1<<8)  >0       and rank < 50 and ave > .5)   -- perox
         )
         AND rank >= 0 -- avoid bad rows

         GROUP BY att.locus_coex, gs2.suba_location
         ) AS pred_src_att

      ON gs.locus = pred_src_att.locus
     ) AS gs2pred

   GROUP by gs2pred.locus

This renders a table in the form of a GFP/MSMS table above and we can then use it to compute as for those predictors.

The final counts table

-- data from suba3
SELECT
 category,
 sum( case when find_in_set('cytosol',att)>0 or find_in_set('cytoskeleton',att)>0
                                                   then num else 0 end)+1 as cytosol,
 sum( case when find_in_set('mitochondrion',att)>0 then num else 0 end)+1 as mitochondrion,
 sum( case when find_in_set('nucleus',att)>0       then num else 0 end)+1 as nucleus,
 sum( case when find_in_set('peroxisome',att)>0    then num else 0 end)+1 as peroxisome,
 sum( case when find_in_set('plastid',att)>0       then num else 0 end)+1 as plastid,
 sum( case when att & 0x1238                       then num else 0 end)+1 as secretory

FROM (SELECT -- join "gold standard" to "predictor" table so we can match predictions
       gs.locus, -- the protein
       gs.suba_location, -- gold standard for the protein
       pred_src_att.suba_location as att,  -- what atted "predicts"
       pred_src_att.num -- number of unique papers attached to this protein
      FROM gold_std as gs
       INNER JOIN ( -- this is the predictor table "pred_src_att"
         -- we join to the gold standard on locus and then match gs.suba_location to locus_coex
         SELECT att.locus_coex AS locus, -- locus_coex is in same compartment as locus!
              gs2.suba_location AS suba_location, -- gold standard says locus_coex is here
              count(DISTINCT att.rank,att.ave) AS num -- we have this many ranks locus_coex,suba_location
         FROM gold_std as gs2 INNER JOIN atted as att ON att.locus = gs2.locus
         WHERE att.locus != att.locus_coex
         AND rank >= 0 -- avoid bad rows
         AND rank < 5
         AND ave > .8
         GROUP BY att.locus_coex, gs2.suba_location
         ) AS pred_src_att
      ON gs.locus = pred_src_att.locus
     ) AS gs2pred
JOIN ( -- MySQL attempt at a literal table of categories
   (select _latin1'peroxisome' as category ,1<<8 as mask) -- for latin1 charset for string literal
   union all (select 'cytosol',    1<<1 | 1<<2)
   union all (select 'secretory',   0x1238 << 0) -- ensure "numeric" context with zero shift (ugh!)
   -- see http://dev.mysql.com/doc/refman/5.0/en/hexadecimal-literals.html
   union all (select 'plastid',         1<<10)
   union all (select 'nucleus',         1<< 7)
   union all (select 'mitochondrion',   1<< 6)
   ) AS  categories

WHERE gs2pred.suba_location & categories.mask
GROUP BY category ORDER BY category

References

[Jaynes638]E.T. Jaynes “Probability Theory: The Logic of Science” Edwin T. (2004) Cambridge University Press ISBN 0 521 59271 2. Section 22.5 page 638