Statistical assessment of enrichment in ranked lists - algorithms and applications

Limor Leibovich (Ph.D. Thesis Seminar)

Monday, 30.12.2013, 13:00

Taub 601

Advisor: Dr. Zohar Yakhini

Modern data analysis often faces the task of extracting characteristic features from sets of elements characterized with some measurement assay or procedure. In molecular biology, for example, an experiment may lead to measurement results pertaining to genes and then questions are asked about the properties of genes for which these were high or low. This is an example, of course, and the set of elements does not have to be genes. They can be genomic regions, proteins, structures, etc. A central technique for analyzing characteristic properties of sets of elements is statistical enrichment. More specifically - the experiment results are often representable as ranked lists of elements and we then seek enrichment of other properties of these elements at the top or bottom of the ranked list. Assessment of the statistical significance of properties occurring at either end of a ranked list can be achieved using the minimum hyper-geometric statistics developed in the Yakhini research group. This approach avoids the specification of an arbitrary threshold to define an end of the list.
This idea can be extended to the case of two rank orders over a common universe of elements, such as, for example, when genes can be ranked according to differential expression as well as according to motif-presence score. In this case, a mutual enrichment analysis of the two ranked orders is appropriate. Mutual enrichment is more informative from the point of view of practical biological science than simple correlation measures, as it focuses on the top of the lists and not on the overall agreement, which may be weak even in cases where extremities agree. Relative ranking can be represented by using permutations over the measured elements. Therefore, the statistical assessment of mutual enrichment is equivalent to characterizing properties of random permutations. Due to the size of the measure space, statistics over the group of permutations over N elements is difficult to perform and implement. To support the practicality of statistically assessing mutual enrichment in ranked lists, we derived polynomially computable bounds for the associated tail distributions. Namely - we provide methods for computing an upper bound on the p-value of mutual enrichment at the top of two permutations uniformly and independently drawn over the group of permutations. We apply mutual enrichment assessment to study motifs in genomics measurement data.
In my talk, I will describe the algorithms and the statistical methods used for assessing enrichment in ranked lists, and demonstrate their utility in the domain of motif search. I will also briefly describe some biological results obtained through the application of these methods.