## Drona.csa.iisc.ac.in

Mach Learn (2010) 81: 333–357DOI 10.1007/s10994-010-5185-8

**Learning to rank on graphs**
Received: 31 July 2008 / Revised: 27 December 2009 / Accepted: 28 March 2010 /Published online: 29 May 2010 The Author(s) 2010

**Abstract **Graph representations of data are increasingly common. Such representations

arise in a variety of applications, including computational biology, social network analysis,

web applications, and many others. There has been much work in recent years on developing

learning algorithms for such graph data; in particular, graph learning algorithms have been

developed for both classification and regression on graphs. Here we consider graph learning

problems in which the goal is not to predict labels of objects in a graph, but rather to rank

the objects relative to one another; for example, one may want to rank genes in a biological

network by relevance to a disease, or customers in a social network by their likelihood of

being interested in a certain product. We develop algorithms for such problems of learning

to rank on graphs. Our algorithms build on the graph regularization ideas developed in the

context of other graph learning problems, and learn a ranking function in a reproducing ker-

nel Hilbert space (RKHS) derived from the graph. This allows us to show attractive stability

and generalization properties. Experiments on several graph ranking tasks in computational

biology and in cheminformatics demonstrate the benefits of our framework.

**Keywords **Graphs · Networks · Ranking · Transductive learning · Graph regularization

In an increasing number of applications of machine learning, it is of interest to analyzedata represented in the form of a graph or network structure over objects. In computationalbiology, it is of interest to analyze protein interaction data, generally represented as a graphof pair-wise interactions between proteins. In social network analysis, it is of interest to
Editor: Tong Zhang.

A preliminary version of this paper appeared in the Proceedings of the 23rd International Conferenceon Machine Learning (ICML) in 2006.

Massachusetts Institute of Technology, Cambridge, MA 02139, USAe-mail:
Mach Learn (2010) 81: 333–357
analyze social interaction data, in this case represented as a graph of interactions betweenindividuals or organizations. In web applications, it is of interest to analyze hyperlink data,which results in graphs over webpages. In addition, as is now well known, if the inputdata resides in a high-dimensional space but comes from an underlying low-dimensionalmanifold, it is often best represented as a graph that captures the local geometry of the inputspace (Tenenbaum et al. ; Roweis and Saul Belkin and Niyogi ).

There has been much work in developing learning algorithms for such graph or net-
work data, ranging from designing kernels for graphs (Kondor and Lafferty Smolaand Kondor ) to developing algorithms for classification and regression on graphs(Belkin and Niyogi Belkin et al. ; Zhou and Schölkopf Zhou et al. Herbster et al. Johnson and Zhang ). In particular, we now have a well-developedtheory for such learning problems on graphs, including a mathematical theory of graph reg-ularization (Smola and Kondor Belkin et al. ; Zhou and Schölkopf
In many graph learning problems, however, the goal is not to predict a class or real-
valued label for each object in the graph, but rather, to rank or prioritize objects in thegraph relative to one another. In computational biology, one often wants to rank or prioritizegenes or proteins such that those that are more likely to be involved in a given disease orbiological pathway appear at the top of the ranking (Morrison et al. Aerts et al. Ma et al. Agarwal and Sengupta ); the top few genes or proteins can then besubjected to biological tests to further elucidate their functional and structural properties. Insocial network analysis, one often wants to rank individuals, for example by their likelihoodof being interested in a particular product; depending on the available resources, the top

*k*individuals for some appropriate

*k *can then be targeted for advertising. In web applications,a common goal is to rank webpages by relevance to a query.

In this paper, we investigate the problem of learning to rank on graphs. There has been
much interest in ranking problems in machine learning in recent years, both due to the factthat they are distinct from the classical learning problems of classification and regression,and due to their widespread applications, ranging from information retrieval to collaborativefiltering and from computational biology to drug discovery (Cohen et al. Freund et al.

Herbrich et al. Joachims ; Crammer and Singer Agarwal et al. Burges et al. ; Rudin et al. ; Agarwal Cortes et al. ; Clemencon et al.

Cossock and Zhang ; Agarwal and Niyogi Agarwal and Sengupta Ranking problems on graphs, however, have received limited attention in the community,some notable exceptions in addition to our own earlier work (Agarwal being that ofZhou et al. () and, more recently, that of Agarwal and Chakrabarti
We note that our work differs from standard graph ranking algorithms such as PageRank
(Brin and Page ) and HITS (Kleinberg ). In particular, these algorithms yield afixed ranking for any graph and do not involve any learning from examples. In contrast, weare interested in situations where one is given examples of preferences among objects in thegraph (such as examples of one document being more relevant to a topic than another), andthe goal is to learn a ranking over the remaining objects in the graph that takes into accountthese preferences; in this case, different sets of preferences will yield different rankingson the same graph. (We note that there are some ‘personalized' variations of PageRankthat can be biased towards certain preferred objects; however this is different from learningfrom general pair-wise preferences as we do here.) Our work also differs from the above-mentioned works of Zhou et al. () and Agarwal and Chakrabarti (). In particular,Zhou et al. (consider problems in which one is given only examples of ‘positive' or‘query' objects (such as documents related to a specific topic), and the goal is to rank theremaining objects in the graph by relevance to these query objects. Agarwal and Chakrabarti
Mach Learn (2010) 81: 333–357
() incorporate pair-wise preferences; however they do so using an algorithm that isderived from PageRank-based considerations and that models rankings on graphs via edgeprobability matrices.

Here we build on the graph regularization ideas of Smola and Kondor Belkin
et al. Zhou and Schölkopf (), Zhou et al. (to develop algorithms forlearning ranking functions on graphs. The graph regularizers used in the algorithms inducea graph kernel; these algorithms can therefore be viewed as learning a ranking function in areproducing kernel Hilbert space (RKHS) derived from the graph, which allows us to showattractive stability and generalization properties for these algorithms using recent results ofAgarwal and Niyogi Experiments on several graph ranking tasks in computationalbiology and in cheminformatics show that our graph ranking algorithms yield better perfor-mance than the corresponding classification or regression algorithms on the same graphs.

The graphs used in our experiments are shown in Fig.
The paper is organized as follows. Section describes formally the problem of learning
to rank on graphs, and gives several examples of specific settings of such problems. Nextwe describe our graph ranking algorithms, first for undirected graphs in Sect. , then fordirected graphs in Sect. Section discusses stability and generalization properties ofour algorithms. Section provides experimental results; we conclude with a discussion inSect.

**2 Problem formulation**
Consider the following problem setting: we are given a weighted

*data graph G *=

*(V , E, w)*,where

*V *= {

*v*1

*, . . , vn*} is a finite set of vertices corresponding to objects or data points,

*E *⊆

*V *×

*V *is a set of edges, and

*w *:

*E*→R+ is a weight function; and in addition, weare given a small number of examples of preferences or order relationships among objectsin

*V *. The set of preferences or order relationships among elements of

*V *can be representedas a (directed) weighted graph of its own, which we shall call the

*preference graph *anddenote by

*Γ *=

*(V , Σ, τ )*, where

*Σ *⊆

*V *×

*V *and

*τ *:

*Σ*→R+; the interpretation is that if

*(vi, vj ) *∈

*Σ*, then

*vi *is to be ranked higher than (is preferred over)

*vj *, and the penalty formis-ordering such a pair is given by

*τ (vi, vj )*. The goal is to learn from

*G, Γ * a rankingfunction

*f *:

*V *→R that ranks accurately the remaining objects in

*V *; here

*f *is consideredto rank objects

*vi *∈

*V *in descending order of the scores

*f (vi)*.

The edges

*E *and weights

*w *in the data graph

*G *will typically denote similarities or
interactions between objects in

*V *, for example interactions between proteins in a proteininteraction graph or similarities between webpages in a web similarity graph. These are oftensymmetric, resulting in an undirected graph; but as we will see, in some applications thesesimilarities or interactions can be asymmetric, which requires us to consider directed graphsas well. The preference graph

*Γ *can come from a variety of considerations. For example, inweb applications, one may observe a user's clicking behavior and infer preferences amongwebpages from these clicks; in an online shopping application, one may record a user'sviewing times for different products and use these to infer preferences among products.

Often, the preferences in

*Γ *are derived from relevance labels associated with individualobjects in

*V *; we give some common examples of this situation below.

*Example 1 *(Ranking with Binary Labels) Here the objects

*vi *∈

*V *are associated with binarylabels

*yi *= +1 (denoting a ‘positive' object, such as a relevant gene) or

*yi *= −1 (denoting a‘negative' object, such as an irrelevant gene); the learner is given examples of a few positive

Mach Learn (2010) 81: 333–357
**Fig. 1 **Diagrams depicting the graphs used in our graph ranking experiments. (**a**) An *S. cerevisiae *(yeast)

protein interaction graph. This is an unweighted, undirected graph, with the presence of an edge indicating an

interaction between two proteins. The isolated vertices correspond to proteins that are given to interact only

with themselves. The ranking task here was to rank kinases above non-kinases. (**b**) A protein similarity graph.

This is a weighted, directed graph, with weights representing (approximate) similarities, with each weight

taking a value in [0*, *1]. The actual graph used in our experiments is a complete graph; the edges shown here

correspond to weights greater than 1*/*2. The ranking task here was to obtain a hierarchical ranking of proteins

in the graph with respect to a specific protein family. (**c**) and (**d**) Chemical similarity graphs over inhibitors

of (**c**) cyclooxygenase-2 (COX2) and (**d**) dihydrofolate reductase (DHFR). These are weighted, undirected

graphs, again with each weight taking a value in [0*, *1]. Again, the actual graphs used in our experiments are

complete graphs; the edges shown here correspond to weights greater than 1*/*2. The ranking task here was to

rank chemical compounds in the graphs according to their activity with respect to the corresponding target.

Further details of these graphs and our experiments are given in Sect.

objects {*vi *: *i *∈ *S*+} and a few negative objects {*vj *: *j *∈ *S*−} for some *S*+*, S*− ⊂ {1*, . . , n*},and the goal is to learn a ranking function *f *: *V *→R that ranks positive objects in *V *higherthan negative objects. In such ranking problems, also known as *bipartite *ranking problems(Freund et al. Agarwal et al. the training examples can be viewed as expressinga preference for each object in *S*+ over each object in *S*−, with a constant mis-rankingpenalty for each such positive-negative pair. The preference graph in this case is given by*Γ *= *(V , Σ, τ )*, where *Σ *= {*(vi, vj ) *: *i *∈ *S*+*, j *∈ *S*−}, and *τ (vi, vj ) *= 1 for all *(vi, vj ) *∈ *Σ*.

*Example 2 *(Ranking with Ordinal Labels) In some cases, the objects *vi *∈ *V *are associ-ated with ordinal labels or ratings *yi *∈ {1*, . . , k*} for some *k > *2 (where a larger value
Mach Learn (2010) 81: 333–357
of *yi *denotes a higher rating, such as with ratings of books using 1 to *k *stars); for each1 ≤ *r *≤ *k*, the learner is given examples of a few objects of rating *r*, say {*vi *: *i *∈ *Sr *}for some *Sr *⊂ {1*, . . , n*}, and the goal is to learn a ranking function *f *: *V *→R thatranks highly-rated objects in *V *higher than lower-rated objects. In such ranking prob-lems, also referred to as *ordinal *or *k-partite *ranking problems (Herbrich et al. Rajaram and Agarwal the training examples can be viewed as expressing a pref-erence for each higher-rated object over each lower-rated one, the penalty for mis-rankingsuch a pair being proportional to the difference in their ratings. The preference graph in thiscase is given by *Γ *= *(V , Σ, τ )*, where *Σ *= {*(vi, vj ) *: *i *∈ *Sr , j *∈ *Ss, *1 ≤ *s < r *≤ *k*}, and*τ (vi, vj ) *= *yi *− *yj *for all *(vi, vj ) *∈ *Σ*.

*Example 3 *(Ranking with Real-Valued Labels) Here the objects *vi *∈ *V *are associated withreal-valued relevance labels *yi *∈ R (such as biological activities of chemical compoundswith respect to some therapeutic target); the learner is given examples of a few objectstogether with their relevance labels, say {*(vi, yi) *: *i *∈ *S*} for some *S *⊂ {1*, . . , n*}, and thegoal is to learn a ranking function *f *: *V *→R that ranks higher-relevance objects in *V *higherthan lower-relevance objects. In such ranking problems, studied recently by Cortes et al.

() and Agarwal and Niyogi (), the training examples can be viewed as expressinga preference for each higher-relevance object over each lower-relevance one, the penaltyfor mis-ranking such a pair being proportional to the difference in their relevance values.

The preference graph in this case is given by *Γ *= *(V , Σ, τ )*, where *Σ *= {*(vi, vj ) *: *i, j *∈ *S*,*yi > yj *}, and *τ (vi, vj ) *= *yi *− *yj *for all *(vi, vj ) *∈ *Σ*.

As discussed above, given a data graph *G *and a preference graph *Γ *, our goal is to
learn from *G, Γ * a ‘good' ranking function *f *: *V *→R. We note that since *V *is assumed

to be finite and known, and our goal is to learn a ranking function only over this set, our

formulation of the problem of learning ranking functions on graphs falls under the setting

of transductive learning (Vapnik ; Joachims ; Belkin and Niyogi Zhou et

al. ). We also note that since *V * = *n*, we can represent any function *f *: *V *→R as a

column vector **f **∈ R*n *with *i*th element *fi *= *f (vi)*. We shall use these two representations

interchangeably in the rest of the paper.

A good ranking function is a function that makes few ranking mistakes. Assuming that
ties are broken uniformly at random, the expected ranking loss or penalty incurred by afunction *f *: *V *→R on a pair *(vi, vj ) *∈ *Σ *is given by
*(f, vi, vj ) *= *τ (vi, vj ) *· **I**{*f*

*i <fj *} + 1
*i *=*fj *}
where **I**{*φ*} is an indicator variable that takes the value 1 if the predicate *φ *is true and 0

otherwise. The (empirical) ranking error of *f *with respect to the preference graph *Γ *can

then be defined as

er*Γ (f ) *= 1
*(f, vi, vj ).*
*Σ * *(vi,vj)*∈*Σ*
In the following two sections, we develop regularization-based algorithms for learning from*G, Γ * a ranking function *f**G,Γ * that minimizes an approximate version of the above rank-ing error. In particular, our algorithms minimize regularized versions of a convex upperbound on the above ranking error with respect to the preference graph *Γ *; the regularizerswe use encourage smoothness of the learned function with respect to the data graph *G*.

Mach Learn (2010) 81: 333–357
**3 Ranking on undirected graphs**

In this section we treat the case when the data graph *G *= *(V , E, w) *is undirected, so that*(vi, vj ) *∈ *E *⇒ *(vj , vi) *∈ *E *and *w(vi, vj ) *= *w(vj , vi) *for all *(vi, vj ) *∈ *E*. The case of di-rected graphs is treated in Sect.

Our goal is to find a ranking function *f *: *V *→R that minimizes a suitably regularized
version of the ranking error
er*Γ (f ) *with respect to the preference graph *Γ *; in particular, we
would like to find a function that minimizes a suitable combination of the ranking error and aregularization term that penalizes complex functions. Due to its discrete nature, minimizingdirectly an objective function involving the ranking error
er*Γ (f ) *is NP-hard; we shall use
instead a convex upper bound on the ranking error. In particular, consider the followingranking loss, which we refer to as the *hinge *ranking loss due to its similarity to the hingeloss used in classification:
h*(f, vi, vj ) *= *τ (vi, vj ) *− *(fi *− *fj ) *+*,*
where for *a *∈ R, we have
*a, *if *a >*0*,*
The hinge ranking loss h*(f, vi, vj ) *is clearly convex in *f *and upper bounds the ranking loss*(f, vi, vj )*. Our algorithms will minimize a regularized version of the following h-error:
erh *(f ) *= 1
h*(f, vi , vj ).*
*Σ * *(vi,vj)*∈*Σ*
In particular, we would like to use a regularizer *SG(f ) *that encourages smoothness of thelearned function with respect to the underlying data graph *G*; this will encourage general-ization to other objects in the graph that are not included in the preference examples *Σ *.

Several such graph regularizers have been proposed in recent years (Smola and Kondor
Belkin et al. ; Zhou and Schölkopf ). These regularizers take the form
*SG(f ) *= **f***T ***Sf**

for an appropriate symmetric, positive semi-definite matrix **S **∈ R*n*×*n *derived from *G*. While

we could use any such regularizer in our algorithms, for most of this paper, we shall focus

on the Laplacian regularizer, which has the above form with **S **= **L **or **S **=

**L**, where **L **and

**L **are the unnormalized and normalized versions of the Laplacian matrix of the graph *G*,

respectively. We do so for two reasons: first, the Laplacian matrix has a nice extension to

directed graphs, which we shall discuss in Sect. ; and second, as discussed by Smola and

Kondor (), any regularizer that is invariant to permutations of vertices of the graph (in

a certain well-defined sense) is necessarily a function of the Laplacian.

The unnormalized Laplacian matrix of the graph *G *is defined as
**L **= **D **− **W***,*

where **W **is the weight matrix given by

*w(vi,vj), *if *(vi,vj)*∈*E,*
Mach Learn (2010) 81: 333–357
and **D **is the diagonal degree matrix **D **= diag*(di)*, with diagonal entries *di *given by

It is easy to show that for any *f *: *V *→R, the Laplacian regularizer using the above matrix

**L **can be written as

*SG(f ) *= **f***T ***Lf **= 1

*w(vi, vj )(fi *− *fj )*2*.*
This quantity is large for functions *f *that assign very different scores to objects in the graphthat have high similarity or interaction weights, and is small for functions that assign similarscores to such objects. Therefore, selecting a function *f *: *V *→R with a small value of theabove regularizer ensures that *f *does not vary rapidly across similar objects in the graph; inother words, that the function is smooth with respect to the graph *G*.

As in Smola and Kondor Zhou and Schölkopf we use here the normalized
version of the Laplacian matrix, which is given by
**L **= **D**−1*/*2**LD**−1*/*2 = **I***n *− **D**−1*/*2**WD**−1*/*2*,*

where **I***n *denotes the *n *× *n *identity matrix and where we have assumed *di > *0 for all *i*. The

normalized Laplacian is often preferred over the unnormalized version due to its attractive

spectral properties (Chung Smola and Kondor ); it also forms the basis for the

extension to directed graphs, discussed in Sect.

Thus, given the data graph *G *and preference graph *Γ *, our algorithm for learning a rank-
ing function *f**G,Γ * : *V *→R minimizes a combination of the empirical h-error and the nor-malized Laplacian regularizer as follows:
erh *(f ) *+ *λ***f***T *

**Lf ***,*

*f *:*V *→R
for some suitable regularization parameter *λ > *0.

In practice, the above optimization problem can be solved by reduction to a convex
quadratic program (QP), much as is done in support vector machines (SVMs). In partic-ular, introducing a slack variable *ξij *for each ordered pair *(vi, vj ) *∈ *Σ*, we can re-write theabove optimization problem as follows:
**f***T *

**Lf **+ *C*

*Σ* *(vi,vj)*∈*Σ*
*ξij *≥ *τ (vi, vj ) *− *(fi *− *fj ) *∀*(vi, vj ) *∈ *Σξij *≥ 0 ∀*(vi, vj ) *∈ *Σ,*
Mach Learn (2010) 81: 333–357
where *C *= 1*/(*2*λ)*. Introducing Lagrange multipliers and taking the Lagrangian dual thenresults in the following (convex) QP in *Σ* variables {*αij *: *(vi, vj ) *∈ *Σ*}:
*ij τ (vi , vj )*
2 *(vi,vj)*∈*Σ (vk,vl)*∈*Σ*
*(vi ,vj )*∈*Σ*
0 ≤ *αij *≤ *C*
*(vi, vj ) *∈ *Σ,*
*L*+ denotes the *(i, j )*th element of
**L**+, the pseudo-inverse of

**L**.It can be shown that,

on solving the above QP for {*αij *}, the solution **f***G,Γ * ∈ R*n *to the original problem is found

as

**f***G,Γ * =

**L**+ **a**+ − **a**− *,*

where **a**+*, ***a**− ∈ R*n *are given by

*j *:*(vi ,vj )*∈*Σ*
*i*:*(vi ,vj )*∈*Σ*
Thus, given *G, Γ *, our algorithm for learning a ranking function **f***G,Γ * consists of solving

the QP in ) for {*αij *}, and then using these to obtain **f***G,Γ * using ).

There are several observations to be made. First, note from that the ranking function
**f***G,Γ * returned by the above algorithm lies in the column-space of

**L**+, which is defined

as {**f **∈ R*n *: **f **=

**L**+**u **for some **u **∈ R}. This column-space, together with the inner product

**f***, ***g** = **f***T *

forms an RKHS with kernel
**L**+. Indeed, that it is a Hilbert space follows from

positive semi-definite. The reproducing property is also easily verified: let
**L**+ denote the *i*th

column vector of
**L**+ and **e***i *∈ R*n *denote a column vector with 1 in the *i*th position and 0

elsewhere; then for any **f **=

**L**+**u**, we have:

**f***,*

**L**+ = **f***T *

**L**+ = *(***u***T *

**L**+*)*

**L***(*

*i ) *= **u***T (*

**L**+*)***e***i *= **u***T *

**L**+**e***i *= **u***T *

Thus, the algorithm above can be viewed as learning a ranking function in the above RKHS
**L**+; indeed, the regularizer **f***T *

**Lf **used in the algorithm then corresponds to the

(squared) norm of **f **in this RKHS.

Second, as noted above, one could in principle use other graph regularizers of the form
**f***T ***Sf**, with **S **an appropriate positive semi-definite matrix derived from *G*; the algorithm

would take a similar form as above with

**L**+ being replaced by **S**−1, the inverse of **S **if **S **is

invertible and its pseudo-inverse otherwise. This would induce a different RKHS, given by

the column space of **S**−1 together with the inner product **f***, ***g** = **f***T ***Sg**, and would correspond

to learning a ranking function using a different graph kernel, given by **S**−1.

Third, we note that the optimization problems derived above resemble the RankSVM
formulations of Herbrich et al. () and Joachims although their formulations did
1The Laplacian matrix
**L **is known to be singular: indeed, it is easy to see that the unnormalized Laplacian **L**

is singular, since all its rows sum to zero; that
**L **is singular then follows from its definition in terms of **L**.

Mach Learn (2010) 81: 333–357
not involve learning with graphs and did not involve preference weights *τ *. Indeed, in viewof the above observations, the graph ranking algorithm above can be viewed as an extensionof RankSVM to graphs using a graph kernel derived from *G*.

Finally, we note that solving the QP in () using a standard QP solver can take *O(* *Σ* 3*)*
time. This can be prohibitive; for example, in ranking with real-valued labels, if we are givenlabels for *m *objects in the graph (see Sect. Example all of which have distinct labels,
then we have *Σ* = *m *, giving *O(m*6*) *time. In our experiments, we used a gradient projec-
tion algorithm for solving the above QP that is considerably more efficient; this algorithmcan be useful for regular RankSVM implementations as well, and is outlined in Appendix .

**4 Ranking on directed graphs**

The case when the data graph *G *= *(V , E, w) *is directed, so that *E *and *w *are asymmetric,can be treated similarly to the undirected case. In particular, the goal is the same: to find aranking function *f *: *V *→R that minimizes a suitably regularized convex upper bound onthe empirical ranking error
er*Γ (f ) *with respect to the preference graph *Γ *.

The convex upper bound on
er*Γ (f ) *can be chosen to be the same as before, namely,
the h-error, which was denoted by
erh *(f )*. The goal is then again to solve an optimization
problem of the form
erh *(f ) *+ *λS*
*f *:*V *→R
for some suitable regularizer *SG(f ) *(and regularization parameter *λ > *0). This is where thetechnical difference lies: there has been limited work on regularization for directed graphs,and indeed, in the form described so far, the Laplacian regularizer used in the previoussection applies only to undirected graphs.

Recently, however, an analogue of the Laplacian was proposed for directed graphs
(Chung ). This shares many nice properties with the Laplacian for undirected graphs,and in fact can also be derived via discrete analysis on directed graphs (Zhou et al. It is defined in terms of a random walk on the given directed graph. In particular, given the(weighted) directed graph *G*, let *d*+ denote the out-degree of *v*
*j *:*(vi ,vj )*∈*E*
If the directed graph *G *is strongly connected and aperiodic, one can consider the random

walk over *G *with transition probability matrix **P **defined as follows:

if *(vi, vj ) *∈ *E*,
In this case, the above random walk has a unique stationary distribution *π *= *(π*1*, . . , πn)T*with *πi > *0 for all *i*. The Laplacian of *G *is then defined
**L **= **I***n *− *Π*1*/*2**P***Π*−1*/*2 + *Π*−1*/*2**P***T Π*1*/*2 *,*

2We continue to use the notation
**L **in this section as the Laplacian defined here for directed graphs extends

the normalized Laplacian for undirected graphs.

Mach Learn (2010) 81: 333–357
where *Π *is the diagonal matrix *Π *= diag*(πi)*. In the case when *G *is not strongly connected

or is periodic, one can use what is termed a *teleporting *random walk, which effectively

allows one to jump to a random vertex (chosen uniformly from the vertices distinct from the

current vertex) with some small probability *η *(Zhou et al. the probability transition

matrix **P***(η) *for such a walk is given by

*(*1−*η)Pij *+*η( *1 *), *if *i *=*j*,
*(*1 − *η)Pij ,*
if *i *= *j *.

Such a teleporting random walk always converges to a unique and positive stationary distri-bution *π (η)*, and therefore the Laplacian
**L **for a general directed graph (and appropriate *η*)

can be defined as in using **P***(η) *and the corresponding *π (η) *in place of **P **and *π *.

As discussed by Zhou et al. ), the Laplacian matrix
**L **constructed as above can be

used in exactly the same way as in the undirected case to define a smoothness regularizer

*SG(f ) *= **f***T *

**Lf **appropriate for functions defined on the vertices of a directed graph. Thus,

the algorithmic framework developed for ranking on undirected graphs can be applied inexactly the same manner to the directed case, except for the replacement with the appropriateLaplacian matrix.

**5 Stability and generalization properties**

The RKHS view of our graph ranking algorithms allows us to show attractive stability andgeneralization properties of these algorithms using results of Agarwal and Niyogi In particular, we will consider the setting of ranking with real-valued labels (see Sect. ,Example similar results can be shown for other settings such as ranking with binary orordinal labels (Sect. , Examples and ).

Let *G *= *(V , E, w) *be a data graph. In the setting of ranking with real-valued labels, the
objects *vi *∈ *V *are associated with real-valued relevance labels *yi *∈ R; we shall assume thatthe labels are bounded and take without loss of generality *yi *∈ [0*, M*] for some *M > *0. Thelearner is given a training sample consisting of examples of objects in *V *together with theirrelevance labels, which we shall denote here as *R *= {*(vi , y ), . . , (v , y )*} ⊂ *V *×[0*, M*];
the goal is to learn from *R *a ranking function *f *: *V *→R that ranks accurately the remainingobjects in *V *. We shall assume that each *(vi , y ) *∈ *R *is drawn randomly and independently
according to some underlying joint distribution *D *over *V *× [0*, M*]. We shall further assumethat a set *Y *⊆ [0*, M*] of positive measure is included in the support of the marginal of *D*over the label space [0*, M*]; this ensures that with probability 1, all labels in *R *are distinct,
and therefore the resulting preference graph *Γ *= *(V , Σ, τ ) *contains *m *edges.

Overloading notation, let the loss or penalty incurred by a ranking function *f *: *V *→R on
any pair of labeled objects *(v, y), (v**, y**) *∈ *V *× R be
*(f, (v, y), (v**, y**)) *= *y *− *y* · **I**{*(y*−*y**)(f (v)*−*f (v**))<*0} + 1 **I**{*f (v)*=*f (v**)*} *.*

Then for any ranking function *f *: *V *→R, we can write the empirical error of *f *with respectto (the preference graph resulting from) a training sample *R *∈ *(V *× [0*, M*]*)m *as above as
Mach Learn (2010) 81: 333–357
Also, define the expected error of *f *with respect to *D *as
er*D(f ) *= **E***((v,y),(v**,y**))*∼*D*×*D f, (v, y), (v**, y**) .*

Moreover, for any *R *∈ *(V *× [0*, M*]*)m *as above, we will denote by *fR *: *V *→R the rank-ing function learned by our graph ranking algorithm (the algorithm of either Sect. orSect. depending on whether *G *is undirected or directed) from the preference graph re-sulting from *R*. Then using results of Agarwal and Niyogi (), we can show that theexpected error of the learned function, er*D(fR)*, can with high probability (over the drawof *R*) be upper bounded in terms of a quantity related to the empirical error,
We first have the following stability result for our algorithms, which shows that a small
change in the training sample has limited effect on the learned ranking function:
**Theorem 1 **(Stability of Graph Ranking) *Let R *∈ *(V *× [0*, M*]*)m and R* ∈ *(V *× [0*, M*]*)m*

differ in only one example. *Then the functions fR *: *V *→R *and fR* : *V *→R *satisfy the fol-*

lowing for all v ∈ *V *:

*R (v) *− *fR* *(v)* ≤ 8 max1≤*i*≤*n*
The above result follows from Agarwal and Niyogi (, Theorem 11), which estab-
lishes stability of all kernel-based ranking algorithms that minimize a suitable ranking error
with regularization in an RKHS (recall that
**L**+ can be viewed as the kernel matrix in our

case), and from the fact that the hinge ranking loss optimized by our algorithms, which inthe above setting can be written as
h*(f, (v, y), (v**, y**)) *= *y *− *y* − sign*(y *− *y**) *· *(f (v) *− *f (v**)) *+*,*
is 1-admissible, in the sense that for any *f*1*, f*2 : *V *→R and any *(v, y), (v**, y**) *∈ *V *× R,
h*(f*1*, (v, y), (v**, y**)) *− h*(f*2*, (v, y), (v**, y**)) *≤ *f*1*(v) *− *f*2*(v)* + *f*1*(v**) *− *f*2*(v**)* *. *(28)
See Agarwal and Niyogi for details.

Before giving our generalization result, we need to define one more ranking loss, which
we will refer to as the *γ *ranking loss (where *γ > *0):
*γ (f, (v, y), (v**, y**))*
⎪ *y *− *y* *,*
⎨ if sign*(y *− *y**) *· *(f (v) *− *f (v**)) < *0*,*
= ⎪ *y *−*y* − 1sign*(y *−*y**) *·*(f(v)*− *f(v**)),*
⎪ if 0 ≤ sign*(y *− *y**) *· *(f (v) *− *f (v**)) *≤ *γ * *y *− *y* *,*
For any *γ > *0, the corresponding empirical *γ *-error of a ranking function *f *: *V *→R withrespect to a training sample *R *∈ *(V *× [0*, M*]*)m *as above can be written as
er*γ (f ) *= 1
Then we have the following generalization result for our algorithms:
Mach Learn (2010) 81: 333–357
**Theorem 2 **(Generalization Bound for Graph Ranking) *Let γ > *0. *Then for any *0 *< δ < *1,

*with probability at least *1 − *δ over the draw of R *∈ *(V *× [0*, M*]*)m *(*according to Dm*),

2 ln*(*1*/δ)*
*D(fR) < *
*R ) *+ 32 max1≤*i*≤*n*
The above result follows from Agarwal and Niyogi Theorem 8) and from the
stability result in Theorem above.

Theorems and above apply to our algorithms for both directed and undirected graphs,
**L**+ referring to the pseudo-inverse of the appropriate Laplacian matrix in each case.

For the case of connected, undirected graphs, the diagonal elements of this pseudo-inverse,
*L*+, can be further bounded in terms of the diameter of the graph:
**Theorem 3 **(Diameter Bound for Connected, Undirected Graphs) *Let G *= *(V , E, w) be*

a connected, *weighted*, *undirected graph*, *and let *

**L ***be the *(*normalized*) *Laplacian matrix*

*of G*. *Let d *= max1≤*i*≤*n di and w*min = min*(i,j)*∈*E w(i, j )*, *and let ρ be the unweighted diam-eter of G*, *i.e.*, *the length *(*number of edges*) *of the longest path between any two vertices iand j in V *. *Then for all *1 ≤ *i *≤ *n*,
*L*+ ≤ *ρd .*
The proof of the above result is based on the proof of a similar result of Herbster et al.

(), which was given for the unnormalized Laplacian of an unweighted graph; for com-pleteness, details are included in Appendix In particular, this leads to the following:
**Corollary 1 **(Generalization Bound for Ranking on Connected, Undirected Graphs) *Let G*,

*ρ*, *d*, *and w*min *be as in Theorem **and let γ > *0. *Then for any *0 *< δ < *1, *with probability*

at least 1 − *δ over the draw of R *∈ *(V *× [0*, M*]*)m *(*according to Dm*),

2 ln*(*1*/δ)*
er*D(fR) < *
*γ λmw*min
We describe below experiments on three different graph ranking tasks in computationalbiology and in cheminformatics. These tasks were chosen for their differing characteristics,both in terms of the type of ranking problem and in terms of the structure of the graph.

The first task involves an *S. cerevisiae *(yeast) protein interaction network, in which some
of the proteins are kinases; the learner is given examples of both kinases and non-kinases,and the goal is to identify kinases in the remaining proteins, i.e., to rank the remainingproteins such that kinases are ranked higher than non-kinases. This corresponds to rankingwith binary labels (see Sect. Example on an undirected graph. We will compare ourgraph ranking algorithm on this task with a corresponding graph classification algorithm.

The second task involves a directed similarity graph; this is also a graph over proteins,
drawn from a different database, with the asymmetry arising from an approximation in thesequence alignment tool used to find the similarity of one protein to another. The proteinsin this graph are associated with a hierarchical organization; we consider a ranking prob-lem in which the learner is given examples of proteins at different levels of the hierarchy
Mach Learn (2010) 81: 333–357
with respect to a specific protein family, and the goal is to rank other proteins in the graphaccording to this hierarchy. This corresponds to ranking with ordinal labels (see Sect. ,Example on a directed graph. We will compare our graph ranking algorithm on this taskwith a corresponding algorithm for ordinal regression on graphs.

The third task involves chemical similarity graphs that describe similarities between
chemical compounds. We consider two such graphs; in each case, the learner is given (real-valued) biological activities for some of the compounds, which measure the extent to whichthe compounds inhibit the activity of a particular drug target, and the goal is to rank the re-maining compounds such that those that are most active against the target are ranked higher.

This corresponds to ranking with real-valued labels (see Sect. , Example ) on an undi-rected graph; in this case the weight matrix itself is positive semi-definite, and so we do notneed to use the Laplacian. We will compare our graph ranking algorithm on this task with acorresponding graph regression algorithm.

The following sections describe in more detail the experimental setup in each case, to-
gether with our results.

6.1 Identifying kinases in a protein interaction network
Our first graph ranking task involves an *S. cerevisiae *(yeast) protein interaction networkdownloaded from the Database of Interacting Proteins (DIP).We considered only thoseproteins for which UniProt Knowledgebase (UniProtKB) identifiers were prothisresulted in an interaction graph *G *= *(V , E, w) *over 2418 proteins, containing 4538 inter-actions (edges) in all. This is an unweighted, undirected graph, in which the presence of anedge between two proteins indicates an interaction between them (see Fig. for pairsof proteins *vi, vj *that share an interaction, we took *w(vi, vj ) *= *w(vi, vj ) *= 1.

The ranking task we considered was to identify kinases in this network. A kinase is a type
of protein (specifically, enzyme) that transfers phosphate groups from a donor moleculeto another recipient molecule. Given their central role in many diseases including cancer,kinases form prominent targets for drug development, thus making their identification animportant task in biology. In order to apply machine learning methods for this task, weobtained a list of *S. cerevisiae *proteins that are known to belong to the kinase family bysearching the UniProtKB database. This gave an initial list of 183 kinases; of these, 101were present in the protein interaction network constructed above. Thus, of the 2418 proteinsin the above graph, we had 101 ‘positive' proteins corresponding to the kinases, and 2317‘negative' proteins corresponding to the non-kinases. This gave rise to a ranking task withbinary labels (see Sect. Example ); in particular, given a small number of examples ofboth kinases and non-kinases, our goal was to rank the remaining proteins such that kinasesare ranked higher than non-kinases.

We considered a number of different splits of the 2418 proteins in the graph into training
and test sets. In particular, we considered training sets of sizes *m *∈ {120*, *240*, *360*, *480*, *600};for each size *m*, we used 10 random splits of the proteins *V *into a training set {*vi *: *i *∈ *S*} ofsize *S* = *m *and a test set *V * {*vi *: *i *∈ *S*} of size *(*2418 − *m)*, subject to a constant propor-tion of positives and negatives in each set. Specifically, for *m *= 120, we included 5 positiveexamples and 115 negative examples; for *m *= 240, we included 10 positive examples and
3The DIP database catalogs experimentally determined interactions between proteins, and is available at. We used the core *S. cerevisiae *data set dated October 14, 2008.

4The UniProtKB database is a central repository for functional information on proteins, and is available at
Mach Learn (2010) 81: 333–357
230 negative examples; and so on. As described in Example this results in a preferencegraph *Γ *= *(V , Σ, τ ) *with *Σ* = *S*+ *S*− preferences, where *S*+*, S*− denote the positive andnegative training examples in *S*, respectively. For each such training set *S *(equivalently, pref-erence graph *Γ *), we used the graph ranking algorithm described in Sect. to learn a rankingfunction *f *: *V *→R. For comparison, we also used the support vector machine (SVM) algo-rithm using the same Laplacian graph kernel to learn a binary classifier *h *: *V *→{−1*, *+1}:this has the form *h(vi) *= sign*(f (vi) *+ *b) *for some *f *: *V *→R and *b *∈ R; we used theunderlying real-valued function *f *to rank the remaining proteins in the graph. The trade-off parameter *C *for both algorithms was selected from the range {0*.*1*, *1*, *10*, *100*, *1000}using 5-fold cross-validation; in each case, the parameter value that gave the lowest aver-age ranking error across the five folds was selected for training. The number of iterations*t*max in the gradient projection algorithm for ranking (see Appendix was fixed to 1000;the learning rate *η *was selected from the range {10−6*, *10−5*, *10−4*, *10−3*, *10−2} using 5-foldcross-validation as above.

The results are shown in Fig. Figure (a) shows the results in terms of the ranking
error, which is the quantity that is approximately minimized by the graph ranking algorithm;each value shown is the average across the 10 random splits for the corresponding trainingsize *m*. We note that the ranking error in this case is simply the fraction of mis-rankedpositive-negative pairs. We also note that the area under the ROC curve (AUC), a popularranking performance measure in the binary setting, is simply one minus this ranking error(Agarwal et al. As can be seen, the graph ranking algorithm yields better rankingperformance than the corresponding classification algorithm, which uses the same graphkernel but optimizes classification accuracy.

Often in ranking, the quality of the results returned at the top of a ranked list is especially
important. One quantity used to capture this in the binary setting is the *average precision*.

In particular, let *T*+*, T*− denote the (indices of) the positive and negative test objects in *V *,respectively; then the *precision at k *of a ranking function *f *: *V *→R is defined as the pro-
**Fig. 2 **Results on the task of identifying kinases in an *S. cerevisiae *protein interaction network. This is a

bipartite ranking task on an undirected graph, the goal being to rank kinases higher than non-kinases. The

plots compare the performance of the graph ranking algorithm given in Sect. with the binary classification

SVM using the same Laplacian graph kernel; the two algorithms therefore learn a function from the same

function class. Results are shown for increasing values of the training size *m*, in terms of **(a) **ranking error

(one minus AUC); and **(b) **average precision. Each value shown in the plots is an average over 10 random

trials. For the ranking error, lower values correspond to better ranking performance; for the average precision,

higher values correspond to better ranking performance. See text for details

Mach Learn (2010) 81: 333–357
**Fig. 3 **Proteins in the Structural

Classification of Proteins (SCOP)

database are organized

hierarchically into classes, folds,

superfamilies, and families. We

considered a hierarchical ranking

task on a protein similarity graph

derived from this database. See

text for details

portion of positives in the top *k *objects returned by *f *:
prec *(f ) *= 1
*k i*∈*T*+
where *π(i) *is the position of *vi *in the ranking of test objects returned by *f *. The averageprecision of *f *is simply the average of these precision values taken at the positions ofpositive objects in the returned ranking:
AP*(f ) *= 1
+ *j*∈*T*+
Clearly, higher values of the average precision correspond to better ranking performance.

Figure shows the results in terms of the average precision. In this case, the difference inperformance of the two algorithms is smaller, although again, the graph ranking algorithmgives slightly better overall performance than the classification algorithm. Note however thatoverall, neither algorithm has good performance in terms of the average precision (the idealranking would have an average precision of 1). We discuss this further in Sect.

6.2 Hierarchical ranking of proteins in an asymmetric protein similarity graph
The second graph ranking task we consider involves a protein similarity graph derived froma different protein database, specifically, the Structural Classification of Proteins (SCOP)database.The SCOP database consists of a hierarchical organization of proteins, based ontheir 3D structure, into classes, folds, superfamilies, and families: closely related proteinsare grouped to form a protein family; related protein families form a superfamily; relatedsuperfamilies form a fold; and finally, related folds form a protein structure class (Fig. ).

The protein similarity graph we used was drawn from a graph used by Weston et al.

(),who considered a different kind of protein ranking taskon the same database. Inour experiments, we considered proteins from two classes, namely the ‘all *α*' and ‘all *β*'classes, which resulted in a similarity graph *G *= *(V , E, w) *over 3314 proteins in all. Thisis a complete, weighted, directed graph (see Fig. in which the similarity *w(vi, vj ) *of
5The SCOP database describes structural relationships between proteins, and is available at.

6The graph used by Weston et al. () is based on SCOP version 1.59 and is available at.

7The protein ranking task considered by Weston et al. ) involves ranking proteins in the graph byrelevance to a single ‘query' protein. The approach used is based on that of Zhou et al. ); see also thediscussion in Sect.
Mach Learn (2010) 81: 333–357
each protein *vi *to each other protein *vj *is based on an ‘E-value' **E***ij *computed using the

popular PSI-BLAST sequence search and alignment tool (Altschul et al. In particu-

lar, given a protein *vi *, PSI-BLAST attempts to find for each other protein *vj *in the database

the highest-scoring alignment between pairs of segments in the amino acid sequences for *vi*

and *vj *(under some appropriate alignment scoring scheme), and then returns an E-value **E***ij*

that denotes the expected number of segment pairs that would result in a similarly high-

scoring alignment if the two sequences had been drawn randomly (under an appropriate

random sequence model). Thus, the E-value can be viewed as a measure of statistical signif-

icance of the (local) similarity between the two sequences: if the sequences are similar (con-

tain segments with a high-scoring alignment), the E-value is low; and vice-versa. Following

Weston et al. (), we convert these to similarity scores by taking *w(vi, vj ) *= *e*−**E***ij /*100.

Under an exact alignment search algorithm, the E-values would satisfy **E***ij *= **E***ji*; how-

ever, due to efficiency considerations, PSI-BLAST employs an approximation that results in

asymmetric E-values, thus giving asymmetric similarity weights *w(vi, vj )*.

The ranking task we considered was to rank proteins in the above graph according to
the structural hierarchy of SCOP with respect to a specific protein family. In particular, wetook the largest protein family, an ‘all *β*' family of ‘V set domains (antibody variable like)',denoted as ‘b.1.1.1' in the SCOP database and containing 403 proteins, as our target; givena small number of examples of proteins at different levels of the hierarchy with respectto this target family, our goal was to rank the remaining proteins such that those in thetarget family would be ranked highest, those in other families but belonging to the samesuperfamily would be ranked next, and so on. This can be viewed as a ranking task withordinal labels (see Sect. Example with *k *= 5 possible ratings: a protein *vi *in the targetfamily has rating *yi *= 5; a protein in a different family but in the same superfamily hasrating *yi *= 4; a protein in a different superfamily but in the same fold has rating *yi *= 3; aprotein in a different fold but in the same class has rating *yi *= 2; and a protein in a differentclass has rating *yi *= 1. Given a small number of examples of proteins of each rating, thegoal is to rank the remaining proteins in the graph such that higher-rated proteins are rankedhigher.

We considered a number of different splits of the 3314 proteins in the graph into training
and test sets. In particular, we considered training sets of sizes *m *∈ {100*, *200*, *300*, *400*, *500};for each size *m*, we used 10 random splits of the proteins *V *into a training set {*vi *: *i *∈ *S*} ofsize *S* = *m *and a test set *V * {*vi *: *i *∈ *S*} of size *(*3314 − *m)*, subject to each having the sameproportion of proteins from the five rating classes as in the complete set *V *. As described in
Example , this results in a preference graph *Γ *= *(V , Σ, τ ) *with *Σ* =
*r * *Ss *
preferences, where for each 1 ≤ *r *≤ 5, *Sr *denotes the training examples in *S *with rating *r*.

For each such training set *S *(equivalently, preference graph *Γ *), we used the graph rank-ing algorithm described in Sect. to learn a ranking function *f *: *V *→R; in constructingthe graph Laplacian, we used a teleporting random walk with *η *= 0*.*01 (see Sect. ). Forcomparison, we also used the state-of-the-art support vector ordinal regression (SVOR)algorithm of Chu and Keerthi () using the same Laplacian graph kernel to learn anordinal regression prediction function *g *: *V *→{1*, . . , *5}: this consists of a real-valued func-tion *f *: *V *→R together with a set of thresholds *b*1*, . . , b*4 ∈ R which are used to predict arating class via *g(vi) *= min{1 ≤ *i *≤ 5 : *f (vi) *≤ *bi*}, where *b*5 = ∞; we used the underly-ing real-valued function *f *to rank the remaining proteins in the graph. We used gradientprojection algorithms for both ranking (see Appendix and ordinal regression; for bothalgorithms, the number of iterations *t*max was fixed to 1000, while the trade-off parame-ter *C *and the learning rate *η *were selected using 5-fold cross-validation from the ranges{0*.*1*, *1*, *10*, *100*, *1000} and {10−6*, *10−5*, *10−4*, *10−3*, *10−2}, respectively (in each case, the
Mach Learn (2010) 81: 333–357
**Fig. 4 **Results on the task of hierarchical ranking of proteins in an asymmetric protein similarity graph

derived from the Structural Classification of Proteins (SCOP) database. This is an ordinal ranking task on a

directed graph, the goal being to rank proteins according to their position in the SCOP hierarchy with respect

to the ‘b.1.1.1' protein family. The plots compare the performance of the graph ranking algorithm given in

Sect. with a support vector ordinal regression (SVOR) algorithm using the same Laplacian graph kernel;

the two algorithms therefore learn a function from the same function class. Results are shown for increasing

values of the training size *m*, in terms of **(a) **ranking error; and **(b) **NDCG. Each value shown in the plots is an

average over 10 random trials. For the ranking error, lower values correspond to better ranking performance;

for the NDCG, higher values correspond to better ranking performance. See text for details

parameter values that gave the lowest average ranking error across the five folds were se-lected for training).

The results are shown in Fig. Figure (a) shows the results in terms of the ranking error,
which is the quantity that is approximately minimized by the graph ranking algorithm; eachvalue shown is the average across the 10 random splits for the corresponding training size *m*.

As can be seen, the graph ranking algorithm yields better ranking performance than thecorresponding ordinal regression algorithm, which uses the same graph kernel but optimizesordinal regression prediction accuracy.

We also evaluated both algorithms in terms of the *normalized discounted cumulative gain*
(NDCG), a quantity often used in information retrieval to measure ranking quality at the topof a ranked list when there are more than two relevance levels (Järvelin and KekäläinenSpecifically, in our setting, let *T *denote the (indices of) the test objects in *V *; thenthe NDCG of a ranking function *f *: *V *→R with respect to *T *is defined as
NDCG*(f ) *= 1
log *(π(i) *+ 1*)*
where *π(i) *is the position of *vi *in the ranking of test objects returned by *f *, *yi *is the rel-evance (rating) of *vi *, and *ZT *is a normalizing constant which ensures that the maximumNDCG with respect to *T *over all functions *f *is 1. As can be seen, the NDCG accumulatesa gain value 2*yi *− 1 for each object *i *∈ *T *, discounted using a logarithmic discounting factorlog *(π(i) *+ 1*) *that depends on the position of the object in the ranked list. Clearly, higher
values of the NDCG correspond to better ranking performance. Figure (b) shows the re-sults in terms of the NDCG. In this case, the performance of the two algorithms is broadlysimilar; in particular, note that both algorithms give NDCG values close to 0.99.

Mach Learn (2010) 81: 333–357
6.3 Ranking chemical compounds by activity in chemical similarity graphs
Our third graph ranking task is from the cheminformatics domain and involves two chemical

similarity graphs. These were derived from cheminformatics data sets used by Sutherland et

al. (). Specifically, the first graph was derived from a data set containing inhibitors of

dihydrofolate reductase (DHFR). DHFR is an enzyme that is involved in the production of

folate in the human body. Folate is important for healthy cell and tissue growth; however, it

is also needed by rapidly dividing cancer cells. Therefore drugs that inhibit DHFR, thereby

interfering with folate metabolism, can be useful in treating cancer; indeed, one of the earli-

est anti-cancer drugs, methotrexate, works by inhibiting DHFR. The DHFR data set used by

Sutherland et al. consists of 361 compounds that inhibit DHFR to varying levels, to-

gether with their biological activities with respect to DHFR. Each compound in the data set

is represented as a vector of 70 chemical descriptors; we scaled each descriptor value to lie

between a minimum of 0 and a maximum of 1, and constructed a chemical similarity graph

*G *= *(V , E, w) *over the 361 compounds by taking the similarity between two compounds

*vi, vj *to be *w(vi, vj ) *= *e*−**x***i*−**x***j *2 , where **x***i, ***x***j *∈ [0*, *1]70 denote the corresponding scaled

descriptor vectors. This is a complete, weighted, undirected graph (see Fig. (c)).

The second graph was derived from a data set containing inhibitors of cyclooxygenase-2
(COX2). COX2 is an enzyme that is implicated in the generation of prostaglandins, whichare responsible for inflammation and pain; drugs that inhibit COX2 can therefore be use-ful in providing relief from these symptoms. Indeed, some of the classical non-steroidalanti-inflammatory drugs (NSAIDs), such as aspirin and ibuprofen, inhibit COX2 along withCOX1, another COX isoenzyme; some newer NSAIDs, such as celecoxib and etoricoxib,are selective inhibitors of COX2. The COX2 data set used by Sutherland et al. (con-sists of 282 compounds that inhibit COX2 to varying levels, together with their biologicalactivities with respect to COX2. Each compound in this data set is represented as a vectorof 74 chemical descriptors; again, we scaled each descriptor value to lie between 0 and 1,and constructed a chemical similarity graph *G *= *(V , E, w) *over the 282 compounds as wasdone in the case of the DHFR data set above. Again, this is a complete, weighted, undirectedgraph (see Fig.
For each of the above graphs, the ranking task we considered was to rank compounds
according to their biological activities with respect to the target of interest (DHFR or COX2).

In particular, the data sets used by Sutherland et al. include the biological activities ofthe corresponding compounds represented as pIC
value of a compound
measures its effectiveness in inhibiting a particular biological or biochemical function, andis defined as
where IC50 denotes the 50% inhibitory concentration of the compound, i.e., the concentra-tion of the compound (generally measured in mol/L) required to inhibit a given biologicalfunction (in this case, the function of DHFR or COX2) by half. Thus the higher the pIC50value of a compound, the greater its inhibition effect (and therefore biological activity). ThepIC
values in the DHFR data set range from 3.3 to 9.8; those in the COX2 data set range
from 4.0 to 9.0. In each case, we had a ranking task with real-valued labels (see Sect. ,Example ): given the pIC
values for a small number of compounds, our goal was to rank
the remaining compounds in the graph such that compounds with greater activity against thetarget would be ranked higher.

In each case, we considered a number of different splits of the compounds in the graph
into training and test sets. In particular, for both graphs, we considered training sets of sizes
Mach Learn (2010) 81: 333–357
*m *∈ {20*, *40*, *60*, *80*, *100}; for each size *m*, we used 10 random splits of the compounds *V*into a training set {*vi *: *i *∈ *S*} of size *S* = *m *and a test set *V * {*vi *: *i *∈ *S*} of size *(n *− *m)*,where *n *= 361 for the graph over DHFR inhibitors and *n *= 282 for the graph over COX2inhibitors. As described in Example , this results in a preference graph *Γ *= *(V , Σ, τ ) *with
*Σ* = {*(i, j) *: *i, j *∈ *S, yi > yj*} preferences, where *yi *∈ R denotes the biological activity

of compound *vi *. For each such training set *S *(equivalently, preference graph *Γ *), we used the

graph ranking algorithm described in Sect. to learn a ranking function *f *: *V *→R; in this

case, by construction, the weight matrix **W **is already symmetric and positive semi-definite,

so we used this as the kernel matrix (which amounts to using **f***T ***W**−1**f **as the regularizer).

For comparison, we also used the support vector regression (SVR) algorithm using the same

kernel to learn a prediction function *f *: *V *→R, and used this to rank the remaining com-

pounds in the graph. The trade-off parameter *C *for both algorithms was selected from the

range {0*.*1*, *1*, *10*, *100*, *1000} using 5-fold cross-validation; the insensitivity parameter * *in

the SVR algorithm was selected similarly from the range {0*.*01*, *0*.*05*, *0*.*1*, *0*.*5*, *1}. In each

case, the parameter values that gave the lowest average ranking error across the five folds

were selected for training. The number of iterations *t*max in the gradient projection algorithm

for ranking (see Appendix was fixed to 1000; the learning rate *η *was selected from the

range {10−6*, *10−5*, *10−4*, *10−3*, *10−2} using 5- fold cross-validation as above.

The results are shown in Fig. (DHFR) and Fig. (COX2). Figures and show
the results in terms of the ranking error, which is the quantity that is approximately mini-mized by the graph ranking algorithm; each value shown is the average across the 10 randomsplits for the corresponding training size *m*. As can be seen, in both cases, the graph rankingalgorithm yields better ranking performance than the corresponding regression algorithm,which uses the same graph kernel but optimizes prediction accuracy.

We also evaluated both algorithms in terms of the NDCG (see Sect. these results
are shown in Figs. and Again, the graph ranking algorithm gives slightly betterperformance than the regression algorithm. However the performance in terms of the NDCGdoes not always show the same monotonic behavior that is observed with the ranking error;
**Fig. 5 **Results on the task of ranking compounds in a chemical similarity graph by their inhibition activity

against dihydrofolate reductase (DHFR). The plots compare the performance of the graph ranking algorithm

given in Sect. with SVR using the same kernel; the two algorithms therefore learn a function from the same

function class. Results are shown for increasing values of the training size *m*, in terms of (**a**) ranking error;

and (**b**) NDCG. Each value shown in the plots is an average over 10 random trials. For the ranking error,

lower values correspond to better ranking performance; for the NDCG, higher values correspond to better

ranking performance. See text for details

Mach Learn (2010) 81: 333–357
**Fig. 6 **Results on the task of ranking compounds in a chemical similarity graph by their inhibition activity

against cyclooxygenase-2 (COX2). The plots compare the performance of the graph ranking algorithm given

in Sect. with SVR using the same kernel; the two algorithms therefore learn a function from the same

function class. Results are shown for increasing values of the training size *m*, in terms of (**a**) ranking error;

and (**b**) NDCG. Each value shown in the plots is an average over 10 random trials. For the ranking error,

lower values correspond to better ranking performance; for the NDCG, higher values correspond to better

ranking performance. See text for details

for example, in the case of DHFR (Fig. (b)), we see that an increase in the number oftraining examples sometimes results in a drop in NDCG. This is discussed further below.

**7 Conclusion**

Our goal in this paper has been to develop an algorithmic framework for learning rank-ing functions on graphs. Problems of ranking on graphs are increasingly common, and inmany applications, stand to benefit from a machine learning approach that can take intoaccount general preference examples. Our work builds on recent work on graph learningand regularization (Smola and Kondor Belkin and Niyogi Belkin et al. Zhou and Schölkopf Zhou et al. ; Herbster et al. Johnson and Zhang Our algorithms can be viewed as learning a ranking function in a reproducing kernel Hilbertspace (RKHS) derived from the underlying graph; using results of Agarwal and Niyogi(), we have shown that this leads to attractive stability and generalization properties.

Furthermore, we have illustrated our approach on three different graph ranking tasks incomputational biology and in cheminformatics; in each case, our graph ranking algorithmsoutperform other graph learning approaches that learn a function from the same RKHS.

There are several questions to be explored. First, we have used the hinge ranking loss
in our formulations, which results in RankSVM-type formulations (Herbrich et al. Joachims (see also Sect. It may be possible to use other ranking losses that alsoserve as convex upper bounds on the standard pair-wise ranking loss, and that lead to alter-native algorithmic formulations.

Second, for the hinge ranking loss based formulation, we have described an efficient
gradient projection algorithm for solving the dual quadratic program (QP) resulting fromthe graph ranking optimization problem; for a preference graph *Γ *= *(V , Σ, τ )*, this takes*O(n*2 */ε*2*) *time to converge to an *ε*-accurate solution, where *n*
*Σ *is the number of objects in
*V *included in *Σ *(see Appendix ). It may be possible to use for example ideas of Chapelle
Mach Learn (2010) 81: 333–357
and Keerthi to obtain yet more efficient algorithms that directly solve the primalproblem.

Finally, while our algorithms showed clear performance gains over other approaches in
terms of the pair-wise ranking error, they did not exhibit as good performance in terms of theaverage precision or NDCG. This is not surprising since the pair-wise ranking error, whichis the quantity optimized by our algorithms, measures the global ranking accuracy, whereasthe average precision and the NDCG measure local accuracy at the top of a ranking. Nev-ertheless, in many applications, ranking accuracy at the top is of considerable importance;indeed, there has been much interest recently in algorithms for optimizing ranking perfor-mance measures that focus on accuracy at the top of a returned ranking (Burges et al. Yue et al. ; Xu and Li ; Chapelle et al. ; Cossock and Zhang Taylor et al. Chakrabarti et al. ; Rudin Chapelle and Wu ). We arenot aware of any such algorithms in the context of graph ranking problems.

The author would like to thank Partha Niyogi for discussions related to regularization
and algorithmic stability, Michael Collins for discussions related to gradient-based optimization algorithms,and Shiladitya Sengupta for suggesting the kinase ranking task. Thanks also to Archana Kamal for help withplotting some of the graphs in Fig. This work was supported in part by the National Science Founda-tion (NSF) under Grant No. DMS-0732334. Any opinions, findings, and conclusions or recommendationsexpressed in this article are those of the author and do not necessarily reflect the views of the NSF.

**Appendix A: Gradient projection algorithm for solving graph ranking QP of (****)**

Consider the QP in ) that results from taking the Lagrangian dual of the optimizationproblem in our graph ranking framework. This is a QP over *Σ* variables {*αij *: *(vi, vj ) *∈*Σ *}. Let *Q(α) *denote the objective:
*αij τ (vi, vj ).*
*(vi ,vj )*∈*Σ (vk ,vl )*∈*Σ*
*(vi ,vj )*∈*Σ*
The constraint set *Ω *can be described as
*Ω *= *α *: 0 ≤ *αij *≤ *C *∀
*(vi, vj ) *∈ *Σ .*
These are simple box constraints; therefore, we can solve the QP using a gradient projectionalgorithm that starts at some initial values *α(*1*)*, and on each iteration *t *, updates *α(t) *using agradient and projection step:
*α(t*+1*) *← *PΩ α(t) *− *ηt *∇*(t) ,*
where *ηt > *0 is a learning rate; ∇*(t) *is the gradient of the objective *Q(α) *evaluated at *α(t)*;and *PΩ *denotes Euclidean projection to the above constraint set *Ω*. The projection to the boxconstraints above is straightforward: values of *αij *outside the interval [0*, C/* *Σ* ] are simplyclipped to the interval. It is well known from standard results in optimization (Bertsekas
that if *ηt *= *η/ t *for some constant *η > *0, then gradient projection converges to an*ε*-accurate solution (in our case, a solution *α *for which *Q(α) *− *Q*∗ *< ε*, where *Q*∗ denotesthe optimal value) in *O(*1*/ε*2*) *iterations. In our case, each iteration takes *O(n*2 *) *time, where
*vi *∈ *V *: *(vi, vj ) *∈ *Σ *or *(vj , vi) *∈ *Σ *for some *vj *∈ *V ,*
thus leading to *O(n*2 */ε*2*) *time for convergence to an *ε*-accurate solution. See algorithm.

Mach Learn (2010) 81: 333–357
**Algorithm **Gradient Projection for Graph Ranking QP

Preference graph *Γ *= *(V , Σ, τ )*Laplacian pseudo-inverse
*L*+ for *i, j *such that *v*
*i *and *vj *included
in preference examples *Σ*)Parameters *C, t*max*, η*
*α(*1*) *← *C/(*1000 *Σ* *) *∀*(v*
*i , vj ) *∈ *Σ *(initialize to some small values)
**For ***t *= 1 to *t*max do:

• *α(t*+1*/*2*) *← *α(t) *− *η*√ ∇*(t)*
• **For **each *(vi, vj) *∈ *Σ*:

**If ***α(t*+1*/*2*) < *0 **Then ***α(t*+1*) *← 0

**Else If ***α(t*+1*/*2*) > C/* *Σ* **Then ***α(t*+1*) *← *C/* *Σ*

**Else ***α(t*+1*) *← *α(t*+1*/*2*)*

*α(t*∗*)*, where *t*∗ = arg min1≤*t*≤*t*max+1 *Q(α(t))*
**Appendix B: Proof of Theorem **

The proof is based on the proof of a similar result of Herbster et al. (), which was givenfor the unnormalized Laplacian of an unweighted graph.

*Proof of Theorem *Since
**L**+ is positive semi-definite, we have

*L*+ ≥ 0. If
*L*+ = 0, the
result holds trivially. Therefore assume
*L*+ *> *0. Then ∃*j *such that
*L*+ *< *0 (since for all *i*,
*j *= 0; this is due to the fact that the vector *(*
*n)T *is an eigenvector
**L**+ with eigenvalue 0). Let *Pij *denote (the set of edges in) the shortest path in *G *from *vi*

to *vj *(shortest in terms of number of edges; such a path exists since *G *is connected); let *r*
be the number of edges in this path. Since **a**1 ≤

*r***a**2 for any **a **∈ R*r *, we have

√ − *il*
√ − *il*
√ ⎠ *.*
*k ,vl )*∈*Pij*
*(vk ,vl )*∈*Pij*
√ − *il*
√ − *il*
*k ,vl )*∈*Pij*
*(vk ,vl )*∈*Pij*
= *L*+*ii*
√ − *ij*
Mach Learn (2010) 81: 333–357
where the equality follows since all other terms in the sum cancel out, and the last inequalityfollows since
*L*+ *< *0. Furthermore, it is easy to show that for any **f **∈ R*n*,

**f***T *

*w(vk, vl) *√
*k ,vl )*∈*E*
*L*+ = *(*
**L**+*)T *

**L***(*

*k ,vl )*∈*E*
*k ,vl )*∈*Pij*
*k ,vl )*∈*Pij*
where the second equality follows from ((applied to **f **=

**L**+, the *i*th column of

the first inequality follows since *E *contains both *(vk, vl) *and *(vl, vk) *for all edges *(vk, vl) *∈*Pij *. Combining ), (), and (we thus get that
*L*+ ≥ *w*min
*L*+ ≤ *rdi .*
The result follows since *r *≤ *ρ *and *di *≤ *d*.

Aerts, S., Lambrechts, D., Maity, S., Van Loo, P., Coessens, B., De Smet, F., Tranchevent, L.C., De Moor, B.,
Marynen, P., Hassan, B., Carmeliet, P., & Moreau, Y. (2006). Gene prioritization through genomic datafusion. *Nature Biotechnology*, *24*(5), 537–544.

Agarwal, S. (2006). Ranking on graph data. In *Proceedings of the 23rd international conference on machine*
Agarwal, A., & Chakrabarti, S. (2007). Learning random walks to rank nodes in graphs. In *Proceedings of*
*the 24th international conference on machine learning*.

Agarwal, S., & Niyogi, P. (2009). Generalization bounds for ranking algorithms via algorithmic stability.

*Journal of Machine Learning Research*, *10*, 441–474.

Agarwal, S., & Sengupta, S. (2009). Ranking genes by relevance to a disease. In *Proceedings of the 8th*
*annual international conference on computational systems bioinformatics*.

Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., & Roth, D. (2005). Generalization bounds for the area
under the ROC curve. *Journal of Machine Learning Research*, *6*, 393–425.

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. J. (1997).

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. *Nucleic AcidsResearch*, *25*(17), 3389–3402.

Mach Learn (2010) 81: 333–357
Belkin, M., & Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for embedding and clustering.

In *Advances in neural information processing systems *(Vol. 14).

Belkin, M., & Niyogi, P. (2004). Semi-supervised learning on Riemannian manifolds. *Machine Learning*, *56*,
Belkin, M., Matveeva, I., & Niyogi, P. (2004). Regularization and semi-supervised learning on large graphs.

In *Proceedings of the 17th annual conference on learning theory*.

Bertsekas, D. (1999). *Nonlinear programming *(2nd ed.). Nashua: Athena Scientific.

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In *Proceedings of*
*the 7th international world Wide Web conference*.

Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., & Hullender, G. (2005). Learn-
ing to rank using gradient descent. In *Proceedings of the 22nd international conference on machinelearning*.

Burges, C. J. C., Ragno, R., & Le, Q. V. (2007). Learning to rank with non-smooth cost functions. In *Advances*
*in neural information processing systems *(Vol. 19). Cambridge: MIT Press.

Chakrabarti, S., Khanna, R., Sawant, U., & Bhattacharyya, C. (2008). Structured learning for non-smooth
ranking losses. In *Proceedings of the 14th ACM conference on knowledge discovery and data mining*.

Chapelle, O., & Keerthi, S. S. (2010, to appear). Efficient algorithms for ranking with SVMs. *Information*
*Retrieval Journal*. doi:
Chapelle, O., & Wu, M. (2010, to appear). Gradient descent optimization of smoothed information retrieval
metrics. *Information Retrieval Journal*.
Chapelle, O., Le, Q., & Smola, A. (2007). Large margin optimization of ranking measures. In *Proceedings of*
*the NIPS-2007 workshop on machine learning for Web search*.

Chu, W., & Keerthi, S. S. (2007). Support vector ordinal regression. *Neural Computation*, *19*(3), 792–815.

Chung, F. R. K. (1997). *Spectral graph theory*. Providence: American Mathematical Society.

Chung, F. R. K. (2005). Laplacians and the Cheeger inequality for directed graphs. *Annals of Combinatorics*,
*9*, 1–19.

Clemencon, S., Lugosi, G., & Vayatis, N. (2008). Ranking and empirical minimization of U-statistics. *Annals*
*of Statistics*, *36*, 844–874.

Cohen, W. W., Schapire, R. E., & Singer, Y. (1999). Learning to order things. *Journal of Artificial Intelligence*
*Research*, *10*, 243–270.

Cortes, C., Mohri, M., & Rastogi, A. (2007). Magnitude-preserving ranking algorithms. In *Proceedings of*
*24th international conference on machine learning*.

Cossock, D., & Zhang, T. (2008). Statistical analysis of Bayes optimal subset ranking. *IEEE Transactions on*
*Information Theory*, *54*(11), 5140–5154.

Crammer, K., & Singer, Y. (2005). Online ranking by projecting. *Neural Computation*, *17*(1), 145–175.

Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining
preferences. *Journal of Machine Learning Research*, *4*, 933–969.

Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries for ordinal regression. In
*Advances in large margin classifiers *(pp. 115–132).

Herbster, M., Pontil, M., & Wainer, L. (2005). Online learning over graphs. In *Proceedings of 22nd interna-*
*tional conference on machine learning*.

Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. *ACM Transactions*
*on Information Systems*, *20*(4), 422–446.

Joachims, T. (2002). Optimizing search engines using clickthrough data. In *Proceedings of the ACM confer-*
*ence on knowledge discovery and data mining*.

Joachims, T. (2003). Transductive learning via spectral graph partitioning. In *Proceedings of the 20th inter-*
*national conference on machine learning*.

Johnson, R., & Zhang, T. (2008). Graph-based semi-supervised learning and spectral kernel design. *IEEE*
*Transactions on Information Theory*, *54*(1), 275–288.

Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment. *Journal of the ACM*, *46*(5), 604–
Kondor, R. I., & Lafferty, J. (2002). Diffusion kernels on graphs and other discrete structures. In *Proceedings*
*of the 19th international conference on machine learning*.

Ma, X., Lee, H., Wang, L., & Sun, F. (2007). CGI: a new approach for prioritizing genes by combining gene
expression and protein-protein interaction data. *Bioinformatics*, *23*(2), 215–221.

Morrison, J. L., Breitling, R., Higham, D. J., & Gilbert, D. R. (2005). GeneRank: using search engine tech-
nology for the analysis of microarray experiments. *BMC Bioinformatics*, *6*, 233.

Rajaram, S., & Agarwal, S. (2005). Generalization bounds for *k*-partite ranking. In *Proceedings of the NIPS-*
*2005 workshop on learning to rank*.

Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. *Science*,
Mach Learn (2010) 81: 333–357
Rudin, C. (2009). The p-norm push: a simple convex ranking algorithm that concentrates at the top of the list.

*Journal of Machine Learning Research*, *10*, 2233–2271.

Rudin, C., Cortes, C., Mohri, M., & Schapire, R. E. (2005). Margin-based ranking meets boosting in the
middle. In *Proceedings of the 18th annual conference on learning theory*.

Smola, A. J., & Kondor, R. (2003). Kernels and regularization on graphs. In *Proceedings of the 16th annual*
*conference on learning theory*.

Sutherland, J. J., O'Brien, L. A., & Weaver, D. F. (2004). A comparison of methods for modeling quantitative
structure-activity relationships. *Journal of Medicinal Chemistry*, *47*(22), 5541–5554.

Taylor, M., Guiver, J., Robertson, S., & Minka, T. (2008). Softrank: optimizing non-smooth rank metrics. In
*Proceedings of the 1st international conference on Web search and data mining*.

Tenenbaum, J., de Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimen-
sionality reduction. *Science*, *290*(5500), 2319–2323.

Vapnik, V. N. (1998). *Statistical learning theory*. New York: Wiley.

Weston, J., Eliseeff, A., Zhou, D., Leslie, C., & Noble, W. S. (2004). Protein ranking: from local to global
structure in the protein similarity network. *Proceedings of the National Academy of Science*, *101*(17),6559–6563.

Xu, J., & Li, H. (2007). AdaRank: a boosting algorithm for information retrieval. In *Proceedings of the 30th*
*ACM SIGIR conference on research and development in information retrieval*.

Yue, Y., Finley, T., Radlinski, F., & Joachims, T. (2007). A support vector method for optimizing average pre-
cision. In *Proceedings of the 30th ACM SIGIR conference on research and development in informationretrieval *(pp. 271–278).

Zhou, D., & Schölkopf, B. (2004). A regularization framework for learning from graph data. In *ICML work-*
*shop on statistical relational learning*.

Zhou, D., Weston, J., Gretton, A., Bousquet, O., & Schölkopf, B. (2004). Ranking on data manifolds. In
*Advances in neural information processing systems *(Vol. 16).

Zhou, D., Huang, J., & Schölkopf, B. (2005). Learning from labeled and unlabeled data on a directed graph.

In *Proceedings of the 22nd international conference on machine learning*.

Source: http://drona.csa.iisc.ac.in/~shivani/Publications/2010/mlj10-graphrank.pdf

Zugestellt durch Post.at Meine Mitgliedschaft bei Raiffeisen. . Informationen für Mitglieder und Kunden der Raiffeisenbank Korneuburg mit Bankstellen in l Korneuburg: Hauptplatz, Laaer Straße und Stockerauer Straße l Bisamberg l Enzersfeldl Großrußbach l Hagenbrunn l Langenzersdorf l Obergänserndorf l Würnitz

VOLUME 6, ISSUE 3 – SEPTEMBER 2015 ECFA e-BULLETIN EDITORIAL BOARD MESSAGE Forensic Accounting Committee Corruption is a major problem – Transparency International Survey. 2 (ECFA) of ICPAC has prepared Economic Scandals ……………………………………………….…. 4 the ECFA e-bulletin with the