# Introduction

- Algorithm to derive similarity between 2 nodes of a graph (or graphical model derived from any other kind of dataset).
- Link to the paper

# SimRank

- Input: A directed graph
*G = (V, E)*where*V*represents vertices and*E*represents edges. - SimRank defines similarity between 2 vertices (or nodes)
*i*and*j*as the average of the similarity between their in-neighbours decayed by a constant factor*C*. - Any node is maximally similar to itself (with similarity = 1).
- PageRank analyses the individual vertices of the graph with respect to the global structure, while SimRank analyses the relationship between a pair of vertices (edges).
- SimRank scores are symmetric and can be defined between all pair of vertices.
*G^2*is defined as the node pair graph such that each node in*G^2*corresponds to an ordered pair of nodes of*G*and there exists an edge between node pair (a, b) and (c, d) if there exists an edge between (a, c) and (b, d).- In
*G^2*, similarity flows from node to node with singleton nodes (nodes of the form (a, a)) as the source of similarity.

# Variants

**Minimax Variant**

- Defines similarity of nodes
*i*and*j*as the minimum of maximum similarity between*i*and any in-neighbour of*j*and between*j*and any in-neighbour of*i*.

**Computing SimRank**

- A naive solution can be obtained by iteration to a fixed point.
- Space complexity is
*O(n^2)*and time complexity is*O(kn^2d) where k*is the number of iterations,*n*is the number of vertices and*d*is the average of product of in-degrees of pair of vertices. - Optimizations can be made by setting the similarity between far off nodes as 0 and considering only nearby nodes for an update.

# Different Interpretations

**Co-citation Score**

- The first iteration of SimRank produces results same as co-citation score between a pair of vertices.
- Successive iterations improve these initial scores.

**Random Surfer-Pairs Model**

- SimRank s(a, b) can be interpreted as the measure of how soon two random surfers are expected to meet at the same node if they start at nodes a and b and walk the graph backwards.
- Expected Meeting Distance (EMD) between 2 nodes a and b is the expected number of steps required before 2 surfers (starting at a and b) would meet if they walked randomly in locked step.
- Surfers are allowed to teleport with a small probability - to circumvent the infinite EMD problem.
- Expected-f Meeting Distance (EMD) - Given length l of a tour, compute f(l) (where f is a non-negative monotonic function) to bound the expected distance to a finite interval.
- Common choice for f is
**f(z) = C^z**where C ε (0, 1) - The SimRank score for two nodes, with parameter C, is the expected-f meeting distance traveling back-edges with
**f(z) = C^z**

# Evaluation

- Experiments on 2 datasets:
- Corpus of scientific research papers from ResearchIndex.
- Transcripts of undergrad students at Stanford.
- Domain specific properties used to measure similarity and compared with SimRank scores.
- Results show improvement over co-citation scores.