Input: A directed graph G = (V, E) where V represents vertices and E represents edges.
SimRank defines similarity between 2 vertices (or nodes) i and j as the average of the similarity between their in-neighbours decayed by a constant factor C.
Any node is maximally similar to itself (with similarity = 1).
PageRank analyses the individual vertices of the graph with respect to the global structure, while SimRank analyses the relationship between a pair of vertices (edges).
SimRank scores are symmetric and can be defined between all pair of vertices.
G^2 is defined as the node pair graph such that each node in G^2 corresponds to an ordered pair of nodes of G and there exists an edge between node pair (a, b) and (c, d) if there exists an edge between (a, c) and (b, d).
In G^2, similarity flows from node to node with singleton nodes (nodes of the form (a, a)) as the source of similarity.
Defines similarity of nodes i and j as the minimum of maximum similarity between i and any in-neighbour of j and between j and any in-neighbour of i.
A naive solution can be obtained by iteration to a fixed point.
Space complexity is O(n^2) and time complexity is O(kn^2d) where k is the number of iterations, n is the number of vertices and d is the average of product of in-degrees of pair of vertices.
Optimizations can be made by setting the similarity between far off nodes as 0 and considering only nearby nodes for an update.
The first iteration of SimRank produces results same as co-citation score between a pair of vertices.
Successive iterations improve these initial scores.
Random Surfer-Pairs Model
SimRank s(a, b) can be interpreted as the measure of how soon two random surfers are expected to meet at the same node if they start at nodes a and b and walk the graph backwards.
Expected Meeting Distance (EMD) between 2 nodes a and b is the expected number of steps required before 2 surfers (starting at a and b) would meet if they walked randomly in locked step.
Surfers are allowed to teleport with a small probability - to circumvent the infinite EMD problem.
Expected-f Meeting Distance (EMD) - Given length l of a tour, compute f(l) (where f is a non-negative monotonic function) to bound the expected distance to a finite interval.
Common choice for f is f(z) = C^z where C ε (0, 1)
The SimRank score for two nodes, with parameter C, is the expected-f meeting distance traveling back-edges with f(z) = C^z
Experiments on 2 datasets:
Corpus of scientific research papers from ResearchIndex.
Transcripts of undergrad students at Stanford.
Domain specific properties used to measure similarity and compared with SimRank scores.