- Algorithm to derive similarity between 2 nodes of a graph (or graphical model derived from any other kind of dataset).
- Link to the paper
- Input: A directed graph G = (V, E) where V represents vertices and E represents edges.
- SimRank defines similarity between 2 vertices (or nodes) i and j as the average of the similarity between their in-neighbours decayed by a constant factor C.
- Any node is maximally similar to itself (with similarity = 1).
- PageRank analyses the individual vertices of the graph with respect to the global structure, while SimRank analyses the relationship between a pair of vertices (edges).
- SimRank scores are symmetric and can be defined between all pair of vertices.
- G^2 is defined as the node pair graph such that each node in G^2 corresponds to an ordered pair of nodes of G and there exists an edge between node pair (a, b) and (c, d) if there exists an edge between (a, c) and (b, d).
- In G^2, similarity flows from node to node with singleton nodes (nodes of the form (a, a)) as the source of similarity.
- Defines similarity of nodes i and j as the minimum of maximum similarity between i and any in-neighbour of j and between j and any in-neighbour of i.
- A naive solution can be obtained by iteration to a fixed point.
- Space complexity is O(n^2) and time complexity is O(kn^2d) where k is the number of iterations, n is the number of vertices and d is the average of product of in-degrees of pair of vertices.
- Optimizations can be made by setting the similarity between far off nodes as 0 and considering only nearby nodes for an update.
- The first iteration of SimRank produces results same as co-citation score between a pair of vertices.
- Successive iterations improve these initial scores.
Random Surfer-Pairs Model
- SimRank s(a, b) can be interpreted as the measure of how soon two random surfers are expected to meet at the same node if they start at nodes a and b and walk the graph backwards.
- Expected Meeting Distance (EMD) between 2 nodes a and b is the expected number of steps required before 2 surfers (starting at a and b) would meet if they walked randomly in locked step.
- Surfers are allowed to teleport with a small probability - to circumvent the infinite EMD problem.
- Expected-f Meeting Distance (EMD) - Given length l of a tour, compute f(l) (where f is a non-negative monotonic function) to bound the expected distance to a finite interval.
- Common choice for f is f(z) = C^z where C ε (0, 1)
- The SimRank score for two nodes, with parameter C, is the expected-f meeting distance traveling back-edges with f(z) = C^z
- Experiments on 2 datasets:
- Corpus of scientific research papers from ResearchIndex.
- Transcripts of undergrad students at Stanford.
- Domain specific properties used to measure similarity and compared with SimRank scores.
- Results show improvement over co-citation scores.