- Introduction to Network Analysis
- Graphs and types of graphs
- Identifying Important Nodes and Structures in Network
- Degree Centrality
- Closeness Centrality
- Betweenness Centrality
- NetworkX for data analysis
- Practical Applications
Networks are everywhere. Be it the network of friends in a university or the online social networks such as Facebook and Twitter or the transportation networks or financial networks, networks are ubiquitous in human society. A network is a collection of objects (nodes) with relationships / interconnections (edges) between them.
Since networks are so universal, analyzing them gives us valuable information about different complex phenomena. For example, by analyzing a network of people in an organization, we can find out the most influential people in that organization. Or by analyzing a network of different airports or train stations we can find out which places have the maximum risk of spreading a particular virus. Network analysis provides us with the tools that will help us answer these type of questions.
The most common and easy method to represent networks is by using graphs. Items or objects are represented by nodes or vertices and the connection or links between the items are represented by edges.
For example, in human networks, the nodes can represent the people and the edges can represent the relationship between them i.e. family ties, friendship, etc.
Based upon whether the edges are directed or not the graphs can be of two types: Directed and undirected graphs. The simple graph we saw in figure 1 was an example of an undirected graph. A directed graph is shown below:
Different nodes in a network have different importance in the network. There are different metrics we can use to measure the importance of a particular node. The importance of a node is usually associated with a sense of the node being being 'central' to the network. For example, in a communication network, nodes that disseminate information to many nodes could be the most important node. Similarly, we might have different intuitions for hubs in a transportation network, important pages on the web. For example, most important nodes could be the one that prevent the network from breaking up. Centrality measures help to identify the most important (central) nodes in a network. Some popular measures of centrality are as follows:
The most basic notion of centrality is the number of neighbors that a node has (or its degree). The underlying assumption of this centrality measure is that nodes that are important have many connections. Mathematically the degree centrality of a node in an undirected graph is given by,
Cdegree = d / (N-1)
where N is the number of nodes in the network and d is the degree of the node (number of connections).
Closeness centrality measure assumes that the most important nodes are close to other nodes in the network. Mathematically the closeness centrality of a node is defined as,
Cclose = (N -1) / S,
where N is the number of nodes in the network and S is the sum of the shortest distance from each node to the particular node.
Betweenness centrality is based upon the assumption that important nodes connect other nodes i.e they lie in between many nodes. For every pair of nodes in a connected graph, we consider the shortest path between the nodes. The betweenness centrality for a particular node is the number of these shortest paths that pass through the vertex.
As we have seen from the above discussion, different centrality measures make have different assumptions about what it means to be a “central” node. Therefore, the rankings of nodes produced by different centrality measures are different. Depending upon the context of the network and our requirement the best centrality measure varies. It is usually best practice to consider multiple centrality measures in identifying central nodes.
PageRank, named after Google co-founder Larry Page, is one of the algorithms that Google uses to rank websites in its search results. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The main assumption behind the algorithm is that more important websites are more likely to receive links from other websites. On this basis the algorithm assigns a weight to each of the nodes in a network.
The working of PageRank algorithm can be understood with the help of a simple network with three nodes, shown below:
Step 1: First, all of the websites in the network (represented by the nodes A, B and C here) are initialized with an initial value, say 1/n where n is the number of nodes in the network.
For the above example, the number of nodes is 3. So each node would be initialized with the value 1/3.
Step 2: Then the Basic PageRank update rule is repeatedly applied to each of the nodes k times. The rule is that each of the nodes shares an equal amount of its current PageRank to all the nodes it links to. Therefore, the updated PageRank is the sum of all the weights that it receives from all of the other nodes.
For the above example let us update the PageRank of node C. Node C is linked by the nodes A and B. Node A contributes 1/2 * 1/3 to the PageRank of C. Node B contributes 1 * 1/3. Therefore, the new PageRank of node C is,
Similarly, for the other nodes.
This step is repeated k times. For most of the networks, the PageRank values converge as k increases.
The Basic PageRank of a node can be interpreted as the probability that a random walk lands on the node after k random steps.
Damping: One of the problems with the above Basic PageRank algorithm is that in some networks, some nodes can suck up all of the value. To overcome this problem, we can use a damping factor α such that the random walker chooses a random node with probability 1- α. Therefore, the update rule with the damping factor is
NetworkX is a python library for analyzing graphs and networks. It allows the creation, manipulation, and study of the structure, dynamics, and functions of complex social, biological, and infrastructure networks. It is suitable for operation on large real-world graphs: e.g., graphs in excess of 10 million nodes and 100 million edges. It’s efficiency, scalability, portability makes it suitable for network and social network analysis. To learn more about NetworkX, check out its quick tutorial here: Tutorial — NetworkX documentation.
Social network analysis has numerous application in the fields of sociology, anthropology, biology, demography, communication studies, economics, geography, information science etc., to name a few. Businesses are using social network analysis to improve customer relations, perform marketing, and analyze the business needs. It also helps to find relationships and understand the dynamics of groups of people and organizations. Social network analysis is also used in security to identify organized clandestine activities such as illegal organized activities and criminal gangs.