In this hands-on assignment, we'll apply Network Analysis techniques (including PageRank) to Wikipedia articles. We'll analyze which articles are important using different metrics, analyze the results, as well as how the metrics differ. We'll be using Python for this assignment. The main objective of the assignment is to get experience playing around with real graph datasets.
Overview
This assignment guides you through using network analysis techniques. Take the quiz once you complete the assignment, as the quiz asks you for the results from the assignment.
Dataset
You can download the dataset at the following link: wikisubset.zip. (the dataset is a subset derived from Wikispeedia navigation paths dataset). After uncompressing the download, you should see three files. The code below can be used to load the data (expects the code to be in the same directory as the data files). This dataset consists of about 4500 articles.
import pandas as pd################################################################################# load datalinks = pd.read_csv('./links.tsv', sep='\t', comment='#', names=['source', 'target'])smatrix = [line for line in open('./shortest-path-distance-matrix.txt').read().split('\n') if (line and not line.startswith('#'))]articles = [line for line in open('./articles.tsv').read().split('\n') if (line and not line.startswith('#'))]###############################################################################
Notice above that we have data on the list of articles, the list of links (source is the article where the link is and target is the article it links to). In addition, the shortest path distances between each pair of articles has already been computed. smatrix[i][j] is the shortest path distance from article[i] to article[j].
Network analysis
Next, we'll compute three measures for each article - indegree, closeness, and PageRank. The template code for computing the same is given below. We'll make a dictionary for each measure, going from article name to score. Fill in all the ...s.
from collections import Counter################################################################################# indegreesindegrees = Counter(...)print('Indegrees')for xx in indegrees.most_common(25): print(xx)print('='*120)################################################################################# closenesscloseness = dict()for jj, article in enumerate(articles):# if shortest distance is not defined, use default value of 10...closeness[article] = ...closeness = Counter(closeness)print('Closeness')for xx in closeness.most_common(25): print(xx)print('='*120)################################################################################# pagerankalpha = 0.95max_niterations = 100tolerance = 0.000001outdegrees = Counter(...)# pagerank: initpagerank = dict(...)# pagerank: iterationsfor iteration in range(max_niterations):print('Iteration number: ', iteration)# calculatenew_pagerank = dict(...)for source, target in zip(links['source'], links['target']):new_pagerank[target] += ...# termination criteriamax_delta = max(abs(new_pagerank[xx]-pagerank[xx]) for xx in pagerank.keys())if max_delta < tolerance: break# updatepagerank = new_pagerankpagerank = Counter(pagerank)print('Pagerank')for xx in pagerank.most_common(25): print(xx)print('='*120)################################################################################# ranksindegrees_index = dict((xx, ii) for ii, (xx, ss) in enumerate(indegrees.most_common()))closeness_index = dict((xx, ii) for ii, (xx, ss) in enumerate(closeness.most_common()))pagerank_index = dict((xx, ii) for ii, (xx, ss) in enumerate(pagerank.most_common()))selected = ['Adolf_Hitler','Albert_Einstein','Algorithm','Arnold_Schwarzenegger','Australia','Atom','Azerbaijan','Australian_Open','American_football','Alexander_Graham_Bell','Astronomy','Asteroid','Arugula','Asparagus','Amsterdam','Abraham_Lincoln',]print 'Ranked by indegrees', sorted(selected, key=indegrees_index.get)print 'Ranked by closeness', sorted(selected, key=closeness_index.get)print 'Ranked by pagerank' , sorted(selected, key=pagerank_index.get)###############################################################################
In the final section (rank), we calculate the rank of some selected articles (like Albert Einstein, Australia, etc), and print them out.
Here's the expected output (for the first three sections) to check your work:
Indegrees('United_States', 1551)('United_Kingdom', 972)('France', 959)('Europe', 933)('World_War_II', 751)('England', 751)('Germany', 743)('India', 611)('English_language', 598)('London', 587)('Japan', 573)('Canada', 571)('Australia', 563)('Italy', 550)('Spain', 539)('Russia', 520)('Scientific_classification', 519)('China', 496)('Animal', 492)('Africa', 477)('Latin', 443)('North_America', 410)('World_War_I', 404)('Egypt', 362)('Asia', 357)========================================================================================================================Closeness('United_States', 0.5795039657560116)('Europe', 0.533804940276006)('United_Kingdom', 0.532569709591577)('France', 0.530972430499481)('Germany', 0.5160313901345291)('World_War_II', 0.5152227445713007)('Latin', 0.5056019332161688)('India', 0.5053798858146684)('English_language', 0.5053798858146684)('England', 0.5038309982486865)('Japan', 0.5021819768710452)('Canada', 0.49859185441941073)('Italy', 0.49783690244430023)('China', 0.4956924402326082)('Russia', 0.49441460794844255)('Spain', 0.49425534199506066)('Australia', 0.49282655246252677)('French_language', 0.4820400041889203)('London', 0.48143499633929504)('United_Nations', 0.48058049697222804)('North_America', 0.48048016701461377)('Netherlands', 0.48032975060002087)('Africa', 0.4795790789747864)('Islam', 0.4787311492459698)('World_War_I', 0.47650103519668735)========================================================================================================================('Iteration number: ', 0)('Iteration number: ', 1)('Iteration number: ', 2)('Iteration number: ', 3)('Iteration number: ', 4)('Iteration number: ', 5)('Iteration number: ', 6)('Iteration number: ', 7)('Iteration number: ', 8)('Iteration number: ', 9)('Iteration number: ', 10)('Iteration number: ', 11)('Iteration number: ', 12)('Iteration number: ', 13)('Iteration number: ', 14)('Iteration number: ', 15)('Iteration number: ', 16)Pagerank('United_States', 0.009897571449834277)('France', 0.007241950112814892)('Europe', 0.007025655924919779)('United_Kingdom', 0.006785736680764528)('English_language', 0.005447819147473695)('Germany', 0.005432638584848569)('World_War_II', 0.0051715983269865)('Latin', 0.004901237273982133)('India', 0.00463797805006983)('England', 0.004578553435285589)('Japan', 0.0042883428858490735)('Italy', 0.0042180111631345725)('Time_zone', 0.004215215297464881)('Spain', 0.0041860687935004574)('China', 0.004050332438365339)('Currency', 0.004027197658465597)('Russia', 0.003994326276941794)('Canada', 0.0036086925815627933)('Christianity', 0.0035844050472497316)('List_of_countries_by_system_of_government', 0.0035467654096853424)('Africa', 0.0034577987405697563)('Australia', 0.0033547786814597594)('United_Nations', 0.0033499045569107995)('Islam', 0.0033117211778507637)('French_language', 0.0032054014723937514)========================================================================================================================
Once you complete the code and have the output (ranks of the selected articles), move on to the quiz.
Bonus: Project Ideas
As a bonus, here's are some project ideas.
- Repeat the PageRank analysis, but instead of saying that every once in a while, we hop to a page at random, use Wikipedia pageview statistics to assign initial probabilities and random hop probabilities. (data is available here: Wikipedia Pageviews data, separate files for each one hour period. You can use data from any one hour for an initial analysis).
- Do the analysis for entire Simple Wikipedia (about 100,000 articles). The data is available via Wikipedia dumps, but the dumps are .sql files. So you'll have to import the data into MySQL and then make use of it, either directly by querying, or exporting it to a tab separated file. Alternatively, datasets are available at The Koblenz Network Collection, but they are missing the data which maps article ids to article titles.