Overview

This tutorial shows how to deploy Pegasus from the Hadoop on Azure portal to compute the page rank for a simple 16-node graph. The rank calculated for a node is a measure of how well connected it is to the other nodes in the graph structure.

A graph is type of abstract mathematical structure that consists of a collection of nodes and a collection of edges that connect a subset of these nodes, pairwise. The Web is a model of a graph structure, where pages are nodes and hyperlinks are (directed) edges. The page rank of a page (node) is a measure of how many other pages have hyperlinks (direct edges) that target that page (node). The higher the value of a page's rank, the more highly connected it is to other pages on the Web. A high page rank typically indicates an important page. The page rank of a page is defined recursively, so highly ranked pages that link to it increase its rank more that poorly ranked pages do.

Pegasus

Pegasus is an open source graph mining library implemented in a distributed manner on top of Hadoop. Pegasus provides large scale algorithms for various graph mining tasks:

  • Degree
  • PageRank
  • Random Walk with Restart
  • Radius
  • Connected Components

 

This form of analysis is applicable to many networked structures other than the Web, such as computer and social networks, that model a graph. 
People from School of Computer Science, Carnegie Mellon University developed Pegasus. For more information, see the Pegasus Project site.