Archive

Archive for September, 2011

Update 27 Sep 2011

September 27th, 2011 No comments

In our last meeting I took down these actions:

  • Ressurect NGMAST paper and use Enron + MIT-NOV – argue that MIT is not big enough to make assumptions

I have started another paper in the Clique SVN here: https://erdos.ucd.ie/svn/clique/papers/BubbleH – at the moment its just the same as the NGMAST paper.

  • Write Thesis outline and get started on background chapters – message dissemination, and social networks side (from milgram onwards etc)

I created a thesis document in the repository too: https://erdos.ucd.ie/svn/clique/papers/MattStabelerThesis – I have started to compile some notes into background chapters, based on my transfer report. I have also started a rough outline of chapters. The latest version in PDF is here.

  • Speak to Conrad about finding a non-overlapping Hierarchical CFA

Since the feedback in CLIQUE mini-workshop, I asked Fergal, Conrad and Aaron again about community finding algoriths. They mentioned the Blondel one and the InfoMap one again.

  • Get a decent sub-set of STUDIVZ dataset

To get a sub-set of nodes in the STUDIVZ dataset I started by picking N nodes randomly, and included their immediate neighbours. I do the L more times to get the nodes a depth of L hops away from the source.  Using 10 random nodes, with a depth of 2 yields a network of around 3500 nodes (12% of all nodes).  When reduced to 5 seed nodes, we get ~1000 nodes (~4%). Going the other way, 100 seed nodes, with a depth of 1 gives 14571 nodes covering ~50% of the network. These figures change depending on which nodes are selected at random initially.

Currently, i’m testing the setup with 5 seed nodes and 2 levels of network, with the hope that there will be some overlap.

Conrad suggests that we take them non-randomly – by first reducing our set of nodes to those with high activity (either number of posts, or total length of posts), then using the network L hops from the remaining nodes.

Enron Dataset Analysis

September 2nd, 2011 No comments

Comparing BubbleH and Bubble RAP for the Enron dataset produced the following results. The plot shows the results for the best run of each parameter, best is determined, in this case, by delivery ratio. i.e. I selected the run with the highest delivery ratio for each CFA and each routing type, and used these as the basis for the plot. This was done automatically, and this initial version does not take into account secondary ordering – meaning that in a run with two identical best delivery ratios, the first to be encountered is picked, ignoring secondary data such as cost or latentcy.

Delivery Ratio, Cost, Latency and Delivered Hops for InfoMap, HGCE, LinkClustering and KCLIQUE,  for both BubbleRAP and BubbleH

Delivery Ratio, Cost, Latency and Delivered Hops for InfoMap, HGCE, LinkClustering and KCLIQUE, for both BubbleRAP and BubbleH

WE can see that in terms of Delivery Ratio, BubbleH outperforms BubbleRAP when there is overlap (LinkClustering, HGCE) and when there is Hierarchical Data, it performs better (HGCE). When there is little or no overlap, BubbleH and BubbleRAP perform identically, as we know/expect. I wonder if we can explicitly test the effect of Hierarchy  by finding an algorithm that partitions into hierarchy (i.e. without overlap)?

CFA Types

  Hierarchical Overlapping
KCLIQUE No Yes
HGCE Yes Yes
InfoMap No No
LinkClustering No Yes
???? Yes No

Next to do: Test Hypothesis – Does HGCE deep Hierarchy beat HGCE flat HEIRARCHY.

Enron, HGCE, BubbleH Rankings - (Exp. Group: DATASET_EXPLORE2)

M Delivery Ratop Cost Latency Delivered Hops Depth Width
0.0000001 0.761918726 4.964306661 16372578807 5.097707153 3 87
0.0000002 0.733766234 4.909635526 16340070463 4.980074222 3 105
0.0000003 0.743108504 4.878466695 16159078259 5.010429586 2 98
0.0000004 0.66686217 4.599790532 15721736431 4.858462118 3 65
0.0000005 0.762253875 4.944449099 16483522910 5.035449299 3 87
0.0000006 0.755215752 4.998659405 16253907145 5.17961946 4 80
0.0000007 0.73150398 4.928529535 15895770224 5.107038543 3 91
0.0000008 0.750775031 4.871638039 16568740115 4.992132135 3 70
0.0000009 0.699329703 4.749811479 16050835473 4.977056251 3 95
0.000001 0.762337662 4.9228739 15679054807 5.05423971 3 88
0.000002 0.732258065 4.97666527 16414982185 5.081354769 3 74
0.000003 0.694888982 4.745789694 16134434295 4.914812805 3 90
0.000004 0.708127357 4.881650607 15674372660 5.067443649 2 84
0.000005 0.729032258 4.937704231 16027611756 5.112458338 5 61
0.00001 0.728320067 4.768831169 16197342711 4.934943917 3 76
0 0.706200251 4.959405111 15616885803 5.189772795 3 55

The table above shows the statistics for Enron, HGCE, BubbleH,  for DATASET_EXPLORE2 which is a new run using the original, simple datasets (without multiple runs over concatenated datasets). it still needs depth,width data adding.

MIT-NOV Hierarchy Analysis

September 1st, 2011 No comments

I re-ran the simulations for a simplified set of paramets for MIT-NOV and Cambridge, to make better sense of the data, I ranked each parameter to HGCE (which correspond to different hierarchichal structures) for each metric – see table below:

MIT-NOV, HGCE, BubbleH Rankings - (Exp. Group: DATASET_EXPLORE)

HGCE, M Paramater Del.Ratio Cost Hops Latency Depth(not ranked) Width (not ranked) Score?
0.001 8 5 7 5 3 15  
0.002 2 7 6 14 5 11  
0.003 13 3 4 4 3 13  
0.004 11 4 3 3 4 20  
0.005 1 2 2 12 2 17  
0.006 3 9 8 10 5 10  
0.007 6 12 12 9 5 14  
0.008 9 13 13 5 4 11  
0.009 4 11 10 8 5 14  
0.01 7 14 14 13 4 15  
0.02 14 7 9 1 3 16  
0.0 10 9 11 2 3 16  
0.1 5 1 1 11 5 16  
0.2 12 5 5 7 3 11  

This table shows the relative rankings for Delivery Ratio, Cost, Hops and Latency for each parameter of M to HGCE (values were derived from the optimum parameters to KCLIQUE based on structure of its output communities, the same parameters were used for each CFA). (Table columns can be sorted by clicking on column headings).

Below is an image of the associated Community Hierarchies

MIT-NOV, HGCE Communities for different threshold values (parameter M to HGCE)

MIT-NOV, HGCE Communities for different threshold values (parameter M to HGCE)

Show below are the four metrics in barchart form, for comparison of actual values:

Delivery Ratio vs Cost, and Hops vs Latency for MIT-NOV, HGCE, BubbleH

Delivery Ratio vs Cost, and Hops vs Latency for MIT-NOV, HGCE, BubbleH

From the above we can see that generally, delivery ratio improves with more depth to the hierarchy, however, in this instance a shallow, broad structure does best overall. When ranking by latency, it appears that broad shallow structures perform best.

thought: should we run HGCE multiple times on the same data to see what the range of different structures it comes up with are? Also, we should get another hierarchical CFA (e.g. the other version of link clustering): Did this, and will post results soon.

thought: is there a way of scoring heirarchy based on depth, width and population of communities?