Archive

Archive for the ‘experiments’ Category

Enron Dataset Analysis

September 2nd, 2011 No comments

Comparing BubbleH and Bubble RAP for the Enron dataset produced the following results. The plot shows the results for the best run of each parameter, best is determined, in this case, by delivery ratio. i.e. I selected the run with the highest delivery ratio for each CFA and each routing type, and used these as the basis for the plot. This was done automatically, and this initial version does not take into account secondary ordering – meaning that in a run with two identical best delivery ratios, the first to be encountered is picked, ignoring secondary data such as cost or latentcy.

Delivery Ratio, Cost, Latency and Delivered Hops for InfoMap, HGCE, LinkClustering and KCLIQUE,  for both BubbleRAP and BubbleH

Delivery Ratio, Cost, Latency and Delivered Hops for InfoMap, HGCE, LinkClustering and KCLIQUE, for both BubbleRAP and BubbleH

WE can see that in terms of Delivery Ratio, BubbleH outperforms BubbleRAP when there is overlap (LinkClustering, HGCE) and when there is Hierarchical Data, it performs better (HGCE). When there is little or no overlap, BubbleH and BubbleRAP perform identically, as we know/expect. I wonder if we can explicitly test the effect of Hierarchy  by finding an algorithm that partitions into hierarchy (i.e. without overlap)?

CFA Types

  Hierarchical Overlapping
KCLIQUE No Yes
HGCE Yes Yes
InfoMap No No
LinkClustering No Yes
???? Yes No

Next to do: Test Hypothesis – Does HGCE deep Hierarchy beat HGCE flat HEIRARCHY.

Enron, HGCE, BubbleH Rankings - (Exp. Group: DATASET_EXPLORE2)

M Delivery Ratop Cost Latency Delivered Hops Depth Width
0.0000001 0.761918726 4.964306661 16372578807 5.097707153 3 87
0.0000002 0.733766234 4.909635526 16340070463 4.980074222 3 105
0.0000003 0.743108504 4.878466695 16159078259 5.010429586 2 98
0.0000004 0.66686217 4.599790532 15721736431 4.858462118 3 65
0.0000005 0.762253875 4.944449099 16483522910 5.035449299 3 87
0.0000006 0.755215752 4.998659405 16253907145 5.17961946 4 80
0.0000007 0.73150398 4.928529535 15895770224 5.107038543 3 91
0.0000008 0.750775031 4.871638039 16568740115 4.992132135 3 70
0.0000009 0.699329703 4.749811479 16050835473 4.977056251 3 95
0.000001 0.762337662 4.9228739 15679054807 5.05423971 3 88
0.000002 0.732258065 4.97666527 16414982185 5.081354769 3 74
0.000003 0.694888982 4.745789694 16134434295 4.914812805 3 90
0.000004 0.708127357 4.881650607 15674372660 5.067443649 2 84
0.000005 0.729032258 4.937704231 16027611756 5.112458338 5 61
0.00001 0.728320067 4.768831169 16197342711 4.934943917 3 76
0 0.706200251 4.959405111 15616885803 5.189772795 3 55

The table above shows the statistics for Enron, HGCE, BubbleH,  for DATASET_EXPLORE2 which is a new run using the original, simple datasets (without multiple runs over concatenated datasets). it still needs depth,width data adding.

MIT-NOV Hierarchy Analysis

September 1st, 2011 No comments

I re-ran the simulations for a simplified set of paramets for MIT-NOV and Cambridge, to make better sense of the data, I ranked each parameter to HGCE (which correspond to different hierarchichal structures) for each metric – see table below:

MIT-NOV, HGCE, BubbleH Rankings - (Exp. Group: DATASET_EXPLORE)

HGCE, M Paramater Del.Ratio Cost Hops Latency Depth(not ranked) Width (not ranked) Score?
0.001 8 5 7 5 3 15  
0.002 2 7 6 14 5 11  
0.003 13 3 4 4 3 13  
0.004 11 4 3 3 4 20  
0.005 1 2 2 12 2 17  
0.006 3 9 8 10 5 10  
0.007 6 12 12 9 5 14  
0.008 9 13 13 5 4 11  
0.009 4 11 10 8 5 14  
0.01 7 14 14 13 4 15  
0.02 14 7 9 1 3 16  
0.0 10 9 11 2 3 16  
0.1 5 1 1 11 5 16  
0.2 12 5 5 7 3 11  

This table shows the relative rankings for Delivery Ratio, Cost, Hops and Latency for each parameter of M to HGCE (values were derived from the optimum parameters to KCLIQUE based on structure of its output communities, the same parameters were used for each CFA). (Table columns can be sorted by clicking on column headings).

Below is an image of the associated Community Hierarchies

MIT-NOV, HGCE Communities for different threshold values (parameter M to HGCE)

MIT-NOV, HGCE Communities for different threshold values (parameter M to HGCE)

Show below are the four metrics in barchart form, for comparison of actual values:

Delivery Ratio vs Cost, and Hops vs Latency for MIT-NOV, HGCE, BubbleH

Delivery Ratio vs Cost, and Hops vs Latency for MIT-NOV, HGCE, BubbleH

From the above we can see that generally, delivery ratio improves with more depth to the hierarchy, however, in this instance a shallow, broad structure does best overall. When ranking by latency, it appears that broad shallow structures perform best.

thought: should we run HGCE multiple times on the same data to see what the range of different structures it comes up with are? Also, we should get another hierarchical CFA (e.g. the other version of link clustering): Did this, and will post results soon.

thought: is there a way of scoring heirarchy based on depth, width and population of communities?

Odd results – an update

July 13th, 2011 No comments

Just before Pádraig went away, we realised that the results we were getting for Bubble and BubbleH were exactly the same in some cases. We also wanted to see whether hierarchy was making any difference to results, and so we set these goals for the period whilst Pádraig was away:

  • work out why bubbleH and bubble are the same in some cases
  • get enron and wall posts dataset
  • pick the most hierarchical looking clustering  in these
  • see whether BubbleH works better than BubbleRap
  • (concentrate on HGCE and KCLIQUE)

The first question: Is Bubble (no global parent) always the same result as BubbleH (global parent)?

  InfoMap KCLIQUE HGCE LinkClustering
MIT-NOV Yes No No No
CAMB Yes Yes Yes No
IC05 Yes Yes No No
IC06 Yes Yes No No
SS Yes Yes No No *
Enron Yes No No No
Studivz ? ? ? ?

* in the case of Social Sensing Study LinkClustering all are same for Delivery Ratio, apart from where the threshold is 0.0. (with Social Sensing we used multiple tuning values, hence multiple results, the others used only one set of parameters)

Answer: Not always.

Is the answer overlap?
I think that these results are down to the structure of the communities. InfoMap ALWAYS produces the same results for BubbleH and BubbleRAP. I wonder if this is this down to the fact that InfoMap partitions the network, and therefore there are no overlaps? Could it be that for the most part, KCLIQUE creates non-overlapping communities and hence the results? HGCE creates a large number of highly overlapping communities, which are also hierarchical. LinkClustering also creates a large number of communities and whilst edges cannot overlap, nodes can belong to multiple communities. Or is it really that the inherent Hierarchy in community structure is causing results to differ?

The question is also, why do BubbleH and BubbleRAP get EXACTLY the same results if there is no overlap? Well, this is because in the end, there is no difference between them when there is no complicated structure to the network. Even if BubbleH is using the global parent community, that is EXACTLY the same as using the Global Ranking in BubbleRAP, so when there is no overlap, each node belongs to EXACTLY one community, and has a local rank in both BubbleH and BubbleRap. The global parent in BubbleH is the same as using the global rank in BubbleRAP. In fact, we could re-write BubbleH to incorporate this explicitly, and do away with a forced global parent, but this is just implementation detail, the end results will be the same.

Second Part: get enron and wall posts dataset

I used Conrad’s version of the Enron Dataset, as he had taken the time to remove irregularities, and in case of future papers, he would have an in depth knowledge of how he processed the data, saving me a lot of time!

The connected time graph is below, showing a decent number of clusters, hopefully with some nice hierarchy!

Connected Time Graph of the Enron dataset.

Connected Time Graph of the Enron dataset.

I explored this dataset in the same way as the previous ones, by experimenting with settings for the different algorithms, InfoMap is easy to visualise, and so below is the InfoMap clustering of the dataset:

Connected Time Graph for Enron Dataset, coloured with InfoMap clustering

Connected Time Graph for Enron Dataset, coloured with InfoMap clustering

I also used Conrads Studivz wall post dataset, (see here), this dataset is huge, and so I haven’t worked out how to run the full set of simulations. I was able to create a connected time edgelist (connected time is based on wall post length, 1000ms per character). Below is the graph of all connections, with node size and colour related to degree, edges removed for clarity.

network_degree_colournetwork_degree

network_degree_colour_section

Studivz connected time graph close-up with edges

Enron Dataset

In order to get any clusters at all out of KCLIQUE and LinkClustering, I had to plug in some very small threshold values, this is probably due to the way in which I created the events (An event occurs when an email is sent, the duration of every event is set to 2 seconds), so for the most part, nodes are not connected, and therefore there are small overall connected times. (Conrad’s dataset did not include the content of the messages, so I was not able to give contact events any length based on message size). To simplify things,  I used the interesting parameters from KCLIQUE and LinkClustering, to inform the parameters for HGCE, specifically the –minCliqueEdgeWeight parameter (more info about HGCE parameters), which excludes edges based on weight, effectively thresholding the graph edges as with KCLIQUE and LinkClustering.

To recap, the threshold means that (in the case of KCLIQUE and LinkClustering and now HGCE) edges are removed where the connected time between individuals is lower than the threshold.

Threhold parameters used for the Enron dataset:

0.0, 0.0000001, 0.0000002, 0.0000003, 0.0000004, 0.0000005, 0.0000006, 0.0000007, 0.0000008, 0.0000009, 0.000001, 0.000002, 0.000003, 0.000004, 0.000005, 0.00001

The plot below shows the results for Delivery Ratio for BubbleRAP using no global parent (NGP), and BubbleH using global parent (GP) (in future, we can safely ignore the global parents part, as BubbleRAP should always be run without a global parent and BubbleH with a global parent).

BubbleRAP vs BubbleH on the Enron Dataset, showing results for multiple parameters (of M). BubbelH beats BubbleRAP in all cases

BubbleRAP vs BubbleH on the Enron Dataset, showing results for multiple parameters (of M). BubbelH beats BubbleRAP in all cases

This is a good result, it shows that in this dataset, for these parameters, BubbleH beats BubbleRAP, but now we need to consider why. I had done an earlier exploration of the enron dataset with default values for M ( 0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.9), and so I looked back the results for that, and was surprised to see that BubbleRAP does much better in some cases. Below is a plot with the new data included (Solid red BubbleRAP line and blue dot dash BubbleH show this data).

BubbleRAP vs BubbleH for Enron Dataset, with extra (default) values for HGCE parameters (for M). Showing stronger results for BubbleRAP

BubbleRAP vs BubbleH for Enron Dataset, with extra (default) values for HGCE parameters (for M). Showing stronger results for BubbleRAP

So, BubbleRAP does better in some situations, (PS. NEED TO CHECK THIS AGAIN TO MAKE SURE BUBBLEH IS PROPERLY PLOTTED).

I started to look at then next step, hoping it will give some answers: pick the most hierarchical looking clustering  in [datasets].

I picked the best results for BubbleH, and mapped the communities to a hierarchical tree structure, shown below:

Community structure for nest run of BubbleH on Enron dataset where M = 0.000001.

Community structure for best run of BubbleH on Enron dataset where M = 0.000001

So far so good it seems to have a broad structure with some specific hierarchical clusters, I also mapped the worst run:

Community structure for best run of BubbleH on Enron dataset where M = 0.0000007

Community structure for best run of BubbleH on Enron dataset where M = 0.0000007

This too has some good structure to it, note however, that this is a plot of the  worst performing run, in a set of best parameters, (we didn’t run a whole range of values – so, this is worst of the best)

The next to show is the best BubbleRAP run, as below:

0.0000005

Community structure for best run of BubbleRAP on Enron dataset where M = 0.0000005

Interestingly, this has broad set of high level communities (as with the previous plots, but they had a global parent), but less broadness lower in the hierarchy (i.e. fewer lower communities).

TODO: plot the worst run of BubbleRAP

UPDATE: I found a better way to show the Dendograms of the Community hierarchy, and have automated the process. This page shows all of the plots for Enron without global parent and this page WITH global parent. To get a better idea of the relationship between hierarchical structure and results, I need to combine the results from runs, with the structure in the same place – so we can compare side by side what happens when there is different types of community.

Extending Datasets

May 31st, 2011 No comments

Pádraig had the idea to try to eliminate Delivery Ratio from our analysis, this way we could then concentrate specifically on Cost and Latency. He suggested that we take the normal dataset and extend it to three times the length, repeating the events in the correct time ordering. Then we could run the simulations 10 times each with a different start time in the original  portion of the dataset, this would give each message a much better chance of being delivered, over three iterations of the dataset.

extendeddatasets

I implemented this by taking a slice of the MIT dataset, processing the events contained within it, and duplicating them, adding the length of the slice to the timestamp times the iteration (i.e. MIT-NOV, MIT-NOV+1 length, MIT-NOV+2 lengths). Any contact events that were active at the start or end time of the slice, were truncated to start/end at the start/end time. I verified the resulting dataset by counting the number of connection every day, and ensuring that there was a distinct repeating pattern. For each dataset, I split the duration between the start and end time of the original dataset time period into 10 equal chunks, and calculated the corresponding start time for each. For MIT and Cambridge this equates to roughly every three days, whereas in the InfoCom datasets there is  a much smaller gap between start times (~14-20 hours).

start times

mit-nov 2004-11-10_00:00:00 - 2004-12-11_00:00:00
2004-11-10_00:00:00
2004-11-13_02:24:00
2004-11-16_04:48:00
2004-11-19_07:12:00
2004-11-22_09:36:00
2004-11-25_12:00:00
2004-11-28_14:24:00
2004-12-01_16:48:00
2004-12-04_19:12:00
2004-12-07_21:36:00
 
infocom-2005 2005-03-06_00:00:00 - 2005-03-12_00:00:00
2005-03-06_00:00:00
2005-03-06_14:24:00
2005-03-07_04:48:00
2005-03-07_19:12:00
2005-03-08_09:36:00
2005-03-09_00:00:00
2005-03-09_14:24:00
2005-03-10_04:48:00
2005-03-10_19:12:00
2005-03-11_09:36:00
 
infocom-2006 2006-04-24_00:00:00 - 2006-05-02_00:00:00
2006-04-24_00:00:00
2006-04-24_19:12:00
2006-04-25_14:24:00
2006-04-26_09:36:00
2006-04-27_04:48:00
2006-04-28_00:00:00
2006-04-28_19:12:00
2006-04-29_14:24:00
2006-04-30_09:36:00
2006-05-01_04:48:00
 
cambridge 2005-10-28_00:00:00 - 2005-12-23_00:00:00
2005-10-28_00:00:00
2005-11-02_13:30:00
2005-11-08_04:00:00
2005-11-13_18:30:00
2005-11-19_09:00:00
2005-11-24_23:30:00
2005-11-30_14:00:00
2005-12-06_04:30:00
2005-12-11_19:00:00
2005-12-17_09:30:00

Running Simulations

We decided to remove the overhead of running the CFAs with multiple parameters, by choosing a set of parameters for each, that had performed well in previous tests. The list below shows the ones we chose:

CFA Parameters

MIT-NOV
	HGCE
	  Bubble
	  K	A	M	E	P	S	R	ST	MAP	Z
	  3	1.3	0	0.25	0.2	4	100	0.7	0.5	0.1
 
	  BubbleH
	  K	A	M	E	P	S	R	ST	MAP	Z
	  4	1.3	0	0.25	0.3	4	100	0.9	0.9	0.1
			# Communities AverageMembers
			# 21 15
			2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 25 27 29 30 31 32 34 35 36 37 38 39 41 42 43 45 46 47 48 50 51 53 54 55 57 58 59 60 61 62 63 65 66 67 68 69 70 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 94 95 97 98 99 100 101 102 103
			6 8 14 16 18 30 41 46 53 59 61 63 78 79 86 88 91 95 101 102
			6 13 17 29 30 47 86 88 95 101
			10 25 62 65 74 83 102
			3 25 30 62 74
			4 23 30 45 57 91
			9 30 42 101
			19 27 35 37 39 81 89 102
			2 41 51 87 88 95 102 103
			15 41 43 51 67 69 87 88 95 100 103
			19 27 41 66 85 98
			15 43 46 58 61 67 69 78 79 83 85 90 95 97 100 103
			2 7 11 15 57 61 67 69 78 79 85 87 88 90 95 97 100 101 103
			6 7 8 10 11 14 15 16 17 21 30 31 34 35 38 41 43 46 55 58 61 69 75 78 79 80 83 85 86 88 90 91 94 95 97 101 102
			8 10 15 16 17 30 38 41 46 53 63 73 80 83 85 95 97 99 101 102
			10 14 18 25 35 46 65 83 86 88 90 102
			10 17 25 35 53 72 85 88
			10 25 53 72 74 78
			27 30 50 95 97 99
			4 12 19 20 23 24 27 30 32 36 37 39 41 45 48 50 54 59 60 68 70 76 77 81 82 84 89 98
			19 27 37 39 41 68 81 98
 
  KCLIQUE
	  BubbleH
	  K	MCT
	  3	0.01
			# k = 3
			# minimum_connected_ratio = 0.01
			# number_of_communities = 4
			# community_sizes = [5, 3, 45, 24]
			# average_size = 19.25
			103 2 67 100 87
			99 27 69
			6 7 8 10 11 13 14 15 16 17 18 21 29 30 31 34 35 38 41 43 44 46 53 55 58 59 61 63 67 73 75 78 79 80 83 85 86 88 90 91 95 97 100 101 102
			68 70 76 12 77 81 82 19 84 23 24 89 32 98 36 37 39 45 48 50 54 20 59 4 
 
	  Bubble
	  K	MCT
	  3		0.01	
 
  LinkClustering
	  BubbleH
	  ET
	  0
			# D_max: 0.591406681046
			# S_max: 0.596774193548
			# Communities: 79
			# Average size : 8
			# Nodes : 95 (seen 635 times)
			3 72
			51 73
			58 30 77 50 102 88 89 97 53 67 82 69 81 86 87 84 85 24 20 21 48 23 46 45 43 41 4 7 6 8 68 11 83 13 99 76 75 38 70 91 90 100 101 95 29 79 78 15 10 39 12 59 14 17 98 19 54 31 16 37 36 35 55 63 32 61 103
			25 72
			62 45
			74 65
			72 35
			77 19 74 20
			19 9 77
			25 84
			103 67 47
			3 30
			25 65
			44 73
			74 73
			9 2
			83 61 88 67 102 90 100 81 95 87 79 85 11 10 101 21 17 16 86 41 2 7 103
			25 74
			27 74
			91 10 17 16 30 88 41 61 62 102 83 101 86 79
			61 88 67 91 83 100 101 86 87 97 85 69 95 57 30 41 7 103 78 31 90
			3 74
			19 98 66 24
			76 64
			4 45 50 72
			102 63 30 33 101
			62 65
			24 20 48 19 32 37 77 98 89 73 54 59
			10 21 61 88 63 90 91 83 101 102 94 97 85 95 25 86 58 17 16 55 18 30 43 53 41 46 69 7 8 103 78
			1 83 30
			9 72
			98 36 77 76 89 70 68 20 84 24 39 12 59 48 23 19 54 82 45 37 50 4 32
			30 42 101
			18 84
			42 72
			90 28 103
			44 27
			25 3
			25 35
			25 62
			102 88 91 90 80 101 95 87 79 85 11 10 81 15 21 14 16 46 86 31 30 61 43 35 41 97 7 6 9 8 18
			60 81 30
			9 52
			52 101 15 16 30 41
			64 73
			39 27 19 37 50 36 98 89 4 68
			99 63 73 102 83 101 95 87 97 85 11 10 38 15 14 17 16 46 31 30 53 41 94 6 8
			73 72
			83 29 88 65 90 102 103 80 86 79 78 10 38 15 21 55 18 31 43 35 46 69 7
			61 63 66 67 102 83 100 101 95 85 10 14 16 46 41 55 69 7 8 78 90
			11 13 58 17 16 47 31 30 29 88 95 83 101 86
			23 34
			93 40
			9 57
			21 16 30 40 41 61 88 6 102 101 86 14
			27 66
			67 43 61 88 63 53 90 69 80 81 86 87 85 21 46 95 29 41 79 7 6 8 83 99 75 91 103 100 101 102 94 97 78 11 10 13 38 15 58 17 16 55 18 31 30 35 34 14
			9 42
			11 13 14 17 30 35 41 61 88 6 93 101 86 87 79
			73 66
			74 72
			3 53
			87 15 95 51 43 41 88 67 102 8 103 69 100
			7 21 43 99 75 64 95 90 101 86 79 78 10 58 55 18 31 30 29 88 6 8
			3 62
			10 88 72 17 69 85 87 53 78
			62 74
			62 28
			9 74
			77 42
			60 98 20 54 77 50
			2 51
			12 2
			91 10 101 14 31 30 53 34 99 61 74 102 83 41 95 78
			57 45 4 23
			9 25
			14 61 44 30 53 41 102 88 7 6 91 83 101 95 79
			11 83 99 61 88 63 90 102 103 100 81 69 94 97 85 91 10 27 15 21 14 16 55 95 31 30 35 41 80 79 101 6 8
 
	  Bubble
	  ET
		0.003
 
	InfoMap
			# InfoMap: Generated from {datasets/communities/EXTENDED_DATASETS/InfoMap/mit-nov/edge_list.dat.tree}
			# Generated on {Fri, 27 May 2011 11:03:51 +0100}
			# number_of_communities = 11
			# avg size = 9
			# nodes = 95
			19 20 24 32 12 54 98 76 89 50 77 36 70 48 68 4 84 45 37 82 59 23 39 81 60
			30 102 86 14 6 91 61 41 16 101 88 79 18 53 35 63 44 9 40 93 52 72 33 42
			15 97 85 100 80 83 38 73 1
			90 95 29 46 58 8 7 75 43 66 64 94
			67 103 87 2 51 57 47
			21 55 78 34
			11 17 13
			31 10
			69 27 99
			25 65 74 3
			62 28
 
InfoCom2005
  HGCE
	  Bubble
	  K	A	M	E	P	S	R	ST	MAP	Z
	  3	1	0	0.25	0.2	1	100	0.5	0.5	0.1
	  BubbleH
	  K	A	M	E	P	S	R	ST	MAP	Z
	  3	1	0	0.25	0.2	1	100	0.6	0.9	0.1
 
			# Communities AverageMembers
			# 2 27
			1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32 33 34 35 36 37 38 39 40 41
			1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32 33 34 35 36 37 38 39 40 41
 
  KCLIQUE
	  BubbleH
	  K	MCT
	  4	0.009
 
			# k = 4
			# minimum_connected_ratio = 0.009
			# number_of_communities = 1
			# community_sizes = [37]
			# average_size = 37.0
			1 2 3 4 5 6 7 8 10 11 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32 33 34 35 37 38 39 40 41 
 
	  Bubble
	  K	MCT
	  2	0.11
 
  LinkClustering
	  bubbleH
	  ET
	  0.008
			# D_max: 0.398456726708
			# S_max: 0.368421052632
			# Communities: 15
			# Average size : 7
			# Nodes : 40 (seen 99 times)
			9 30
			21 41
			12 23
			11 25 26 35 15 41 13 14 24 10 39 27 20 21 22 16 33 32 37 29 40 34 1 19 3 5 4 7 6 8 17 18
			38 36
			10 39 15 22 33 37 36 34 1 13
			24 27 39 2 14 35 34
			24 25 13 15 9 5 37 29 34
			18 12
			2 23
			11 39 27 32 30 37 40 35 5 4 28
			26 22 23 33 8 6 18 34
			9 12
			39 38 40 22 19 32 35 4
 
	  bubble
	  ET
	  0.04
 
	InfoMap
			# InfoMap: Generated from {datasets/communities/EXTENDED_DATASETS/InfoMap/infocom-2005/edge_list.dat.tree}
			# Generated on {Fri, 27 May 2011 11:03:59 +0100}
			# number_of_communities = 3
			# avg size = 14
			# nodes = 41
			40 32 39 1 27 22 15 35 29 19 24 17 4 25 8 13 30 10 9 26 14 36 5 34 11 6 16 38 21 3 23 28 41 2 12 31
			33 20 37
			7 18
 
InfoCom2006
  HGCE
	  Bubble
	  K	A	M	E	P	S	R	ST	MAP	Z
	  3	1	0	0.25	0.2	1	100	0.7	0.5	0.2
			# Communities AverageMembers
			# 2 49
			22 23 24 25 26 27 28 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 84 85 86 87 88 89 90 91 92 93 94 95 96 98
			22 23 24 25 26 27 28 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 84 85 86 87 88 89 90 91 92 93 94 95 96 98
 
	  BubbleH
	  K	A	M	E	P	S	R	ST	MAP	Z
	  3	1	0	0.25	0.3	4	100	0.6	0.5	0.2
 
  KCLIQUE
	  BubbleH
	  K	MCT
	  3	0.02
 
			# k = 3
			# minimum_connected_ratio = 0.02
			# number_of_communities = 1
			# community_sizes = [58]
			# average_size = 58.0
			22 25 26 27 28 31 32 34 35 36 37 38 39 40 43 44 45 46 47 49 50 51 52 53 54 55 56 57 58 60 61 63 64 66 67 70 72 73 74 75 76 77 78 79 81 82 83 84 85 86 87 88 89 90 91 93 95 98 
 
	  Bubble
	  K	MCT
	  3	0.02
 
  LinkClustering
	  BubbleH
	  ET
	  0.01
			# D_max: 0.352182263797
			# S_max: 0.407407407407
			# Communities: 95
			# Average size : 4
			# Nodes : 78 (seen 387 times)
			86 35
			38 40 49 36 35
			44 56
			83 48
			24 62 27
			68 69 21
			22 67
			36 48
			30 59
			46 44
			56 71
			91 39
			92 48
			58 94
			83 40
			56 94 40
			59 43
			80 65
			56 40 48
			24 60 62 51 75 52 77 76 88 89
			46 39 86
			47 56 40
			77 58 44 45 51 43 41 60 75 89 73 71 70 64 80 92 95 93 74
			25 38 58 49 32 57 56 88 40 43 60 75 89 53 52 86
			68 48
			46 26 39
			26 80
			39 74 89 46 54 88 53 52
			26 56
			74 97
			56 29
			54 48
			26 97
			46 85
			60 25 75 22 57 51 78
			24 62
			54 41
			76 58 82 32 44 30 51 52
			38 22 49 56 40
			47 22
			87 65 72
			26 58 53 43
			31 65
			83 73 23
			24 80
			88 57 70 28 50 77 75 89 66 67 82 31 93 81 44 84 85
			42 48
			24 62 48
			69 21
			47 45 28 43 60 88 89 73 79 64 93 81 86 84 78
			54 22
			73 29
			56 73 40 49
			80 59
			45 28
			24 62 36
			68 65
			22 48
			65 96
			42 29
			86 85
			25 44
			71 41
			51 21
			63 38 59
			47 92
			25 67
			31 35 49
			29 43
			89 58 43 37 36 53 34
			39 30
			22 85
			97 85
			92 71
			74 64 73 68 51 45 86 43 52
			76 66 91 90 31 96 93 79 78
			59 52
			77 58 32 33 86 45 51 43
			25 65
			83 23
			46 39 59
			30 48
			33 48
			46 39
			48 51 53 76 89 64 66 82 73 80 84 78
			27 49 55 56 35 61 63 40 38
			25 26 74 54 57 75 52
			26 22
			68 47
			69 65
			61 27 51 43
			28 60 88 89 64 66 67 82 80 81 86 87 84 25 44 45 42 43 34 77 76 75 74 73 72 70 91 90 93 95 79 78 51 58 98 32 57 37 50 53 52 54 31
			27 48
			37 59 34
 
	  Bubble
	  ET
	  0.09  
 
	InfoMap
			# InfoMap: Generated from {datasets/communities/EXTENDED_DATASETS/InfoMap/infocom-2006/edge_list.dat.tree}
			# Generated on {Fri, 27 May 2011 11:04:18 +0100}
			# number_of_communities = 5
			# avg size = 16
			# nodes = 78
			78 93 76 66 60 74 64 82 77 81 45 72 84 88 28 43 89 90 44 53 51 67 79 87 37 70 75 86 91 56 34 52 73 48 58 40 95 41 32 80 71 50 31 54 26 57 85 47 92 98 36 24 30 33 59 42 62 96 65 23 29 83 94 97 21
			38 35 49 63 55 27 61
			25 22
			39 46
			68 69
 
Cambridge
  HGCE
	  BubbleH
	  K	A	M	E		P	S	R	ST	MAP	Z
	  5	1.6	0	0.25	0.2	4	100	0.6	0.5	0.1
			# Communities AverageMembers
			# 4 23
			1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
			1 2 3 4 5 6 7 9 10 11 13 14 15 17 18 19 20 21 23 25 28 29 30 32 33 34
			2 3 5 6 7 8 9 10 11 12 13 16 18 20 21 22 25 27 28 29 31 32 33 34 35 36
			1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 19 20 23 24 26 29 30 31 33 35
 
	  Bubble
	  K	A	M	E	P	S	R	ST	MAP	Z
	  5	1	0	0.25	0.3	4	100	0.9	0.9	0.2
 
  KCLIQUE
	  BubbleH
	  K	MCT
	  2	0.02
			# k = 2
			# minimum_connected_ratio = 0.02
			# number_of_communities = 3
			# community_sizes = [2, 2, 11]
			# average_size = 5.0
			1 4
			24 26
			32 34 3 36 7 16 21 22 25 27 28 
 
	  Bubble
	  K	MCT
	  2	0.04
 
  LinkClustering - bubbleH
	  ET
	  0.004
			# D_max: 0.75851421306
			# S_max: 0.5
			# Communities: 12
			# Average size : 5
			# Nodes : 34 (seen 55 times)
			11 25 27 21 22 16 32 28 36 34 3 12 7
			9 5 19 15 33
			2 34
			26 35
			3 6
			13 15 6 19 20
			9 13
			10 20 14 17 23 19 18 30 1 33 2 5 4 6 15
			2 35
			24 26
			11 21 22 29 7
 
	  Bubble
	  ET
	  0.003
 
	InfoMap
			# InfoMap: Generated from {datasets/communities/EXTENDED_DATASETS/InfoMap/cambridge/edge_list.dat.tree}
			# Generated on {Fri, 27 May 2011 11:04:24 +0100}
			# number_of_communities = 3
			# avg size = 12
			# nodes = 36
			34 21 36 22 25 32 28 3 7 16 27 12 11 29 8
			18 33 15 14 5 20 19 4 6 2 23 1 17 30 10 13 9 31
			26 35 24

I originally made the decision to use seperate parameters for Bubble and BubbleH, but in hindsight this was not a good idea, as they cannot easily be compared.

When we discussed the results, we realised the KCLIQUE seems to perform well for latency with BubbleH, despite not having much hierarchical structure. Pádraig suggested I investigate this further, to find out what is happening. In the mean time I have re-run the simulations, using the best parameters for BubbleH – to level the playing field in terms of parameters. As is obvious from the above community structures, only HGCE produces communities with one all encompassing community, and in the case of KCLIQUE for both InfoCom datasets, it produces only one community – this seems odd, as we would expect the best performance from a more complicated structure.

This resulted in the following community structure.

Community Structures

MIT-NOV
  HGCE
    comm_count	avg_member_count	min_nodes	max_nodes
    21	        15.80952381     	4	        87
  KCLIQUE
    comm_count	avg_member_count	min_nodes	max_nodes
    4	          20.25	            4	        46
 
  InfoMap
    comm_count	avg_member_count	min_nodes	max_nodes
    11          8.636363636     	2	        25
 
  LinkClustering
    comm_count	avg_member_count	min_nodes	max_nodes
    78	          8.141025641     2       	67
 
INFOCOM-2005
  HGCE
    comm_count	avg_member_count	min_nodes	max_nodes
    2	          40     	          40	      40
 
  KCLIQUE
    comm_count	avg_member_count	min_nodes	max_nodes
    1	          38     	          38	      38
 
  InfoMap
    comm_count	avg_member_count	min_nodes	max_nodes
    3	          13.66666667     	2	        36
 
  LinkClustering
    comm_count	avg_member_count	min_nodes	max_nodes
    14          7.071428571     	2	        32
 
INFOCOM-2006
  HGCE
    comm_count	avg_member_count	min_nodes	max_nodes
    2	          73     	          73	      73
  KCLIQUE
    comm_count	avg_member_count	min_nodes	max_nodes
    1	          59                59       	59
  InfoMap
    comm_count	avg_member_count	min_nodes	max_nodes
    5	          15.6            	2       	65
  LinkClustering
    comm_count	avg_member_count	min_nodes	max_nodes
    94          4.117021277     	2	        43
 
CAMBRIDGE
  HGCE
    comm_count	avg_member_count	min_nodes	max_nodes
    4	          28.75     	      23	      36
 
  KCLIQUE
    comm_count	avg_member_count	min_nodes	max_nodes
    3           6     	          3       	12
 
  InfoMap
    comm_count	avg_member_count	min_nodes	max_nodes
    3           12                3         18
 
  LinkClustering
    comm_count	avg_member_count	min_nodes	max_nodes
    11          5	                2	        15

Delivery Ratio, Cost, Latency and Delivered Hops for each CFA and for each Dataset, for BubbleRAP and BubbleH.

Delivery Ratio, Cost, Latency and Delivered Hops for each CFA and for each Dataset, for BubbleRAP and BubbleH. (OLD PLOT - SEE BELOW FOR NEW PLOT)

In the results above, there seems to be something odd happening with LinkClustering for BUBBLERap, for Cost and Hops in MIT-NOV and InfoCom2005 it has much higher values than for the other CFAs. Also, there appears to be something odd happening with Latency in the Cambridge dataset, with a maximum of ~1000 hours latency which is about 40 days. The MIT dataset is roughly the same length, but doesn’t get anywhere near the same latency.

The two plots below show Delivery Ratio and Latency for BubbleH, HGCE for the Cambridge dataset.

Delivery Ratio for HGCE-BubbleH, on the Cambridge dataset

Delivery Ratio for HGCE-BubbleH, on the Cambridge dataset

Latency for HGCE-BubbleH, on the Cambridge dataset

Latency for HGCE-BubbleH, on the Cambridge dataset

These seem to show that at the points where the new parts of the dataset are joined, there is a huge jump in deliveries of messages. It also seems that the datasets are truncated at the end. I will double check the start/end times, but this could be due to the original dataset having no events after a certain timestamp, and therefore, in the last section, the simulation stops when there are no events left to process. The huge spike in messages delivered explains the very high and very low extremes – as messages that were sent in the last iteration of message sending, get delivered very quickly. It seems odd however, that the same is not true for the first iteration. It could be, that because the last set have started to be distributed, maybe only one hop away, these nodes are already in a much better position to deliver the nodes at the start of the next section. This could be due to higher activity in the first hours of the dataset collection, perhaps even due to all the nodes being switched on in the same place at the same time. If this is the case, then we need to start the dataset slice slightly later, to avoid this. The activity traces for each dataset is shown below:

MIT-Nov Daily Connections

MIT-Nov Daily Connections

Cambridge Average Daily Connections

Cambridge Daily Connections

Infocom2005 Hourly Connections

Infocom2005 Hourly Connections

Infocom2006 hourly connections

Infocom2006 hourly connections

From these we can see that MIT has a good distribution of connectivy across the dataset, but the other three have only short spikes of activity. This in itself does not make a difference to the comparison of resulst for each datasets, as they are all relative, but it does go some way to explain the very high latency figures for Cambridge.

The issues with KCLIQUE

I looked back at the data for KCLIQUE in the NGMAST simulations, where there were multiple parameters to KCLIQUE, resulting in varying degrees of community structure, for all datasets, the delivery ratio in KCLIQUE seems to increase with member count, whereas community count peaks before delivery ratio really gets going.

commsvsstats-combined

Delivery Ratio, Cost, Community Count and Average Member count.

This makes sense, because for BubbleH, the more we know about members of a the communities, the more messages we can direct using local ranking. And, because we haven’t included an overall community for KCLIQUE (or InfoMap*, or LinkClustering) we have no way to deliver messages to nodes not in a community, until a node holding a message comes into contact with the destination node. I’m not sure if we designed it this way or whether this is just a bug.

The next step is to fix this issue, by either

  1. adjusting the BubbleH algorithm to use global rank to pass messages between nodes at the highest level
  2. Force a global community in the other community finding algorithms (this will have the same effect as 1)

* InfoMap includes every node in the communities list, but does not include a global parent community

Update:

I have now run all of the datasets with a forced global parent community, this seems to have evened the playing field somewhat:

matrixes-global-parent

Results where every CFA is forced to produce a global parent community including all nodes. (NOTE the change in axis labels from previous run)

The interesting thing is that there is very little difference in delivery ratio and cost using BUBBLERap.

Results for NGMAST11/NEMO

May 23rd, 2011 No comments

We have decided not to submit to NGMAST11 conference as the results we have were not really ready, also I found an error in the config setup for KCLIQUE BubbleH. I have therefore run all of the simulations from scratch, just to make sure it is all correct. In this re-run, I ran the HGCE algorithms far too many times for each dataset, so there is a very large number of results for the HGCE algorihtm (~700 for each dataset). We plan to submit to the Finding NEMO workshop instead.

To visualise the results so far, we have plotted the best  (see note below) results for BubbleRAP and BubbleH:

matrix

BUBBLERap and BubbleH compared for Delivery Ratio, Cost and Latency

To determine the ‘best’ run for algorithms with multiple parameters, we first filtered the results to the best 10 for latency, then ordered by delivery ratio followed by cost, this should mean that a good delivery ratio is still shown for good delivery latency, but avoids results where a abnormally low delivery ratio is chosen with a very poor delivery ratio.

Initial results – HGCE, KCLIQUE, InfoMap and LinkClustering

May 7th, 2011 No comments

We ran each community finding algorithm over the datasets: MIT-NOV, Cambridge, InfoCom 2005, and InfoCom 2006. We used bubbleRAP, BubbleH, Unlimited flood, hold & wait, and Prophet to measure the effect of the CFA on the dataset. The results below are grouped by dataset. Only the best results for BubbleH/BubbleRAP KCLIQUE and HGCE are shown. Note: The local ranking used for InfoMap is based on the rank given by the InfoMap algorithm, NOT centrality as with BubbleRAP and BubbleH.

MIT-NOV

MIT-NOV

CAMBRIDGE

CAMBRIDGE

IncoCom 2005

IncoCom 2005

InfoCom 2005

InfoCom 2005

My next plan is to link statistics about community structure (number of communities, average members etc.) to the results.

Also, I want to find out if there is an existing measure of overlappiness? Does the amount of overlap in communities affect the results in any way? Also, what about the extent to which there is ‘nesting’ of communities, does this make a difference?

I also want to visualise the communities produced in some way, so that it is clearer what the network structure for the best results is based on.

This excel file this excel file contains the combined results of each CFA for all 4 datasets, so we can compare the effect of community structure (number of communities, average member size) directly with results for each simulation run.

BubbleH vs. the rest

March 15th, 2011 No comments

After a few iterations of bug fixes in the Bubble H code, it finally gave some sensible results. Shown below for MIT-OCT and MIT-NOV, where the community finding was done using a training set, and the test was done on the test set. Results are compared with the previous CFAs and also the  routing schemes we originally compared to (Bubble, PBR, Prophet, Unlimited Flood). GCEH was run multiple times with varying parameters, and the output chosen to drive BubbleH was chosen by picking a good looking result, this is a flaw in the process.

For MIT-OCT, the data in mit-oct-training.gce_output_K4_ST0.5_MAP0.5_e0.25_0.2.dat was used, which looks like:

  0(101)
  |
   - 1(71)
  |
   - 4(40)
  |  |
  |   - 2(16)
  |  |
  |   - 3(20)
  |  |
  |   - 5(28)
  |
   - 6(63)
  |  |
  |   - 7(47)
  |  |  |
  |  |   - 8(24)
  |
   - 9(66)

For MIT-NOV the data in mit-nov-training.gce_output_K3_ST0.5_MAP0.5_e0.15_0.2.dat was used, which looks like:

  0(101)
  |
   - 1(39)
  |  |
  |   - 2(16)
  |  |
  |   - 3(22)
  |  |  |
  |  |   - 4(14)
  |  |
  |   - 5(8)
  |
   - 6(30)
  |
   - 7(24)
  |
   - 8(63)
MIT-NOV

MIT-OCT

MIT-NOV

MIT-NOV

It is clear that MIT-NOV seems to have a better delivery ratio overall, this is probably due to the increased activity in December (MIT-NOV-TEST), compared to that in November (MIT-OCT-TEST), as seen in the activity plot below.

The average number of connections (bluetooth contacts) per week in the MIT Dataset

The average number of connections (bluetooth contacts) per week in the MIT Dataset

A thorough investigation would mean running the output of the many parameters used for the GCEH algorithm, and running the simulation over the whole lot – complicated, but possible.

UPDATE: Using MIT-NOV-CHEAT dataset – i.e. allowing the use of data from the test period, gives a much better result – see below.

MIT-NOV-CHEAT

MIT-NOV-CHEAT

The candidate hierarchy was taken from similar parameters as previously: edge_list.dat.gce_output_K-3_ST-0.5_MAP-0.5_E-0.15_Z-0.2.dat which looked like:

  2(77)
  |
   - 0(52)
  |  |
  |   - 1(18)
  3(79)

This was slightly different from the previous hierarchies, in that it has less communities, but it demonstrated an accurate looking community structure – i.e. 3 communities. The results indicate that BubbleH does better here than previous attempts, which is encouraging. I also tried attempted to run the KCLIQUE version of BubbleRAP, but, as discussed before, it cannot find any communities in the training+test period.

Categories: experiments