Archive

Archive for the ‘discussions’ Category

Temporal Networks

May 31st, 2012 2 comments

Pádraig and  I discussed Time Respecting Paths  in the datasets we have based on the paper – Temporal Networks[1].

We want to detect the shortest/quickest paths between pairs of nodes, which respect the the temporal paths, so that we can derive a betweeness centrality measure.

The existing implementation we have, based on the BubbleRAP paper, floods the network, and calculates the betweeness centrality using the number of times a node is used in ANY path  – the BubbleRAP implementation uses only the SHORTEST paths. So there is some discrepancy here. The BubbeRAP paper reads as follows:

To calculate the individual centrality value for each node, we take an numerical approach. First we carry out a large number of emulations of unlimited flooding with different uniformly distributed traffic patterns created using the HaggleSim emulator, which can replay the collected mobility traces and emulate different forwarding strategies for every contact event. Then we count the number of times a node acts as a relay for other nodes on all the shortest delay deliveries. Here the shortest delay delivery refers to the case when the same message is delivered to the destination through different paths, and we only count the de- livery with the shortest delay. We call this number the betweenness centrality of this node in this temporal graph. Of course, we can normalise it to the highest value found. Here we use unlimited flooding since it can explore the largest range of delivery alternatives with the shortest delay. This definition captures the spirit of the Freeman centrality[2]

 

The Freeman centrality paper, which I only skimmed, doesn’t appear to refer to temporal graphs directly –  further inspection may yield a better understanding 🙂

There is a question over what observation period we should use, and how to decide which is really the ‘shortest path’  when paths may start/end at different times. In the Bubble RAP paper, they seem to have chosen the path with the least latency, but not considering the start time of the path. In the temporal networks paper [1], the authors talk about a Saw-tooth pattern which governs latency (as regular connections create a latency lag between connections), and mention a method to normalise over this pattern – but not really for the Betweeness Centrality measure.

e.g. If we start at time 0, and measure the time it takes to get to the destination node, that is fine. however, what if there is an intermediate time, 4, where, if we measure the path from there, we find a new shortest path – in Bubble RAP, this is taken as the shortest path, but is this correct?

I will implement the Time Respecting, shortest path (latency) betweeness centrality, (and maybe the shortest path – hops), as per BubbleRAP, and I will also investigate the normalised version as per the Temporal Networks paper.

[1]Holme P, Saramäki J. Temporal Networks. 2011:1-28. Available at: http://arxiv.org/abs/1108.1780. Accessed May 15, 2012..

[2]Freeman L. A set of measures of centrality based on betweenness. Sociometry. 1977. Available at: http://www.jstor.org/stable/10.2307/3033543. Accessed May 31, 2012.

Ideas about message routing

December 14th, 2011 No comments

Aaron (McDaid) and I had an interesting conversation about what we mean by message routing.

We first discussed what it meant to have a single message, or a number of message copies, and termed ‘ink and paper’ to be a ‘cost’ associated with making a copy of a message and passing it to another node, rather than just calling it ‘cost’.

We decided that in my simulations, ‘cost’ did not exactly mean this, as even if a node send 1 message to a node it met, and kept a copy of it in it’s archive, the ‘cost’ would be exactly the same as if it discarded the message.
Whereas in our ink and paper cost scenario, there would be an additional 1 ‘ink and paper cost’ for every ‘copy’ of the message sent. Unless, the original message was passed on.

I mentioned some ideas about how to estimate global properties from a local viewpoint (i.e. graham’s periodicity paper combined with vector clocks).

We then went on to determine what ‘periodic’ meant, in terms of future prediction of events, Aaron was concerned that in fact I meant ‘often’ as opposed to ‘regular’, as often implies that some event is repeated, but not necessarily at a regular interval. Regular implies that the event is repeated with the same interval. (e.g. we meed every thursday), he suggested that I be very careful when using the work ‘periodic’.

He also expressed the concern over my liberal use of the term ‘dynamic networks’ and he suggested that in fact the network may better be describes as a temporal network, as edges are made and broken over time.

We then spent some time discussing a mechanism to determine optimum routes through a series of interactions in a temporal network of people. (I am pretty sure I have read a paper that does something similar to what follows).

Aaron suggested we model contact events as nodes, so we worked through this idea:
We have 5 Nodes A to E.
The first event at time= 0 is where each node meets itself.
A new node and is created when two nodes meet, an edge between each node is created, and an edge between a previous event and the new event is created if at least one of the nodes was in the previous meeting.
Edges are labelled by the amount of time since the last meeting.

networkofevents-1

At the end of this period, we are now able to determine the best ‘routes’ for sending information between nodes. There are two optimal routes, the time optimal route, and the hop optimal route. Information about A received by C can travel from A to B to C to D, 3 hops and a cost of 2 + 6 + 4 = 12, or between A and C directly, 1 hop at a cost of 13.

In the case of C to D, there are two routes, C to B to D (cost 12), or C to D (cost 12), so how do we decide which to use. Well, we can assume that the best route, is one that minimises hops primarily, and time secondarily. If we add a small epsilon value to the time, e.g. 0.00001,

This gives us:

C to B to D = 2 hops = 8.00001 + 4.00001  = 12.00002

vs

C to D  = 1 hop = 12.00001

So the second, shorter (hops) route takes precedence, as the total time is less. So we have chosen the shortest number of hops when the time period would normally have been equal. The reckoning is that it’s better to ‘copy’  (ink and paper) fewer times, when the total time is the same.

The only case when there is conflict, is when there are an equal number of hops, and the information is delivered after the same time peroid. In this case we might adopt a few strategies:

  1. Keep a record of all most optimum paths
  2. Record all paths, with a rank of efficiency
  3. Choose only paths with the most recent event
  4. Choose only paths with the oldest event
  5. Choose a path at random from the optimum paths

If we now wished to predict routes, Aaron suggested that an exhaustive mechanism is to predict all possible future connections, assuming existing connections are regular and periodic, calculate the best routes, and send messages accordingly. This would be quite intensive and perhaps infeasible.

Perhaps a better strategy is to use our knowledge of existing routes, and on meeting a node on the optimal path to the destination for a message we hold, pass the message, else keep it.

We didn’t get as far as solving the problem of how to predict the future connections, and therefore calculate the optimal path.

Extending Datasets

May 31st, 2011 No comments

Pádraig had the idea to try to eliminate Delivery Ratio from our analysis, this way we could then concentrate specifically on Cost and Latency. He suggested that we take the normal dataset and extend it to three times the length, repeating the events in the correct time ordering. Then we could run the simulations 10 times each with a different start time in the original  portion of the dataset, this would give each message a much better chance of being delivered, over three iterations of the dataset.

extendeddatasets

I implemented this by taking a slice of the MIT dataset, processing the events contained within it, and duplicating them, adding the length of the slice to the timestamp times the iteration (i.e. MIT-NOV, MIT-NOV+1 length, MIT-NOV+2 lengths). Any contact events that were active at the start or end time of the slice, were truncated to start/end at the start/end time. I verified the resulting dataset by counting the number of connection every day, and ensuring that there was a distinct repeating pattern. For each dataset, I split the duration between the start and end time of the original dataset time period into 10 equal chunks, and calculated the corresponding start time for each. For MIT and Cambridge this equates to roughly every three days, whereas in the InfoCom datasets there is  a much smaller gap between start times (~14-20 hours).

start times

mit-nov 2004-11-10_00:00:00 - 2004-12-11_00:00:00
2004-11-10_00:00:00
2004-11-13_02:24:00
2004-11-16_04:48:00
2004-11-19_07:12:00
2004-11-22_09:36:00
2004-11-25_12:00:00
2004-11-28_14:24:00
2004-12-01_16:48:00
2004-12-04_19:12:00
2004-12-07_21:36:00
 
infocom-2005 2005-03-06_00:00:00 - 2005-03-12_00:00:00
2005-03-06_00:00:00
2005-03-06_14:24:00
2005-03-07_04:48:00
2005-03-07_19:12:00
2005-03-08_09:36:00
2005-03-09_00:00:00
2005-03-09_14:24:00
2005-03-10_04:48:00
2005-03-10_19:12:00
2005-03-11_09:36:00
 
infocom-2006 2006-04-24_00:00:00 - 2006-05-02_00:00:00
2006-04-24_00:00:00
2006-04-24_19:12:00
2006-04-25_14:24:00
2006-04-26_09:36:00
2006-04-27_04:48:00
2006-04-28_00:00:00
2006-04-28_19:12:00
2006-04-29_14:24:00
2006-04-30_09:36:00
2006-05-01_04:48:00
 
cambridge 2005-10-28_00:00:00 - 2005-12-23_00:00:00
2005-10-28_00:00:00
2005-11-02_13:30:00
2005-11-08_04:00:00
2005-11-13_18:30:00
2005-11-19_09:00:00
2005-11-24_23:30:00
2005-11-30_14:00:00
2005-12-06_04:30:00
2005-12-11_19:00:00
2005-12-17_09:30:00

Running Simulations

We decided to remove the overhead of running the CFAs with multiple parameters, by choosing a set of parameters for each, that had performed well in previous tests. The list below shows the ones we chose:

CFA Parameters

MIT-NOV
	HGCE
	  Bubble
	  K	A	M	E	P	S	R	ST	MAP	Z
	  3	1.3	0	0.25	0.2	4	100	0.7	0.5	0.1
 
	  BubbleH
	  K	A	M	E	P	S	R	ST	MAP	Z
	  4	1.3	0	0.25	0.3	4	100	0.9	0.9	0.1
			# Communities AverageMembers
			# 21 15
			2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 25 27 29 30 31 32 34 35 36 37 38 39 41 42 43 45 46 47 48 50 51 53 54 55 57 58 59 60 61 62 63 65 66 67 68 69 70 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 94 95 97 98 99 100 101 102 103
			6 8 14 16 18 30 41 46 53 59 61 63 78 79 86 88 91 95 101 102
			6 13 17 29 30 47 86 88 95 101
			10 25 62 65 74 83 102
			3 25 30 62 74
			4 23 30 45 57 91
			9 30 42 101
			19 27 35 37 39 81 89 102
			2 41 51 87 88 95 102 103
			15 41 43 51 67 69 87 88 95 100 103
			19 27 41 66 85 98
			15 43 46 58 61 67 69 78 79 83 85 90 95 97 100 103
			2 7 11 15 57 61 67 69 78 79 85 87 88 90 95 97 100 101 103
			6 7 8 10 11 14 15 16 17 21 30 31 34 35 38 41 43 46 55 58 61 69 75 78 79 80 83 85 86 88 90 91 94 95 97 101 102
			8 10 15 16 17 30 38 41 46 53 63 73 80 83 85 95 97 99 101 102
			10 14 18 25 35 46 65 83 86 88 90 102
			10 17 25 35 53 72 85 88
			10 25 53 72 74 78
			27 30 50 95 97 99
			4 12 19 20 23 24 27 30 32 36 37 39 41 45 48 50 54 59 60 68 70 76 77 81 82 84 89 98
			19 27 37 39 41 68 81 98
 
  KCLIQUE
	  BubbleH
	  K	MCT
	  3	0.01
			# k = 3
			# minimum_connected_ratio = 0.01
			# number_of_communities = 4
			# community_sizes = [5, 3, 45, 24]
			# average_size = 19.25
			103 2 67 100 87
			99 27 69
			6 7 8 10 11 13 14 15 16 17 18 21 29 30 31 34 35 38 41 43 44 46 53 55 58 59 61 63 67 73 75 78 79 80 83 85 86 88 90 91 95 97 100 101 102
			68 70 76 12 77 81 82 19 84 23 24 89 32 98 36 37 39 45 48 50 54 20 59 4 
 
	  Bubble
	  K	MCT
	  3		0.01	
 
  LinkClustering
	  BubbleH
	  ET
	  0
			# D_max: 0.591406681046
			# S_max: 0.596774193548
			# Communities: 79
			# Average size : 8
			# Nodes : 95 (seen 635 times)
			3 72
			51 73
			58 30 77 50 102 88 89 97 53 67 82 69 81 86 87 84 85 24 20 21 48 23 46 45 43 41 4 7 6 8 68 11 83 13 99 76 75 38 70 91 90 100 101 95 29 79 78 15 10 39 12 59 14 17 98 19 54 31 16 37 36 35 55 63 32 61 103
			25 72
			62 45
			74 65
			72 35
			77 19 74 20
			19 9 77
			25 84
			103 67 47
			3 30
			25 65
			44 73
			74 73
			9 2
			83 61 88 67 102 90 100 81 95 87 79 85 11 10 101 21 17 16 86 41 2 7 103
			25 74
			27 74
			91 10 17 16 30 88 41 61 62 102 83 101 86 79
			61 88 67 91 83 100 101 86 87 97 85 69 95 57 30 41 7 103 78 31 90
			3 74
			19 98 66 24
			76 64
			4 45 50 72
			102 63 30 33 101
			62 65
			24 20 48 19 32 37 77 98 89 73 54 59
			10 21 61 88 63 90 91 83 101 102 94 97 85 95 25 86 58 17 16 55 18 30 43 53 41 46 69 7 8 103 78
			1 83 30
			9 72
			98 36 77 76 89 70 68 20 84 24 39 12 59 48 23 19 54 82 45 37 50 4 32
			30 42 101
			18 84
			42 72
			90 28 103
			44 27
			25 3
			25 35
			25 62
			102 88 91 90 80 101 95 87 79 85 11 10 81 15 21 14 16 46 86 31 30 61 43 35 41 97 7 6 9 8 18
			60 81 30
			9 52
			52 101 15 16 30 41
			64 73
			39 27 19 37 50 36 98 89 4 68
			99 63 73 102 83 101 95 87 97 85 11 10 38 15 14 17 16 46 31 30 53 41 94 6 8
			73 72
			83 29 88 65 90 102 103 80 86 79 78 10 38 15 21 55 18 31 43 35 46 69 7
			61 63 66 67 102 83 100 101 95 85 10 14 16 46 41 55 69 7 8 78 90
			11 13 58 17 16 47 31 30 29 88 95 83 101 86
			23 34
			93 40
			9 57
			21 16 30 40 41 61 88 6 102 101 86 14
			27 66
			67 43 61 88 63 53 90 69 80 81 86 87 85 21 46 95 29 41 79 7 6 8 83 99 75 91 103 100 101 102 94 97 78 11 10 13 38 15 58 17 16 55 18 31 30 35 34 14
			9 42
			11 13 14 17 30 35 41 61 88 6 93 101 86 87 79
			73 66
			74 72
			3 53
			87 15 95 51 43 41 88 67 102 8 103 69 100
			7 21 43 99 75 64 95 90 101 86 79 78 10 58 55 18 31 30 29 88 6 8
			3 62
			10 88 72 17 69 85 87 53 78
			62 74
			62 28
			9 74
			77 42
			60 98 20 54 77 50
			2 51
			12 2
			91 10 101 14 31 30 53 34 99 61 74 102 83 41 95 78
			57 45 4 23
			9 25
			14 61 44 30 53 41 102 88 7 6 91 83 101 95 79
			11 83 99 61 88 63 90 102 103 100 81 69 94 97 85 91 10 27 15 21 14 16 55 95 31 30 35 41 80 79 101 6 8
 
	  Bubble
	  ET
		0.003
 
	InfoMap
			# InfoMap: Generated from {datasets/communities/EXTENDED_DATASETS/InfoMap/mit-nov/edge_list.dat.tree}
			# Generated on {Fri, 27 May 2011 11:03:51 +0100}
			# number_of_communities = 11
			# avg size = 9
			# nodes = 95
			19 20 24 32 12 54 98 76 89 50 77 36 70 48 68 4 84 45 37 82 59 23 39 81 60
			30 102 86 14 6 91 61 41 16 101 88 79 18 53 35 63 44 9 40 93 52 72 33 42
			15 97 85 100 80 83 38 73 1
			90 95 29 46 58 8 7 75 43 66 64 94
			67 103 87 2 51 57 47
			21 55 78 34
			11 17 13
			31 10
			69 27 99
			25 65 74 3
			62 28
 
InfoCom2005
  HGCE
	  Bubble
	  K	A	M	E	P	S	R	ST	MAP	Z
	  3	1	0	0.25	0.2	1	100	0.5	0.5	0.1
	  BubbleH
	  K	A	M	E	P	S	R	ST	MAP	Z
	  3	1	0	0.25	0.2	1	100	0.6	0.9	0.1
 
			# Communities AverageMembers
			# 2 27
			1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32 33 34 35 36 37 38 39 40 41
			1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32 33 34 35 36 37 38 39 40 41
 
  KCLIQUE
	  BubbleH
	  K	MCT
	  4	0.009
 
			# k = 4
			# minimum_connected_ratio = 0.009
			# number_of_communities = 1
			# community_sizes = [37]
			# average_size = 37.0
			1 2 3 4 5 6 7 8 10 11 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32 33 34 35 37 38 39 40 41 
 
	  Bubble
	  K	MCT
	  2	0.11
 
  LinkClustering
	  bubbleH
	  ET
	  0.008
			# D_max: 0.398456726708
			# S_max: 0.368421052632
			# Communities: 15
			# Average size : 7
			# Nodes : 40 (seen 99 times)
			9 30
			21 41
			12 23
			11 25 26 35 15 41 13 14 24 10 39 27 20 21 22 16 33 32 37 29 40 34 1 19 3 5 4 7 6 8 17 18
			38 36
			10 39 15 22 33 37 36 34 1 13
			24 27 39 2 14 35 34
			24 25 13 15 9 5 37 29 34
			18 12
			2 23
			11 39 27 32 30 37 40 35 5 4 28
			26 22 23 33 8 6 18 34
			9 12
			39 38 40 22 19 32 35 4
 
	  bubble
	  ET
	  0.04
 
	InfoMap
			# InfoMap: Generated from {datasets/communities/EXTENDED_DATASETS/InfoMap/infocom-2005/edge_list.dat.tree}
			# Generated on {Fri, 27 May 2011 11:03:59 +0100}
			# number_of_communities = 3
			# avg size = 14
			# nodes = 41
			40 32 39 1 27 22 15 35 29 19 24 17 4 25 8 13 30 10 9 26 14 36 5 34 11 6 16 38 21 3 23 28 41 2 12 31
			33 20 37
			7 18
 
InfoCom2006
  HGCE
	  Bubble
	  K	A	M	E	P	S	R	ST	MAP	Z
	  3	1	0	0.25	0.2	1	100	0.7	0.5	0.2
			# Communities AverageMembers
			# 2 49
			22 23 24 25 26 27 28 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 84 85 86 87 88 89 90 91 92 93 94 95 96 98
			22 23 24 25 26 27 28 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 84 85 86 87 88 89 90 91 92 93 94 95 96 98
 
	  BubbleH
	  K	A	M	E	P	S	R	ST	MAP	Z
	  3	1	0	0.25	0.3	4	100	0.6	0.5	0.2
 
  KCLIQUE
	  BubbleH
	  K	MCT
	  3	0.02
 
			# k = 3
			# minimum_connected_ratio = 0.02
			# number_of_communities = 1
			# community_sizes = [58]
			# average_size = 58.0
			22 25 26 27 28 31 32 34 35 36 37 38 39 40 43 44 45 46 47 49 50 51 52 53 54 55 56 57 58 60 61 63 64 66 67 70 72 73 74 75 76 77 78 79 81 82 83 84 85 86 87 88 89 90 91 93 95 98 
 
	  Bubble
	  K	MCT
	  3	0.02
 
  LinkClustering
	  BubbleH
	  ET
	  0.01
			# D_max: 0.352182263797
			# S_max: 0.407407407407
			# Communities: 95
			# Average size : 4
			# Nodes : 78 (seen 387 times)
			86 35
			38 40 49 36 35
			44 56
			83 48
			24 62 27
			68 69 21
			22 67
			36 48
			30 59
			46 44
			56 71
			91 39
			92 48
			58 94
			83 40
			56 94 40
			59 43
			80 65
			56 40 48
			24 60 62 51 75 52 77 76 88 89
			46 39 86
			47 56 40
			77 58 44 45 51 43 41 60 75 89 73 71 70 64 80 92 95 93 74
			25 38 58 49 32 57 56 88 40 43 60 75 89 53 52 86
			68 48
			46 26 39
			26 80
			39 74 89 46 54 88 53 52
			26 56
			74 97
			56 29
			54 48
			26 97
			46 85
			60 25 75 22 57 51 78
			24 62
			54 41
			76 58 82 32 44 30 51 52
			38 22 49 56 40
			47 22
			87 65 72
			26 58 53 43
			31 65
			83 73 23
			24 80
			88 57 70 28 50 77 75 89 66 67 82 31 93 81 44 84 85
			42 48
			24 62 48
			69 21
			47 45 28 43 60 88 89 73 79 64 93 81 86 84 78
			54 22
			73 29
			56 73 40 49
			80 59
			45 28
			24 62 36
			68 65
			22 48
			65 96
			42 29
			86 85
			25 44
			71 41
			51 21
			63 38 59
			47 92
			25 67
			31 35 49
			29 43
			89 58 43 37 36 53 34
			39 30
			22 85
			97 85
			92 71
			74 64 73 68 51 45 86 43 52
			76 66 91 90 31 96 93 79 78
			59 52
			77 58 32 33 86 45 51 43
			25 65
			83 23
			46 39 59
			30 48
			33 48
			46 39
			48 51 53 76 89 64 66 82 73 80 84 78
			27 49 55 56 35 61 63 40 38
			25 26 74 54 57 75 52
			26 22
			68 47
			69 65
			61 27 51 43
			28 60 88 89 64 66 67 82 80 81 86 87 84 25 44 45 42 43 34 77 76 75 74 73 72 70 91 90 93 95 79 78 51 58 98 32 57 37 50 53 52 54 31
			27 48
			37 59 34
 
	  Bubble
	  ET
	  0.09  
 
	InfoMap
			# InfoMap: Generated from {datasets/communities/EXTENDED_DATASETS/InfoMap/infocom-2006/edge_list.dat.tree}
			# Generated on {Fri, 27 May 2011 11:04:18 +0100}
			# number_of_communities = 5
			# avg size = 16
			# nodes = 78
			78 93 76 66 60 74 64 82 77 81 45 72 84 88 28 43 89 90 44 53 51 67 79 87 37 70 75 86 91 56 34 52 73 48 58 40 95 41 32 80 71 50 31 54 26 57 85 47 92 98 36 24 30 33 59 42 62 96 65 23 29 83 94 97 21
			38 35 49 63 55 27 61
			25 22
			39 46
			68 69
 
Cambridge
  HGCE
	  BubbleH
	  K	A	M	E		P	S	R	ST	MAP	Z
	  5	1.6	0	0.25	0.2	4	100	0.6	0.5	0.1
			# Communities AverageMembers
			# 4 23
			1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
			1 2 3 4 5 6 7 9 10 11 13 14 15 17 18 19 20 21 23 25 28 29 30 32 33 34
			2 3 5 6 7 8 9 10 11 12 13 16 18 20 21 22 25 27 28 29 31 32 33 34 35 36
			1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 19 20 23 24 26 29 30 31 33 35
 
	  Bubble
	  K	A	M	E	P	S	R	ST	MAP	Z
	  5	1	0	0.25	0.3	4	100	0.9	0.9	0.2
 
  KCLIQUE
	  BubbleH
	  K	MCT
	  2	0.02
			# k = 2
			# minimum_connected_ratio = 0.02
			# number_of_communities = 3
			# community_sizes = [2, 2, 11]
			# average_size = 5.0
			1 4
			24 26
			32 34 3 36 7 16 21 22 25 27 28 
 
	  Bubble
	  K	MCT
	  2	0.04
 
  LinkClustering - bubbleH
	  ET
	  0.004
			# D_max: 0.75851421306
			# S_max: 0.5
			# Communities: 12
			# Average size : 5
			# Nodes : 34 (seen 55 times)
			11 25 27 21 22 16 32 28 36 34 3 12 7
			9 5 19 15 33
			2 34
			26 35
			3 6
			13 15 6 19 20
			9 13
			10 20 14 17 23 19 18 30 1 33 2 5 4 6 15
			2 35
			24 26
			11 21 22 29 7
 
	  Bubble
	  ET
	  0.003
 
	InfoMap
			# InfoMap: Generated from {datasets/communities/EXTENDED_DATASETS/InfoMap/cambridge/edge_list.dat.tree}
			# Generated on {Fri, 27 May 2011 11:04:24 +0100}
			# number_of_communities = 3
			# avg size = 12
			# nodes = 36
			34 21 36 22 25 32 28 3 7 16 27 12 11 29 8
			18 33 15 14 5 20 19 4 6 2 23 1 17 30 10 13 9 31
			26 35 24

I originally made the decision to use seperate parameters for Bubble and BubbleH, but in hindsight this was not a good idea, as they cannot easily be compared.

When we discussed the results, we realised the KCLIQUE seems to perform well for latency with BubbleH, despite not having much hierarchical structure. Pádraig suggested I investigate this further, to find out what is happening. In the mean time I have re-run the simulations, using the best parameters for BubbleH – to level the playing field in terms of parameters. As is obvious from the above community structures, only HGCE produces communities with one all encompassing community, and in the case of KCLIQUE for both InfoCom datasets, it produces only one community – this seems odd, as we would expect the best performance from a more complicated structure.

This resulted in the following community structure.

Community Structures

MIT-NOV
  HGCE
    comm_count	avg_member_count	min_nodes	max_nodes
    21	        15.80952381     	4	        87
  KCLIQUE
    comm_count	avg_member_count	min_nodes	max_nodes
    4	          20.25	            4	        46
 
  InfoMap
    comm_count	avg_member_count	min_nodes	max_nodes
    11          8.636363636     	2	        25
 
  LinkClustering
    comm_count	avg_member_count	min_nodes	max_nodes
    78	          8.141025641     2       	67
 
INFOCOM-2005
  HGCE
    comm_count	avg_member_count	min_nodes	max_nodes
    2	          40     	          40	      40
 
  KCLIQUE
    comm_count	avg_member_count	min_nodes	max_nodes
    1	          38     	          38	      38
 
  InfoMap
    comm_count	avg_member_count	min_nodes	max_nodes
    3	          13.66666667     	2	        36
 
  LinkClustering
    comm_count	avg_member_count	min_nodes	max_nodes
    14          7.071428571     	2	        32
 
INFOCOM-2006
  HGCE
    comm_count	avg_member_count	min_nodes	max_nodes
    2	          73     	          73	      73
  KCLIQUE
    comm_count	avg_member_count	min_nodes	max_nodes
    1	          59                59       	59
  InfoMap
    comm_count	avg_member_count	min_nodes	max_nodes
    5	          15.6            	2       	65
  LinkClustering
    comm_count	avg_member_count	min_nodes	max_nodes
    94          4.117021277     	2	        43
 
CAMBRIDGE
  HGCE
    comm_count	avg_member_count	min_nodes	max_nodes
    4	          28.75     	      23	      36
 
  KCLIQUE
    comm_count	avg_member_count	min_nodes	max_nodes
    3           6     	          3       	12
 
  InfoMap
    comm_count	avg_member_count	min_nodes	max_nodes
    3           12                3         18
 
  LinkClustering
    comm_count	avg_member_count	min_nodes	max_nodes
    11          5	                2	        15

Delivery Ratio, Cost, Latency and Delivered Hops for each CFA and for each Dataset, for BubbleRAP and BubbleH.

Delivery Ratio, Cost, Latency and Delivered Hops for each CFA and for each Dataset, for BubbleRAP and BubbleH. (OLD PLOT - SEE BELOW FOR NEW PLOT)

In the results above, there seems to be something odd happening with LinkClustering for BUBBLERap, for Cost and Hops in MIT-NOV and InfoCom2005 it has much higher values than for the other CFAs. Also, there appears to be something odd happening with Latency in the Cambridge dataset, with a maximum of ~1000 hours latency which is about 40 days. The MIT dataset is roughly the same length, but doesn’t get anywhere near the same latency.

The two plots below show Delivery Ratio and Latency for BubbleH, HGCE for the Cambridge dataset.

Delivery Ratio for HGCE-BubbleH, on the Cambridge dataset

Delivery Ratio for HGCE-BubbleH, on the Cambridge dataset

Latency for HGCE-BubbleH, on the Cambridge dataset

Latency for HGCE-BubbleH, on the Cambridge dataset

These seem to show that at the points where the new parts of the dataset are joined, there is a huge jump in deliveries of messages. It also seems that the datasets are truncated at the end. I will double check the start/end times, but this could be due to the original dataset having no events after a certain timestamp, and therefore, in the last section, the simulation stops when there are no events left to process. The huge spike in messages delivered explains the very high and very low extremes – as messages that were sent in the last iteration of message sending, get delivered very quickly. It seems odd however, that the same is not true for the first iteration. It could be, that because the last set have started to be distributed, maybe only one hop away, these nodes are already in a much better position to deliver the nodes at the start of the next section. This could be due to higher activity in the first hours of the dataset collection, perhaps even due to all the nodes being switched on in the same place at the same time. If this is the case, then we need to start the dataset slice slightly later, to avoid this. The activity traces for each dataset is shown below:

MIT-Nov Daily Connections

MIT-Nov Daily Connections

Cambridge Average Daily Connections

Cambridge Daily Connections

Infocom2005 Hourly Connections

Infocom2005 Hourly Connections

Infocom2006 hourly connections

Infocom2006 hourly connections

From these we can see that MIT has a good distribution of connectivy across the dataset, but the other three have only short spikes of activity. This in itself does not make a difference to the comparison of resulst for each datasets, as they are all relative, but it does go some way to explain the very high latency figures for Cambridge.

The issues with KCLIQUE

I looked back at the data for KCLIQUE in the NGMAST simulations, where there were multiple parameters to KCLIQUE, resulting in varying degrees of community structure, for all datasets, the delivery ratio in KCLIQUE seems to increase with member count, whereas community count peaks before delivery ratio really gets going.

commsvsstats-combined

Delivery Ratio, Cost, Community Count and Average Member count.

This makes sense, because for BubbleH, the more we know about members of a the communities, the more messages we can direct using local ranking. And, because we haven’t included an overall community for KCLIQUE (or InfoMap*, or LinkClustering) we have no way to deliver messages to nodes not in a community, until a node holding a message comes into contact with the destination node. I’m not sure if we designed it this way or whether this is just a bug.

The next step is to fix this issue, by either

  1. adjusting the BubbleH algorithm to use global rank to pass messages between nodes at the highest level
  2. Force a global community in the other community finding algorithms (this will have the same effect as 1)

* InfoMap includes every node in the communities list, but does not include a global parent community

Update:

I have now run all of the datasets with a forced global parent community, this seems to have evened the playing field somewhat:

matrixes-global-parent

Results where every CFA is forced to produce a global parent community including all nodes. (NOTE the change in axis labels from previous run)

The interesting thing is that there is very little difference in delivery ratio and cost using BUBBLERap.

The trouble with too many hops

April 12th, 2011 4 comments

In the recent paper that we submitted to SMW2011, we showed that we could increase the delivery ratio using overlapping hierarchical clustering, however, we also identified that the cost involved was much higher than we expected, up to 245 hops in some cases.  To find out what was happening, we checked the behaviour of the best performing parameters (GCEH-K3-E0.15-ST0.7-MAP0.5-Z0.2) by recording the route of each message along with some hints as to why the message was being passed at that point. Hop Stats for GCEH-K3-E0.15-ST0.7-MAP0.5-Z0.2 (11MB uncompressed).

This is the current algorithm:

click to view algorithm

for (BubbleHMessage message : my_buffer) {
  /**
  * Identify smallest community that Self and Destination share =
  * Bridging Community
  * */
 
  int my_bridge_community = BubbleHierarchyOracle.bridgeCommunity(my_node.getID(), message.dest, my_properties);
 
  // if encountered node is destination
  if (process.getNode().getID() == message.dest) {
  // pass it on to them.
    destination_met++;
    toSend.add(message);
    message.setHopCode("DM");
  } else {
    int remote_bridge = BubbleHierarchyOracle.bridgeCommunity(process.getNode().getID(), message.dest, my_properties);
    if (BubbleHierarchyOracle.isBetter(remote_bridge, my_bridge_community)) {
      // if P is in community with message.dest, that is smaller
      // than BC
      // pass message
      message.setHopCode(remote_bridge +"-isBetter-" + my_bridge_community+ ":" + BubbleHierarchyOracle.whyIsBetter(remote_bridge, my_bridge_community));
      toSend.add(message);
      smaller_community_found++;
    } else if (process.getCommunities().contains(my_bridge_community)) {// if both nodes are in the bridge community
        // if the rank of the encountered node is higher, pass the message
      if (process.getlocalRank(my_bridge_community) > getlocalRank(my_bridge_community)) {
        // pass message
        toSend.add(message);
        message.setHopCode("RANKBC-" + my_bridge_community);
        bridge_community_ranking_message_passed++;
      }
    } else {
      // process is not destination, is not in a smaller community
      // with the destination, and is not in the bridge community,
      // therefore we do not pass the message to it
    }
  }
 
}

The isBetter function:

click to view isBetter function

public static boolean isBetter(int candidate_bridgeCommunity, int test_bridgeCommunity) {
 
  if(candidate_bridgeCommunity == -1 && test_bridgeCommunity == -1){
    // neither have a shared community
    return false;
  }else if((candidate_bridgeCommunity == -1) && (test_bridgeCommunity >= 0)){
    // candidate shares a community with dest
    return false;
  }else if((candidate_bridgeCommunity >= 0) && (test_bridgeCommunity == -1)){
    // current shares a community, but candidate does not
    return true;
  }else if(heirarachy.level(candidate_bridgeCommunity) > heirarachy.level(test_bridgeCommunity)){
      // candidate has a higher level
      return true;
  }else if (heirarachy.members(candidate_bridgeCommunity).length < heirarachy.members(test_bridgeCommunity).length){
      // candidate has less members
      return true;
  }else{
    return false;
  }
 
}

Some information about the run:

Log first ran in simulator: 2004-12-09_23:57:24
Log Filename: [...]0.hop-stats.dat
Mesages: 10506
	Hops: 276313
	Avg: 26
	Max Hop Count: 245
Earliest Hop: Wed, 10 Nov 2004 00:00:00 +0000
Latest Hop: Thu, 09 Dec 2004 23:13:29 +0000
Distribution of path lengths for GCEH-K3-E0.15-ST0.7-MAP0.5-Z0.2 using MIT-NOV dataset
Distribution of path lengths for GCEH-K3-E0.15-ST0.7-MAP0.5-Z0.2 using MIT-NOV dataset

Community Structure:

click view community structure

 0(95)
  |
   - 1(95)
    |
     - 2(33)
    |
     - 3(30)
    |
     - 4(33)
    |
     - 5(31)
    |
     - 6(20)
    |  |
    |   - 8(7)
    |
     - 7(12)
    |
     - 9(15)
    |
     - 10(12)
    |  |
    |   - 11(5)
    |  |
    |   - 12(5)
    |
     - 13(4)
    |
     - 14(24)
    |
     - 15(15)
    |
     - 16(11)
    |
     - 17(49)
    |  |
    |   - 18(11)
    |
     - 19(41)
    |  |
    |   - 20(14)
    |
     - 21(4)
    |
     - 22(11)
    |
     - 23(5)
    |
     - 24(45)

Community Membership:

click to view

# options
# {'phi': 0.25, 'threads': 5, 'max_times_spoken_for': 1, 'minCliqueSize': 3, 'num_realizations': 100, 'epsilon': 0.14999999999999999, 'perturb_alpha': 0.20000000000000001, 'minCliqueEdgeWeight': 0.0, 'similarity_threshold': 0.69999999999999996, 'outfile': None, 'min_appearance_prop': 0.5, 'intNodeIDs': False, 'file': None, 'alpha': 1.0}
# ForceGlobalParent:true
# community_id-parent_id: members...n
0--1: 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 25 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 50 51 52 53 54 55 57 58 59 60 61 62 63 64 65 66 67 68 69 70 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 93 94 95 97 98 99 100 101 102 103
1-0: 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 25 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 50 51 52 53 54 55 57 58 59 60 61 62 63 64 65 66 67 68 69 70 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 93 94 95 97 98 99 100 101 102 103
2-1: 6 8 10 11 14 15 16 17 18 21 29 30 31 38 41 43 46 55 61 63 79 80 83 85 86 88 90 91 94 95 97 101 102
3-1: 6 10 11 14 15 16 17 18 21 30 31 38 41 43 55 61 69 79 80 83 85 86 88 90 91 94 95 97 101 102
4-1: 6 7 8 10 14 15 16 17 18 21 29 30 31 41 46 55 58 61 75 78 79 80 83 85 86 88 90 91 94 95 97 101 102
5-1: 6 10 11 13 14 15 16 17 18 27 30 31 35 41 46 53 61 63 69 78 79 83 85 86 88 91 95 97 99 101 102
6-1: 3 9 10 16 25 30 31 35 41 53 61 65 72 74 79 88 91 95 101 102
7-1: 6 10 14 18 25 31 62 65 83 86 88 102
8-6: 3 25 41 53 65 72 74
9-1: 6 14 16 30 41 44 53 61 73 79 83 91 95 101 102
10-1: 9 15 16 30 41 42 52 61 77 91 101 102
11-10: 9 30 42 77 101
12-10: 9 30 41 42 101
13-1: 1 30 83 101
14-1: 7 15 16 30 35 38 41 46 53 58 61 69 79 80 83 85 90 91 95 97 100 101 102 103
15-1: 10 16 30 41 53 61 73 74 79 83 91 95 99 101 102
16-1: 6 11 13 14 17 30 47 58 86 90 95
17-1: 1 2 3 7 8 10 15 21 25 27 29 31 34 38 42 43 46 47 51 52 55 57 58 61 64 65 66 67 69 72 73 74 75 78 79 80 83 85 87 88 90 93 94 95 97 98 99 100 103
18-17: 8 10 15 27 66 80 83 85 97 98 100
19-1: 1 2 6 7 8 14 15 18 21 29 33 34 38 40 43 46 47 51 55 57 58 64 66 67 72 73 75 78 80 83 85 86 87 88 90 94 95 97 100 102 103
20-19: 2 15 38 51 67 80 85 87 88 95 97 100 102 103
21-1: 28 58 90 103
22-1: 23 30 39 41 59 63 79 81 91 101 102
23-1: 45 69 72 88 99
24-1: 3 4 9 12 19 20 23 24 25 28 32 33 34 36 37 39 42 44 45 48 50 51 52 54 57 59 60 62 63 64 65 66 68 70 72 73 74 76 77 81 82 84 89 94 98

As expected there are a high number of messages that arrive within a small number of hops, which is highly desirable, however there is a long tail that extends up to 245 hops, which was not expected.

An annotated log file excerpt for message pair 85:3 (one of the longest paths) is shown below. Note that this message is live, which means it has not been delivered. Analysis of this shows a large number of hops between the same nodes, and the same communities.

Log file Exerpt

Key:

n-isBetter-n:Reason = the first  (candidate) community is deemed
                      better than the last (test), for the reason given.
RANKBC-n  =           the rank of the two nodes in the given community n is
                      compared, and the encountered node scored higher
DM =                  destination met, so the message was passed
click to view log

Message ID: 8060
From 85 to 3
Status: live
Passed to node 85 at 2004-11-10 00:00:00 reason: INITIATED
Passed to node 16 at 2004-11-10 09:22:09 reason: 6-isBetter-17:BETTER-CANDIDATE-LESS-MEMBERS
Passed to node 102 at 2004-11-10 09:56:10 reason: RANKBC-6
Passed to node 41 at 2004-11-10 12:04:45 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 6 at 2004-11-10 14:36:56 reason: RANKBC-8
Passed to node 90 at 2004-11-10 14:40:55 reason: 17-isBetter-1:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 95 at 2004-11-10 15:23:10 reason: 6-isBetter-17:BETTER-CANDIDATE-LESS-MEMBERS
Passed to node 30 at 2004-11-10 15:26:52 reason: RANKBC-6
Passed to node 102 at 2004-11-10 15:40:22 reason: RANKBC-6
Passed to node 41 at 2004-11-10 18:06:57 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 90 at 2004-11-10 18:19:24 reason: RANKBC-8
Passed to node 95 at 2004-11-11 09:47:52 reason: 6-isBetter-17:BETTER-CANDIDATE-LESS-MEMBERS
Passed to node 30 at 2004-11-11 11:46:14 reason: RANKBC-6
Passed to node 102 at 2004-11-11 12:00:38 reason: RANKBC-6
Passed to node 41 at 2004-11-11 12:55:31 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-11 14:05:32 reason: RANKBC-8
Passed to node 41 at 2004-11-11 14:38:50 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-11 15:11:56 reason: RANKBC-8
Passed to node 41 at 2004-11-11 16:37:36 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-11 17:30:20 reason: RANKBC-8
Passed to node 41 at 2004-11-11 17:56:34 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 6 at 2004-11-11 18:18:04 reason: RANKBC-8
Passed to node 30 at 2004-11-15 09:20:19 reason: 6-isBetter-1:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 101 at 2004-11-15 09:46:13 reason: RANKBC-6
Passed to node 41 at 2004-11-15 09:57:18 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 11 at 2004-11-15 12:47:19 reason: RANKBC-8
Passed to node 88 at 2004-11-15 13:41:22 reason: 6-isBetter-1:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 38 at 2004-11-15 13:47:35 reason: RANKBC-6
Passed to node 79 at 2004-11-15 14:41:15 reason: 6-isBetter-17:BETTER-CANDIDATE-LESS-MEMBERS
Passed to node 91 at 2004-11-15 14:52:48 reason: RANKBC-6
Passed to node 101 at 2004-11-15 14:58:03 reason: RANKBC-6
Passed to node 102 at 2004-11-15 14:58:48 reason: RANKBC-6
Passed to node 41 at 2004-11-15 15:26:31 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-15 16:04:25 reason: RANKBC-8
Passed to node 101 at 2004-11-15 16:08:44 reason: RANKBC-6
Passed to node 53 at 2004-11-15 16:14:16 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 13 at 2004-11-15 16:25:03 reason: RANKBC-8
Passed to node 91 at 2004-11-15 16:29:28 reason: 6-isBetter-1:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 53 at 2004-11-15 16:30:40 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-15 16:41:37 reason: RANKBC-8
Passed to node 41 at 2004-11-15 16:51:04 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-15 17:31:42 reason: RANKBC-8
Passed to node 41 at 2004-11-15 18:13:21 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-16 09:18:35 reason: RANKBC-8
Passed to node 41 at 2004-11-16 09:24:56 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 6 at 2004-11-16 11:01:28 reason: RANKBC-8
Passed to node 30 at 2004-11-16 11:01:30 reason: 6-isBetter-1:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 101 at 2004-11-16 11:58:26 reason: RANKBC-6
Passed to node 102 at 2004-11-16 12:09:36 reason: RANKBC-6
Passed to node 41 at 2004-11-16 12:15:36 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 6 at 2004-11-16 13:35:34 reason: RANKBC-8
Passed to node 102 at 2004-11-16 14:12:41 reason: 6-isBetter-1:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 41 at 2004-11-16 14:47:50 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-16 14:49:36 reason: RANKBC-8
Passed to node 41 at 2004-11-16 14:49:53 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-16 15:06:52 reason: RANKBC-8
Passed to node 101 at 2004-11-16 15:13:25 reason: RANKBC-6
Passed to node 53 at 2004-11-16 15:41:12 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 13 at 2004-11-16 15:50:43 reason: RANKBC-8
Passed to node 102 at 2004-11-16 15:55:57 reason: 6-isBetter-1:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 53 at 2004-11-16 16:44:56 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 14 at 2004-11-16 16:46:28 reason: RANKBC-8
Passed to node 79 at 2004-11-16 16:52:06 reason: 6-isBetter-1:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 61 at 2004-11-16 16:53:08 reason: RANKBC-6
Passed to node 58 at 2004-11-16 17:20:54 reason: RANKBC-6
Passed to node 31 at 2004-11-16 17:21:37 reason: 6-isBetter-17:BETTER-CANDIDATE-LESS-MEMBERS
Passed to node 90 at 2004-11-16 18:52:52 reason: RANKBC-6
Passed to node 101 at 2004-11-17 12:32:06 reason: 6-isBetter-17:BETTER-CANDIDATE-LESS-MEMBERS
Passed to node 41 at 2004-11-17 12:37:00 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 30 at 2004-11-17 12:51:31 reason: RANKBC-8
Passed to node 102 at 2004-11-17 12:53:32 reason: RANKBC-6
Passed to node 41 at 2004-11-17 12:59:28 reason: 8-isBetter-6:BETTER-CANDIDATE-HAS-HIGHER-LEVEL-DEPTH
Passed to node 17 at 2004-11-17 13:10:13 reason: RANKBC-8
... etc.

In the implementation, the way that the ‘best’ bridging community is found, involves a choice when there are two or more candidates for the best BC for a node pair. This choice is naive because the choice is made by picking the last one in the list.

There seems to be a number of problems when considering whether a given best bridging community for one node is better than another. When considering this choice, the depth of the community in the structure is compared, this is also a naive decision, as the hierarchy levels  in the structure do not relate to the importance of the community, as the depth is not governed by any metric saying how deep a community is, only that it has a parent, and perhaps some children.  However, the size of the communities is still perhaps a good metric to use.

There will always be a bridging community for every node pair, as there is one global parent community, this means the the global community is treated like a local community. This should not be a big problem, as it pushes messages to the top when there is no better community to be found, these messages should soon find a node with a better community than the global one.

A couple of solutions to this are:

  • compare all best communities for a node/message when comparing two nodes
  • store a best-so-far community with the message, and only pass it on to nodes with a better one.
  • stop using any local/global ranking, and consider only community best-ness

Pádraig suggested checking to see whether it is particular destination nodes causing the high hop counts, the graphic below illustrates the average hop lengths for each node, both when the node is the source and when the node is the destination. The data table is also given, which also includes the standard deviation.

Average hop lengths by node where the node is the source and where the node is the destination

Average hop lengths by node where the node is the source and where the node is the destination

Data table:

click to view data

node avg_hops_when_src stddev avg_hops_when_dest stddev
1 5.65 3.06 165.62 56.67
2 36.99 48.48 9.68 12.99
3 24.58 35.35 177.15 90.16
4 31.98 40.29 14.3 18.24
5 1 0 1 0
6 36.5 48.97 8.59 6.47
7 30.65 50 6.27 10.11
8 30.64 51.51 17.74 13.27
9 40.73 49.72 49.09 34.73
10 28.68 50.77 25.77 21.38
11 29.25 51.39 5.19 3.42
12 28.47 37.03 13.25 17.61
13 28.65 48.69 9.17 9.79
14 29.51 48.44 5.63 4.58
15 29.37 49.2 25.79 23.91
16 28.68 50.36 9.35 7.35
17 28.81 50.53 8.09 10.76
18 24.65 35.73 82.75 35.73
19 26.43 49.7 12.36 16.71
20 28.99 47.09 14.4 17.78
21 26.54 49.97 6.3 9.82
22 1 0 1 0
23 31.04 46.51 24.57 30.6
24 24.81 44.58 13.21 17.19
25 22.88 40.02 107.42 63.14
26 1 0 1 0
27 19.05 33.76 18.77 12.32
28 14.76 31.24 45.16 26.61
29 37.24 46.37 6.75 9.76
30 29.38 49.53 5.64 3.51
31 30.43 49.31 18.72 10.58
32 33.48 46 12.8 17.24
33 39.87 50.71 4.82 3.3
34 33.17 40.86 5.78 3.49
35 33.63 45.75 5.06 2.24
36 23.25 40.12 15.01 18.54
37 35.37 45.88 14.89 18.24
38 27.52 43.52 9.3 9.72
39 25.62 39.24 46.45 29.91
40 24.88 36.34 5.75 3.27
41 32.17 52.09 9.75 5.69
42 38.6 48.48 111.61 61.11
43 29.83 51.16 6.34 9.82
44 14.51 16.61 76.79 37.2
45 37.45 43.78 29.45 24.56
46 27.56 48.06 6.54 9.87
47 28.09 43.68 22.54 19.48
48 20.84 33.7 13.55 17.73
49 1 0 1 0
50 18.75 34.44 13.67 17.81
51 28.18 50.56 6.25 4.93
52 41.13 49.35 64.24 34.58
53 30.95 47.38 15.41 13.87
54 31.66 46.03 13.83 17.66
55 28.63 44.23 6.44 9.57
56 1 0 1 0
57 31.55 41.85 5.34 3.2
58 27.45 50.7 22.09 22.22
59 27.46 44.1 26.02 27.82
60 10.72 9.25 30.58 26.47
61 30.2 49.97 13.91 18.41
62 33.79 44.06 109.91 57.24
63 30.76 48.56 14.25 16.67
64 20.71 22.38 5.73 3.49
65 20.37 34.53 111.34 63.53
66 37.36 45.57 127.11 65.15
67 38.19 48.7 9.25 13.14
68 25.67 37.47 15.05 18.31
69 29.79 45.81 7.61 4.27
70 21.41 33.17 16.02 18.61
71 1 0 1 0
72 22.07 27.05 24.52 20.85
73 30.64 49.46 72.68 66.96
74 20.09 30.47 93.49 55.14
75 28.4 50.19 6.65 9.8
76 27.27 37.04 14.08 17.77
77 15.32 29.31 31.28 38.38
78 29.72 47.43 6.76 9.76
79 23.72 42.65 10.85 9.49
80 29.97 52.24 31.66 25.39
81 34.55 49.01 22.03 26.42
82 27.24 40.62 16.61 18.63
83 28.57 32.77 99.87 50.57
84 33.09 45.61 15.61 18.58
85 27.16 48.18 8.73 8
86 24.51 44.49 4.67 3.37
87 27 50.14 9.69 13.32
88 30.65 52.13 28.11 44.69
89 21.62 35.71 14.28 17.76
90 27.7 48 10.19 9.92
91 28.15 47.49 8.95 8.27
92 1 0 1 0
93 27.7 37.74 34.67 35.09
94 36.59 47.23 3.75 1.49
95 29.91 49.99 23.63 20.39
96 1 0 1 0
97 37.44 49.69 9.7 6.36
98 18.9 35.19 17.85 19.89
99 31.82 49.52 32.99 20.77
100 23.79 39.27 34.01 17.46
101 29.15 50.93 9.07 8.7
102 34.7 47.82 6.42 5.27
103 37.67 48.72 60.97 33.82

It seems that in some cases, a node might be very hard to get to, and therefore has a high number of hops, however it could also be the case that because the node is not seen very often, then the messages just keep moving around different nodes. The way to see this would be to take measurements of the the hop count (as above) at time intervals. However, it would be best to first fix the algorithm to see if the behaviour continues, then consider further analysis.

Problem Found/Solution

After a bit of bug tracking and debug output scouring, we were able to find two problems. The first, a simulator problem meant that community file (numbered 0 to n) were being loaded in an incorrect order, which meant the internal numbering of communities did not match the file names:
view ordering

community.0.dat
community.1.dat
community.10.dat
community.11.dat
community.12.dat
community.13.dat
community.14.dat
community.15.dat
community.16.dat
community.17.dat
community.18.dat
community.19.dat
community.2.dat
community.20.dat
community.21.dat
community.22.dat
community.23.dat
community.24.dat
community.3.dat
community.4.dat
community.5.dat
community.6.dat
community.7.dat
community.8.dat
community.9.dat

The second problem was a little less subtle, it was caused by the behaviour of the isBetter function when considering the level of a community in the hierarchy and the number of members in a community. When an upper level community had a smaller membership than a lower level one, the message tended to be flip-flop’d between communities.  This was temporarily fixed by only considering the size of a community when the level was the same.

Results after this fix (for both BubbleRAP and BubbleH) are shown below:

However!

We have since decided to test what happens when we ignore the depth of a community, and only consider the member count, this might have the effect of flattening the hierarchy somewhat, but it simplifies the algorithm… watch this space for results.

Some new ideas

February 1st, 2011 No comments

Met with Pádraig,  results so far are not complete, so we still need to:

  • Run routing based on ranked centrality only (i.e. the aggregate graph of all connections): Graham might have done this already. To give us a more complete picture of what routing towards the centre really looks like.
  • Do more randomised ranking runs, to see of random can come up with better routing rank than LBR.
  • Implement and test new LBR idea.

Next Step for LBR

Pádraig suggested a simple advancement for LBR:

Person A, has a  message for Person C, and has encountered Person B. A has to work out whether to pass the message on or not. Each node has  a probability matrix of visiting all locations at any time.

Probability matrix of nodes being at any given location

A makes his decision by calculating the dot product of his own locations against C’s locations, and comparing that to B’s calculation in relation to C. If the sum for the B.C is greater than A.C then A passes the message to B. The rationale being that when one encounters someone else, who is more likely to visit the same location as the destination person, then the message should be passed on, because they are more likely to see the other person.

There are some caveats….. TODO: buffer zone, future prediction, past history, recent(ness?), limited locations,

Some specific details/ideas/extensions to consider:

  • We need to consider how these probability matrices will evolve over time. A nodes probability matrix will change when he/it visits new places, or alters existing mobility patterns. Initially, we can use the computed data, but more refined versions should exhibit a real-world scenario of unpredictable behaviour.
  • Determine a good threshold to use, so that messages are not sent between people of very similar rank/score, the rationale being that there may not be a benefit in passing it on, and therefore try to reduce overhead.
  • Limit the number of locations considered, to only those that are popular, this might boost the use of popular locations, in an attempt achieve a higher probability of a message being passed on.
  • Consider a more sophisticated mechanism to predict co-location with the destination node, or a better conduit/carrier node, by predicting future interactions with nodes based on past history.
  • It may also be important to show the possibility of real-world application, by implementing  a scheme for automatic dissemination and update of the probability matrix using the network itself. (related to  previous ideas about piggybacking meta-data using network/vector clocks, which themselves can be used as a source of routing metrics. e.g. recentness of contact, latency, update routes/speeds/times etc.)

Pádraig and I discussed the problems we may encounter in regards to peer review and justification of the work; in that the problem area is not well defined, and therefore we might have problems showing why this work is novel, and proving what our contributions are. To that end we need to explore the literature a little more, so we might be able to show a solid justification for the work, or alternatively, change our approach so it fits in better with other, better defined problems.

What sorts of messags are ‘delay tolerant’? Pádraig’s suggestion is that twitter messages, and facebook update messages might be delay tolerant, as a person may not need to receive all messages, and generally only want to get the latest updates, it does not matter if a few are lost along the way.

How do we define the urgency of messages, and the efficiency of the network? Perhaps one type of message can be delivered within 10 time periods, and still be considered to be relevant and within acceptable delivery time, but another message may need to be delivered within 1 time period, to ensure quality of service.

There is a categorisation issue too; where some messages can be considered one-to-one (direct messaging), some one-to-many (twitter update), many to many (local information) and many-to-one (sensor networks). We need to decide which of these we will consider. On this note, I said I would speak to Neil Cowzer, who is working on implicit group messaging, to see what his motivation is, and to see if he has a well defined problem space that he is tackling.

Another alternative that Pádraig suggested, was looking at social science approach, where we look at the heuristic approaches to routing in complex social networks. Pádraig’s suggestion was that on some networks, we might be able to apply certain routing techniques, which do not work on others. The contribution would be defining, categrorising and testing new and existing combinations of network types and routing mechanisms. This would be an interesting route to take, but would mean a step back in my research, as I would need to do reading into this area. This would link up well with the work of Stanley Milgram, Granovetter and Watts & Strogatz etc. So I should re-read some of this work, but more importantly, take a look at the biggest cited, citing documents, to see where research might be heading now.

Seminar/Presentation 29 Nov 2010

November 29th, 2010 1 comment

Presentation summary:

Routing in Human Interaction Networks

The goal of this seminar is to get some ideas from the participants, about the feasibility, and possible directions for this research.

We believe that by detecting the underlying patterns of human interaction involving movement and contacts between individuals, we can design a delay tolerant mechanism for passing messages between actors in the system (for example using mobile phones) that does not rely on fixed infrastructure. A system such as this should exhibit low cost (hop count, update overhead, delivery latency) and high efficiency (delivery ratio).

We believe that we can drive a routing policy by calculating and sharing metrics and meta-data at node level; from the contacts they make and the places they visit, without the need for a centralised system. In particular we are interested in using location to drive the messaging scheme.

We will show recent, related results of Persistence Based Routing vs existing schemes, some initial results from our simple schemes, and will ask for suggestions regarding the direction of our research into Location Based Routing.

Present: Pádraig Cunningham, Davide Cellai, Derek Green, Conrad Lee, Fergal Reid, Neil Hurley

Presented some background, and ideas about routing in human interaction networks (Slides – 1.2MB), someone noted tha Delivery Ratio and Latency are directly linked, with the suggested that the requirement for high delivery ratio and low latency, may not be intuitive in some situations. e.g. when someone who will cause a high latency, may be the only one to get the message across.

The presentation seemed to go well, however some parts I might not have delivered properly, meaning that I seemed to be skimming over a few bits.

Some suggestions were made:

  • Consider the use of some sort of altered vector clocks, which keep a history, and can be shared with other nodes
  • Partition the graph in the Reality Mining data, to identify the communities, then test the algorithms for only those communities
  • Strong consensus to start using periodic patterns for predicting routes
  • Neil suggested that I try to generate a dataset that does work well for LBR
  • Ferhgal suggested what we had talked about before: finding out at what places, messages get sent

Reference to chase up:

  1. PNAS 2005/2006 – LIBEN-NOWELL – ?GeoRouting in social networks?
  2. SneakerNet
  3. Talk to Martin Harrigan – who is using Vector Clocks in some clever way

There may have been other things I don’t remember – feel free to comment with additions!

Meeting with KM, SB and DC about Social Sensing data 7th July 2010

July 7th, 2010 No comments

Met with Kevin McCarthy, Steven Bourke and Davide Cellai about what the Social Sensing study data could be used for. We are all interested in movement patterns and prediction, and decided that we should work together.

The data itself is currently stored in a MongoDB, which is a document storage database, and is apparently very easy to query. The data itself is stored in approximately 200GB of seperate files. Kevin assured us that we would be able to access this data.

Steven suggested a number of sources of similar (location) data:

  • GeoLife, by Microsoft Research.
  • SimpleGeo
  • SpotRank by Skyhook

He also described how he collected location data from Gowalla, for ~2000 users in Dublin. His masters thesis was about DTN with sensors, and so his interests are in line with mine and DC’s.

We agreed to meet next week to brainstorm some ideas worthy of collaboration.

Quick meeting with Paddy 6th May 2010

May 6th, 2010 No comments

Met with paddy in his office and discussed updates so far.

Paddy wants to see something tangible by next week’s update – i.e. a worked example of how vector clocks will work in terms of location.

Also emphasised that I should not get sidetracked! (of course!)

Suggested storing temporal  information (parallel vector clocks?) – e.g. a from to, so that we can say things like does this time overlap this one, is it contained withing etc. etc.

Also thought about how to define location – the bounding of the location  gives an experimentation thing – change the grid size – whats the computation impact of the size of the grid – and what the relevance e.g. too big and it makes realistic.

Construct vector stamps – 5 separate path across these locations, two or three allow drop messages – run through and pick various vector clocks at various times, and show them. Then start generalising them.

From this we can draw general things about e.g.: decisions made, what information is stored, what we put in vector clocks, what operators we need.

Then run a simulation and see if generalisation works. Then we can see if some things fall down, and come back and change things.

Should stick with ideas about location, not proximity yet.

Using this it is then possible to write this concisely.

Actions:

  • Generate a worked example/scenario
    • show examples of vector clocks at times
    • show the movements over time
  • Don’t get sidetracked!

Meeting with Paddy 9 March 2010

March 9th, 2010 No comments

Met with Paddy to discuss funding and an idea of direction in PhD

Paddy said that if his EU grants get signed off, he will garauntee me another 1 year of funding from September, the only requirements will be a chunk of work which results in a ~30 page document – something to do with autonomous software patching.

Also spoke to him about the idea of Vector clocks and how we can use them for opportunisic networking – he wondered whether we could derive some utility for locations, perhaps based on duration spent at a location, number of nodes in that location etc. etc.

He said that perhaps a good paper would be on techniques for analysing locations for use with opportunistic networking.

Paddy said that he wants me to aim to submit to CfP on Social Based Routing in Mobile and Delay Tolerent Nets the deadline is in June, and split the time up until then into 6 week chunks, and for the first chunk, tell him what I will be doing – i.e. what problems I am solving, and for the whole time, what my goals are.

Also spoke about my idea for using the Dublin Bus network for research and profit – he mentioned a guy in trinity who has access to a load of transport data and also suggested I dig out an email address of someone important at Dublin bus, and send them a formal email about gettin information about the bus, and CC him (so it looks official).

He also mentioned the TextUS service based at DCU, (runs the Aircoach SMS payment service, and the SMS parking scheme) who we could collaborate with to provide a TFL type service for dublin.

Paper reading with Davide 26 Feb 2010

February 26th, 2010 No comments

Spent a few hours going over the following paper with Davide:

Kossinets, G., Kleinberg, J., & Watts, D. (2008). The structure of information pathways in a social communication network. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining – KDD ’08 (p. 435). New York, New York, USA: ACM Press. doi: 10.1145/1401890.1401945.

This was a very useful session, as it meant that we were forced to really understand the paper.

The authors present analysis of email datasets usin using vector clocks as a framework. They argue that the issues of out-of-date information and indirect paths are central to the understanding of the patterns of systemic communicatoion.

They explore Granovetter’s theory of weak ties, which basically says that long range connections can give better information than close connections.

This paper lead me to think that vector clocks would be a great way to discover the state of an unknown network from the point of view of each node. E.g. if a node exchanges vector clocks with other nodes it meets (perhaps with some probability, and time restriction) it could quicky build up in internal state of the following:

  • the number of nodes in its connected network
  • hops to other nodes (?? maybe ?? – )
  • the range of other nodes (i.e. the amount information update a node gives about other nodes) – which can be used as a routing metric
  • membership clustering using the ball of radius (i.e. the nodes it is up to date with to some value Τ days, which could be based on periodicity)
  • periodic degree of a node can also be used (using T as the period)  as a routing metric

Each node must have a unique identity (perhaps based on BT Mac address), even Open World nodes that are not members of the closed world can be counted when it comes to degree calculations, but only as a ramp up, as only closed world nodes (i.e. others using this algorithm) will be of interest when sending messages.