### Archive

Archive for the ‘Supervisor Meetings’ Category

## Final Chapter – Next Steps

June 22nd, 2012 1 comment

Met with Pádraig to discuss next steps.
Recording

We decided to go along the route of trying to classify CFAs based on synthetic datasets. We realised that there are no synthetic datasets out there that model both contacts and clustering.

We decided to be clever with the Studivz dataset. We will run the Moses CFA over it. Then pick a random community, then build out the network based on connected communities.

The idea being that we can decide how many nodes we want, and keep going until we have enough.

I implemented this. The script takes an edge list, the community allocation by Moses. (see file in SVN)

The script picks a random community C0, then, while the total number of nodes is less than the max number of nodes, it picks the community (C1) connected to C0 that has the largest number of links to C0. Then, it picks one of these communities at random, and repeats the picking process until the total number of nodes has been reached or exceeded.

Initially, I did this four times, with a limit of 200 nodes and produced the following graphs, coloured by modularity class, and sized by betweeness centrality (sized linearly between 5 and 100 in gephi).

studivz-three-200-D

studivz-three-200-B

studivz-three-200-C

studivz-three-200-A

But then I got to thinking what are we actually trying to test here? It’s a bit vague to just see what happens in these datasets. We could be testing a number of things:

• Which CFA works best for well connected communities
• Which CFA works best for poorly connected communities
• Are we testing a particular aspect of a network?
• Density?
• Avg. community size
• Average network size

So I also decides to generate a set of nodes in poorly connected datasets, so adapted the algorithm to pick the communities with the lowest number of links. Resulting in the four below:

studivz-three-200-A-MIN

studivz-three-200-B-MIN

studivz-three-200-C-MIN

studivz-three-200-D-MIN

Also, perhaps we should be considering weighted links, perhaps the most time connected vs the least time connected?

Each of these subsets should be compared in terms of other network metrics to see if there is an effect on CFA performance.

awesome command to do this:

EXPERIMENT_GROUP=FINAL_EVALUATION DATASETS=studivz-three-200-A,studivz-three-200-B,studivz-three-200-C,studivz-three-200-D,studivz-three-200-A-MIN,studivz-three-200-B-MIN,studivz-three-200-C-MIN,studivz-three-200-D-MIN,mit-nov,mit-oct,cambridge,social-sensing,hypertext2009,infocom-2005,infocom-2006  &amp;&amp; for DATASET in $(echo${DATASETS} | tr "," "\n"); do php -f scripts/stats/DatasetCommunityLinkCountStats.php OUTPUT/${EXPERIMENT_GROUP}/edgelist-graphs/${DATASET}/edge_list.dat OUTPUT/${EXPERIMENT_GROUP}/communities/${DATASET}/Moses/no-global-parent/edge_list.dat.communities.dat  CSV ${DATASET} &gt; OUTPUT/${EXPERIMENT_GROUP}/data/\${DATASET}-DatasetCommLinkStats.txt; done
Categories:

## Temporal Networks

Pádraig and  I discussed Time Respecting Paths  in the datasets we have based on the paper – Temporal Networks[1].

We want to detect the shortest/quickest paths between pairs of nodes, which respect the the temporal paths, so that we can derive a betweeness centrality measure.

The existing implementation we have, based on the BubbleRAP paper, floods the network, and calculates the betweeness centrality using the number of times a node is used in ANY path  – the BubbleRAP implementation uses only the SHORTEST paths. So there is some discrepancy here. The BubbeRAP paper reads as follows:

To calculate the individual centrality value for each node, we take an numerical approach. First we carry out a large number of emulations of unlimited flooding with different uniformly distributed traffic patterns created using the HaggleSim emulator, which can replay the collected mobility traces and emulate different forwarding strategies for every contact event. Then we count the number of times a node acts as a relay for other nodes on all the shortest delay deliveries. Here the shortest delay delivery refers to the case when the same message is delivered to the destination through different paths, and we only count the de- livery with the shortest delay. We call this number the betweenness centrality of this node in this temporal graph. Of course, we can normalise it to the highest value found. Here we use unlimited flooding since it can explore the largest range of delivery alternatives with the shortest delay. This definition captures the spirit of the Freeman centrality[2]

The Freeman centrality paper, which I only skimmed, doesn’t appear to refer to temporal graphs directly –  further inspection may yield a better understanding 🙂

There is a question over what observation period we should use, and how to decide which is really the ‘shortest path’  when paths may start/end at different times. In the Bubble RAP paper, they seem to have chosen the path with the least latency, but not considering the start time of the path. In the temporal networks paper [1], the authors talk about a Saw-tooth pattern which governs latency (as regular connections create a latency lag between connections), and mention a method to normalise over this pattern – but not really for the Betweeness Centrality measure.

e.g. If we start at time 0, and measure the time it takes to get to the destination node, that is fine. however, what if there is an intermediate time, 4, where, if we measure the path from there, we find a new shortest path – in Bubble RAP, this is taken as the shortest path, but is this correct?

I will implement the Time Respecting, shortest path (latency) betweeness centrality, (and maybe the shortest path – hops), as per BubbleRAP, and I will also investigate the normalised version as per the Temporal Networks paper.

[1]Holme P, Saramäki J. Temporal Networks. 2011:1-28. Available at: http://arxiv.org/abs/1108.1780. Accessed May 15, 2012..

[2]Freeman L. A set of measures of centrality based on betweenness. Sociometry. 1977. Available at: http://www.jstor.org/stable/10.2307/3033543. Accessed May 31, 2012.

## THE plan

Complete exhaustive runs on all datasets:

• MIT-ALL
• MIT-OCT
• ENRON
• Enron-a
• Enron-b
• Salathe??
• Studivz

Implement Conga – use the number of communities from InfoMap as the target number for Conga.

Implement InfoMap-H.

Above to be completed by next weeks meeting, 19th Jan.

Write a thesis chapter based on this work, with a view to making it into a journal article. Chapter and Artical to be completed by the end of March 2012.

Then, conduct another piece of work to tie it all in, probably based on Vector Clocks.

Finally, fill in the (rather large, and difficult) blanks in the thesis.  PC: Easy when said quickly 🙂

Categories:

## Extending Datasets

Pádraig had the idea to try to eliminate Delivery Ratio from our analysis, this way we could then concentrate specifically on Cost and Latency. He suggested that we take the normal dataset and extend it to three times the length, repeating the events in the correct time ordering. Then we could run the simulations 10 times each with a different start time in the original  portion of the dataset, this would give each message a much better chance of being delivered, over three iterations of the dataset.

I implemented this by taking a slice of the MIT dataset, processing the events contained within it, and duplicating them, adding the length of the slice to the timestamp times the iteration (i.e. MIT-NOV, MIT-NOV+1 length, MIT-NOV+2 lengths). Any contact events that were active at the start or end time of the slice, were truncated to start/end at the start/end time. I verified the resulting dataset by counting the number of connection every day, and ensuring that there was a distinct repeating pattern. For each dataset, I split the duration between the start and end time of the original dataset time period into 10 equal chunks, and calculated the corresponding start time for each. For MIT and Cambridge this equates to roughly every three days, whereas in the InfoCom datasets there is  a much smaller gap between start times (~14-20 hours).

start times

mit-nov 2004-11-10_00:00:00 - 2004-12-11_00:00:00 2004-11-10_00:00:00 2004-11-13_02:24:00 2004-11-16_04:48:00 2004-11-19_07:12:00 2004-11-22_09:36:00 2004-11-25_12:00:00 2004-11-28_14:24:00 2004-12-01_16:48:00 2004-12-04_19:12:00 2004-12-07_21:36:00   infocom-2005 2005-03-06_00:00:00 - 2005-03-12_00:00:00 2005-03-06_00:00:00 2005-03-06_14:24:00 2005-03-07_04:48:00 2005-03-07_19:12:00 2005-03-08_09:36:00 2005-03-09_00:00:00 2005-03-09_14:24:00 2005-03-10_04:48:00 2005-03-10_19:12:00 2005-03-11_09:36:00   infocom-2006 2006-04-24_00:00:00 - 2006-05-02_00:00:00 2006-04-24_00:00:00 2006-04-24_19:12:00 2006-04-25_14:24:00 2006-04-26_09:36:00 2006-04-27_04:48:00 2006-04-28_00:00:00 2006-04-28_19:12:00 2006-04-29_14:24:00 2006-04-30_09:36:00 2006-05-01_04:48:00   cambridge 2005-10-28_00:00:00 - 2005-12-23_00:00:00 2005-10-28_00:00:00 2005-11-02_13:30:00 2005-11-08_04:00:00 2005-11-13_18:30:00 2005-11-19_09:00:00 2005-11-24_23:30:00 2005-11-30_14:00:00 2005-12-06_04:30:00 2005-12-11_19:00:00 2005-12-17_09:30:00

Running Simulations

We decided to remove the overhead of running the CFAs with multiple parameters, by choosing a set of parameters for each, that had performed well in previous tests. The list below shows the ones we chose:

CFA Parameters

MIT-NOV HGCE Bubble K A M E P S R ST MAP Z 3 1.3 0 0.25 0.2 4 100 0.7 0.5 0.1   BubbleH K A M E P S R ST MAP Z 4 1.3 0 0.25 0.3 4 100 0.9 0.9 0.1 # Communities AverageMembers # 21 15 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 25 27 29 30 31 32 34 35 36 37 38 39 41 42 43 45 46 47 48 50 51 53 54 55 57 58 59 60 61 62 63 65 66 67 68 69 70 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 94 95 97 98 99 100 101 102 103 6 8 14 16 18 30 41 46 53 59 61 63 78 79 86 88 91 95 101 102 6 13 17 29 30 47 86 88 95 101 10 25 62 65 74 83 102 3 25 30 62 74 4 23 30 45 57 91 9 30 42 101 19 27 35 37 39 81 89 102 2 41 51 87 88 95 102 103 15 41 43 51 67 69 87 88 95 100 103 19 27 41 66 85 98 15 43 46 58 61 67 69 78 79 83 85 90 95 97 100 103 2 7 11 15 57 61 67 69 78 79 85 87 88 90 95 97 100 101 103 6 7 8 10 11 14 15 16 17 21 30 31 34 35 38 41 43 46 55 58 61 69 75 78 79 80 83 85 86 88 90 91 94 95 97 101 102 8 10 15 16 17 30 38 41 46 53 63 73 80 83 85 95 97 99 101 102 10 14 18 25 35 46 65 83 86 88 90 102 10 17 25 35 53 72 85 88 10 25 53 72 74 78 27 30 50 95 97 99 4 12 19 20 23 24 27 30 32 36 37 39 41 45 48 50 54 59 60 68 70 76 77 81 82 84 89 98 19 27 37 39 41 68 81 98   KCLIQUE BubbleH K MCT 3 0.01 # k = 3 # minimum_connected_ratio = 0.01 # number_of_communities = 4 # community_sizes = [5, 3, 45, 24] # average_size = 19.25 103 2 67 100 87 99 27 69 6 7 8 10 11 13 14 15 16 17 18 21 29 30 31 34 35 38 41 43 44 46 53 55 58 59 61 63 67 73 75 78 79 80 83 85 86 88 90 91 95 97 100 101 102 68 70 76 12 77 81 82 19 84 23 24 89 32 98 36 37 39 45 48 50 54 20 59 4   Bubble K MCT 3 0.01   LinkClustering BubbleH ET 0 # D_max: 0.591406681046 # S_max: 0.596774193548 # Communities: 79 # Average size : 8 # Nodes : 95 (seen 635 times) 3 72 51 73 58 30 77 50 102 88 89 97 53 67 82 69 81 86 87 84 85 24 20 21 48 23 46 45 43 41 4 7 6 8 68 11 83 13 99 76 75 38 70 91 90 100 101 95 29 79 78 15 10 39 12 59 14 17 98 19 54 31 16 37 36 35 55 63 32 61 103 25 72 62 45 74 65 72 35 77 19 74 20 19 9 77 25 84 103 67 47 3 30 25 65 44 73 74 73 9 2 83 61 88 67 102 90 100 81 95 87 79 85 11 10 101 21 17 16 86 41 2 7 103 25 74 27 74 91 10 17 16 30 88 41 61 62 102 83 101 86 79 61 88 67 91 83 100 101 86 87 97 85 69 95 57 30 41 7 103 78 31 90 3 74 19 98 66 24 76 64 4 45 50 72 102 63 30 33 101 62 65 24 20 48 19 32 37 77 98 89 73 54 59 10 21 61 88 63 90 91 83 101 102 94 97 85 95 25 86 58 17 16 55 18 30 43 53 41 46 69 7 8 103 78 1 83 30 9 72 98 36 77 76 89 70 68 20 84 24 39 12 59 48 23 19 54 82 45 37 50 4 32 30 42 101 18 84 42 72 90 28 103 44 27 25 3 25 35 25 62 102 88 91 90 80 101 95 87 79 85 11 10 81 15 21 14 16 46 86 31 30 61 43 35 41 97 7 6 9 8 18 60 81 30 9 52 52 101 15 16 30 41 64 73 39 27 19 37 50 36 98 89 4 68 99 63 73 102 83 101 95 87 97 85 11 10 38 15 14 17 16 46 31 30 53 41 94 6 8 73 72 83 29 88 65 90 102 103 80 86 79 78 10 38 15 21 55 18 31 43 35 46 69 7 61 63 66 67 102 83 100 101 95 85 10 14 16 46 41 55 69 7 8 78 90 11 13 58 17 16 47 31 30 29 88 95 83 101 86 23 34 93 40 9 57 21 16 30 40 41 61 88 6 102 101 86 14 27 66 67 43 61 88 63 53 90 69 80 81 86 87 85 21 46 95 29 41 79 7 6 8 83 99 75 91 103 100 101 102 94 97 78 11 10 13 38 15 58 17 16 55 18 31 30 35 34 14 9 42 11 13 14 17 30 35 41 61 88 6 93 101 86 87 79 73 66 74 72 3 53 87 15 95 51 43 41 88 67 102 8 103 69 100 7 21 43 99 75 64 95 90 101 86 79 78 10 58 55 18 31 30 29 88 6 8 3 62 10 88 72 17 69 85 87 53 78 62 74 62 28 9 74 77 42 60 98 20 54 77 50 2 51 12 2 91 10 101 14 31 30 53 34 99 61 74 102 83 41 95 78 57 45 4 23 9 25 14 61 44 30 53 41 102 88 7 6 91 83 101 95 79 11 83 99 61 88 63 90 102 103 100 81 69 94 97 85 91 10 27 15 21 14 16 55 95 31 30 35 41 80 79 101 6 8   Bubble ET 0.003   InfoMap # InfoMap: Generated from {datasets/communities/EXTENDED_DATASETS/InfoMap/mit-nov/edge_list.dat.tree} # Generated on {Fri, 27 May 2011 11:03:51 +0100} # number_of_communities = 11 # avg size = 9 # nodes = 95 19 20 24 32 12 54 98 76 89 50 77 36 70 48 68 4 84 45 37 82 59 23 39 81 60 30 102 86 14 6 91 61 41 16 101 88 79 18 53 35 63 44 9 40 93 52 72 33 42 15 97 85 100 80 83 38 73 1 90 95 29 46 58 8 7 75 43 66 64 94 67 103 87 2 51 57 47 21 55 78 34 11 17 13 31 10 69 27 99 25 65 74 3 62 28   InfoCom2005 HGCE Bubble K A M E P S R ST MAP Z 3 1 0 0.25 0.2 1 100 0.5 0.5 0.1 BubbleH K A M E P S R ST MAP Z 3 1 0 0.25 0.2 1 100 0.6 0.9 0.1   # Communities AverageMembers # 2 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32 33 34 35 36 37 38 39 40 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32 33 34 35 36 37 38 39 40 41   KCLIQUE BubbleH K MCT 4 0.009   # k = 4 # minimum_connected_ratio = 0.009 # number_of_communities = 1 # community_sizes = [37] # average_size = 37.0 1 2 3 4 5 6 7 8 10 11 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32 33 34 35 37 38 39 40 41   Bubble K MCT 2 0.11   LinkClustering bubbleH ET 0.008 # D_max: 0.398456726708 # S_max: 0.368421052632 # Communities: 15 # Average size : 7 # Nodes : 40 (seen 99 times) 9 30 21 41 12 23 11 25 26 35 15 41 13 14 24 10 39 27 20 21 22 16 33 32 37 29 40 34 1 19 3 5 4 7 6 8 17 18 38 36 10 39 15 22 33 37 36 34 1 13 24 27 39 2 14 35 34 24 25 13 15 9 5 37 29 34 18 12 2 23 11 39 27 32 30 37 40 35 5 4 28 26 22 23 33 8 6 18 34 9 12 39 38 40 22 19 32 35 4   bubble ET 0.04   InfoMap # InfoMap: Generated from {datasets/communities/EXTENDED_DATASETS/InfoMap/infocom-2005/edge_list.dat.tree} # Generated on {Fri, 27 May 2011 11:03:59 +0100} # number_of_communities = 3 # avg size = 14 # nodes = 41 40 32 39 1 27 22 15 35 29 19 24 17 4 25 8 13 30 10 9 26 14 36 5 34 11 6 16 38 21 3 23 28 41 2 12 31 33 20 37 7 18   InfoCom2006 HGCE Bubble K A M E P S R ST MAP Z 3 1 0 0.25 0.2 1 100 0.7 0.5 0.2 # Communities AverageMembers # 2 49 22 23 24 25 26 27 28 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 84 85 86 87 88 89 90 91 92 93 94 95 96 98 22 23 24 25 26 27 28 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 84 85 86 87 88 89 90 91 92 93 94 95 96 98   BubbleH K A M E P S R ST MAP Z 3 1 0 0.25 0.3 4 100 0.6 0.5 0.2   KCLIQUE BubbleH K MCT 3 0.02   # k = 3 # minimum_connected_ratio = 0.02 # number_of_communities = 1 # community_sizes = [58] # average_size = 58.0 22 25 26 27 28 31 32 34 35 36 37 38 39 40 43 44 45 46 47 49 50 51 52 53 54 55 56 57 58 60 61 63 64 66 67 70 72 73 74 75 76 77 78 79 81 82 83 84 85 86 87 88 89 90 91 93 95 98   Bubble K MCT 3 0.02   LinkClustering BubbleH ET 0.01 # D_max: 0.352182263797 # S_max: 0.407407407407 # Communities: 95 # Average size : 4 # Nodes : 78 (seen 387 times) 86 35 38 40 49 36 35 44 56 83 48 24 62 27 68 69 21 22 67 36 48 30 59 46 44 56 71 91 39 92 48 58 94 83 40 56 94 40 59 43 80 65 56 40 48 24 60 62 51 75 52 77 76 88 89 46 39 86 47 56 40 77 58 44 45 51 43 41 60 75 89 73 71 70 64 80 92 95 93 74 25 38 58 49 32 57 56 88 40 43 60 75 89 53 52 86 68 48 46 26 39 26 80 39 74 89 46 54 88 53 52 26 56 74 97 56 29 54 48 26 97 46 85 60 25 75 22 57 51 78 24 62 54 41 76 58 82 32 44 30 51 52 38 22 49 56 40 47 22 87 65 72 26 58 53 43 31 65 83 73 23 24 80 88 57 70 28 50 77 75 89 66 67 82 31 93 81 44 84 85 42 48 24 62 48 69 21 47 45 28 43 60 88 89 73 79 64 93 81 86 84 78 54 22 73 29 56 73 40 49 80 59 45 28 24 62 36 68 65 22 48 65 96 42 29 86 85 25 44 71 41 51 21 63 38 59 47 92 25 67 31 35 49 29 43 89 58 43 37 36 53 34 39 30 22 85 97 85 92 71 74 64 73 68 51 45 86 43 52 76 66 91 90 31 96 93 79 78 59 52 77 58 32 33 86 45 51 43 25 65 83 23 46 39 59 30 48 33 48 46 39 48 51 53 76 89 64 66 82 73 80 84 78 27 49 55 56 35 61 63 40 38 25 26 74 54 57 75 52 26 22 68 47 69 65 61 27 51 43 28 60 88 89 64 66 67 82 80 81 86 87 84 25 44 45 42 43 34 77 76 75 74 73 72 70 91 90 93 95 79 78 51 58 98 32 57 37 50 53 52 54 31 27 48 37 59 34   Bubble ET 0.09   InfoMap # InfoMap: Generated from {datasets/communities/EXTENDED_DATASETS/InfoMap/infocom-2006/edge_list.dat.tree} # Generated on {Fri, 27 May 2011 11:04:18 +0100} # number_of_communities = 5 # avg size = 16 # nodes = 78 78 93 76 66 60 74 64 82 77 81 45 72 84 88 28 43 89 90 44 53 51 67 79 87 37 70 75 86 91 56 34 52 73 48 58 40 95 41 32 80 71 50 31 54 26 57 85 47 92 98 36 24 30 33 59 42 62 96 65 23 29 83 94 97 21 38 35 49 63 55 27 61 25 22 39 46 68 69   Cambridge HGCE BubbleH K A M E P S R ST MAP Z 5 1.6 0 0.25 0.2 4 100 0.6 0.5 0.1 # Communities AverageMembers # 4 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 1 2 3 4 5 6 7 9 10 11 13 14 15 17 18 19 20 21 23 25 28 29 30 32 33 34 2 3 5 6 7 8 9 10 11 12 13 16 18 20 21 22 25 27 28 29 31 32 33 34 35 36 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 19 20 23 24 26 29 30 31 33 35   Bubble K A M E P S R ST MAP Z 5 1 0 0.25 0.3 4 100 0.9 0.9 0.2   KCLIQUE BubbleH K MCT 2 0.02 # k = 2 # minimum_connected_ratio = 0.02 # number_of_communities = 3 # community_sizes = [2, 2, 11] # average_size = 5.0 1 4 24 26 32 34 3 36 7 16 21 22 25 27 28   Bubble K MCT 2 0.04   LinkClustering - bubbleH ET 0.004 # D_max: 0.75851421306 # S_max: 0.5 # Communities: 12 # Average size : 5 # Nodes : 34 (seen 55 times) 11 25 27 21 22 16 32 28 36 34 3 12 7 9 5 19 15 33 2 34 26 35 3 6 13 15 6 19 20 9 13 10 20 14 17 23 19 18 30 1 33 2 5 4 6 15 2 35 24 26 11 21 22 29 7   Bubble ET 0.003   InfoMap # InfoMap: Generated from {datasets/communities/EXTENDED_DATASETS/InfoMap/cambridge/edge_list.dat.tree} # Generated on {Fri, 27 May 2011 11:04:24 +0100} # number_of_communities = 3 # avg size = 12 # nodes = 36 34 21 36 22 25 32 28 3 7 16 27 12 11 29 8 18 33 15 14 5 20 19 4 6 2 23 1 17 30 10 13 9 31 26 35 24

I originally made the decision to use seperate parameters for Bubble and BubbleH, but in hindsight this was not a good idea, as they cannot easily be compared.

When we discussed the results, we realised the KCLIQUE seems to perform well for latency with BubbleH, despite not having much hierarchical structure. Pádraig suggested I investigate this further, to find out what is happening. In the mean time I have re-run the simulations, using the best parameters for BubbleH – to level the playing field in terms of parameters. As is obvious from the above community structures, only HGCE produces communities with one all encompassing community, and in the case of KCLIQUE for both InfoCom datasets, it produces only one community – this seems odd, as we would expect the best performance from a more complicated structure.

This resulted in the following community structure.

Community Structures

MIT-NOV HGCE comm_count avg_member_count min_nodes max_nodes 21 15.80952381 4 87 KCLIQUE comm_count avg_member_count min_nodes max_nodes 4 20.25 4 46   InfoMap comm_count avg_member_count min_nodes max_nodes 11 8.636363636 2 25   LinkClustering comm_count avg_member_count min_nodes max_nodes 78 8.141025641 2 67   INFOCOM-2005 HGCE comm_count avg_member_count min_nodes max_nodes 2 40 40 40   KCLIQUE comm_count avg_member_count min_nodes max_nodes 1 38 38 38   InfoMap comm_count avg_member_count min_nodes max_nodes 3 13.66666667 2 36   LinkClustering comm_count avg_member_count min_nodes max_nodes 14 7.071428571 2 32   INFOCOM-2006 HGCE comm_count avg_member_count min_nodes max_nodes 2 73 73 73 KCLIQUE comm_count avg_member_count min_nodes max_nodes 1 59 59 59 InfoMap comm_count avg_member_count min_nodes max_nodes 5 15.6 2 65 LinkClustering comm_count avg_member_count min_nodes max_nodes 94 4.117021277 2 43   CAMBRIDGE HGCE comm_count avg_member_count min_nodes max_nodes 4 28.75 23 36   KCLIQUE comm_count avg_member_count min_nodes max_nodes 3 6 3 12   InfoMap comm_count avg_member_count min_nodes max_nodes 3 12 3 18   LinkClustering comm_count avg_member_count min_nodes max_nodes 11 5 2 15

Delivery Ratio, Cost, Latency and Delivered Hops for each CFA and for each Dataset, for BubbleRAP and BubbleH. (OLD PLOT - SEE BELOW FOR NEW PLOT)

In the results above, there seems to be something odd happening with LinkClustering for BUBBLERap, for Cost and Hops in MIT-NOV and InfoCom2005 it has much higher values than for the other CFAs. Also, there appears to be something odd happening with Latency in the Cambridge dataset, with a maximum of ~1000 hours latency which is about 40 days. The MIT dataset is roughly the same length, but doesn’t get anywhere near the same latency.

The two plots below show Delivery Ratio and Latency for BubbleH, HGCE for the Cambridge dataset.

Delivery Ratio for HGCE-BubbleH, on the Cambridge dataset

Latency for HGCE-BubbleH, on the Cambridge dataset

These seem to show that at the points where the new parts of the dataset are joined, there is a huge jump in deliveries of messages. It also seems that the datasets are truncated at the end. I will double check the start/end times, but this could be due to the original dataset having no events after a certain timestamp, and therefore, in the last section, the simulation stops when there are no events left to process. The huge spike in messages delivered explains the very high and very low extremes – as messages that were sent in the last iteration of message sending, get delivered very quickly. It seems odd however, that the same is not true for the first iteration. It could be, that because the last set have started to be distributed, maybe only one hop away, these nodes are already in a much better position to deliver the nodes at the start of the next section. This could be due to higher activity in the first hours of the dataset collection, perhaps even due to all the nodes being switched on in the same place at the same time. If this is the case, then we need to start the dataset slice slightly later, to avoid this. The activity traces for each dataset is shown below:

MIT-Nov Daily Connections

Cambridge Daily Connections

Infocom2005 Hourly Connections

Infocom2006 hourly connections

From these we can see that MIT has a good distribution of connectivy across the dataset, but the other three have only short spikes of activity. This in itself does not make a difference to the comparison of resulst for each datasets, as they are all relative, but it does go some way to explain the very high latency figures for Cambridge.

The issues with KCLIQUE

I looked back at the data for KCLIQUE in the NGMAST simulations, where there were multiple parameters to KCLIQUE, resulting in varying degrees of community structure, for all datasets, the delivery ratio in KCLIQUE seems to increase with member count, whereas community count peaks before delivery ratio really gets going.

Delivery Ratio, Cost, Community Count and Average Member count.

This makes sense, because for BubbleH, the more we know about members of a the communities, the more messages we can direct using local ranking. And, because we haven’t included an overall community for KCLIQUE (or InfoMap*, or LinkClustering) we have no way to deliver messages to nodes not in a community, until a node holding a message comes into contact with the destination node. I’m not sure if we designed it this way or whether this is just a bug.

The next step is to fix this issue, by either

1. adjusting the BubbleH algorithm to use global rank to pass messages between nodes at the highest level
2. Force a global community in the other community finding algorithms (this will have the same effect as 1)

* InfoMap includes every node in the communities list, but does not include a global parent community

Update:

I have now run all of the datasets with a forced global parent community, this seems to have evened the playing field somewhat:

Results where every CFA is forced to produce a global parent community including all nodes. (NOTE the change in axis labels from previous run)

The interesting thing is that there is very little difference in delivery ratio and cost using BUBBLERap.

Categories:

## Adapting BubbleRap with global hierarchical knowledge

We discussed the results so far, and I suggested there was scope to improve the global ranking used by BubbleRAP, Pádraig hatched upon a clever idea to use hierarchical clustering to drive routing. Pádraig invited Conrad in to tell us about how Hierarchical GCE works. He descbibed how it will give some notion of stable heirararchical denodograms cut-points.

We then discussed the idea of causing the global routing part of BubbleRAP to use this hierarchy, by selecting only the minimum set of  communities that contain the destination node for routing to. In the example above, routing from C to P would mean using the top level community. However, a route between F and A, would mean not passing messages to communities encompassed by N, M, O and P, as they are not in the minumum set. The thinking behind this, is that we can prevent messages being sent to obscure places in the network, and thus reduce overhead, and perhaps increase delivery latency.

One possible criticism, is that this might be hard to justfify in terms of reality, as nodes will not be able to have pre-computed community knowledge. Conrad mentioned that he had been working on some ideas regarding vector clocks, which might be applicable, as I have said before, we could use some data within the vector clocks as metric for routing, and also use it as a means of communicating network structure data, so that each node can calculate the state of the network based on it’s knowledge, and therefore a clever mechanism to label communities could be devised which would allow nodes to route based on this hierarchical knowledge.

Conrad will hook me up with the code to do the clustering, and I will work out how to implement this in the simulator.

Categories:

## next steps

Met with Pádraig to talk about next steps, discussed the results so far with the four clustering algorithms. The next steps are:

• Stop using the simulation data for community finding.

Posit: all historic data is available and there is a process that is informing community finding, this data is then being used for routing test period.

• Tune GCE so it finds more than 1 community
• (Determine what happens in Bubble-Degree when rank is equal – does it keep or pass) it keeps it – rank must be higher
• Use weighted edges for GCE – Weight based on:
• Connected time
• Number of connections
• Or, threshold as in BubbleRAP
• Use threshold to remove edges for MOSES

If we find that we get improvements based on the community finding algorithm, the next step would be to synthesise some new datasets. Alternative datasets include Cabspotting, Rollernet, Social Sensing.

We could then map out an analysis: message routing that is informed by overlapping community finding, and interesting ways of caculating those communities as time goes by. Towards a paper: ad-hoc message routing based on overlapping communities.

It might also be useful to explore the difference in utility of Overlapping vs Partitioning community finding algorithms.

Categories:

## Some new ideas

Met with Pádraig,  results so far are not complete, so we still need to:

• Run routing based on ranked centrality only (i.e. the aggregate graph of all connections): Graham might have done this already. To give us a more complete picture of what routing towards the centre really looks like.
• Do more randomised ranking runs, to see of random can come up with better routing rank than LBR.
• Implement and test new LBR idea.

Next Step for LBR

Pádraig suggested a simple advancement for LBR:

Person A, has a  message for Person C, and has encountered Person B. A has to work out whether to pass the message on or not. Each node has  a probability matrix of visiting all locations at any time.

Probability matrix of nodes being at any given location

A makes his decision by calculating the dot product of his own locations against C’s locations, and comparing that to B’s calculation in relation to C. If the sum for the B.C is greater than A.C then A passes the message to B. The rationale being that when one encounters someone else, who is more likely to visit the same location as the destination person, then the message should be passed on, because they are more likely to see the other person.

There are some caveats….. TODO: buffer zone, future prediction, past history, recent(ness?), limited locations,

Some specific details/ideas/extensions to consider:

• We need to consider how these probability matrices will evolve over time. A nodes probability matrix will change when he/it visits new places, or alters existing mobility patterns. Initially, we can use the computed data, but more refined versions should exhibit a real-world scenario of unpredictable behaviour.
• Determine a good threshold to use, so that messages are not sent between people of very similar rank/score, the rationale being that there may not be a benefit in passing it on, and therefore try to reduce overhead.
• Limit the number of locations considered, to only those that are popular, this might boost the use of popular locations, in an attempt achieve a higher probability of a message being passed on.
• Consider a more sophisticated mechanism to predict co-location with the destination node, or a better conduit/carrier node, by predicting future interactions with nodes based on past history.
• It may also be important to show the possibility of real-world application, by implementing  a scheme for automatic dissemination and update of the probability matrix using the network itself. (related to  previous ideas about piggybacking meta-data using network/vector clocks, which themselves can be used as a source of routing metrics. e.g. recentness of contact, latency, update routes/speeds/times etc.)

Pádraig and I discussed the problems we may encounter in regards to peer review and justification of the work; in that the problem area is not well defined, and therefore we might have problems showing why this work is novel, and proving what our contributions are. To that end we need to explore the literature a little more, so we might be able to show a solid justification for the work, or alternatively, change our approach so it fits in better with other, better defined problems.

What sorts of messags are ‘delay tolerant’? Pádraig’s suggestion is that twitter messages, and facebook update messages might be delay tolerant, as a person may not need to receive all messages, and generally only want to get the latest updates, it does not matter if a few are lost along the way.

How do we define the urgency of messages, and the efficiency of the network? Perhaps one type of message can be delivered within 10 time periods, and still be considered to be relevant and within acceptable delivery time, but another message may need to be delivered within 1 time period, to ensure quality of service.

There is a categorisation issue too; where some messages can be considered one-to-one (direct messaging), some one-to-many (twitter update), many to many (local information) and many-to-one (sensor networks). We need to decide which of these we will consider. On this note, I said I would speak to Neil Cowzer, who is working on implicit group messaging, to see what his motivation is, and to see if he has a well defined problem space that he is tackling.

Another alternative that Pádraig suggested, was looking at social science approach, where we look at the heuristic approaches to routing in complex social networks. Pádraig’s suggestion was that on some networks, we might be able to apply certain routing techniques, which do not work on others. The contribution would be defining, categrorising and testing new and existing combinations of network types and routing mechanisms. This would be an interesting route to take, but would mean a step back in my research, as I would need to do reading into this area. This would link up well with the work of Stanley Milgram, Granovetter and Watts & Strogatz etc. So I should re-read some of this work, but more importantly, take a look at the biggest cited, citing documents, to see where research might be heading now.

## Picking location clusters

AMet with Pádraig to discuss his ideas about location clusters. The idea is to assign a cell tower to a maximum of three clusters or ‘locations’. We would first need to remove the largest N clusters.  The rank of clusters should be defined by the sum of the weight (number of reports of links between towers) of the edges between the node into the cluster, or perhaps the average.

Cell towers are linked to at most their top three communities, ranked on the strength of the ties (weight of edges) to that community.

The first step is to run this community allocation against the whole set,  and see what it looks like. As it might be that we can keep the large clusters.

Once we have this done, we can re-rank the locations using a modified version of the orignal algorithm. In the new version, a location has a score, which is the sum of the number of times a cell towers within it has been reported, this means that some cell towers will affect the score of multiple locations. Once we have the new ranked locations, we can then rank each used based on their visiting these locations.

Location Clusters/Communities

After assigning cells to a maximum of three location clusters based on the AVERAGE edge weight and the SUM of weight (i.e. two seperate configurations), it was possible to visualise the assigned clusters.

The top 3 allocated clusters based on MOSES output for MIT-OCT 2004, where the membership is calculated using the average weight of edges in/out of the cluster.

For the initial experiments, the dataset without any communities removed was used as the source for the ranking algorithm. Using only the MIT-OCT dataset as the basis for comparison with other results.

LocationSim results for MIT-OCT using Top 3 Moses Allocated Communities for LBR based on AVERAGE, SUM weight calculation, vs RANDOM ranking allocation.

For comparison, one run with a random allocation of rankings (i.e. ranking was shuffled randomly, but only for one run, not an average) gives a hint about the improvement that ranking gives for LBR. In this case, there no significant improvement, but for more conclusive results, multiple runs of random rankings would need to be tested. It might be interesting to try to find the best possible ranking in order to improve routing, it could be that there is no better ranking that can be achieved than that of LBR. This would explain the poor performance against the Random protocols, who are not limited to a strict hierarchy. To match Random 1.0 and 0.2, LBR may need to be more sophisticated.

Categories:

## Meeting 12 Nov 2010

We discussed progress since last week, and I showed the plots from the MIT-Reality dataset, and decided that it probably is OK for determining whether nodes are co-located or not, and give some extra information about movements. I said that I would generate a graph of people connected during the same time period to the same cell tower.

Pádraig suggested a simple policy for routing, based on ranking of locations and consequently people that visit them, where a persons rank is based on the rank of the towers (places) he visits, and how many different towers he visits, making a person that moves around to different, highly ranked towers, more important for routing that someone that only visits low ranked towers, or perhaps even someone who only visits one highly ranked tower.

Davide said that he might have made some notes about this in the past and would look them up, and Pádraig said that it could be calculated within some probablistic framework. I noted, that this sort of scheme would really be routing towards the center of the ranking network, which Pádraig suggested was because it was routing based on the centrality of the network. But we decided that it is worth using this to at least test this idea out against a random routing protocol.

With regards the actual implementation, Pádraig explained his idea that we could do the ranking (at least to start with) before the simulation starts, and simple gives nodes fixed ranked positions, meaning that we don’t actually have to change any of the simulator code, the rank is just a property of the node itself. So, a routing policy is as follows:

on meeting another node:

foreach message in buffer

   if node is destination node
pass message
else if node has a higher rank and visits a
location that the destination node does
pass message
else
keep message

The idea being the the protocol uses the existing contact events to drive communications, but uses the fixed location ranking to form the routing policy.

Pádraig also said that we really need to have a conference target in mind, and we couldn’t really think of a suitable one, Davide suggested Ubicomp, but it has a low acceptance rate.

So, for next time, I will have generated a graph of connected individuals based on connecting to cell towers during the same time period, made sure I can generate the same results as Graham, generated  a ranking for individuals based on cell towers, developed a routing scheme based on this, and hopefully identified conference targets.

UPDATE:

Davide suggested the following calculation for determining the popularity of users, which can be used as a metric for routing.

Popularity of user $i = \sum_j \tau_{ij} ( p_j + c )$
where $\tau_{ij}$ is the time (or number of times) user $i$ has visited the tower $j$$p_j$ is the popularity of tower $j$ and $c$ is a parameter which tunes the importance of user mobility

Pádraig suggested that $c$ should be 0.

Categories:

## Meeting with Pádraig and Davide 21 Oct 2010

Talked again about location clustering and proximity methods, and Pádraig suggested we try to get past this, and get on with the interesting stuff.

To simplify things, we decided that it is probably best to use the grid partitioning system for considering co-location, as it gives us an implicit referencing scheme, and by considering neighboring locations to also be co-located, we can resolve the problem of split places.

With regards to the proximity calculations, we still need to verify the meaning of the signal strength, as is does not necessarily mean signal strength. We decided to consider any common WiFi point spotting to be a proximity event, and will record the start and end times of these, so that we can use metrics such as duration of contact in our routing scheme(s).

Davide offered to find out more about the signal strength issue, so that we can reliably refine it later on.

Pádraig suggested that  I/we need to find some good conference targets, ideally around January. I suggested ICUMT and Mobihoc, and Davide reminded me about a special issue journal call for papers that I spotted recently (Call for papers Phycom). Ubicomp 2011 is a possibility too.

By next week, I want to have the space partitioning method sorted out, and all contact/proximity events recorded. As well as coming up with an idea about the simulated on-device data structure, and some ideas about routing schemes.

Categories: