Archive

Archive for February, 2011

Adapting BubbleRap with global hierarchical knowledge

February 24th, 2011 No comments

We discussed the results so far, and I suggested there was scope to improve the global ranking used by BubbleRAP, Pádraig hatched upon a clever idea to use hierarchical clustering to drive routing. Pádraig invited Conrad in to tell us about how Hierarchical GCE works. He descbibed how it will give some notion of stable heirararchical denodograms cut-points.

heirarchicalclustering

We then discussed the idea of causing the global routing part of BubbleRAP to use this hierarchy, by selecting only the minimum set of  communities that contain the destination node for routing to. In the example above, routing from C to P would mean using the top level community. However, a route between F and A, would mean not passing messages to communities encompassed by N, M, O and P, as they are not in the minumum set. The thinking behind this, is that we can prevent messages being sent to obscure places in the network, and thus reduce overhead, and perhaps increase delivery latency.

One possible criticism, is that this might be hard to justfify in terms of reality, as nodes will not be able to have pre-computed community knowledge. Conrad mentioned that he had been working on some ideas regarding vector clocks, which might be applicable, as I have said before, we could use some data within the vector clocks as metric for routing, and also use it as a means of communicating network structure data, so that each node can calculate the state of the network based on it’s knowledge, and therefore a clever mechanism to label communities could be devised which would allow nodes to route based on this hierarchical knowledge.

Conrad will hook me up with the code to do the clustering, and I will work out how to implement this in the simulator.

Routing using communities

February 21st, 2011 No comments

It is important to make sure that each community finding algorithm (CFA) gets a chance to compete on a level playing field, so that they can be equally tested.

The dataset is MIT Reality Mining, which starts in August 200? and ends in May 200?.

The Bubble RAP algorithm uses k-KLIQUE for community finding. The graph contains the core nodes in the dataset, edges are created when a contact is made between two nodes, and labelled with the total time connected between the two nodes. Edges are removed when the total time connected is less than a defined threshold. Then it detects the communities using a k-CLIQUE algorithm, where k = 3. The threshold used by Graham is 2 days. It is 4.5 days in the BubbleRAP paper.

MIT-OCT has been used extensively by Graham, along with MIT-NOV to test his algorithms, so we will continue to use that time period for testing. However, we must make sure that there is sufficient time for all clustering algorithms to make a good effort at finding communities. I believe this was down to the fact that there was activity recorded a large number of users in the dataset for these  time periods.

The figure below shows the average number of connections per week for the whole dataset.

The average number of connections (bluetooth contacts) per week in the MIT Dataset

The average number of connections (bluetooth contacts) per week in the MIT Dataset

A note about the MIT-OCT dataset: There are two groups of nodes in the dataset, those belonging to the CORE set, and the rest, which belong to the encompasing ALL set. A node is determined to be in the CORE set if it is part of the MIT Reality Mining study, and is reporting readings. A node that is not in the CORE set, but is in the ALL set, represents a device that has been spotted by a node in the CORE set, but that is not known about apart from it’s contact trace. These are normally considered separate to the routing process, as it would not be possible to communicate with that node.

So, for the first experiment, each algorithm will generate communities from 01 Aug 2004 until 27 Sep 2004, and 01 Aug 2004 until 18 Nov 2004. The test period will be  27 Sep 2004 to 25 Oct 2004 and 18 Nov 2004 to 15 Nov 2004.

The image below shows the network generated in the MIT October training period. Some of the low weight (connected time) edges are removed for clarity. It appears to have three groups of nodes.

Visualisation of MIT OCT Training edge list. Edges are weighted by the amount of time nodes are connected, with low weight edges removed for clarity. Nodes are sized and coloured by degree.

Visualisation of MIT OCT Training edge list. Edges are weighted by the amount of time nodes are connected, with low weight edges removed for clarity. Nodes are sized and coloured by degree.

Some options for community finding algorithms:

  • Should the graph they all work on be limited by the threshold period?
    • Justification for this is that BubbleRAP determines this to be important
  • Whether or not to use CORE|ALL nodes
  • If the CFAs cannot successfully find communities for the training period, we will need to experiment with time periods that suit all algorithms.

The table below shows the results of community finding on the training datsets mit-oct and mit-nov, apart from KCLIQUE which performs very poorly before the end of December for any threshold value, so for the test period,  01 Aug 2004 to 31 Dec 2004 was used to calculate the communities.

Dataset # Communities Average Size
MIT-OCT-TRAINING – 01 Aug 2004 to 27 Sep 2004
KCLIQUE t=2days – to 31 Dec 2004 8 7
MOSES 7 35
MOSES adjusted graph * 1 12
GCE 4 60
ABL adjusted graph * 11 3
ABL 113 7
MIT-NOV-TRAINING – 01 Aug 2004 to 18 November 2004
KCLIQUE t=2days – to 31 Dec 2004 8 7
MOSES 8 42
MOSES adjusted graph * 3 7
GCE 2 51
ABL adjusted graph * 41 3
ABL 99 8

* adjusted graph means that some edges were removed from the edge list before passing to the algorithm, in this case, edges where the total connected time of nodes was less than 1 day.

Bubble results for test period Oct and Nov 2004, with training periods 01 Aug 2004 to 27 Sep 2004, and 18 Oct 2004 respectively.

Bubble results for test period Oct and Nov 2004, with training periods 01 Aug 2004 to 27 Sep 2004, and 18 Oct 2004 respectively.

Bubble results for test period Oct and Nov 2004, with training periods 01 Aug 2004 to 27 Sep 2004, and 18 Oct 2004 respectively.

Bubble results for test period Oct and Nov 2004, with training periods 01 Aug 2004 to 27 Sep 2004, and 18 Oct 2004 respectively.

average-weekly-connections-alt

Average weekly connections, showing training period and test period for MIT-OCT and MIT-NOV

next steps

February 17th, 2011 No comments

Met with Pádraig to talk about next steps, discussed the results so far with the four clustering algorithms. The next steps are:

Tasks

  • Stop using the simulation data for community finding.
communitytrainingdata

Posit: all historic data is available and there is a process that is informing community finding, this data is then being used for routing test period.

  • Tune GCE so it finds more than 1 community
  • (Determine what happens in Bubble-Degree when rank is equal – does it keep or pass) it keeps it – rank must be higher
  • Use weighted edges for GCE – Weight based on:
    • Connected time
    • Number of connections
    • Or, threshold as in BubbleRAP
  • Use threshold to remove edges for MOSES

If we find that we get improvements based on the community finding algorithm, the next step would be to synthesise some new datasets. Alternative datasets include Cabspotting, Rollernet, Social Sensing.

We could then map out an analysis: message routing that is informed by overlapping community finding, and interesting ways of caculating those communities as time goes by. Towards a paper: ad-hoc message routing based on overlapping communities.

It might also be useful to explore the difference in utility of Overlapping vs Partitioning community finding algorithms.

BubbleRAP routing using different clustering techniques

February 15th, 2011 1 comment

After a lot of wrangling with code, and spending a good deal of time reading through notes, etc. I was able to work out how to generate community clusters using LocationSim, and how to feed in clusters determined by other mechanisms.

BubbleRAP uses and adapted form of KCLIQUE clustering to generate community structures, along with a score of Betweenness centrality (referred to as Hui Ranking) globally and within clusters to drive the routing process. The adaption relates to the use of a contact threshold, (the Bubblerap paper uses 4.5 days, Graham uses 2 days), which specifies how long on average nodes must be in contact for them to be counted.

I adapted the clustering phase to use MOSES, GCE (using Conrad’s code) and the algorithm by Yong-Yeol Ahn, James P. Bagrow, Sune Lehmann in Link communities in complex networks which I have name ABL. At this stage I have tested only the basic algorithms against MIT-OCT and MIT-ALL node lists.

Dataset # Communities Average Size
MIT-OCT
KCLIQUE t=4.5days
KCLIQUE t=2days
MOSES 8 29
GCE 3 41
ABL 64 8
MIT-ALL
KCLIQUE t=4.5 days 6 6
KCLIQUE t=2 days 7 9
MOSES
GCE 1 101
ABL 16 19

I found that KCLIQUE does not find any communities for the small time slice in MIT-OCT, this could be because the threshold means that in the course of a month, participants do not spend alot of time together on average.  I will need to explore this further to verify this this.

MOSES and GCE perform badly on the denser, MIT-ALL dataset, and after speaking to Aaron and Conrad, it seems this can be expected with such a closely connected dataset. My next step is to try to use GCE using edge weights, to see if that makes a difference.

in order to get some idea of results for MIT-ALL MOSES and GCE, and MIT-OCT KCLIQUE, I used the communities from their opposite positions to populate the datasets, this meant I was able to simulate BubbleRAP for all cluster types for both MIT-OCT and MIT-ALL.

Another intersting thing I discovered, was Grahams implementation of  Bubble-Degree, which instead of using the Hui Ranking for communities, nodes are ranked by degree over a 6 hour time period. Bubble-Connections worked in the same way but the ranking is based on the number of connections in a 6 hour period. (Graham explains this in section 6.1.2 in his thesis, and in the comments of his code).

The results are below for all cluster types, and both datasets, using bubble-degree, and bubblerap. Also included is unlimited flood to show the maximum possible delivery between all pairs of nodes.

Results for Bubble and Bubble-Degree for dataset MIT-ALL, using KCLIQUE, MOSES, GCE and ABL

Results for Bubble and Bubble-Degree for dataset MIT-ALL, using KCLIQUE, MOSES, GCE and ABL

Results for Bubble and Bubble-Degree for dataset MIT-OCT, using KCLIQUE, MOSES, GCE and ABL

Results for Bubble and Bubble-Degree for dataset MIT-OCT, using KCLIQUE, MOSES, GCE and ABL

This seems to suggest that the clustering algoritm does have some effect on the routing algorithm, and further analysis may be able to show whether this is due to simply the difference in number/size of communities.

Other tasks to do:

  • Investigate recent improvements of Bubble Rap (if any)
  • Consider different network data  (e.g. can we run a similar analysis over a phonecall network?)
  • Tweaks to clustering algorithms?
  • Chase up SocialSensing dataset
  • Continue to think about Location Based Routing – is it feasable still?

Meeting with Pádraig 3 Feb 2011

February 7th, 2011 No comments

Discussed results so far – I feel that the dataset does not include enough data to give reliable information about location. We discussed other datasets, perhaps using cabspotting and geolife.

Padriag re-iterated his skepticism about using location; that all the information needed is in contact information, and therefore location data is redundant. we suggested the idea of testing LBR on the cabspotting dataset, over short periods of time, because cabs are highly mobile, they come into contact with each other more often.

Pádraig suggested that we could start looking more at communities, and continuing on from Bubble RAP, so we could then use synthetic networks. It would mean a bit of a change of track however.

I said I would try to get Bubble Rap running in the simulator, then try to run it using MOSES or GCE to determine the communities, rather than the default method. GCE has a heirarchical clustering mechanism, which would suit Bubble Rap.

I also said I would try to use the Social Sensing data to see how it performs.

Pádraig suggested again that we could follow on from what Pan Hui did, and testing highly overlapping communities, and use some of the groups community finding algorithms. This would fit better with the groups work, and therefore would be a bit easier to link in.

With Bubble added the results for MIT-OCT are show below – note that the communities used for this run, were pre-generated by Graham.  I haven’t yet  found out how he generated them.

Results with Bubble included

Results with Bubble included

Bubble routing seems to calculate Communities using k-clique clustering and weighted network analysis.

Last minute update: I generated data for use in the simulator based on MOSES communities (based on contacts between nodes). Instead of using Hui Betweenness for ranking, I simple used the number of communties a node is a member of, as the rank value, just for initial testing (of script output).  It yielded a better result than regular Bubble, with the caveat that the bubble communities are based on a start date 1 week into the period, and I need to double check that the MOSES one also does this.

moses-vs-hui-rank-with-caveats

Regular Bubble routing vs Bubble MOSES Communities with Community Rank

Categories: Uncategorized

Routing to the centre

February 2nd, 2011 No comments

Further to the previous post, I have run some more simulations which include degree rank routing, where individuals are ranked in order based on their degree. I also updated LBR to allow for randomising rank scores, and ran this 25 times to get an idea about the ability for random ranking to deliver messages.

All routing methods shown for comparison.

All routing methods shown for comparison.

Interestingly,  routing based on degree only, gives a good delivery ratio. The average of the RANDOM runs performs slightly better than the more sophisticated LBR schemes, but not quite as good as LBR basic (which is the original ranking based solely on cell towers). This suggests to me that the mechanisms we are using to rank locations are not working, perhaps due to lack of rich enough data. It will be interesting to see how this dataset compares to other datasets.

Some new ideas

February 1st, 2011 No comments

Met with Pádraig,  results so far are not complete, so we still need to:

  • Run routing based on ranked centrality only (i.e. the aggregate graph of all connections): Graham might have done this already. To give us a more complete picture of what routing towards the centre really looks like.
  • Do more randomised ranking runs, to see of random can come up with better routing rank than LBR.
  • Implement and test new LBR idea.

Next Step for LBR

Pádraig suggested a simple advancement for LBR:

Person A, has a  message for Person C, and has encountered Person B. A has to work out whether to pass the message on or not. Each node has  a probability matrix of visiting all locations at any time.

Probability matrix of nodes being at any given location

A makes his decision by calculating the dot product of his own locations against C’s locations, and comparing that to B’s calculation in relation to C. If the sum for the B.C is greater than A.C then A passes the message to B. The rationale being that when one encounters someone else, who is more likely to visit the same location as the destination person, then the message should be passed on, because they are more likely to see the other person.

There are some caveats….. TODO: buffer zone, future prediction, past history, recent(ness?), limited locations,

Some specific details/ideas/extensions to consider:

  • We need to consider how these probability matrices will evolve over time. A nodes probability matrix will change when he/it visits new places, or alters existing mobility patterns. Initially, we can use the computed data, but more refined versions should exhibit a real-world scenario of unpredictable behaviour.
  • Determine a good threshold to use, so that messages are not sent between people of very similar rank/score, the rationale being that there may not be a benefit in passing it on, and therefore try to reduce overhead.
  • Limit the number of locations considered, to only those that are popular, this might boost the use of popular locations, in an attempt achieve a higher probability of a message being passed on.
  • Consider a more sophisticated mechanism to predict co-location with the destination node, or a better conduit/carrier node, by predicting future interactions with nodes based on past history.
  • It may also be important to show the possibility of real-world application, by implementing  a scheme for automatic dissemination and update of the probability matrix using the network itself. (related to  previous ideas about piggybacking meta-data using network/vector clocks, which themselves can be used as a source of routing metrics. e.g. recentness of contact, latency, update routes/speeds/times etc.)

Pádraig and I discussed the problems we may encounter in regards to peer review and justification of the work; in that the problem area is not well defined, and therefore we might have problems showing why this work is novel, and proving what our contributions are. To that end we need to explore the literature a little more, so we might be able to show a solid justification for the work, or alternatively, change our approach so it fits in better with other, better defined problems.

What sorts of messags are ‘delay tolerant’? Pádraig’s suggestion is that twitter messages, and facebook update messages might be delay tolerant, as a person may not need to receive all messages, and generally only want to get the latest updates, it does not matter if a few are lost along the way.

How do we define the urgency of messages, and the efficiency of the network? Perhaps one type of message can be delivered within 10 time periods, and still be considered to be relevant and within acceptable delivery time, but another message may need to be delivered within 1 time period, to ensure quality of service.

There is a categorisation issue too; where some messages can be considered one-to-one (direct messaging), some one-to-many (twitter update), many to many (local information) and many-to-one (sensor networks). We need to decide which of these we will consider. On this note, I said I would speak to Neil Cowzer, who is working on implicit group messaging, to see what his motivation is, and to see if he has a well defined problem space that he is tackling.

Another alternative that Pádraig suggested, was looking at social science approach, where we look at the heuristic approaches to routing in complex social networks. Pádraig’s suggestion was that on some networks, we might be able to apply certain routing techniques, which do not work on others. The contribution would be defining, categrorising and testing new and existing combinations of network types and routing mechanisms. This would be an interesting route to take, but would mean a step back in my research, as I would need to do reading into this area. This would link up well with the work of Stanley Milgram, Granovetter and Watts & Strogatz etc. So I should re-read some of this work, but more importantly, take a look at the biggest cited, citing documents, to see where research might be heading now.