Current Tasks

January 11th, 2012 No comments

Below is a list of tasks I had before I went on holiday for Christmas, since I came back, I have worked my way through most of them (strikethrough).

  • Incorporate the Blondel Louvain method into simulator
  • Incorporate the  Conga (steve ????) method into simulator
  • Consider using the InfoMap hierarchical method
  • Implement multiple Random runs giving an average result
  • Incorporate the dataset from the Salathe paper (motes in a school)
  • Complete an exhaustive run:
    • Use entire dataset for training and testing (using all core nodes)
    • over all datasets (MIT-NOV, MIT-ALL, Social Sensing, Hypertext2009, Cambridge, InfoCom05, InfoCom06, Enron-a, Studivz (selected nodes), Salathe-School dataset
    • using all algorithms (KCLIQUE, HGCE, InfoMap, LinkClustering, Moses, Blondel, Conga, Random)
    • selecting best of four threshold parameters (where appropriate)
  • Generate plots for all of the above
  • All that currently remains, is to incorporate Conga, which is a little tricky, as it is not easily to integrate it into the simulator (no output options that can be worked into later scripts), and there is no source code available.

    Also, heirarchical InfoMap needs to be looked at….

    Also, runs for MIT-ALL and Enron-a need to be set-up.

    Categories: what i've been doing

    Ideas about message routing

    December 14th, 2011 No comments

    Aaron (McDaid) and I had an interesting conversation about what we mean by message routing.

    We first discussed what it meant to have a single message, or a number of message copies, and termed ‘ink and paper’ to be a ‘cost’ associated with making a copy of a message and passing it to another node, rather than just calling it ‘cost’.

    We decided that in my simulations, ‘cost’ did not exactly mean this, as even if a node send 1 message to a node it met, and kept a copy of it in it’s archive, the ‘cost’ would be exactly the same as if it discarded the message.
    Whereas in our ink and paper cost scenario, there would be an additional 1 ‘ink and paper cost’ for every ‘copy’ of the message sent. Unless, the original message was passed on.

    I mentioned some ideas about how to estimate global properties from a local viewpoint (i.e. graham’s periodicity paper combined with vector clocks).

    We then went on to determine what ‘periodic’ meant, in terms of future prediction of events, Aaron was concerned that in fact I meant ‘often’ as opposed to ‘regular’, as often implies that some event is repeated, but not necessarily at a regular interval. Regular implies that the event is repeated with the same interval. (e.g. we meed every thursday), he suggested that I be very careful when using the work ‘periodic’.

    He also expressed the concern over my liberal use of the term ‘dynamic networks’ and he suggested that in fact the network may better be describes as a temporal network, as edges are made and broken over time.

    We then spent some time discussing a mechanism to determine optimum routes through a series of interactions in a temporal network of people. (I am pretty sure I have read a paper that does something similar to what follows).

    Aaron suggested we model contact events as nodes, so we worked through this idea:
    We have 5 Nodes A to E.
    The first event at time= 0 is where each node meets itself.
    A new node and is created when two nodes meet, an edge between each node is created, and an edge between a previous event and the new event is created if at least one of the nodes was in the previous meeting.
    Edges are labelled by the amount of time since the last meeting.

    networkofevents-1

    At the end of this period, we are now able to determine the best ‘routes’ for sending information between nodes. There are two optimal routes, the time optimal route, and the hop optimal route. Information about A received by C can travel from A to B to C to D, 3 hops and a cost of 2 + 6 + 4 = 12, or between A and C directly, 1 hop at a cost of 13.

    In the case of C to D, there are two routes, C to B to D (cost 12), or C to D (cost 12), so how do we decide which to use. Well, we can assume that the best route, is one that minimises hops primarily, and time secondarily. If we add a small epsilon value to the time, e.g. 0.00001,

    This gives us:

    C to B to D = 2 hops = 8.00001 + 4.00001  = 12.00002

    vs

    C to D  = 1 hop = 12.00001

    So the second, shorter (hops) route takes precedence, as the total time is less. So we have chosen the shortest number of hops when the time period would normally have been equal. The reckoning is that it’s better to ‘copy’  (ink and paper) fewer times, when the total time is the same.

    The only case when there is conflict, is when there are an equal number of hops, and the information is delivered after the same time peroid. In this case we might adopt a few strategies:

    1. Keep a record of all most optimum paths
    2. Record all paths, with a rank of efficiency
    3. Choose only paths with the most recent event
    4. Choose only paths with the oldest event
    5. Choose a path at random from the optimum paths

    If we now wished to predict routes, Aaron suggested that an exhaustive mechanism is to predict all possible future connections, assuming existing connections are regular and periodic, calculate the best routes, and send messages accordingly. This would be quite intensive and perhaps infeasible.

    Perhaps a better strategy is to use our knowledge of existing routes, and on meeting a node on the optimal path to the destination for a message we hold, pass the message, else keep it.

    We didn’t get as far as solving the problem of how to predict the future connections, and therefore calculate the optimal path.

    Distributed Community finding in Dynamic Networks

    December 5th, 2011 No comments

    I Have been thinking about this as an extension to the work we have done with BubbleH and BUBBLE Rap. The biggest drawbacks of these is that they assume that community information is available to them. This is not necessarily realistic, as in the real-world, nodes would have to get this information from some where.

    A solution to this is to create some sort of distributed community finding mechanism, that adapts to the ever changing dynamic network that is made up by the nodes doing the calculations.

    In a previous post, I mentioned Vector Clocks and whilst they like a nice idea, the Vector Clock idea on it’s own doesn’t give us a routing scheme.

    I think there is a way to use some of the concepts of the vector clock idea to inform a distributed community finding scheme, which estimates, for example, InfoMap routing, or, ideally, the hierarchical clustering of LinkClustering.

    I think there is something in sharing edge infomation (i.e. edges and connected time) which will allow us to detect communities: perhaps using our assumptions about threshold values (Mean, Mediam, 80th percentile and 20th percentile), we can run lightweight algorithms at node level.

    This can be either done by a node to calculate it’s own communities, or to calculate communities for other  nodes.

    For example, if a node knows about itself and some other nodes communities, it can make decisions about routing. If it has a message for a node that it doesn’t know about, it assumes it belongs to a global community.

    It may also be possible to estimate betweenness centrality based on degree over time (as Graham did), in which case we have a way to calculate global rank, and perhaps LocalRank. If we can find a way to detect community size (perhaps nodes advertise membership), then we can also determine BridgingCommunities.

    I would have to be careful not to step on Graham’s toes though, as his Persistance Based Routing also relies on connectivity times between nodes, but as far as I am aware, it does not do any sort of clustering of nodes.

    Categories: Ideas, Thoughts

    Selected Nodes – with fixed HGCE output

    November 30th, 2011 No comments

    We identified a problem with the way HGCE was being used, for all other CFAs the community structure used was identical apart from the global parent used by BubbleH, but HGCE outputs a different structure every run. I adapted my code to make sure that BubbleRAP and BubbleH make use of the SAME community structure. In the results below, selected pairs of nodes are chosen based on their contact in the training period. Also, the overall results are reported for the best value for a given metric (i.e. the best run for delivery-ratio is used to show results for cost, latency, etc.). In this case, all use delivery-ratio as the primary metric.

    Studivz

    Studivz

    Social Sensing

    Social Sensing

    MIT

    MIT

    Hypertext2009

    Hypertext2009

    Enron

    Enron

    Enron still shows a lose for BubbleH for HGCE. Detailed plot compared to previous results:

    Delivery Ratio for Enron, BubbleH vs Bubble - old results vs new results - no great change

    Delivery Ratio for Enron, BubbleH vs Bubble - old results vs new results - no great change

    Still working on coding up an alternative to unlimited flood which allows us to accurately measure delivery ratio for selected pairs.

    Cogs

    November 30th, 2011 No comments

    I have been thinking alot recently about why we might be doing this research into delay tolerant networking, and what possible applications of the research are, and whilst waiting for a simulation to run, I thought I would write up one idea I came up with, which I call – Cogs.

    The idea is probably a little too over the top for reality.

    ‘Cogs’ (as in cognition) are little packets of thought, that encompass the idea of something, either a feeling/mood, a specific piece of information, or a directed message. These ‘cogs’ are shared between people who meet, and are spread around an area by people,  and depending on the ‘cog’, are received by like minded people.

    The idea being that knowledge about what is going on around people, is passed around, giving other people the opportunity to join in, get a feel for the mood in the area, and perhaps actively seek out places based on the cogs eminating from them. Some examples:

    • I’m walking though temple bar, and a come across a really good band playing in the open, who I enjoy listening to, and I think others will to. I emit a ‘Cog’ with a feeling (happy), location (place: temple bar, range: 500meters),  interests (music,band,awesome,[genre],[song title]), and likely lifetime (1 hour) attached to it, this general ‘cog’ gets broadcast to those nearby, who may also be enjoying the band, they in turn broadcast this to others, and so on, until the time runs out, or the message travels further than is useful for that message (e.g. 500m). Some users may actively subscribe to certain interests, places, or people (senders), others may just browse the incoming stream of cogs for interesting things.
    • I remember something I have to tell someone, but I’m not in a rush to get it there, so a emit a direct ‘cog’ with a message attached, and make it private with key, and the key of the recipient(s). This cog, rather than be broadcast to everyone, is directed through only those people I meet, that are on the shortest path (in time) to the recipient.
    • I think of a witty joke, decide leave it in the form of some digital graffiti in my favourite pub, I emit a cog, with a fixed location, and a very long time-out, I emit it to those people nearby who are willing to accept such items (fixed and long term), perhaps the bar staff and regular patrons. This cog is then transmitted only to those people nearby, and is not carried any great distance away.
    Categories: Ideas, Thoughts, What Matt Says

    Selected Pairs evaluation

    November 23rd, 2011 No comments

    Pádraig suggested that we use only those nodes that have communicated in the training phase, as the message pairs  for the testing phase. To this end, I implemented this as follows:

    Given the entire edge list for the training period, take each edge, and create a pair from the source and destination node. Do not create a pair of the pairing exists (either direction). At simulation startup for the test period, send messages between only those pairs, in both directions. So that messages are sent:  A->B and B->A. This reduces the number of messages sent, and it can be argued that only nodes that know each other already, will want to communicate in future. This will give us a much better picture.

    Enron-Split-A

    Enron Split - A results

    Enron Split - A results

    Prophet and Hold&Wait

    Hypertext2009

    Hypertext2009-split results

    Hypertext2009-split results

    Prophet and Hold&Wait

    Studivz-split

    Studivz-split results

    Studivz-split results

    Social Sensing Split

    Social Sensing Split results

    Social Sensing Split results

    Very strange for Undelivered hops, but these charts will change when they use ONLY the best run for a given metric, not the best of any.

    Enron Evaluation

    November 17th, 2011 No comments

    We took a limited period in the Enron dataset, and split it into two parts, a training phase, and a testing phase:

    Activity in the Enron dataset, hourly, daily, weekly, showing possible periods that could be used for training/testing

    Activity in the Enron dataset, hourly, daily, weekly, showing possible periods that could be used for training/testing

    In this case, the first period  shown in the above plot was used.

    For HGCE the comminity structure is shown below:

    Community Structure created by HGCE on Enron-split-a

    Community Structure created by HGCE on Enron-split-a

    Results of Routing are not good for HGCE:

    Results for Enron-split-a dataset

    Results for Enron-split-a dataset

    Variation in parameters (threshold) for delivery ratio, on HGCE are shown below:

    Benchmark results are shown below:

    In which Unlimited flood makes nearly 50% delivery ratio.

    Categories: Datasets, experiments

    Studivz Evaluation

    November 14th, 2011 No comments

    We explored a small portion of the Studivz dataset; around 4 months between Oct 2007 and Feb 2008 (illustrated below).

    Studivz Weekly and Daily Activity, showing limited periods used, and message peaks (new year)

    Studivz Weekly and Daily Activity, showing limited periods used, and message peaks (new year)

    Because the dataset is so large, it was not practical to use all of the 28,000 nodes for testing, so, we took a small number of nodes (10) and travelled a number of hops (2) into their connected peers in a connected time graph, and used only this set of nodes to run out simulation. In this instance, we used 385 nodes. We again used MEAN,MEDIAN,80th and 20th percentiles in connected time ratios (described here) as thresholds on edge weights for Community finding using HGCE,KCLIQUE and LinkClustering. (Moses, Random and InfoMap are not thresholded).

    Values used:

    • Mean: 0.0001171252,
    • Median: 5.66346e-05,
    • 80th Percentile: 0.0001579604,
    • 20th Percentile: 1.1558e-06

    For InfoMap, this resulted in the graph below:

    infomap-clustering-studivz

    InfoMap Clustering of the reduced node set in Studivz. All edges shown.

    KCLIQUE performed quite poorly on this for 80Pc, as there are some clear structures that it did not find.

    KCLIQUE 80PC Threshold on Studivz Limited period/nodes

    KCLIQUE 80PC Threshold on Studivz Limited period/nodes

    It performed slightly better on 20th percentile thresholding, but again a number of clear structures missed.

    KCLIQUE 20PC Threshold on Studivz Limited period/nodes

    KCLIQUE 20PC Threshold on Studivz Limited period/nodes

    LinkClustering for 80pc is shown below:

    LinkClustering studivz dataset - 80pc

    LinkClustering studivz dataset - 80pc

    HGCE finds some structure to the network, and successfully finds hierarchy:

    HGCE Community Strucutre for Studivz. Each node represents a community.

    HGCE Community Strucutre for Studivz. Each node represents a community.

    Result of Routing:

    Studivz Dataset results of routing

    Studivz Dataset results of routing

    For Prophet UF and H&W the results are below:

    studivz-split-uf-haw-pht-bar

    Studivz Delivery Ratio for unlimitedflood, hold and wait, and Prophet

    Studivz Delivery Ratio for unlimitedflood, hold and wait, and Prophet

    It seems that even unlimited flood can only pentrate to 20% delivery ratio

    Categories: Datasets, experiments

    Hypertext2009 Evaluation

    November 14th, 2011 No comments

    Finally finished running the experiments on the Hypertext2009 dataset. This used all available nodes, but was split into two parts. The dataset is around 3 days long, so I used day 1 to “train” the community finding algorithms, i.e. used the edge list from day 1 to create the communities. Then, the testing phase took place on days 2 and 3. (i.e. the comminities from day 1, were used for routing in days 2 and 3).

    In the training phase, I used 4 threshold values for the HGCE, KCLIQUE and LinkClustering (ahn et. al.). InfoMap, Random and Moses do not have thresholding applied. Those values relate to the MEAN, MEDIAN, 20th Percentile and 80th Percentile of the connected time ratio. i.e. ordering all edges in ascending order, pick the edge at the 20th percentile, and 80th percentile as use it’s value as the threshold value (see http://mattstabeler.co.uk/phdblog/datasets/hypertext2009 for plot).

    Values used were: MEAN: 0.0035760146,MEDIAN: 0.0007309433,80th PC: g, 20th PC: 0.0003654716

    The visualization below show the clustering based on InfoMap, which is non-overlapping. Edges are related to connected time ratio, and are removed where this is lower than the 80th Percentile. Edge thickness indicates connected time, and node size indicates betweeness centrality.

     Without Numbers: InfoMap Community Assignment of Hypertext2009 dataset, using 80th Percentile thresholding of edges

    Without Numbers: InfoMap Community Assignment of Hypertext2009 dataset, using 80th Percentile thresholding of edges

    infomap-hypertext2009-80thpc

    Without Numbers: InfoMap Community Assignment of Hypertext2009 dataset, using 80th Percentile thresholding of edges

    This visualisation shows clustering using the KCLIQUE algorithm, this is overallping, so some nodes will have multiple community assignments.

    KCLIQUE Community Assignment of Hypertext2009 dataset, using 80th Percentile thresholding of edges

    KCLIQUE Community Assignment of Hypertext2009 dataset, using 80th Percentile thresholding of edges

    When we run the routing algoritms BubbleRAP, BubbleH, Prophet (best in class), Unlimited Flood (best possible) and Hold and wait (baseline), we get the following results:

    combined_hypertext2009-split-current_split

    Best results for each metric, for each community finding algorithm. BubbleRAP vs BubbleH

    and for the rest:

    ht2009-split-uf-haw-pht-line

    Delivery Ratio for Unlimited Flood, Prophet and Hold and Wait on Hypertext 2009 - split

    The community structure found by HGCE was very flat, and in fact found no hierarchical structure at any threshold value apart from for the 80th percentile threshold, where there is one sub community.

    HGCE Community hierarchy produced by HGCE for BubbleH

    HGCE Community hierarchy produced by HGCE for BubbleH

    Categories: Datasets, experiments

    Calculating Edge List Thresholds

    November 3rd, 2011 No comments

    I had an idea to use some standard way to pick the basic threshold parameters for datasets, and it struck me that all we are doing is finding the elbow in the distribution of connected times for each edge.

    So, after a little bit of experimentation, I found that the 80/20 rule (Pareto principle) works well.

    $pareto = $values[floor((count($values) / 100) * 80)];

    The code above, takes an array of all edge weights (connected time), sorted lowest to highest, and picks out the value at the 80th percentile. i.e out of 200 values, it would pick the value at position 160 as the threshold value.
    I compared the results for the MEAN average, and the MEDIAN average, and found that the 80th Percentile worked well for Studivz and Hypertext2009.
    So I propose that we use this mechanism to pick the threshold values for all datasets.