### Archive

Archive for the ‘what i’ve been doing’ Category

Aaron (McDaid) and I had an interesting conversation about what we mean by message routing.

We first discussed what it meant to have a single message, or a number of message copies, and termed ‘ink and paper’ to be a ‘cost’ associated with making a copy of a message and passing it to another node, rather than just calling it ‘cost’.

We decided that in my simulations, ‘cost’ did not exactly mean this, as even if a node send 1 message to a node it met, and kept a copy of it in it’s archive, the ‘cost’ would be exactly the same as if it discarded the message.
Whereas in our ink and paper cost scenario, there would be an additional 1 ‘ink and paper cost’ for every ‘copy’ of the message sent. Unless, the original message was passed on.

I mentioned some ideas about how to estimate global properties from a local viewpoint (i.e. graham’s periodicity paper combined with vector clocks).

We then went on to determine what ‘periodic’ meant, in terms of future prediction of events, Aaron was concerned that in fact I meant ‘often’ as opposed to ‘regular’, as often implies that some event is repeated, but not necessarily at a regular interval. Regular implies that the event is repeated with the same interval. (e.g. we meed every thursday), he suggested that I be very careful when using the work ‘periodic’.

He also expressed the concern over my liberal use of the term ‘dynamic networks’ and he suggested that in fact the network may better be describes as a temporal network, as edges are made and broken over time.

We then spent some time discussing a mechanism to determine optimum routes through a series of interactions in a temporal network of people. (I am pretty sure I have read a paper that does something similar to what follows).

Aaron suggested we model contact events as nodes, so we worked through this idea:
We have 5 Nodes A to E.
The first event at time= 0 is where each node meets itself.
A new node and is created when two nodes meet, an edge between each node is created, and an edge between a previous event and the new event is created if at least one of the nodes was in the previous meeting.
Edges are labelled by the amount of time since the last meeting.

At the end of this period, we are now able to determine the best ‘routes’ for sending information between nodes. There are two optimal routes, the time optimal route, and the hop optimal route. Information about A received by C can travel from A to B to C to D, 3 hops and a cost of 2 + 6 + 4 = 12, or between A and C directly, 1 hop at a cost of 13.

In the case of C to D, there are two routes, C to B to D (cost 12), or C to D (cost 12), so how do we decide which to use. Well, we can assume that the best route, is one that minimises hops primarily, and time secondarily. If we add a small epsilon value to the time, e.g. 0.00001,

This gives us:

C to B to D = 2 hops = 8.00001 + 4.00001  = 12.00002

vs

C to D  = 1 hop = 12.00001

So the second, shorter (hops) route takes precedence, as the total time is less. So we have chosen the shortest number of hops when the time period would normally have been equal. The reckoning is that it’s better to ‘copy’  (ink and paper) fewer times, when the total time is the same.

The only case when there is conflict, is when there are an equal number of hops, and the information is delivered after the same time peroid. In this case we might adopt a few strategies:

1. Keep a record of all most optimum paths
2. Record all paths, with a rank of efficiency
3. Choose only paths with the most recent event
4. Choose only paths with the oldest event
5. Choose a path at random from the optimum paths

If we now wished to predict routes, Aaron suggested that an exhaustive mechanism is to predict all possible future connections, assuming existing connections are regular and periodic, calculate the best routes, and send messages accordingly. This would be quite intensive and perhaps infeasible.

Perhaps a better strategy is to use our knowledge of existing routes, and on meeting a node on the optimal path to the destination for a message we hold, pass the message, else keep it.

We didn’t get as far as solving the problem of how to predict the future connections, and therefore calculate the optimal path.

Categories:

## Distributed Community finding in Dynamic Networks

I Have been thinking about this as an extension to the work we have done with BubbleH and BUBBLE Rap. The biggest drawbacks of these is that they assume that community information is available to them. This is not necessarily realistic, as in the real-world, nodes would have to get this information from some where.

A solution to this is to create some sort of distributed community finding mechanism, that adapts to the ever changing dynamic network that is made up by the nodes doing the calculations.

In a previous post, I mentioned Vector Clocks and whilst they like a nice idea, the Vector Clock idea on it’s own doesn’t give us a routing scheme.

I think there is a way to use some of the concepts of the vector clock idea to inform a distributed community finding scheme, which estimates, for example, InfoMap routing, or, ideally, the hierarchical clustering of LinkClustering.

I think there is something in sharing edge infomation (i.e. edges and connected time) which will allow us to detect communities: perhaps using our assumptions about threshold values (Mean, Mediam, 80th percentile and 20th percentile), we can run lightweight algorithms at node level.

This can be either done by a node to calculate it’s own communities, or to calculate communities for other  nodes.

For example, if a node knows about itself and some other nodes communities, it can make decisions about routing. If it has a message for a node that it doesn’t know about, it assumes it belongs to a global community.

It may also be possible to estimate betweenness centrality based on degree over time (as Graham did), in which case we have a way to calculate global rank, and perhaps LocalRank. If we can find a way to detect community size (perhaps nodes advertise membership), then we can also determine BridgingCommunities.

I would have to be careful not to step on Graham’s toes though, as his Persistance Based Routing also relies on connectivity times between nodes, but as far as I am aware, it does not do any sort of clustering of nodes.

Categories:

## Selected Nodes – with fixed HGCE output

We identified a problem with the way HGCE was being used, for all other CFAs the community structure used was identical apart from the global parent used by BubbleH, but HGCE outputs a different structure every run. I adapted my code to make sure that BubbleRAP and BubbleH make use of the SAME community structure. In the results below, selected pairs of nodes are chosen based on their contact in the training period. Also, the overall results are reported for the best value for a given metric (i.e. the best run for delivery-ratio is used to show results for cost, latency, etc.). In this case, all use delivery-ratio as the primary metric.

Studivz

Social Sensing

MIT

Hypertext2009

Enron

Enron still shows a lose for BubbleH for HGCE. Detailed plot compared to previous results:

Delivery Ratio for Enron, BubbleH vs Bubble - old results vs new results - no great change

Still working on coding up an alternative to unlimited flood which allows us to accurately measure delivery ratio for selected pairs.

Categories:

## Cogs

I have been thinking alot recently about why we might be doing this research into delay tolerant networking, and what possible applications of the research are, and whilst waiting for a simulation to run, I thought I would write up one idea I came up with, which I call – Cogs.

The idea is probably a little too over the top for reality.

‘Cogs’ (as in cognition) are little packets of thought, that encompass the idea of something, either a feeling/mood, a specific piece of information, or a directed message. These ‘cogs’ are shared between people who meet, and are spread around an area by people,  and depending on the ‘cog’, are received by like minded people.

The idea being that knowledge about what is going on around people, is passed around, giving other people the opportunity to join in, get a feel for the mood in the area, and perhaps actively seek out places based on the cogs eminating from them. Some examples:

• I’m walking though temple bar, and a come across a really good band playing in the open, who I enjoy listening to, and I think others will to. I emit a ‘Cog’ with a feeling (happy), location (place: temple bar, range: 500meters),  interests (music,band,awesome,[genre],[song title]), and likely lifetime (1 hour) attached to it, this general ‘cog’ gets broadcast to those nearby, who may also be enjoying the band, they in turn broadcast this to others, and so on, until the time runs out, or the message travels further than is useful for that message (e.g. 500m). Some users may actively subscribe to certain interests, places, or people (senders), others may just browse the incoming stream of cogs for interesting things.
• I remember something I have to tell someone, but I’m not in a rush to get it there, so a emit a direct ‘cog’ with a message attached, and make it private with key, and the key of the recipient(s). This cog, rather than be broadcast to everyone, is directed through only those people I meet, that are on the shortest path (in time) to the recipient.
• I think of a witty joke, decide leave it in the form of some digital graffiti in my favourite pub, I emit a cog, with a fixed location, and a very long time-out, I emit it to those people nearby who are willing to accept such items (fixed and long term), perhaps the bar staff and regular patrons. This cog is then transmitted only to those people nearby, and is not carried any great distance away.
Categories:

## Selected Pairs evaluation

Pádraig suggested that we use only those nodes that have communicated in the training phase, as the message pairs  for the testing phase. To this end, I implemented this as follows:

Given the entire edge list for the training period, take each edge, and create a pair from the source and destination node. Do not create a pair of the pairing exists (either direction). At simulation startup for the test period, send messages between only those pairs, in both directions. So that messages are sent:  A->B and B->A. This reduces the number of messages sent, and it can be argued that only nodes that know each other already, will want to communicate in future. This will give us a much better picture.

Enron-Split-A

Enron Split - A results

Prophet and Hold&Wait

Hypertext2009

Hypertext2009-split results

Prophet and Hold&Wait

Studivz-split

Studivz-split results

Social Sensing Split

Social Sensing Split results

Very strange for Undelivered hops, but these charts will change when they use ONLY the best run for a given metric, not the best of any.

Categories:

## Enron Evaluation

We took a limited period in the Enron dataset, and split it into two parts, a training phase, and a testing phase:

Activity in the Enron dataset, hourly, daily, weekly, showing possible periods that could be used for training/testing

In this case, the first period  shown in the above plot was used.

For HGCE the comminity structure is shown below:

Community Structure created by HGCE on Enron-split-a

Results of Routing are not good for HGCE:

Results for Enron-split-a dataset

Variation in parameters (threshold) for delivery ratio, on HGCE are shown below:

Benchmark results are shown below:

In which Unlimited flood makes nearly 50% delivery ratio.

Categories:

## Studivz Evaluation

We explored a small portion of the Studivz dataset; around 4 months between Oct 2007 and Feb 2008 (illustrated below).

Studivz Weekly and Daily Activity, showing limited periods used, and message peaks (new year)

Because the dataset is so large, it was not practical to use all of the 28,000 nodes for testing, so, we took a small number of nodes (10) and travelled a number of hops (2) into their connected peers in a connected time graph, and used only this set of nodes to run out simulation. In this instance, we used 385 nodes. We again used MEAN,MEDIAN,80th and 20th percentiles in connected time ratios (described here) as thresholds on edge weights for Community finding using HGCE,KCLIQUE and LinkClustering. (Moses, Random and InfoMap are not thresholded).

Values used:

• Mean: 0.0001171252,
• Median: 5.66346e-05,
• 80th Percentile: 0.0001579604,
• 20th Percentile: 1.1558e-06

For InfoMap, this resulted in the graph below:

InfoMap Clustering of the reduced node set in Studivz. All edges shown.

KCLIQUE performed quite poorly on this for 80Pc, as there are some clear structures that it did not find.

KCLIQUE 80PC Threshold on Studivz Limited period/nodes

It performed slightly better on 20th percentile thresholding, but again a number of clear structures missed.

KCLIQUE 20PC Threshold on Studivz Limited period/nodes

LinkClustering for 80pc is shown below:

HGCE finds some structure to the network, and successfully finds hierarchy:

HGCE Community Strucutre for Studivz. Each node represents a community.

Result of Routing:

Studivz Dataset results of routing

For Prophet UF and H&W the results are below:

Studivz Delivery Ratio for unlimitedflood, hold and wait, and Prophet

It seems that even unlimited flood can only pentrate to 20% delivery ratio

Categories:

## Hypertext2009 Evaluation

Finally finished running the experiments on the Hypertext2009 dataset. This used all available nodes, but was split into two parts. The dataset is around 3 days long, so I used day 1 to “train” the community finding algorithms, i.e. used the edge list from day 1 to create the communities. Then, the testing phase took place on days 2 and 3. (i.e. the comminities from day 1, were used for routing in days 2 and 3).

In the training phase, I used 4 threshold values for the HGCE, KCLIQUE and LinkClustering (ahn et. al.). InfoMap, Random and Moses do not have thresholding applied. Those values relate to the MEAN, MEDIAN, 20th Percentile and 80th Percentile of the connected time ratio. i.e. ordering all edges in ascending order, pick the edge at the 20th percentile, and 80th percentile as use it’s value as the threshold value (see http://mattstabeler.co.uk/phdblog/datasets/hypertext2009 for plot).

Values used were: MEAN: 0.0035760146,MEDIAN: 0.0007309433,80th PC: g, 20th PC: 0.0003654716

The visualization below show the clustering based on InfoMap, which is non-overlapping. Edges are related to connected time ratio, and are removed where this is lower than the 80th Percentile. Edge thickness indicates connected time, and node size indicates betweeness centrality.

Without Numbers: InfoMap Community Assignment of Hypertext2009 dataset, using 80th Percentile thresholding of edges

Without Numbers: InfoMap Community Assignment of Hypertext2009 dataset, using 80th Percentile thresholding of edges

This visualisation shows clustering using the KCLIQUE algorithm, this is overallping, so some nodes will have multiple community assignments.

KCLIQUE Community Assignment of Hypertext2009 dataset, using 80th Percentile thresholding of edges

When we run the routing algoritms BubbleRAP, BubbleH, Prophet (best in class), Unlimited Flood (best possible) and Hold and wait (baseline), we get the following results:

Best results for each metric, for each community finding algorithm. BubbleRAP vs BubbleH

and for the rest:

Delivery Ratio for Unlimited Flood, Prophet and Hold and Wait on Hypertext 2009 - split

The community structure found by HGCE was very flat, and in fact found no hierarchical structure at any threshold value apart from for the 80th percentile threshold, where there is one sub community.

HGCE Community hierarchy produced by HGCE for BubbleH

Categories:

## Calculating Edge List Thresholds

November 3rd, 2011 1 comment

I had an idea to use some standard way to pick the basic threshold parameters for datasets, and it struck me that all we are doing is finding the elbow in the distribution of connected times for each edge.

So, after a little bit of experimentation, I found that the 80/20 rule (Pareto principle) works well.

\$pareto = \$values[floor((count(\$values) / 100) * 80)];

The code above, takes an array of all edge weights (connected time), sorted lowest to highest, and picks out the value at the 80th percentile. i.e out of 200 values, it would pick the value at position 160 as the threshold value.
I compared the results for the MEAN average, and the MEDIAN average, and found that the 80th Percentile worked well for Studivz and Hypertext2009.
So I propose that we use this mechanism to pick the threshold values for all datasets.
Categories:

## Studivz

I took some time to explore the Studivz wall posts dataset, to see whether it would be useful to use.

The first step was to extract a sub-set of the data, as the entire dataset is a little large to run in LocationSim (some of the CFAs can’t quite handle 28k nodes), scaling it up to this many nodes is a work in progress (it might involve writing something more lightweight to do some of the raw processing).

The approach I have taken so far, is to pick N nodes randomly, and included their immediate neighbours. I do the L more times to get the nodes a depth of L hops away from the source.  Using 10 random nodes, with a depth of 2 yields a network of around 3049 nodes (~10% of all nodes).  When reduced to 5 seed nodes, we get ~1000 nodes (~4%).   Going the other way, 100 seed nodes, with a depth of 1 gives 14571 nodes covering ~50% of the network. These figures change depending on which nodes are selected at random initially. Two other paramters affect the results of this, the first is a threshold, where nodes with a connnected time less than this are not included, the second is the value used to seed the random number generator (if 0, then automatically choose a seed).

In the end I settled on three parameters in the table below – note that the number of nodes in the final set is highly subjective to the initially chosen nodes, so this is very random.

## Studivz Random Node Choices

N L # Nodes
3 2 213
4 2 914
10 2 3049

Interestingly, despite the source or seed nodes being picked at random, the entire graph is connected in all configurations, the graphic below shows the connected_time graph and InfoMap clusterings for N=3, L=2.

InfoMap clustering of Studivz dataset, where N=3 and L=2

This is a promising start, since there are distinct clusters of nodes, which we expected, as this is the concatenation of three egocentric networks, but also there are connections between each egocentric network, meaning there is a route to every other node. However, we can’t tell from this graph how often these contacts occur.

Looking at the whole dataset, we can get an idea about how active it is over time by measuring the number of connections in a given time period, below show the number of weekly connections for the entire dataset.

Weekly number of connections in the Studivz dataset

It shows that this social network seems to have become increasingly popular  over time, with a peak of just over 10,000 wall posts made in Jan 2007. If we were to pick a period to concentrate on, it should probably be from October 2006 onwards.

### Studivz N=3, L=2

Initial results for each metric are shown below:

Delivery Ratio for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Cost for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Latency for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Delivery ratio is very poor for all runs, to see what the maximum possible delivery ratio is, we can look at the results for flooding the network below:

Delivery Ratio plot of Unlimited Flood on Studivz 3 2 0 0

This achieves a delivery ratio of roughly 65 percent, so we have a bit of work to do to be able to match this!

### Studivz 4 2 0 0

When we add another nodes to the initial seed set, we get a step up in the total number of nodes, 914 to be exact, this is currently running through the simulator.

Studivz 4 2 0 0

UPDATE:

Below is the weekly activity during the set using 914 nodes (4,2,0,0)

Weekly activity in STUDIVZ 4,2,0,0

The results on the larger dataset are shown below, these runs were taking considerably longer, and highlighted a couple of minor bugs in the simulator (not closing files properly! which means that file not found, too many open files messages kept occurring).

Delivery Ratio, Cost, Latency, Average Delivered Hops and Average Undelivered Hops for STUDIVZ with 4 seed nodes and a depth of 2.

We see here that BubbleH is doing well in terms of delivery ratio compared to bubbleRAP , link clustering, which created a huge number of communities does particularly well  (at ~3o% for BubbleRAP and BubbleH), this adds weight to the idea that a large number of communities does well, and in fact, (in this case only, where there is only on set of parameters) we see that the Average cost is roughly the same as with the other CFAs.  BubbleH also performs well in terms of cost.  Latency very high for all CFAs as the dataset is very long.

Unlimited Flood and Prophet on STUDIVZ 4 2 0 0

However, we see from the Unlimited flood run, that we have a way to go to match the best possible delivery ratio, at around 90% delivery ration, it beats BubbleH hands down. Some consolation though, the advanced Prophet algorithm also only gets around 52% delivery ratio.

Categories: