Archive

Archive for the ‘Thoughts’ Category

Why HGCE does well?

July 27th, 2012 No comments

I just realise that it’s probably doing well because its agglomerative, a bit like how the interaction networks form over time (node by node from an ego centric view) and also it’s robust to perturbations – as characterised by the network itself. This would make over great choice for an algorithm to adapt for a distributed version, where each node learns its network over time and has to account for perturbations in the graph. Might discuss this in conclusions.

Dynamic Dataset Visualisation for DTN Simulations

June 13th, 2012 No comments

I was thinking about a way to visualise the flow of messages around a DTN network during simulations runs.

At the moment, we (PC and I) visualise the contact networks we are working with using the aggregate graph of the period of interest, nodes are sized based on betweenness centrality, edge thickness represents its weight, based on  the total connected time or connected time ratio between nodes. This network is laid out using some force directed layout, usually using the Gephi software’s Force Atlas (1 or 2) layout, or OpenOrd, and a combination of techniques such as non-overalap of nodes, and expansion/contraction. We sometimes remove low weight edges for clarity. Often, we also colour the nodes based on their community assignment.

This works well, and with a bit of manipulation, results in nice enough looking graphs.

I would like to have a nice way to visualise the flow of messages around this network for a given routing algorithm. I imagine a vizualisation much as the above, but with some adaptions. The adaptions allow for other information to be displayed:

  • The temporal connections of the edges – at some points in time, edges do not exist, when nodes are not in contact.
  • The data transmitted across edges – as time goes on, edges may be used to transmit messages
  • The data held by nodes at a given point in time – the buffers of nodes fill and empty over time, as data is transmitted

I would like to see a vizualisation, where the nodes are initially laid out in some appropriate manner e.g. force directed, based on the aggregate graph, so that nodes can be tracked over time, if they happen to move (based on some criteria). The nodes are sized based on Betweenness Centrality (or some other ranking/partition), along with some measure of their data storage, for example, a dotted border, where each dot represents a message. Perhaps a shadowed version records the accumulated count so far of all messages the node has seen.

In this visualisation, edges are only shown when they are active, and fade (with some function) as they become inactive (perhaps proportional to their connection time), perhaps they are still shown using their aggragate graph weight as a thickness, or perhaps this is the accumulated weight (perhaps showing the total aggregate as a shadow). Edges are decorated with information, such as dashes, each dash representing a message transmitted along the path. Each Dash is coloured based in how recently it was transmitted, perhaps in a binary fashion, where recently (tunable) is colours, say red, and non recent messages fade to blue as they reach a time limit. (do we keep edges greyed out, but vizible when they are inactive?) Dashes are located at end of the edge where the sending node is, to indicate the direction of the transmission – perhaps these series of dashes is terminates with an arrow.

When edges become active, they are emphasised in some way, perhaps flashing or briefly pulsing. An alternative orthogonal emphasis is shown when a message is transmitted, so that it is obvious when an edge is activated/de-activated, when a message is transmitted, or when both occur at the same time (i.e. they are easily distinguishable).

In the case of nodes and edges, dashes/dots represent messages, and should be displayed in a uniform size, and where the number of dashes exceeds the length of the edge, or circumference of the circle, they flow outwards from the line, so they appear stacked. In cases where the number of dashes exceeds the limits of the edge more than, say, 3 times, dashes are concatenated into an alternative symbol (e.g. a block) representing a larger number of messages e.g. 10.  Alternatively, a stacked buffer style viz. can be displayed beside the edge/node.

Dynamic interactions:

When a given dash is selected, representing a message, the path it has taken through the network is highlighted, perhaps a side panel of messages, nodes and edges is shown, with numerical statistics with sorting, so that important items are easily identifiable. These can be highlighted on the main viz. when selected (e.g. hovered over).

When an edge, node, or series of edges is drawn across, different modes cause different effects:

  • All Messages, All paths – the paths of any message traversing the edges is highligted
    • Only most recent N messages across edge
    • Only messages within time period
  • All messages transmitted/received by this node
    • All messages originating/terminating at this node
    • Recent/N/Time period of Messages
The layout parameters can be tweaked during vizulisation, perhaps switching between layout based on aggregate graph, accumulative graph and instantaneous graph.
Something enabling start/stop/pause/rewind/fast-forward/fast-reverse/repeat-period should also be implemented. It is also quite important to be able to drill down into the viz. to see interactions in small sections.

Data driving the viz:

The source data for the viz. could come in the form of three pieces of information:

  1. The list of aggregate edges between nodes (inlcuding weightings), and (maybe) a list of nodes – e.g. in some graph format which allows meta data for nodes and edges
  2. The list of contacts between nodes – start time, end time, node id, node id
  3. The list of movements of messages between nodes. message id, Source node, target node, transmission start time, transmission end time (assuming there may be some duration of transmission), and maybe including a list of all messages with source and destination. (for message initialisation a start time might be included).

It would be up to the viz. software to calculate the aggregate graphs, order the contacts, arrange for messages to be transmitted at correct times.

 

 

Categories: Ideas, Thoughts, What Matt Says

Temporal Networks

May 31st, 2012 2 comments

Pádraig and  I discussed Time Respecting Paths  in the datasets we have based on the paper – Temporal Networks[1].

We want to detect the shortest/quickest paths between pairs of nodes, which respect the the temporal paths, so that we can derive a betweeness centrality measure.

The existing implementation we have, based on the BubbleRAP paper, floods the network, and calculates the betweeness centrality using the number of times a node is used in ANY path  – the BubbleRAP implementation uses only the SHORTEST paths. So there is some discrepancy here. The BubbeRAP paper reads as follows:

To calculate the individual centrality value for each node, we take an numerical approach. First we carry out a large number of emulations of unlimited flooding with different uniformly distributed traffic patterns created using the HaggleSim emulator, which can replay the collected mobility traces and emulate different forwarding strategies for every contact event. Then we count the number of times a node acts as a relay for other nodes on all the shortest delay deliveries. Here the shortest delay delivery refers to the case when the same message is delivered to the destination through different paths, and we only count the de- livery with the shortest delay. We call this number the betweenness centrality of this node in this temporal graph. Of course, we can normalise it to the highest value found. Here we use unlimited flooding since it can explore the largest range of delivery alternatives with the shortest delay. This definition captures the spirit of the Freeman centrality[2]

 

The Freeman centrality paper, which I only skimmed, doesn’t appear to refer to temporal graphs directly –  further inspection may yield a better understanding 🙂

There is a question over what observation period we should use, and how to decide which is really the ‘shortest path’  when paths may start/end at different times. In the Bubble RAP paper, they seem to have chosen the path with the least latency, but not considering the start time of the path. In the temporal networks paper [1], the authors talk about a Saw-tooth pattern which governs latency (as regular connections create a latency lag between connections), and mention a method to normalise over this pattern – but not really for the Betweeness Centrality measure.

e.g. If we start at time 0, and measure the time it takes to get to the destination node, that is fine. however, what if there is an intermediate time, 4, where, if we measure the path from there, we find a new shortest path – in Bubble RAP, this is taken as the shortest path, but is this correct?

I will implement the Time Respecting, shortest path (latency) betweeness centrality, (and maybe the shortest path – hops), as per BubbleRAP, and I will also investigate the normalised version as per the Temporal Networks paper.

[1]Holme P, Saramäki J. Temporal Networks. 2011:1-28. Available at: http://arxiv.org/abs/1108.1780. Accessed May 15, 2012..

[2]Freeman L. A set of measures of centrality based on betweenness. Sociometry. 1977. Available at: http://www.jstor.org/stable/10.2307/3033543. Accessed May 31, 2012.

Measuring Hierarchy

February 4th, 2012 No comments

Pádraig and I discussed a mechanism for measuring hierarchy, to give a given structure a metric. Pádraig’s suggestion was to measure the average distance to the root, and I came up with a normalised version that should also take into account the relative size of a node (in this case a community), as follows:

The root is always assumed to be a node with a weight of exactly the total size. e.g. for community hierarchy, the size of the node represents the number of members of the community. The root node in this instance contains all members of of the network.

We measure the distance from the root to each community, and multiply the size of the community by the distance. We divide the sum of these values by the size of the root community. If the results is 1, then the community structure is shallow, if the result it less than 1, then the community structure does not contain all nodes, the further the result is upwards of 1 the more depth the structure has.

$depth_sum = 0;
$levels = array();
foreach($data as $commid=>$commdata){
  $levels[$commdata['level']] +=1;
  $depth_sum += (($commdata['level'] + 1) * count($commdata['members']));
}
$depth_metric = $depth_sum / count($nodes);

It still needs something to balance for breadth, so,

At each level of the dendogram, we take the size of each node, multiply it by the number of nodes at this level, and divide by the size of the root node, then multiply by the depth. The sum of the results for each node divided by the size of the root node gives us the breadth metric.

$breadth_sum = 0;
foreach($data as $commid=>$commdata){
  $breadth_sum += (
                    ($commdata['level']+1) *
                    (
                    ($levels[$commdata['level']] *
                    count($commdata['members']))
                    / count($nodes)
                    )
                    );
    }
$breadth_metric = $breadth_sum / count($nodes);

Average Overlap is simply the sum of the number of communities each node is a member of divided by the number of communities.

 

 

It will be interesting to calculate these metrics for each dataset and each threshold value, we can plot the delivery ratio (and other metrics)  in a scatter plot for breadth and depth, and see if there is a trend.

Ideas about message routing

December 14th, 2011 No comments

Aaron (McDaid) and I had an interesting conversation about what we mean by message routing.

We first discussed what it meant to have a single message, or a number of message copies, and termed ‘ink and paper’ to be a ‘cost’ associated with making a copy of a message and passing it to another node, rather than just calling it ‘cost’.

We decided that in my simulations, ‘cost’ did not exactly mean this, as even if a node send 1 message to a node it met, and kept a copy of it in it’s archive, the ‘cost’ would be exactly the same as if it discarded the message.
Whereas in our ink and paper cost scenario, there would be an additional 1 ‘ink and paper cost’ for every ‘copy’ of the message sent. Unless, the original message was passed on.

I mentioned some ideas about how to estimate global properties from a local viewpoint (i.e. graham’s periodicity paper combined with vector clocks).

We then went on to determine what ‘periodic’ meant, in terms of future prediction of events, Aaron was concerned that in fact I meant ‘often’ as opposed to ‘regular’, as often implies that some event is repeated, but not necessarily at a regular interval. Regular implies that the event is repeated with the same interval. (e.g. we meed every thursday), he suggested that I be very careful when using the work ‘periodic’.

He also expressed the concern over my liberal use of the term ‘dynamic networks’ and he suggested that in fact the network may better be describes as a temporal network, as edges are made and broken over time.

We then spent some time discussing a mechanism to determine optimum routes through a series of interactions in a temporal network of people. (I am pretty sure I have read a paper that does something similar to what follows).

Aaron suggested we model contact events as nodes, so we worked through this idea:
We have 5 Nodes A to E.
The first event at time= 0 is where each node meets itself.
A new node and is created when two nodes meet, an edge between each node is created, and an edge between a previous event and the new event is created if at least one of the nodes was in the previous meeting.
Edges are labelled by the amount of time since the last meeting.

networkofevents-1

At the end of this period, we are now able to determine the best ‘routes’ for sending information between nodes. There are two optimal routes, the time optimal route, and the hop optimal route. Information about A received by C can travel from A to B to C to D, 3 hops and a cost of 2 + 6 + 4 = 12, or between A and C directly, 1 hop at a cost of 13.

In the case of C to D, there are two routes, C to B to D (cost 12), or C to D (cost 12), so how do we decide which to use. Well, we can assume that the best route, is one that minimises hops primarily, and time secondarily. If we add a small epsilon value to the time, e.g. 0.00001,

This gives us:

C to B to D = 2 hops = 8.00001 + 4.00001  = 12.00002

vs

C to D  = 1 hop = 12.00001

So the second, shorter (hops) route takes precedence, as the total time is less. So we have chosen the shortest number of hops when the time period would normally have been equal. The reckoning is that it’s better to ‘copy’  (ink and paper) fewer times, when the total time is the same.

The only case when there is conflict, is when there are an equal number of hops, and the information is delivered after the same time peroid. In this case we might adopt a few strategies:

  1. Keep a record of all most optimum paths
  2. Record all paths, with a rank of efficiency
  3. Choose only paths with the most recent event
  4. Choose only paths with the oldest event
  5. Choose a path at random from the optimum paths

If we now wished to predict routes, Aaron suggested that an exhaustive mechanism is to predict all possible future connections, assuming existing connections are regular and periodic, calculate the best routes, and send messages accordingly. This would be quite intensive and perhaps infeasible.

Perhaps a better strategy is to use our knowledge of existing routes, and on meeting a node on the optimal path to the destination for a message we hold, pass the message, else keep it.

We didn’t get as far as solving the problem of how to predict the future connections, and therefore calculate the optimal path.

Distributed Community finding in Dynamic Networks

December 5th, 2011 No comments

I Have been thinking about this as an extension to the work we have done with BubbleH and BUBBLE Rap. The biggest drawbacks of these is that they assume that community information is available to them. This is not necessarily realistic, as in the real-world, nodes would have to get this information from some where.

A solution to this is to create some sort of distributed community finding mechanism, that adapts to the ever changing dynamic network that is made up by the nodes doing the calculations.

In a previous post, I mentioned Vector Clocks and whilst they like a nice idea, the Vector Clock idea on it’s own doesn’t give us a routing scheme.

I think there is a way to use some of the concepts of the vector clock idea to inform a distributed community finding scheme, which estimates, for example, InfoMap routing, or, ideally, the hierarchical clustering of LinkClustering.

I think there is something in sharing edge infomation (i.e. edges and connected time) which will allow us to detect communities: perhaps using our assumptions about threshold values (Mean, Mediam, 80th percentile and 20th percentile), we can run lightweight algorithms at node level.

This can be either done by a node to calculate it’s own communities, or to calculate communities for other  nodes.

For example, if a node knows about itself and some other nodes communities, it can make decisions about routing. If it has a message for a node that it doesn’t know about, it assumes it belongs to a global community.

It may also be possible to estimate betweenness centrality based on degree over time (as Graham did), in which case we have a way to calculate global rank, and perhaps LocalRank. If we can find a way to detect community size (perhaps nodes advertise membership), then we can also determine BridgingCommunities.

I would have to be careful not to step on Graham’s toes though, as his Persistance Based Routing also relies on connectivity times between nodes, but as far as I am aware, it does not do any sort of clustering of nodes.

Categories: Ideas, Thoughts

Cogs

November 30th, 2011 No comments

I have been thinking alot recently about why we might be doing this research into delay tolerant networking, and what possible applications of the research are, and whilst waiting for a simulation to run, I thought I would write up one idea I came up with, which I call – Cogs.

The idea is probably a little too over the top for reality.

‘Cogs’ (as in cognition) are little packets of thought, that encompass the idea of something, either a feeling/mood, a specific piece of information, or a directed message. These ‘cogs’ are shared between people who meet, and are spread around an area by people,  and depending on the ‘cog’, are received by like minded people.

The idea being that knowledge about what is going on around people, is passed around, giving other people the opportunity to join in, get a feel for the mood in the area, and perhaps actively seek out places based on the cogs eminating from them. Some examples:

  • I’m walking though temple bar, and a come across a really good band playing in the open, who I enjoy listening to, and I think others will to. I emit a ‘Cog’ with a feeling (happy), location (place: temple bar, range: 500meters),  interests (music,band,awesome,[genre],[song title]), and likely lifetime (1 hour) attached to it, this general ‘cog’ gets broadcast to those nearby, who may also be enjoying the band, they in turn broadcast this to others, and so on, until the time runs out, or the message travels further than is useful for that message (e.g. 500m). Some users may actively subscribe to certain interests, places, or people (senders), others may just browse the incoming stream of cogs for interesting things.
  • I remember something I have to tell someone, but I’m not in a rush to get it there, so a emit a direct ‘cog’ with a message attached, and make it private with key, and the key of the recipient(s). This cog, rather than be broadcast to everyone, is directed through only those people I meet, that are on the shortest path (in time) to the recipient.
  • I think of a witty joke, decide leave it in the form of some digital graffiti in my favourite pub, I emit a cog, with a fixed location, and a very long time-out, I emit it to those people nearby who are willing to accept such items (fixed and long term), perhaps the bar staff and regular patrons. This cog is then transmitted only to those people nearby, and is not carried any great distance away.
Categories: Ideas, Thoughts, What Matt Says

Calculating Edge List Thresholds

November 3rd, 2011 No comments

I had an idea to use some standard way to pick the basic threshold parameters for datasets, and it struck me that all we are doing is finding the elbow in the distribution of connected times for each edge.

So, after a little bit of experimentation, I found that the 80/20 rule (Pareto principle) works well.

$pareto = $values[floor((count($values) / 100) * 80)];

The code above, takes an array of all edge weights (connected time), sorted lowest to highest, and picks out the value at the 80th percentile. i.e out of 200 values, it would pick the value at position 160 as the threshold value.
I compared the results for the MEAN average, and the MEDIAN average, and found that the 80th Percentile worked well for Studivz and Hypertext2009.
So I propose that we use this mechanism to pick the threshold values for all datasets.

Odd results – an update

July 13th, 2011 No comments

Just before Pádraig went away, we realised that the results we were getting for Bubble and BubbleH were exactly the same in some cases. We also wanted to see whether hierarchy was making any difference to results, and so we set these goals for the period whilst Pádraig was away:

  • work out why bubbleH and bubble are the same in some cases
  • get enron and wall posts dataset
  • pick the most hierarchical looking clustering  in these
  • see whether BubbleH works better than BubbleRap
  • (concentrate on HGCE and KCLIQUE)

The first question: Is Bubble (no global parent) always the same result as BubbleH (global parent)?

  InfoMap KCLIQUE HGCE LinkClustering
MIT-NOV Yes No No No
CAMB Yes Yes Yes No
IC05 Yes Yes No No
IC06 Yes Yes No No
SS Yes Yes No No *
Enron Yes No No No
Studivz ? ? ? ?

* in the case of Social Sensing Study LinkClustering all are same for Delivery Ratio, apart from where the threshold is 0.0. (with Social Sensing we used multiple tuning values, hence multiple results, the others used only one set of parameters)

Answer: Not always.

Is the answer overlap?
I think that these results are down to the structure of the communities. InfoMap ALWAYS produces the same results for BubbleH and BubbleRAP. I wonder if this is this down to the fact that InfoMap partitions the network, and therefore there are no overlaps? Could it be that for the most part, KCLIQUE creates non-overlapping communities and hence the results? HGCE creates a large number of highly overlapping communities, which are also hierarchical. LinkClustering also creates a large number of communities and whilst edges cannot overlap, nodes can belong to multiple communities. Or is it really that the inherent Hierarchy in community structure is causing results to differ?

The question is also, why do BubbleH and BubbleRAP get EXACTLY the same results if there is no overlap? Well, this is because in the end, there is no difference between them when there is no complicated structure to the network. Even if BubbleH is using the global parent community, that is EXACTLY the same as using the Global Ranking in BubbleRAP, so when there is no overlap, each node belongs to EXACTLY one community, and has a local rank in both BubbleH and BubbleRap. The global parent in BubbleH is the same as using the global rank in BubbleRAP. In fact, we could re-write BubbleH to incorporate this explicitly, and do away with a forced global parent, but this is just implementation detail, the end results will be the same.

Second Part: get enron and wall posts dataset

I used Conrad’s version of the Enron Dataset, as he had taken the time to remove irregularities, and in case of future papers, he would have an in depth knowledge of how he processed the data, saving me a lot of time!

The connected time graph is below, showing a decent number of clusters, hopefully with some nice hierarchy!

Connected Time Graph of the Enron dataset.

Connected Time Graph of the Enron dataset.

I explored this dataset in the same way as the previous ones, by experimenting with settings for the different algorithms, InfoMap is easy to visualise, and so below is the InfoMap clustering of the dataset:

Connected Time Graph for Enron Dataset, coloured with InfoMap clustering

Connected Time Graph for Enron Dataset, coloured with InfoMap clustering

I also used Conrads Studivz wall post dataset, (see here), this dataset is huge, and so I haven’t worked out how to run the full set of simulations. I was able to create a connected time edgelist (connected time is based on wall post length, 1000ms per character). Below is the graph of all connections, with node size and colour related to degree, edges removed for clarity.

network_degree_colournetwork_degree

network_degree_colour_section

Studivz connected time graph close-up with edges

Enron Dataset

In order to get any clusters at all out of KCLIQUE and LinkClustering, I had to plug in some very small threshold values, this is probably due to the way in which I created the events (An event occurs when an email is sent, the duration of every event is set to 2 seconds), so for the most part, nodes are not connected, and therefore there are small overall connected times. (Conrad’s dataset did not include the content of the messages, so I was not able to give contact events any length based on message size). To simplify things,  I used the interesting parameters from KCLIQUE and LinkClustering, to inform the parameters for HGCE, specifically the –minCliqueEdgeWeight parameter (more info about HGCE parameters), which excludes edges based on weight, effectively thresholding the graph edges as with KCLIQUE and LinkClustering.

To recap, the threshold means that (in the case of KCLIQUE and LinkClustering and now HGCE) edges are removed where the connected time between individuals is lower than the threshold.

Threhold parameters used for the Enron dataset:

0.0, 0.0000001, 0.0000002, 0.0000003, 0.0000004, 0.0000005, 0.0000006, 0.0000007, 0.0000008, 0.0000009, 0.000001, 0.000002, 0.000003, 0.000004, 0.000005, 0.00001

The plot below shows the results for Delivery Ratio for BubbleRAP using no global parent (NGP), and BubbleH using global parent (GP) (in future, we can safely ignore the global parents part, as BubbleRAP should always be run without a global parent and BubbleH with a global parent).

BubbleRAP vs BubbleH on the Enron Dataset, showing results for multiple parameters (of M). BubbelH beats BubbleRAP in all cases

BubbleRAP vs BubbleH on the Enron Dataset, showing results for multiple parameters (of M). BubbelH beats BubbleRAP in all cases

This is a good result, it shows that in this dataset, for these parameters, BubbleH beats BubbleRAP, but now we need to consider why. I had done an earlier exploration of the enron dataset with default values for M ( 0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.9), and so I looked back the results for that, and was surprised to see that BubbleRAP does much better in some cases. Below is a plot with the new data included (Solid red BubbleRAP line and blue dot dash BubbleH show this data).

BubbleRAP vs BubbleH for Enron Dataset, with extra (default) values for HGCE parameters (for M). Showing stronger results for BubbleRAP

BubbleRAP vs BubbleH for Enron Dataset, with extra (default) values for HGCE parameters (for M). Showing stronger results for BubbleRAP

So, BubbleRAP does better in some situations, (PS. NEED TO CHECK THIS AGAIN TO MAKE SURE BUBBLEH IS PROPERLY PLOTTED).

I started to look at then next step, hoping it will give some answers: pick the most hierarchical looking clustering  in [datasets].

I picked the best results for BubbleH, and mapped the communities to a hierarchical tree structure, shown below:

Community structure for nest run of BubbleH on Enron dataset where M = 0.000001.

Community structure for best run of BubbleH on Enron dataset where M = 0.000001

So far so good it seems to have a broad structure with some specific hierarchical clusters, I also mapped the worst run:

Community structure for best run of BubbleH on Enron dataset where M = 0.0000007

Community structure for best run of BubbleH on Enron dataset where M = 0.0000007

This too has some good structure to it, note however, that this is a plot of the  worst performing run, in a set of best parameters, (we didn’t run a whole range of values – so, this is worst of the best)

The next to show is the best BubbleRAP run, as below:

0.0000005

Community structure for best run of BubbleRAP on Enron dataset where M = 0.0000005

Interestingly, this has broad set of high level communities (as with the previous plots, but they had a global parent), but less broadness lower in the hierarchy (i.e. fewer lower communities).

TODO: plot the worst run of BubbleRAP

UPDATE: I found a better way to show the Dendograms of the Community hierarchy, and have automated the process. This page shows all of the plots for Enron without global parent and this page WITH global parent. To get a better idea of the relationship between hierarchical structure and results, I need to combine the results from runs, with the structure in the same place – so we can compare side by side what happens when there is different types of community.

BubbleH and SMW11

April 19th, 2011 No comments

Our paper for SMW11 was accepted, and now we will be able to include the results from the bug-fixed version of  BubbleH, which performs very well. An addition that Pádraig suggested, was to ignore level information in the hierarchy, and simply use community size. This has the benefit of making the algorithm simpler, but still incorporating hierarchical structure. I ran the two versions side by side and found that they perform almost identically, with some cases ignoring level being slightly better!

MIT-NOV Dataset, with levels included and not included for multiple parameters to H-GCE

MIT-NOV Dataset, with levels included and not included for multiple parameters to H-GCE

This plot is very hard to read, but it is possible to see the similarities at a broad level. The best performing run was with H-GCE parameters of K=3, E=0.15, ST=0.9, MAP=0.9 and Z=2. It’s structure seems to be relatively flat, with a large number of communities:

click to view

  0(90)
  |
   - 1(5)
  |
   - 2(5)
  |
   - 3(8)
  |
   - 4(8)
  |
   - 5(8)
  |
   - 6(4)
  |
   - 7(28)
  |
   - 8(27)
  |
   - 9(22)
  |
   - 10(17)
  |
   - 11(13)
  |
   - 12(20)
  |
   - 13(16)
  |
   - 14(26)
  |  |
  |   - 15(12)
  |
   - 16(7)
  |
   - 17(4)
  |
   - 18(4)
  |
   - 19(3)
  |
   - 20(8)
  |
   - 21(8)
  |
   - 22(3)
  |
   - 23(13)
  |
   - 24(8)
  |
   - 25(27)

This gives us the following graphic for the SMW11 paper.

bubbleh-bubble-prophet-flood-ignorelevelstrue-bestrun

MIT-NOV Dataset, for multiple parameters to H-GCE, compared to Bubble, PROPHET and Flood

The latency is shown below, which seems to follow the same trend as the previous version, but with BubbleH actually beating all but Flood at the end.

Latency in MIT-NOV dataset for Bubble, BubbleH, PROPHET and Flood.

Latency in MIT-NOV dataset for Bubble, BubbleH, PROPHET and Flood.

This is a positive result, but whilst doing some work towards the next paper for NGMAST11, I realised that we should be doing runs for multiple parameters to K-Clique, however for this paper, probably don’t need to worry so much. Also, for reference, the average value for BubbleH is included in the plot below.

bubbleh-bubble-prophet-flood-ignorelevelstrue-bestrun-withavg

We need to consider whether to evaluate multiple parameters to BubbleRAP, and see whether this affects the algorithm. Also, we need to consider whether the hierarchy is really making things better or is it the sheer number of communities, because the best performing run has a quite flat structure.