Archive for the ‘what i’ve been doing’ Category

Why HGCE does well?

July 27th, 2012 No comments

I just realise that it’s probably doing well because its agglomerative, a bit like how the interaction networks form over time (node by node from an ego centric view) and also it’s robust to perturbations – as characterised by the network itself. This would make over great choice for an algorithm to adapt for a distributed version, where each node learns its network over time and has to account for perturbations in the graph. Might discuss this in conclusions.

Final Chapter – Next Steps

June 22nd, 2012 1 comment

Met with Pádraig to discuss next steps.

We decided to go along the route of trying to classify CFAs based on synthetic datasets. We realised that there are no synthetic datasets out there that model both contacts and clustering.

We decided to be clever with the Studivz dataset. We will run the Moses CFA over it. Then pick a random community, then build out the network based on connected communities.

The idea being that we can decide how many nodes we want, and keep going until we have enough.


I implemented this. The script takes an edge list, the community allocation by Moses. (see file in SVN)

The script picks a random community C0, then, while the total number of nodes is less than the max number of nodes, it picks the community (C1) connected to C0 that has the largest number of links to C0. Then, it picks one of these communities at random, and repeats the picking process until the total number of nodes has been reached or exceeded.

Initially, I did this four times, with a limit of 200 nodes and produced the following graphs, coloured by modularity class, and sized by betweeness centrality (sized linearly between 5 and 100 in gephi).





But then I got to thinking what are we actually trying to test here? It’s a bit vague to just see what happens in these datasets. We could be testing a number of things:

  • Which CFA works best for well connected communities
  • Which CFA works best for poorly connected communities
  • Are we testing a particular aspect of a network?
    • Density?
    • Avg. community size
    • Average network size

So I also decides to generate a set of nodes in poorly connected datasets, so adapted the algorithm to pick the communities with the lowest number of links. Resulting in the four below:





Also, perhaps we should be considering weighted links, perhaps the most time connected vs the least time connected?

Each of these subsets should be compared in terms of other network metrics to see if there is an effect on CFA performance.

awesome command to do this:

EXPERIMENT_GROUP=FINAL_EVALUATION DATASETS=studivz-three-200-A,studivz-three-200-B,studivz-three-200-C,studivz-three-200-D,studivz-three-200-A-MIN,studivz-three-200-B-MIN,studivz-three-200-C-MIN,studivz-three-200-D-MIN,mit-nov,mit-oct,cambridge,social-sensing,hypertext2009,infocom-2005,infocom-2006  && for DATASET in $(echo ${DATASETS} | tr "," "\n"); do php -f scripts/stats/DatasetCommunityLinkCountStats.php OUTPUT/${EXPERIMENT_GROUP}/edgelist-graphs/${DATASET}/edge_list.dat OUTPUT/${EXPERIMENT_GROUP}/communities/${DATASET}/Moses/no-global-parent/edge_list.dat.communities.dat  CSV ${DATASET} > OUTPUT/${EXPERIMENT_GROUP}/data/${DATASET}-DatasetCommLinkStats.txt; done

Dynamic Dataset Visualisation for DTN Simulations

June 13th, 2012 No comments

I was thinking about a way to visualise the flow of messages around a DTN network during simulations runs.

At the moment, we (PC and I) visualise the contact networks we are working with using the aggregate graph of the period of interest, nodes are sized based on betweenness centrality, edge thickness represents its weight, based on  the total connected time or connected time ratio between nodes. This network is laid out using some force directed layout, usually using the Gephi software’s Force Atlas (1 or 2) layout, or OpenOrd, and a combination of techniques such as non-overalap of nodes, and expansion/contraction. We sometimes remove low weight edges for clarity. Often, we also colour the nodes based on their community assignment.

This works well, and with a bit of manipulation, results in nice enough looking graphs.

I would like to have a nice way to visualise the flow of messages around this network for a given routing algorithm. I imagine a vizualisation much as the above, but with some adaptions. The adaptions allow for other information to be displayed:

  • The temporal connections of the edges – at some points in time, edges do not exist, when nodes are not in contact.
  • The data transmitted across edges – as time goes on, edges may be used to transmit messages
  • The data held by nodes at a given point in time – the buffers of nodes fill and empty over time, as data is transmitted

I would like to see a vizualisation, where the nodes are initially laid out in some appropriate manner e.g. force directed, based on the aggregate graph, so that nodes can be tracked over time, if they happen to move (based on some criteria). The nodes are sized based on Betweenness Centrality (or some other ranking/partition), along with some measure of their data storage, for example, a dotted border, where each dot represents a message. Perhaps a shadowed version records the accumulated count so far of all messages the node has seen.

In this visualisation, edges are only shown when they are active, and fade (with some function) as they become inactive (perhaps proportional to their connection time), perhaps they are still shown using their aggragate graph weight as a thickness, or perhaps this is the accumulated weight (perhaps showing the total aggregate as a shadow). Edges are decorated with information, such as dashes, each dash representing a message transmitted along the path. Each Dash is coloured based in how recently it was transmitted, perhaps in a binary fashion, where recently (tunable) is colours, say red, and non recent messages fade to blue as they reach a time limit. (do we keep edges greyed out, but vizible when they are inactive?) Dashes are located at end of the edge where the sending node is, to indicate the direction of the transmission – perhaps these series of dashes is terminates with an arrow.

When edges become active, they are emphasised in some way, perhaps flashing or briefly pulsing. An alternative orthogonal emphasis is shown when a message is transmitted, so that it is obvious when an edge is activated/de-activated, when a message is transmitted, or when both occur at the same time (i.e. they are easily distinguishable).

In the case of nodes and edges, dashes/dots represent messages, and should be displayed in a uniform size, and where the number of dashes exceeds the length of the edge, or circumference of the circle, they flow outwards from the line, so they appear stacked. In cases where the number of dashes exceeds the limits of the edge more than, say, 3 times, dashes are concatenated into an alternative symbol (e.g. a block) representing a larger number of messages e.g. 10.  Alternatively, a stacked buffer style viz. can be displayed beside the edge/node.

Dynamic interactions:

When a given dash is selected, representing a message, the path it has taken through the network is highlighted, perhaps a side panel of messages, nodes and edges is shown, with numerical statistics with sorting, so that important items are easily identifiable. These can be highlighted on the main viz. when selected (e.g. hovered over).

When an edge, node, or series of edges is drawn across, different modes cause different effects:

  • All Messages, All paths – the paths of any message traversing the edges is highligted
    • Only most recent N messages across edge
    • Only messages within time period
  • All messages transmitted/received by this node
    • All messages originating/terminating at this node
    • Recent/N/Time period of Messages
The layout parameters can be tweaked during vizulisation, perhaps switching between layout based on aggregate graph, accumulative graph and instantaneous graph.
Something enabling start/stop/pause/rewind/fast-forward/fast-reverse/repeat-period should also be implemented. It is also quite important to be able to drill down into the viz. to see interactions in small sections.

Data driving the viz:

The source data for the viz. could come in the form of three pieces of information:

  1. The list of aggregate edges between nodes (inlcuding weightings), and (maybe) a list of nodes – e.g. in some graph format which allows meta data for nodes and edges
  2. The list of contacts between nodes – start time, end time, node id, node id
  3. The list of movements of messages between nodes. message id, Source node, target node, transmission start time, transmission end time (assuming there may be some duration of transmission), and maybe including a list of all messages with source and destination. (for message initialisation a start time might be included).

It would be up to the viz. software to calculate the aggregate graphs, order the contacts, arrange for messages to be transmitted at correct times.



Categories: Ideas, Thoughts, What Matt Says

Temporal Networks

May 31st, 2012 2 comments

Pádraig and  I discussed Time Respecting Paths  in the datasets we have based on the paper – Temporal Networks[1].

We want to detect the shortest/quickest paths between pairs of nodes, which respect the the temporal paths, so that we can derive a betweeness centrality measure.

The existing implementation we have, based on the BubbleRAP paper, floods the network, and calculates the betweeness centrality using the number of times a node is used in ANY path  – the BubbleRAP implementation uses only the SHORTEST paths. So there is some discrepancy here. The BubbeRAP paper reads as follows:

To calculate the individual centrality value for each node, we take an numerical approach. First we carry out a large number of emulations of unlimited flooding with different uniformly distributed traffic patterns created using the HaggleSim emulator, which can replay the collected mobility traces and emulate different forwarding strategies for every contact event. Then we count the number of times a node acts as a relay for other nodes on all the shortest delay deliveries. Here the shortest delay delivery refers to the case when the same message is delivered to the destination through different paths, and we only count the de- livery with the shortest delay. We call this number the betweenness centrality of this node in this temporal graph. Of course, we can normalise it to the highest value found. Here we use unlimited flooding since it can explore the largest range of delivery alternatives with the shortest delay. This definition captures the spirit of the Freeman centrality[2]


The Freeman centrality paper, which I only skimmed, doesn’t appear to refer to temporal graphs directly –  further inspection may yield a better understanding 🙂

There is a question over what observation period we should use, and how to decide which is really the ‘shortest path’  when paths may start/end at different times. In the Bubble RAP paper, they seem to have chosen the path with the least latency, but not considering the start time of the path. In the temporal networks paper [1], the authors talk about a Saw-tooth pattern which governs latency (as regular connections create a latency lag between connections), and mention a method to normalise over this pattern – but not really for the Betweeness Centrality measure.

e.g. If we start at time 0, and measure the time it takes to get to the destination node, that is fine. however, what if there is an intermediate time, 4, where, if we measure the path from there, we find a new shortest path – in Bubble RAP, this is taken as the shortest path, but is this correct?

I will implement the Time Respecting, shortest path (latency) betweeness centrality, (and maybe the shortest path – hops), as per BubbleRAP, and I will also investigate the normalised version as per the Temporal Networks paper.

[1]Holme P, Saramäki J. Temporal Networks. 2011:1-28. Available at: Accessed May 15, 2012..

[2]Freeman L. A set of measures of centrality based on betweenness. Sociometry. 1977. Available at: Accessed May 31, 2012.

Updated Hierarchy Stats

February 16th, 2012 No comments

It turns out that the two clumps for Studivz-three-B related to K=3 and K=4. So we ran more runs on Enron and Mit to see what happens, the following are only for K=3.

In the case of the MIT datasets, I included the 4 thresholds we have been using (20th,80th,mean and median thresholds)  and a range of 20 values from the lowest edge weight to the highest. I will also run this on the Enron Subsets.





Enron (all)

The effect of Hierarchy and Overlap on routing

February 13th, 2012 No comments

Following on from the previous post, I implemented the algorithms that evaluate the structure of the hierarchy, and plotted this for all datasets against the main routing metrics.

However, this showed a confusing picture, so Pádraig suggested I run multiple parameters to HGCE over one dataset, and see if it made a difference. We chose the Studivz-three-B dataset, as it is large enough to give a nice general picture.

The parameters were as follows:


M (threshold) = 0,1e-05,2e-05,3e-05,4e-05,6e-05,6e-05,7e-05,8e-05,9e-05,6.83085891329e-05,2.04848750968e-05,8.6207182699e-05,5.97475523656e-06

the other parameters were kept at their default (see here)

Again, on first look, this seemed confusing, but by removing the two very low delivery ratio results (leaving 24 out of 26), the visualizations seemed to show a trend towards increased hierarchical structure meaning increased delivery ratio.

It is quite interesting to explore the data by filtering out different runs, e.g removing all values where K=3, or 4 makes a difference: Hierarchy Data Studivz-B

I suspect the two clumps in delivery ratio form either side of the point an important Threshold parameter.


Measuring Hierarchy

February 4th, 2012 No comments

Pádraig and I discussed a mechanism for measuring hierarchy, to give a given structure a metric. Pádraig’s suggestion was to measure the average distance to the root, and I came up with a normalised version that should also take into account the relative size of a node (in this case a community), as follows:

The root is always assumed to be a node with a weight of exactly the total size. e.g. for community hierarchy, the size of the node represents the number of members of the community. The root node in this instance contains all members of of the network.

We measure the distance from the root to each community, and multiply the size of the community by the distance. We divide the sum of these values by the size of the root community. If the results is 1, then the community structure is shallow, if the result it less than 1, then the community structure does not contain all nodes, the further the result is upwards of 1 the more depth the structure has.

$depth_sum = 0;
$levels = array();
foreach($data as $commid=>$commdata){
  $levels[$commdata['level']] +=1;
  $depth_sum += (($commdata['level'] + 1) * count($commdata['members']));
$depth_metric = $depth_sum / count($nodes);

It still needs something to balance for breadth, so,

At each level of the dendogram, we take the size of each node, multiply it by the number of nodes at this level, and divide by the size of the root node, then multiply by the depth. The sum of the results for each node divided by the size of the root node gives us the breadth metric.

$breadth_sum = 0;
foreach($data as $commid=>$commdata){
  $breadth_sum += (
                    ($commdata['level']+1) *
                    ($levels[$commdata['level']] *
                    / count($nodes)
$breadth_metric = $breadth_sum / count($nodes);

Average Overlap is simply the sum of the number of communities each node is a member of divided by the number of communities.



It will be interesting to calculate these metrics for each dataset and each threshold value, we can plot the delivery ratio (and other metrics)  in a scatter plot for breadth and depth, and see if there is a trend.

Exhaustive Simulations

January 18th, 2012 No comments

I re-ran all of the simulations, as I found an odd quirk in the MIT-NOV dataset, where the Hui Centrality was not calculated correctly, and resulted in NaN being recorded for every node in every local community, which gave very odd results… this was down to a dataset error, which I have rectified, and in future cases when NaN is returned, it is replaced with a rank of 0.0. I think it was down to a divide by zero error in the Hui  Ranking code.

This results in less pronounced results for MIT-NOV than we previously looked at. Other results were note affected.

The only other notable quirk was that Moses did not identify communities for IncoCom-2005, IncoCom-2005 and MIT-ALL. The plots below show results for each Dataset.

Still working on getting a decent run out of Salathe-School, and Studivz.

Categories: experiments

THE plan

January 12th, 2012 No comments

Complete exhaustive runs on all datasets:

  • Enron-a
  • Enron-b
  • Salathe??
  • Studivz

Implement Conga – use the number of communities from InfoMap as the target number for Conga.

Implement InfoMap-H.

Above to be completed by next weeks meeting, 19th Jan.

Write a thesis chapter based on this work, with a view to making it into a journal article. Chapter and Artical to be completed by the end of March 2012.

Then, conduct another piece of work to tie it all in, probably based on Vector Clocks.

Finally, fill in the (rather large, and difficult) blanks in the thesis.  PC: Easy when said quickly 🙂

Current Tasks

January 11th, 2012 No comments

Below is a list of tasks I had before I went on holiday for Christmas, since I came back, I have worked my way through most of them (strikethrough).

  • Incorporate the Blondel Louvain method into simulator
  • Incorporate the  Conga (steve ????) method into simulator
  • Consider using the InfoMap hierarchical method
  • Implement multiple Random runs giving an average result
  • Incorporate the dataset from the Salathe paper (motes in a school)
  • Complete an exhaustive run:
    • Use entire dataset for training and testing (using all core nodes)
    • over all datasets (MIT-NOV, MIT-ALL, Social Sensing, Hypertext2009, Cambridge, InfoCom05, InfoCom06, Enron-a, Studivz (selected nodes), Salathe-School dataset
    • using all algorithms (KCLIQUE, HGCE, InfoMap, LinkClustering, Moses, Blondel, Conga, Random)
    • selecting best of four threshold parameters (where appropriate)
  • Generate plots for all of the above
  • All that currently remains, is to incorporate Conga, which is a little tricky, as it is not easily to integrate it into the simulator (no output options that can be worked into later scripts), and there is no source code available.

    Also, heirarchical InfoMap needs to be looked at….

    Also, runs for MIT-ALL and Enron-a need to be set-up.

    Categories: what i've been doing