Recording

We decided to go along the route of trying to classify CFAs based on synthetic datasets. We realised that there are no synthetic datasets out there that model both contacts and clustering.

We decided to be clever with the Studivz dataset. We will run the Moses CFA over it. Then pick a random community, then build out the network based on connected communities.

The idea being that we can decide how many nodes we want, and keep going until we have enough.

I implemented this. The script takes an edge list, the community allocation by Moses. (see file in SVN)

The script picks a random community C0, then, while the total number of nodes is less than the max number of nodes, it picks the community (C1) connected to C0 that has the largest number of links to C0. Then, it picks one of these communities at random, and repeats the picking process until the total number of nodes has been reached or exceeded.

Initially, I did this four times, with a limit of 200 nodes and produced the following graphs, coloured by modularity class, and sized by betweeness centrality (sized linearly between 5 and 100 in gephi).

But then I got to thinking what are we actually trying to test here? It’s a bit vague to just see what happens in these datasets. We could be testing a number of things:

- Which CFA works best for well connected communities
- Which CFA works best for poorly connected communities
- Are we testing a particular aspect of a network?
- Density?
- Avg. community size
- Average network size

So I also decides to generate a set of nodes in poorly connected datasets, so adapted the algorithm to pick the communities with the lowest number of links. Resulting in the four below:

Also, perhaps we should be considering weighted links, perhaps the most time connected vs the least time connected?

Each of these subsets should be compared in terms of other network metrics to see if there is an effect on CFA performance.

awesome command to do this:

EXPERIMENT_GROUP=FINAL_EVALUATION DATASETS=studivz-three-200-A,studivz-three-200-B,studivz-three-200-C,studivz-three-200-D,studivz-three-200-A-MIN,studivz-three-200-B-MIN,studivz-three-200-C-MIN,studivz-three-200-D-MIN,mit-nov,mit-oct,cambridge,social-sensing,hypertext2009,infocom-2005,infocom-2006 && for DATASET in $(echo ${DATASETS} | tr "," "\n"); do php -f scripts/stats/DatasetCommunityLinkCountStats.php OUTPUT/${EXPERIMENT_GROUP}/edgelist-graphs/${DATASET}/edge_list.dat OUTPUT/${EXPERIMENT_GROUP}/communities/${DATASET}/Moses/no-global-parent/edge_list.dat.communities.dat CSV ${DATASET} > OUTPUT/${EXPERIMENT_GROUP}/data/${DATASET}-DatasetCommLinkStats.txt; done

At the moment, we (PC and I) visualise the contact networks we are working with using the aggregate graph of the period of interest, nodes are sized based on betweenness centrality, edge thickness represents its weight, based on the total connected time or connected time ratio between nodes. This network is laid out using some force directed layout, usually using the Gephi software’s Force Atlas (1 or 2) layout, or OpenOrd, and a combination of techniques such as non-overalap of nodes, and expansion/contraction. We sometimes remove low weight edges for clarity. Often, we also colour the nodes based on their community assignment.

This works well, and with a bit of manipulation, results in nice enough looking graphs.

I would like to have a nice way to visualise the flow of messages around this network for a given routing algorithm. I imagine a vizualisation much as the above, but with some adaptions. The adaptions allow for other information to be displayed:

- The temporal connections of the edges – at some points in time, edges do not exist, when nodes are not in contact.
- The data transmitted across edges – as time goes on, edges may be used to transmit messages
- The data held by nodes at a given point in time – the buffers of nodes fill and empty over time, as data is transmitted

I would like to see a vizualisation, where the nodes are initially laid out in some appropriate manner e.g. force directed, based on the aggregate graph, so that nodes can be tracked over time, if they happen to move (based on some criteria). The nodes are sized based on Betweenness Centrality (or some other ranking/partition), along with some measure of their data storage, for example, a dotted border, where each dot represents a message. Perhaps a shadowed version records the accumulated count so far of all messages the node has seen.

In this visualisation, edges are only shown when they are active, and fade (with some function) as they become inactive (perhaps proportional to their connection time), perhaps they are still shown using their aggragate graph weight as a thickness, or perhaps this is the accumulated weight (perhaps showing the total aggregate as a shadow). Edges are decorated with information, such as dashes, each dash representing a message transmitted along the path. Each Dash is coloured based in how recently it was transmitted, perhaps in a binary fashion, where recently (tunable) is colours, say red, and non recent messages fade to blue as they reach a time limit. (do we keep edges greyed out, but vizible when they are inactive?) Dashes are located at end of the edge where the sending node is, to indicate the direction of the transmission – perhaps these series of dashes is terminates with an arrow.

When edges become active, they are emphasised in some way, perhaps flashing or briefly pulsing. An alternative orthogonal emphasis is shown when a message is transmitted, so that it is obvious when an edge is activated/de-activated, when a message is transmitted, or when both occur at the same time (i.e. they are easily distinguishable).

In the case of nodes and edges, dashes/dots represent messages, and should be displayed in a uniform size, and where the number of dashes exceeds the length of the edge, or circumference of the circle, they flow outwards from the line, so they appear stacked. In cases where the number of dashes exceeds the limits of the edge more than, say, 3 times, dashes are concatenated into an alternative symbol (e.g. a block) representing a larger number of messages e.g. 10. Alternatively, a stacked buffer style viz. can be displayed beside the edge/node.

**Dynamic interactions:**

When a given dash is selected, representing a message, the path it has taken through the network is highlighted, perhaps a side panel of messages, nodes and edges is shown, with numerical statistics with sorting, so that important items are easily identifiable. These can be highlighted on the main viz. when selected (e.g. hovered over).

When an edge, node, or series of edges is drawn across, different modes cause different effects:

- All Messages, All paths – the paths of any message traversing the edges is highligted
- Only most recent N messages across edge
- Only messages within time period
- All messages transmitted/received by this node
- All messages originating/terminating at this node
- Recent/N/Time period of Messages

The layout parameters can be tweaked during vizulisation, perhaps switching between layout based on aggregate graph, accumulative graph and instantaneous graph.

Something enabling start/stop/pause/rewind/fast-forward/fast-reverse/repeat-period should also be implemented. It is also quite important to be able to drill down into the viz. to see interactions in small sections.

**Data driving the viz:**

The source data for the viz. could come in the form of three pieces of information:

- The list of aggregate edges between nodes (inlcuding weightings), and (maybe) a list of nodes – e.g. in some graph format which allows meta data for nodes and edges
- The list of contacts between nodes – start time, end time, node id, node id
- The list of movements of messages between nodes. message id, Source node, target node, transmission start time, transmission end time (assuming there may be some duration of transmission), and maybe including a list of all messages with source and destination. (for message initialisation a start time might be included).

It would be up to the viz. software to calculate the aggregate graphs, order the contacts, arrange for messages to be transmitted at correct times.

]]>

The presentation seemed to go well, there were plenty of people in the room, it was well received and I got a few questions. It was definitely a good place to submit this work. The session chair in my session was Scott Burleigh, from NASA, who was one of the authors of RFC5050 – a DTN Specification, and he seemed to be quite important in the field.

There were a number of talks about much lower level stuff, but all very interesting. I think I need to read up a bit on the DTN Bundle Specification and the DTN Routing Specification (based on PRoPHET I think), and include a small section on these in the thesis background.

I was also chatting to Bodgan Ghita from Plymouth University, who suggested that I could apply for a research/developer position in DTN security, but he needs someone to start in July. I don’t think I’ll be done by then – plus I’m not sure it’s my kind of thing.

Another interesting guy was Jeorg Ott – who gave a keynote(?) talk which did some nice review of the general DTN area, “DTN – is it for humans?” In which he basically answered yes, but there needs to be thought about the real-world applications of DTN based systems.

He gave an interesting simulation scenario that he had done, a market trader auction system, where the vendor advertises the prices of items, and local users bid/barter the prices, expanding the traders rearch. His idea worked in the same way as the cogs idea I wrote about while back – http://mattstabeler.co.uk/phdblog/archives/1949. The main emphasis was determining how far the data needed to travel from the physical location, i.e. in what location context should it exist, and when no-one is there, the data is deleted/lost – as it no longer has context.

Very few of the talks included real-world datasets, and it seemed to be acceptable to use synthetic datasets. However, there were no papers that were very similar to our work.

Also, Santorini was pretty awesome definitely a place for couples/families though!

]]>We want to detect the shortest/quickest paths between pairs of nodes, which respect the the temporal paths, so that we can derive a betweeness centrality measure.

The existing implementation we have, based on the BubbleRAP paper, floods the network, and calculates the betweeness centrality using the number of times a node is used in ANY path – the BubbleRAP implementation uses only the SHORTEST paths. So there is some discrepancy here. The BubbeRAP paper reads as follows:

To calculate the individual centrality value for each node, we take an numerical approach. First we carry out a large number of emulations of unlimited flooding with different uniformly distributed traffic patterns created using the HaggleSim emulator, which can replay the collected mobility traces and emulate different forwarding strategies for every contact event. Then we count the number of times a node acts as a relay for other nodes on all the shortest delay deliveries. Here the shortest delay delivery refers to the case when the same message is delivered to the destination through different paths, and we only count the de- livery with the shortest delay. We call this number the

betweenness centralityof this node in this temporal graph. Of course, we can normalise it to the highest value found. Here we use unlimited flooding since it can explore the largest range of delivery alternatives with the shortest delay. This definition captures the spirit of the Freeman centrality[2]

The Freeman centrality paper, which I only skimmed, doesn’t appear to refer to temporal graphs directly – further inspection may yield a better understanding

There is a question over what observation period we should use, and how to decide which is really the ‘shortest path’ when paths may start/end at different times. In the Bubble RAP paper, they seem to have chosen the path with the least latency, but not considering the start time of the path. In the temporal networks paper [1], the authors talk about a Saw-tooth pattern which governs latency (as regular connections create a latency lag between connections), and mention a method to normalise over this pattern – but not really for the Betweeness Centrality measure.

e.g. If we start at time 0, and measure the time it takes to get to the destination node, that is fine. however, what if there is an intermediate time, 4, where, if we measure the path from there, we find a new shortest path – in Bubble RAP, this is taken as the shortest path, but is this correct?

I will implement the Time Respecting, shortest path (latency) betweeness centrality, (and maybe the shortest path – hops), as per BubbleRAP, and I will also investigate the normalised version as per the Temporal Networks paper.

[1]Holme P, Saramäki J. Temporal Networks. 2011:1-28. Available at: http://arxiv.org/abs/1108.1780. Accessed May 15, 2012..

[2]Freeman L. A set of measures of centrality based on betweenness. Sociometry. 1977. Available at: http://www.jstor.org/stable/10.2307/3033543. Accessed May 31, 2012.

]]>In the case of the MIT datasets, I included the 4 thresholds we have been using (20th,80th,mean and median thresholds) and a range of 20 values from the lowest edge weight to the highest. I will also run this on the Enron Subsets.

]]>

However, this showed a confusing picture, so Pádraig suggested I run multiple parameters to HGCE over one dataset, and see if it made a difference. We chose the Studivz-three-B dataset, as it is large enough to give a nice general picture.

The parameters were as follows:

K=3,4

M (threshold) = 0,1e-05,2e-05,3e-05,4e-05,6e-05,6e-05,7e-05,8e-05,9e-05,6.83085891329e-05,2.04848750968e-05,8.6207182699e-05,5.97475523656e-06

the other parameters were kept at their default (see here)

Again, on first look, this seemed confusing, but by removing the two very low delivery ratio results (leaving 24 out of 26), the visualizations seemed to show a trend towards increased hierarchical structure meaning increased delivery ratio.

It is quite interesting to explore the data by filtering out different runs, e.g removing all values where K=3, or 4 makes a difference: Hierarchy Data Studivz-B

I suspect the two clumps in delivery ratio form either side of the point an important Threshold parameter.

]]>

The root is always assumed to be a node with a weight of exactly the total size. e.g. for community hierarchy, the size of the node represents the number of members of the community. The root node in this instance contains all members of of the network.

We measure the distance from the root to each community, and multiply the size of the community by the distance. We divide the sum of these values by the size of the root community. If the results is 1, then the community structure is shallow, if the result it less than 1, then the community structure does not contain all nodes, the further the result is upwards of 1 the more depth the structure has.

$depth_sum = 0; $levels = array(); foreach($data as $commid=>$commdata){ $levels[$commdata['level']] +=1; $depth_sum += (($commdata['level'] + 1) * count($commdata['members'])); } $depth_metric = $depth_sum / count($nodes);

It still needs something to balance for breadth, so,

At each level of the dendogram, we take the size of each node, multiply it by the number of nodes at this level, and divide by the size of the root node, then multiply by the depth. The sum of the results for each node divided by the size of the root node gives us the breadth metric.

$breadth_sum = 0; foreach($data as $commid=>$commdata){ $breadth_sum += ( ($commdata['level']+1) * ( ($levels[$commdata['level']] * count($commdata['members'])) / count($nodes) ) ); } $breadth_metric = $breadth_sum / count($nodes);

Average Overlap is simply the sum of the number of communities each node is a member of divided by the number of communities.

It will be interesting to calculate these metrics for each dataset and each threshold value, we can plot the delivery ratio (and other metrics) in a scatter plot for breadth and depth, and see if there is a trend.

]]>I re-ran all of the simulations, as I found an odd quirk in the MIT-NOV dataset, where the Hui Centrality was not calculated correctly, and resulted in NaN being recorded for every node in every local community, which gave very odd results… this was down to a dataset error, which I have rectified, and in future cases when NaN is returned, it is replaced with a rank of 0.0. I think it was down to a divide by zero error in the Hui Ranking code.

This results in less pronounced results for MIT-NOV than we previously looked at. Other results were note affected.

The only other notable quirk was that Moses did not identify communities for IncoCom-2005, IncoCom-2005 and MIT-ALL. The plots below show results for each Dataset.

]]>~~MIT-ALL~~~~MIT-OCT~~- ENRON
~~Enron-a~~~~Enron-b~~- Salathe??
- Studivz

Implement Conga – use the number of communities from InfoMap as the target number for Conga.

Implement InfoMap-H.

Above to be completed by **next weeks meeting, 19th Jan.**

Write a thesis chapter based on this work, with a view to making it into a journal article. Chapter and Artical to be completed by the **end of March 2012.**

Then, conduct another piece of work to tie it all in, probably based on Vector Clocks.

Finally, fill in the (rather large, and difficult) blanks in the thesis. *PC:* Easy when said quickly