Archive

Archive for the ‘What Pádraig Says’ Category

Temporal Networks

May 31st, 2012 2 comments

Pádraig and  I discussed Time Respecting Paths  in the datasets we have based on the paper – Temporal Networks[1].

We want to detect the shortest/quickest paths between pairs of nodes, which respect the the temporal paths, so that we can derive a betweeness centrality measure.

The existing implementation we have, based on the BubbleRAP paper, floods the network, and calculates the betweeness centrality using the number of times a node is used in ANY path  – the BubbleRAP implementation uses only the SHORTEST paths. So there is some discrepancy here. The BubbeRAP paper reads as follows:

To calculate the individual centrality value for each node, we take an numerical approach. First we carry out a large number of emulations of unlimited flooding with different uniformly distributed traffic patterns created using the HaggleSim emulator, which can replay the collected mobility traces and emulate different forwarding strategies for every contact event. Then we count the number of times a node acts as a relay for other nodes on all the shortest delay deliveries. Here the shortest delay delivery refers to the case when the same message is delivered to the destination through different paths, and we only count the de- livery with the shortest delay. We call this number the betweenness centrality of this node in this temporal graph. Of course, we can normalise it to the highest value found. Here we use unlimited flooding since it can explore the largest range of delivery alternatives with the shortest delay. This definition captures the spirit of the Freeman centrality[2]

 

The Freeman centrality paper, which I only skimmed, doesn’t appear to refer to temporal graphs directly –  further inspection may yield a better understanding 🙂

There is a question over what observation period we should use, and how to decide which is really the ‘shortest path’  when paths may start/end at different times. In the Bubble RAP paper, they seem to have chosen the path with the least latency, but not considering the start time of the path. In the temporal networks paper [1], the authors talk about a Saw-tooth pattern which governs latency (as regular connections create a latency lag between connections), and mention a method to normalise over this pattern – but not really for the Betweeness Centrality measure.

e.g. If we start at time 0, and measure the time it takes to get to the destination node, that is fine. however, what if there is an intermediate time, 4, where, if we measure the path from there, we find a new shortest path – in Bubble RAP, this is taken as the shortest path, but is this correct?

I will implement the Time Respecting, shortest path (latency) betweeness centrality, (and maybe the shortest path – hops), as per BubbleRAP, and I will also investigate the normalised version as per the Temporal Networks paper.

[1]Holme P, Saramäki J. Temporal Networks. 2011:1-28. Available at: http://arxiv.org/abs/1108.1780. Accessed May 15, 2012..

[2]Freeman L. A set of measures of centrality based on betweenness. Sociometry. 1977. Available at: http://www.jstor.org/stable/10.2307/3033543. Accessed May 31, 2012.

The effect of Hierarchy and Overlap on routing

February 13th, 2012 No comments

Following on from the previous post, I implemented the algorithms that evaluate the structure of the hierarchy, and plotted this for all datasets against the main routing metrics.

However, this showed a confusing picture, so Pádraig suggested I run multiple parameters to HGCE over one dataset, and see if it made a difference. We chose the Studivz-three-B dataset, as it is large enough to give a nice general picture.

The parameters were as follows:

K=3,4

M (threshold) = 0,1e-05,2e-05,3e-05,4e-05,6e-05,6e-05,7e-05,8e-05,9e-05,6.83085891329e-05,2.04848750968e-05,8.6207182699e-05,5.97475523656e-06

the other parameters were kept at their default (see here)

Again, on first look, this seemed confusing, but by removing the two very low delivery ratio results (leaving 24 out of 26), the visualizations seemed to show a trend towards increased hierarchical structure meaning increased delivery ratio.

It is quite interesting to explore the data by filtering out different runs, e.g removing all values where K=3, or 4 makes a difference: Hierarchy Data Studivz-B

I suspect the two clumps in delivery ratio form either side of the point an important Threshold parameter.

 

Measuring Hierarchy

February 4th, 2012 No comments

Pádraig and I discussed a mechanism for measuring hierarchy, to give a given structure a metric. Pádraig’s suggestion was to measure the average distance to the root, and I came up with a normalised version that should also take into account the relative size of a node (in this case a community), as follows:

The root is always assumed to be a node with a weight of exactly the total size. e.g. for community hierarchy, the size of the node represents the number of members of the community. The root node in this instance contains all members of of the network.

We measure the distance from the root to each community, and multiply the size of the community by the distance. We divide the sum of these values by the size of the root community. If the results is 1, then the community structure is shallow, if the result it less than 1, then the community structure does not contain all nodes, the further the result is upwards of 1 the more depth the structure has.

$depth_sum = 0;
$levels = array();
foreach($data as $commid=>$commdata){
  $levels[$commdata['level']] +=1;
  $depth_sum += (($commdata['level'] + 1) * count($commdata['members']));
}
$depth_metric = $depth_sum / count($nodes);

It still needs something to balance for breadth, so,

At each level of the dendogram, we take the size of each node, multiply it by the number of nodes at this level, and divide by the size of the root node, then multiply by the depth. The sum of the results for each node divided by the size of the root node gives us the breadth metric.

$breadth_sum = 0;
foreach($data as $commid=>$commdata){
  $breadth_sum += (
                    ($commdata['level']+1) *
                    (
                    ($levels[$commdata['level']] *
                    count($commdata['members']))
                    / count($nodes)
                    )
                    );
    }
$breadth_metric = $breadth_sum / count($nodes);

Average Overlap is simply the sum of the number of communities each node is a member of divided by the number of communities.

 

 

It will be interesting to calculate these metrics for each dataset and each threshold value, we can plot the delivery ratio (and other metrics)  in a scatter plot for breadth and depth, and see if there is a trend.

THE plan

January 12th, 2012 No comments

Complete exhaustive runs on all datasets:

  • MIT-ALL
  • MIT-OCT
  • ENRON
  • Enron-a
  • Enron-b
  • Salathe??
  • Studivz

Implement Conga – use the number of communities from InfoMap as the target number for Conga.

Implement InfoMap-H.

Above to be completed by next weeks meeting, 19th Jan.

Write a thesis chapter based on this work, with a view to making it into a journal article. Chapter and Artical to be completed by the end of March 2012.

Then, conduct another piece of work to tie it all in, probably based on Vector Clocks.

Finally, fill in the (rather large, and difficult) blanks in the thesis.  PC: Easy when said quickly 🙂

Update 27 Sep 2011

September 27th, 2011 No comments

In our last meeting I took down these actions:

  • Ressurect NGMAST paper and use Enron + MIT-NOV – argue that MIT is not big enough to make assumptions

I have started another paper in the Clique SVN here: https://erdos.ucd.ie/svn/clique/papers/BubbleH – at the moment its just the same as the NGMAST paper.

  • Write Thesis outline and get started on background chapters – message dissemination, and social networks side (from milgram onwards etc)

I created a thesis document in the repository too: https://erdos.ucd.ie/svn/clique/papers/MattStabelerThesis – I have started to compile some notes into background chapters, based on my transfer report. I have also started a rough outline of chapters. The latest version in PDF is here.

  • Speak to Conrad about finding a non-overlapping Hierarchical CFA

Since the feedback in CLIQUE mini-workshop, I asked Fergal, Conrad and Aaron again about community finding algoriths. They mentioned the Blondel one and the InfoMap one again.

  • Get a decent sub-set of STUDIVZ dataset

To get a sub-set of nodes in the STUDIVZ dataset I started by picking N nodes randomly, and included their immediate neighbours. I do the L more times to get the nodes a depth of L hops away from the source.  Using 10 random nodes, with a depth of 2 yields a network of around 3500 nodes (12% of all nodes).  When reduced to 5 seed nodes, we get ~1000 nodes (~4%). Going the other way, 100 seed nodes, with a depth of 1 gives 14571 nodes covering ~50% of the network. These figures change depending on which nodes are selected at random initially.

Currently, i’m testing the setup with 5 seed nodes and 2 levels of network, with the hope that there will be some overlap.

Conrad suggests that we take them non-randomly – by first reducing our set of nodes to those with high activity (either number of posts, or total length of posts), then using the network L hops from the remaining nodes.

BubbleH and SMW11

April 19th, 2011 No comments

Our paper for SMW11 was accepted, and now we will be able to include the results from the bug-fixed version of  BubbleH, which performs very well. An addition that Pádraig suggested, was to ignore level information in the hierarchy, and simply use community size. This has the benefit of making the algorithm simpler, but still incorporating hierarchical structure. I ran the two versions side by side and found that they perform almost identically, with some cases ignoring level being slightly better!

MIT-NOV Dataset, with levels included and not included for multiple parameters to H-GCE

MIT-NOV Dataset, with levels included and not included for multiple parameters to H-GCE

This plot is very hard to read, but it is possible to see the similarities at a broad level. The best performing run was with H-GCE parameters of K=3, E=0.15, ST=0.9, MAP=0.9 and Z=2. It’s structure seems to be relatively flat, with a large number of communities:

click to view

  0(90)
  |
   - 1(5)
  |
   - 2(5)
  |
   - 3(8)
  |
   - 4(8)
  |
   - 5(8)
  |
   - 6(4)
  |
   - 7(28)
  |
   - 8(27)
  |
   - 9(22)
  |
   - 10(17)
  |
   - 11(13)
  |
   - 12(20)
  |
   - 13(16)
  |
   - 14(26)
  |  |
  |   - 15(12)
  |
   - 16(7)
  |
   - 17(4)
  |
   - 18(4)
  |
   - 19(3)
  |
   - 20(8)
  |
   - 21(8)
  |
   - 22(3)
  |
   - 23(13)
  |
   - 24(8)
  |
   - 25(27)

This gives us the following graphic for the SMW11 paper.

bubbleh-bubble-prophet-flood-ignorelevelstrue-bestrun

MIT-NOV Dataset, for multiple parameters to H-GCE, compared to Bubble, PROPHET and Flood

The latency is shown below, which seems to follow the same trend as the previous version, but with BubbleH actually beating all but Flood at the end.

Latency in MIT-NOV dataset for Bubble, BubbleH, PROPHET and Flood.

Latency in MIT-NOV dataset for Bubble, BubbleH, PROPHET and Flood.

This is a positive result, but whilst doing some work towards the next paper for NGMAST11, I realised that we should be doing runs for multiple parameters to K-Clique, however for this paper, probably don’t need to worry so much. Also, for reference, the average value for BubbleH is included in the plot below.

bubbleh-bubble-prophet-flood-ignorelevelstrue-bestrun-withavg

We need to consider whether to evaluate multiple parameters to BubbleRAP, and see whether this affects the algorithm. Also, we need to consider whether the hierarchy is really making things better or is it the sheer number of communities, because the best performing run has a quite flat structure.

Adapting BubbleRap with global hierarchical knowledge

February 24th, 2011 No comments

We discussed the results so far, and I suggested there was scope to improve the global ranking used by BubbleRAP, Pádraig hatched upon a clever idea to use hierarchical clustering to drive routing. Pádraig invited Conrad in to tell us about how Hierarchical GCE works. He descbibed how it will give some notion of stable heirararchical denodograms cut-points.

heirarchicalclustering

We then discussed the idea of causing the global routing part of BubbleRAP to use this hierarchy, by selecting only the minimum set of  communities that contain the destination node for routing to. In the example above, routing from C to P would mean using the top level community. However, a route between F and A, would mean not passing messages to communities encompassed by N, M, O and P, as they are not in the minumum set. The thinking behind this, is that we can prevent messages being sent to obscure places in the network, and thus reduce overhead, and perhaps increase delivery latency.

One possible criticism, is that this might be hard to justfify in terms of reality, as nodes will not be able to have pre-computed community knowledge. Conrad mentioned that he had been working on some ideas regarding vector clocks, which might be applicable, as I have said before, we could use some data within the vector clocks as metric for routing, and also use it as a means of communicating network structure data, so that each node can calculate the state of the network based on it’s knowledge, and therefore a clever mechanism to label communities could be devised which would allow nodes to route based on this hierarchical knowledge.

Conrad will hook me up with the code to do the clustering, and I will work out how to implement this in the simulator.

Some new ideas

February 1st, 2011 No comments

Met with Pádraig,  results so far are not complete, so we still need to:

  • Run routing based on ranked centrality only (i.e. the aggregate graph of all connections): Graham might have done this already. To give us a more complete picture of what routing towards the centre really looks like.
  • Do more randomised ranking runs, to see of random can come up with better routing rank than LBR.
  • Implement and test new LBR idea.

Next Step for LBR

Pádraig suggested a simple advancement for LBR:

Person A, has a  message for Person C, and has encountered Person B. A has to work out whether to pass the message on or not. Each node has  a probability matrix of visiting all locations at any time.

Probability matrix of nodes being at any given location

A makes his decision by calculating the dot product of his own locations against C’s locations, and comparing that to B’s calculation in relation to C. If the sum for the B.C is greater than A.C then A passes the message to B. The rationale being that when one encounters someone else, who is more likely to visit the same location as the destination person, then the message should be passed on, because they are more likely to see the other person.

There are some caveats….. TODO: buffer zone, future prediction, past history, recent(ness?), limited locations,

Some specific details/ideas/extensions to consider:

  • We need to consider how these probability matrices will evolve over time. A nodes probability matrix will change when he/it visits new places, or alters existing mobility patterns. Initially, we can use the computed data, but more refined versions should exhibit a real-world scenario of unpredictable behaviour.
  • Determine a good threshold to use, so that messages are not sent between people of very similar rank/score, the rationale being that there may not be a benefit in passing it on, and therefore try to reduce overhead.
  • Limit the number of locations considered, to only those that are popular, this might boost the use of popular locations, in an attempt achieve a higher probability of a message being passed on.
  • Consider a more sophisticated mechanism to predict co-location with the destination node, or a better conduit/carrier node, by predicting future interactions with nodes based on past history.
  • It may also be important to show the possibility of real-world application, by implementing  a scheme for automatic dissemination and update of the probability matrix using the network itself. (related to  previous ideas about piggybacking meta-data using network/vector clocks, which themselves can be used as a source of routing metrics. e.g. recentness of contact, latency, update routes/speeds/times etc.)

Pádraig and I discussed the problems we may encounter in regards to peer review and justification of the work; in that the problem area is not well defined, and therefore we might have problems showing why this work is novel, and proving what our contributions are. To that end we need to explore the literature a little more, so we might be able to show a solid justification for the work, or alternatively, change our approach so it fits in better with other, better defined problems.

What sorts of messags are ‘delay tolerant’? Pádraig’s suggestion is that twitter messages, and facebook update messages might be delay tolerant, as a person may not need to receive all messages, and generally only want to get the latest updates, it does not matter if a few are lost along the way.

How do we define the urgency of messages, and the efficiency of the network? Perhaps one type of message can be delivered within 10 time periods, and still be considered to be relevant and within acceptable delivery time, but another message may need to be delivered within 1 time period, to ensure quality of service.

There is a categorisation issue too; where some messages can be considered one-to-one (direct messaging), some one-to-many (twitter update), many to many (local information) and many-to-one (sensor networks). We need to decide which of these we will consider. On this note, I said I would speak to Neil Cowzer, who is working on implicit group messaging, to see what his motivation is, and to see if he has a well defined problem space that he is tackling.

Another alternative that Pádraig suggested, was looking at social science approach, where we look at the heuristic approaches to routing in complex social networks. Pádraig’s suggestion was that on some networks, we might be able to apply certain routing techniques, which do not work on others. The contribution would be defining, categrorising and testing new and existing combinations of network types and routing mechanisms. This would be an interesting route to take, but would mean a step back in my research, as I would need to do reading into this area. This would link up well with the work of Stanley Milgram, Granovetter and Watts & Strogatz etc. So I should re-read some of this work, but more importantly, take a look at the biggest cited, citing documents, to see where research might be heading now.

Picking location clusters

January 21st, 2011 No comments

AMet with Pádraig to discuss his ideas about location clusters. The idea is to assign a cell tower to a maximum of three clusters or ‘locations’. We would first need to remove the largest N clusters.  The rank of clusters should be defined by the sum of the weight (number of reports of links between towers) of the edges between the node into the cluster, or perhaps the average.

towercommunities

Cell towers are linked to at most their top three communities, ranked on the strength of the ties (weight of edges) to that community.

The first step is to run this community allocation against the whole set,  and see what it looks like. As it might be that we can keep the large clusters.

Once we have this done, we can re-rank the locations using a modified version of the orignal algorithm. In the new version, a location has a score, which is the sum of the number of times a cell towers within it has been reported, this means that some cell towers will affect the score of multiple locations. Once we have the new ranked locations, we can then rank each used based on their visiting these locations.

Location Clusters/Communities

After assigning cells to a maximum of three location clusters based on the AVERAGE edge weight and the SUM of weight (i.e. two seperate configurations), it was possible to visualise the assigned clusters.

The top 3 allocated clusters based on MOSES output for MIT-OCT 2004, where the membership is calculated using the average weight of edges in/out of the cluster.

The top 3 allocated clusters based on MOSES output for MIT-OCT 2004, where the membership is calculated using the average weight of edges in/out of the cluster.

For the initial experiments, the dataset without any communities removed was used as the source for the ranking algorithm. Using only the MIT-OCT dataset as the basis for comparison with other results.

mit-oct-start-top3moses

LocationSim results for MIT-OCT using Top 3 Moses Allocated Communities for LBR based on AVERAGE, SUM weight calculation, vs RANDOM ranking allocation.

For comparison, one run with a random allocation of rankings (i.e. ranking was shuffled randomly, but only for one run, not an average) gives a hint about the improvement that ranking gives for LBR. In this case, there no significant improvement, but for more conclusive results, multiple runs of random rankings would need to be tested. It might be interesting to try to find the best possible ranking in order to improve routing, it could be that there is no better ranking that can be achieved than that of LBR. This would explain the poor performance against the Random protocols, who are not limited to a strict hierarchy. To match Random 1.0 and 0.2, LBR may need to be more sophisticated.

MIT Cell Tower Community Finding

January 19th, 2011 1 comment

Had a meeting with Pádraig; we discussed using MOSES to get the list of communities of cell towers.

To recap, two cell towers are linked, if during a contact (Bluetooth) between two people, they are spotted by either person. This generates a huge furball of when visualised. I installed MOSES on a server, and ran it on the graph generated from MIT Oct 2004. It produced 66 Communities. There are 468 nodes in the dataset. The average number of communities per node is 2.17.

Padraig suggested visualising the data, and colouring each community in turn, so that we might be able to get an idea about which communities we can remove (as they are too big), and will leave us with smaller subsets of the graph, which identify locations better.

We can then use these communities as the ‘locations’ in location based routing. We need to determine whether it matters if a node report multiple ‘locations’ at the same time.

I started to view the communities colours as suggested, but it still showed a very large furball, so I decided to see what happenned when the highly connected nodes are removed. In the image below, we can see that when the top  100 highly connected nodes are removed, it becomed clear that there are distinct groups of cell towers.

MIT Oct 2004, when two users see each other, the cell towers they see are linked. This version has the top 100 highest connected (degree) nodes removed.

MIT Oct 2004, when two users see each other, the cell towers they see are linked. This version has the top 100 highest connected (degree) nodes removed. Edges show community membership as given by MOSES.

I sent on the moses output, and list edges to Aaron McDaid and Daniel Archambault, who genereated the Euler diagram  below using Tulip.

the layout in the visualization is based solely on the communities found by moses. Tulip attempts to lay out the sets in an Euler diagram such that communities are placed nearby if they share nodes in common.

Euler Diagram generated using Tulip from the output of MOSES for cell towers connected in MIT Oct 2004.

Euler Diagram generated using Tulip from the output of MOSES for cell towers connected in MIT Oct 2004.

I have yet to speak to Aaron in more detail about what this diagram means, but if I have interpreted the visualization correctly, the similar coloured areas are collections of nearby nodes; seperated into distinct clusters, rather than overlapping ones. If it is possible to extract this clustering, it might provide exactly the location clustering we need, if we remove the very large clusters (light brown/green).

I took some time to review the way that cell towers are linked, to make sure that it was making a fair linking, ready for when MOSES did it’s thing. As it is, it is a little loose in how it determined whether two cells are linked, as it links all cells that are seen in an entire contact period. This means that cells that are linked when two people are travelling together. I plan to work on a more strict approach, where the duration of the cell sightings for each person are compared, and cells linked only when it is clear that they are at the same time. However, I continued using the results we already have.

The images below show the graphs, when the top N communities are removed. The size of the node relates to the number of times a cell tower is reported, using binning.

The most number of times a node is spotted is 9291, the smallest is 1, so
9291 - 1 / 10 bins = bin size of 929.
For example, if a node is spotted 1043 times, then it will be placed into bin 2.

The width of the edge relates to the number of times an edge is reported between its two vertices, again using binning. The most number of times an edge is spotted is 3676, minimum 1. The average however, is only 8.38, and the median is only 3, so the binning really only distinguishes the very highly seen nodes.

The colour of the edges is related to the community membership. Because nodes can be members of multiple communities, and therefore have multiple edges, I made the decision to create only one edge, and distinguish it using the concatenation of the community names (e.g. Community_0Community4…), so nodes that edges that make up the same communities, will have the same colour. This might not have been the best way to do it, but the software used does not show multiple edges between the same nodes. An alternative would be to make the width of the edge equal to the number of communities it represents.

Minus 1 Community

Minus 1 Community

Minus 15 Communities

Minus 15 Communities

Minus 30 Communities

Minus 30 Communities

Minus 40 Communities

Minus 40 Communities

As communities are removed, the graph becomes clearer, where 40 communities are removed, it becomes more obvious that there are distinct communities and the high degree nodes are more obvious.

eps and png files for other numbers removed

The goal is; given any cell tower id, we can identify which community it belongs to, and by extension, the ‘location’ the node is at.

An idea I have for finally making this data usable, so that we have distinct communities. Ignore the top N communities (e.g. 30). Order each community by summed weight of edges, then take the highest degree node, and assign it to the community it belongs to, that has the highest rank, then remove that node from further consideration. Continue until all nodes are assigned to a community. This may need some more thought.