Archive

Archive for the ‘Thoughts’ Category

BubbleRAP using Heirarchical Community Structure

March 3rd, 2011 2 comments

The existing BubbleRAP mechanism works as follows:

on N connected with encountered node P
for each message M (for destination D) in buffer of N
if P == D
    pass message M to P
else
    if N and P share a community with D
        if LocalRank(P) > LocalRank(N)
            pass message M to P (and delete from buffer)
    else
        if (P shares a community with D) OR (GlobalRank(P) > GlobalRank(N))
            pass message M to P (and delete from buffer)
// keep message

The proposed version of Bubble-H (Bubble Heirarchical) works as follows

on N connected with encountered node P
for each message M (for destination D) in buffer of N
if P == D
    pass message
else
    if P is in a CLOSER community, that also contains D
        pass message M to P
    else if P and N share CLOSE community with D
            if(LocalRank(P) > LocalRank(N))
                pass message M to P
            else
                keep message
    // we need to still push messages to the top when there is no overlap
    else if(P is in a BRIDGING community with D, that N is not)
      pass message M to P
    else
      keep message

Bubble-H as above uses the notion of CLOSE(R) communities, and BRIDGING communities:

heirarchicalclustering-1

  • A CLOSER community, is the one that is lower down in the hierarchical structure of communities, for example, when N has the message destined for O, on meeting P, he would pass the message, as P is in a community lower in the hierarchy. Being CLOSER suggests that P is somehow more likely to be connected to O.
  • The shared CLOSE community is one that is low down in the heirarchy scale, that the destination is a member of, but that both nodes also share. They compare local ranks to decide who should have the message. For example. N and M share a CLOSE community with O and P.
  • A BRIDGING community, is at a lowest point that  bridges a gap between branches of the structure, in the example, the community C2 containing CDEA, GF and H, would be considered a bridge. (not the best example?). This is handy for when a node who is not in a low level community with the destination needs to get the message closer.

I think there might be something missing from this new algorithm, and I am not convinced it is any different to the existing BubbleRAP protocol, especially as nodes are not clustered hierarchically.

A note on hierarchical GCE: this algorithm produces communities that are hierarchically organised, however, the nodes within these communities can and will overlap, so that a node in a community in one partition of the tree, may also appear in a community on the other side, which means I have now realised that the illustration above is not a correct representation.

UPDATE (3/3/11): Pádraig’s comments make sense, and so the following is a better algorithm:

on N connected with encountered node P
for each message M (for destination D) in buffer of N
Identify bridging community (BC), smallest community containing N and D.
IF P ==D
   then pass the message
ELSE IF P shares a community with D that is smaller than BC
    then pass the message.
ELSE IF P is more central in BC than N
    then pass the message.

UPDATE (7/3/11): I have managed to get Conrad’s code to run today  and output the correct data, and it seems that the output is very different from what I had imagined; the MIT-Nov-Training dataset produces lots of non-heirarchical communities, and one or two parent-child structures (depending on parameters) – this means that in some/most cases, there will nodes who do not have bridging communities.

I am currently implementing BubbleH in the simulator and need to decide what to do in the case of no bridging community; two choices as I see it are: When a bridging community is not found either

  1. use global rank (as per original bubble rap), or
  2. Keep message until a bridging/smaller community is found.

My favourite is option 2, as this helps to hold the message by default, until a useful encounter comes along (a small world property?).

Vector clocks to drive delay tolerant networking

March 1st, 2011 1 comment

Vector Clocks can be used to build a knowledge of the network surrounding each node, from the local perspective of that node. This means that it would be possible to calculate some metrics about the global network, at each individual node. However, we already know [1] that in human networks, the 6 hour degree approximates the betweeness centrality of the node, which can be used as a useful metric for routing [2], so why should we complicate things further?

Benefits of vector clocks

Maintaining a vector clock [3] for each node encountered can give us extra information about the network. Assuming in our vector clock updates, we also transmit information about the network as seen by each node, we can tell both the structure of the network, and a number of extra things about the network that simply counting degree cannot.

Each node can know how out of date other nodes are, simply by checking it’s records, and seeing when an update was last received about that node, this gives some notion of distance from itself to other nodes (a.k.a. ball of radius, latency).

Indirect paths and essential links can be deduced at a global level by looking at the routes that are used to update vector clocks; where an update is received about an unknown node during an encounter, we can mark the encountered node as a possible carrier for the unknown node. And where one node is always used as a conduit for information updates about certain nodes, we know that it connects some other portion of the network.

The rate that other nodes update us with information (update rate) gives us a notion of how often contact is made with that node, the update amount, tells us how much knowledge about the network that node delivers . A node that has a high update rate is one that is encountered often, and one that has a high update amount may be well connected to other portions of the network.

Derived knowledge

Using these metrics, we can start to make other observations about the network. We now have sufficient information to attempt to predict future events [4][5]; Using the knowledge of update rate and out-of-dateness We may anticipate future encounters, based on the notion of periodicity [1], or perhaps even by simple Markov chains [6].

We can also try to calculate a notion of distance from one node to another using the update amount, out-of-dateness and out knowledge of the network structure. A nodes knowledge of the network also allows it to calculate things like it’s own centrality, degree and community membership, as well as giving it hints as to the same metrics for other nodes.

Vector clocks for location

Extending the idea further, it would be possible to share information about node movements along with network structure. Assuming nodes could detect their locations using some globally known scheme (e.g. simply using a grid-based approach to specifying location based on Latitude/Longitude), the vector clock updates can pass along updates about the last time a node visited a location, and perhaps the duration of the visit. This, combined with updates from other other nodes about their movements can give each node a picture of movements of other nodes. This in turn would allow us to make predictions[4][5] about where nodes will be in the future.

Usage

The combination of these metrics, gives us rich information to provide to a routing scheme, for example, it may be possible to adapt CAR [7] to use these metrics in it’s calculations, or a hybrid approach to BubbleRAP [2], where the global and/or local rank are based on these metrics. We may also want to devise our own scheme for routing based on this specific knowledge, however there are a large number of schemes already proposed, and it would seem  sensible to improve an existing scheme, rather than create yet another one.

Refs

1. Williamson G, Cellai D, Dobson S, Nixon P. Self-management of Routing on Human Proximity Networks Spyropoulos T, Hummel KA, eds. 2009;5918:1-12. Available at: http://www.springerlink.com/index/10.1007/978-3-642-10865-5 [Accessed July 7, 2010].

2. Hui P, Crowcroft J, Yoneki E. BUBBLE Rap: Social-based Forwarding in Delay Tolerant Networks. Networks. 2008:241-250. Available at: http://portal.acm.org/citation.cfm?doid=1374618.1374652.

3. Kossinets G, Kleinberg J, Watts D. The structure of information pathways in a social communication network. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining – KDD ’08. New York, New York, USA: ACM Press; 2008:435. Available at: http://portal.acm.org/citation.cfm?doid=1401890.1401945.

4. Song C, Qu Z, Blumm N, Barabási A-L. Limits of predictability in human mobility. Science (New York, N.Y.). 2010;327(5968):1018-21. Available at: http://www.ncbi.nlm.nih.gov/pubmed/20167789.

5. Ashbrook D, Starner T. Using GPS to learn significant locations and predict movement across multiple users. Personal and Ubiquitous Computing. 2003;7(5):275-286. Available at: http://www.springerlink.com/openurl.asp?genre=article&id=doi:10.1007/s00779-003-0240-0 [Accessed July 31, 2010].

6. Musolesi M, Piraccini M, Fodor K, Corradi A, A. Supporting Energy-Efficient Uploading Strategies for Continuous Sensing Applications on Mobile Phones. Pervasive. 2010. Available at: http://www.springerlink.com/index/WH71427029706513.pdf [Accessed September 3, 2010].

7. Musolesi M, Mascolo C. CAR: Context-Aware Adaptive Routing for Delay-Tolerant Mobile Networks. IEEE Transactions on Mobile Computing. 2009;8(2):246-260. Available at: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4585387.

MIT Cell Tower Community Finding

January 19th, 2011 1 comment

Had a meeting with Pádraig; we discussed using MOSES to get the list of communities of cell towers.

To recap, two cell towers are linked, if during a contact (Bluetooth) between two people, they are spotted by either person. This generates a huge furball of when visualised. I installed MOSES on a server, and ran it on the graph generated from MIT Oct 2004. It produced 66 Communities. There are 468 nodes in the dataset. The average number of communities per node is 2.17.

Padraig suggested visualising the data, and colouring each community in turn, so that we might be able to get an idea about which communities we can remove (as they are too big), and will leave us with smaller subsets of the graph, which identify locations better.

We can then use these communities as the ‘locations’ in location based routing. We need to determine whether it matters if a node report multiple ‘locations’ at the same time.

I started to view the communities colours as suggested, but it still showed a very large furball, so I decided to see what happenned when the highly connected nodes are removed. In the image below, we can see that when the top  100 highly connected nodes are removed, it becomed clear that there are distinct groups of cell towers.

MIT Oct 2004, when two users see each other, the cell towers they see are linked. This version has the top 100 highest connected (degree) nodes removed.

MIT Oct 2004, when two users see each other, the cell towers they see are linked. This version has the top 100 highest connected (degree) nodes removed. Edges show community membership as given by MOSES.

I sent on the moses output, and list edges to Aaron McDaid and Daniel Archambault, who genereated the Euler diagram  below using Tulip.

the layout in the visualization is based solely on the communities found by moses. Tulip attempts to lay out the sets in an Euler diagram such that communities are placed nearby if they share nodes in common.

Euler Diagram generated using Tulip from the output of MOSES for cell towers connected in MIT Oct 2004.

Euler Diagram generated using Tulip from the output of MOSES for cell towers connected in MIT Oct 2004.

I have yet to speak to Aaron in more detail about what this diagram means, but if I have interpreted the visualization correctly, the similar coloured areas are collections of nearby nodes; seperated into distinct clusters, rather than overlapping ones. If it is possible to extract this clustering, it might provide exactly the location clustering we need, if we remove the very large clusters (light brown/green).

I took some time to review the way that cell towers are linked, to make sure that it was making a fair linking, ready for when MOSES did it’s thing. As it is, it is a little loose in how it determined whether two cells are linked, as it links all cells that are seen in an entire contact period. This means that cells that are linked when two people are travelling together. I plan to work on a more strict approach, where the duration of the cell sightings for each person are compared, and cells linked only when it is clear that they are at the same time. However, I continued using the results we already have.

The images below show the graphs, when the top N communities are removed. The size of the node relates to the number of times a cell tower is reported, using binning.

The most number of times a node is spotted is 9291, the smallest is 1, so
9291 - 1 / 10 bins = bin size of 929.
For example, if a node is spotted 1043 times, then it will be placed into bin 2.

The width of the edge relates to the number of times an edge is reported between its two vertices, again using binning. The most number of times an edge is spotted is 3676, minimum 1. The average however, is only 8.38, and the median is only 3, so the binning really only distinguishes the very highly seen nodes.

The colour of the edges is related to the community membership. Because nodes can be members of multiple communities, and therefore have multiple edges, I made the decision to create only one edge, and distinguish it using the concatenation of the community names (e.g. Community_0Community4…), so nodes that edges that make up the same communities, will have the same colour. This might not have been the best way to do it, but the software used does not show multiple edges between the same nodes. An alternative would be to make the width of the edge equal to the number of communities it represents.

Minus 1 Community

Minus 1 Community

Minus 15 Communities

Minus 15 Communities

Minus 30 Communities

Minus 30 Communities

Minus 40 Communities

Minus 40 Communities

As communities are removed, the graph becomes clearer, where 40 communities are removed, it becomes more obvious that there are distinct communities and the high degree nodes are more obvious.

eps and png files for other numbers removed

The goal is; given any cell tower id, we can identify which community it belongs to, and by extension, the ‘location’ the node is at.

An idea I have for finally making this data usable, so that we have distinct communities. Ignore the top N communities (e.g. 30). Order each community by summed weight of edges, then take the highest degree node, and assign it to the community it belongs to, that has the highest rank, then remove that node from further consideration. Continue until all nodes are assigned to a community. This may need some more thought.

Updated plots for Community Structure

December 14th, 2010 No comments

In the last meeting I had with Pádraig we went through the spectral clustering technique, and discovered that there were some errors in my calculations of the the clusters. I took the time to re-do the community calculations,  and to run the tests again, to make sure all is correct. I used the MIT  dataset, and constrained it to the month of October 2004 (as before). Messages are created and initially sent at the first time step. The selected communities for each week are taken as the biggest group, which in practice were roughly the same set of connected nodes. I chose to omit the nodes with a value of 0.0 in the matrix (V) (see  data pipelines).

Delivery Ratio over time

mit-oct-start-community-lbr-lineplot

This plot shows the delivery ratio over time for MIT-OCT dataset, for LBR communities (where the community is taken from weeks 1, 2, 3, 4 and all 4 weeks), PBR, Prophet, Unlimited Flood and Random(1.0, 0.2, 0.0).

Final Delivery Ratio and Cost

mit-oct-start-community-lbr-barchart

This chart shows delivery ratio and cost for each Protocol, note that unlimited flood does not report cost.


In these updated plots, it is encouraging to see that with the corrected community structure, LBR actually performs well, and in fact for the community structure in week 1, it performs better overall than PBR, with weeks 2, 3, 4, all, close behind. It is also good to see that LBR is beating Random 0.0, and 0.2 consistently. LBR week 1 Community also tracks a small way behind Flooding for a short while around 05/10 – 08/10 which is quite interesting. The communities are listed below (or can be viewed here), for further analysis.

UPDATE:

Having given this a little more thought, this is a little mis-leading, so I have generated new plots where all Protocols are constrained to the same community nodes. Unfortunately, this yields a similar spread of results between the protocols, as did the results that did not consider community structure. Only the structure discovered in week 1 increases the delivery ratio.

Delivery Ratio for each community, for each protocol

mit-oct-communities

This plot shows delivery ratios for each community generated from different weeks and all weeks in Oct 2004 for the MIT Reality Mining dataset

cost-dratio

This shows the final delivery ratios and costs for each protocol, for MIT OCT 2004, communitites

What I plan to do now is to go back to the ranking algorithm, and use the linked cells mechanism to re-calculate the rankings. It might also be useful to start thinking about being able to use the real-time location information (in this case cell towers/linked cell towers) in a routing algorithm directly. This would let us start to implement some simplistic location predictions.

I also created a clearer graph visualisation for the cell tower links, this shows that there is a componant that  encompasses a large percentage of the reported cells overall. If we were to consider this one location, it is possible that it covers a very large area.

connected-cell-towers_new

This shows the cell towers that are connected when at least 100 messages are passed and reported at both cell towers, edges are numbered where this is more than 300.

Routing with location

November 3rd, 2010 No comments

I have started to think again about how to route using location, ignoring contact traces for the moment, I considered what the simplest metric for routing might be. The simplest method, as I seem to remember talking with Davide about previously, is simply comparing knowledge about the set of places visited by each node.


For example, in the above scenario, where the numbered nodes are places visited by nodes, and the lines build the network of places visited by each node (a line represents the route between two places, direction of the arrows shows any ordering of places), if node A has a message for node C, and at some time, meets node B, he can consider whether to pass the message to node B, or keep it to himself, by comparing the intersection (?) of the set of places that each node has with that of the destination node C. Based on the assumption that the more locations it visits in common with the destination node, the more likely it will be to pass the message to the destination.

So, in this case where:

A \cap B = (3,7)
A \cap C = (3,7)
B \cap C = (3,7,8)

Node B has more places in common with C than node A does, so node A passes the message to B, because, B is more likely to visit the same location as C, than A is. This simplistic scheme may work for some situations, but does not consider when two nodes do not meet in the same places, so how to bridge the gap. This is something to think about later on, but an initial guess is to use some other scheme to find a node that does know the destination node, then use this simple scheme.

This scheme is probably not sophisticated enough to be efficient in large networks, and in previous thoughts about routing schemes for location (in which I was mostly using the term vector clocks) I considered a number of things that can be used to inform, or be considered in a routing protocol:

  • Locations visited
  • The times locations are visited
    • The length of time spent at a location
    • The length of time between visits
  • The ordering of visits to locations
  • Common popular locations
    • Use of sinks/post offices
    • Use of data mules (proxy nodes that visit popular locations frequently)
    • The number of routes that pass through a place (degree)
  • The mobility of a node, e.g. the number of places it knows about
  • Prediction of future visits to locations based on historic knowledge
  • Sharing of knowledge between nodes

The edges in the graph, could easily be labelled with the times nodes travelled between places, such that we can build a (historic?) vector clock of all traversals in the graph.

I have started to think that using location for routing may be helpful because, knowing WHERE a meeting takes place, might help to consider the importance of that meeting to the routing protocol, whereas when we only know that a meeting has taken place, but not in what context, it is hard to distinguish whether it is an important meeting. However, I haven’t fully formed this idea.

Location clustering so far.

October 21st, 2010 No comments

I have had a chance to look at the code that Derek sent me, namely the CFinder software for finding k-Cliques, and an implementation of the MDS  algorithm, I struggled to get these to make any sense of the data that I have, and started to think about an alternative approach.

Despite the decision we made in a previous meeting, to consider location readings globally, regardlessof order/time, I decided to consider using the nature of the data to inform the clustering algorithm, namely, using the time ordering to indicate the relationship between readings. This meant going back to the idea of using readings from  individuals, to determine the places that are relevant to them. This of course may have issues when comparing places from different individuals, but no more than when implementing a simulator (or real world) system, where individuals track their own movement patterns.

By taking time ordering into consideration, we can implicitly detect when a node is moving from one place to another, by measuring the distance between the current reading, and the centroid of previous readings from a known place.

- take each location reading for a person, and measure the distance
            between it and the center of the current cluster
    - if the distance of reading from the center of current cluster is greater
        than the mean distance of points in the cluster to the center + alpha,
        then it is outside of the cluster.
        - read ahead by X (to make sure this is not just an erroneous reading)
        - if X readings are beyond the threshold of Current cluster
          - then we have left the location,
              - record the current cluster if it has more than
                  N location readings
              - start a new cluster with the current reading
        - else
          - discard the erroneous reading
    - else
      - add to list of cluster points
      - re-calculate cluster centre

Where alpha is a metric that allows the cluster to grow, alpha can be fixed (say 50m) or a function of the mean distance to the centre of the cluster. The number of readings to read ahead (X) should probably be fixed at say, 10, so that if the person goes away and comes back very quickly, they are considered to have not left.

The minimum number of readings required to consider a cluster valid should be relatively large, say 10 or 20, so that only places where the user lingers for some time, are considered. If we use a number too high, then we will miss some important places. A number too small would mean that places where users just pause for a while will be recorded, which may not be important places.

The algoritm still need improvement, so that it also considers the time between readings; for example, rather that considering the number of readings in a cluster, consider the amount time between the first and most recent reading, and use this to determine whether a place is recorded. This means that we should be able to record important places even if readings were not taken for some reason.