Archive for December, 2010

MIT Contacts – Cell Tower stats

December 17th, 2010 No comments

I processed the contact logs, rather than just the hop logs, and after ~6 hours it completed! Now all contacts in the whole MIT dataset have associated cell towers. If a cell tower could not be found for the contact duration, then a second search was done for a record of a cell tower finishing  in the previous hour. The timestamps are preserved so that they can be excluded later. The following shows the stats for this data, in regards to cell towers.

Row Count:	114046

Both Have Cells:		28804	 (25%)
Neither have cells:		34587	 (30%)
One side has cells:		50655	 (44%)
	From has cells:		30557	 (27%)
	To has cells:		20098	 (18%)
Nodes share cell:		4045	 (4%)
First start time:		Thu, 01 Jan 2004 04:03:17 +0000
Last end time:			Thu, 05 May 2005 03:39:29 +0100

This means that out of 114046 contacts between 01 Jan 2004 and 05 May 2005, only 4045 nodes report at least one cell in common at time of contact, but 28804 report different cells, and a further 50655 report cells for at least one of the nodes. This means, that if we assume that a one sided cell sighting at time of contact determines a cell sighting for both nodes, then we have cell towers for ~73% of  contacts.

The assumption is that if any of the cell towers seen in the contact period, then the nodes are considered to share a cell. However, this can be further refined if needs be, to be more strict, where a contact period is very long.

Below is a visualisation of the connected cell towers for MIT-Oct 2004, it it very confused and has a huge central mass, but there are clear connections of towers on the periphery. I think this can be further refined to be more strict about cell tower connections (as at the moment, the dataset contains more than 1 cell tower for any given contact period). Also, at the moment it links two cells even when two nodes are on contact for a long period of time, in which they could conceivably be moving together.


Cell towers where an edge is made when either node in a contact event has seen one of the cell towers.


December 16th, 2010 No comments

Met with Pádraig and Davide, we need to get out of the data cleansing phase, and get into some more interesting areas!

I need to generate the linked cell towers graph based on contacts not just message passing. (as this was what should have been done in the first place!), and use the last known cell tower (within a time limit) as the current cell tower, if none has been seen. This will hopefully mean that we will have cell towers reported for all contacts, and with any luck there should be a larger proportion of contact events that report the same tower.

Using the graph, link the main connected towers into the ranking algorithm.

Categories: Uncategorized

Location and Persistence Based Routing (LPBR) thoughts

December 16th, 2010 No comments

I think that once we have some robust way of relying upon what a location is, we can start to be a little bit more sophisticated with the routing protocol. i.e. making real-time decisions about routing based on the current history, rather than the broad ranking.

For example, extending the idea in PBR which uses duration of contact, and including the idea of Prophet and various others, which use the time since the last meeting with a node, we can use these for locations, and contacts;

Node A keeps a record of contacts and locations, and interrogates
            other nodes on encounter.
On encountering another node B, pass the message for C:
     if B spends more time than A, at any location that node C does
     or B has visited the destination nodes most popular location(s)
           more recently than A
     or B has seen the destination node more recently than A
          ?(and it has a regular pattern (periodicity) of seeing the destination
           node which can be predicted, and means that B will see C
           sooner than A)?
      or B has spent more time than A, with node C in the past

This effectively pushes the message to to the places that the destination node often visits, and to the nodes that see the destination more often.

In a large network, with many very weakly connected communities, it would be difficult to penetrate the destination communities with messages, however, an enhancement to get messages sent widely is to use some other broadcasting mechanism first, such as spray and wait, where there are a limited copies of a message, of which half are shared at every contact until each node has one copy.

At this point the Spray and wait protocol would hold onto the message waiting to see the destination. However, if we then start LPBR routing from this point, we could find that one or some of those messages are with nodes in the destination community, and therefore a greater likelihood of getting the message to the destination.

Categories: Ideas, projects

Updated plots for Community Structure

December 14th, 2010 No comments

In the last meeting I had with Pádraig we went through the spectral clustering technique, and discovered that there were some errors in my calculations of the the clusters. I took the time to re-do the community calculations,  and to run the tests again, to make sure all is correct. I used the MIT  dataset, and constrained it to the month of October 2004 (as before). Messages are created and initially sent at the first time step. The selected communities for each week are taken as the biggest group, which in practice were roughly the same set of connected nodes. I chose to omit the nodes with a value of 0.0 in the matrix (V) (see  data pipelines).

Delivery Ratio over time


This plot shows the delivery ratio over time for MIT-OCT dataset, for LBR communities (where the community is taken from weeks 1, 2, 3, 4 and all 4 weeks), PBR, Prophet, Unlimited Flood and Random(1.0, 0.2, 0.0).

Final Delivery Ratio and Cost


This chart shows delivery ratio and cost for each Protocol, note that unlimited flood does not report cost.

In these updated plots, it is encouraging to see that with the corrected community structure, LBR actually performs well, and in fact for the community structure in week 1, it performs better overall than PBR, with weeks 2, 3, 4, all, close behind. It is also good to see that LBR is beating Random 0.0, and 0.2 consistently. LBR week 1 Community also tracks a small way behind Flooding for a short while around 05/10 – 08/10 which is quite interesting. The communities are listed below (or can be viewed here), for further analysis.


Having given this a little more thought, this is a little mis-leading, so I have generated new plots where all Protocols are constrained to the same community nodes. Unfortunately, this yields a similar spread of results between the protocols, as did the results that did not consider community structure. Only the structure discovered in week 1 increases the delivery ratio.

Delivery Ratio for each community, for each protocol


This plot shows delivery ratios for each community generated from different weeks and all weeks in Oct 2004 for the MIT Reality Mining dataset


This shows the final delivery ratios and costs for each protocol, for MIT OCT 2004, communitites

What I plan to do now is to go back to the ranking algorithm, and use the linked cells mechanism to re-calculate the rankings. It might also be useful to start thinking about being able to use the real-time location information (in this case cell towers/linked cell towers) in a routing algorithm directly. This would let us start to implement some simplistic location predictions.

I also created a clearer graph visualisation for the cell tower links, this shows that there is a componant that  encompasses a large percentage of the reported cells overall. If we were to consider this one location, it is possible that it covers a very large area.


This shows the cell towers that are connected when at least 100 messages are passed and reported at both cell towers, edges are numbered where this is more than 300.

Cell tower connections based on contact events

December 10th, 2010 No comments

Using the log of hop events, I was able to generate a list of cell towers being used at the time of the hop, from this I built a list of the cell towers that are linked in this way, with a count of the number of times they had been connected (linked_cell list txt, or linked_cells json version ). From this a created a graph where the weight of the edges is the count.

connected-cell-towersEdges with a weight (or number of times nodes passed a message and reported different cell towers) less than 100 are removed for clarity, the colour of the edges shows the weight of the edge increasing from blue to green.

The tool that I have used to visualise these graphs is called the network workbench, and it reports the following about the graph:

Nodes: 84
Isolated nodes: 9
Node attributes present: label
Edges: 149

No self loops were discovered.
No parallel edges were discovered.
Edge attributes:
Did not detect any nonnumeric attributes
Numeric attributes:
min    max    mean
weight     1    2700    103.42953

This network seems to be valued.

Average total degree: 3.547619047619045
Average in degree: 1.7738095238095228
Average out degree: 1.7738095238095228
This graph is not weakly connected.
There are 14 weakly connected components. (9 isolates)
The largest connected component consists of 67 nodes.
This graph is not strongly connected.
There are 61 strongly connected components.
The largest strongly connected component consists of 24 nodes.

Density (disregarding weights): 0.02137
Additional Densities by Numeric Attribute
densities (weighted against standard max)
weight: 2.21041
densities (weighted against observed max)
weight: 0.00082


Edges are coloured linearly from white to black depending on weight, edges with a weight greater than 300 are labelled.

It seems that there are is a group of towers that are strongly connected, by large number of messages being passed, and nodes reporting them as being the cell towers used. One possible reason for this,  is perhaps because the places where the nodes are, are popular, and as such, require more cell towers to cover the demand. These results are promising, because it means that if we pre-process the data, and ignore connections where there is a low weight, we can use groups of towers to give a notion of place. What this means is, when a node reports a tower in the known cluster of towers, we can assume that it is in roughly the same place as any other node who also reports a tower in the same cluster.

Community Clustering part 2

December 8th, 2010 No comments

I calculated the communities as described  in data-pipelines and from this generated two lists of nodes for every week in October, and for all weeks.  I took the largest list for each, and used that as the community for which to test the LBR algorithm.


Updated plot, showing the delivery ratio for LBR-Community, one (dotted) line for the communities found in each week, and overall.

It seems that, as expected, routing within the community gives an advantage to the algorithm, and in fact, when the community knowledge is derived from the full dataset, then LBR performs better that the average of the rotated dataset. This plot is slightly misleading however, as the data for the previous algorithms is the rotated averages, whereas the data for LBR-Community is for straight runs. Below is the plot against straight runs of each other algorithm.


This plot shows straight runs of all algorithms, and the updated LBR-Community runs.

In this plot, it is obvious that the difference is less profound as with the average plots, however, I am considering running the rotation scheme again for LBR-Community.

LBR Location Analysis – MIT Dataset (Oct)

December 8th, 2010 No comments

I wanted to see what the real picture was when using cell towers for location, so for a run of LBR on the MIT Dataset for October, I collected details about every message hop. I compared the timestamps for the hops to the cell tower information in the database, to find out which cell tower each node involved in the message hop was recorded as using.

Colocated: (same cell) 2247

Not-Colocated: (different cells) 15688

One sided: (only one sees cell) 13452

No data: (no cells seen) 3211

Count: (total number of hops evaluated) 31387

These results are slightly concerning, as roughly half of the hops between nodes happened when nodes reported different locations (cell towers ID),  10% of hops did not record any cell tower information at all, and in 42% of hops, only one node reported a location. This seems to demonstrate that we cannot rely on cell tower information to give us an accurate idea of the co-location of people. However, it might still prove useful to use the cell tower location for generating general mobility statistics, which could be used as routing metrics. Davide mentioned he had some ideas about these sort of statistics.

Categories: Uncategorized

Community Clustering part 1

December 1st, 2010 No comments

I have visualised the graph of connections in the MIT Dataset, for the month of October split into 5 parts: Weeks 1 to 4, and the whole month. Each week represents a new set of edges, built after the end of the previous week. So each graph shows the edges that are formed only in that week.


The list of edges and matrix data is in the zip file here: mit-graph.

Categories: Uncategorized