Archive for October, 2010

Graham’s Thesis

October 27th, 2010 1 comment

I was lucky enough to get to read Graham Williamson’s thesis draft, and it reads very well, and justifies my line of research, i.e. enhancing existing routing protocols with location information.

Graham’s Persistance Based Routing (PBR) is efficient, at the same time as being reasonably reliable. I believe it is possible to further enhance his routing protocol with location knowledge. By using his hybrid scheme (Binary Spray and Focus + PBR)  as a basis, I believe that it can be made more efficient, and increase the reliability.


His hypothesis is:

Link persistence is an effective measurement as the basis for building a practical and flexible routing protocol in human contact networks in order to maximise delivery ratio.

He lists his main contributions as follows:

  • An analysis of the metrics which form the basis for routing in delay- tolerant networks (Chapter 4).
  • A procedure for generating synthetic contact traces which include periodicity and well-defined community structure (Chapter 5).
  • A study of structural metrics for routing and the effect of community- awareness on routing performance (Chapter 6: Section 6.2).
  • A self-managing mechanism for adapting window sizes inferred from periodicities observed in the contact pattern of nodes (Chapter 6: Section 6.3).
  • A metric, and protocol based on this, for routing in human contact net- works which is based on the persistence of links (Chapter 7).
  • A hybrid degree-persistence protocol able to maintain performance with limited routing information (Chapter 7: Section 7.4).

He identifies identifies four areas for future research

  1. Augmenting his PBR scheme with a message replication phase and identifying situations that respond well to this
  2. Using composite metrics to balance competing routing demands
  3. Investigating the combination of link persistence with link prediction
  4. Considering other dissemination goals such as publish-subscribe and anycast


ContactSim is Graham’s powerful discrete contact event simulator, which he used to process data, run his experiments and generate results. I had a preview of this a few months ago, but have not looked at it since. However, as he has now documented some of the main aspects of it, it may be possible to adopt this simulator for our own research, by augmenting it to include a notion of location. I will contact Graham for advice on how he thinks this could be achieved. The source code may be available in the group repository, so I will start investigating this.

Meeting with Pádraig and Davide 21 Oct 2010

October 21st, 2010 No comments

Talked again about location clustering and proximity methods, and Pádraig suggested we try to get past this, and get on with the interesting stuff.

To simplify things, we decided that it is probably best to use the grid partitioning system for considering co-location, as it gives us an implicit referencing scheme, and by considering neighboring locations to also be co-located, we can resolve the problem of split places.

With regards to the proximity calculations, we still need to verify the meaning of the signal strength, as is does not necessarily mean signal strength. We decided to consider any common WiFi point spotting to be a proximity event, and will record the start and end times of these, so that we can use metrics such as duration of contact in our routing scheme(s).

Davide offered to find out more about the signal strength issue, so that we can reliably refine it later on.

Pádraig suggested that  I/we need to find some good conference targets, ideally around January. I suggested ICUMT and Mobihoc, and Davide reminded me about a special issue journal call for papers that I spotted recently (Call for papers Phycom). Ubicomp 2011 is a possibility too.

By next week, I want to have the space partitioning method sorted out, and all contact/proximity events recorded. As well as coming up with an idea about the simulated on-device data structure, and some ideas about routing schemes.

Location clustering so far.

October 21st, 2010 No comments

I have had a chance to look at the code that Derek sent me, namely the CFinder software for finding k-Cliques, and an implementation of the MDS  algorithm, I struggled to get these to make any sense of the data that I have, and started to think about an alternative approach.

Despite the decision we made in a previous meeting, to consider location readings globally, regardlessof order/time, I decided to consider using the nature of the data to inform the clustering algorithm, namely, using the time ordering to indicate the relationship between readings. This meant going back to the idea of using readings from  individuals, to determine the places that are relevant to them. This of course may have issues when comparing places from different individuals, but no more than when implementing a simulator (or real world) system, where individuals track their own movement patterns.

By taking time ordering into consideration, we can implicitly detect when a node is moving from one place to another, by measuring the distance between the current reading, and the centroid of previous readings from a known place.

- take each location reading for a person, and measure the distance
            between it and the center of the current cluster
    - if the distance of reading from the center of current cluster is greater
        than the mean distance of points in the cluster to the center + alpha,
        then it is outside of the cluster.
        - read ahead by X (to make sure this is not just an erroneous reading)
        - if X readings are beyond the threshold of Current cluster
          - then we have left the location,
              - record the current cluster if it has more than
                  N location readings
              - start a new cluster with the current reading
        - else
          - discard the erroneous reading
    - else
      - add to list of cluster points
      - re-calculate cluster centre

Where alpha is a metric that allows the cluster to grow, alpha can be fixed (say 50m) or a function of the mean distance to the centre of the cluster. The number of readings to read ahead (X) should probably be fixed at say, 10, so that if the person goes away and comes back very quickly, they are considered to have not left.

The minimum number of readings required to consider a cluster valid should be relatively large, say 10 or 20, so that only places where the user lingers for some time, are considered. If we use a number too high, then we will miss some important places. A number too small would mean that places where users just pause for a while will be recorded, which may not be important places.

The algoritm still need improvement, so that it also considers the time between readings; for example, rather that considering the number of readings in a cluster, consider the amount time between the first and most recent reading, and use this to determine whether a place is recorded. This means that we should be able to record important places even if readings were not taken for some reason.

Data Analysis – Co-location using WiFi access points

October 19th, 2010 No comments

In the last post, I described a method for deciding on co-location based on spotting of common WiFi points, which co-located users if they saw the same WiFi access point, at roughly the same time. We refined this method to discriminate when the approximate distance, based on signal reading (which could be signal strength, or signal quality). Pseudo code for this algorithm:

  • For each consecutive timestep window, between start and end.
    • Retrieve all readings within that window for all users.
      • Foreach user
        • pick the strongest signal from the list of readings for each access point (as there may be multiple)
        • rank reading in order if signal strength
        • discard the readings for access points where:
          • Signal Strength divided by Strongest Signal Strength is less than Alpha
        • Foreach other user
          • perform signal processing as above
          • if the intersection of the remaining list of access points for each user is not null, consider the two to be co-located

The variables in this algorithm are the start and end time, the size of the window and the alpha value.

There were a few factors to consider before we could implement this, firstly, it was important to find out what the values for signal strength actually meant;  the data ranged between values of 11 to 110.  I used an N95 phone, with the Campaignr software installed, to record a number of reading at varying distances from known access points. I instructed the software to upload to a script that simply recorded the data in a text file.  This showed that the ‘signal_strength’ value increases as the phone moves away from the access point. Another factor to consider, was the range of values that were reported in the database. Davide suggested plotting the values of all signal strengths for all readings, over the the readings that would be selected using the alpha value selection. The figure below shows the data based on an alpha value of 0, for all readings,  0.8, and 1.0 (for only the strongest readings).

The second figure shows all readings, split into individuals.

The next step is to decide what alpha value to use, and store the co-locations into the database. To this end, I have written a script to parse all 3.6 million records, calculate the alpha value (Strongest signal strength over signal strength) and write the value back into the database. This will allow us to quickly see the effect of using different alpha values for decision making, without having to repeat the calculations every time.  It will also allow us to plot the effect of different alpha values against each other very quickly.

Data clustering – initial views

October 8th, 2010 1 comment

I have started to look at clustering the location readings into important places. To give me a feel of what the data looks like, I have visualised all of the locations.

Because of the limitations of the browser, it was not possible to visualise all ~40,000 points in the dataset, so it is split into 4 lots of ~10,000. These images are only intended to give a rough idea of how the data is spread out.

My initial thoughts are that there is too much density, and therefore the clustering algorithm has to be highly selective when choosing a place.

Categories: Uncategorized

Meeting with Pádraig and Davide 6th Oct 2010

October 6th, 2010 No comments

Met to discuss the next steps.

Pádraig described a method for testing the quality of a community (i.e. whether it has been partitioned correctly), by means of an infinite(?) random walk, described in a notation based on Huffman(?)  numbers. The length of the decriptions between communities determines the quality of the partitioning. In that a random walk should spend much more time in an actual community, as there are few links out of it. He related this to BUBBLE rap, in that it might be important to test the quality of the communities used in the algorithm, to test how good the community finding algorithm is (e.g. if done on the fly).

We also spent some time discussing how to define a place, from a series of location readings. Firstly we need to consider whether we determine a location in terms of an individual node, or the global set of readings. After some discussion we decided for the purposes of this simulation, we will determine it globally. This will at least give us an idea about what the data looks like. If it turns out that this seems to produce ‘places’ that encompass too wide an area, we should consider revising this decision.

Sec0ndly, we need to find the means of partitioning the space. We discussed last week clustering the readings using a k-Clique algorithm. I contacted Derek Greene, who supplied me with some links and advice about achieving this. It was his feeling that it was more a distance-based clustering problem, with weighted links, rather than an unweighted k-Clique algorithm (I have yet to explore this), and suggested I initially explore the data using something like MultiDimensional Scaling (MDS), and suggested a Java based implementation.

Davide suggested another approach to the partitioning of locations, in the form of a grid based approach, splitting the landscape based on latitude/longitude, and a size dimension,effectively making a grid of the planet, where all nodes implicitly know that a lat/long is part of a location, if it falls within a grid sqaure.

These two methods have their benefits, the first being that we can get a really good idea about locations based on the data we have, without the worry of partitioning the same place into two locations, in a fairly straightforward way, the second being that all nodes could calculate all possible locations and refer to them when neccassary. We decided in the end to use a clustering algorithm.

I talked through my initial analyis of the Social Sensing Study data, and described how I had found hints of a Bluetooth collection problem, in that for most devices, they had a very low number of instances of device readings. i.e. there were very few occasions when they actually detected Bluetooth devices. This means that using this dataset for detecting proximity using Bluetooth alone may not be very realistic.

Entity WiFi % BT % Location %
1 10545 12 2416 3 10905 12
2 70724 79 18980 21 74522 83
3 55242 62 10303 12 72661 81
4 20624 23 2496 3 22760 25
5 27171 30 5745 6 29265 33
6 62719 70 6251 7 66243 74
7 37571 42 5916 7 46798 52
8 18284 20 6081 7 22155 25
9 43976 49 3406 4 44720 50
10 40316 45 8976 10 43216 48
11 57190 64 15957 18 70937 79
12 76976 86 1542 2 79010 88
13 43285 48 7706 9 47767 54
14 29630 33 20801 23 30127 34
15 40417 45 6665 7 48979 55
16 26864 30 1088 1 27693 31
17 23075 26 13154 15 29570 33
18 52615 59 11070 12 54305 61
19 60182 67 5251 6 64364 72

(for each type of data, the number of readings out of roughly 89280 possible readings – every 30 seconds for 1 month)

I suggested an alternative solution, using WiFi co-location records instead. (see the paper (cite?!!) that used a similar thing in their evaluation).  I envisage using the logs of WiFi access points in view to determine whether one device is co-located with another. There are three mechanisms for doing this that I thought about afterwards.

  1. If two devices can see the same WiFi access point, then they are proximate
  2. If two decices can see the same WiFi access points, and the one witht the strongest signal is the same, then they are proximate
  3. If the list of WiFi access points that two devices can see overlaps by at least N, then they are proximate

Davide suggested an algorithm that combines these three, which accounts for situations where two nodes are between two wifi points, close to each other, but the signal strength of the WiFi points would result in not being co-located in situation 2 above. Based in the difference between the signal strength of the top 2 (or extended to top N) access points.

We decided that this was worth investigating further.

We talked about an abstract database structure, into which any dataset of this nature could be placed, so that we can compare different datasets. It was much the same as before.

Entitity Table
Entity ID
Identifier (BT MAC ADDRESS?)
Contact Table Effectively lists edges between nodes
Contact ID Instance ID
Entity ID Source
Entity ID Destination
Contact start time
Contact end time
Places Table
Location ID
Entity ID The entity that knows this location
Location Center lat/long/alt(?)
Location radius in meters
Location Name ?

Pádraig suggested that we will need to think about the on-node data structure at some point soon, but that it will be determined by the way that nodes choose to communicate.

Node A:
Visits Loc: L1, l2, l3, l4
Sees Persons: p1, p2, p3
Node B:
Visits Loc: L7, l2, l1, l9
Sees Persons: p2, p8, p6

However, we decided to concentrate first on getting the places clustered, and getting the proxmity records converted into contact windows/periods.

So for the next meeting I said that I would try to have the clustering algorithm implemented and tested on the dataset, as well as checking the bluetooth records issue, and, assuming the data is missing, using the WiFi mechanism to derive contact windows/periods. I also said I would think about the data-structure for the nodes.

Categories: Uncategorized

Ideas about Entites, Contacts and Places

October 4th, 2010 No comments

To define a format for storing data about interaction and movement of people, it might be useful to start with an abstract view of what the data means. This may help when making policies regarding how data is manipualted.

There are some decisions to make regarding how to decide on how to interpret different situations:


  1. We assume that an entity is defined soley by their mobile device.
  2. We assume that in most cases the entity carries that device with them.


  1. If two entities see each other (via bluetooth), we consider them to be able to communicate.
    • are there some time constraints?
  2. If only one entity can see another, this may still mean that there are able to communicate.
  3. If an entity sees an unknown device, does it still:
    • Attempt to communicate with it?
    • Consider in it’s degree calculations?
    • Share data about it?


  1. Entities calculate the meaningful places that they visit.
    • What algorithm do they use to calculate the place?
    • But when do they update this information?
  2. Do entities also track routes between places, and how do they calculate this
    • What are the characteristics of a route?
  3. Do entities conform to a shared view of places and location? or do they keep their own knowledge of places, and attempt to find overlaps with their peers?
Categories: Ideas