Data Pipelines

March 29th, 2011 Leave a comment Go to comments

A few posts have been exploring the outcome of various simulations, all of which have been extrusions of some existing datasets. I thought it might be a good idea to document the pipelines that I have employed to get to this point, so that I can identify parts that could be failing, without realising.

Firstly, the datasets that I have been analysing are as follows:

MIT Reality Mining Dataset

This is probably the most important dataset that we have used, and for the purposes of the simulation, the data has come from the same source, but in two seperate formats: Graham’s internal format for ContactSim and my own version, which was simply a MySQL dump from the authors, with a few small tweaks. Another version has been made available by Marting Harrigan, who contacted the authors directly, and was provided with a Matlab version of the data, however, I have not as yet explored this.

  author = {Nathan Eagle and Alex (Sandy) Pentland},
  title = {{CRAWDAD} data set mit/reality (v. 2005-07-01)},
  howpublished = {Downloaded from},
  month = jul,
  year = 2005

For the purposes of the simulation, Graham had this to say:

Oh, and yep, the dataset was generated from devicespan table with an SQL query I have now totally forgotten. IIRC there was some quirk in the data that meant that entries in the table link device IDs to people IDs so you had to join one of the other tables and look up the person’s device ID… But no, it doesn’t use any of the other information in the dataset, just the bluetooth contacts. Any inconsistent data is thrown away (e.g. there are a good number of sightings where the end time is before the start time). I made a decision that with brief encounters where the start and end time is the same in the dataset that I would assume a contact of a fixed minimum duration occurred — the scan time is 5 minutes so you could choose something like 2.5 minutes on average but I think I went for just a second. Or you could just throw them away… This is implemented the MITDataset class IIRC.

The data for the simulator is held a file called resultsset.csv in the format below, which simply lists the start and end timstamps, and nodes that are connected.


“2004-07-23 12:45:11″,”2004-07-23 12:45:11”,100,108

“2004-07-23 12:26:46″,”2004-07-23 12:51:41”,100,88


This data came from, but has also been used seperately by Graham and myself, however, I have not used any of my version of the dataset for LocationSim.

Simulation Pipeline

For the simulator, the pipeline is simple,  at the start of the simulation (which can be multiple) the dataset is read in from the file, and the contact events are time ordered. Nodes are generated in one of three ways:

  • The CORE set is determined by the configuration file, in the node-list property. If node-list is set to CORE, the the simulator uses the NUMBER_IN_STUDY variable to determine how many nodes to load from 1 to N.
  • When ALL is set in the config file, then the simulator loads the dataset, and calulates the node ids directly from the data.
  • When a comma seperated list if node ids can be provided in the config file, and only these nodes are counted in the simulation run.

The above applies to the MIT-Reality Mining dataset, I have not investigated the other fully yet.

Metrics Pipeline

To generate the metrics that I have used for LBR, I used my version of the dataset, held in a MySQL database, to generate to location metric from the Cell Towers, the details are discussed in this post. The data that was used for calculating the routing metric, comes from the cellspan table  which lists the following as an example:

oid                     endtime    starttime    person_oid    celltower_oid
2004-07-23 12:23:56    2004-07-23  12:22:34     94            38

In order to make this available to the simulator, I created a script to take the values, and generate the XML required to copy/paste into the configuration file, which listed the metric value for each node for example:

 <property name="user_76" value="872285082" />
 <property name="user_44" value="836721726" />
 <property name="user_84" value="824237644" />
  <property name="user_92" value="802016509" />
  <property name="user_69" value="728832754" />

This is then available to the simulator without the need for additional coding.

Community Finding Pipeline

The graph visualisations obtained in the last post were found by running the simulator and instructing it to perform the ‘OutputAggregateGraphTask’ task, which, at the time scheduled to run (weekly, or in another run, only at the end) , it analyses the contacts between individual nodes, in the period since the start (or since the last run of the task), and generates an edge between nodes if they have received an onContact between each other. When the task is run, it generates a log file named for the timestamp it is recording at e.g. 20041024_2348.dat, which is simply a list of edges.

I used the Network Workbench to generate the the visualisations, using the force directed (with annotation) algorithm, setting the distance measure up high, to encourage it to settle.

I am yet to publish the community finding results (using spectral clustering), but I have processed the data, in the way discussed with Pádraig, and detailed in his slides (pages 2-7). I automated a number the processes by writing and using a script (called generateMatrixFromGraph.php) which takes in a list of log files genereated by the graphing task (above). It generates a matrix, and the associated commands for Octave (an open source numerical processor, similar to Matlab), and writes a file with the data (e.g. All Weeks 20041024_2348.matrix.dat).

It then runs the Octave program passing the generated code file, which in turn writes the results for variable S and V  to two files (e.g. All Weeks 20041024_2348.octave_output.S.txt ). I then imported these space separated files into MS Excel, so that I could manipulate the data, to find the clusters (numbering each row/column with the node id, and sorting the data in the column of V, whose corresponding column in S, is the first to have a non-trivial value. (e.g. column 18 in week 3, gives a list of values for nodes, which when ordered, splits them into negative and positive numbers, the communities are determined by positive vs negative values)).

I am concerned that the nodes discovered in the visualisation of the network, are not the same nodes partitioned in using the spectral analysis.

This post is perhaps still a work in progress, or I might transfer it to a stand-alone page, which I can use to track similar data manipulations.

Moses Community Finding  and Allocation Pipeline

Nodes represent cell towers, and are linked based on user-colocation, as described here, and edge list is produced, which is passed to the MOSES program. (being run on a server as a binary). MOSES outputs multiple grouped lists of nodes with represent communities.  These communities overlap, and so any given node may be a member of multiple communities.

When visualised, this network looks like a furball, and due to the large overlap of some communities, it does not make sense to treat each community as a location. Therefore we applied an allocation algorithm to the data, which associated any given node with a maximum of only 3 locations. These locations are then scored based on the total number of reports for every node which is a member of the community/cluster/location. Each user in the dataset is then scored based on the number of times they have reported any node within the community. This results in a score fore every use, which can be used as a ranking for the Location  Based Routing/Ranking algorithm.

  1. No comments yet.
  1. No trackbacks yet.