About Datasets

September 14th, 2010 Leave a comment Go to comments

I have spent a couple of days looking at the quality of the datasets I have parsed (Cabspotting and Geolife) and have had a couple problems. When I imported the data, I made the decision to use the MySQL datetime format for timestamps. This resulted in very poor performance when querying the database (maybe). So, I decided to convert the timestamps to integers, which represented unix time.

ALTER TABLE `entity_locations` ADD `unixtime` INT NULL;

Query OK, 25075904 rows affected (7 min 40.11 sec)

The coversion was a simple case of create an new column, then updating the new column with converted timestamps:

UPDATE entity_locations SET unixtime = UNIX_TIMESTAMP(timestamp);

Query OK, 25075904 rows affected (11 min 13.38 sec)

Then adding an index (I should have added this when adding the extra column, but may not have made much difference)

ALTER TABLE `entity_locations` ADD INDEX ( `unixtime` );

Query OK, 25075904 rows affected (15 min 21.79 sec)

All seemed well and fine, however, I had noticed during testing a sample dataset, that the MySQL datetime field equated to 23 seconds earlier than the new unixtime field. (as parsed by PHP), the ameded dataset maintained this difference, meaning it was not a compound error. This will not be a major problem, unless comparing to another dataset, using a small granularity.

One problem I did notice however, in the Cabspotting dataset, the timestamp column  had ‘on update current timestamp’ set meaning that when the new unixtime column was updated, the timestamp column was given a new timestamp value. I rectified this by removing the trigger, and setting the timestamp based on the unixtime. (not accounting for the 23 seconds).

Another issue I noticed with some of the data, was that the timestamp was set to the time of insert, for some of the geolife data, as well as some records which were zeroed, and some which were in the future. This may mean the need for a full import. Luckily I have kept all of the data and parsing scripts.

I still need to do some thorough analyis of the datasets in other ways as mentioned before. And I also want to convert the social sensing type data into the same format as these two datasets. There is also the CenceMe dataset to consider; it may not be suitable for parsing, as the location data could be very sparse. This of course might be useful for comparing to thicker datasets.

Geolife dataset

Initial analysis of the datasets to find the concentration of readings shows that there are a handful of users who collect data from around apri/mayl 2008, then October 2008, another set of users contribute data in earnest, with fewer contributing from the original set of users. Readings tail off from april 2009, with vert few users contributing after the beggining of August 2009.


Show a good amount of data is recoreded for most users between 17 May 2008 and 09 Jun 2008, with only a small number of exceptions.

  1. No comments yet.
  1. No trackbacks yet.