## Thinning sensor time sequences geometrically

Thursday, June 13th, 2013 at 8:13 pm

Apparently the following terribly obvious idea isn’t terribly obvious, and everyone who has been collecting time series data from, say, temperature sensors has simply been stuffing them into their flavour-of-the-month database as, basically, rows of the form (time, value, sensor_id) in copious quantities with not much idea about what to do with it.

If you’ve spent your life generating machine tool paths and thinning out the points so that the controllers don’t get overloaded, the first thing you say is: “Hang on, why don’t you just fit some straight line segments to all this data?”

Our good friends at Open Energy Monitor immediately began running with the idea as soon as I mentioned it — and show every sign of comprehending the depth of the science behind the practice which I have unfortunately not had time to investigate in our machine tooling case.

Here is a section from a temperature time sequence that contains 10090 values collected over seven days:

Here’s what it looks like when you thin it to the tolerance of 0.25 degrees which results in 210 points — a compression factor of 50 times:

The important thing about these algorithms is to model the lines, not the points, because that’s what you will be integrating with. This means a record in the database is of the form:

`(time0, value0, time1, value1, tolerance, sensor_id)`

Now you might think this is unnecessary redundancy because each time and value is listed twice in the database, when you should be able to get away with simply listing the sample points once, but it’s actually the linear segments that define the curve. Databases are very poor at joining rows against sequentially subsequent rows to recover these records, so don’t try to do it that way. The tolerance value refers to how well the line segment from (time0, value0) to (time1, value1) fits the data.

The algorithm I use to do this thinning is pretty simple and recursive on a stack:

```segs = [ (t0, v0, t1, v1), (t1, v1, t2, v2), ... ]  # input
thinsegs = [ ]                                      # output

cstack = [ (0, len(segs)-1) ]
while cstack:
i0, i1 = cstack.pop()
di, dtol = GreatestDeviation(segs, i0, i1)

if dtol > maxtolerance:
cstack.append((di+1, i1))
cstack.append((i0, di))
else:
thinsegs.append((segs[i0][0], segs[i0][1], segs[i1][2], segs[i1][3], dtol))
```

All we do is fit a line to each sequence of segments, find the element of greatest deviation (on the (t1, v1) point), and split at that place if it exceeds the pre-set tolerance.

Doing it this way in a batch is quite a bit easier than attempting to thinning ongoing at load-time, but it is simple-minded because on an evenly curved path it will tend split to power of 2, which means it’s on average around 25% worse than optimal. And sometimes it’s possible to fit lines to the sequence whose endpoints are not on the sequence, but where the lines fit better. I’ve never done this; it’s hard, doesn’t gain much and may have disadvantages. But it is an important part of the science. One thing I do like about this algorithm is it preserves corners well (although there are some exceptions). I also think that the tolerance should be variable, so it can be tighter where the slope is long and shallow because the low frequency undulations on a long stable sequence could contain important data.

A notable observation of this curve is that it looks it contains exponential decay segments going up and down as the person in the room switched on and off a fan heater to keep warm. The decay curves going down are much clearer, and I wonder if they are all tending towards the ambient outside temperature at a rate given by the insulation rating of the room (very poor).

So we tried fitting exponential decay curves to this data, made possible with the addition of a further parameter in the segment record, called ex.

```    v = v0 + f (e-(t - t0) ex - 1)

where:
f = (v1 - v0) / (e-(t1 - t0) ex - 1.0)
```

This reduced the segment count to 161, which is promising. I can’t plot the curved segments using the annotated timeline, so they show up as straight lines below. However, I can plot the exponent values which show that the decay rate is pretty steady across the intervals whenever the heater is switched off:

So that’s the fun and games we can play with one sensor alone. Imagine what we can do when we have 100 sensors all in the same house and can relate the time displacements of the temperature changes between one sensor and the next.