Freesteel Blog » 2017 » February
For a number of years I have been familiar with the observation that the sophistication of, in particular, time series data analysis is adversely impacted by the use of the Excel spreadsheet program. More recently I have discovered exactly how it is an irreparably deficient application and I am convinced that its use should be abolished from all non-small business accounting applications (ie everything except what it was originally intended for).
Hitherto I did not attach much importance to this view, owing to the fact that it is considered an anti-Microsoft bias as well as a “lost cause” because “everyone uses it”. However, on learning the existence of a large body of signal processing theory which is all but inaccessible to users of Excel due to its fundamental nature, I submit my observations for consideration below.
My first remark is that if data scientists don’t know about the benefits and substantial applications of multi-source data combinations, Kalman filters and seasonal adjustments reliant on the autoregressive-moving-average model, they are missing an important part of their job and are deferring the implementations of these concepts to mere “estimation by eye” from the graphs.
My second remark is that when external software exists that can be used to, say, calculate and compensate for the seasonal adjustment, it generally requires the data to be submitted in a time series format, and this requires a great deal of preparation of the spreadsheet data. Thus the appearance of being able to open up and immediately (and supposedly do) work with the spreadsheet data within seconds is deceptive, because there is now a longer route for the data to move it back out in a form to be processed and re-imported back into the spreadsheet for familiar viewing.
Let us consider a couple of time-series data sets. For example, the monthly national GDP and the employment statistics, or imagine one minute intervals of temperature and electricity use in a building.
What elements of the data are required to perform any useful processes on it, beyond simply the production of visual graphs?
For time series data (which a great proportion of data can be described as being), the existence of a reliable datetime value is paramount. Excel may in theory have a datetime cell type, but it is not visibly distinguishable from an unstructured string type with its ambiguous American and English date orderings. As such, it cannot be consistently used because improper use does generate an unignorable error (eg anything in column A must be in this datetime form or you can’t save the file).
Furthermore, just the datetime is not enough, because there are time series intervals (for example, monthly or quarterly data) and these cannot always be approximated by a single time point. By convention quarterly intervals can either be represented by the end of the quarter (eg sales results) or the start of the quarter (eg sales targets) but both need to be boxed into the same entity in any subsequent regression model.
Finally, when you have different data series from different data sources they usually work to different sample rates, so you cannot represent them adequately as a single row per time sample. This would apply to the power use for the heating system which is provided every minute, when the average outdoor temperature is recorded daily.
Accordingly, the primary dimension of the data points, the datetimes, are problematical. But what of the data point values, the measured quantities? If they are each recorded into a single spreadsheet cell we will invariably be lacking an associated standard deviation/confidence interval for them. The standard deviation is an crucial input to the Kalman filter for the determination of the weight applied to each individual measure.
Take the example of the monthly political polling data prior to an election. These are often supplied by different agencies and almost always come with a confidence interval that depends on the size of sample so we know to take less notice of a poll which defies the steady trend when it has a wide margin of error. But then if there are more polls with the same wide margin of error that are also in line with that new trend, the best guess of the trend will be pulled in the new direction as much as it would have been by one very good accurate poll with a narrow margin of error. This balancing of the estimations from the aggregation of the location and accuracy of the measures is optimized by the Kalman filter, and should not be done by eye from the charts themselves merely because it can’t easily be applied in Excel and we’re too lazy to convert our working platform to something where it could have been easily applied.
And this brings me to the final point about Excel, which apparently can do anything because it can run programmed macros. Really? Who can honestly think, if they have every stopped to consider it, that it is a good idea to co-mingle software with data? You might as well nail your vinyl record onto the record player and then parcel-tape it into a heavy cardboard box to prevent interchangeability.
The co-mingling of data and code with no reliable means of disconnection leads to dangerous and ridiculous practices, such as copying the data into and out of the spreadsheet by hand just to access the service of the macros.
Come on folks. If you’re going to call yourself data scientists, you cannot rely on a single tool that prevents you from applying the last fifty years of data science know-how optimally and productively — and then rely on its inadequacy as an excuse to not challenge yourself to learn and master the amazing mathematical technology that would have been at your disposal had you not chosen ignorance over education.
We have got to get beyond the pointless issue of formatting data into “machine readable form” for the mere purpose of making graphs for visual titillation, and get to grips with actual intelligence and effective control theory.
There is, for example, nothing smart about having to control a car with a human looking out through a window for white lines on the tarmac and the traffic lights on the verge in order to move the steering wheel and pedals in response. Smart is getting the data to feed-back directly into a computer that controls these motions optimally while you sit back in awe having merely specified the destination. But if someone out there building the tech has dared to embed a copy of Excel within the process chain between the sensor acquisition and the motor control actuators, then we are doomed.
Thanks to the famous Go to considered harmful letter of 1968. There was a heated debate about it, and 30 years later it was unconscionable that programming languages could even be conceived of to include a goto statement. Kids these days probably don’t even know what one is.
But just think about all that programming wasted and how much further on we could have been without the inclusion of that single statement which caused so much unnecessary expense and buggy code throughout the years, and then imagine how much damage is being caused by inappropriate use of this inadequate data non-analysis tool up to now and for the next 20 years before it too finally gets buried in the ash-can of history and people don’t even remember that we ever used it in the first place.
This has been the moment of truth.
Friday, February 10th, 2017 at 1:08 pm - Machining
Here’s some bad out-of-focus video from the latest steel milling.
The milling was bodged because somebody used the wrong size collet which didn’t grip properly and so the cutter slowly slid inside, leaving steps on the job.
I’ve lost patience for blogging, or filming or documenting things properly. It’s all been out of sorts. Maybe I’m low on VitaminD.
I did feel content while doing a day’s work yanking out lots of small trees by their roots on Saturday.
Progress has been on-going with a number of datalogging and ESP8266 projects, such as this Sonoff S20 smart Socket hack where I got help to install Micropython to turn on and off a vacuum cleaner.
(I am trying to do the same with my heating system boiler, except the ESP8266 I’m using keeps dropping out for no reason.)
I have the view that the “Internet” part of the “Internet of Things” is the problem, with all these servers, gateways, analytics, and remote controls from anywhere in the world. None of it does anything useful, because the value of data and control of a thing is inversely proportional to your remoteness from it in time and space.
The answer is therefore going to be nearby. It’s not ever going to be found from pushing it out to some faraway server in China.
What we really want are “Things with Internet Technology”. That is stuff with Wifi, Webpages and Wisdom. No nonsense, fun, and easily reprogrammable on an escalator of betterness to a high plane than the bare ground we start with.
So, in addition to the cheap and multiples-of cave logger work, I’m proposing a mains cut-off switch that does not contain its own microcontroller — let alone its own pre-installed software. Those are added extra onto the minimally necessary hardware.
All you get from it are four pins on the junction box: Ground, 5V, GPIO-output for on-off, Analog-in for the current draw.
There are about a billion pis, arduinos, microbits and esps out in the world that people know about and are waiting for something like this to plug into, so there is no reason to force your own choice onto them just because you’re familiar with it. That’s what I think.