Mysterious temperature drift.. [[ solved ]]



Extensive testing helps reveal many problems that would otherwise stay dormant. It is a somewhat daunting task to run a selection of appropriate tests and then analyze the results and make certain improvements. It is something that not only should be done during a product development, but it also should be done rigorously, which quite often means it is a bit boring. This post is NOT about that.
This post is about solving these mysterious bugs and problems that appear seemingly randomly and are an absolute pain to fix, which also makes them NOT boring.




It all started back in December, during an internal demo. We were still using a slightly older version of electronics and during one of the many valve activations, I saw that the SRVs temperature measurement went quite suddenly a few degrees up. This was unexpected to say the least - we had tested this hardware quite a lot before that and used the same sensor in other products and frankly never had any wrong readings. Although the temperature settled pretty quickly - a clear indication it was a glitch, the reason it jumped was anything but clear.
In situations like this one, what you typically do is try to replicate the issue in a more controlled environment (read in the place where nobody can judge you for not knowing all the answers, such as your desk or lab). Once the problem is replicated you use your favourite method to investigate whether it's a software or hardware issue and finally dig deep into details and then, sometime, way past midnight, you aim to declare a victory. At that time you collate all the data put it in a document (or perhaps and email with clear timestamp)  and let all your colleagues (and perhaps bosses) know how clever and hardworking (remember "past midnight" and "timestap") you are. A clear win-win situation. If you are lucky enough to make a few of these successful investigations in, say a year, you quickly gain your coworkers respect and you become your bosses' "go-to-person".
Such was not the case here.
I intially spent a couple of evenings trying to replicate it but no matter how hard or clever I tried to be, I couldn't. As you can imagine I eventually categorized this as "non issue" and concluded that it must have been caused by some kind of a fluke and not by any software or hardware malfunction. Well at least not a hardware one, the software can always be updated later :)
Days have passed I was busy with other things and almost forgot about this "non issue", when I got a skype call from our CTO, Piotr, who was in the middle of yet another demo, this time with potential investors and an important distributor. He was not happy as the problem appeared again... during the demo.. Definitely not a good time for the SRV to report  30 deg C (even if only briefly) in an perfectly heated and cooled room in front of a demanding and critical audience.

It has just gotten really serious. This temperature thing thought it can play hide and seek with me.. Too much was at stake. I had to maintain my "go-to-person" status, I had to act.
I was relentless. I had to be. First, I examined every bit of firmware code to see if there was anything that could have caused this glitch. I happily concluded this was not the case. I then methodically examined every sample PCBA I had and again concluded that nothing about the PCB or assembly was even remotely suspicious. The stepper motor, wires and connectors were next. Then I reprogrammed the unit to operate up and down in fixed intervals hooked the pcb up to a few external temp sensors and oscilloscopes and... almost immediately noticed that due to some minor timing difference, every 100 or so motor activations the temperature was measured during the motor operation. And that's when the reading way higher than it should have been. Ok, I though, at least there is a reason why I couldn't replicate the problem before - I simply didn't do enough activation.
But why was the temperature off? Interference from the motor, noise on the data lines, temperature rise due to high current? Was there any pattern to it all?
My bet was it was the PCB layout issue. After all it was not my design - Piotr had done it himself - must have been his fault :) The fact that I was the one who had reviewed the design and gave my thumbs up, simple slipped my mind at that busy busy time.
I cut, drilled and desoldered and resolderd the board, removed pieces of it and added new ones. Still the conclusion was not clear. I tried different conditions and power sources, got a bit closer to the answer but not quite there...
On the third attempt, quite late at night (it had to be done at night for the timestamp etc reasons above) I was on the phone with Piotr, when, to paraphrase him - he had an epiphany. He suggested I should attach another sensor on another board, with a bunch of cables to my test SRV and do my normal test again. Briliant, such a simple test and I couldn't figure it out myself. As you can imagine the result was - there was no problem in such scenario, which proved the sensor got warmer because the heat was transferred in the PCB - a simple design error, that was even simpler to correct. Surprisingly when Piotr heard about the results even later that night he wasn't convinced. We all try to rationalize sometimes, who can blame him :)
We obviously implemented necessery changes immediately, made and tested new PCBs and put this problem to bed once and for all :)
This situation reminds why I like product design so much. It's an incredible feeling to not just make a product but also to iron out tiny little glitches than almost always appear.
It's even better to be able to share the experience with such a great audience.
Thank you.

Novo's one and only "go-to-person" - Taimur

 


Comments

  1. Thanks for the insight. I was wondering how the temperature sensor was measuring the air rather than the hot water - I had not considered the board.

    This also shows that the board will make an excellent remote temperature sensor with “probe” for in duct or strap on pipe applications.

    One attraction of the open interface but robust TRV is only needing one hub for a range of needs.

    ReplyDelete

Post a Comment

Popular posts from this blog

January (and a bit of February) in a nutshell

Thread

Who is Novo?