Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Power of Power Cycling (growthalytics.com)
181 points by nkurz on Aug 24, 2015 | hide | past | favorite | 81 comments


No, it's not acceptable to power cycle to fix problems.

Yesterday, I had to power-cycle a CNC milling machine. The machine (a Tormach 1100) started the spindle with the spindle RPM about a tenth of the value specified. This resulted in pushing an end mill into the workpiece (it wasn't spinning fast enough to cut properly), snapping off the end mill (like a drill bit) which went flying off, and damaging the workpiece. This is not a huge deal (it's about $25 in damage, and everybody wears safety goggles around milling machines) but it shouldn't happen. We discovered that all spindle speeds were far slower than they should have been; entering a new spindle speed did cause the spindle drive to change speed, but the speeds were about an order of magnitude low. The spindle speed control is digital and computer-controlled.

Power-cycling the machine (which runs Windows Embedded Standard, a Windows XP derivative) cleared the problem, and I was able to run several jobs successfully. But someone else had reported a spindle overspeed on Friday. Something has gone badly wrong in spindle speed control on that machine.

The "Recommended Best Practices" for this machine tool actually says "Reboot controller once a day to force both Mach3 and Windows to restart."[1] That's not good.

[1] http://www.tormach.com/uploads/883/SB0036_Mach3_Best_Practic... (a PDF file with the wrong suffix)


I would state this a different way. It's not an acceptable solution for the developer/designer/manufacturer to propose. However, for troubleshooting, it's the single best tool we have (and with all of the black boxes, the only tool in many cases).


Ouch!

I recall being told about a high end cnc machine that power cycleled and moved the tool to home straight through the work piece.


You are right. Not acceptable to have end-users power cycle.

That doesn't mean that individual components can't be power cycled on fault without affecting the rest of the system. That, of course, means writing the system in a certain way that allows that (allows fault isolation). You can do it with Erlang, OS processes, containers, separate hosts, etc.

In other words there is nothing that prevent the company to auto-power-cycle some component that crashed. Report the error and then issue a fix later, while you still get to use the machine.


Hi! Blog post author here. My initial draft had this at the bottom:

"Disclaimer: If you have a recurring problem, power cycling will not fix your root cause. "

At some point I must have decided the whole thing was too long and that sentence didn't make the cut.

I personally would not go as far as to say:"it's not acceptable to power cycle to fix problems".

I see it more like:"not all problems can be fixed with power cycling".


The PDF with the wrong suffix crashed Google Chrome Mac. But then I power cycled (reopened) the browser and now it downloads fine :)


If you're using Windows to run a milling machine, you're doing it wrong. It's not Windows problem that you're using it inappropriately. That's like taking a 1990 Ford Taurus with 250k on it to a track day and then complaining that the engine blew up. Technically the engine did fail, but the driver should have known better.


Windows Embedded is quite common in factories where I live. What alternatives would you suggest that are vendor supported?


If your vendor thinks that Windows is an acceptable control for a CNC milling machine, consider another vendor. Yes vendors might choose to use Windows, and think that it's OK. But just because a vendor chooses to use Windows doesn't somehow make you obligated to buy it and suffer, does it?

Sure they might be the only vendor, or someone inside your organization might choose them anyhow despite your protestations. But that doesn't change the fact that using Windows for these kinds of things is foolish.

Just because big companies do it doesn't make it smart, does it? I mean, if it did, then startups wouldn't be able to exist would they? Startups are able to be a thing because big companies sometimes do stupid stuff.


They all use Windows. Mine use Windows NT4 and Windows XP. That's the industry. If you want industrial CNC machines you don't get a choice.


There's a tremendous difference between using Windows to run the GUI and using Windows to do the real-time control of the servos and various other hardware.

Most of the time these machines run a small microprocessor which receives commands over some port and interprets them as it is instructed to. This is what's called a "hard realtime" system as it always responds within a certain amount of time, guaranteed, provably as per the design.

What Tormach is doing is eliminating the dedicated gcode interpreter hardware/controller and performing those operations strictly in software, on a program running on a PC. There's some utility to that, but pretending that it's as good as having a dedicated, realtime gcode interpreter is not honest.


"force both Mach3 and Windows to restart."

I would guess Windows is used for the higher-level functionality (GUI, possibly format conversions), while Mach3 does the lower-level stuff that requires precise timing.

Adding a second OS makes sense as it makes it way easier to keep the real-time stuff real-time.

Using XP embedded for the top layer shouldn't be that big of a risk. It may have lots of known exploits, but you can remove lots of the attack surface, and an alternative GUI may not have seen much security auditing.


This is just ideology.


I'd argue it's actually more about practicality and reliability. Windows isn't particularly well-suited for hard-real-time situations (like, presumably, a CNC milling machine). This is an environment that's usually better suited to RTLinux, VxWorks, etc. or some one-off dedicated program written in something like Ada.


I don't see why this is attracting so much skepticism when "crash-only programming" and the "chaos monkey" are popular ideas on HN. There really are only two ways to do software reliability:

- "failure is not an option": you don't get to reboot your rocket controller when it has a floating point error, or your Therac when you're doing an X-ray. This involves investing a lot of time, effort, review, and formal validation into getting it right. It's completely incompatible with Agile and RERO.

- "s--- happens": the cost of failure is small and you can just accept it, issue refunds/apologies and move on. Or show the fail whale. This is much easier and cheaper. This environment moves towards powercycling and redeploying software/EC2 instances/Docker containers whenever something happens. You monitor observed reliability and make a commercial decision as to whether it's unacceptable and you need to fix some bugs.

Almost all of HN works in the second area.


Crash-only is rather different from power cycling. The unit of granularity in a crash-only metaphor is at a higher level (process, thread, green thread or other schedulable entity), the whole point being that you delegate to a supervisor tree for restarting crashed subtasks/processes to a known good state while maintaining the uptime and integrity of the entire system (i.e. crashes are not typically user-visible). It works really well because in any complex system running on top of so many layers, the state space expands combinatorially so that performing intricate error diagnosis and recovery procedures will often be a losing proposition from the number of code paths you'd need to properly exercise.

Power cycling is different. A system that expects you to power cycle often because it consistently fragments or cannot dynamically update/reread its configuration is broken and a major annoyance. In many ways, systems that require lots of power cycling are as such because they're designed in a way antithetical to crash-only. It's a failure to enforce boundaries and separation of concerns.


Power cycling needs to happen in individual components. In other words sub-systems might power cycle but the user impact should be as small as possible.

OS processes/Erlang processes/Separate Machines(Containers) can reliable do that. Because they isolate fault propagation.

In a shared memory system if something crashes you don't know what the state of the rest of the system is. Maybe on bad client wrote over the memory of other 999999 clients. So just restarting that one thread is not safe.

Another overlooked aspect is to do crash-only right with recovery you need stable storage. That is where you can persist a known good state so you can restore from and continue. That might be hard or easy depending on the environment.

> You monitor observed reliability and make a commercial decision as to whether it's unacceptable and you need to fix some bugs.

Yap. But "you" here is the developer not the end user. When you go to buy socks on a website, it is not your job to monitor and restart their back-end system or switch IPs. They should be doing you are just buying socks.

> Almost all of HN works in the second area.

Because the first is very hard. NASA does it, medical device manufacturers do it. Critical security modules have it. It is very expensive though.


"Crash-only" means that a subsystem should bug out early when it detects a problem rather than soldiering on and potentially making things worse. The bugging out itself notifies a monitor process which can make a decision on how to recover. The crash-only paradigm was popularized by Erlang, which was created to drive highly reliable pieces of telecom equipment. So it's not entirely without merit.


Hi! I did some research before writing this blog post (okay, I spent 5 minutes reading the wikipedia article on power cycling).

It turns out rocket controllers get power cycled too.

source: https://en.wikipedia.org/wiki/Power_cycling


You probably didn't mean it this way, but I'd like to point out that formal validation certainly isn't incompatible with "agile" approaches, but it won't work the way that most people practice it.

At the systems level you often have a mix of both classes (like your Xray example, typically). And there is often a third path "failure has a well defined mitigation path" in the mix, also.


There's a middle ground there where failure is an option, but with lots of redundancy and modularity so that such a failure is isolated and quickly resolved (at least temporarily). This is the Erlang/OTP approach to software reliability.


Power cycling is a work-around, not a "fix". Pretty much any time you end up needing to power cycle something, it's due to an underlying bug, which remains after the fact. In fact, in many cases it may have intentionally been left unfixed, since the people shipping these products know that most users will try power cycling before complaining about a problem. It has become so commonplace that we apparently now see it as acceptable that we have to regularly reboot our routers, PCs, phones, etc. just to keep them working.


Thing is, that underlying problem may be in a system that's out of anyone's direct control. Case in point: a clinical system which occasionally pops a memory leak and needs a bounce. The leak turns out to be a known bug in a library - not their code. And it's fairly elderly so it's not been or going to be fixed.

I mean, it shouldn't happen, but if "should"s were horses I'd have a ponyburger for brunch.


There's no doubt Power Cycling helps. But I would argue it helps for people who 'consume' products. It is anything but a help for Engineers, for people who design products.

There's so many examples I could provide: you are a software engineer, the program you develop works well for a day and then crashes. You can either investigate the problem, find there's a memory leak and fix it or Power Cycle your program.

-> If you design a product, don't go for the easy stuff, don't power cycle just to get back to a well known scenario. Investigate and fix.


Yes, power cycling most often treats the symptom, not the problem. As a consumer, you often don't have a way to treat the problem, so that's your only recourse. As an engineer, or really anyone capable and responsible for fixing the problem, often you do not want to power cycle, because if the problem goes away you may no longer have a way or reproducing it. That makes most problems harder to fix, and possibly impossible to confirm fixed.

I often reduce how I classify the importance of bug reports I get that I can't reproduce. It's often not efficient to work on something with so little information to go on. Further information on the bug or a reproducible error case vastly help.


The real problems start when your support organization thinks that rebooting the customer's machine is a solution.....

Then the engineering team never gets to fix the problem, and your support team starts eating up all the resources because they get constant calls from the customers asking about the same problems.

(thankful for not being in that job anymore!)


Experience allows an Engineer to tell when a problem is one that needs fixing, and when its one that just need a reboot, so many problems with technology are transient, as in not reproducible in any meaningful way (or rather commonly occurring way), yet we spend the most energy (In My Opinion) on those.


Well, if it works for a day:

    1 0 * * * kill -HUP $(cat /var/run/my_server.pid) >> /var/log/my_server/error.log
You think I've kidding but I've done this in production. Now I'm more likely to use god/monit.

In fact, not joking at all: this is how I make Finder not slow down. Every 4 hours, I restart it via crontab.


On the other hand, there are languages and platforms like Erlang, which are designed around "managed" power cycling your processes.


'You have a point, but your point is conceptually "state is bad, build things with the least state possible".

Even though I agree with this, there is a fundamental difference between computer science and software engineering, and in actual software engineering there will often be hacks, and bad state. Even if sometimes it's not yours, but a library you're using.

I agree with you, but this article deals with a more "real life, pragmatic" approach, versus your thesis which is true, but a little more idealistic


Really? My definition would have been the opposite. When you call yourself an engineer you're saying you're like structural engineer building a bridge or an aeronautical engineer building a plane. You're saying you have a professional and moral obligation to do a good job.

Any company that expects its developers to take pride in their work should track crashes and strive to produce software that doesn't routinely need to be restarted.


You have to be careful there though. When it comes to computer software even perfect code can require a restart if different code on the same machine is interfering with it. The story has gotten much better but even now occasionally I'll have a reboot a Game or 3D visualization app because the graphics driver got wedged. It's not the applications fault it's the drivers fault.

The reason power cycling is such a useful tool is that most computer systems are running in an environment where the things outside of it's control way outnumber the things within it's control.


If we all had ECC on our machines then SW engineers would have one less of an excuse to blame it on HW.

Power-cycle related story: my machine crashed two weeks ago while playing a movie and my 3TB hdd showed only 2TB on _SMART_ after power reboot! After googling around, letting the hdd cool off and playing with wd tools, several power reboots later my 3TB partition is back but not without loosing a couple of MB of the size of the disk at the end of it. Puff, just gone, not deleted, gone! My old partition is now bigger than the 3TB drive.

Smart reports a 3TB disk again, and I'm on the look for a spare drive.

No, really, not all power cycles are equal.


This comment has two parts - the first one I think deserves more attention. When your program can actively and randomly rot while it's in in production, the only valid recovery method really is "power cycling" the application. As the OS is also being affected by this rot, power cycling the OS is a valid solution to many problems as well.

The push for making broad use of consumer hardware running production environments to save (a veritable fistfull of) money has only made this problem worse.

It makes me wonder if we shouldn't be adding a module to the Linux kernel to occasionally reload blocks of code from disk.


"occasionally reload blocks of code from disk" - It's not that simple. An ECC system avoids most HW related issues but does not protect you from SW errors which probably accounts for most rotting.

Why reboot it in the first place? If the HW has ECC, the SW does not rot unless it is poorly written.


The OP assumes many things about 'turning it off and on again'. Firstly it assumes that turning something off and restarting will bring it back to some default state. That isn't always true, especially for computers. Say some new code has been injected. Restarting, going through the motion of boot and login, may allow that code to do something (harvest logic creds) that it otherwise might not.

And the OP assumes that once 'off' has been achieved that "on" is always a possibility. Forget the car. Think airplane that has lost instruments, but the engines are still working. Turning things off might move you from bad to worse. Maybe you stick with the partially-working machine rather than risk bricking things.


And the OP assumes that once 'off' has been achieved that "on" is always a possibility.

Or that to do this is a simple operation with no other side-effects. As one of the other comments here points out, power-cycling is probably acceptable for consumer products, but with systems that are more mission-critical, should really be closer to a last-resort method.


Chernobyl started with them 'power cycling' the reactor, an on-off-on test for a backup system. It did not go well.


The OP wrote a lighthearted observation of the reality in front of him.


I feel like the majority of problems that are resolved with power cycling could be prevented on the engineering side if engineers merely understood state machines better.

The reason why power cycling works at all is because the machine's state is inconsistent with the program's, and power cycling gives the machine and the program a chance to start over from scratch.

So many programs out there, especially so in embedded devices where power cycling is common, don't really have an explicit state machine model of operation. As such, the debugging of these types of errors is near impossible. If you have an explicit state machine model of operation, and you have a bug that is remedied by a power cycle, you can quite easily trace the bug back to a specific state transition. Of course, this becomes more untenable with increases in system and hardware complexity, but on the level of a driver or a single executable, an explicit state machine model works wonders.


Hm, is that the answer to the question why do we actually sleep?


As an extreme example of power cycling, I was refactoring a slow function in some simulation code and replacing some hash-based look ups with some pointer arithmetics.

I must have spent several hours writing and re-writing that function, and I still couldn't get the tests to pass. I went over all the code character by character, debugging, etc, and I just couldn't figure out where it wasn't working. Eventually, I just nuked all the work I did that morning, rewrote that entire routine from scratch, and it worked perfectly.


Just after reading this I went to get a coke from the vending machine down the hall. There's a bottle sitting on the robotic arm halfway down and stuck. Display reads OUT OF SERVICE.

I yanked the power cord, put it back in. The steppers re-homed themselves and the bottle made it to the bottom.

Free Coke.


For anyone who hasn't seen it yet, the AI koan on power cycling:

'A novice was trying to fix a broken Lisp machine by turning the power off and on.

Knight, seeing what the student was doing, spoke sternly: “You cannot fix a machine by just power-cycling it with no understanding of what is going wrong.”

Knight turned the machine off and on.

The machine worked.'

See the rest at http://www.catb.org/jargon/html/koans.html


This is going in another direction as the cited article, but there are aspects when it comes to software aging and software rejuvenation for high availability. https://en.wikipedia.org/wiki/Software_rejuvenation

There were some interesting reads in the AT&T Technical Journals for the 5ESS switches (e.g. Kintala, Bernstein , Wang “Components for Software Fault Tolerance and Rejuvenation”) a long time ago. The journals are not free accessible, but I have found another text from one of the authors about the same topic http://www.crosstalkonline.org/storage/issue-archives/2004/2... (defense related web site) or http://srejuv.ee.duke.edu/shaman02_secure.pdf (other author)


In this thread: Lots of people criticizing a perfectly valid first diagnostic step that they all use frequently. Yes, we all know that it's not a FIX for some long term problem, but frequently you're looking at a simple confluence of rare events that caused an issue that can be resolved by resetting the system to a known state. If the issue continues with a frequency that is troubling, then you move on to other steps, but if you fire up a kernel debugger and bus sniffer every time you have a minor glitch, you're wasting time. And I wager everyone commenting otherwise knows that.


I don't see many people criticising here quite frankly.

As per your comment, you are right: if there is a kernel bug and you'd rather restart your computer to quickly get back to what you were doing and avoid distraction, all good.

If you are part of a kernel engineering team and you quickly restart your computer and pretend it never happened then maybe you didn't give your best shot.


And I wager everyone commenting otherwise knows that.

Not really. If it happens once, it can / will happen again. In my opinion a bug is a bug and should be fixed. But I do agree this attitude does seem prevalent.


The value of power cycling is to determine how intermittent the problem is.

If the problem only ever occurs once and power cycling fixes it then that is effectively a permanent fix.

If it comes back next year and power cycling fixes it again, you know there's something wrong but it probably won't be an actual problem for any practical purposes unless it gets worse.

If you find yourself to need to power cycle every month or every week or even worse, increasingly more often, chances are you're going realize that the root cause must be investigated and fixed in a matter or months or weeks or even days.


Another way of saying,

"Have you tried turning it off and on again ?"

- from Graham Lineham's excellent I.T Crowd , the phrase is spoken by Roy from Technical Support, the I.T. dept.


It doesn't work unless you say it with the correct accent.



I've heard an alternative of the opening joke.

It results in the software engineer basically saying, "lets try it again and see if we can reproduce it".


Personally, I'd rather just fix the problem :)

  :~$ uname -r && uptime 
  3.2.0-4-amd64
  01:33:09 up 461 days,  3:50, 13 users,


And what do you do when the machine goes down? How do you test that all services come back up again? Your infrastructure should be resilient against servers and services going down. But you can't test that if they never go down. See Netflix chaos monkey, or CoreOS automatically restarting after updates (which are rather frequent, every couple weeks).


Do you hot patch things or what?


There is/was KSplice for kernel and everything else can be upgraded without reboot.


Nice try :P

    # uptime
    14:34:38 up 587 days, 21:12,  1 user,  load average: 0.00, 0.00, 0.00


Grandparent wins on users (13 times 400+ days)

Seriously: I'm assuming that these are both servers running a single application or small group of related applications. OA was talking mainly about client style devices.


I had to recently power cycle my refrigerator.

The bottom ice maker stopped working and no matter what setting I had it on would make it create ice.

My current theory is that I changed the ice settings while the freezer was open (a few days) earlier. I noticed this because the water in the door shuts off if you open the other door.

Sure enough, after leaving the fridge unplugged for 15 minutes I now have ice.


Must be an LG, I had the same problem... And yah power cycling fixed it for a couple weeks, then it did it again. The wife got mad and called the repair guys, they said, turn down the temp in the freezer. Apparently something in the software works better with the temp turned down to the minimum, because its not like it can't make ice at 10 degree's or whatever it was previously set to.


Cosmic rays flip bits all the time:

"one error per month per 256 MiB of ram was expected"

http://stackoverflow.com/questions/2580933/cosmic-rays-what-...


While I 100% support power cycling as a troubleshooting step, it doesn't answer the question as to why something went wrong. For fixing the air conditioning in my car once in a blue moon, fine. For figuring out why my internet drops out once per day, it's only a triage step.


My philosophy is that the first time something funky and unexplainable happens, power cycle and don't spend a bunch of time on it. But keep and eye out for the same thing and if it happens again, then investigate.

Using it to solve a daily problem is a terrible idea, if you have any control over the system.


Obligatory link to Jim Gray's "Why Do Computers Stop and What Can Be Done About It?": http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf

TL;DR: Most software bugs that make it past testing are transient "heisenbugs". That is, they're the kind of bug that goes away when if you restart the program.

Related: This is actually a core tenet of the Erlang ecosystem -- spend any length of time around Erlangers and you're bound to hear the phrase "let it crash". Erlang actually has support for this built into the system: Supervisor processes exist to automatically "power cycle" your code if an unhandled error occurs.


It is not necessary that there is a bug firsthand. Think at a system with memory pressure due to memory fragmentation. This could lead to failed memory requests for applications that would succeed on a less long running system. (For this reason some systems even disallow dynamic memory allocations during runtime)


I was thinking about this for IoT yesterday. Basically if you wanted a machine with premanent uptime, it would be two identical basic machines that would periodically power cycle each other. On error, hard-reset, and reload the OS/app from the working device onto the problem device.


Watchdog processes are actually extremely common in embedded systems. Servers have them. Even the code in the SMM on a Chromebook has a watchdog.


I have a wife and four kids (two of whom are now living on their own) and I can tell you that it's sometimes impossible to power cycle them. Especially when they're young and excited - and the next day is Christmas.



I once called a repairman to look at my dishwasher because it was stuck repeating the wash cycle over and over. He power cycled my dishwasher and the problem was fixed.


I have to admit - I thought this would be an article about bicycling and the usefulness of physical fitness in being a better (XYZ).

Not disappointed, but not what I expected.


Here's a question. Is it actually common practice to test your code by leaving it to run for weeks while inputting random stuff and observing if it leaks memory, sets bad values or crashes?

If not, why not? That'd catch any "mad user bugs" and all kinds of accumulated collateral damages from the sea of complexities which might be overlooked if routines are only tested individually.


This is along the lines of "soak testing"

It seems so common its mandatory in some places / fields, and totally unknown in others unfortunately.


It's this kind of 'fixing' the issue that is the cause why we still have this kind of problems. I'm the kind of guy that debugs things for 3 days to find the issue. Yup even if "it isn't worth it". I should warn you, I hit people in the face with a shovel if something isn't right. And frankly if you are a IT professional you should to.



If you are power cycling your computer, it is not sufficient to shut it down and restart it from the power button.

You must disconnect its power by turning off the PSU or yanking the cable. I suspect a majority of tech people know this but your PSU passively supplies power to a few things on the board, this can be enough for controller errors to persist across a reboot.


It is great until you power cycle machine that was behaving strange only to find out that it was due to HDD that was starting to fail and it does not boot anymore.


In a production environment, I prefer frequent power cycling. You don't want to be in a situation where a dev/ops is afraid to power cycle a server with very high uptime. If there turns out to be a problem anywhere, better to find it sooner than later where recovery is more likely and in the worst case fail-over or restore from a backup.


Yep, but power cycle when there are people available, with enough time to fix any issue that may appear.

The worst time for power cycling a machine is during an emergency.


From the title, I hoped this would be an analysis of cyclists' power or something related to the fancy torque meters they love to use.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: