Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does someone know why these systems - Stratus, Tandem/HP NonStop, etc. - appear to be relics of the past?

I can see how Google is better off spending money on software engineers than on hardware (thus, fault-tolerant distributed systems); but there's lots of systems that don't need Google-scale computer power, nor Google-style ultra-low-cost-per-operation hardware, but that really need to stay up - utilities, banks, every system at the center of a big enterprise.

Is the issue that modern software just isn't reliable enough to make hardware failures an important part of the downtime? Or is it just that systems like https://kb.vmware.com/selfservice/microsites/search.do?langu... are cheaper today? (I guess those actually work, right? The level of complexity involved is frightening...)



In the old days 100% of computers were doing civilization critical work, Air traffic control, Social Security checks, nuclear reactor monitoring. Computers are expensive and automation provides the most gain in those applications. There just aren't that many civilization critical systems.

In the modern era we continue to have the same small number of civilization critical systems, we still have exactly one air traffic control system. However we have billions of, well, filler. Ringtones and social media and spam. Therefore rounded down 0% of modern computers are doing civilization critical work and safe secure systems superficially appear to be a thing of the past even if the number of systems and employees remain constant.

In 1960 the average ALU was calculating your paycheck and it was rather important to get that done on time and correctly. In 2010 the average ALU is a countdown timer in your microwave oven, its your digital thermostat, its my cars transmission processor, maybe its video game system or other human interface device (desktop, laptop, phone) but probably numerically unlikely. Things are sloppier now because they're sloppy-tolerant applications.


Great explanation. In other words the median number of computers performing some kind of critical task has probably gone from 90% somewhere closer to 0.


In absolute numbers they're still there. Air traffic control and baggage sorting applications still run APL, and your bank still runs COBOL. It has not gone away.


While hearing about COBOL in banks quite often all my friends actually working there say Java and Oracle and all related job adverts I have seen so far were Java or C++. Could someone please provide some relevant facts?


OK, so I'm probably revealing something about myself... appreciate HN's value for what you say rather than who you are. But as you asked...

Flexcube now owned by Oracle (core banking, basically a general ledger if used to that extent) and VisionPlus (credit cards) are done in multiple languages, but the core is COBOL. If you're in banking you probably have a guess who I worked for.

These systems go deep into an organisation's nervous system, and in the case of Flexcube, for example, come from a prominent bank's nervous system.

Most in-hourse mainframe programming today however is on the middle layer providing APIs with JSON feeds, heavy lifting left to 3rd parties largely with workforces in India and sometimes China.

This is a world away from trading or investment management. While for a layman 'finance' is often a catch-all, banking IT and trading IT are different worlds. Java dominates investment/trading, with C# doing a bit but not much. C++ is mainly plugins for traders written for something not self-taught VBA that need to do things fast, but Java has largely filled that void supplemented by C#.

Edit: Around 5-10 years ago, a lot of IT work in banking was about unifying each country's code-base around a common core, the larger the bank the more tedious the task. Now it is all about AML and KYC, if you're interested in banking non front-office space. ML is a joke, quants do what quants do and win 50% of the time.


It must be quite common to be in a big financial corp, work in IT and not see a green screen (mainframe terminal) in your entire working life there. I did a stint in one of the mainframe teams in my old workplace (Big Card Network™) but the vast majority of the people in my age bracket who were in IT worked with java and javascript, on the web applications that make up the corp's front-end.

As zhte415 lets on, most of the people working on the mainframes were third-party contractors. I don't think the company even advertises COBOL and mainframe roles. When I was hired (as part of a graduate programme) the job description didn't say a word about mainframes. I had to raise a little hell to get into one of the mainframe teams, and it wasn't even particularly easy.

I mean, you'd think with all the stuff people say about how the old guard is retiring and those ancient systems will need people who know how to maintain them, they'd have been waiting for me with open arms. Not quite.

My hunch is that there's a supply/demand thing going on. It's not that all those big banks etc. don't need modern-educated software engineers that can tend to the ancient tech. They do. Except, those new generation soft. eng's don't really care about the ancient tech, and the corps can buy cheap foreign labour to tend to the mainframes. There's no supply and all demand is covered. So they advertise for the roles that they do need filled, which is to say, everything on mobile platforms, the web etc.


Going into mainframe but also keeping your eyes open for multiple techs after college is a great way to get into architecture and international career at a comparably very young ago, if you're into that.


So, I work for one of the larger US banks, so I'm qualified to answer this question!

I'll narrow this down to one particular section of our consumer bank, because that serves as a fairly "clean" example, not much affected by outsourcing other complications.

We have a "core banking system" -- the computer that keeps track of the account balances for customers and moves around the bits that represent dollars. This is implemented on 1960s technology: ours doesn't happen to be in Cobol, but it is written in a different language of similar vintage. The system is pretty rock-solid: as an example, it has only a couple hundred milliseconds at most for all of the processing needed to approve or decline a credit-card transaction in real-time, and it has no difficulty checking balances, restrictions, and business rules to enforce that -- that isn't so impressive until you consider that even the slowest transactions need to meet that limit and we still have to allow for network latencies. We struggle to find qualified programmers for this system: the few who are actually skilled in it can command pretty decent salaries and benefits, and we'll search around the world to find them. The system is not abandonware: we'd love if it were written in something more modern, but a complete rewrite would be an incredible undertaking (this system developed over several decades, and that is difficult to recreate). On the other hand, we ARE looking at questions like how to get it to run in the cloud.

So that's all true, but if you look through our technical job listings, you'll mostly find us looking for Java developers, PostgreSQL DBAs, Angular web-developers, and other such positions. That is because the core banking system is a tiny portion of what we do. In my specific area, we have roughly 20 development sprint teams (and a few other support folks not on those teams). Of those, ONE team has core banking system developers on it (along with some developers with other skill sets). By that estimate, it makes up 2% to 5% of the work we do.

The fact is, keeping track of the balance in your account is only ONE TINY PART of what your bank does. We have to keep track of your personal information (email address, mailing address, login id, etc). We have to serve you up a website. We have to process scanned checks, send out marketing emails, analyze traffic to detect fraud, and hundreds of other things. The core banking system (and other similar systems in similar companies) may be written in 1960s languages because the existing systems are robust and well-tested, but for that exact reason, they don't require a huge amount of development work.


> the few who are actually skilled in it can command pretty decent salaries and benefits, and we'll search around the world to find them.

Just out of curiosity, what kind of numbers are you talking about?

> On the other hand, we ARE looking at questions like how to get it to run in the cloud.

Call me a masochist but that sounds really fun.


> Just out of curiosity, what kind of numbers are you talking about?

Unfortunately, that's exactly the kind of thing my employer does not want me to talk about.

> Call me a masochist but that sounds really fun.

Oh, it is. It's actually a really interesting project, and from what I've seen so far I think it may well be fully successful and go live with our actual customer records by mid-year 2017.


The Core banking system (the one that is actually processing transactions) may still be in COBOL. The other add-on services are probably in other languages. I know a few banks here (Thailand) that has successfully transition its core banking to Java, so I would guess many large banks have done the transition too. However, I doubt smaller banks would have done the transition.


Intuitively I'd feel like the larger the bank, the more work such a transition would be and thus the smaller the likelihood of it having happened.


You're correct. It is a massive transition, middle-layer upon middle layer.

Oracle market a completely different code-base (which is fully Java), even single-currency code-bases, from those for larger banks, small third parties 'partners' providing implementation.


The larger the bank (or any company, really) the more labor they've been devoting to merger cleanup.


And nuclear power plants still managed by systems from the 70s-80s, simply because they still operate and still do what they have to.


Nobody wants to be that guy who has to rewrite the nuclear power plant software.

I do wonder what they use. I assume off-site computers aren't really an option for networking reasons, so Stratus, NonStop, z mainframes, etc probably. I wonder if they have a backup mainframe, or if the redundancy in one of those things is enough.


>Nobody wants to be that guy who has to rewrite the nuclear power plant software.

Nobody is allowed to rewrite the nuclear power plan software. My first job was at Westinghouse Nuclear Division. Those Fortran libraries were off-limits in terms of changes.


No surprise, same story in energy balance circle management to calculate the optimal schedule.


> energy balance circle management

WTF? What is energy balance circle management?


When you are creating a schedule for all the power plants taking into consideration predictions for home and industrial users and trading as well. There can be domestic and foreign trading involved as well. Not sure how it is called in English.


Is there a good move here? Update or not?


I used to work at a computer museum and we had an old computer from the 60s that was used in a plant and they came back and took it because they were upgrading their main computer and they wanted to keep redundancy I.E a backup for their backup.

In short, yes. They always have two machines running ready to swap in. I have heard some also run multiple live mainframes to check that the results agree, and discard and restart calculations if they don't.


For example systems like RC 4000.

http://brinch-hansen.net/papers/1967b.pdf


Also I heard part of the reason is that those systems are trivially maintainable on-site; if something breaks, you take out a soldering iron and a box of spare capacitors, and go fix it.


Agreed - but also consider 'multiple points of redundancy'.

If you have your service spread across 10K servers, well, it doesn't matter if some go down.

Instead of making them all 100% super fault-tolerant for a crazy expensive unit price, you can make them relatively cheap and replace them when they fail.


Worked at a big co switching off tandem. Went from giant tandem machine that cost 2 million dollars per year to rent to about 6 off the shelf Dell servers for 6k each one time fee. Performance at the end was about 3x the old system. Was a glorified billing system. If one node went down, who cares, the other 5 did the job.

Most people don't value 99.999 vs 99.9999 reliability as worth millions of dollars per system per year. Space shuttles may disagree, but not billing platforms.


Note that the space shuttle didn't use processors running in lockstep, like the Tandem machines do (from what I understand). It used 5 single-core computers. Four of them would run the primary software, with a single control channel controlled by each of the four.

For something like the elevons, control was accomplished by connecting 3 actuators to 3 of the channels, with voting being accomplished by physical force - if the 3 computers disagreed, the 2 would overpower the one. (Things like thrusters used electronic voting, close to the thruster itself.)

This seems closer to modern architectures with multiple computers, than the mainframe idea of redundant hardware.


Note that all those computers shared the same bus, and so that the complete system wasn't as redundant as hoped as this story[1] retells:

"At 12:12 GMT 13 May 2008, a space shuttle was loading its hypergolic fuel for mission STS-124 when a 3-1 disagreement occurred among its General Purpose Computers (GPC 4 disagreed with the other GPCs). Three seconds later, the split became 2-1-1 (GPC 2 now disagreed with GPC 4 and the other two GPCs). This required that the launch countdown be stopped.

During the subsequent troubleshooting, the remaining two GPCs disagreed (1-1-1-1 split). This was a complete system disagreement. However, none of the GPCs were faulty. The fault was in the FA 2 Multiplexer Demultiplexer. This fault was a crack in a diode. This crack was perpendicular to the normal current flow and completely through the current path. As a crack opened up, it changed the diode into another type of component ... a capacitor.

Because some of the bits in the signal are smaller than they should have been, some of the GPC receivers could not see these bits. The ability to see these bits depends on the sensitivity of the receiver, which is a function of manufacturing variances, temperature, and its power supply voltage.

From the symptoms, it is apparent that the receiver in GPC 4 was the least sensitive and saw the errors before the other three GPC. This caused GPC 4 to disagree with the other three. Then, as the crack in the diode widened, the bits became shorter to the point where GPC 2 could no longer see these bits; which caused it to disagree with the other GPC. At this point, the set of messages that was received correctly by GPC 4 was different from the set of messages that was correctly received by GPC 2 which was different again from the set of messages that was correctly received by GPC 1 and GPC 3. This process continued until GPC 1 and GPC 3 also disagreed with all the other GPC."

[1] Adapted from https://c3.nasa.gov/dashlink/projects/79/wiki/test_stories_s...


Is there a website which collects/links to such anecdotes and surfaces interesting ones? I would to read more such stories.


Tandem computers did not run software in lockstep to achieve fault tolerance. They were shared-nothing parallel clusters before shared-nothing parallel clusters were cool, with message-based communication between processes independent of the node each was running on. This gave near-linear scalability as the number of nodes grew. I worked on parallel sorting and parallel data loading back in the 1980s, scaling up to 256 nodes, and Tandem's NonStop SQL developed and commercialized similar parallel database query evaluation in the late 1980s, ahead of its time.


Also I think the processors ran different, independently developed software, the idea being that it would be unlikely for independent codebases to have the same bugs.


The four main CPUs ran the same Primary Flight Software. There was a separate backup computer running the Backup Flight Software, which was developed by an independant team, to take over if the PFS failed.

There's a bit of debate as to what would have happened if it ever took over.


I expect they did know, but it'd be an interesting exercise to not tell either development team if they were writing the "real" or the "backup" software, and just swap them around to keep people on their toes.


So....

Years ago, I had a customer that had a Tandem which was quite exciting to me as I had yet to encounter one up to that point of my career.

Imagine my letdown when I eventually discovered that the Tandem was used for FTP. ¯\_(ツ)_/¯


Five way redundancy look good on paper, put if you use the same vendor and hardware purchased at the same time, you have a one point of failure, aka when one server fails there's a high probability that the others will fail simultaneously.


It depends on what you are trying to protect and what options you have if one component fails (replacement).

For highly critical systems that is an issue but it is being adressed. Take as example the space shuttle or the Airbus (3/2) primary/secondary flight computer [1] where the backup is being build by different companies with different processors ...

[1] https://ifs.host.cs.st-andrews.ac.uk/Resources/CaseStudies/A...


By running statistics on a large enough infrastructure you can see how frequently 2 hard drives in the same raid fail within days, or even hours!

Same brand, same batch, operating time, running temperature, even same vibration does trick. :(


We had 3 hard drives (the legendary IBM DeathStars) from the same production batch fail within hours of each other. Part of me was like "Yeah! Awesome quality control!" and a larger part was like "Oh shit." We ended up losing the array and had to restore from backup.


Why?


If the hardware failure is due to a manufacturing problem, then it affects every machine. If it's problem with the production batch, then again, unless you bought from different batches, you haven't reduced risk much.

And if it's a software problem, then again the redundancy isn't helping you.

Basically you haven't succeeded in getting 5 independent tries in the problem -- the failures are highly correlated.

It's the same reasoning that led to the financial crisis -- you have a CDO or an MBS that has a bunch of different obligations so you think you've diversified away risk, but in actuality there is one big factor that causes everything to fail at once.


Also - same shipping way. If I buy four disks from the same shop, then those four disks came in the same delivery to the shop (so they were all dropped at the same time), are handled by the same picker, thrown into the same parcel and dropped a couple times the same way.


Given the existence of things like Erlang/OTP, I doubt software is the big issue here.

Rather, the issue is that individual machines are cheap enough that you can have machine-level redundancy at a much lower cost than trying to engineer a single machine with redundant components.

The issue with machine-level redundancy is that they tend to run their own isolated operating systems (which is great from a fault-tolerance perspective, but is rather wasteful from a computational perspective). Operating systems like Plan 9 were billed to help bridge this gap and make clusters of machines communicating over 9P feel like a single unified whole, but they never seemed to catch on (besides maybe the concept of a Beowulf cluster).


They aren't, as VLM points out they just aren't something you see a lot of quantity of. For some applications you can achieve 100% uptime with networks and clusters, for others you still use doubly or triply redundant processor networks and various manufacturers have specialty chips for those markets (generally Health, Life, Safety (HLS) type systems).

Sometimes people developed redundant but not non-stop systems. I talked with an VP of Citibank when working at NetApp and they had a number of systems which ran on schedules of alternate days, so one would process transaction records for a while, then another would take over and repeat. They had three identical systems where one was eseentially a hot standby for the other two, and new versions of code would be deployed on one which would run the same transaction records and they would check for the same output, so they could do a 'walking' upgrade of software. Back when Tandem's and big Sun iron ruled the roost those machines were too expensive to have an extra one which was essentially a spare. These days however its much more economical to do that.


Modern software hasn't had the capability until recently. These Stratus VOS systems get updates by direct assembly patching over the running kernel. Linux ksplice and kpatch are relatively new, and AFAIK only Oracle support their use in production over their own UEK Linux Kernel.

Source: I work with several ex-Stratus employees who have told me some of their war stories.


Redhat too, since 7.2 on x64 (with some support requirements). No experiences though.


For my part (sample size of one means nothing of course)...

The stratus server I worked with in around 2002 was, I believe, one of the first to move towards standard x86 configurations and Windows OS.

They did so because, no matter how reliable your server, executives were jumping on Windows as an application server bandwagon and developers were targeting it.

There were several driver and firmware bugs that would BSOD the whole server, and bring the entire fault tolerant OS down.

Moreover, storage was a hack job. They couldn't make hardware RAID cards work in that configuration, and Windows software RAID was a joke, so you were stuck with Veritas storage software. Twice we applied Windows Service Packs only to find it would BSOD due to a Veritas software incompatibility. If it was a more average server I'd apply said update to a test server first, but no one could afford a test Stratus so you were always testing in production, which doesn't help reliability.

Here's a picture of it. It became my homeserver for a few years but frankly it was less reliable than a desktop.

http://imgur.com/a/7010z


One concern is that there is a limit to how reliable a single box can really be, no matter what sort of engineering you throw at it. Consider http://thedailywtf.com/articles/Designed-For-Reliability - because the box was designed to never ever go down, when it did go down, not only was there no failover, but it took twenty-four hours to reboot.


I would think it's largely because the cost difference between a normal server and something like a Stratus grew significantly larger over time.

You could more easily justify the cost when the upcharge wasn't so huge.

And, of course, we all got better at reliable distributed systems with commodity servers. So aside from the cost gap, the uptime gap was closing as well.


I'd tune that a bit and guess that the price of non-Stratus hardware fell through the floor.

In the mid-90s, data centers full of commodity hardware would have been cost prohibitive.


This issue has been fixed for good. If you need uptime, then virtualization is good enough. You can migrate VMs to other hosts on moment notice. As long as the failure isn't sudden, then you probably have enough time to migrate the VM elsewhere. Good servers, workstations and UPS have enough monitoring to warn you.


"every system at the center of a big enterprise"...that's the thing. A lot of enterprises actually don't need this kind of uptime or can get by with other types of redundancy.

Your comment about software, though, is a good point. The last relatively common OS that has that kind of uptime is OpenVMS.


They're not cost effective.

It's possible to get close enough with off the shelf hardware and good system design, so nobody wants to pay 5-10x more for special hardware.


Oh yeah, Mainframes are still built this way.


> Is the issue that modern software just isn't reliable enough to make hardware failures an important part of the downtime?

I think the issue is more about networks never being reliable enough for such reliable computers to ever make much sense. A data center simply cannot give you five nines availability over the internet, so it doesn't matter how reliable everything inside is, you would still need geo redundancy and all that fault-tolerance in a distributed system.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: