Finally, a few months into the job, I wrote “break” where I meant “continue,” and caused our payments servers to shutdown one-by-one over the course of several weeks.
So, they don't write tests at justin.tv, and they don't do automatic deployment? Sounds like a great place to work at...
1) We've started introducing tests - as you grow I think they're essential. I would say though, not all bugs can be caught by tests (and in fact, I once went over the last 10 bugs we had fixed and discovered that none of them could have been caught by any testing framework). Here are some kinds of bugs you can't catch with testing:
- The code assumes that an external service is highly available
- The page looks glitchy in IE
- If you use that flash api, computers with an nvidia graphics card will show a green screen
- If you open and close a socket quickly, flash will hard-crash the entire browser...sometimes
- Under peak load, when a vacuum is running, this new query will hose the DB
That said, it sounds like this bug COULD have been caught by testing (though I don't know the details and can't be sure). And tests can be darn useful, when used in the right circumstance. But in the hierarchy of important things for running a production website, here's what I would list in priority order:
- Extensive monitoring and alerting
- Detailed, easily accessible logging
- Code review required and taken seriously
- Hiring good developers
- Automated deployment and rollback
- Integration testing
- Unit testing
We've been working our way down the hierarchy :-)
2) We have 100% automated deployment and rollback, and a neat internal tool called Brigade for managing it that I really should get around to open sourcing some day...
For what it's worth, the payment servers were one of the only parts of our system that did have (manual) tests in place at that time (because it took 24 hours to do a restart). The bug in question was in DB timeout code, and all tests passed (until several weeks later, when the DB had problems). Now, clearly coverage was insufficient, and I do agree that adding more testing (as we now have) is a good idea. But some bugs are always going to slip past. Code reviews are what could have caught this bug (which we also now take more seriously).
Every hotshot thinks they don't need tests until they try writing tests for a while, then they usually become evangelists.
My shock and amazement comes from the 24 hours to restart your payments system? Whaaaaat? If that is correct you guys have far, far more issues than not having tests.
How do you restart a system which holds open long-lived connections without kicking everyone off?
If you want to be kind to your users, you stop accepting connections on a new server, wait a period of time (the longer the better), and then restart. It's a matter of how many users you're willing to disrupt for how quickly you can restart the system.
The canonical example at Justin.tv is the video system: people broadcast for days to weeks at a time. If you restart the server they're connected to, their stream will be disrupted (even if the auto-reconnect works).
We have a separate system that handles most of the complexity which is stateless, but sometimes you need to restart the actual connection-holding-daemon itself. How would you suggest doing that without disrupting service?
Unit and Integration Tests don't always catch everything but cluster immune systems do a pretty good job on the important stuff:
When someone wanted to do a deployment, we had a completely automated system that we called the cluster immune system. This would deploy the change incrementally, one machine at a time. That process would continually monitor the health of those machines, as well as the cluster as a whole, to see if the change was causing problems. If it didn't like what was going on, it would reject the change, do a fast revert, and lock deployments until someone investigated what went wrong.
Monitoring tends to be far more useful than tests for web apps. And despite how advanced this system sounds, it's usually far less work to monitor a web app than it is to test it.
There seems to be a general assumption that writing [unit?] tests is always the "right thing" to do, even at a startup, and no matter what the code in question does. In my experience that's absolutely not the case.
Just to be clear: Bill's opinion on this isn't universally shared at JTV.
When you're two guys in a garage, unit tests probably slow you down. But at some point, they begin to dramatically speed things up -- working on someone other person's code becomes orders of magnitude easier when there's a comprehensive test suite to validate your changes.
The problem is that at some point between first employee and a dozen (or more) engineers, you start to wish you'd spent a bit more time on testing code. Writing those tests from scratch becomes "a project", and it only gets done with herculean effort.
Having been at JTV, looking back it almost seems insane to try to do as much work as we we're without good coverage. If it matters, it should get a test.
I think a lot of people also see tests as slowing things down but, they also provide a lot of value in documenting intentions, make upgrades easier, etc.
It also matters a lot more when you're hiring relatively new/inexperienced people, and throwing them in the deep end. I disagree with Bill mostly because I see the purpose of testing differently than he: the point is not to catch All Of The Bugs; the point is to make it easier to confidently work on code that you didn't write.
There's a lot of counter-intuitive stuff that happens in production code for a big website, and it's not totally fair to throw inexperienced people at it without at least some safety belts in place.
I think, as much as I dislike pair-programming overall, that's something I would prefer to use as a safety belt here (because it's only expensive for the period during which a new employee is still learning). Pair programming hasn't ever been tried at justin.tv either though, as far as I recall. I'm probably going to give it a try at ZeroCater for new engineers actually (we're hiring!).
The only time tests aren't the right thing to do is when you can guarantee that you will completely trash that system at some point and that code is never going to see the light of a production machine. Most code gets refactored out of it's initial state, and that's where your tests offer huge benefits. I can say that I didn't truly appreciate the full value of tests until the first time I had to do a massive refactoring of a well tested app.
Being able to change core fundamental components of a system and trust that everything will Just Work™ when you're done is an awesome feeling.
While I'm not a unit test nazi by any means, if you don't have a QA and deployment process in place to catch things like this unit, and functional tests may be time well spent.
Things are not quite as wild-west at Justin.tv as they once were. We now have QA (Thanks B!), monitoring and deploy management.
I think most small teams start with minimal qa and testing (not the recommended approach) and thus things like this happen often. Rapid development becomes blitzkrieg.
That's what caught my eye too. I like how it's soon followed by "Pushing code fast and often does have a cost, but the benefit in productivity is well worth it." That's certainly true if it's mostly _tested_ code, but there is nothing productive about spending a day trying to find and fix a bug like this in production. Been there, done that.
I'm trying to bootstrap a startup now, and if it fails to gain traction then justin.tv would be one of the first places I'd interview. (Hey, I like watching competitive TF2, and they provide a hell of a platform for casts, can I say.) Now I know this blog post was sort of written for the purposes of recruitment, but it's sort of making me think twice about whether I'd want to interview there. Bleh.
We're not the wild west we once were, but neither are we a paragon of testing. When I started Justin.tv, I had 1.5 years of software development experience, never on a team larger than 2 developers; I've now been at it for almost 7 years and yeah I don't think I'd do everything the same from the beginning.
Here are some things we now do at Justin.tv because of experiences like the one you mentioned:
- Everything gets code reviewed
- We have started to introduce tests (sorry Bill, they're really quite helpful in the right circumstance)
- We have extensive monitoring for all systems in case something DOES slip through
- We have fully automated deployment (and rollback)
There's a cost to all of these things, either in setup time or in constant maintenance. When you have few users, maybe they're not worth it - as you grow they become essential.
sorry Bill, they're really quite helpful in the right circumstance
I don't think I've ever claimed they're never useful ;) But I do still believe that, out of the things you listed, they have the lowest utility for the amount of effort involved.
Personally, I would do fully-automated deploy and rollback first, closely followed by monitoring, then code reviews, and then if problems were still slipping through the net I'd tell people to start writing unit tests.
Cool to learn that a bigger site tosses programmers into the ring from the start. What's even better is that they don't have meltdowns over mistakes and make sure every event is a chance to learn and improve rather than be concerned about losing one's job. There should be more of this in the dev community. I think I'd be much more comfortable in an environment like this where I'm being challenged as opposed to spending 3-6 months working under someone else and only half getting it. It may have some implications on their business, but they're confident enough to figure out a fix so more power to them.
There is a middle ground where you work with some controls in place that don't bring the whole production system down. Fat fingering something (as described in the article) can happen no matter what your experience level is. More than showing that they treat new hires well, I think this reflects poorly on their dev environment...
Really good article. Purpose of article is to motivate talented people to join Justin.tv and I think it does that very well. First thing I did after reading, was to look for the available positions and actually found one that excited me.
Didn't apply though. Why you might ask.
Although article promotes the "doing" more than "doing absolutely right and best way". Job description is asking for more of "doing right and best way" than "getting it done". And I don't blame them for it, as they have successful product and they would like to keep it that way. What better of doing that than hiring best of the best.
For now, I am just going to wait and work on improving my skills to that point.
I followed Dreamhack's streams on Justin.tv and it was really an awful experience for me, with several downtime right into the live performances and such.
I do remember someone from the staff coming into the overcrowded channels and expressing their astonishment feelings toward the fact that the chat was actually standing and surviving the load.
Besides that, some phrases they translated into my language are just preposterous, usually it's just better off to do not translate a site at all if you manage to do it in such a way.
I did read all the comments where they've pointed out their many improvements, i hope it will get better and better since a platform like that has great potential.
I subscribed to the NASL stream and paid $25 and suffered a similar awful experience related to video quality. Any questions on how to fix it on their chat was met with anger from mods, with answers not to spam even though I asked once or twice the most, and now my comments are getting downvoted in this thread. I really liked Justin.tv and had intended on paying for more streams, but it seems like a big joke to me as far I can see.
I've had nothing but great experiences from the Justin.tv/Twitch.tv team regarding their tech. NASL and Dreamhack both just had fairly crappy streaming setups and equipment, in my opinion.
When I watched NASL the first day everything played great for the majority of the day then my stream started randomly locking up for no reason, just the video, the audio would continue to play then the stream would catch up a few seconds later.
I'm on a high end gaming PC with oodles of RAM / SSD / i7 on a 50 down 10 up connection so I don't think that watching a stream in 1080p would tax my computer too much.
I also don't think it was the NASL team's fault because the same thing happens on other streams periodically.
It seems to be a bandwidth issue from justin.tv to the end user but it could be anything, all I know is that it is quite annoying.
I would like to see how many happy OSX users are on Justin tv. It feels like they didn't test for usability at all. I asked them how it could be possible to code a player that is capable of playing HD flash at 1fps. Yes - 1 f p s, whereas other sites can play it fluidly(1). It's as if breaking stuff is their motto and answering user questions on how to solve it or explain why it happens is not.
Overall, their product is good enough to make me pay for a stream or two. But, as a user used to free video, that eventually pays for yours and you can't deliver, - ouch.
We use almost 100% OSX at the office (with smatterings of Ubuntu) so I have to say: yes, we have a lot of happy OSX users. If anything it's the IE users on Windows who should be annoyed.
I watch from home on my macbook air constantly and from work on an old mac mini, so I don't think it's a general OSX issue. What kind of hardware are you on? Which OS version?
Explain why my streams at 720p and up are unplayable on my top end macbook air 10.6 . I have asked how to fix it via Justin.tv support and had no response. Honestly, I would not give a shit, but since I paid $25 bucks, I would expect to see what I pay for... uh ?
That's very strange. Maybe it's a flash version issue? Because I watch NASL all the time on my air (which sounds like the same hardware as yours but maybe slightly worse) and it runs great.
Well, flash drains my battery like nothing else, but that's just flash video. Same thing happens to me on YouTube.
Well, me too. I paid for the NASL stream and couldn't watch it. Surely, you can understand that throwing $25 away without getting the service kind of sucks. To contrast it against an HD stream from gomtv, these GSL games stream/re-play archives beautifully without any stutters.
So I want to be a fan, a paying fan, but it's kind of a bummer that I can't.
You're assuming that they actually have real control over player performance on your particular hardware setup, when in fact they have almost no control over it. Go ask Adobe and Apple why flash video (especially anything that isn't h.264) playback occasionally (definitely if you're talking about pre-10.6) sucks on OS X.
Why is your reading comprehension so bad? I said other sites play HD streams very well. I've got a high end Macbook air 10.6 for your information, but they already should know that since I asked their support about it and failed to get a response.
Please give me examples of these other sites that take heterogeneous user generated live streams of varying codecs, profiles, etc. that are being delivered to the flash player that perform much better than Justin on your Macbook.
I think you do not understand the underlying technology well enough to be pointing fingers at Justin's player.
So, they don't write tests at justin.tv, and they don't do automatic deployment? Sounds like a great place to work at...