How to Avoid One of the Costliest Mistakes in Software Engineering

by Edmond Lau

A few weeks ago, a young startup engineer reached out to me for advice. Her CEO had given her team 4 weeks to rewrite the web product from scratch. That was over 3 months ago. The project was now 2 months behind, the team was working overtime at 80+ hours per week, and a laundry list of bugs remained. Stress levels were high, and each passing day of not shipping meant that business opportunities at their fledgling startup might fall through. What should she do?

Being caught in the middle of a rewrite project that’s behind schedule sucks. There’s no ignoring it. I’ve been there myself, earlier in my career. One rewrite project I worked on was originally scheduled for 4 months but didn’t ship until after 9 months. Another lasted for 9 months until it was ultimately abandoned. Extremely talented engineers worked grueling hours both times, and some even burned out. One of the worst parts about long rewrites is that when a critical new market opportunity comes up that requires building on top of the code being rewritten, you have to make a difficult choice: do you defer (and possibly lose) the opportunity by building on top of the new codebase and waiting for the new codebase to ship, or do you build it on top of the old codebase and then rewrite it a second time in the new? Neither choice is that appealing.

The rewrite projects I worked on were painful but formative experiences. Since then, I’ve learned to be extremely wary any time someone proposes an ambitious rewrite project. I’m not the only one to feel this way.

The Story of Google Docs

Earlier this year, I interviewed Sam Schillace, the VP of Engineering at Box and formerly the Director of Google Apps, for my upcoming book, The Effective Engineer. One question I asked him was, “What’s the most costly mistake that you’ve seen engineering teams make?”

His response: “Trying to rewrite something from scratch. That’s the cardinal sin.”

Schillace joined Google in 2006 as part of a startup acquisition. His 4-person team had built Writely, the product that eventually formed the basis for Google Docs. The team had built and released a prototype in C#, intending it to be an experiment, only to find it spread virally and explode to half a million users. When Google acquired their startup, their first order of business was to migrate to Google’s infrastructure to leverage its capacity and scaling capabilities. Thousands of users continued to sign up every day, and the team was expending energy patching up a codebase that they knew couldn’t scale.

From my time at Google, I knew that Google didn’t support C# in its software stack. 1 So when Schillace told me about this cardinal sin, I naturally had to follow up and ask, “When you were converting Writely to Google Docs, did you guys have to rewrite everything from scratch?”

He looked at me and said, “We did indeed — although there’s an interesting lesson in there.” When they embarked on the rewrite, one his co-founders had actually pushed for rewriting parts of the codebase while they were translating it. They didn’t like the codebase they had built in many different ways. So while translating the application to Java, why not rewrite it at the same time? Why not take the time to make a few improvements? The temptation to do a major rewrite was extremely high. Who wants to rewrite the codebase to an intermediate state that they would have to throw away again?

Schillace had to fight hard to push against that logic. And in the end, he convinced the team to set a very limited scope for their rewrite and to defer other improvements. They set a clear goal of getting the site up and running on its feet in Google’s data centers, and they took the shortest possible path toward achieving that goal. They ramped up on and integrated the product with 12 different internal Google technologies. They spent a week running the codebase through a series of regular expressions and then fixed up tens to hundreds of thousands of file errors to get the code to compile. The semantics of Java and C# would sometimes differ — string comparison in C#, for example, uses logical comparison for double equals (==) while Java checks for reference equality — and they had to painstakingly visit each instance and iron out all the differences.

“That was really, really painful,” Schillace told me. It was the epitome of a grind. After 12 weeks, they ended up with what he described as a “funky, weird, mangled chewed-on code base that didn’t look right.” But it was a codebase that successfully worked on Google’s data centers. His team set the record for being the fastest Google acquisition ever to port onto Google infrastructure. If they had bundled improvements with the rewrite, they certainly would not have finished nearly as quickly, and the project could easily have expanded in scope. If, like many engineers, they had focused on “doing the right thing” and keeping code quality high, they might have eschewed mangling the codebase through a bunch of regular expressions. Instead, they did what was necessary to get Writely up and running as soon as possible.

Lessons Learned

The message isn’t that we shouldn’t ever rewrite or throw away code. We’re always developing software with imperfect information about how it’s going to be used or what the eventual requirements might be. Eliot Horowitz, the founder of MongoDB, remarked once at conference that you should think of code as having a half-life of 3-5 years, and therefore needs to be refreshed periodically. 2 Jeff Dean, the architect behind Google technologies like MapReduce, Bigtable, and many more, said to design your software to scale by 10x, after which the best design will likely look quite different.

The problem surfaces when we rewrite entire codebases from scratch and in one go. We tend to significantly underestimate the cost and overestimate the benefit.

When we’re inexperienced, we underestimate because we have little idea of how long things will take. We don’t have enough clout or knowledge to say no to aggressive schedules. We lack the prioritization and technical skills to effectively reduce variance in the timelines. We think to ourselves, A good engineering team would be able to do this. We just have to work hard enough and demonstrate that we’re good.

And when we’re experienced, we underestimate partly because estimation is still hard and partly because we’re overconfident about our abilities. It’s a case of illusory superiority. Ask 100 drivers about their driving abilities, and 80% will rate themselves as above average. 3 Ask 100 teaching faculty, and 68% will rate themselves in the top 25% for teaching ability. 4 The same lopsidedness shows up in estimation of IQ, test scores, memory, job performance, and more. So it’s no wonder that many software engineers believe that deadline slips only happen to average or below average teams and that they themselves are insusceptible to the types of deadline slips that have plagued software rewrites for decades.

When deadlines for rewrites do slip, we often delude ourselves with optimistic beliefs that maybe we’ll just work harder and find ways to make up for lost time. We convince ourselves that there’s no other alternative. It might work once or twice, but as a long-term strategy, it’s not sustainable. You can’t sprint to the end of a marathon.

The best strategy then is to be skeptical of the value of rewriting entire codebases from scratch. When you do have to do it, like Schillace did, you tackle it with a limited set of goals to get as quickly as possible to a state where you’re again making incremental improvements. And if you do find yourself joining the ranks of other software engineers who fell prey to a slipping rewrite project, you need to have an honest but difficult conversation on how to bridge the gap between your desired business target and your underestimate — whether it’s by cutting features, resetting the project to a more realistic timeline, or treating the project as a sunk cost and abandoning it all together. Tactics exist for each of these, but none are as impactful as how you initially frame the rewrite project.

  1. Google’s social networking site Orkut was originally written and running in C#, but they too had to migrate to a Java stack. 

  2. FirstMark CTO Summit, 2014. 

  3. McCormick IA, et. al. “Comparative perceptions of driver ability — a confirmation and expansion”, PubMed. 

  4. “Illusory superiority”, Wikipedia. 

Posted:

“A comprehensive tour of our industry's collective wisdom written with clarity.”

— Jack Heart, Engineering Manager at Asana

“Edmond managed to distill his decade of engineering experience into crystal-clear best practices.”

— Daniel Peng, Senior Staff Engineer at Google

“A comprehensive tour of our industry's collective wisdom written with clarity.”

— Jack Heart, Engineering Manager at Asana

“Edmond managed to distill his decade of engineering experience into crystal-clear best practices.”

— Daniel Peng, Senior Staff Engineer at Google

Grow Your Skills Beyond the Book

Listen to podcast interviews with top software engineers and watch master-level videos of techniques previously taught only in workshops and seminars.

Leave a Comment