The most epic developer fails of 2017: Lessons learned

As organizations increasingly pin the success of their businesses to the innovation and quality of software shipped by their in-house engineering teams, developers are working under a veritable microscope.

Software engineers are central to digital disruption. For good or bad, they now receive much attention throughout the organization. When things are good, they're rock stars. When things go wrong, they're reviled. 

For some developers, things have gone really wrong in 2017. The year has seen more than its share of major developer-caused failures, from the freezing of millions of dollars in cryptocurrency to fatal car software bugs to leaks of sensitive data on GitHub.

As we close out the year, TechBeacon has collected stories of some of the most notable developer fails. We asked  experts what they think went wrong and what you can learn from other people's big mistakes.

Gartner Magic Quadrant for Software Test Automation 2017

Production database deleted on day one

This summer, the development community was captivated by the story of the junior developer who managed to delete a production database on his first day on the job. The developer made a mistake running a script to create a local database, using configuration values included in a company wiki-type document rather than those outputted by the script.

Those values, inscrutably, led to a production database, and once those values were loaded into the automated test, it scrubbed the database, which by all indications was running on untested backups, further exacerbating the situation. The dev was fired on the spot, but industry consensus is that the wrong person got fired.

"My response to this situation is simple: Fire the CTO," said Todd Williams, a development project management consultant and author of Filling Execution Gaps. "That a developer could possibly do this is the problem. And the fact that this document had the production system configuration information in it is even worse."

It's a situation that a new developer should never have been put into, let alone fired for, agreed Shefik Macauley, a longtime software architect and member of the Forbes Technology Council.

"This is certainly a developer nightmare, but I wouldn't have fired him—especially when it's just his first day!" Shefik said. "Safeguards need to be put in place, particularly when on-boarding new developers to a team. In the case of junior developers, they should not be granted direct access to the production server."

Sensitive data exposed on GitHub

GitHub has become a major resource for development shops in recent years, but sloppy content management and data controls have also made it a huge point of exposure when software teams aren't careful about what they publish there.

This point was brought firmly into focus in June when a third-party security researcher stumbled onto a huge mistake made by a development team from outsourcer Tata Consulting Services. A developer on the team managed to publish a boatload of source code, development plans, and documents related to development work for nine global financial institutions.

"This is just a mouse click away for just about anybody," said Williams, adding that the potential for exposing sensitive information through poor controls and configurations isn't restricted to GitHub. Similar situations could arise on other content storage platforms, such as Google Docs, Dropbox, Box, or any other cloud sharing app that can be made public. "As with most developer fails, this is at its root a controls issue." 

For developers, though, GitHub has been a particular problem, since many developers rush to commit code without sanitizing what goes into their repositories, including things such as plain-text credentials, encryption keys, and OAuth tokens. It's such a problem that there is a growing number of scanning tools designed specifically to root out sensitive data on your repos. The trick is that organizations need to build in processes to not just avoid these exposures but spot-check for offenders.

Buckle up: Bug disables seatbelts, airbags

One of the biggest examples of life-or-death software quality issues this year came by way of Fiat Chrysler Automobiles, which had to recall 1.25 million Dodge Ram pickup trucks due to a potentially fatal bug in its onboard computer software.

The error causes the truck to essentially deactivate airbags and seatbelt pre-tensioners during a rollover crash if the vehicle bumps into something in its undercarriage while driving off-road—exactly the situation in which a driver is probably most likely to roll over.

"I think it really speaks to the heart of software development's potentially life-threatening implications," said Dave Hatter, a software developer with 25 years of experience who works for a financial institution. He said that while the complexity of software in our lives is making it harder to write tests that can cover every single potential risk, manufacturers must have a higher standard of testing for software that can threaten the physical well-being of a customer. "You really have a moral duty in my mind to try to make sure that the software is as robust and stable as it can be."

Internal Windows builds moved into production

As organizations speed up their deployment cadences, release management becomes more important than ever. There's nothing like a good old-fashioned public screw-up to show why that is, and you can thank someone from the Microsoft Windows 10 development team for this particular example.

Earlier this year, the team accidentally pushed internal builds into production. For the most part, the builds were sent only to users in Microsoft's Windows Insiders testing group, but a small contingent of regular mobile device users also received the update, which incidentally bricked those devices.

"This really sounds like human error on the build team," said Hatter. "In my mind, release management really is the issue here."

Wallet bug evaporates $280 million

The perceived stability of cryptocurrency platforms took a big hit on November 1 when a software bug rendered $280 million inaccessible to Ethereum cryptocurrency wallet holders.

This debacle came as the culmination of a cascade of screw-ups by cryptocurrency platform vendor Parity. In July a bug in the company's platform allowed attackers to steal $32 million in the Ethereum value token, ether. When the company went in to fix that particular bug, it introduced a new bug that didn't manifest itself until November. That's when a user somehow managed to trigger it to gain sole access to dozens of multi-signature wallets and then delete the code library that supports those wallets.

Parity is still scrambling to fix this situation, and we'll likely have lots of lessons to learn from it in the months to come. But for now, said Hatter, the top lesson is that while speed is always paramount in development, rushing a project can have dire consequences.

"They fixed that bug in a hurry, which caused a new glitch, which essentially evaporated almost $300 million," he said. "There's often such a rush to be first, such a rush to fix something, or such a rush to have a killer app that you're not testing as much as you should."

Apple's 'I' autocorrect

While the consequences weren't nearly as monetarily significant as the Ethereum screw-up, similar regression issues also plagued Apple's development team in a very public fashion. Within an update sped to production to address the KRACK Wi-Fi vulnerability, Apple introduced an autocorrect bug that replaced "I" with an "A" and a unicode question mark. The quirk irritated Apple users for over a week following the 11.1 release.

The root cause of the glitch is unknown, but at least one development expert speculated that it could be part of an intentional programming trigger that should have been eliminated prior to going live.

"It seems like an intentional trigger that perhaps Apple configured internally to execute a programming sequence in iOS, one that Apple overlooked to disable when shipping that version of software to the general public," Shefik said. "It is not uncommon for development teams to create code triggers to execute certain programming sequences, such as for streamlined quality assurance and regression testing purposes. However, the triggers must be disabled for outside use, and even such disabling must be a part of the quality assurance testing."

Regardless of the real root cause, this autocorrect bug offers evidence of how even the most seemingly trivial software quality issues introduced by developers can greatly impact a brand's reputation.

Memory leak exposes sensitive data 

Content delivery network vendor Cloudflare put sensitive data at risk from some of the biggest brands on the web—including Uber, Fitbit, and OkCupid—with a memory leak bug introduced in its codebase. The bug, in production for over five months, potentially exposed customers' passwords, cookies, and authentication tokens. The flaw was in Cloudflare's HTML parser, which exposed 1 in every 3.3 million HTTP requests. That worked out to as many as 120,000 leakages per piece of exposed data in a single day.

The flaw came about amid a transition the company was making from a legacy Ragel-based parser that was becoming onerous for the team to maintain. The bug itself was there for years, but it was mitigated by the way the code was previously written. As the team made the transition and massaged the code, the leak manifested itself.

"[This] points out the frailty of modern systems to latent bugs in old software that can be suddenly exposed and exploited through the smallest of code changes," said Gunter Ollmann, CTO of security for Microsoft and advisor for Vectra Networks.

Good controls keep mistakes from becoming epic fails

At the end of the day, the real lesson that IT leaders should be picking up from these developer fails is that no developer or development team should ever be a single point of failure. If organizations want to set their developers up for success, they need control-based backstops that keep simple mistakes from turning into epic fails. 

Gartner Magic Quadrant for Software Test Automation 2017
Topics: Quality