Incompetence at Scale – Grey Enlightenment

As everyone knows, CrowdStrike has been implicated in the largest IT failure ever, leading to mass outages and millions of people being inconvenienced. Flights were grounded. Hospital systems were down. For 3-4 hours in that early Friday morning, the world had been put on pause, until Cloudstrike cobbled together a fix. Remarkably, Cloudstrike stock was only down 11% that day.

Or about that AT&T data leak last week:

And let’s not forget the May 2017 Equifax data breach, which ranks as among the worst and was only a preview of the tidal wave of hacks and leaks to come, made worse by the rise of ransomware. Or the Boeing 737-9 MAX fiasco.

In trying to diagnose why such incidents continue to persist, affected companies only incur small, temporary reputational damage and fines, and then it’s business as usual. Being hacked or having data leaked is treated merely as a cost of doing business. A few weeks of bad press, a drawn-out class action lawsuit, and some customers defecting in the aftermath of a hack or leak is cheaper than having to invest in better security. Even Congressional hearings are not enough of a deterrent. It also helps that memories are short: seven years after the breach, Equifax is the most popular consumer credit rating agency.

Second, it’s a matter of asymmetry and incentives. An employee who is a cog in a system, unless directly responsible, does not incur any consequence for contributing to the problem. Employees and lower management have little direct incentive to do more than the minimum or to speak up if there is gross negligence. There is also a culture of secrecy: tech employees are paid a lot of money which effectively acts as hush money, and employees are often prohibited from talking to press or independent media (like blogs or podcasts) about internal matters.

Working at a high-paying tech company is a dream job for many people: who wants to give that up (and be backlisted for further employment) by going to the media or being a nuisance? Like many things in life, whistleblowing is subjected to survivorship bias: you only hear about the whistleblowers whose revelations led to large payouts, media coverage, or who built a career from it, not those who destroyed their professional careers and got nothing for their trouble.

During the ’70s car fires were much more common. There was a scandal in which Ford knowingly sold cars, specifically, the Pinto, which had a risk of catching fire due to a defect in the gas tank, and had set aside money for lawsuits based on the small but definite or known probability of the Pinto catching fire. The obvious question is, why would Ford knowingly sell cars that could catch on fire? The idea of a company setting aside money for lawsuits instead of fixing the problem comes off as yet another cynical manifestation of corporate greed.

But as Milton Friedman reasoned, to make a car that is impervious to fires would make it commercially unviable. So given a choice of cars with a tiny but definite risk of catching fire vs. no cars, the former is preferable. A similar rationalization goes into data leaks and hacks: a certain amount of failure is tolerated and baked in. Of course, like a lot of Friedman’s insights, this is a false dichotomy, and it’s possible to make cars safer without sacrificing profitability or having no cars, and as Volvo showed, safety can be turned into part of the brand.

Modifying the Pinto was inexpensive: “The Ford recall placed a polyethylene shield between the tank and likely causes of puncture, lengthened the filler tube, and improved the tank filler seal in the event of a collision.[115]” Domestic car manufactures were threatened more by foreign competition and changing American tastes favoring fuel-efficient compact cars, than extra costs incurred by making cars safer.

However, this reasoning or calculation breaks down for interconnected systems or failures in which there are many potential victims per incident. Cars are isolated nodes in a system, and a car fire typically does not affect other cars, but planes have hundreds of passengers. A software upgrade can affect thousands of systems and a mistake can have a cascade or multiplicative effect as we saw with CrowdStrike. A data leak puts millions of users’ data at risk.

Moreover, the Friedman-ism intellectual hand-waiving or reductionism that the free market will fix the problem as hacked/leaked companies lose customers to better-run companies, is wishful thinking and unsupported by reality. Friedman’s assumptions overlook network effects, in which the threat of competition no longer applies after a company has achieved dominance of a market after locking users into a network, which makes the cost of defection high, compared to low-friction or interchangeable markets like food or toothpaste.

Indeed, customers voluntarily continue to patronize companies which have had breaches. Almost major tech company has had breaches, sometimes multiple times, like Yahoo in particular. LinkedIn is still very popular despite being the posterchild of data leaks. There are entire databases which sell or aggregate leaked data, giving a treasure trove for criminals who can use this information for the furtherence of other crimes (it helps that people reuse passwords). Similar to how cars became safer due to regulation, it will take a similar major overhaul in which negligence and incompetence cannot just be treated as a cost of doing business.