Mitigating the Cost of Downtime in the Age of Artificial Intelligence
The ability to predict hardware failures with some degree of accuracy offers the potential for tremendous savings for service providers with large capital investments in information technology (IT). For enterprise customers operating costly high-performance computing (HPC) and artificial intelligence (AI) platforms at scale, the availability of their service offering is everything: A downed system locks out customers and traps resources, rendering them useless at great cost. An outage of this sort can cause a loss of revenue, impact overall employee productivity, and damage a company’s brand. Depending on the industry, studies have shown the average cost of an hour of unplanned downtime can range anywhere from $100,000 an hour to well over $500,000 an hour.[1] [2]
Enter Predictive failure analysis (PFA)
By evaluating large amounts of historical data, predictive failure analysis (PFA) can offer valuable insights into the likelihood of an outage. Chipsets, circuit boards, hard drives, and soldered connections all have a finite useful life. Trends in historical failure data may point to a time horizon for future failures.
For large equipment or automobile manufacturers (and their customers), PFA can potentially offer improvements to asset life expectancy leading to reduced future spend of up to 5%. PFA can also be used to schedule maintenance when operators and technicians are more freely available and cheaper, creating efficiencies and savings of up to 20%.[3]
Numerous factors can impact the performance of PFA. Depending on the average workload of the system in question, the scope of historical data used, and the machine learning (ML) or deep learning (DL) algorithms involved, the accuracy of PFA may be called into question.
Linear and polynomial regressions are often used to determine remaining useful life (RUI), while Long Short-Term Memory (LSTM) and random forest algorithms can be used to refine failure predictions with varying degrees of success. [4] [5] [6]
While it’s clear that there are tangible benefits with PFA, it is not foolproof. To provide the same level of service availability and to protect the value of capital investments, it is wise to consider augmenting any support program using PFA with a remote hands contract.
Remote Hands as an Insurance Policy
Using remote hands in conjunction with PFA offers benefits for both planned and unplanned outages.
A planned outage can be scheduled well in advance, thereby allowing resources to be assigned when they are freely available and most affordable. Remote hands providers also enjoy economies of scale: With plentiful resources to deploy, a significant savings can be had over hiring, training, and managing your own technicians. (One fully burdened engineer can easily cost six-figures or more annually depending on the requisite skillsets).
With any unplanned outage, the restoration of service is the primary concern. An unplanned outage without a remote hands contract will take longer to resolve. The resources deployed to troubleshoot and solve the problem onsite would first need to be verified, insured, and onboarded before being deployed. The remote hands provider, with resources at scale, will have already vetted and assigned resources resulting in a faster response time.
Strategic Technology Investment
Depending on the nature of the service, a sustained outage for a large environment could easily cost millions of dollars. A remote hands contract on its own or operating as part of a broader disaster recovery plan that includes PFA, can be treated as an expense line item, or potentially even capitalized as part of a larger software or service subscription.
A slight increase in operational costs could very well protect you from the millions of dollars in losses associated with a prolonged service outage.
What approach should you use to justify an investment in remote hands support? Corporate finance metrics vary from company to company, but a few illustrations can be drawn.
Case Study
You are evaluating a remote hands contract for $250,000 to cover three environments in data centers in North America for one year. Your estimated cost of downtime is $100,000 an hour all in (including stranded or idle resources, lost revenue, brand impact, etc.). The last time you experienced an outage due to failed hardware, your application was down for six hours. The net impact to the company was $600,000.
Corporate finance won’t approve IT spend unless an investment clears a hurdle rate (sometimes known as a minimum acceptable rate of return or MAAR) of 10%.
An anticipated benefit of a remote hands contract is reducing the mean-time-to-recovery (MTTR) for an unplanned outage. Estimates suggest that MTTR can be reduced significantly. A reduction in MTTR of 50% for the previous outage would have saved $300,000 by restoring service three hours faster.
Should corporate finance authorize the purchase of this remote hands contract as an insurance policy to help reduce future unplanned downtime?
We’ll use a simple formula for ROI:
ROI = (Net Profit / Cost of Investment) * 100
In this case, the net profit would be the $300,000 savings minus the cost of the $250,000 contract, or $50,000.
ROI = ($300,000-$250,000)/$250,000 * 100
$50,000 divided by $250,000 is 20%, or double the MARR required by finance.
(Keep in mind, this return is measured against only one outage. With multiple unplanned outages in a single year the savings would be considerably higher.)
Corporate finance should approve the investment in the remote hands contract.
Summary
Artificial intelligence has made great strides in the field of predictive failure analysis, and the effectiveness of PFA will only increase in the coming months and years.
In the meantime, the need for investment protection remains. A strategic investment in a remote hands contract can help mitigate the financial impact of unplanned outages, while helping you capitalize on the flexibility of planned downtime windows.
Notes
[1] https://medium.com/@brijesh_soni/why-random-forests-outperform-decision-trees-a-powerful-tool-for-complex-data-analysis-47f96d9062e7
[2] Yadav, D. K., Kaushik, A., & Yadav, N. (n.d.). Predicting machine failures using machine learning and deep learning algorithms. ScienceDirect. https://www.elsevier.com/locate/smse
[3] https://www.bakerhughes.com/bently-nevada/blog/unplanned-downtime-key-disruptor-industry
[4] https://medium.com/@jatin2707/machine-failure-prediction-a-comprehensive-guide-524726c3b1fd
[5] https://www.atlassian.com/incident-management/kpis/cost-of-downtime
[6] From “Predictive Maintenance: Deloitte’s Approach” https://www2.deloitte.com/content/dam/Deloitte/us/Documents/process-and-operations/us-predictive-maintenance.pdf
Topics: Remote hands, machine learning (ML), artificial intelligence (AI), deep learning (DL), productivity, data centers, cloud, predictive failure analysis (PFA), recurrent neural network (RNN), high-performance computing (HPC), corporate finance, strategic investments, linear regression, Long Short-Term Memory (LSTM), random forests, portfolio theory.