A software engineer receives an alert that a customer is experiencing a critical incident. If the engineer is lucky, this alert hits during normal work hours, so the incident is only interrupting regular tasks. If not, the engineer is most likely sacrificing a meal, personal time, or sleep.
Unfortunately for employees of just about any enterprise today, the odds of a critical alert arriving are higher than they would like. On average, a typical problem will take them just a bit over two hours to respond, diagnose, and repair. Atypical problems may run for days or weeks.
Only after the incident is handled, can the engineer return to regular work or personal time, and that requires both context switching and being able to shake off the high stress and tension of a work emergency.
Best case, worst Case
The best-case scenario for the engineer is a few uninterrupted hours until the next alert. The worst case, though, is what companies are increasingly seeing among their engineering and technical teams: more alerts, more late nights, more burnout, and more employee churn as a direct result.
A recent study surfaced a potentially catastrophic trend: alerts and alarms have increased more than 35% this past year, with 19% of that increase coming in marked “critical.” Today’s challenges of digital transformation combined with accelerating pressure on digital services are among the biggest problems businesses face right now and it is impacting their customer’s experience and satisfaction, and their bottom line.
Welcome to the world of managing a successful enterprise in post pandemic times. If you are an engineer or in IT operations, you are likely to be at your current employer for two or three years until you leave for better opportunities and (hopefully) less stress, but our study found that due to these increased interruptions and time requirements, people are simply packing up and leaving. If you are in leadership, you are in a constant scramble to bring qualified team members on board in a market that grows deeper and more competitive every day.
The frantic digital transformation that took place in nearly every industry throughout 2020 only made the work environment more stressful for the back-end teams. While their companies reported record profits, employers also saw record turnover and unemployment and struggled to support engineers tasked with significantly longer hours and very little mental break.
Caught in a downward spiral
These challenges only accelerate the global engineering shortage. Already with an estimated 40 million technical jobs going unfilled, the lack of skilled software engineers is expected to more than double in the next decade. The human impact of this increased demand on a company’s own engineers can be significant.
Large-scale working from home meant companies could keep their doors open and find talented people in new areas. However, for many engineers, it also meant that they were never truly off duty.
Monitoring the human impact
Our hypothetical engineer cited above might face critical alerts all night, followed by all week. He or she may be constantly patching where possible to keep customers happy, but without time to address underlying issues. Our study found that working hours were much less consistent in 2020 for more than a third of engineers, and that inconsistency meant they ended up working more than two extra hours per day, that means we’re asking people to work up to an extra day a week or 10 to 12 extra work weeks a year… That isn’t sustainable.
That kind of “all hands on deck” effort is possible to sustain through a period of emergency action. But when limited resources and people stretch it to months, morale evaporates, productivity drops and people burn out. When there is no reasonable end in sight, engineers will start looking for greener pastures.
Monitoring team health and operations should be a top priority for any IT leadership. Organizations need to consider skill distribution and escalation patterns among their engineers, and conduct post mortem reviews to weed out repetitive efforts.
In addition, actively managing resources to help spread the workload equally is crucial to making everyone’s life a little easier. If our engineer goes on to have a particularly challenging couple of days, for example, the manager might consider adjusting duties and shifting workloads for the rest of the week.
Building a better system
For companies that are serious about monitoring and protecting team health, there are technology tools that will help ease the burdens on their engineers. Machine learning and artificial intelligence for IT operations (AIOps) can help teams respond to work needs in real-time across the entire organization, ensure they don’t get bugged with false alarms, and enable automation that can avoid their interruption entirely.
AIOps can sort through alerts, filtering them to the right person at the right time and even automating responses when they don’t require an engineer’s immediate attention. Streamlining broken processes and measuring productivity among team members will help identify where needs are felt the deepest. Automation can also empower your first line responders and keep your higher skilled developers doing what they do best, coding.
Reducing this alert noise boosts overall efficiency by cutting down on interruptions and limiting the number of responders needed to address issues. Together, these benefits provide long stretches of regular working hours for teams and minimize intrusions on personal time.
Digital transformation isn’t going to slow, and neither are the challenges generated by an increasingly digital marketplace. That’s why keeping your engineers and technical teams happy and motivated will only become more important as the global talent shortage grows worse.
Adopting AIOps will help companies address two urgent business outcomes: increasing customer happiness and protecting the mental health of workers that deliver it. Automating the systems that generate noise and removing toil and repetitive work is the key to accomplishing both of those goals.
About the Author:
Michael Cucchi, vice president of product at PagerDuty