At Sensu Summit 2019, Adam Westman, Sr. Engineering Manager at Target, introduced us to GoAlert, their on-call scheduling and notification open source project. In this post, I’ll recap his talk, sharing the journey that led them to build GoAlert, the problems they’ve solved, and how they use GoAlert with Sensu Go to simplify monitoring and reduce alert fatigue.
Adam Westman discusses GoAlert at Sensu Summit 2019.
Scale and challenges at Target
As the eighth largest retailer in the United States, Target’s scale and scope presents a lot of interesting challenges for their monitoring and engineering teams. Across the country, they have:
- More than 1,800 stores
- 30 million in-store guests each week
- 100 million monthly online shoppers
- 40 distribution centers
- Two wholly owned data centers, with extensive workloads in the cloud
- More than 350,000 worldwide team members
- An additional 100,000+ team members during the holiday season
Their goal is to provide delightful guest experiences, so that means they have to make sure each store’s computing, monitoring, and telemetry needs are met. When shoppers arrive in stores or online, Target wants to ensure that the entire guest experience is smooth, whether they’re using in-store kiosks to register for their new baby or upcoming wedding, or simply paying in a checkout line. In order to provide those seamless experiences, Target team members need to be able to use store-critical platforms with ease.
The work Adam’s team does supports the daily tasks of millions of people every day, and it’s a responsibility they don’t take lightly. The amazing scale at which they operate requires creative solutions to some very difficult challenges.
With such a massive enterprise, Target needs a monitoring platform that’s automated, reliable, and built to scale, particularly during the holiday season when they have dramatic spikes in in-store and online traffic. It’s also important that their tools and processes align with company values: Target is committed to being guest-focused, experience-centric, and team-empowered.
Target’s technology journey
As Adam’s team set out to revise their monitoring practices, they first had to acknowledge that they had siloed development, operations, and support teams. Within this structure, Adam’s team consisted of a few hundred support engineers that were incentivized by restoring alerts. The team cared about tickets being generated in their work queues when things were broken or about to break, and they were 100% focused on break/fix work.
At the same time, their on-call system was pretty simple. If a support member needed to get in touch with the engineering team, they’d look at the on-call calendar, find out who was on call, and reach out to them (or their manager).
Overall, the process was sufficient for where they were at the time, but then Target moved toward a product-based organization with dedicated, durable, full-stack teams. Now, engineers would need operational visibility into their products, and they’d be accountable for designing, building, and supporting them. Adam’s team became an internal SaaS provider of monitoring and telemetry.
So much changed so quickly, and Adam’s team needed to evolve their practices and change the culture around monitoring. One of the first changes was vocabulary-related. Instead of calling themselves the Monitoring team, they switched to the Measurement team. Measurement “felt proactive,” and they wanted teams to think about performance and readiness before they moved any code into production.
Next, they needed to adopt new measurement practices and tools that would satisfy the following needs:
- Be easy to use. Team engagement improves when it’s simple to get started.
- Automatically engage the right engineer at the right time. They wanted to define shift durations, handover times, and escalation policies as well as reduce alert fatigue, which they’d accomplish with a product that would engage relevant engineers when it’s time to take action.
- Provide self-service. They wanted users to be able to define their own contact methods, set their notification parameters, make schedule adjustments, etc.
- Provide a mobile experience. Because administrators are on-the-go, they need to be able to access tools from their mobile devices.
From a technical standpoint, they cared about:
- Reliability. The monitoring platform needed to be extremely reliable so engineers could take prompt action to prevent a potentially negative guest experience.
- Security. It must adhere to security best practices and open standards to keep customer data secure.
- Minimal shared dependencies between the on-call product and engineers’ other products and tools.
- Scalability. With the new product-based structure, they suddenly had thousands of engineers that wanted data delivered to them in near real time. They needed to be able to process thousands of alerts per second and support thousands of users located in multiple countries.
- Open source. Target believes deeply in open source, which they both use and contribute to.
It was a tall order, so Adam formed a team of software, infrastructure, and support engineers to build GoAlert.
Today, GoAlert is the single on-call scheduling and notification product at Target. Through their investment in GoAlert, they’ve reduced monitoring maintenance costs, simplified and improved engineer engagement, and built a stable, scalable product that can be enhanced as teams’ needs shift.
Because GoAlert is easy to use, adoption took off: they have 3,500 internal customers that use GoAlert, and since their initial deployment two years ago, GoAlert has successfully processed more than 1.5 million alerts. Target engineers get reliable, customizable notifications on critical alerts without having to constantly watch dashboards or work queues.
As of June of this year, Open Source GoAlert is available for anyone (i.e., not just internal folks) to use. Because they run the exact same release in production before every code release, GoAlert users can be confident that it’s already been tested at scale.
Streamlining monitoring and alerting with GoAlert and Sensu Go
The real value comes from closely tying monitoring to alerting. Using GoAlert alongside Sensu Go, Adam’s team can get real-time updates when a failure occurs, respond quickly and appropriately, and ultimately save time because everything is automated and in sync.
For example, when a critical alert pops up on the Sensu dashboard, the alert details are passed to GoAlert.
A critical event as it appears within the Sensu dashboard.
The check details get automatically passed to GoAlert.
GoAlert can then notify users of the alert by email, SMS, or Slack.
A text message from GoAlert.
GoAlert supports two-way messaging, so with a two-character text reply, users can acknowledge the alert and GoAlert auto-replies to confirm the alert has been acknowledged.
If instead users manually resolve the check in Sensu, the status will automatically change to CLOSED in GoAlert, and they’ll also get a notification via SMS (or email or Slack, depending on their preferences).
Once resolved in Sensu, the alert closes in GoAlert.
Users can also subscribe to status change notifications to keep tabs on alerts that they care about.
To see GoAlert in action, check out this portion of Adam’s talk. Ready to give GoAlert a try? Head over to Bonsai to download the Sensu Go GoAlert handler.