AxeCrafted Blog

Wandering thoughts from a wandering mind.

EN | PT

Simple Anomaly Detection in Databricks Using Mean and Standard Deviation

Posted on June 23rd, 2025

Data quality and monitoring are becoming critical challenges in modern data platforms. Whether you’re detecting silent failures after a deploy or uncovering unexpected user behavior, being able to spot anomalies quickly can prevent lost revenue, degraded user experience, or compliance risks. Here’s how we tackled this problem using a simple statistical approach combined with generative AI.

In the land of data, regular patterns are comforting companions. When your morning coffee is brewed exactly at 7:30 or when your cat is expecting you to wake up and feed it at precisely 6:00, it's the regularity and predictability that make the experience delightful. But occasionally, these friendly patterns break - a missed coffee alarm or a loud meow announcing you messed up. Welcome to the intriguing world of anomalies.

At our company, Consumidor Positivo, we've had problems with deploys that would unexpectedly affect our systems. We would only find out what happened much later, despite having robust unit testing and integration testing. Even worse, identifying the cause and correlating it with a deploy took considerable time. This led us to think about simple ways to detect when data patterns shift suddenly so we could be alerted and diagnose problems faster.

Since then, we've built a system called ADULA (the name is an internal joke which I can't tell here, sadly - unless we hire you of course) responsible for detecting anomalies in our events data, which also evolved to diagnose them automatically for us (to the best of its efforts).

Foundation and Assumptions

Before diving deeper, let's establish some assumptions we'll be using to build this product:

  • Normality Assumption: we assume event counts roughly follow a normal distribution - while real-world data does not necessarily follow this, it’s a useful approximation, especially for large event counts. For example, the number of PageViews on Tuesdays at 11:00 AM follows a normal distribution.
  • Stationarity: we assume the behavior of events does not change significantly over short periods - gradual changes are manageable, but abrupt changes may require recalibration. Of course, using the same example as above, the event volume changes over time and this number (hopefully) drifts upward, but this is a good enough approximation to detect sudden changes.
  • Weighted Importance of Recent Data: recent events are more predictive of current behavior - older events gradually lose influence, ensuring the model adapts to new patterns. This accounts for the drift mentioned above.

Defining Anomalies Clearly

In order to detect an anomaly, one must first define what it is - for this, we apply a common statistical rule called the "three-sigma rule".

An event is flagged as an anomaly when its count deviates more than three standard deviations ($3\sigma$) from the (weighted) mean. If assuming a normal distribution, event counts beyond this probability should occur naturally with very low probability - specifically less than 0.3% of data would lie outside the Three Sigma region, as depicted below.

Three Sigma Rule Example

Implementation

First, we defined our constraints: alerts every 30 minutes provided a good trade-off between cost and detection accuracy for our case - we thought about and tested a streaming rolling window query framework for continuous evaluation in smaller 1-minute windows, but we were pretty happy with the proof of concept running every 30 minutes, which was simple to set up and economical in terms of cluster costs.

Second, we needed to define which breakdowns we wanted to use in order to check if something is an anomaly - the more fine-grained you need it, the more likely you are to get false positives or even fail to find enough data to work with - in our case, since event volume varies significantly by hour of day and day of week, we broke it down into:

  • Write Key: basically tells us event origin, which could come from different websites/systems
  • Event Name: we want to discover anomalies for each event
  • Day of Week: consider variations throughout the week
  • Hour of Day: consider variations throughout the day

And at the end, since we're running this as a 30 minute-window. The final breakdown level determines whether the event falls in the 0–29 or 30–59 minute range. While averaging different 30-minute periods could provide greater precision, this simpler approach sufficiently met our needs.

By using the last 90 days of data, we then create a table which contains the baseline (event count average, weighted by recency) for each of the breakdowns above, and also the weighted standard deviation for that sample. This workflow runs every week and recalculates the baseline. The final table looks something like this:

event write_key weekday hour minute avg std
PageViewed abc 2 14 0 245.6 12.4
PageViewed abc 5 9 30 312.8 15.2
PageViewed abc 0 20 30 128.4 9.7
PageViewed abc 4 11 0 475.1 20.3
PageViewed abc 6 16 30 382.5 18.9

Then, another workflow is responsible for aggregating our real-time data every 30 minutes, and comparing the data volume during that time window against the baseline table. If an event count deviates more than $3\sigma$ from the average, this triggers an anomaly. In our case, we were only interested in discovering "upper-bound anomalies" for error tracking events (that is, the number of error events is increasing), and "lower-bound anomalies" for business metrics tracking events (that is, a business metric is decreasing in volume, like page views).

To help clarify how the pieces fit together, here’s a visual summary of the system architecture we implemented:

ADULA Architecture: From Raw Events to Diagnosed Anomalies

We'll dive into how the language models may help us soon.

For V1, That Is All - And Then What?

That was our proof of concept. We spun it up (with testing and tweaking) in just two days. It worked wonders. On the very first day, it detected a shift in our user authentication event count. We discovered that a recent deploy had prevented some users from authenticating, and we rolled it back immediately. If the anomaly system had not been in place, we probably would have only discovered the issue once a business employee noticed a drop in the authentication rate while checking dashboards.

After using the anomaly detection system for a couple of weeks, we noticed some opportunities to make it better. First, a 0.3% anomaly threshold across hundreds of events ends up triggering a lot of false positives, and it takes a while to diagnose it. So we thought: what if we could run an automatic diagnosis on the anomaly trigger - perhaps using GenAI?

Automating Diagnosis

Our events are JSON payloads sent from different applications, which are standardized using a JSON Schema file containing field descriptions and data types. A page view event, for example, contains the website, the page path with some standard UTMs for attribution, the user agent, and some other fields. A form submitted event will contain the form ID, the step ID, and so on.

An LLM agent looking at this could recognize that these breakdowns are potentially important for diagnosing an anomaly - for example, a drop in form submissions might be concentrated on a specific form and a specific step - so the first step is to share a sample of the event payloads with an agent tasked with identifying the most valuable breakdowns for analysis.

These breakdowns are then parsed into a query template that compares event volume for each breakdown value during the anomaly window against historical averages - this way, it is possible to determine whether a specific breakdown is the main driver of the anomaly. For example, if the volume drop is concentrated in one form ID, that is probably the anomaly driver.

The volume variation data is consolidated into a final prompt which is sent to an "Anomaly Detection Analyst Agent", which evaluates the overall volume, variation share, and identifies the most likely causes of the anomaly. The agent consolidates a final report which is sent to a Slack channel along with the anomaly alert so users can quickly understand the likely root causes.

A diagnosis looks something like this:

AI Diagnostics for event ProductViewed (workspace `abc`) is complete.
Summary:
Diagnosis: Explainable 
Confidence: 0.85 
Most significant drivers: 
* location -> `offer-popup`
  * variation_share_of_total -> **43%**
  * z_score -> **-4.88**
* channel_source -> `Direct`
  * variation_share_of_total -> **26%**
  * z_score -> **-4.23**
Notes:
* The bulk of the drop is driven by the offer-popup location and Direct channel.

Equipped with this, it's much easier to identify the responsible deploy - first, because it was usually merged within the last 30 minutes; and second, because the anomaly's origin is typically well determined by the automatic diagnosis.

A Note on Trade-offs

While this system has been highly effective for us, it’s not without limitations. Statistical assumptions like normality and stationarity don’t always hold, especially for low-frequency events. False positives can still occur despite tuning, and diagnosing rare or multi-causal anomalies remains challenging. Still, by prioritizing simplicity, interpretability, and responsiveness, we created a system that adds tangible value without introducing complex infrastructure or heavy ML models.

Building the Pretty Unnecessary Stuff

Obvious next step? Build an API and UI! Given that users engaged pretty well, we decided to wrap this anomaly system inside a UI and add some management features to support users:

  • Config: users can create configs to change the $\sigma$ threshold to make an event more or less sensitive, and also specify breakdowns in advance that will be sent to the anomaly agent
  • Visibility and Simulator: users can visualize the anomaly (as depicted below) and simulate the effect of changing the threshold directly in the UI
  • Resolve/Suppress: users can manually resolve an anomaly and give a reason, which is later used as feedback by the system. They can also suppress an anomaly, if it is expected
  • Visual Aids: the UI showcases cards that help diagnose the anomaly further using the same information that was provided to the agent - analysts can have their own interpretation

This is what it looks like:

Unnecessary UI Example

The green bands show the region where the event volume variation is expected - any datapoint that falls outside of this region is an anomaly and triggers the process described above. There is more to the UI, but I can't show everything here.

Final Thoughts

Building ADULA was a reminder that simple tools, combined thoughtfully, can solve real problems. You don’t always need deep learning models to add intelligence to your stack - sometimes all it takes is a good baseline, smart breakdowns, and a bit of generative guidance. If you’re facing similar challenges, we’d love to exchange ideas - or maybe even work together.

Photo of Leonardo Machado

Leonardo Machado

Some guy from Brazil. Loves his wife, cats, coffee, and data. Often found trying to make sense of numbers or cooking something questionable.