When working with data, it is quite important to be able to detect outliers within that data. This guide will give a basic introduction on outliers within data and a simple way to detect these outliers.
First, lets take a look at Wikipedia’s trusty definition of outlier:
“In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are sometimes excluded from the data set. An outlier can be an indication of exciting possibility, but can also cause serious problems in statistical analyses.”
Interesting. Well how exactly can we define an outlier when looking at our own data? In order to make a quick definition, let’s first talk about standard deviations. Standard deviations are used to measure the dispersion within data. Simply put, a low standard deviation means that values are typically close to the mean value in a set of values, where as a high standard deviation means that the values aren’t always so close to the mean.
Luckily, Sigma makes this easy to compute! We have the StdDev function to do this for you. If you want to find the standard deviation of a column, you can add a new summary and use this function.
Here, I have some sample sales data:
And I would like to find the standard deviation of the total sales. I simply add a new summary:
And use the formula
Ok, now what? We have the standard deviation, and we know that the standard deviation associates with how far a set of values typically varies from the average, so let’s find the average! We can use the summary here as well, and use Sigma’s Avg function.
The next step here is to determine what we determine an outlier. There is a nice rule out there about data that follows a normal distribution, outlined by our friend Wikipedia. Essentially, we can expect 68% of values to fall within one standard deviation of the mean, 95% to fall within two standard deviations of the mean, and 99.7% to fall within three.
Determining exactly what we consider an outlier is a big step and can depend on the use case, however, we will use 3 standard deviations away from the mean to determine that in this example.
I love using summaries, so I will use those again to find the upper and lower bounds of what we can consider “non-outliers”. First, I find the upper bound with the following formula:
[Average] + (3*[Standard Deviation])
And the lower bound with the following formula:
[Average] - (3 * [Standard Deviation])
Now, we can check if our daily sales numbers fall within this range! We can add a new column and use the following formula to see if each value falls within the range we defined:
If(Between([Total sales], [Lower Bound], [Upper Bound]), True, False)
Where True will denote that the value is an outlier. Here are the results:
We see only false values here, but remember we chose a large interval here. Let’s take a look at the column details:
You can see that the hypothesis of 99.7% of values lying within 3 standard deviations was almost true, just off by 0.1%. Filtering to true on the [Outlier?] column, we can see the outliers here:
Want to Dig Deeper?
In certain use cases, you may want to determine outliers within a certain window. In that case, we can follow the same steps, but use the MovingAvg and MovingStdDev functions.
For the really bold, take a look at this article on using Snowpark for ML models in Sigma. Machine learning is a great way to detect outliers!