gwloggeR
is a package for automatized flagging of outliers, level shifts and temporal changes in raw air and hydrostatic pressure data.
Here you find the basic building-blocks of the package. Before continuing, make sure you have correctly installed the last stable version of gwloggeR
.
At its core gwloggeR
consists of several detect_
functions which take as input a vector of observations, and return a boolean vector of the same length specifying TRUE for each positive detection, and FALSE otherwise.
Currently, there are three detect_
functions defined:
gwloggeR::detect_outliers()
which detects outliers.gwloggeR::detect_levelshifts()
which detects level shifts.gwloggeR::detect_temporalchanges()
which detects changes that have a temporal effect with an exponential decay.These functions are mainly to be used with raw air or hydrostatic pressure data.
The underlying assumption for air pressure data is that it is normally distributed.
Lets generate some sample air pressure data. Based on Belgium, our Gaussian parameter estimates for air pressure are: \(\hat{\mu} = 1033 \; cmH_2O\) and \(\hat{\sigma} = 9.5 \; cmH_2O\)
n <- 100L
mu <- 1033
sigma <- 9.5
x <- rnorm(n, mean = mu, sd = sigma)
plot(x, type = 'l', ylim = c(1000, 1100))
Now lets add an outlier to x
. An outlier we define as a single extreme value (e.g. \(6 \times \hat{\sigma}\)).
Assuming we do not have any a-priori information about x
, we execute:
gwloggeR::detect_outliers(x, plot = TRUE)
#> [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
#> [ reached getOption("max.print") -- omitted 90 entries ]
Note the optional plot = TRUE
argument. In that case, besides the boolean vector output signifying that the 5-th observation is an outlier, we also get a plot explaining why it is an outlier.
We can improve detection methods by supplying a-priori information about x
. For example, in this case we know that x
represents air pressure measurements in Belgium. As such, we make an apriori
object and supply this information to the detection function.
apriori <- gwloggeR::apriori('air pressure', units = 'cmH2O')
gwloggeR::detect_outliers(x = x, apriori = apriori, plot = TRUE)
#> [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
#> [ reached getOption("max.print") -- omitted 90 entries ]
Without a-priori information the detect_outliers()
function will assume normality of x
and estimate the \(\hat{\mu}\) and \(\hat{\sigma}\) parameters in a robust way. When we supply a-priori information, then other extra assumptions might apply and the detection result will improve. Wonder what exact assumptions and how it all fits together? Then read the Air pressure vignette.
The underlying assumption for hydrostatic pressure is that it is a pure random walk with a Gaussian error term. We also assume that the variance of the error term depends on the time-interval between the observations. For example, if a time-interval is 5 minutes, then we do not expect a change of more than 10 cmH2O.
Armed with these assumptions, lets generate some sample data.
n <- 100L
x <- cumsum(c(1200, rnorm(n - 1L, mean = 0, sd = 10/6)))
timestamps <- seq(as.POSIXct('2000-01-01'), by = '5 min', length.out = n)
plot(timestamps, x, type = 'l')
Now lets add an outlier to our observations x
. Here we also define an outlier as a single extreme value in the random walk, but which does not have any effect on subsequent values.
x[5L] <- x[5L] - 15
plot(timestamps, x, type = 'l')
Our x
is not normally distributed. Remember that detect_outliers()
without any a-priori information assumes normality of x
? So if we do not provide any a-priori information about x
, the outlier detection will be bad. To improve it, we state that x
is hydrostatic pressure data.
apriori <- gwloggeR::apriori('hydrostatic pressure', units = 'cmH2O')
Note also, that in this case, we also need to supply timestamps
to the detection function. This is because the underlying model also considers time-intervals between observations.
gwloggeR::detect_outliers(x, timestamps = timestamps, apriori = apriori, plot = TRUE)
#> [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
#> [ reached getOption("max.print") -- omitted 90 entries ]
See how the outlier is caught?
Contrary to outliers, level shifts do have a lasting constant effect on the random walk. Lets simulate one:
x[6L:9L] <- x[6L:9L] - 15
plot(timestamps, x, type = 'l')
Now let us see how detect_levelshifts()
handles this.
gwloggeR::detect_levelshifts(x = x, timestamps = timestamps, apriori = apriori, plot = TRUE)
#> [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
#> [ reached getOption("max.print") -- omitted 90 entries ]
The 5-th observation is the start of a level shift down (blue background), and the 10-th observation is the start of a level shift up (red background).
Contrary to level shifts that have a constant effect, temporal changes decay exponentially with time. First we undo the level shift, and then simulate a temporal change with a decay factor of 0.7:
x[5L:9L] <- x[5L:9L] + 15
x[5L:15L] <- x[5L:15L] - 15*0.7^(0:10)
plot(timestamps, x, type = 'l')
Now lets see how detect_temporalchanges()
reacts:
gwloggeR::detect_temporalchanges(x = x, timestamps = timestamps, apriori = apriori, plot = TRUE)
#> [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
#> [ reached getOption("max.print") -- omitted 90 entries ]
The temporal change starts at the 5-th observation until it decays around the 9-th.
Now if all this was very obvious to you, you can give gwlogger
a try on the real data. If you still wonder how all this works, you can read more about the algorithm in the Hydrostatic pressure vignette.
Hmm… wonder what’s next? Me too!