(Mostly for our internal use since the source data are not publicly available.)
During the 6 weeks since the beginning of September the SLO1 error budget dropped to 25%. After that (in the middle of October) the trend turned and now, at the end of October, it’s at 50%.
When looking at metrics in our (not public) Grafana
(Packit boards
-> (Prod/Stg) Accepted status time)
we can see that the average value (of the “accepted status time”) indeed
increased during the beginning of September by approx 1 second and
cases of >15s started to appear so the error budget started to be consumed.
The cause is yet unknown, but the changes we did the last week of August and which could thus contribute to the problem were:
We need to continue watching the metrics, experiment with the workers (numbers and types) and give the changes more time (2 weeks at least) to be able to tell whether they change the trend.
Related:
Last updated: October 1, 2024 at 1:49 PM UTC