The author uses a systemd timer to schedule their backups. For backups going to a remote host I prefer adding a little bit of variance to the execution time to avoid consistently hitting some hotspot.
From the timer I use to backup my server using Borg to rsync.net:
[Timer]
OnUnitActiveSec=24h
RandomizedDelaySec=1h
This will run the backup script every 24 hours with a random delay of up to 1 hour, so every 24.5 hours on average. This causes the job to nicely rotate around the day.
That’s really such a nice solution to the problem, nice.
Can you imagine not reading the docs to discover those options. So you spin up a database to save state about runs to implement the delay. And you need a dashboard to monitor the various parts of the system for debugging.
This works, but compared to using systemd it has the drawback that the range of possible times is anchored to the configured time in cron. The systemd timer example I gave causes the next cycle to start when the previous job finished.
So if it initially runs the script between 0:00 and 1:00 and the script takes 1.5 hours to finish then the next run will be between 1:30 and 2:30 the next day, instead of 0:00 and 1:00.
> So if it initially runs the script between 0:00 and 1:00 and the script takes 1.5 hours to finish then the next run will be between 1:30 and 2:30 the next day
Shouldn't it be between 1:30 ans 3:30? I'm just nitpicking, of course, that's a nice solution.
A loner straightforward solution - albeit not supported by schedulers like cron with start times denoted by fixed, absolute values - is to use non-recurring intervals.
Seems a bad idea in terms of the accuracy of your logs e.g. so you might not notice if the command you want to run is starting to run unusually long some days because of some error.
I would guess that most workloads have some sort of slow time when it’s appropriate to schedule backups. Wouldn’t having backups rotate through the day potentially cause slowness during a more active time for users?
Not OP, but one good reason to do that is to reduce probability and frequency of collision with other periodically running jobs.
Let's say you run your job on the hour, and you have a job running every 4h and another every 24h, without planning, because 4 divides 24, you have one in four chance of having them collide and having the 24h job running at the same time than the 4h job.
If you add more 4h jobs, the probability that one of those 4h jobs collide each time with the 24h job increases.
The more jobs you have, the higher the probability that some will be divisors of others.
Using prime number for scheduling reduce the probability at a given time that those jobs collides. If you create a job every 5h and another every 23h, the 23h will collide with the 5h job every 115h. PPCM(5, 23) = 115.
Interestingly, this technique is used in nature by cicadas who developed long, prime-numbered, periodical life cycles to avoid gaining a predator that can sync up with the cicadas[0].
If you have a bunch of computers each on a fixed timer, and their clocks are synchronized (say, with NTP), then on the least common multiple of all of those timers you'll get a stampede of requests from all of the computers.
If you're on a sufficiently large network, that surge can cause failures. And a fixed retry policy will just cause the same stampede to recur on the retry intervals; you want to add jitter to ensure that you spread the load out.
> If you're on a sufficiently large network, that surge can cause failures.
The more prime numbers you use, the rarer a stampede affecting a certain percent of nodes will be. To an extent that makes a bigger network safer.
Adding more nodes with the same prime numbers means the peak load scales linearly with the network size. So wouldn't that mean that a "sufficiently large" network is no worse off than a small network?
Also a backup is a bulk upload that can easily run a thousand times slower than normal and still succeed. Even if every server triggers at once, that shouldn't inherently cause failures.
So that it backs up at a different time each day? For instance, if it is set to 11H, and you were starting at 4pm on a day, then the backup would run next at:
15h, 2h, 9h, 20h, 7h....etc. So you won't have the backup running at the same time everyday.
Personally speaking I would prefer that the backup runs at the same time everyday but some people don't.
Well, it's not that the number is prime. There are many numbers which don't back up at the same time every day that aren't prime, like 25hrs or 22hrs. Numbers which are more complex fractions of a full day will take longer to repeat.
Depending on a more exact statement of the goal, the best constant interval to avoid repeating parts of the day on nearby dates would be to multiply a day's length by the golden ratio; this would be about every 38.833 hours.
Since the golden ratio is irrational, you'll technically never repeat the same time. But if I remember correctly, it's the best number to space out the times uniformly throughout the day and also distantly between nearby days.
It possibly makes the process unnecessarily slow. People tend to choose “round” numbers for their cronjobs. Probably most commonly minute 0 of a given hour for an hourly or daily job. Thus on e.g. 0:00 UTC there might be hundreds of clients running their backups.
I don't have a strict need to run my backups at a fixed point in time (e.g. within the night hours). By not hitting a hotspot I have a better chance of having a larger percentage of the targets bandwidth for my needs (both network as well as disk IO).
The random delay ensures that the job runs at a different point in time every day, with most of these points in time being expected to have a light load. If it accidentally hits a hotspot on one day it will be fine the next.
From the timer I use to backup my server using Borg to rsync.net:
This will run the backup script every 24 hours with a random delay of up to 1 hour, so every 24.5 hours on average. This causes the job to nicely rotate around the day.