The author uses a systemd timer to schedule their backups. For backups going to ...

corytheboyd · on Sept 20, 2020

That’s really such a nice solution to the problem, nice.

Can you imagine not reading the docs to discover those options. So you spin up a database to save state about runs to implement the delay. And you need a dashboard to monitor the various parts of the system for debugging.

Or you read the docs

coldtea · on Sept 21, 2020

Or you prefix your command / script with sleep and a randomlly generated 1h value!

TimWolla · on Sept 21, 2020

This works, but compared to using systemd it has the drawback that the range of possible times is anchored to the configured time in cron. The systemd timer example I gave causes the next cycle to start when the previous job finished.

So if it initially runs the script between 0:00 and 1:00 and the script takes 1.5 hours to finish then the next run will be between 1:30 and 2:30 the next day, instead of 0:00 and 1:00.

MayeulC · on Sept 21, 2020

> So if it initially runs the script between 0:00 and 1:00 and the script takes 1.5 hours to finish then the next run will be between 1:30 and 2:30 the next day

Shouldn't it be between 1:30 ans 3:30? I'm just nitpicking, of course, that's a nice solution.

TimWolla · on Sept 21, 2020

Yes, you are correct. The script ends no later than 2:30 and then there's a delay of up to 1 hour.

ComputerGuru · on Sept 21, 2020

A loner straightforward solution - albeit not supported by schedulers like cron with start times denoted by fixed, absolute values - is to use non-recurring intervals.

Eg run a task at intervals of 86413 seconds

NewJazz · on Sept 21, 2020

The snooze scheduler* has a random delay feature.

* https://github.com/leahneukirchen/snooze

klodolph · on Sept 21, 2020

It's why you see a bunch of cron jobs that start off with a random sleep. For example, certbot's cron:

    perl -e 'sleep int(rand(43200))' && certbot -q renew

seanwilson · on Sept 21, 2020

Seems a bad idea in terms of the accuracy of your logs e.g. so you might not notice if the command you want to run is starting to run unusually long some days because of some error.

klodolph · on Sept 21, 2020

Another good reason to switch to systemd, IMO.

srtjstjsj · on Sept 21, 2020

Replace sleep with a countdown logger.

andyfleming · on Sept 21, 2020

I would guess that most workloads have some sort of slow time when it’s appropriate to schedule backups. Wouldn’t having backups rotate through the day potentially cause slowness during a more active time for users?

gerdesj · on Sept 20, 2020

Whenever I use a scheduler I always use prime numbers wherever possible.

0xdeadb00f · on Sept 21, 2020

May I ask why?

dimtion · on Sept 21, 2020

Not OP, but one good reason to do that is to reduce probability and frequency of collision with other periodically running jobs.

Let's say you run your job on the hour, and you have a job running every 4h and another every 24h, without planning, because 4 divides 24, you have one in four chance of having them collide and having the 24h job running at the same time than the 4h job. If you add more 4h jobs, the probability that one of those 4h jobs collide each time with the 24h job increases.

The more jobs you have, the higher the probability that some will be divisors of others.

Using prime number for scheduling reduce the probability at a given time that those jobs collides. If you create a job every 5h and another every 23h, the 23h will collide with the 5h job every 115h. PPCM(5, 23) = 115.

Interestingly, this technique is used in nature by cicadas who developed long, prime-numbered, periodical life cycles to avoid gaining a predator that can sync up with the cicadas[0].

[0] https://www.cicadamania.com/cicadas/cicadas-and-prime-number...

MayeulC · on Sept 21, 2020

And the number of teeth on two interacting gears is usually coprime, to distribute wear more evenly.

cbhl · on Sept 21, 2020

If you have a bunch of computers each on a fixed timer, and their clocks are synchronized (say, with NTP), then on the least common multiple of all of those timers you'll get a stampede of requests from all of the computers.

If you're on a sufficiently large network, that surge can cause failures. And a fixed retry policy will just cause the same stampede to recur on the retry intervals; you want to add jitter to ensure that you spread the load out.

Dylan16807 · on Sept 21, 2020

> If you're on a sufficiently large network, that surge can cause failures.

The more prime numbers you use, the rarer a stampede affecting a certain percent of nodes will be. To an extent that makes a bigger network safer.

Adding more nodes with the same prime numbers means the peak load scales linearly with the network size. So wouldn't that mean that a "sufficiently large" network is no worse off than a small network?

Also a backup is a bulk upload that can easily run a thousand times slower than normal and still succeed. Even if every server triggers at once, that shouldn't inherently cause failures.

sk5t · on Sept 21, 2020

To appease the cicadas.

pkphilip · on Sept 21, 2020

So that it backs up at a different time each day? For instance, if it is set to 11H, and you were starting at 4pm on a day, then the backup would run next at:

15h, 2h, 9h, 20h, 7h....etc. So you won't have the backup running at the same time everyday.

Personally speaking I would prefer that the backup runs at the same time everyday but some people don't.

xvedejas · on Sept 21, 2020

Well, it's not that the number is prime. There are many numbers which don't back up at the same time every day that aren't prime, like 25hrs or 22hrs. Numbers which are more complex fractions of a full day will take longer to repeat.

Depending on a more exact statement of the goal, the best constant interval to avoid repeating parts of the day on nearby dates would be to multiply a day's length by the golden ratio; this would be about every 38.833 hours.

Since the golden ratio is irrational, you'll technically never repeat the same time. But if I remember correctly, it's the best number to space out the times uniformly throughout the day and also distantly between nearby days.

shirakawasuna · on Sept 21, 2020

What's the downside of consistently hitting a hotspot?

TimWolla · on Sept 21, 2020

It possibly makes the process unnecessarily slow. People tend to choose “round” numbers for their cronjobs. Probably most commonly minute 0 of a given hour for an hourly or daily job. Thus on e.g. 0:00 UTC there might be hundreds of clients running their backups.

I don't have a strict need to run my backups at a fixed point in time (e.g. within the night hours). By not hitting a hotspot I have a better chance of having a larger percentage of the targets bandwidth for my needs (both network as well as disk IO).

The random delay ensures that the job runs at a different point in time every day, with most of these points in time being expected to have a light load. If it accidentally hits a hotspot on one day it will be fine the next.

pathseeker · on Sept 21, 2020

Unless you have a whole fleet doing the same thing, nothing.