I have an app written in Go (I don't think the programming language is relevant though...) that should act as a scheduler. It connects to a PostgreSQL database and calculates at what time is the next run for some tasks. The app sleeps until it's time to run the next task and then runs it (it actually publishes a message in a queue and another app is in charge with running the task).
In order to reduce the risk of not running a scheduled task at a given time, this app should be run on multiple nodes and only one node should publish the message to run the task.
The table in the DB has these two columns:
id serial
next_run int4
When the app starts it performs this query:
SELECT * FROM task;
After this it calculates all the times at which the tasks should be run. When it's time to run the first task, all nodes calculate the following next run and they perform this query:
UPDATE task SET next_run = ${new_next_run} WHERE id = ${id} AND next_run = ${current_next_run};
If the number of affected rows is 1, it means the node was the first one to reach the database and it is the one that will publish the message to run the task. In theory this should work and I even created a small prototype to test this produces the expected result.
In practice, there is a scenario in which this logic fails. When the app starts, it checks for all tasks and calculates their next runs. If the next task should run in less than 3 seconds (approx.), when the update query is performed, it doesn't check the "where" clause against the real database, but against what was saved in the cache. Or at least this is what I suppose happens...
If the first task to run is in more than 3 seconds, the tests pass:
T: node 1 -> rows affected = 1; publish message
T+100 milliseconds: node 2 -> rows affected = 0; don't publish message
T+130 milliseconds: node 3 -> rows affected = 0; don't publish message
..........
T+300 milliseconds: node n -> rows affected = 0; don't publish message
If the first task to run is in less than 3 seconds, the tests fail and my log looks like this:
T: node 1 -> rows affected = 1; publish message
T+100 milliseconds: node 2 -> rows affected = 1; publish message
T+130 milliseconds: node 3 -> rows affected = 0; don't publish message
T+180 milliseconds: node 4 -> rows affected = 1; publish message
T+200 milliseconds: node 5 -> rows affected = 1; publish message
T+220 milliseconds: node 6 -> rows affected = 0; don't publish message
..........
T+300 milliseconds: node n -> rows affected = 0; don't publish message
The only explanation making sense is that the UPDATE
is performed using the cache, not the real database and that's why it gives unpredictable results. But I know in PostgreSQL there's no way to flush the cache using a query, so I come back to the question in the title: How to query PostgreSQL real database, not its cache?
Later edit:
The pseudocode:
start program
declare global tasks variable
read all tasks from DB
SELECT * FROM task
for each task in tasks
calculate next_run and save it in the global tasks variable
if calculated next_run != next_run from database then
update next_run in database
UPDATE task SET next_run = ${next_run} WHERE id = ${id}
endif
endfor
for true
sort tasks by next_run asc
extract first task from tasks
if next_run > now
sleep (next_run - now) seconds
endif
read a fresh instance of task from DB
SELECT * FROM task WHERE id = ${id}
calculate subsequent next_run
update subsequent next_run in tasks
update subsequent next_run in DB
BEGIN;
SELECT id FROM task WHERE id = ${id} AND next_run = ${current_next_run} FOR UPDATE;
UPDATE task SET next_run = ${subsequent_next_run} WHERE id = ${id};
COMMIT;
if transaction succeeded then // rows in result set and rows affected
run the task
endif
endfor
question from:
https://stackoverflow.com/questions/66051975/how-to-query-postgresql-real-database-not-its-cache