Scheduled Task Watchdog

Friday has a scheduler. It fires prompts on a cron schedule – a joke every morning, a lunch reminder at noon, a health check every five minutes. The scheduler works by polling a database table every 30 seconds, finding tasks that are due, and calling session.prompt().

One afternoon, a task fired. Nine seconds later, the process was killed. The task's last_run_at was already stamped. The task was already marked inactive. Claude never sent the message. Nobody knew.

Every time the process crashes or restarts mid-turn – a SIGSEGV, a /kill command, a systemd restart – any task that was in flight is silently lost. The database says it ran. It didn't.

Why `prompt()` Is Synchronous

The heartbeat tick before the watchdog looks like this:

async #tick(): Promise<void> {
    const now = new Date().toISOString();
    const dueTasks = this.#tasks.findDue(now);

    for (const task of dueTasks) {
        if (this.#session.streaming) break;

        const prompt = `[Scheduled task: ${task.description}]\n${task.prompt}`;
        this.#session.prompt(prompt);

        if (task.cron_expression === 'once') {
            this.#tasks.deactivate(task.id);
        } else {
            const nextRun = nextRunFromCron(task.cron_expression);
            this.#tasks.updateNextRun(task.id, nextRun, now);
        }
    }
}

This is from src/scheduler/heartbeat.ts (before)

Call prompt(). Immediately stamp last_run_at and deactivate if it's a one-shot task. The word "immediately" is the problem.

session.prompt() is synchronous by design. It queues the prompt, kicks off #streamTurn() in the background, and returns { status: 202 }. It doesn't wait for the model to respond – it can't. A single turn can take thirty seconds or more as Claude reads files, makes tool calls, and streams a response. Blocking the heartbeat for that long would mean no other tasks could fire while one is in progress.

So from the database's perspective, the task is done the instant it's dispatched. From the model's perspective, the task hasn't started yet. There's a gap between those two moments – and in a long-running daemon, something always eventually happens in the gap.

Making prompt() return a Promise would be the obvious fix, but it would mean a larger refactor that touches every caller. The bridge calls prompt(). The Telegram handler calls prompt(). Tests call prompt(). Changing the return type from a synchronous result to a Promise would ripple outward through the entire codebase. The scheduler's problem shouldn't become everyone's problem.

`fired_at`: The Missing State

The schema has two timestamps that matter: next_run_at (when the task should fire) and last_run_at (when it last completed). There is no representation for "dispatched but not yet completed." The task is either due or done. The gap between those states is invisible.

The fix is a nullable fired_at column. We set it when a task is dispatched. We clear it when the turn completes. If the process dies in between, fired_at stays set – and that's the signal.

A status enum (pending, fired, completed, failed) would be more expressive, but it would require migrating existing data and complicating every query. The timestamp is simpler. It can double as a boolean flag – is the task in-flight? – and as a diagnostic – when exactly was it fired?

The migration is one line of SQL, wrapped in a try/catch because SQLite doesn't support ALTER TABLE ADD COLUMN IF NOT EXISTS:

try {
    db.exec(`ALTER TABLE scheduled_tasks ADD COLUMN fired_at TEXT`);
} catch {
    // column already exists
}

This is from src/db/schema.ts

This is the standard SQLite migration pattern. It's ugly every time.

With fired_at in place, the task store gains four new methods:

markFired(id: number, firedAt: string): void {
    this.#db
        .prepare('UPDATE scheduled_tasks SET fired_at = ? WHERE id = ?')
        .run(firedAt, id);
}

resetFired(id: number): void {
    this.#db
        .prepare('UPDATE scheduled_tasks SET fired_at = NULL WHERE id = ?')
        .run(id);
}

findInterrupted(): ScheduledTask[] {
    return this.#db
        .prepare(
            'SELECT * FROM scheduled_tasks WHERE fired_at IS NOT NULL AND last_run_at IS NULL'
        )
        .all() as ScheduledTask[];
}

resetForRetry(id: number, nextRunAt: string): void {
    this.#db
        .prepare('UPDATE scheduled_tasks SET next_run_at = ? WHERE id = ?')
        .run(nextRunAt, id);
}

This is from src/db/tasks.ts

markFired and resetFired bracket the in-flight window. findInterrupted finds tasks that were in flight when the process died – fired_at is set but last_run_at is still null. resetForRetry re-queues them.

The existing findDue query also needs a guard:

findDue(now: string): ScheduledTask[] {
    return this.#db
        .prepare(
            'SELECT * FROM scheduled_tasks WHERE is_active = 1 AND next_run_at <= ? AND fired_at IS NULL'
        )
        .all(now) as ScheduledTask[];
}

The AND fired_at IS NULL clause prevents double-firing. Without it, a task could be dispatched, take longer than 30 seconds, and appear as due again on the next heartbeat tick. The heartbeat would dispatch it a second time while the first turn is still running.

resetForRetry is a separate method rather than a special case of updateNextRun. The existing updateNextRun(id, nextRunAt, lastRunAt) always sets both columns. Recovery needs to reset next_run_at to "now" without touching last_run_at – which should stay null to indicate the task never actually completed. Rather than overloading one method with null semantics, a separate method keeps the intent explicit.

The One-Shot Listener

The heartbeat needs to know when a turn finishes. Not when prompt() returns – that's immediate and meaningless. When the model has actually finished responding, tool calls and all.

The Session class got an onTurnComplete() method:

#turnCompleteListeners: Set<() => void> = new Set();

onTurnComplete(fn: () => void): () => void {
    this.#turnCompleteListeners.add(fn);
    return () => this.#turnCompleteListeners.delete(fn);
}

This is from src/acp/session.ts

The callbacks fire in the finally block of #streamTurn():

async #streamTurn(): Promise<void> {
    try {
        for await (const message of this.#query!) {
            // ... process messages
        }
    } catch (err) {
        this.#buffer.push({
            type: 'error',
            message: err instanceof Error ? err.message : String(err),
        });
    } finally {
        this.#streaming = false;
        this.#done = true;
        this.#query = null;

        const listeners = [...this.#turnCompleteListeners];
        this.#turnCompleteListeners.clear();
        for (const fn of listeners) {
            fn();
        }

        const next = this.#queue.shift();
        if (next) {
            this.prompt(next);
        }
    }
}

The ordering in the finally block matters. Listeners fire after #streaming = false so callers see the correct state, but before the message queue drains so the callback runs before the next queued prompt starts a new turn. Getting this wrong would mean the heartbeat's completion callback fires after the next prompt has already started, which could lead to state confusion.

The pattern is deliberately minimal. A full EventEmitter would work, but it brings subscription management, error handling for listener exceptions, and a lifecycle that outlives any single turn. The one-shot pattern – register, fire once, auto-clear – is exactly the right scope. The Set is cleared after draining, so listeners don't accumulate across turns.

Recovery at Startup

If a task is interrupted, the process is dead. Recovery only matters when it comes back. Running a recovery scan on every heartbeat tick would be wasteful – interrupted tasks can only exist after a crash, and a crash means a restart.

The Heartbeat.start() method calls #recoverInterrupted() before starting the interval timer:

start(): void {
    if (this.#timer) return;
    this.#recoverInterrupted();
    this.#timer = setInterval(() => this.#tick(), this.#intervalMs);
}

This is from src/scheduler/heartbeat.ts

The recovery logic scans for tasks where fired_at is set but last_run_at is null – tasks that were dispatched but never completed:

#recoverInterrupted(): void {
    const now = new Date().toISOString();
    const interrupted = this.#tasks.findInterrupted();
    for (const task of interrupted) {
        this.#tasks.resetFired(task.id);
        this.#tasks.resetForRetry(task.id, now);
    }
    if (interrupted.length > 0) {
        const list = interrupted
            .map((t) => `• ${t.description}`)
            .join('\n');
        this.#session.prompt(
            `⚠️ The following scheduled tasks were interrupted before completing and have been re-queued:\n${list}`
        );
    }
}

For each interrupted task: clear fired_at, set next_run_at to now. Then send a single prompt listing everything that was recovered. The next heartbeat tick picks them up as due tasks and dispatches them normally.

The Telegram notification is the key detail. Silent recovery would be correct – the tasks would re-run – but invisible. The notification means you see what happened: "The following scheduled tasks were interrupted before completing and have been re-queued: • Morning joke, • Lunch reminder." That's the difference between a system that heals silently and one that heals visibly. Both are better than a system that doesn't heal at all, but visible healing builds trust.

The Refactor Detour

Adding onTurnComplete() to the Session class meant opening a file that was already too large. The listener mechanism itself was small – a Set, a registration method, a drain loop in finally. But the class it was being added to had accumulated responsibilities that made every change feel precarious.

Three concerns were extracted into their own classes in the same branch: ToolApprover (the permission gate for tool calls), UsageTracker (token and cost accumulation), and TurnRenderer (Telegram message rendering). None of these were part of the watchdog feature. But touching the Session class for the listener mechanism made the existing complexity impossible to ignore.

The alternative – leaving a 400-line class and coming back later – never works. "Later" means never, or it means a larger, scarier refactor when the class has grown to 600 lines. Extracting while you're already in the file, with the context loaded and the tests passing, is the cheapest time to do it.

The extractions didn't change behaviour. They moved code out and left behind delegation. The Session class shrank. The new classes got their own tests. The total test count for the branch landed at 285.

This is a pattern worth naming: the refactor that comes along for the ride. You open a file to add one thing, and the file tells you it needs three things removed first. You can ignore it – ship the feature, close the PR. But the cost of ignoring it compounds. Every future change to that file carries the weight of the complexity you chose not to address. The watchdog feature was the reason to open the file. The extraction was the price of being honest about what was inside.

Conclusion

The heartbeat tick now works in two stages:

async #tick(): Promise<void> {
    const now = new Date().toISOString();
    const dueTasks = this.#tasks.findDue(now);

    for (const task of dueTasks) {
        if (this.#session.streaming) break;

        const delayMs =
            new Date(now).getTime() - new Date(task.next_run_at).getTime();
        const lateNote =
            delayMs > LATE_THRESHOLD_MS
                ? `\n⚠️ This task was scheduled for ${task.next_run_at} but is running ${formatDelay(delayMs)} late.`
                : '';
        const prompt = `[Scheduled task: ${task.description}]\n${task.prompt}${lateNote}`;

        const nextRun =
            task.cron_expression === 'once'
                ? now
                : nextRunFromCron(task.cron_expression);

        this.#tasks.markFired(task.id, now);
        this.#session.onTurnComplete(() => {
            const completedAt = new Date().toISOString();
            this.#tasks.resetFired(task.id);
            this.#tasks.updateNextRun(task.id, nextRun, completedAt);
            if (task.cron_expression === 'once') {
                this.#tasks.deactivate(task.id);
            }
        });
        this.#session.prompt(prompt);
    }
}

This is from src/scheduler/heartbeat.ts (after)

Stage one: stamp fired_at and dispatch the prompt. Stage two, in the one-shot callback: stamp last_run_at, clear fired_at, advance the schedule, and deactivate if it's a one-shot task.

If the process dies between stage one and stage two, fired_at stays set. On next startup, recovery detects it, resets the task, and sends a Telegram notification listing what was recovered.

The late-task detection is a bonus that came from the same refactor. If a task runs more than two minutes past its scheduled time – because the session was busy, or the process was restarting – the prompt includes a note about the delay. The model can decide whether to mention it or just carry on. It's a small thing, but it closes another visibility gap: not just "did it run?" but "did it run on time?"

Before: the heartbeat dispatched a task and immediately declared it complete. The database was optimistic. If the process died mid-turn, the task was gone – marked as done, never actually finished, no one the wiser.

After: the database is honest. A task is only marked complete when the model has finished responding. If the process dies, the interrupted state is visible and automatically recovered. The gap between "dispatched" and "completed" is no longer invisible – it's a column in the database that the system actively monitors and heals.

The implementation is small. A nullable column. Four store methods. A one-shot listener. A recovery sweep at startup. The before/after of the heartbeat tick is the clearest summary: what used to be a single step – fire and forget – is now two steps with a callback between them. The database no longer lies about what happened. And when something goes wrong, the system tells you about it instead of pretending everything is fine.