Multimedia Attachments — Building Friday

Friday is a Telegram bot that talks to Claude. Text in, text out. It works. But try sending a screenshot, a voice note, a PDF – anything that isn't plain text – and nothing happens. No error. No acknowledgement. The message arrives at the bot and vanishes.

The bridge registers a single handler: message:text. Photos, documents, voice notes, audio files, video – none of them have a handler. No handler means no processing. Telegram delivers the message, the bot shrugs, and the message disappears.

The fix needs to cover two separate features that are really one story: making Friday receive files properly, and then making sure those files stick around long enough to be useful. The first version gets attachments working. The second version makes them work well.

File Paths Over Content Blocks

The first decision was how Claude should see attachments. The obvious approach – base64-encoding images into special content blocks, using the vision API, building MIME-type-specific pipelines – would have meant a lot of plumbing for each file type.

We went the other way. Attachments get downloaded to disk. Their absolute file paths get injected into the prompt as plain text:

[Attached: /home/friday/meta/attachments/a1b2c3d4/photo.jpg (photo)]

Claude Code already has a Read tool that can open images natively, inspect file contents, run commands on files. The prompt is still a string. The file is still a file. Claude figures out what to do with it.

This means every file type – photos, documents, voice notes, video – goes through the same pipeline: download, inject the path, move on. No MIME-type branching. No special handling for images versus documents. One pipeline that handles everything.

Transcribing voice notes, extracting text from PDFs, pulling frames from video – each of those is its own feature with its own plumbing. The uniform approach ships once and covers everything. Voice transcription comes later, as a targeted improvement to one specific weak spot.

One Pipeline for Everything

Every attachment type – photo, document, voice, audio, video, video note – goes through the same #handleAttachment method on the bridge. The per-type handlers extract the file ID and a label, then delegate:

async handlePhoto(ctx) {
    const photo = ctx.message.photo[ctx.message.photo.length - 1];
    await this.#handleAttachment(
        ctx.chat.id,
        ctx.message.message_id,
        ctx.message.media_group_id,
        photo.file_id,
        'photo',
        ctx.message.caption,
        photo.file_size
    );
}

This is from src/telegram/bridge.ts

Telegram provides multiple resolutions for each photo, from thumbnail to full size. We take the last element in the photo array – always the highest available resolution. Getting that wrong would mean sending Claude a 90×90 thumbnail.

The #handleAttachment method downloads the file, then hands it to the collector:

async #handleAttachment(
    chatId, messageId, groupId, fileId, label, caption, fileSize
) {
    let filePath;
    try {
        ({ filePath } = await downloadFile(
            this.#bot.api, this.#token, fileId, fileSize
        ));
    } catch (err) {
        if (err instanceof AttachmentTooLargeError) {
            await this.#bot.api.sendMessage(
                chatId, '⚠️ File exceeds the 20 MB Telegram download limit.'
            );
            return;
        }
        throw err;
    }

    this.#collector.collect(groupId, {
        filePath, label, caption, chatId, messageId,
    });
}

This is from src/telegram/bridge.ts

Download, catch errors, hand off. Same code path for every file type.

Media Group Batching

Send three photos as a Telegram album and the bot receives three separate updates. Each one carries the same media_group_id, but they arrive individually – not as a batch.

Without batching, each photo would trigger its own prompt to Claude. Three photos, three separate "here's a file" messages. That's noisy and wasteful.

The MediaGroupCollector handles this with a debounce. When an update has a group ID, the collector stashes the entry and starts (or restarts) a 500ms timer. When the timer fires, all collected entries for that group flush as a single prompt. Solo attachments – no group ID – flush immediately:

collect(groupId: string | undefined, entry: CollectedEntry): void {
    if (!groupId) {
        this.#onFlush?.([entry]);
        return;
    }

    const existing = this.#pending.get(groupId);
    if (existing) {
        clearTimeout(existing.timer);
        existing.entries.push(entry);
        existing.timer = setTimeout(
            () => this.#flush(groupId),
            this.#debounceMs
        );
    } else {
        const timer = setTimeout(
            () => this.#flush(groupId),
            this.#debounceMs
        );
        this.#pending.set(groupId, { entries: [entry], timer });
    }
}

This is from src/telegram/media-group-collector.ts

The updates typically arrive within 200ms of each other, so 500ms is generous. When the timer fires, the flush callback receives all entries at once and builds a single prompt with all the file paths listed together.

The 20 MB Wall

Telegram's Bot API has a hard limit: getFile can only serve files up to 20 MB. Larger files can't be downloaded by bots at all. This is a platform constraint with no workaround.

The downloader checks the size before downloading:

const MAX_FILE_SIZE = 20_000_000;

if (fileSize !== undefined && fileSize > MAX_FILE_SIZE) {
    throw new AttachmentTooLargeError();
}

const file = await api.getFile(fileId);

if (file.file_size !== undefined && file.file_size > MAX_FILE_SIZE) {
    throw new AttachmentTooLargeError();
}

This is from src/telegram/file-downloader.ts

The first check uses the size from the original message – catching the obvious case before making a network call. The second uses the size returned by getFile, for cases where the message didn't include a size but the file is still too large.

When the error bubbles up, the bridge sends a user-facing message explaining the limit. No silent failure. You know why your file didn't go through.

The One-Turn Problem

The first version of the attachment pipeline deletes files right after session.prompt() resolves. Simple lifecycle: download, use, delete.

It breaks immediately in practice.

Send a screenshot. Claude sees it, responds. Ask a follow-up about the bottom-left corner – the file is gone. Claude gets a path to a file that no longer exists. Queued messages make it worse: by the time the second message is dequeued, the attachment from the first is already deleted.

The pipeline works for single-turn interactions. Anything beyond that and you hit a wall of missing files. It's a one-shot mechanism pretending to be a conversation feature.

UUID Folders and 24-Hour Retention

Each download creates a folder under meta/attachments/ named with a random UUID. The file lives inside it for 24 hours:

export async function makeAttachmentGroupDir(
    attachmentsDir?: string
): Promise<string> {
    const base = attachmentsDir ?? join(process.cwd(), 'meta', 'attachments');
    const groupDir = join(base, randomUUID());
    await mkdir(groupDir, { recursive: true });
    return groupDir;
}

This is from src/telegram/file-downloader.ts

One folder per download, not per media group. Grouping by media_group_id would have required caching the group-to-folder mapping across the async download boundary – extra complexity for no real benefit, since all files persist for the same 24-hour window regardless.

The AttachmentCleaner service handles expiration. It runs on boot and every 24 hours after that, scanning for folders whose modification time exceeds the retention threshold:

async run(): Promise<void> {
    let entries;
    try {
        entries = await readdir(this.#attachmentsDir, {
            withFileTypes: true,
        });
    } catch {
        return;
    }

    const threshold = Date.now() - this.#retentionMs;

    for (const entry of entries) {
        if (!entry.isDirectory()) {
            skipped++;
            continue;
        }

        const dirPath = join(this.#attachmentsDir, entry.name);
        const stats = await stat(dirPath).catch(() => null);
        if (!stats) continue;

        if (stats.mtimeMs < threshold) {
            await rm(dirPath, { recursive: true });
        }
    }
}

This is from src/telegram/attachment-cleaner.ts

The folder structure makes cleanup straightforward: delete old directories recursively, skip anything that isn't a directory. Legacy flat files from before the refactor are left alone – they're harmless, and special-casing their cleanup isn't worth the code.

Twenty-four hours is generous. Most multi-turn conversations about an attachment happen within minutes. But the retention window is configurable via telegram.attachment-retention-hours, so tightening it later is a one-line config change.

Whisper Transcription

Voice notes are the pipeline's blind spot. They arrive as file paths to audio files that Claude can't read. You get [Attached: /path/to/voice.ogg (voice)] in the prompt, and Claude has no idea what you said.

The Transcriber service shells out to the Whisper CLI:

async transcribe(filePath: string): Promise<string | null> {
    const tmpDir = await mkdtemp(join(tmpdir(), 'whisper-'));
    try {
        await execAsync(
            `${this.#binary} "${filePath}" --model ${this.#model} --output_format txt --output_dir "${tmpDir}"`
        );

        const stem = basename(filePath, extname(filePath));
        const txtPath = join(tmpDir, `${stem}.txt`);
        const text = await readFile(txtPath, 'utf8');
        const trimmed = text.trim();

        if (!trimmed) return null;

        return trimmed;
    } catch {
        return null;
    } finally {
        await rm(tmpDir, { recursive: true }).catch(() => {});
    }
}

This is from src/telegram/transcriber.ts

Whisper has a well-maintained Python CLI, so we shell out to it rather than pulling in a Node.js binding or WASM build. A child process is a handful of lines with zero new npm dependencies – and the feature is optional anyway. If Whisper isn't installed, the transcriber just isn't available.

Whisper writes output to <output_dir>/<input_stem>.txt – so voice.ogg produces voice.txt. You need to know the input filename to find the output. The transcriber computes the expected path deterministically.

When the collector flushes, it checks whether each entry is audio and whether a transcriber is available:

this.#collector.onFlush(async (entries) => {
    const lines = await Promise.all(
        entries.map(async (e) => {
            const isAudio =
                e.label === 'voice' || e.label.startsWith('audio');
            if (isAudio && this.#transcriber) {
                const transcript =
                    await this.#transcriber.transcribe(e.filePath);
                if (transcript !== null) {
                    return `[Transcription: ${transcript}]`;
                }
            }
            return `[Attached: ${e.filePath} (${e.label})]`;
        })
    );
    // ...
});

This is from src/telegram/bridge.ts

Voice notes and audio files arrive as [Transcription: <text>] instead of a useless file path. Everything else still comes through as a path for Claude to read directly.

Conclusion

Send a screenshot. Ask about the bottom-left corner three turns later – the file is still there. Send a voice note while driving – Claude gets the transcription, not a useless file path. Send an album of three photos – they arrive as a single prompt.

If Whisper isn't installed, you get file paths instead of transcriptions. If a file is too large, you get a message telling you why. Nothing crashes, nothing silently drops.

Two passes. The first got files flowing. The second made them stick around and made audio readable.

With voice transcription working, you can answer AskUserQuestion prompts by talking instead of typing. That's not a convenience feature – it's what made voice-exclusive development sessions possible. Walk away from the keyboard, keep building.