Processing podcasts part 2

Async workflows, dealing with large files, vector embeddings, ChatGPT, MySQL full-text indices - Processing podcasts

Aug 27, 2024

∙ Paid

Introduction

This is the second part of this series. You can check out part 1 here.

In the last part, there’s a bug. Transcribing a podcast episode looks like this:

public function transcribe(string $filePath): ?string
{
    $response = $this->client->audio()->transcribe([
        'model' => 'whisper-1',
        'file' => fopen($filePath, 'r'),
    ]);

    return $response->text;
}

A 1-hour audio file in mp3 is about 130MB. You cannot send a file that large to OpenAI. So what can we do?

Chunking large audio files

I learned from smarter engineers that there is one secret when it comes to dealing with large files, datasets, or database tables. Avoid dealing with large stuff. Divide and conquer.

Here’s what we can do:

Chunk the audio file into smaller pieces. Let’s say 10MB each
Send them to OpenAI in a parallel way
Store the content of each chunk temporarily
Once every chunk has been processed concatenate the content and save it to the episode

This is the file structure I’ll use:

1 is the podcast’s ID. Each podcast has a folder
82..126 are the episode IDs. Each episode has its folder
In the episode’s folder, there’s the large audio file that the user uploaded
Finally, chunks are located in the chunk folder

Chunking a file is relatively easy:

class FileService
{
    public function chunk(
        string $filePath, 
        string $destinationFolder, 
        int $chunkSize = 1024 * 1024
    ): void {
        if (!file_exists($filePath)) {
            return;
        }

        $file = fopen($filePath, 'rb');

        $chunkNumber = 0;

        mkdir($destinationFolder);

        while (!feof($file)) {
            $chunkData = fread($file, $chunkSize);

            if ($chunkData) {
                $chunkFileName = 
                    $destinationFolder .
                    DIRECTORY_SEPARATOR . 
                    "{$chunkNumber}.mp3";

                file_put_contents($chunkFileName, $chunkData);

                $chunkNumber++;
            }
        }

        fclose($file);
    }
}

The method does the following steps:

It accepts the file path we want to chunk and the destination folder
It opens the file for reading in binary mode
It creates the destination folder in case it doesn’t exist
It starts a loop until it reaches the end of the file
In the loop, it reads X bytes from the current position
And writes into a file called X.mp3 where X is the current chunk number
Finally, it closes the file

That’s quite simple, actually. When a user uploads an episode the app needs to store the original file and chunk it into smaller pieces. Chunking should run in the background since it has to deal with larger files.

The controller looks like this:

class PodcastEpisodeController extends Controller
{
    public function store(Request $request, Podcast $podcast)
    {
        /** @var Episode $episode */
        $episode = Episode::create([
            'podcast_id' => $podcast->id,
            'title' => $request->title,
        ]);

        $path = 'podcasts' . 
            DIRECTORY_SEPARATOR . 
            $episode->podcast->id . 
            DIRECTORY_SEPARATOR . 
            $episode->id;

        Storage::createDirectory($path);

        $episode->audio_file_path = storage_path(
            'app' . 
            DIRECTORY_SEPARATOR . 
            $request->file('audio_file')->store($path)
        );

        $episode->save();

        Bus::chain([
            new ChunkEpisodeJob($episode->refresh()),
            new TranscribeEpisodeJob($episode),
        ])
            ->dispatch();

        return $episode;
    }
}

First, it creates the episode, then it creates the folder for it and uploads the file. Finally, it can dispatch the two jobs for chunking the episode and then transcribing it. They need to run in order - first chunking, and then transcribing the chunks.

ChunkEpisodeJob is just a wrapper for the FileService call:

class ChunkEpisodeJob implements ShouldQueue
{
    use Queueable;
    use Batchable;

    public function __construct(private Episode $episode)
    {
    }

    public function handle(FileService $fileService): void
    {
        $fileService->chunk(
            $this->episode->audio_file_path,
            $this->episode->chunk_folder_path,
        );
    }
}

Transcribing the chunks

To transcribe all the small chunks we need two jobs:

Keep reading with a 7-day free trial

Subscribe to Computer Science Simplified to keep reading this post and get 7 days of free access to the full post archives.