Computer Science Simplified

Computer Science Simplified

Share this post

Computer Science Simplified
Computer Science Simplified
Processing podcasts part 2

Processing podcasts part 2

Async workflows, dealing with large files, vector embeddings, ChatGPT, MySQL full-text indices - Processing podcasts

Martin Joo's avatar
Martin Joo
Aug 27, 2024
∙ Paid
9

Share this post

Computer Science Simplified
Computer Science Simplified
Processing podcasts part 2
Share

Introduction

This is the second part of this series. You can check out part 1 here.

In the last part, there’s a bug. Transcribing a podcast episode looks like this:

public function transcribe(string $filePath): ?string
{
    $response = $this->client->audio()->transcribe([
        'model' => 'whisper-1',
        'file' => fopen($filePath, 'r'),
    ]);

    return $response->text;
}

A 1-hour audio file in mp3 is about 130MB. You cannot send a file that large to OpenAI. So what can we do?

Computer Science Simplified is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Chunking large audio files

I learned from smarter engineers that there is one secret when it comes to dealing with large files, datasets, or database tables. Avoid dealing with large stuff. Divide and conquer.

Here’s what we can do:

  • Chunk the audio file into smaller pieces. Let’s say 10MB each

  • Send them to OpenAI in a parallel way

  • Store the content of each chunk temporarily

  • Once every chunk has been processed concatenate the content and save it to the episode

This is the file structure I’ll use:

  • 1 is the podcast’s ID. Each podcast has a folder

  • 82..126 are the episode IDs. Each episode has its folder

  • In the episode’s folder, there’s the large audio file that the user uploaded

  • Finally, chunks are located in the chunk folder

Chunking a file is relatively easy:

class FileService
{
    public function chunk(
        string $filePath, 
        string $destinationFolder, 
        int $chunkSize = 1024 * 1024
    ): void {
        if (!file_exists($filePath)) {
            return;
        }

        $file = fopen($filePath, 'rb');

        $chunkNumber = 0;

        mkdir($destinationFolder);

        while (!feof($file)) {
            $chunkData = fread($file, $chunkSize);

            if ($chunkData) {
                $chunkFileName = 
                    $destinationFolder .
                    DIRECTORY_SEPARATOR . 
                    "{$chunkNumber}.mp3";

                file_put_contents($chunkFileName, $chunkData);

                $chunkNumber++;
            }
        }

        fclose($file);
    }
}

The method does the following steps:

  • It accepts the file path we want to chunk and the destination folder

  • It opens the file for reading in binary mode

  • It creates the destination folder in case it doesn’t exist

  • It starts a loop until it reaches the end of the file

  • In the loop, it reads X bytes from the current position

  • And writes into a file called X.mp3 where X is the current chunk number

  • Finally, it closes the file

That’s quite simple, actually. When a user uploads an episode the app needs to store the original file and chunk it into smaller pieces. Chunking should run in the background since it has to deal with larger files.

The controller looks like this:

class PodcastEpisodeController extends Controller
{
    public function store(Request $request, Podcast $podcast)
    {
        /** @var Episode $episode */
        $episode = Episode::create([
            'podcast_id' => $podcast->id,
            'title' => $request->title,
        ]);

        $path = 'podcasts' . 
            DIRECTORY_SEPARATOR . 
            $episode->podcast->id . 
            DIRECTORY_SEPARATOR . 
            $episode->id;

        Storage::createDirectory($path);

        $episode->audio_file_path = storage_path(
            'app' . 
            DIRECTORY_SEPARATOR . 
            $request->file('audio_file')->store($path)
        );

        $episode->save();

        Bus::chain([
            new ChunkEpisodeJob($episode->refresh()),
            new TranscribeEpisodeJob($episode),
        ])
            ->dispatch();

        return $episode;
    }
}

First, it creates the episode, then it creates the folder for it and uploads the file. Finally, it can dispatch the two jobs for chunking the episode and then transcribing it. They need to run in order - first chunking, and then transcribing the chunks.

ChunkEpisodeJob is just a wrapper for the FileService call:

class ChunkEpisodeJob implements ShouldQueue
{
    use Queueable;
    use Batchable;

    public function __construct(private Episode $episode)
    {
    }

    public function handle(FileService $fileService): void
    {
        $fileService->chunk(
            $this->episode->audio_file_path,
            $this->episode->chunk_folder_path,
        );
    }
}

Transcribing the chunks

To transcribe all the small chunks we need two jobs:

Keep reading with a 7-day free trial

Subscribe to Computer Science Simplified to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Martin Joo
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share