Processing podcasts part 3

Async workflows, dealing with large files, vector embeddings, ChatGPT, MySQL full-text indices - Processing podcasts

Martin Joo

Sep 03, 2024

Introduction

In the first two parts of this post, we built a basic app that can:

Transcribe a podcast episode
Summarize it
Create OpenAI embeddings
And run full-text search in the content

If you missed them, you can read Part 1 here and Part 2 here.

There are two other features I’m going to build in this part:

Selecting an episode and recommending similar ones based on the content and topics
Asking questions about a particular episode

Recommending similar episodes

The embeddings can be used to determine the similarity between two texts. Since these embeddings represent coordinates we’re not really talking about text anymore but vectors.

The cosine similarity is the simplest (and a fine) method to compare two vectors. The first part is calculating the dot products of the two vectors. This is quite simple:

\(A⋅B=a_1b_1+a_2b_2+...+a_nb_n\)

In code, it looks like this:

private function dotProduct(array $a, array $b): float
{
    $products = array_map(fn ($ax, $bx) => $ax * $bx, $a, $b);

    return array_sum($products);
}

It multiples the elements of the array and it returns the sum.

A higher dot product means a greater similarity in direction. For example, If they are 90 degrees apart, it will be zero. But if they point in similar directions it will be large.

For the next part, we need to determine the magnitude of the two vectors. In other words, the length of a vector. It can be calculated by:

\(∥A∥=\sqrt{a_1^2 + a_2^2 + ... + a_n^2}\)

In code, it looks like this:

private function magnitude(array $a): float
{
    return sqrt(array_sum(array_map(fn ($x) => $x * $x, $a)));
}

The whole cosine similarity calculation looks like this:

private function cosineSimilarity(array $a, array $b): float
{
    $dotProduct = $this->dotProduct($a, $b);

    $magnitudeA = $this->magnitude($a);

    $magnitudeB = $this->magnitude($b);

    return $dotProduct / ($magnitudeA * $magnitudeB);
}

Without understanding every small detail, the cosine similarity returns a number between -1 and 1:

1 means that the vectors are identical in direction
0 means that the vectors are orthogonal
-1 means that the vectors are diametrically opposed

Let’s see these three cases.

These vectors ([1,2] and [1,3]) are almost identical in direction:

The cosine similarity is 0.9899

The two vectors ([1,2] and [-2,-2]) are diametrically opposed:

The cosine value is -0.9486.

With this simple function, we can determine how similar two vectors are.

To do this the controller queries episodes and calls the SimilarityService class:

class EpisodeController extends Controller
{
    public function recommendations(
        Episode $episode,
        SimilarityService $similarityService,
    ) {
        $episodes = Episode::query()
            ->select('id', 'title', 'embeddings')
            ->whereNot('id', $episode->id)
            ->whereNotNull('embeddings')
            ->get();

        $similarEpisodeIds = $similarityService
            ->getMostSimilarModels(
                $episode->getEmbeddings(), 
                $episodes, 
                3,
            );

        return Episode::query()
            ->select('id', 'title')
            ->whereIn('id', $similarEpisodeIds)
            ->get();
    }
}

There are two helper methods in SimilarityService that retrieve the similarities for a collection of models:

class SimilarityService
{
    public function getMostSimilarModels(
        array $embeddings,
        Collection $models,
        int $numberOfMatches = 3,
    ): Collection {
        return $this->getSimilarities($models, $embeddings)
            ->sortByDesc('similarity')
            ->take($numberOfMatches)
            ->pluck('model_id');
    }

    private function getSimilarities(
        Collection $models, 
        array $inputEmbedding,
    ): Collection {
        $similarities = [];

        foreach ($models as $model) {
            /** @var HasEmbeddings $model */
            $similarities[] = [
                'model_id' => $model->getId(),
                'similarity' => $this->cosineSimilarity(
                    $inputEmbedding, 
                    $model->getEmbeddings(),
                ),
            ];
        }

        return collect($similarities);
    }
}

And that’s it. This is the basic idea of a content-based recommendation system. Of course, it’s simple, but it works.

In my database episodes #1101, #1102, and #1103 are related to each other but not the other 1,100 episodes:

The recommendations for #1101 are exactly #1102 and #1103.

To sum it up, these are the important steps:

Transcribing the audio files using OpenAI
Creating embeddings from the content using OpenAI
Determining cosine similarities using our brain
Displaying recommendations to the user

The next feature is going to be a walk in the park.

Asking questions about the content

Asking questions is the simplest feature of all. We can use the chat API and provide the content as context:

public function answer(string $question, string $context): ?string
{
    $response = $this->client->chat()->create([
        'model' => 'gpt-4-turbo',
        'messages' => [
            [
                'role' => 'user',
                'content' => "Based on the context above answer the question. Context: $context. Question: $question",
            ],
        ],
    ]);

    if (empty($response->choices)) {
        return null;
    }

    return $response->choices[0]->message->content;
}

Here are a few things you need to consider:

The 1-hour-long HackersIncorporated episode I’m testing with contains 91k characters. On average, 1 character is 0.25 tokens in the OpenAI API. It means one request costs 22,000 tokens. It’s a lot. It can be kind of expensive if you process lots of texts.
Only the gpt-4-turbo model can handle a text that large. You can send 128k tokens worth of text. However, other models such as gpt-3.5 can only accept something like 8k tokens. With other models, you need to use chunks or summaries of the content. Which is fine, but the result will be lower quality. And of course, gpt-4 is more expensive.

The API works as expected. It can process 1-hour worth of text in just 7-10 seconds:

If the request takes too long, you can process it async:

The controller dispatches the job and it returns a 202 - Accepted result
The job is running for a long time
When it finishes it dispatches a notification
The notification notifies the client via websocket and shows the results

That’s it. We have a basic app that can process podcast episodes and it can:

Transcribe and summarize episodes
Recommend similar episodes based on their content
Search in the content effectively
Ask questions about the episode

Don’t forget to check out the repository here.

If you have any questions just leave a comment!

Computer Science Simplified

Discussion about this post