Reading Time: 4 minutes

I’m currently working on a series of episodes for a Podcast I’ll be publishing soon. The Podcast will be in Italian and I wanted to make sure to publish the episode transcripts together with the audio episodes.

The idea of manually typing all the episodes text wasn’t really appealing to me so I started looking around.

What are the tools out there?

From a quick Google search, it seems that some companies are offering a mix of automated and human-driven transcription services.

I wasn’t really interested in that for now. I was, of course, just interested in consuming an API I could push my audio to and get back some text in a reasonable amount of time.

For this reason, I started looking for speech-to-text APIs and, of course, the usual suspects figured among the first results.

To be quite honest, I didn’t spend too much time investigating the solutions above. I probably spent more time reading about them to write this blog post.

I decided to go with Google Cloud because I’ve never used GCP before and wanted to give it a try. Additionally, the documentation for it seemed quite straightforward, as well as the support for Italian as language to transcribe from (the podcast is in Italian). I also had a few free credits available because I’ve never used GCP for personal use before.

Setting up

If you want to try transcribing your episodes too, follow this quick setup guide to get started.

Head over to Google Cloud and set up an account. Make sure you create a project and enable the Speech-to-Text API. If you forget to do so gcloud will be able to take care of that for you, later.

Google Cloud Speech-to-Text
Google Cloud Speech-to-Text

Second thing I did was installing gcloud, the CLI Google Cloud provides for interacting with the APIs. This time I was only interested in testing the API so it seemed to me that this tool was the only way to get started quickly.

Additionally, there’s not much you can do from the Google Cloud Web Console if you want to deal with Speech-to-Text APIs.

Get your file ready for transcription

Sampling rate for your audio file should be at least 16 kHz for better results. Additionally, GCP recommends a lossless codec. I only had an mp3 of my episode handy at the time so I gave it a try anyway and it worked well enough.

Make sure you know the sample rate of your file, though, because specifying a wrong one might lead to poor results.

You can usually verify the sample rate by getting info on your file from your Mac’s Finder:

File's context menu
Just click Get Info on your file’s context menu
File's Finder Info section with sample rate
There’s the sample rate

You can read more about the recommended settings on the Best Practices section.

Upload your episode to the bucket

GCP needs your file to be available from a Storage Bucket so, go ahead and create one.

Storage Bucket creation example

You’ll be able to upload your episode from there.

Time to transcribe

Once you have your episode file up there in the cloud go back to your local machine terminal were you have configured the gcloud tool.

Gcloud used to trigger the speech-to-text transcription

If your episode lasts longer than 60 seconds (😬) you’ll want to use recognize-long-running and most likely specify --async.

As I said before, make sure you specify the right --sample-rate: in my case 44100. This will help GCP transcribe your file with better results.

The --async switch creates a long-running asynchronous operation. It took around 5 minutes for me to have the operation complete.

Oddly, I wasn’t able to find any reference to the asynchronous operation from my Google Cloud Console. So, if you want to be able to know what happened to your transcription job, make sure you take note of the operation identifier. You’ll need it to query the speech operations API for information about your transcription job.

The speech operation metadata

The transcribed data

Once your transcription operation is complete the describe command will return the transcript excerpts, together with the confidence rate.

The speech transcript excerpt

I wasn’t particularly interested in the confidence rate, I only wanted a big blob of text to be able to review and use for SEO purposes as well as to be able to include it with the episode. For this reason, jq to the resque!

I love jq, you can achieve so much with when it comes to manipulate JSON.

In my case, I only wanted to concatenate all the transcript fields and save them to a file. Here’s how I did:

$ ./bin/gcloud ml speech operations describe <your-transcription-operation-id> | jq '.response.results[].alternatives[].transcript' > my-transcript.txt

And that’s it!

Conclusion

I thought of sharing the steps above because they’ve been useful to me in producing the transcripts. I think GCP Speech-to-Text works quite well with Italian but, of course, the transcript is not suitable to be used as it is, unless your accent is perfect. Mine wasn’t 😅.

If you want to know more about my journey towards publishing my first podcast follow me on Twitter were I’ll be sharing more about it.

Photo by Malte Wingen on Unsplash

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.