cybrkyd

Speech to text - comparing Google Docs and Whisper AI

 Tue, 04 Jun 2024 09:48 UTC
Speech to text - comparing Google Docs and Whisper AI
Image: CC BY 4.0 by cybrkyd

At the end, this felt surreal. I felt as though I was in George Orwell's Nineteen Eighty-Four, using the speakwrite.

I dislike typing on a keyboard. I don't hate it — one after all needs to utilise that tool in one's day job, mustn't one? No, I just think that a keyboard slows me down; I'm not the fastest typist in the world and sometimes, I catch my mind racing ahead whilst my fingers struggle to keep up. How fast is a thought?

Speech to text on a phone

I honestly had no idea before a few days ago that Google Docs (via Chrome) has speech-to-text ability. I found this out after attempting to use Gboard on my phone via Markor. Just press the little microphone icon and talk.

Nice! That works, except when you tell it "new paragraph". Then, the recording stops and you need to press the little microphone icon again.

OK, so let's dictate in one huge paragraph, not a problem. That works and works very well but then, I find myself constantly wandering back to the screen to check for any mistakes. Soon, with a lot of words spoken — and a lot of mistakes by Google, not me — I end up with a jumbled mess of words. I'm now at the point of losing my flow, losing my train of thought; instead fixated on the mistakes. I find that I constantly need to stop to make corrections.

Google Docs on the PC

I had a much better experience when using Google Docs on the PC but only insofar as the mic stays on when pressed and doesn't disable itself when you ask for a new paragraph. Unfortunately, the mis-heard words are still present as is the confusion over whether to insert a full stop or to actually write "full stop".

I find myself once more looking at the screen as it types and the result is, sadly, the same jumble of mis-heard words, mis-punctuated sentences and there goes my train of thought all over again. I'm constantly pausing to manually fix one of the many errors.

This is more of a hindrance than a help.

Whisper AI

It was love a first soundbite! Sadly, I do not have a graphics card on my PC but who cares? I've re-encoded three-hour videos before with just the CPU and boat-loads of RAM and it worked just fine. It takes longer but again, who cares? I have no plans to dictate War and Peace and get Whisper AI to transcribe it!

I approached this as if I were using a dictaphone (remember those?); speak naturally, ramble on, record everything in natural speech. Keep the flow going; lose your train of thought on your own terms but keep rambling on and on. Just how I like it. And, I have a recording to play back to myself and importantly, to refer back to.

I attempted to use my phone...no go! Android does not make it easy to record from a Bluetooth mic. Apple, on the other hand, allows this and it works very well with the Voice Memos app. How does Bluetooth factor in here? Well, I'm one of those who likes to pace as I think; it helps! And preferably without anything in my hands.

The only challenge — in true-to-form-Apple-style — is moving the recordings from the device to Linux. I cheated...I shared the files from the Voice Memos app to VLC and then copied them via USB from the VLC folder to my PC.

My very first attempt yielded a four-minute recording which took about one-and-a-half minutes to transcribe. Not bad. And not a single mistake, not a single out-of-place punctuation mark. Perhaps I got lucky. A few other tests later and yes, nothing is perfect; there were a few mistakes but no big deal. I still have my recordings available to cross reference and can quickly and easily make any corrections.

Lessons learnt

The Google Docs approach seems to analyse each word as it comes in, failing to get the "gist" of what I'm saying in relation to the sentence construction. This is evidenced by the fact that the narrator needs to explicitly request a "full stop" or a "new paragraph". It's fine for speech-to-text if you really, really hate typing.

Whisper AI tackles speech-to-text transcription by paying attention to the nuances and patterns in the speech, taking for granted that a pause might mean a comma or a full stop. A slight change in tone or volume of the voice is still "heard" correctly, even in the presence of some background noise.

For me, making a long voice note should be a natural process, speaking the words as I think and not having to worry about the tool capturing it correctly. By recording actual words using my voice, I can also refer back to those sound files if I need to. Whisper AI is therefore my sub-titles.

»

Visitors: Loading...