The acronym TTS is well known among those who develop call centre software, GPS car navigation devices and software for the blind. It means ‘Text To Speech’, and is more commonly known as voice synthesis, such as the conversion of written text (e.g. ‘Take the first turn on the left into Coronation Street’) into a computer-generated voice.
STT would therefore mean ‘Speech To Text’, and is usually called voice recognition. Voice recognition has been around for many years now, and is used in a simple form in call centres (’Which size of pizza you would like – small, medium or large?’) and in a more sophisticated form in dictation software. Converting speech into text is unsurprisingly very difficult and quite computationally intensive.
The reason it shouldn’t be surprising is because we … don’t … speak … with … spaces … in … between … words, weactuallyspeakinacontinuousflow. Working out how where one word begins and another one ends is tricky enough, but there are even more difficult problems. Take accents, for example: I’m a native English speaker and I still find it difficult to follow what some die-hard Scousers say. Or take the inconvenient fact that many words have the same pronunciation; way, whey, weigh (these are known as homonyms).
There are various clever strategies tricks that programmers have used to make voice recognition actually possible, such as getting users to train dictation software to their accent and by using context to decide which words might have been said. But it’s still tricky, and it’s not quite there yet.
Of course, I’m an unabashed optimist about technology, and I’m as interested in its societal effects as the way in which it actually works. So, let’s just imagine it’s 2017 and voice recognition is not only extremely good, but extremely widespread. Your mobile phone can transcribe your all phone calls, and some techies even have jewelery that will transcribe everything within earshot.
What does this do? It dramatically reduces the portion of our life that is not digitised and searchable. Already, we can refer to emails, text messages, photos (pretty much all of which are now digital) and instant messages whenever we want to check what someone said or did. As with all technology, this has its upsides and downsides. I frequently search through old correspondence to find out someone’s favourite music or where an old friend works now. But it does mean that even private conversations online could eventually become public, and so I have to watch what I say. This is most apparent when it comes to legal proceedings – it’s perfectly possible for someone to demand to see your emails or IMs if you’re involved in a suit, and deleting them all can look very suspicious.
With some effort, though, you can remain fairly discreet online. I’m not convinced you can do the same when it comes to talking out loud. If people begin transcribing all their conversations, all the time, it’ll be impossible to not slip up. I suppose you could go ‘off-the-record’ and turn it off, but who would know if you really did? After all, why not transcribe everything? Imagine how useful it would be during meetings – notetaking would be vastly simplified (although not eliminated). Imagine how tempting it would be to try and look at the conversations your friends had about you. Imagine how people might post conversations directly to Facebook. Live.
And for more mundane purposes, imagine having every spoken word on radio and TV transcribed. While I enjoy listening to some podcasts, there are only a few occasions (gym, coach, planes) where I can do that; I’d rather just read transcripts most of the time. This would suddenly free up vast amounts of high quality material, and Radio 4 would suddenly become one of the web’s most popular destinations.
This is not science fiction. It could be done quite easily now – I could wear a small lapel microphone, connect it to my iPod and set it recording all day. When I get home, I could upload the recording to my computer and run it through a voice recognition program on a PC. It’d pick up what I said pretty reliably, and it’d probably get a reasonable percentage of what other people said. There are probably some people who already do this.
In a couple of years, I can imagine this process happening even more smoothly, where the recording is automatically synchronised with my computer and uploaded to Google’s servers, which crunch through it with the power of a million PCs and return it to me, a few minutes later, with 99.9% reliability, with each speaker identified and each conversation handily logged and cross-referenced in Google Mail.
It’s coming. It’s not that difficult. The question is, how will we deal with it?