Technology for speech recognition had its origins at Bell Labs in the early 1950s with a system for single-speaker digit recognition. However, commercially viable recognition systems did not appear until some four decades later, by which time typical system vocabularies exceeded that of an average human.1 Today’s speech recognition systems offer significant advantages to those who do a lot of writing on a computer, such as office workers, literary authors, journalists, thesis writers and translators.
The highest quality, best known commercial speech recognition software today is from Nuance, which has served as the basis for voicemail recognition applications for Cisco, Apple’s Siri. and other popular platforms. The commercial speech recognition solution most frequently encountered by users of desktop computers for years has been Dragon NaturallySpeaking (DNS) from Nuance, which is currently available for seven languages in DNS version 13 (eight languages are often claimed, but this counts US and UK English as separate languages). However, most of the current investment and development in speech recognition is in the mobile device market, where Nuance now offers high-quality recognition with nearly 40 language settings. The APIs for DNS and mobile or web solutions are available to developers, in some cases at no cost.2
Writers and translators have a long history of working with Dragon NaturallySpeaking in the languages for which it is available — English (US and UK), Dutch, French, Italian, German, Spanish and Japanese. But recent developments with mobile apps have seen an expansion of automated speech recognition to many other languages and interest groups, including students3 (for transcribing notes) and hearing-disabled persons.4 One of the most popular tools is the free Dragon Dictation app for iOS.
I started using DNS for speech recognition a decade ago, but its incompatibility with the Windows operating system at the time Windows XP Service Pack 2 was released soon derailed that effort. It was not until four years ago that I began using it again, when I observed a colleague working in “mixed mode”, combining speech recognition with some work with a keyboard or mouse, because this is often a more efficient way of invoking commands of other shortcuts.. I was shocked to discover that she could translate and edit about 10,000 words of high-quality English legal text before I finished banging away on the keyboard to get my first draft of a 3,000-word job.
Later, I became a “true believer” in speech recognition when I realized how much more relaxed I am speaking my texts and how having my hands free allows me to touch the screen and use my fingers to mark points of reference for untangling particularly long, nasty German patent claim sentences. The quality of the draft also tends to be better, though identifying “dictos” (transcription errors by the automated speech recognition) can be tricky – the errors cannot be found with a spell checker – so different methods are needed for effective post-editing.
It was not until I moved to Portugal in 2013 that I fully appreciated the disadvantages of the unavailability of commercially useful speech recognition in languages such as Portuguese. I spent two years looking for options to share with colleagues in my new country, but it wasn’t until Professor David Hardisty at Universidade Nova in Lisbon tested speech recognition with some students using the Macintosh Yosemite operating system that I became aware of a possible usable solution.
The first public demonstration by the Lisbon university class was at SDL Day, held at the university in January of this year. Demonstrations were given by Professor Hardisty and two students in the Masters program – Isabel Rocha and Joana Bernardo. Joana has some disabilities from cerebral palsy, and she showed how her typing difficulties could be largely overcome using the integrated speech recognition on the Macintosh. Unfortunately, no such possibility was known at the time for PCs running Microsoft Windows or Linux.
Research into possible solutions for speech recognition in Portuguese and other languages not served by Dragon NaturallySpeaking Microsoft Windows was frustrating. Time and again I read that research efforts and investments in speech recognition were almost entirely devoted to mobile platforms. It was claimed that the recognition quality on mobile platforms was superior to desktop computers! This infuriated me until one day I decided to see what was available and how much “better” it was. I had recently bought an iPhone 4S after my third Android smartphone gave up the ghost, so I had a look in the Apple Store and found Nuance’s free “Dragon Dictation” as well as a Dragon remote microphone app. The latter is also available for Android devices, but is useful only when working with DNS and is thus limited to the languages available for Nuance’s Macintosh and Windows Dragon applications.
I began to experiment enthusiastically with the Dragon Dictation app on my iPhone, dictating blog posts in the bathtub or translations (from printouts of the source texts while hanging out in the fields with my ducks and goats, or enjoying a jug of sangria in a noisy cantinho).5 I went on the road, showing the possibilities at conferences in Porto, Seville and Zagreb and developing a 3-stage integrated translation workflow consisting of:
Although this had the disadvantage of not have ready access to my reference material, the flexibility and ergonomics of the process was appealing. And if I really needed to use reference material, see TM matches, etc., I could do so sitting at my desk with the source text loaded in a memoQ project, stepping through the segments with arrow keys and using the pop-up terminology hits in the source text or the results displayed in the Translation Pane.
Speech recognition dominated much of the agenda at the memoQ Day at Universidade Nova in April. Kilgray’s product manager, Gábor Ugray, attended and was intrigued by the new possibilities, contacting Nuance afterward and inviting their representatives to attend memoQ Fest in 2015 in May. By then, further apps had been developed that made use of the remote Nuance servers: myEcho6 for iOS and the Swype7 virtual keyboard (another Nuance app) for iOS and Android. myEcho allows dictation of text from an iPhone or iPad at the cursor location of any connected PC running Windows, while Swype offers convenient switching of keyboards and dictation languages. Curiously, most Swype users seem to be unaware of its dictation features.
Tiago Neto, a translator and veterinary researcher in the north of Portugal, began a months-long study of optimal virtualization solutions8 to allow Macintosh Yosemite speech recognition to be used with Windows applications, as well as automated training techniques and iPad workflows.
Monolingual authoring and editing processes with memoQ and other CAT tools were also subject to intense study with the new speech recognition possibilities, leading to insights such as the possibility of dictating text changes in a separate file and updating TMs in a memoQ project by importing only those changes using the monolingual review9 feature of memoQ, thus avoiding possible chaos from segmentation problems in re-imported unchanged text passages.
At memoQfest, speech recognition veteran Jim Wardell10 noted that the additional languages available on the Nuance servers make speech recognition accessible to about another two billion people. He spoke about the extreme accuracy today of speech recognition for practiced speakers and showed the use of the Swype virtual keyboard in a browser on an Android tablet to translate in the memoQ WebTrans server solution. But at the time, no option other than myEcho was known for direct dictation of languages like Portuguese, Arabic, Chinese or Russian from mobile devices into an application running on Microsoft Windows, until Professor Hardisty began to test remote editing workflows and discovered that the Swype keyboard could be used in tools like TeamViewer.
This opened up the floodgates of discovery and experimentation, as it was found that most remote keyboard apps for iOS and Android have some degree of compatibility with the Swype keyboard and Nuance speech recognition. Some have buffering issues, others are not stable for dictating longer chunks of text, but the principle works, and with some attention paid by developers, the difficulties can be resolved quickly.
We are close to the point where those who need to write text on a computer will be able to work comfortably in most software applications using integrations with mobile applications, web browsers and other means. The cost of these solutions is lower than the old PC software solution Dragon NaturallySpeaking, and the recognition accuracy of the remote servers is considered better. At memoQfest, the Nuance representatives also revealed that their online servers have met the highest security standards and are trusted by the US government and IBM, among others.
Trainability was thought to be an issue at first, but research in recent months has developed many possibilities to bulk-train custom vocabularies for Macintosh OS and mobile device recognition. One simple way of doing this for Swype, for example, is to select a group of words in a text file and use the Swype key for them to be learned. Automation features on a Macintosh can be harnessed to quickly bulk load thousands of custom words.
The health advantages of getting one’s hands off the keyboard and mouse for most of the work involved in writing are clear. The spread of speech recognition to schools, companies and home offices can reduce the appalling rates of stress injuries which have resulted from the ubiquitous use of computers, while at the same time allowing output volumes to be maintained or possibly increased and better texts obtained through more focused, relaxed work. And as one colleague said, “It’s harder to speak a stupid-sounding sentence than to type one.”
Kevin Lossner is a consultant, an instructor in language service technologies and processes, and a German-to-English translator, mostly of legal and scientific texts. His blog, Translation Tribulations, is a popular source of information on translation technology and unconventional practical methods, interoperability, coffee and cookies, ethics, and interesting scandals.
Web links last accessed on 14 June 2015.
1 Huang, Xuedong; Baker, James; Reddy, Raj. "A Historical Perspective of Speech Recognition". Communications of the ACM.
2 See http://www.nuance.com/for-developers/index.htm
5 See http://www.translationtribulations.com/2015/04/free-good-quality-speech-recognition.html for an example of a blog post dictated in a noisy restaurant.
8 See http://www.translationtribulations.com/2015_04_01_archive.html (in Portuguese)
9 See http://www.translationtribulations.com/2013/10/the-next-big-cat-feature-to-copy.html.
10 A recording of Jim’s talk is available at https://youtu.be/icKcrs4CAls