Voice computing

Voice computing is the discipline that develops hardware or software to process voice inputs. ^[1]

It spans many other fields including human-computer interaction, conversational computing, linguistics, natural language processing, automatic speech recognition, audio engineering, digital signal processing, cloud computing, data science, ethics, law, and information security.

Voice computing has become increasingly significant in modern times, especially with the advent of smart speakers like the Amazon Echo and Google Assistant, a shift towards serverless computing, and improved accuracy of speech recognition and text-to-speech models.

History

Voice computing has a rich history. ^[2] Inspired by the human vocal tract, Wolfgang Kempelen built the acoustic-mechanical speech machine to produce the earliest synthetic speech sounds. This led to further work by Thomas Edison to record audio with dictation machines and play it back in corporate settings. In the 1950s-1960s there were primitive attempts to build automated speech recognition systems by Bell Labs, IBM, and others. However it was not until the 1980s that Hidden Markov Models were used to recognize up to 1,000 words that speech recognition systems became relevant.

Date	Event
1784	Wolfgang von Kempelen creates the Acoustic-Mechanical speech machine.
1879	Thomas Edison invents the first dictation machine.
1952	Bell Labs releases Audrey, capable of recognizing spoken digits with 90% accuracy.
1962	IBM Shoebox can recognize up to 16 words.
1971	Harpy is created, which can understand over 1,000 words.
1986	IBM Tangora uses Hidden Markov Models to predict phonemes in speech.
2006	National Security Agency begins research in hotword detection during normal conversations.
2008	Google launches a voice application, bring speech recognition to mobile devices.
2011	Apple releases Siri on iPhone
2014	Amazon releases Amazon Echo to make voice computing relevant to the public at large.

Around 2011, Siri emerged on Apple iPhones as the first voice assistant accessible to consumers. This innovation led to a dramatic shift to building voice-first computing architectures. PS4 was released by Sony in North America in 2013 (70+ million devices), Amazon released the Amazon Echo in 2014 (30+ million devices), Microsoft released Cortana (2015 - 400 million Windows 10 users), Google released Google Assistant (2016 - 2 billion active monthly users on Android phones), and Apple released HomePod (2018 - 500,000 devices sold and 1 billion devices active with iOS/Siri). These shifts, along with advancements in cloud infrastructure (e.g. Amazon Web Services) and codecs, have solidified the voice computing field and made it widely relevant to the public at large.

Hardware

A voice computer is assembled hardware and software to process voice inputs.

Note that voice computers do not necessarily need a screen, such as in the traditional Amazon Echo. In other embodiments, traditional laptop computers or mobile phones could be used as as voice computers. Moreover, there has become increasingly more interfaces for voice computers with the advent of IoT-enabled devices, such as within cars or televisions.

As of September 2018, there are currently over 20,000 types of devices compatible with Amazon Alexa. ^[3]

Software

Voice computing software can read/write, record, clean, encrypt/decrypt, playback, transcode, transcribe, compress, publish, featurize, model, and visualize voice files.

Here are some popular software packages related to voice computing:

FFmpeg - for transcoding audio files from one format to another (e.g. .WAV --> .MP3). ^[4]
Audacity - for recording and filtering audio. ^[5]
SoX - for manipulating audio files and removing environmental noise. ^[6]
Natural Language ToolKit - for featurizing transcripts with things like parts of speech. ^[7]
LibROSA - for visualizing audio file spectrograms and featurizing audio files. ^[8]
OpenSMILE - for featurizing audio files with things like mel-frequency cepstrum coefficients. ^[9]
PocketSphinx - for transcribing speech files into text ^[10].
Pyttsx3 - for playing back audio files (text-to-speech). ^[11]
Pycryptodome - for encrypting and decrypting audio files. ^[12]

Applications

Voice computing applications span many industries including voice assistants, healthcare, e-Commerce, finance, supply chain, agriculture, text-to-speech, security, marketing, customer support, recruiting, cloud computing, microphone design, and podcasting. Voice technology is projected to grow at a CAGR of 19-25% by 2025, making it an attractive industry for startups and investors alike. ^[13]

Use case	Example Product or Startup
Voice assistants	Cortana ^[14], Amazon Alexa ^[15], Siri ^[16], Google Assistant ^[17], Apple HomePod ^[18], Jasper ^[19], and Nala ^[20]
Healthcare	Cardiocube ^[21], Toneboard ^[22], Suki ^[23], Praktice.ai ^[24], Corti ^[25], and Syllable. ^[26]
e-Commerce	Cerebel ^[27], Voysis ^[28], Mindori ^[29], Twiggle ^[30], and Addstructure. ^[31]
Finance	Kasisto ^[32], Personetics ^[33], Voxo ^[34], and Active Intelligence ^[35]
Supply Chain and Manufacturing	Augury ^[36], Kextil ^[37], 3DSignals ^[38], Voxware ^[39], Otosense ^[40]
Agriculture	Agvoice ^[41]
Text-to-speech	Lyrebyrd ^[42] and VocalID ^[43]
Security	Pindrop security ^[44] and Aimbrain ^[45]
Marketing	Convirza ^[46], Dialogtech ^[47], Invoca ^[48], Veritonic ^[49]
Customer support	Cogito. ^[50], Afiniti ^[51], Aaron.ai ^[52], Blueworx ^[53], Servo.ai ^[54], SmartAction, and Chatdesk. ^[55]
Recruiting	SurveyLex ^[56] and Voice glance ^[57].
Speech-to-text	Voicebase ^[58], Speechmatics ^[59], Capio ^[60], Nuance, and Spitch ^[61]
Cloud computing	AWS ^[62], GCP ^[63], IBM Watson ^[64], and Microsoft Azure ^[65].
Microphone/speaker design	Bose ^[66] and Audio Technica. ^[67]
Podcasting	Anchor ^[68] and iTunes ^[69].

Legal considerations

In the United States different states have varying telephone recording laws. In some states, it is legal to record a conversation with the consent of only one party, in others the consent of all parties is required.

Moreover, COPAA is a significant law to protect minors on the internet. With an increasing number of minors interacting with voice computing devices (e.g. the Amazon Alexa), the Federal Trade Commission recently relaxed the COPAA rule so that kids can issue voice searches and commands. ^[70]

Lastly, GDPR is a new European law that governs right to be forgotten and many other clauses for EU citizens. GDPR also is clear that companies need to outline clear measures to obtain consent if audio recordings are made and define the purpose and scope as to how these recordings will be used (e.g. for training purposes). The bar for valid consent has been raised much higher under the GDPR. Consents must be freely given, specific, informed, and unambiguous; tacit consent is no longer be enough. ^[71]

All these things make it quite unclear how voice computing technology will be regulated into the future.

Research Conferences

There are many research conferences that relate to voice computing. Some of these include:

International Conference on Acoustics, Speech, and Signal Processing
Interspeech ^[72]
AVEC ^[73]
IEEE Int'l Conf. on Automatic Face and Gesture Recognition ^[74]
ACII2019 The 8th Int'l Conf. on Affective Computing and Intelligent Interaction ^[75]

Developer community

Google Assistant has roughly 2,000 actions as of January 2018. ^[76]

There are over 50,000 Alexa skills worldwide as of September 2018. ^[77]

In June 2017, Google released AudioSet,^[78] a large-scale collection of human-labeled 10-second sound clips drawn from YouTube videos. It contains 1,010,480 videos of human speech files, or 2,793.5 hours in total. ^[79] It was released as part of the IEEE ICASSP 2017 Conference. ^[80]

In November 2017, Mozilla Foundation released the Common Voice Project, a collection of speech files to help contribute to the larger open source machine learning community.^[81]^[82] The voicebank is currently 12GB in size, with more than 500 hours of English-language voice data that have been collected from 112 countries since the project's inception in June 2017. ^[83] This dataset has already resulted in creative projects like the DeepSpeech model, an open source transcription model. ^[84].

In August 2018, Jim Schwoebel ^[85] released a book (Introduction to Voice Computing in Python) with a GitHub repository.^[86] This book contains 10 chapters and the GitHub repository contains over 200+ starter scripts to help get developers started programming voice computing applications in Python. ^[87]

References

^ Schwoebel, J. (2018). An Introduction to Voice Computing in Python. Boston; Seattle, Atlanta: NeuroLex Laboratories. https://neurolex.ai/voicebook
^ Timeline for Speech Recognition. https://medium.com/swlh/the-past-present-and-future-of-speech-recognition-technology-cf13c179aaf
^ Voicebot.AI. https://voicebot.ai/2018/09/02/amazon-alexa-now-has-50000-skills-worldwide-is-on-20000-devices-used-by-3500-brands/
^ FFmpeg. https://www.ffmpeg.org/
^ Audacity. https://www.audacityteam.org/
^ SoX. http://sox.sourceforge.net/
^ NLTK. https://www.nltk.org/
^ LibROSA. https://librosa.github.io/librosa/
^ OpenSMILE. https://www.audeering.com/technology/opensmile/
^ https://github.com/cmusphinx/pocketsphinx
^ Pyttsx3. https://github.com/nateshmbhat/pyttsx3
^ Pycryptodome. https://pycryptodome.readthedocs.io/en/latest/
^ Businesswire. https://www.businesswire.com/news/home/20180417006122/en/Global-Speech-Voice-Recognition-Market-2018-Forecast
^ Cortana. https://www.microsoft.com/en-us/cortana
^ Amazon Alexa. https://developer.amazon.com/alexa
^ Siri. https://www.apple.com/siri/
^ Google Assistant. https://assistant.google.com/#?modal_active=none
^ HomePod. https://www.apple.com/homepod/
^ Jasper https://jasperproject.github.io/
^ Nala. https://github.com/jim-schwoebel/nala
^ Cardiocube. https://www.cardiocube.com/
^ Toneboard. https://toneboard.com/
^ Suki. https://www.suki.ai/
^ Praktice.ai. https://praktice.ai/
^ Corti. https://corti.ai/
^ Syllable. https://www.syllable.ai/
^ Cerebel. https://map.startuplithuania.lt/companies/cerebel
^ Voysis. https://voysis.com/
^ Mindori. http://mindori.com/
^ Twiggle. https://www.twiggle.com/
^ AddStructure. https://www.crunchbase.com/organization/addstructure
^ Kasisto. https://kasisto.com/
^ Personetics. https://personetics.com/
^ Voxo. https://www.voxo.ai/
^ Active Intelligence. https://active.ai/
^ Augury. https://www.augury.com/
^ Kextil. http://www.kextil.com/
^ 3DSignals. https://www.3dsig.com/
^ Voxware. https://www.voxware.com/
^ Otosense. https://www.otosense.com/
^ Agvoice. https://agvoiceglobal.com/
^ Lyrebird. https://lyrebird.ai/
^ VocalD. https://vocalid.ai/
^ Pindrop. https://www.pindrop.com/
^ Aimbrain. https://aimbrain.com/
^ Convirza. https://www.convirza.com/
^ Dialogtech. https://www.dialogtech.com/
^ Invoca. https://www.invoca.com/
^ Veritonic. https://veritonic.com/
^ Cogito. https://www.cogitocorp.com/
^ Afiniti. https://www.afiniti.com/
^ Aaron.ai. https://aaron.ai/
^ Blueworx. https://www.blueworx.com/
^ Servo.ai. https://www.servo.ai/
^ Chatdesk. https://chatdesk.com/
^ SurveyLex. https://www.surveylex.com/
^ Voice glance. https://voiceglance.com/
^ Voicebase. https://www.voicebase.com/
^ Speechmatics. https://www.speechmatics.com/
^ Capio. https://www.capio.ai/
^ Spitch. https://www.spitch.ch/
^ AWS. https://aws.amazon.com/
^ GCP. https://cloud.google.com/
^ IBM Watson. https://www.ibm.com/watson/
^ Microsoft Azure. https://azure.microsoft.com/en-us/
^ Bose speakers. https://www.bose.com/en_us/shop_all/speakers/speakers.html
^ Audio Technica. https://www.audio-technica.com/cms/site/c35da94027e94819/index.html
^ Anchor. https://anchor.fm/
^ iTunes. https://www.apple.com/itunes/
^ Techcrunch. https://techcrunch.com/2017/10/24/ftc-relaxes-coppa-rule-so-kids-can-issue-voice-searches-and-commands/
^ IAPP. https://iapp.org/news/a/how-do-the-rules-on-audio-recording-change-under-the-gdpr/
^ Interspeech 2018. http://interspeech2018.org/
^ AVEC 2018. http://avec2018.org/
^ 2018 FG. https://fg2018.cse.sc.edu/
^ ASCII 2019. http://acii-conf.org/2019/
^ Voicebot.ai. https://voicebot.ai/2018/01/24/google-assistant-app-total-reaches-nearly-2400-thats-not-real-number-really-1719/
^ Voicebot.ai. https://voicebot.ai/2018/09/02/amazon-alexa-now-has-50000-skills-worldwide-is-on-20000-devices-used-by-3500-brands/.
^ Google AudioSet. https://research.google.com/audioset/
^ Audioset data. https://research.google.com/audioset/dataset/speech.html
^ Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, & Ritter, M. (2017, March). Audio set: An ontology and human-labeled dataset for audio events. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on (pp. 776-780). IEEE.
^ Common Voice Project. https://voice.mozilla.org/
^ Common Voice Project. https://blog.mozilla.org/blog/2017/11/29/announcing-the-initial-release-of-mozillas-open-source-speech-recognition-model-and-voice-dataset/
^ Mozilla's large repository of voice data will shape the future of machine learning. https://opensource.com/article/18/4/common-voice
^ DeepSpeech. https://github.com/mozilla/DeepSpeech
^ Jim Schwoebel. https://www.linkedin.com/in/jimschwoebel
^ Introduction to Voice Computing in Python. https://www.amazon.com/Introduction-Voice-Computing-Python-Schwoebel-ebook/dp/B07HLGBRWZ
^ Voicebook. https://github.com/jim-schwoebel/voicebook

[1] Schwoebel, J. (2018). An Introduction to Voice Computing in Python. Boston; Seattle, Atlanta: NeuroLex Laboratories. https://neurolex.ai/voicebook

[2] Timeline for Speech Recognition. https://medium.com/swlh/the-past-present-and-future-of-speech-recognition-technology-cf13c179aaf

[3] Voicebot.AI. https://voicebot.ai/2018/09/02/amazon-alexa-now-has-50000-skills-worldwide-is-on-20000-devices-used-by-3500-brands/

[4] FFmpeg. https://www.ffmpeg.org/

[5] Audacity. https://www.audacityteam.org/

[6] SoX. http://sox.sourceforge.net/

[7] NLTK. https://www.nltk.org/

[8] LibROSA. https://librosa.github.io/librosa/

[9] OpenSMILE. https://www.audeering.com/technology/opensmile/

[10] ttps://github.com/cmusphinx/pocketsphinx

[11] Pyttsx3. https://github.com/nateshmbhat/pyttsx3

[12] Pycryptodome. https://pycryptodome.readthedocs.io/en/latest/

[13] Businesswire. https://www.businesswire.com/news/home/20180417006122/en/Global-Speech-Voice-Recognition-Market-2018-Forecast

[14] Cortana. https://www.microsoft.com/en-us/cortana

[15] Amazon Alexa. https://developer.amazon.com/alexa

[16] Siri. https://www.apple.com/siri/

[17] Google Assistant. https://assistant.google.com/#?modal_active=none

[18] HomePod. https://www.apple.com/homepod/

[19] Jasper https://jasperproject.github.io/

[20] Nala. https://github.com/jim-schwoebel/nala

[21] Cardiocube. https://www.cardiocube.com/

[22] Toneboard. https://toneboard.com/

[23] Suki. https://www.suki.ai/

[24] Praktice.ai. https://praktice.ai/

[25] Corti. https://corti.ai/

[26] Syllable. https://www.syllable.ai/

[27] Cerebel. https://map.startuplithuania.lt/companies/cerebel

[28] Voysis. https://voysis.com/

[29] Mindori. http://mindori.com/

[30] Twiggle. https://www.twiggle.com/

[31] AddStructure. https://www.crunchbase.com/organization/addstructure

[32] Kasisto. https://kasisto.com/

[33] Personetics. https://personetics.com/

[34] Voxo. https://www.voxo.ai/

[35] Active Intelligence. https://active.ai/

[36] Augury. https://www.augury.com/

[37] Kextil. http://www.kextil.com/

[38] 3DSignals. https://www.3dsig.com/

[39] Voxware. https://www.voxware.com/

[40] Otosense. https://www.otosense.com/

[41] Agvoice. https://agvoiceglobal.com/

[42] Lyrebird. https://lyrebird.ai/

[43] VocalD. https://vocalid.ai/

[44] Pindrop. https://www.pindrop.com/

[45] Aimbrain. https://aimbrain.com/

[46] Convirza. https://www.convirza.com/

[47] Dialogtech. https://www.dialogtech.com/

[48] Invoca. https://www.invoca.com/

[49] Veritonic. https://veritonic.com/

[50] Cogito. https://www.cogitocorp.com/

[51] Afiniti. https://www.afiniti.com/

[52] Aaron.ai. https://aaron.ai/

[53] Blueworx. https://www.blueworx.com/

[54] Servo.ai. https://www.servo.ai/

[55] Chatdesk. https://chatdesk.com/

[56] SurveyLex. https://www.surveylex.com/

[57] Voice glance. https://voiceglance.com/

[58] Voicebase. https://www.voicebase.com/

[59] Speechmatics. https://www.speechmatics.com/

[60] Capio. https://www.capio.ai/

[61] Spitch. https://www.spitch.ch/

[62] AWS. https://aws.amazon.com/

[63] GCP. https://cloud.google.com/

[64] IBM Watson. https://www.ibm.com/watson/

[65] Microsoft Azure. https://azure.microsoft.com/en-us/

[66] Bose speakers. https://www.bose.com/en_us/shop_all/speakers/speakers.html

[67] Audio Technica. https://www.audio-technica.com/cms/site/c35da94027e94819/index.html

[68] Anchor. https://anchor.fm/

[69] Tunes. https://www.apple.com/itunes/

[70] Techcrunch. https://techcrunch.com/2017/10/24/ftc-relaxes-coppa-rule-so-kids-can-issue-voice-searches-and-commands/

[71] IAPP. https://iapp.org/news/a/how-do-the-rules-on-audio-recording-change-under-the-gdpr/

[72] Interspeech 2018. http://interspeech2018.org/

[73] AVEC 2018. http://avec2018.org/

[74] 2018 FG. https://fg2018.cse.sc.edu/

[75] ASCII 2019. http://acii-conf.org/2019/

[76] Voicebot.ai. https://voicebot.ai/2018/01/24/google-assistant-app-total-reaches-nearly-2400-thats-not-real-number-really-1719/

[77] Voicebot.ai. https://voicebot.ai/2018/09/02/amazon-alexa-now-has-50000-skills-worldwide-is-on-20000-devices-used-by-3500-brands/.

[78] Google AudioSet. https://research.google.com/audioset/

[79] Audioset data. https://research.google.com/audioset/dataset/speech.html

[80] Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, & Ritter, M. (2017, March). Audio set: An ontology and human-labeled dataset for audio events. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on (pp. 776-780). IEEE.

[81] Common Voice Project. https://voice.mozilla.org/

[82] Common Voice Project. https://blog.mozilla.org/blog/2017/11/29/announcing-the-initial-release-of-mozillas-open-source-speech-recognition-model-and-voice-dataset/

[83] Mozilla's large repository of voice data will shape the future of machine learning. https://opensource.com/article/18/4/common-voice

[84] DeepSpeech. https://github.com/mozilla/DeepSpeech

[85] Jim Schwoebel. https://www.linkedin.com/in/jimschwoebel

[86] Introduction to Voice Computing in Python. https://www.amazon.com/Introduction-Voice-Computing-Python-Schwoebel-ebook/dp/B07HLGBRWZ

[87] Voicebook. https://github.com/jim-schwoebel/voicebook

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

[66]

[67]

[68]

[69]

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

[78]

[79]

[80]

[81]

[82]

[83]

[84]

[85]

[86]

[87]