0% found this document useful (0 votes)
72 views29 pages

Speech To Text Preprint Version

This document summarizes an open-source software called SpeechToText that can automatically detect and transcribe voice recordings to assist with digital forensic investigations. SpeechToText uses state-of-the-art voice activity detection and automatic speech recognition modules. It was integrated into the Autopsy forensic software as a proof-of-concept. The software achieved 100% accuracy in detecting voice in unencrypted audio/video files and word error rates of 27.2% for English voice messages and 7.8% for clean speech. Processing time was about 9 seconds per minute of speech on a mid-range laptop. The document also analyzes voice recording formats and storage in 14 popular Android apps.

Uploaded by

Dummy Hack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views29 pages

Speech To Text Preprint Version

This document summarizes an open-source software called SpeechToText that can automatically detect and transcribe voice recordings to assist with digital forensic investigations. SpeechToText uses state-of-the-art voice activity detection and automatic speech recognition modules. It was integrated into the Autopsy forensic software as a proof-of-concept. The software achieved 100% accuracy in detecting voice in unencrypted audio/video files and word error rates of 27.2% for English voice messages and 7.8% for clean speech. Processing time was about 9 seconds per minute of speech on a mid-range laptop. The document also analyzes voice recording formats and storage in 14 popular Android apps.

Uploaded by

Dummy Hack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/353205208

SpeechToText: An open-source software for automatic detection and


transcription of voice recordings in digital forensics

Article  in  Forensic Science International Digital Investigation · September 2021


DOI: 10.1016/j.fsidi.2021.301223

CITATION READS

1 458

2 authors, including:

Miguel Negrão
Instituto Politécnico de Leiria
4 PUBLICATIONS   8 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Miguel Negrão on 15 July 2021.

The user has requested enhancement of the downloaded file.


SpeechToText: An open-source software for automatic
detection and transcription of voice recordings in digital
forensics
Miguel Negrãoa,b , Patricio Dominguesa,b,c
a
School of Technology and Management - Polytechnic Institute of Leiria, Leiria, Portugal
b
Computer Science and Communication Research Centre, Portugal
c
Instituto de Telecomunicações, Portugal

Abstract
Voice is the most natural way for humans to communicate with each
other, and more recently, to interact with voice controlled digital machines.
Although text is predominant in digital platforms, voice and video are be-
coming increasingly important, with communication applications supporting
voice messages and videos. This is relevant for digital forensic examinations,
as content held in voice format can hold relevant evidence for the investi-
gation. In this paper, we present the open source SpeechToText software,
which resorts to state-of-the art Voice Activity Detection (VAD) and Auto-
matic Speech Recognition (ASR) modules to detect voice content, and then
to transcribe it to text. This allows integrating voice content into the reg-
ular flow of a digital forensic investigation, with transcribed audio indexed
by text search engines. Although SpeechToText can be run independently,
it also provides a Jython-based software module for the well-known Autopsy
software. The paper also analyzes the availability, storage location and audio
format of voice-recorded content in 14 popular Android applications featur-
ing voice recordings. SpeechToText achieves 100% accuracy for detecting
voice in unencrypted audio/video files, a word error rate (WER) of 27.2%
when transcribing English voice messages by non-native speakers and a WER
of 7.80% for the test-clean set of LibriSpeech. It achieves a real time factor
of 0.15 for the detection and transcription process in a medium-range lap-
top, meaning that one minute of speech is processed in roughly nine seconds.
https://doi.org/10.1016/j.fsidi.2021.301223
Keywords: voice recordings, automatic speech recognition, automatic

Preprint submitted to Forensic Science International: Digital Investigation July 15, 2021
speech transcription, digital forensics, Android applications

1. Introduction
In the last two decades, the world has gone digital. For instance, few of us
dare to leave home without our smartphone, possibly the best-representing
tool of the digitized world. The omnipresence of digital devices in our lives
means that often a criminal investigation or civil litigation will need to an-
alyze digital devices linked to the case under consideration. This is one of
the three factors that has contributed to swell backlogs of digital forensic
laboratories, jointly with the increasing number of devices per case and the
volume of data per device (Quick & Choo, 2014).
Examinations of electronic devices are performed by digital forensic ex-
perts who apply procedures, tools, and techniques to extract and interpret
data found at the devices (Casey, 2011). Communications that have oc-
curred between individuals and whose traces remain on the devices can pro-
vide important clues for the investigation and thus need to be analyzed and
interpreted. Although many communications through digital channels are
in the form of written text, voice still plays an important role in the digital
world, as it is the most natural way for humans to communicate. For in-
stance, many platforms allow voice communication, either through calls or
through recorded voice messages (Azfar et al., 2016; Nouwens et al., 2017).
Examples include Facebook, WhatsApp, Signal, and Telegram, to name just
a few. Voice is also used for interacting with machines, as it demands much
less attention than writing, it is hands-free, and thus can be performed si-
multaneously with other tasks, such as driving, walking, etc. Some examples
include voice interaction with digital personal assistants such as Amazon
Echo and Apple Siri, vehicles, and many other devices.
While text can be parsed, indexed, and searched, in a relatively easy
manner, voice data are more difficult to integrate into a digital forensic ex-
amination. To be useful, voice data need to be transcribed to text in a
digital format, a process identified as speech-to-text. The no tech approach
of speech-to-text is simply to have someone listen and transcribe the audio.
This is a human-intensive task, and thus expensive and slow, especially if
the quality of transcription is relevant. As digital forensic laboratories are
often money, human resources, and time constrained (Casey & Souvignet,
2020), a faster and more economical approach is needed. This is the case

2
for automated speech-to-text, where software transcribes audio files of voice
communication, a process called Automatic Speech Recognition (ASR).
Although speech-to-text is a faculty seen as basic for humans, it is, in
reality, a quite complex process involving a significant part of the human
brain (Carlini & Wagner, 2018). The same occurs in the digital world, where
for many years, speech-to-text software was quite limited and error prone (S
& E, 2016). Since the (re)emergence of artificial intelligence (AI), and in
particular due to the advance of deep learning-based techniques (LeCun et al.,
2015), automatic speech recognition software has tremendously evolved (Yu
& Deng, 2015; Watanabe et al., 2017). ASR has moved from systems that
required good audio, articulated speech to deliver mediocre results to systems
achieving low error rates in transcriptions, surpassing humans in speed and
cost. Nonetheless, automatic speech recognition is still not perfect (Errattahi
et al., 2018). Examples of usage include voice to interact with the smartphone
for writing an email or to take a note, thus bypassing the need to use the
(virtual) keyboard. Another example is the automatic captioning of videos
on YouTube and Bing.
In this paper, we apply ASR to digital forensics. Specifically, we aim
to automatically transcribe meaningful audio files found in digital devices
to text so that the content can be indexed for searches and included in
digital forensic reports. In a digital forensic search it is common for audio
files to be considered of interest if they contain a high percentage of human
speech, and have duration above a minimum threshold. The main motivation
is to automate, in digital forensic cases, 1) detection of speech in digital
forensic cases, and 2) speech transcription, all with the goal of speeding up
the handling of audio recordings.
For this purpose, we developed a software suite that relies on several
open-source libraries, namely INA’s inaSpeechSegmenter (Doukhan et al.,
2018a,b) for speech detection, and Mozilla’s DeepSpeech (Amodei et al.,
2016) for speech transcription. As a proof-of-concept, we built a software
module called SpeechToText for the well-known Autopsy forensic software
and assessed its behavior with a set of audio files from popular Android
applications. The main contributions of this paper are as follows: 1) De-
velopment of an open-source software suite able to detect audio recordings
with speech and to transcribe the content to text through ASR, without the
need to export potentially sensitive files to the cloud; 2) Development of the
SpeechToText software module for Autopsy; 3) Analysis of the location of
the audio files and the audio coding formats for 14 Android applications fea-

3
turing voice recording, including 12 well-known communication applications;
4) Assessment of the transcription accuracy and speed performance of the
developed software with audio files from the studied applications, and with
the test-clean set from the LibriSpeech corpus (Panayotov et al., 2015).
The remainder of this paper is structured as follows. Section 2 reviews
related work. Next, Section 3 describes the software stack that composes
SpeechToText, while Section 4 presents the experimental setup and the main
accuracy and performance results. Section 5 discusses the results. Finally,
Section 6 concludes the paper and presents venues for future work.

2. Related Work
We now review related work. We first focus on speech detection, then on
speech transcription, and conclude with scientific works that target Android
applications that allow recording and exchanging audio messages. In this
work, we only deal with healthy recording files, that is, files that are not
corrupted. If the need arises, recovering techniques to repair damaged audio
files can be applied (Heo et al., 2019).

2.1. Speech Detection


Speech detection is also known as Voice Activity Detection (VAD) and
simply put, consists of detecting speech. Speech detection is a step that
precedes speech transcription, as only audio recordings with speech need to
be processed. As automatic speech transcription is computationally expen-
sive, VAD assumes an important role to contain computational costs, and
hence, execution times. VAD accuracy is important, as an erroneous voice
detection can classify background noise as voice and thus induce wasteful
speech recognition, or worse, it can mistakenly omit voice segments, yielding
an incomplete transcription.
Voice detection used to be performed through signal processing (Chang
et al., 2006), and then evolved to machine learning approaches (Ying et al.,
2011), as they yielded superior results. More recently, VAD has transi-
tioned to deep learning (Zhang & Wu, 2013; Wang & Chen, 2018). As
explained later, in this work, we resort to inaSpeechSegmenter (Doukhan
et al., 2018a,b) for VAD. The software has a convolutional neural network
(CNN) based architecture and classifies segments of the audio file with activ-
ity as containing speech, music or noise, and for those that contain speech,
the gender of the speaker. Regarding gender detection accuracy, the software

4
achieves ”a frame- level gender detection F-measure of 96.52 and a hourly
women speaking time percentage error bellow 0.6%” (Doukhan et al., 2018a).
The library fits into our main requirements: 1) available under open-source
licenses and 2) able to work locally, without the need to access the Internet.
This last requirement is particularly important in digital forensics, where it is
often forbidden or at least strongly advised against using commercial cloud-
based services, which require transferring case evidence to a third party.

2.2. Speech Transcription


In a similar way to speech detection, usage of deep learning has also
profoundly improved the quality of automatic speech recognition, yielding
a huge boost in accuracy (Yu & Deng, 2015; Watanabe et al., 2017). This
is the case for DeepSpeech, an ASR architecture by Baidu (Amodei et al.,
2015, 2016). In SpeechToText, we resort to Mozilla’s DeepSpeech implemen-
tation (Mozilla DeepSpeech), which is one of the core components of our
software. In this paper, we refer to Baidu’s DeepSpeech architecture sim-
ply as DeepSpeech, while MozDeepSpeech refers to Mozilla’s DeepSpeech
implementation.
As written in MozDeepSpeech’s GitHub repository, MozDeepSpeech is
“an open-source embedded (offline, on-device) speech-to-text engine which can
run in real time on devices ranging from a Raspberry Pi 4 to high power GPU
servers” 1 . Technical details of Baidu’s DeepSpeech architecture are described
in (Hannun et al., 2014) and in (Amodei et al., 2015, 2016). Baidu’s Deep-
Speech team devoted major efforts to training, namely: 1) building and then
using large datasets for training, and 2) resorting to multiple GPUs and opti-
mization techniques to speed up training (Hannun et al., 2014). DeepSpeech’s
architecture has been enhanced with high performance computing training
techniques and larger datasets, yielding more accurate results (Amodei et al.,
2015), and the project is still active. Technically, DeepSpeech relies on recur-
rent neural networks (RNN). Furthermore, to identify wrongly transcribed
words and correct them, DeepSpeech also comprises an N-gram language
model, selecting the highest probability sequence for N words (Guo et al.,
2019).
At the time of this writing, MozDeepSpeech has support for English and
Mandarin through pre-trained models. However, MozDeepSpeech’s authors

1
GPU stands for graphics processing unit.

5
alert that the English pre-trained model has some biases towards US English
and male voices, due to its training dataset (Mozilla DeepSpeech). Models
have also been made available by third-parties2 for other languages such as
German, Spanish, French and Italian.
There are other open source ASR projects, which like DeepSpeech, work
offline. Examples include Facebook’s Wav2letter++ (Pratap et al., 2019),
Kaldi (Ravanelli et al., 2019) and CMUSphinx (CMUSphinx). Peinl et al.
provide a review of ASR offline open source systems on constrained edge
hardware (Peinl et al., 2020). They also compare open source systems to
commercial cloud-based ASR systems from Amazon (Amazon Transcribe),
Google (Google Speech-to-Text) and Microsoft (Microsoft Speech Services),
reporting that the Word Error Rate (WER) (Morris et al., 2004) for these
systems was, respectively, 12.3%, 8.6% and 9.4%, while MozDeepSpeech fared
worse with a 29.0% word error rate.
Nonetheless, as stated earlier, commercial products are not viable for
some digital forensic environments, not only due to costs but also due to the
need to interact with cloud services, and the consequent loss of privacy.
Filippidou and Moussiades (Filippidou & Moussiades, 2020) provide a
brief review on ASR projects and products, and benchmark three cloud-based
closed source solutions: Google, Wit from wit.ai (Wit.ai), and IBMWatson.
For benchmarking, they use English recordings from three individuals whose
native language is Greek. They report that Google’s solution yielded the
lowest WER, and note that WER, regardless of the benchmarked solution,
was substantially dependent on the speaker. For instance, WER for Google
ranged from 16.60% to 24.85%, with Wit’s WER varying between 23.28%
and 58.87%. Our experiments with SpeechToText also confirms the lesser
accuracy of ASR for non-native English speakers.

2.3. Digital Forensics of Audio Messaging Android Applications


Due to the importance of communication applications for digital forensics,
namely in mobile platforms, numerous scientific studies cover the most impor-
tant ones. Dargahi et al. analyze three popular mobile voice over IP applica-
tions available on the Google Play store: Viber, Skype, and WhatsApp Mes-
senger (Dargahi et al., 2017), noting that audio and video files could be recov-
ered on a rooted Android phone. Azfar et al. analyze three Android commu-

2
https://gitlab.com/Jaco-Assistant/Scribosermo

6
nication applications: Line, WeChat, and Viber (Azfar et al., 2016). Anglano
et al. study the Telegram Messenger on Android smartphones (Anglano et al.,
2017), while Wu et al. analyze forensic artifacts left by the WeChat applica-
tion on Android devices (Wu et al., 2017). Zoom video conference software
is analyzed by Mahr et al. for Windows 10 and Android platforms, identi-
fying the localization of files, namely SQLite databases (Mahr et al., 2021).
In our work, besides all the above-cited Android applications, we also study
Google Duo, Evernote, and Microsoft Teams, although our analysis is solely
focused on audio recording capabilities and consequent forensic artifacts of
the applications.

3. SpeechToText
SpeechToText is a set of applications to provide speech detection and
transcription for the Autopsy software (see Table 1). SpeechToText is avail-
able online under a GPLv3 license3 . It supports both Linux and Windows
operating systems and can make use of a GPU which supports the Compute
Unified Device Architecture (CUDA) for increased performance. Autopsy4
is a popular open-source digital forensic software (Barr-Smith et al., 2021).
It uses The Sleuth Kit5 , a collection of command line tools, to analyze disk
images and recover files from them. The functionality of Autopsy can be
extended through three types of modules: 1) File ingest; 2) Datasource
ingest; and 3) Report. SpeechToText encompasses two modules for Au-
topsy: a datasource ingest module called vad check ingest.py (henceforth
vad check ingest), and a report module called ast report.py (hence-
forth ast report). These two modules are supported by an application
called deepspeech csv, with deepspeech csv itself requiring another ap-
plication, named ina speech segmenter (see Figure 1). Next, we describe
deepspeech csv, ina speech segmenter, and their execution flow.

3.1. deepspeech csv and ina speech segmenter


deepspeech csv is a command line program based on MozDeepSpeech’s
C++ native client. It acts as a frontend to MozDeepSpeech’s engine, adding

3
https://github.com/labcif/AutopsySpeechToText
4
https://www.sleuthkit.org/autopsy/
5
https://www.sleuthkit.org

7
one main feature: the ability to perform VAD using comma-separated values
(CSV) files generated by ina speech segmenter.
While MozDeepSpeech’s original native client receives an audio file and
transcribes it using TensorFlow (Abadi et al., 2016), outputting text and
metadata, it does not perform VAD. As speech transcription is a computing-
heavy task, it is important that audio files are first assessed for the exis-
tence of speech, through VAD, so that only audio segments with speech are
considered for speech transcription. VAD is the role of the python script
ina speech segmenter. For each audio file, ina speech segmenter cre-
ates one CSV file with one line per segment, each line containing the segment
start and end times, plus a classification (“male”, “female”, “music”, “noise”
or “noEnergy”).
The deepspeech csv application receives a list of 16kHz mono 16-bit
pulse code modulation (PCM) encoded WAV audio files and a directory
where corresponding ina speech segmenter-produced CSV files are located.
For each audio file, deepspeech csv looks for a matched CSV file (same
root name but with suffix “.csv”). It then extracts from the audio file the
segments containing speech and feeds them to the MozDeepSpeech engine,
obtaining the corresponding transcription. The segments are processed in
the order they appear in the CSV file and gender information is not used
for the transcription. As result, for each audio file, a corresponding text
file with the transcribed text is created in the directory containing the CSV
files. By default, in the text file, each segment has its start time prepended,
although this can be disabled from the command line. The deepspeech csv
application also requires command line parameters specifying the names of
two files that harbor the neural network model needed by MozDeepSpeech’s
transcription engine. These two files are specific to the spoken language (e.g.,
English).

3.2. vad check ingest


The vad check ingest ingest module for Autopsy identifies and tags all
files in the Autopsy case’s data sources which have speech, and if configured
to do so, automatically transcribes them, adding the transcribed text as
an artifact associated with the file. It relies on vad check ingest and
ina speech segmenter, as follows.
After being activated through Autopsy’s ingest module interface, vad c
heck ingest iterates through all files found in the Autopsy case’s data
sources, and if the file is of type audio or video, it is sent to an external

8
Data source ingest VAD_CHECK_INGEST

Audio files
Report AST_REPORT

Autopsy

INA_SPEECH_SEGMENTER
Voice Activity Detection (VAD)

csv file

DEEPSPEECH_CSV
MozDeepSpeech (ASR)

Transcribed text

Figure 1: Software components of the SpeechToText module

process (ina speech segmenter) which detects whether the file has speech
content or not, and the gender of the speaker (male/female). The file type
is determined by Autopsy’s File Type Identification Module which detects
the file type based on its internal byte signature. This module must be run
before running vad check ingest. If the file is a video file, then ffmpeg6
is used to determine if the file has audio streams. For both audio and video
files, ffmpeg is then used to convert the audio to a 16kHz mono 16-bit PCM-
encoded WAV file, the format expected by ina speech segmenter and
deepspeech csv, and needed by MozDeepSpeech engine. The audio is
upsampled or downsampled to 16kHz depending on the original audio file’s
sample rate. In the case of a stereo or multi-channel audio file, all channels
are summed. The ffmpeg processes are run in parallel using a pool of threads
of size n, with n the number of available CPU cores.
The vad check ingest module has a configuration panel (see Figure 2)
with parameters that determine which files are marked as “interesting items”,
through an Autopsy’s tag. Given that a device might contain many audio
files without speech, such as music collections, and operating system audio
notifications, it is important to be able to find the files on the device that
have speech activity. A file is tagged if both, the percentage of the audio file

6
https://www.ffmpeg.org/

9
Table 1: SpeechToText’s applications

Application Description Output


ina speech segmenter Use VAD to identify segments One CSV file
(Python 3) with spoken speech. per audio file.
deepspeech csv One text file per audio file
Transcribe a set of audio files.
(C++) containing transcribed text.
vad check ingest Ingest module for Autopsy. Transcribed text
(Jython 2.7) Analyze and transcribe speech. in Autopsy result tree.
ast report Report module for Autopsy. Transcribed text
(Jython 2.7) Produce HTML/CSV reports. in HTML/CSV.
Detect and transcribe text
deep speech iss.py One text file per audio file
from audio files.
(Python 3) containing transcribed text.
(independent from Autopsy)

which has speech activity, and the total duration of segments with speech,
are above thresholds set in the configuration panel. A 5s file containing 4s
of speech (as an example, the sentence “How are you today ?”), will not
be tagged if, for instance, the minimum amount of speech required is 20s.
The panel also has an option to determine whether files which have been
tagged should be automatically transcribed using MozDeepSpeech. Since
transcribing files is an operation that can take a significant amount of time,
if the user expects that many files with speech activity will be found, and
needs to quickly transcribe a subset of all files, the user can manually tag
only these files and do a selective transcription using ast report. Finally,
the language assumed for the transcription can be selected from a drop-down
menu. The module currently has English and Mandarin Chinese available,
but as more MozDeepSpeech language models become available these can be
used by the software by placing them in a specific folder.
The actual detection of the files with speech is done by ina speech s
egmenter. After ina speech segmenter has executed, vad check ingest
sums the duration of all segments labeled as female or male by ina speec
h segmenter obtaining the total duration of segments with speech (male +
female) and tags the file as “Audio file with speech” (if total dur. > 0),
“Audio file with speech - male ” (if male dur. > 0) and ”Audio file with
speech - female” (if female dur. > 0). If the duration of the audio file,
which is determined by ffmpeg, is less than the minimum total duration of

10
(a) New artifacts created by (b) vad check ingest’s configuration panel
vad check ingest

Figure 2: Details of the Autopsy user interface for vad check ingest.

segments with speech set in the configuration panel, the file is excluded and
not passed to ina speech segmenter, reducing ina speech segmenter’s
and deepspeech csv’s runtimes.
After ina speech segmenter has completed analyzing all files, files where
speech was detected which match the required conditions will be displayed
in the interesting items node of Autopsy’s Directory Tree panel. When the
transcription option is activated in the configuration panel, these files are also
sent to deepspeech csv. For each file, the written transcript returned by
deepspeech csv is imported into the Autopsy case as a blackboard artifact of
type “TSK EXTRACTED TEXT”. The transcripts appear on the graphical
user interface of Autopsy, in “Results”, “Extracted Content”, “Extracted
Text” (see left of Figure 2).
If Autopsy’s keyword search ingest module is active in the case under
analysis, then after running vad check ingest any matches in the tran-
scribed text will automatically appear in the Directory Tree panel, in the
keyword matches node (see Figure 3).

11
Figure 3: Keyword search of text transcribed with vad check ingest.

3.3. ast report


The ast report module can create an HTML or CSV file containing
text transcribed from audio files with speech. The user can select whether to
create the report using all files containing the tag “Transcribe” or “Tran-
scribed” (see Figure 4). In the first case, the selected files will be first
converted to a 16kHz mono 16-bit PCM-encoded WAV file and transcribed
using deepspeech csv, with the written transcript imported into the case
as a blackboard artifact, similarly to vad check ingest. In this case, the
user should first manually tag files to be transcribed with the “Transcribe”
tag. In the second case, the report will be created using all files which were
previously transcribed using vad check ingest. If the corresponding option
was selected during transcription, the transcripts in the report will have each
segment prepended with the corresponding time.

3.4. deepspeech iss.py


A command line python script, named deepspeech iss.py, is also pro-
vided inside the SpeechToText module to transcribe files. It accepts a results
directory and a list of audio or video files, and proceeds similarly to the Au-
topsy plugin. Similarly, ffmpeg is run on all files to convert them to 16kHz
mono 16-bit PCM-encoded WAV files, subsequently ina speech segmenter
generates CSV files for all WAV files, and finally deepspeech csv generates

12
Figure 4: The ast report module’s configuration panel.

Table 2: Results of SpeechToText performance test, transcribing 2619 voice record-


ings from the test-clean set from LibriSpeech corpus. Proc.=processor type. ffm-
peg=elapsed time running ffmpeg and ffprobe. Ina=ina speech segmenter elapsed time.
DP=deepspeech csv elapsed time. Ing=Total ingest duration. RTF=real time factor.

Proc. ffmpeg (s) Ina (s) DP (s) Ing (s) RTF

CPU 65.9 1329.3 6619.6 8049.5 0.42


GPU 67.0 415.1 2520.1 3047.4 0.16

text files with the transcribed text. The deepspeech iss.py script can be
run in a machine without an installed Autopsy version, as it has no depen-
dency on Autopsy.

3.5. SpeechToText’s performance and accuracy


Transcribing audio files using an RNN is a computationally intensive task.
GPUs can provide a very important speedup of CNN inference and training
(Oh & Jung, 2004; Chellapilla et al., 2006), and the same applies to RNNs
(Li et al., 2014). MozDeepSpeech is implemented using Tensorflow, “an in-
terface for expressing machine learning algorithms, and an implementation
for executing such algorithms” (Abadi et al., 2016). Tensorflow supports
many different platforms and is able to use multiple CPUs or GPUs. The

13
ina speech segmenter and deepspeech csv versions bundled inside the
SpeechToText module both support Nvidia GPUs via the CUDA frame-
work (Nickolls et al., 2008), and fallback to CPU if no CUDA-compatible
GPU is available.
To test the performance and accuracy of SpeechToText module, the test-
clean set from the LibriSpeech corpus (Panayotov et al., 2015) was imported
into an Autopsy case. The set consists of 2620 recordings with clean speech
by native English speakers. vad check ingest was run twice on this set,
once using the GPU and a second time using the CPU7 . Of the 2620 files,
ina speech segmenter determined 2619 to have speech. The total time
duration of those 2619 files is 19450.4s. The files contain almost no silence.
The tests were conducted on a laptop with an Intel(R) Core(TM) i7-
8750H CPU (with base frequency 2.20GHz and max turbo frequency 4.10
GHz) and a 6 GiB GeForce RTX 2060 Mobile GPU. The results can be seen
in Table 2.
The real time factor8 (RTF) of the Autopsy ingest using GPU was 0.16
and using CPU was 0.42. As expected, using a GPU will significantly reduce
the time needed to process a forensic image using the SpeechToText module.
Regarding the accuracy of the transcription, the WER for this set when
transcribed with the SpeechToText module was 7.80%. This is just somewhat
higher than the value of 5.3% of Baidu’s DeepSpeech 2 (Amodei et al., 2015)
for the same set, and very close to the value of 7.06% given by Mozilla for
the same version of MozDeepSpeech (v0.9.3)9 .
The WER was calculated using the formula
PM
n=1 lev(ti )
PM ,
n=1 len(ri )

where M is the number of audio files transcribed, ri is reference text, ti is


the transcribed text, len is the number of words in a sentence, and lev is the
Levenshtein distance at the word level.

7
By setting CUDA VISIBLE DEVICES=-1.
8
The real time factor is the elapsed time for the transcription process divided by the
duration of the audio file.
9
https://github.com/mozilla/DeepSpeech/releases/tag/v0.9.3

14
4. Transcription of audio from mobile applications featuring voice
recordings
The effectiveness of the SpeechToText module was evaluated by simulat-
ing a real-world scenario: the generation of transcripts of conversations found
in audio files extracted from a smartphone during a forensic investigation.
Different applications save audio files which can contain conversations of in-
terest for forensic investigators. Some of those applications are video and
audio recorders, (audio) note-taking applications, mobile messaging applica-
tions, which allow the exchange of audio or video messages, and collaboration
and videoconference applications.
Messaging applications (MA) enable the smartphone to send and receive
messages containing text, photos, audio, video, and files through a mobile or
fixed internet connection to and from individuals on a contact list or group.
MAs have become very popular, being used monthly by several billion people
(Facebook, 2017). In some regions, such as Brazil, communicating via audio
messages exchanged in MAs is widespread (Maros et al., 2020). Collaboration
and videoconference applications, such as Zoom and Microsoft Teams also
allow the exchange of audio messages.
The effectiveness of the SpeechToText module was evaluated by setting up
an experiment where different Android applications which can generate audio
files containing speech were installed on a smartphone. The objectives of the
experiment were: 1) to determine whether the SpeechToText module was
able to automatically detect the audio files and subsequently automatically
transcribe them; 2) to determine where audio messages or notes are saved on
the device by each application; and 3) to determine the audio coding format
used by the audio files.

4.1. Experimental setup


For this experiment, a Motorola Moto G4 Play running Android 7.1.1
was used. Table 3 lists the applications which were installed on the device.
Amongst these are some of the most used MAs, namely WhatsApp, with
two billion users in 2020 (Facebook, 2020), Facebook Messenger with 1.3
billion users in 2017 (Facebook, 2017) and WeChat with 1.2 billion users in
2020 (Tencent, 2020).
For each application, a voice message was sent from the device being
tested (A) to a different device (B), and another message was sent from
device B to device A. Each message consisted of reading the same excerpt of

15
Table 3: Applications installed on smartphone for forensic analysis of audio files. The
columns “Subj. To” and “Subj. From” show which subject sent the message from or to
the smartphone.

App. name App. Type Feature name Subj. Subj. To


From

Moto Camera One Photo and Video Video recording a -


Evernote Note taking Audio in note a -
Zoom Videoconferencing Meet and Chat a a
Microsoft Teams Collaboration Voice message a d
Signal Messaging Voice message a a
Skype Videoconferencing Voice message a a
Snapchat Messaging Voice note a a
Viber Messaging Voice message a a
Telegram Messaging Voice message a c
WeChat Messaging Voice message a a
Facebook Messenger Messaging Voice recording a b
WhatsApp Messaging Voice message a b
GoogleDuo Video Chat Voice mail a a
Houseparty Videoconferencing Facemail a a

16
the Universal Declaration of Human Rights, in English, with the duration of
around 1 minute. The messages were spoken by four different individuals all
non-native speakers in English. In the case of Houseparty, a video message
(Facemail) was sent, since HouseParty currently does not offer audio only
voice messages. Regarding the Evernote application, a note was created
containing recorded sound. In the case of the Moto Camera One application
a video was recorded. A total of 26 voice recordings were created using the
applications.
The extraction and forensic analysis of the smartphone were conducted on
the same computer used for deepspeech csv’s performance analysis, running
Linux (Debian 10) and Autopsy 4.17.0, and making use of the Nvidia GPU.
The version of the SpeechToText used was built with DeepSpeech version
v0.9.3-0-gf2e9c85 and TensorFlow version v2.3.0-6-g23ad988.
Given that the smartphone was rooted, forensic images of the data and
cache android partitions were created by block-level copying. This was
achieved by flashing the Team Win Recovery Project (TWRP10 ) custom re-
covery image and using adb to run dd. Since Autopsy cannot process images
containing the Samsung developed F2FS filesystem, the images were instead
loop mounted in read-only mode onto a directory, and that directory was
imported as logical files.
During the initial import of the directories as new data sources, Autopsy’s
File Type Identification module was run. Subsequently the vad check ing
est ingest module was run with file selection conditions dur ≥ 20s and
sp ≥ 30%, where dur is the audio stream duration and sp is the percentage
of speech content. The ingest process was also configured to automatically
transcribe all files matching these criteria.

4.2. Results
Autopsy’s File Type Identification module in the initial ingest found a
total of 50 audio or video files. The subsequent vad check ingest ingest
process found and transcribed 28 files, all of which contained voice recordings
created purposely for this experiment, although several were duplicate files
saved in a different location by the applications, and one was a preliminary
voice recording test. Of the other 22 files which were not selected for tran-
scription, 20 had duration less than 20s and the remaining two were ringtones

10
https://twrp.me/

17
Table 4: Results of the ingest process. Rec = Number of experiment voice recordings
found. Tr. Files = files transcribed. A/V = Audio/Video. Dur = total duration of 30
audio/video files. Ina = ina speech segmenter elapsed time. DP = deepspeech csv elapsed
time. Ing = Ingest duration. RTF = real time factor.

Rec Tr. A/V Dur (s) Ina (s) DS (s) Ing (s) RTF
Files

20 30 29/1 1638.0 28.4 202.31 233.2 0.15

Table 5: Detection of files with speech by application

App. name Voice recordings Audio coding for-


automatically mat
detected

Moto Camera One Yes aac


Evernote Yes amr nb
Zoom Yes amr nb
Microsoft Teams Yes aac
Signal No -
Skype Yes aac
Snapchat No -
Viber Yes aac
Telegram Yes opus
WeChat No (found manually) headless amr nb
Facebook Messenger Yes aac
WhatsApp Yes opus
GoogleDuo Yes aac
Houseparty No -

18
Table 6: Path of audio files generated by applications containing voice recordings

Application Location in device

Evernote /media/0/Android/data/com.evernote/files/Temp/Shared/
Facebook Mes- /data/com.facebook.orca/cache/fb temp/
senger
Google Duo /media/0/Duo/
Moto Camera /media/0/DCIM/Camera/
One
Skype /data/com.skype.raider/cache/071419883E91F454201B5D2F53337571DEC9A86AF867044
1660DBB7B35232C43/RNManualFileCache/
Teams /media/0/MicrosoftTeams/Media/VoiceMessages/
Telegram /media/0/Telegram/TelegramAudio/
Viber /media/0/Android/data/com.viber.voip/files/.ptt/
WeChat /media/0/tencent/MicroMsg/24a59238ae32c61707a6e7c0d3db9ed9/voice2/1a/09/
WhatsApp /media/0/WhatsApp/Media/WhatsAppVoiceNotes/202013/
Zoom /data/us.zoom.videomeetings/data/[email protected]/zwzyt6-y
qia [email protected]/

with no speech content.


Files were automatically transcribed corresponding to 18 of the 26 voice
recordings created (see Table 5). In total, voice recordings were found auto-
matically for 10 out of 14 tested applications. No audio or video files were
detected by Autopsy’s File Type Identification module in the application
specific storage of HouseParty, Signal, and Snapchat. By manual search, it
was possible to find the two voice recordings exchanged through WeChat.
WeChat saved the voice recordings as AMR audio files, but without the cor-
responding header, thus not being detected by Autopsy. Using a script11 ,
the headerless AMR files were converted to WAV files and imported into the
Autopsy case, resulting in a total of 20 voice recordings transcribed.
Regarding the accuracy of the transcription, the WER was measured for
the 20 transcribed voice recordings. Each recording contained an introduc-
tion identifying the application used to send the message and the device from
which it was sent. The text corresponding to this speech segment was deleted,
and only the transcription relating to the excerpt of the Universal Declara-
tion of Human Rights was considered. The WER for the files transcribed in

11
https://gist.github.com/Kronopath/c94c93d8279e3bac19f2/c9ee08870c445b8
82c0bf47be4f77836fa2d048e

19
the ingest was 27.2%, ranging from a minimum of 17.0% to a maximum of
35.0%.
The first sentence from the transcribed text of one of the audio messages
which has a WER close to the average (26.4%) is reproduced below:
universal declaration of human rights article on all human beings
are born free and equal in dignity and rights there and though
with reason and conscious and shoulderwards one another in a
spirit of brotherhood
The elapsed time for the ingest process, including the additional WeChat
files, was 233.26s, corresponding to a real time factor of 0.15 (see Table 4).
The elapsed time includes converting the files to 16kHz mono 16-bit PCM-
encoded WAV format, running ina speech segmenter to detect files with
speech and provide segmentation, and subsequently executing deepspeech
csv to transcribe the files using MozDeepSpeech. The elapsed time does
not include the previous ingest which ran Autopsy’s File Type Identification
module.
Table 5 shows the audio coding format used by each application to store
the voice messages. Table 6 lists the paths where audio files generated by the
applications containing the voice recordings are stored. Facebook Messenger,
Skype, and Zoom store the audio files in the data partition, the other ap-
plications store the audio files in the media partition, which can be accessed
directly. To access Facebook Messenger, Skype, and Zoom audio files either
rooting or an equivalent method is required.

5. Discussion
The audio files generated by most of the used applications are easily found
as they are saved as standard audio files in the application specific storage.
In the case of WeChat the audio files were saved as a headerless audio file. On
the other hand, audio/video files from the HouseParty, Signal, and Snapchat
apps were not detected by Autopsy’s File Detection module, possibly due
to encryption, and thus were left out of the experiment. Therefore, to use
the SpeechToText plugin to process files from these three applications, other
tools must first be used to extract the audio/video files.
Considering the WeChat processed files, a total of 20 out of 26 voice
recordings were found and transcribed. All the voice recordings of the ex-
periment for which a corresponding audio file was found by Autopsy (con-
sidering also the processed WeChat files) were correctly identified by vad c

20
heck ingest as containing speech. Two files which did not contain any
speech (ringtones) were also analyzed by ina speech segmenter due to
fulfilling the duration condition (dur > 20s). Neither of them was labeled
by vad check ingest as fulfilling the 30% speech content condition. This
corresponds to an identification accuracy of 100%.
Manson et al. (2013) reported that the average word rate for a telephone
conversation between English speakers in a particular study was 214 WPM.
A skilled typist has typing speed in the range of 70 WPM (Snyder et al.,
2014), which corresponds to a 3.1 real-time factor. The real-time factor for
the ingest process in this experiment was 0.15 which is 20.4 times faster than
a skilled typist. It can be concluded that using the SpeechToText module
will speed up the process of obtaining transcriptions of voice conversation
on a smartphone. Besides, computers can work 24/7 and GPU capabilities
have been increasing regularly, which should further boost performance. It
should be noted that the figure 20.4x is a conservative estimate, since it does
not include the time necessary for manually determining which files should
be transcribed from a larger list of audio files.
The WER was somewhat high at 27.2%, although one must take into
account that the sound files produced by the combinations of the apps and
smartphones used were not of high quality. Some had a considerable amount
of noise and artifacts. Furthermore, none of the speakers are native English
speakers, a known factor for lower transcription performance. In addition, the
text had a somewhat difficult vocabulary in terms of pronunciation. Despite a
relatively high WER, it was possible to understand most of the text produced
by the plugin. The quality of transcription produced is considered adequate
for the purpose of quickly finding relevant information in the triage stage
of a forensic investigation, which was the original goal. Moreover, even if
the WER values preclude the direct use of the automated transcriptions in
formal forensic reports, SpeechToText’s transcripts can serve as drafts for
human transcribers, still saving precious time.
As described before, for the test-clean set of LibriSpeech the output of
the SpeechToText plugin had a WER of 7.8%, which matches the value of
7.06% given by Mozilla for the same version of MozDeepSpeech (v0.9.3)12 ,
therefore the plugin has the expected accuracy when used with speech from
native English speakers and good quality audio recordings.

12
https://github.com/mozilla/DeepSpeech/releases/tag/v0.9.3

21
Using the SpeechToText plugin together with Autopsy’s keyword search
allows quickly narrowing down conversations of interest from numerous record-
ings present on a forensic image in an automated way. It is possible that the
SpeechToText plugin will mistranscribe one of the words which is part of
the keywords defined by the forensic examiner as relevant. Taking that into
account, one can include additional misspellings and similar-sounding words
to the items in the keyword list to improve the probability of keyword match.

6. Conclusion
This work presents the SpeechToText plugin, a digital forensic tool lever-
aging open-source machine learning software for automatic detection and
translation of voice recordings.
We showed that the SpeechToText plugin can automatically detect with
high accuracy audio files present in a forensic image of a smartphone. The
audio files were generated by different applications, namely by messaging
applications such as WhatsApp and Facebook Messenger. We also demon-
strated that the plugin can automatically transcribe files where speech was
detected, or files manually tagged, generating transcribed text with a WER
which is adequate for the purpose of discovering relevant information in a
forensic investigation. We showed that the detection and transcription, given
adequate hardware, can be performed faster than real-time, therefore signifi-
cantly decreasing the time required to find relevant information contained in
audio files present in a forensic image, and freeing valuable human resources.
We also showed that the most popular messaging applications store audio
or video messages unencrypted on the device, listing the locations where such
files are likely to be found and corresponding audio coding formats.
As future work, we plan to add support for other ASR engines, making
possible for users to select the most convenient ASR engine and model. We
also plan on testing the software with recordings that mimic real-world con-
ditions such as poor signal-to-noise ratio, or multiple overlapping voices in a
single recording.

Acknowledgment
This work was partially supported by CIIC under the FCT project UIDB-
04524-2020, and by FCT/MCTES and EU funds under the project UIDB/EEA/50008/2020.

22
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M.,
Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R.,
Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden,
P., Wicke, M., Yu, Y., & Zheng, X. (2016). TensorFlow: A System for
Large-Scale Machine Learning. In 12th USENIX Symposium on Operating
Systems Design and Implementation (OSDI 16) (pp. 265–283).

Amazon Transcribe (2021). Amazon Transcribe Speech To Text. URL: ht


tps://aws.amazon.com/transcribe/ [Online; accessed 28. Jan. 2021].

Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E.,
Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G. et al. (2016).
Deep speech 2: End-to-End Speech Recognition in English and Mandarin.
In M. F. Balcan, & K. Q. Weinberger (Eds.), Proceedings of The 33rd
International Conference on Machine Learning (pp. 173–182). New York,
New York, USA: PMLR volume 48 of Proceedings of Machine Learning
Research. URL: http://proceedings.mlr.press/v48/amodei16.html.

Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro,
B., Chen, J., Chrzanowski, M., Coates, A., Diamos, G., Elsen, E., Engel,
J., Fan, L., Fougner, C., Han, T., Hannun, A., Jun, B., LeGresley, P.,
Lin, L., Narang, S., Ng, A., Ozair, S., Prenger, R., Raiman, J., Satheesh,
S., Seetapun, D., Sengupta, S., Wang, Y., Wang, Z., Wang, C., Xiao, B.,
Yogatama, D., Zhan, J., & Zhu, Z. (2015). Deep Speech 2: End-to-End
Speech Recognition in English and Mandarin. arXiv:1512.02595 [cs] , .
arXiv:1512.02595.

Anglano, C., Canonico, M., & Guazzone, M. (2017). Forensic analysis of


Telegram Messenger on Android smartphones. Digital Investigation, 23 ,
31–49. doi:10.1016/j.diin.2017.09.002.

Azfar, A., Choo, K.-K. R., & Liu, L. (2016). An Android Communication
App Forensic Taxonomy. Journal of Forensic Sciences, 61 , 1337–1350.
doi:10.1111/1556-4029.13164.

Barr-Smith, F., Farrant, T., Leonard-Lagarde, B., Rigby, D., Rigby, S., &
Sibley-Calder, F. (2021). Dead Man’s Switch: Forensic Autopsy of the
Nintendo Switch. Forensic Science International: Digital Investigation,

23
(p. 301110). URL: https://doi.org/10.1016/j.fsidi.2021.301110.
doi:10.1016/j.fsidi.2021.301110.

Carlini, N., & Wagner, D. (2018). Audio Adversarial Examples: Targeted


Attacks on Speech-to-Text. In 2018 IEEE Security and Privacy Workshops
(SPW). IEEE. doi:10.1109/spw.2018.00009.

Casey, E. (2011). Digital Evidence and Computer Crime. Cambridge, MA,


USA: Academic Press. URL: https://www.elsevier.com/books/digit
al-evidence-and-computer-crime/casey/978-0-08-092148-8.

Casey, E., & Souvignet, T. R. (2020). Digital transformation risk manage-


ment in forensic science laboratories. Forensic Science International , 316 ,
110486. URL: https://doi.org/10.1016/j.forsciint.2020.110486.
doi:10.1016/j.forsciint.2020.110486.

Chang, J.-H., Kim, N. S., & Mitra, S. K. (2006). Voice activity detection
based on multiple statistical models. IEEE Transactions on Signal Pro-
cessing, 54 , 1965–1976.

Chellapilla, K., Puri, S., & Simard, P. (2006). High performance convolu-
tional neural networks for document processing. In Tenth International
Workshop on Frontiers in Handwriting Recognition. Suvisoft.

CMUSphinx (2021). CMUSphinx. URL: https://github.com/cmusphinx


[Online; accessed 28. Jan. 2021].

Dargahi, T., Dehghantanha, A., & Conti, M. (2017). Forensics Analy-


sis of Android Mobile VoIP Apps. In Contemporary Digital Forensic
Investigations of Cloud and Mobile Applications (pp. 7–20). Elsevier.
doi:10.1016/b978-0-12-805303-4.00002-2.

Doukhan, D., Carrive, J., Vallet, F., Larcher, A., & Meignier, S. (2018a).
An open-source speaker gender detection framework for monitoring gender
equality. In Acoustics Speech and Signal Processing (ICASSP), 2018 IEEE
International Conference On. IEEE.

Doukhan, D., Lechapt, E., Evrard, M., & Carrive, J. (2018b). INA’S MIREX
2018 music and speech detection system. In Music Information Retrieval
Evaluation eXchange (MIREX 2018).

24
Errattahi, R., El Hannani, A., & Ouahmane, H. (2018). Automatic Speech
Recognition Errors Detection and Correction: A Review. Procedia Com-
puter Science, 128 , 32 – 37. doi:10.1016/j.procs.2018.03.005. 1st
International Conference on Natural Language and Speech Processing.

Facebook (2017). Messenger. https://www.facebook.com/messenger/pos


ts/1530169047102770.

Facebook (2020). WhatsApp. https://blog.whatsapp.com/two-billion-users-


connecting-the-world-privately.

Filippidou, F., & Moussiades, L. (2020). A Benchmarking of IBM, Google


and Wit Automatic Speech Recognition Systems. In I. Maglogiannis, L. Il-
iadis, & E. Pimenidis (Eds.), IFIP Advances in Information and Com-
munication Technology (pp. 73–82). Springer International Publishing.
doi:10.1007/978-3-030-49161-1 7.

Google Speech-to-Text (2021). Google Cloud Speech API. URL: https:


//cloud.google.com/speech-to-text [Online; accessed 28. Jan. 2021].

Guo, J., Sainath, T. N., & Weiss, R. J. (2019). A Spelling Correction Model
for End-to-end Speech Recognition. In ICASSP 2019 - 2019 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE. doi:10.1109/icassp.2019.8683745.

Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E.,
Prenger, R., Satheesh, S., Sengupta, S., Coates, A., & Ng, A. Y. (2014).
Deep Speech: Scaling up end-to-end speech recognition. arXiv:1412.5567
[cs] , . arXiv:1412.5567.

Heo, H.-S., So, B.-M., Yang, I.-H., Yoon, S.-H., & Yu, H.-J. (2019). Auto-
mated recovery of damaged audio files using deep neural networks. Digital
Investigation, 30 , 117–126. doi:10.1016/j.diin.2019.07.007.

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521 ,
436–444. doi:10.1038/nature14539.

Li, B., Zhou, E., Huang, B., Duan, J., Wang, Y., Xu, N., Zhang, J., &
Yang, H. (2014). Large scale recurrent neural network on GPU. In 2014
International Joint Conference on Neural Networks (IJCNN) (pp. 4062–
4069). doi:10.1109/IJCNN.2014.6889433.

25
Mahr, A., Cichon, M., Mateo, S., Grajeda, C., & Baggili, I. (2021). Zooming
into the pandemic! A forensic analysis of the Zoom Application. Forensic
Science International: Digital Investigation, 36 , 301107. doi:10.1016/j.
fsidi.2021.301107.

Manson, J. H., Bryant, G. A., Gervais, M. M., & Kline, M. A. (2013). Con-
vergence of speech rate in conversation predicts cooperation. Evolution
and Human Behavior , 34 , 419–426. doi:10.1016/j.evolhumbehav.2013.
08.001.

Maros, A., Almeida, J., Benevenuto, F., & Vasconcelos, M. (2020). Analyz-
ing the Use of Audio Messages in WhatsApp Groups. In Proceedings of
The Web Conference 2020 WWW ’20 (pp. 3005–3011). Taipei, Taiwan:
Association for Computing Machinery. doi:10.1145/3366423.3380070.

Microsoft Speech Services (2021). Microsoft Azure Speech Services. URL:


https://azure.microsoft.com/en-us/services/cognitive-service
s/speech-services/ [Online; accessed 28. Jan. 2021].

Morris, A. C., Maier, V., & Green, P. D. (2004). From WER and RIL to MER
and WIL: Improved evaluation measures for connected speech recognition.
In INTERSPEECH .

Mozilla DeepSpeech (2021). Mozilla DeepSpeech. URL: https://github.c


om/mozilla/DeepSpeech [Online; accessed 01 Feb. 2021].

Nickolls, J., Buck, I., Garland, M., & Skadron, K. (2008). Scalable Parallel
Programming with CUDA. Queue, 6 , 40–53. doi:10.1145/1365490.1365
500.

Nouwens, M., Griggio, C. F., & Mackay, W. E. (2017). ”WhatsApp is for


family;Messenger is for friends”. In Proceedings of the 2017 CHI Conference
on Human Factors in Computing Systems. ACM. doi:10.1145/3025453.
3025484.

Oh, K.-S., & Jung, K. (2004). GPU implementation of neural networks.


Pattern Recognition, 37 , 1311–1314. doi:10.1016/j.patcog.2004.01.0
13.

26
Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech:
An ASR corpus based on public domain audio books. In 2015 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP)
(pp. 5206–5210). doi:10.1109/ICASSP.2015.7178964.
Peinl, R., Rizk, B., & Szabad, R. (2020). Open Source Speech Recognition
on Edge Devices. In 2020 10th International Conference on Advanced
Computer Information Technologies (ACIT). IEEE. doi:10.1109/acit49
673.2020.9208978.
Pratap, V., Hannun, A., Xu, Q., Cai, J., Kahn, J., Synnaeve, G., Liptchin-
sky, V., & Collobert, R. (2019). Wav2Letter++: A Fast Open-source
Speech Recognition System. In ICASSP 2019 - 2019 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
doi:10.1109/icassp.2019.8683535.
Quick, D., & Choo, K.-K. R. (2014). Impacts of increasing volume of digital
forensic data: A survey and future research challenges. Digital Investiga-
tion, 11 , 273–294. doi:10.1016/j.diin.2014.09.002.
Ravanelli, M., Parcollet, T., & Bengio, Y. (2019). The Pytorch-kaldi Speech
Recognition Toolkit. In ICASSP 2019 - 2019 IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
doi:10.1109/icassp.2019.8683713.
S, K., & E, C. (2016). A review on automatic speech recognition architec-
ture and approaches. International Journal of Signal Processing, Image
Processing and Pattern Recognition, 9 , 393–404. doi:10.14257/ijsip.2
016.9.4.34.
Snyder, K. M., Ashitaka, Y., Shimada, H., Ulrich, J. E., & Logan, G. D.
(2014). What skilled typists don’t know about the QWERTY keyboard.
Attention, Perception, & Psychophysics, 76 , 162–171.
Tencent (2020). Tencent Announces 2020 First Quarter Results. https:
//cdc-tencent-com-1258344706.image.myqcloud.com/uploads/2020
/05/18/13009f73ecab16501df9062e43e47e67.pdf.
Wang, D., & Chen, J. (2018). Supervised Speech Separation Based on Deep
Learning: An Overview. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 26 , 1702–1726. doi:10.1109/taslp.2018.2842159.

27
Watanabe, S., Delcroix, M., Metze, F., & Hershey, J. R. (Eds.) (2017). New
Era for Robust Speech Recognition. Springer International Publishing.
doi:10.1007/978-3-319-64680-0.

Wit.ai (2021). Wit.ai. URL: https://wit.ai [Online; accessed 28. Jan.


2021].

Wu, S., Zhang, Y., Wang, X., Xiong, X., & Du, L. (2017). Forensic analysis of
WeChat on Android smartphones. Digital Investigation, 21 , 3–10. doi:10
.1016/j.diin.2016.11.002.

Ying, D., Yan, Y., Dang, J., & Soong, F. K. (2011). Voice Activity Detection
Based on an Unsupervised Learning Framework. IEEE Transactions on
Audio, Speech, and Language Processing, 19 , 2624–2633. doi:10.1109/ta
sl.2011.2125953.

Yu, D., & Deng, L. (2015). Automatic Speech Recognition. Springer London.
URL: https://doi.org/10.1007/978-1-4471-5779-3. doi:10.1007/97
8-1-4471-5779-3.

Zhang, X.-L., & Wu, J. (2013). Deep Belief Networks Based Voice Activity
Detection. IEEE Transactions on Audio, Speech, and Language Processing,
21 , 697–710. doi:10.1109/tasl.2012.2229986.

28

View publication stats

You might also like