Speech To Text Preprint Version
Speech To Text Preprint Version
net/publication/353205208
CITATION READS
1 458
2 authors, including:
Miguel Negrão
Instituto Politécnico de Leiria
4 PUBLICATIONS 8 CITATIONS
SEE PROFILE
All content following this page was uploaded by Miguel Negrão on 15 July 2021.
Abstract
Voice is the most natural way for humans to communicate with each
other, and more recently, to interact with voice controlled digital machines.
Although text is predominant in digital platforms, voice and video are be-
coming increasingly important, with communication applications supporting
voice messages and videos. This is relevant for digital forensic examinations,
as content held in voice format can hold relevant evidence for the investi-
gation. In this paper, we present the open source SpeechToText software,
which resorts to state-of-the art Voice Activity Detection (VAD) and Auto-
matic Speech Recognition (ASR) modules to detect voice content, and then
to transcribe it to text. This allows integrating voice content into the reg-
ular flow of a digital forensic investigation, with transcribed audio indexed
by text search engines. Although SpeechToText can be run independently,
it also provides a Jython-based software module for the well-known Autopsy
software. The paper also analyzes the availability, storage location and audio
format of voice-recorded content in 14 popular Android applications featur-
ing voice recordings. SpeechToText achieves 100% accuracy for detecting
voice in unencrypted audio/video files, a word error rate (WER) of 27.2%
when transcribing English voice messages by non-native speakers and a WER
of 7.80% for the test-clean set of LibriSpeech. It achieves a real time factor
of 0.15 for the detection and transcription process in a medium-range lap-
top, meaning that one minute of speech is processed in roughly nine seconds.
https://doi.org/10.1016/j.fsidi.2021.301223
Keywords: voice recordings, automatic speech recognition, automatic
Preprint submitted to Forensic Science International: Digital Investigation July 15, 2021
speech transcription, digital forensics, Android applications
1. Introduction
In the last two decades, the world has gone digital. For instance, few of us
dare to leave home without our smartphone, possibly the best-representing
tool of the digitized world. The omnipresence of digital devices in our lives
means that often a criminal investigation or civil litigation will need to an-
alyze digital devices linked to the case under consideration. This is one of
the three factors that has contributed to swell backlogs of digital forensic
laboratories, jointly with the increasing number of devices per case and the
volume of data per device (Quick & Choo, 2014).
Examinations of electronic devices are performed by digital forensic ex-
perts who apply procedures, tools, and techniques to extract and interpret
data found at the devices (Casey, 2011). Communications that have oc-
curred between individuals and whose traces remain on the devices can pro-
vide important clues for the investigation and thus need to be analyzed and
interpreted. Although many communications through digital channels are
in the form of written text, voice still plays an important role in the digital
world, as it is the most natural way for humans to communicate. For in-
stance, many platforms allow voice communication, either through calls or
through recorded voice messages (Azfar et al., 2016; Nouwens et al., 2017).
Examples include Facebook, WhatsApp, Signal, and Telegram, to name just
a few. Voice is also used for interacting with machines, as it demands much
less attention than writing, it is hands-free, and thus can be performed si-
multaneously with other tasks, such as driving, walking, etc. Some examples
include voice interaction with digital personal assistants such as Amazon
Echo and Apple Siri, vehicles, and many other devices.
While text can be parsed, indexed, and searched, in a relatively easy
manner, voice data are more difficult to integrate into a digital forensic ex-
amination. To be useful, voice data need to be transcribed to text in a
digital format, a process identified as speech-to-text. The no tech approach
of speech-to-text is simply to have someone listen and transcribe the audio.
This is a human-intensive task, and thus expensive and slow, especially if
the quality of transcription is relevant. As digital forensic laboratories are
often money, human resources, and time constrained (Casey & Souvignet,
2020), a faster and more economical approach is needed. This is the case
2
for automated speech-to-text, where software transcribes audio files of voice
communication, a process called Automatic Speech Recognition (ASR).
Although speech-to-text is a faculty seen as basic for humans, it is, in
reality, a quite complex process involving a significant part of the human
brain (Carlini & Wagner, 2018). The same occurs in the digital world, where
for many years, speech-to-text software was quite limited and error prone (S
& E, 2016). Since the (re)emergence of artificial intelligence (AI), and in
particular due to the advance of deep learning-based techniques (LeCun et al.,
2015), automatic speech recognition software has tremendously evolved (Yu
& Deng, 2015; Watanabe et al., 2017). ASR has moved from systems that
required good audio, articulated speech to deliver mediocre results to systems
achieving low error rates in transcriptions, surpassing humans in speed and
cost. Nonetheless, automatic speech recognition is still not perfect (Errattahi
et al., 2018). Examples of usage include voice to interact with the smartphone
for writing an email or to take a note, thus bypassing the need to use the
(virtual) keyboard. Another example is the automatic captioning of videos
on YouTube and Bing.
In this paper, we apply ASR to digital forensics. Specifically, we aim
to automatically transcribe meaningful audio files found in digital devices
to text so that the content can be indexed for searches and included in
digital forensic reports. In a digital forensic search it is common for audio
files to be considered of interest if they contain a high percentage of human
speech, and have duration above a minimum threshold. The main motivation
is to automate, in digital forensic cases, 1) detection of speech in digital
forensic cases, and 2) speech transcription, all with the goal of speeding up
the handling of audio recordings.
For this purpose, we developed a software suite that relies on several
open-source libraries, namely INA’s inaSpeechSegmenter (Doukhan et al.,
2018a,b) for speech detection, and Mozilla’s DeepSpeech (Amodei et al.,
2016) for speech transcription. As a proof-of-concept, we built a software
module called SpeechToText for the well-known Autopsy forensic software
and assessed its behavior with a set of audio files from popular Android
applications. The main contributions of this paper are as follows: 1) De-
velopment of an open-source software suite able to detect audio recordings
with speech and to transcribe the content to text through ASR, without the
need to export potentially sensitive files to the cloud; 2) Development of the
SpeechToText software module for Autopsy; 3) Analysis of the location of
the audio files and the audio coding formats for 14 Android applications fea-
3
turing voice recording, including 12 well-known communication applications;
4) Assessment of the transcription accuracy and speed performance of the
developed software with audio files from the studied applications, and with
the test-clean set from the LibriSpeech corpus (Panayotov et al., 2015).
The remainder of this paper is structured as follows. Section 2 reviews
related work. Next, Section 3 describes the software stack that composes
SpeechToText, while Section 4 presents the experimental setup and the main
accuracy and performance results. Section 5 discusses the results. Finally,
Section 6 concludes the paper and presents venues for future work.
2. Related Work
We now review related work. We first focus on speech detection, then on
speech transcription, and conclude with scientific works that target Android
applications that allow recording and exchanging audio messages. In this
work, we only deal with healthy recording files, that is, files that are not
corrupted. If the need arises, recovering techniques to repair damaged audio
files can be applied (Heo et al., 2019).
4
achieves ”a frame- level gender detection F-measure of 96.52 and a hourly
women speaking time percentage error bellow 0.6%” (Doukhan et al., 2018a).
The library fits into our main requirements: 1) available under open-source
licenses and 2) able to work locally, without the need to access the Internet.
This last requirement is particularly important in digital forensics, where it is
often forbidden or at least strongly advised against using commercial cloud-
based services, which require transferring case evidence to a third party.
1
GPU stands for graphics processing unit.
5
alert that the English pre-trained model has some biases towards US English
and male voices, due to its training dataset (Mozilla DeepSpeech). Models
have also been made available by third-parties2 for other languages such as
German, Spanish, French and Italian.
There are other open source ASR projects, which like DeepSpeech, work
offline. Examples include Facebook’s Wav2letter++ (Pratap et al., 2019),
Kaldi (Ravanelli et al., 2019) and CMUSphinx (CMUSphinx). Peinl et al.
provide a review of ASR offline open source systems on constrained edge
hardware (Peinl et al., 2020). They also compare open source systems to
commercial cloud-based ASR systems from Amazon (Amazon Transcribe),
Google (Google Speech-to-Text) and Microsoft (Microsoft Speech Services),
reporting that the Word Error Rate (WER) (Morris et al., 2004) for these
systems was, respectively, 12.3%, 8.6% and 9.4%, while MozDeepSpeech fared
worse with a 29.0% word error rate.
Nonetheless, as stated earlier, commercial products are not viable for
some digital forensic environments, not only due to costs but also due to the
need to interact with cloud services, and the consequent loss of privacy.
Filippidou and Moussiades (Filippidou & Moussiades, 2020) provide a
brief review on ASR projects and products, and benchmark three cloud-based
closed source solutions: Google, Wit from wit.ai (Wit.ai), and IBMWatson.
For benchmarking, they use English recordings from three individuals whose
native language is Greek. They report that Google’s solution yielded the
lowest WER, and note that WER, regardless of the benchmarked solution,
was substantially dependent on the speaker. For instance, WER for Google
ranged from 16.60% to 24.85%, with Wit’s WER varying between 23.28%
and 58.87%. Our experiments with SpeechToText also confirms the lesser
accuracy of ASR for non-native English speakers.
2
https://gitlab.com/Jaco-Assistant/Scribosermo
6
nication applications: Line, WeChat, and Viber (Azfar et al., 2016). Anglano
et al. study the Telegram Messenger on Android smartphones (Anglano et al.,
2017), while Wu et al. analyze forensic artifacts left by the WeChat applica-
tion on Android devices (Wu et al., 2017). Zoom video conference software
is analyzed by Mahr et al. for Windows 10 and Android platforms, identi-
fying the localization of files, namely SQLite databases (Mahr et al., 2021).
In our work, besides all the above-cited Android applications, we also study
Google Duo, Evernote, and Microsoft Teams, although our analysis is solely
focused on audio recording capabilities and consequent forensic artifacts of
the applications.
3. SpeechToText
SpeechToText is a set of applications to provide speech detection and
transcription for the Autopsy software (see Table 1). SpeechToText is avail-
able online under a GPLv3 license3 . It supports both Linux and Windows
operating systems and can make use of a GPU which supports the Compute
Unified Device Architecture (CUDA) for increased performance. Autopsy4
is a popular open-source digital forensic software (Barr-Smith et al., 2021).
It uses The Sleuth Kit5 , a collection of command line tools, to analyze disk
images and recover files from them. The functionality of Autopsy can be
extended through three types of modules: 1) File ingest; 2) Datasource
ingest; and 3) Report. SpeechToText encompasses two modules for Au-
topsy: a datasource ingest module called vad check ingest.py (henceforth
vad check ingest), and a report module called ast report.py (hence-
forth ast report). These two modules are supported by an application
called deepspeech csv, with deepspeech csv itself requiring another ap-
plication, named ina speech segmenter (see Figure 1). Next, we describe
deepspeech csv, ina speech segmenter, and their execution flow.
3
https://github.com/labcif/AutopsySpeechToText
4
https://www.sleuthkit.org/autopsy/
5
https://www.sleuthkit.org
7
one main feature: the ability to perform VAD using comma-separated values
(CSV) files generated by ina speech segmenter.
While MozDeepSpeech’s original native client receives an audio file and
transcribes it using TensorFlow (Abadi et al., 2016), outputting text and
metadata, it does not perform VAD. As speech transcription is a computing-
heavy task, it is important that audio files are first assessed for the exis-
tence of speech, through VAD, so that only audio segments with speech are
considered for speech transcription. VAD is the role of the python script
ina speech segmenter. For each audio file, ina speech segmenter cre-
ates one CSV file with one line per segment, each line containing the segment
start and end times, plus a classification (“male”, “female”, “music”, “noise”
or “noEnergy”).
The deepspeech csv application receives a list of 16kHz mono 16-bit
pulse code modulation (PCM) encoded WAV audio files and a directory
where corresponding ina speech segmenter-produced CSV files are located.
For each audio file, deepspeech csv looks for a matched CSV file (same
root name but with suffix “.csv”). It then extracts from the audio file the
segments containing speech and feeds them to the MozDeepSpeech engine,
obtaining the corresponding transcription. The segments are processed in
the order they appear in the CSV file and gender information is not used
for the transcription. As result, for each audio file, a corresponding text
file with the transcribed text is created in the directory containing the CSV
files. By default, in the text file, each segment has its start time prepended,
although this can be disabled from the command line. The deepspeech csv
application also requires command line parameters specifying the names of
two files that harbor the neural network model needed by MozDeepSpeech’s
transcription engine. These two files are specific to the spoken language (e.g.,
English).
8
Data source ingest VAD_CHECK_INGEST
Audio files
Report AST_REPORT
Autopsy
INA_SPEECH_SEGMENTER
Voice Activity Detection (VAD)
csv file
DEEPSPEECH_CSV
MozDeepSpeech (ASR)
Transcribed text
process (ina speech segmenter) which detects whether the file has speech
content or not, and the gender of the speaker (male/female). The file type
is determined by Autopsy’s File Type Identification Module which detects
the file type based on its internal byte signature. This module must be run
before running vad check ingest. If the file is a video file, then ffmpeg6
is used to determine if the file has audio streams. For both audio and video
files, ffmpeg is then used to convert the audio to a 16kHz mono 16-bit PCM-
encoded WAV file, the format expected by ina speech segmenter and
deepspeech csv, and needed by MozDeepSpeech engine. The audio is
upsampled or downsampled to 16kHz depending on the original audio file’s
sample rate. In the case of a stereo or multi-channel audio file, all channels
are summed. The ffmpeg processes are run in parallel using a pool of threads
of size n, with n the number of available CPU cores.
The vad check ingest module has a configuration panel (see Figure 2)
with parameters that determine which files are marked as “interesting items”,
through an Autopsy’s tag. Given that a device might contain many audio
files without speech, such as music collections, and operating system audio
notifications, it is important to be able to find the files on the device that
have speech activity. A file is tagged if both, the percentage of the audio file
6
https://www.ffmpeg.org/
9
Table 1: SpeechToText’s applications
which has speech activity, and the total duration of segments with speech,
are above thresholds set in the configuration panel. A 5s file containing 4s
of speech (as an example, the sentence “How are you today ?”), will not
be tagged if, for instance, the minimum amount of speech required is 20s.
The panel also has an option to determine whether files which have been
tagged should be automatically transcribed using MozDeepSpeech. Since
transcribing files is an operation that can take a significant amount of time,
if the user expects that many files with speech activity will be found, and
needs to quickly transcribe a subset of all files, the user can manually tag
only these files and do a selective transcription using ast report. Finally,
the language assumed for the transcription can be selected from a drop-down
menu. The module currently has English and Mandarin Chinese available,
but as more MozDeepSpeech language models become available these can be
used by the software by placing them in a specific folder.
The actual detection of the files with speech is done by ina speech s
egmenter. After ina speech segmenter has executed, vad check ingest
sums the duration of all segments labeled as female or male by ina speec
h segmenter obtaining the total duration of segments with speech (male +
female) and tags the file as “Audio file with speech” (if total dur. > 0),
“Audio file with speech - male ” (if male dur. > 0) and ”Audio file with
speech - female” (if female dur. > 0). If the duration of the audio file,
which is determined by ffmpeg, is less than the minimum total duration of
10
(a) New artifacts created by (b) vad check ingest’s configuration panel
vad check ingest
Figure 2: Details of the Autopsy user interface for vad check ingest.
segments with speech set in the configuration panel, the file is excluded and
not passed to ina speech segmenter, reducing ina speech segmenter’s
and deepspeech csv’s runtimes.
After ina speech segmenter has completed analyzing all files, files where
speech was detected which match the required conditions will be displayed
in the interesting items node of Autopsy’s Directory Tree panel. When the
transcription option is activated in the configuration panel, these files are also
sent to deepspeech csv. For each file, the written transcript returned by
deepspeech csv is imported into the Autopsy case as a blackboard artifact of
type “TSK EXTRACTED TEXT”. The transcripts appear on the graphical
user interface of Autopsy, in “Results”, “Extracted Content”, “Extracted
Text” (see left of Figure 2).
If Autopsy’s keyword search ingest module is active in the case under
analysis, then after running vad check ingest any matches in the tran-
scribed text will automatically appear in the Directory Tree panel, in the
keyword matches node (see Figure 3).
11
Figure 3: Keyword search of text transcribed with vad check ingest.
12
Figure 4: The ast report module’s configuration panel.
text files with the transcribed text. The deepspeech iss.py script can be
run in a machine without an installed Autopsy version, as it has no depen-
dency on Autopsy.
13
ina speech segmenter and deepspeech csv versions bundled inside the
SpeechToText module both support Nvidia GPUs via the CUDA frame-
work (Nickolls et al., 2008), and fallback to CPU if no CUDA-compatible
GPU is available.
To test the performance and accuracy of SpeechToText module, the test-
clean set from the LibriSpeech corpus (Panayotov et al., 2015) was imported
into an Autopsy case. The set consists of 2620 recordings with clean speech
by native English speakers. vad check ingest was run twice on this set,
once using the GPU and a second time using the CPU7 . Of the 2620 files,
ina speech segmenter determined 2619 to have speech. The total time
duration of those 2619 files is 19450.4s. The files contain almost no silence.
The tests were conducted on a laptop with an Intel(R) Core(TM) i7-
8750H CPU (with base frequency 2.20GHz and max turbo frequency 4.10
GHz) and a 6 GiB GeForce RTX 2060 Mobile GPU. The results can be seen
in Table 2.
The real time factor8 (RTF) of the Autopsy ingest using GPU was 0.16
and using CPU was 0.42. As expected, using a GPU will significantly reduce
the time needed to process a forensic image using the SpeechToText module.
Regarding the accuracy of the transcription, the WER for this set when
transcribed with the SpeechToText module was 7.80%. This is just somewhat
higher than the value of 5.3% of Baidu’s DeepSpeech 2 (Amodei et al., 2015)
for the same set, and very close to the value of 7.06% given by Mozilla for
the same version of MozDeepSpeech (v0.9.3)9 .
The WER was calculated using the formula
PM
n=1 lev(ti )
PM ,
n=1 len(ri )
7
By setting CUDA VISIBLE DEVICES=-1.
8
The real time factor is the elapsed time for the transcription process divided by the
duration of the audio file.
9
https://github.com/mozilla/DeepSpeech/releases/tag/v0.9.3
14
4. Transcription of audio from mobile applications featuring voice
recordings
The effectiveness of the SpeechToText module was evaluated by simulat-
ing a real-world scenario: the generation of transcripts of conversations found
in audio files extracted from a smartphone during a forensic investigation.
Different applications save audio files which can contain conversations of in-
terest for forensic investigators. Some of those applications are video and
audio recorders, (audio) note-taking applications, mobile messaging applica-
tions, which allow the exchange of audio or video messages, and collaboration
and videoconference applications.
Messaging applications (MA) enable the smartphone to send and receive
messages containing text, photos, audio, video, and files through a mobile or
fixed internet connection to and from individuals on a contact list or group.
MAs have become very popular, being used monthly by several billion people
(Facebook, 2017). In some regions, such as Brazil, communicating via audio
messages exchanged in MAs is widespread (Maros et al., 2020). Collaboration
and videoconference applications, such as Zoom and Microsoft Teams also
allow the exchange of audio messages.
The effectiveness of the SpeechToText module was evaluated by setting up
an experiment where different Android applications which can generate audio
files containing speech were installed on a smartphone. The objectives of the
experiment were: 1) to determine whether the SpeechToText module was
able to automatically detect the audio files and subsequently automatically
transcribe them; 2) to determine where audio messages or notes are saved on
the device by each application; and 3) to determine the audio coding format
used by the audio files.
15
Table 3: Applications installed on smartphone for forensic analysis of audio files. The
columns “Subj. To” and “Subj. From” show which subject sent the message from or to
the smartphone.
16
the Universal Declaration of Human Rights, in English, with the duration of
around 1 minute. The messages were spoken by four different individuals all
non-native speakers in English. In the case of Houseparty, a video message
(Facemail) was sent, since HouseParty currently does not offer audio only
voice messages. Regarding the Evernote application, a note was created
containing recorded sound. In the case of the Moto Camera One application
a video was recorded. A total of 26 voice recordings were created using the
applications.
The extraction and forensic analysis of the smartphone were conducted on
the same computer used for deepspeech csv’s performance analysis, running
Linux (Debian 10) and Autopsy 4.17.0, and making use of the Nvidia GPU.
The version of the SpeechToText used was built with DeepSpeech version
v0.9.3-0-gf2e9c85 and TensorFlow version v2.3.0-6-g23ad988.
Given that the smartphone was rooted, forensic images of the data and
cache android partitions were created by block-level copying. This was
achieved by flashing the Team Win Recovery Project (TWRP10 ) custom re-
covery image and using adb to run dd. Since Autopsy cannot process images
containing the Samsung developed F2FS filesystem, the images were instead
loop mounted in read-only mode onto a directory, and that directory was
imported as logical files.
During the initial import of the directories as new data sources, Autopsy’s
File Type Identification module was run. Subsequently the vad check ing
est ingest module was run with file selection conditions dur ≥ 20s and
sp ≥ 30%, where dur is the audio stream duration and sp is the percentage
of speech content. The ingest process was also configured to automatically
transcribe all files matching these criteria.
4.2. Results
Autopsy’s File Type Identification module in the initial ingest found a
total of 50 audio or video files. The subsequent vad check ingest ingest
process found and transcribed 28 files, all of which contained voice recordings
created purposely for this experiment, although several were duplicate files
saved in a different location by the applications, and one was a preliminary
voice recording test. Of the other 22 files which were not selected for tran-
scription, 20 had duration less than 20s and the remaining two were ringtones
10
https://twrp.me/
17
Table 4: Results of the ingest process. Rec = Number of experiment voice recordings
found. Tr. Files = files transcribed. A/V = Audio/Video. Dur = total duration of 30
audio/video files. Ina = ina speech segmenter elapsed time. DP = deepspeech csv elapsed
time. Ing = Ingest duration. RTF = real time factor.
Rec Tr. A/V Dur (s) Ina (s) DS (s) Ing (s) RTF
Files
18
Table 6: Path of audio files generated by applications containing voice recordings
Evernote /media/0/Android/data/com.evernote/files/Temp/Shared/
Facebook Mes- /data/com.facebook.orca/cache/fb temp/
senger
Google Duo /media/0/Duo/
Moto Camera /media/0/DCIM/Camera/
One
Skype /data/com.skype.raider/cache/071419883E91F454201B5D2F53337571DEC9A86AF867044
1660DBB7B35232C43/RNManualFileCache/
Teams /media/0/MicrosoftTeams/Media/VoiceMessages/
Telegram /media/0/Telegram/TelegramAudio/
Viber /media/0/Android/data/com.viber.voip/files/.ptt/
WeChat /media/0/tencent/MicroMsg/24a59238ae32c61707a6e7c0d3db9ed9/voice2/1a/09/
WhatsApp /media/0/WhatsApp/Media/WhatsAppVoiceNotes/202013/
Zoom /data/us.zoom.videomeetings/data/[email protected]/zwzyt6-y
qia [email protected]/
11
https://gist.github.com/Kronopath/c94c93d8279e3bac19f2/c9ee08870c445b8
82c0bf47be4f77836fa2d048e
19
the ingest was 27.2%, ranging from a minimum of 17.0% to a maximum of
35.0%.
The first sentence from the transcribed text of one of the audio messages
which has a WER close to the average (26.4%) is reproduced below:
universal declaration of human rights article on all human beings
are born free and equal in dignity and rights there and though
with reason and conscious and shoulderwards one another in a
spirit of brotherhood
The elapsed time for the ingest process, including the additional WeChat
files, was 233.26s, corresponding to a real time factor of 0.15 (see Table 4).
The elapsed time includes converting the files to 16kHz mono 16-bit PCM-
encoded WAV format, running ina speech segmenter to detect files with
speech and provide segmentation, and subsequently executing deepspeech
csv to transcribe the files using MozDeepSpeech. The elapsed time does
not include the previous ingest which ran Autopsy’s File Type Identification
module.
Table 5 shows the audio coding format used by each application to store
the voice messages. Table 6 lists the paths where audio files generated by the
applications containing the voice recordings are stored. Facebook Messenger,
Skype, and Zoom store the audio files in the data partition, the other ap-
plications store the audio files in the media partition, which can be accessed
directly. To access Facebook Messenger, Skype, and Zoom audio files either
rooting or an equivalent method is required.
5. Discussion
The audio files generated by most of the used applications are easily found
as they are saved as standard audio files in the application specific storage.
In the case of WeChat the audio files were saved as a headerless audio file. On
the other hand, audio/video files from the HouseParty, Signal, and Snapchat
apps were not detected by Autopsy’s File Detection module, possibly due
to encryption, and thus were left out of the experiment. Therefore, to use
the SpeechToText plugin to process files from these three applications, other
tools must first be used to extract the audio/video files.
Considering the WeChat processed files, a total of 20 out of 26 voice
recordings were found and transcribed. All the voice recordings of the ex-
periment for which a corresponding audio file was found by Autopsy (con-
sidering also the processed WeChat files) were correctly identified by vad c
20
heck ingest as containing speech. Two files which did not contain any
speech (ringtones) were also analyzed by ina speech segmenter due to
fulfilling the duration condition (dur > 20s). Neither of them was labeled
by vad check ingest as fulfilling the 30% speech content condition. This
corresponds to an identification accuracy of 100%.
Manson et al. (2013) reported that the average word rate for a telephone
conversation between English speakers in a particular study was 214 WPM.
A skilled typist has typing speed in the range of 70 WPM (Snyder et al.,
2014), which corresponds to a 3.1 real-time factor. The real-time factor for
the ingest process in this experiment was 0.15 which is 20.4 times faster than
a skilled typist. It can be concluded that using the SpeechToText module
will speed up the process of obtaining transcriptions of voice conversation
on a smartphone. Besides, computers can work 24/7 and GPU capabilities
have been increasing regularly, which should further boost performance. It
should be noted that the figure 20.4x is a conservative estimate, since it does
not include the time necessary for manually determining which files should
be transcribed from a larger list of audio files.
The WER was somewhat high at 27.2%, although one must take into
account that the sound files produced by the combinations of the apps and
smartphones used were not of high quality. Some had a considerable amount
of noise and artifacts. Furthermore, none of the speakers are native English
speakers, a known factor for lower transcription performance. In addition, the
text had a somewhat difficult vocabulary in terms of pronunciation. Despite a
relatively high WER, it was possible to understand most of the text produced
by the plugin. The quality of transcription produced is considered adequate
for the purpose of quickly finding relevant information in the triage stage
of a forensic investigation, which was the original goal. Moreover, even if
the WER values preclude the direct use of the automated transcriptions in
formal forensic reports, SpeechToText’s transcripts can serve as drafts for
human transcribers, still saving precious time.
As described before, for the test-clean set of LibriSpeech the output of
the SpeechToText plugin had a WER of 7.8%, which matches the value of
7.06% given by Mozilla for the same version of MozDeepSpeech (v0.9.3)12 ,
therefore the plugin has the expected accuracy when used with speech from
native English speakers and good quality audio recordings.
12
https://github.com/mozilla/DeepSpeech/releases/tag/v0.9.3
21
Using the SpeechToText plugin together with Autopsy’s keyword search
allows quickly narrowing down conversations of interest from numerous record-
ings present on a forensic image in an automated way. It is possible that the
SpeechToText plugin will mistranscribe one of the words which is part of
the keywords defined by the forensic examiner as relevant. Taking that into
account, one can include additional misspellings and similar-sounding words
to the items in the keyword list to improve the probability of keyword match.
6. Conclusion
This work presents the SpeechToText plugin, a digital forensic tool lever-
aging open-source machine learning software for automatic detection and
translation of voice recordings.
We showed that the SpeechToText plugin can automatically detect with
high accuracy audio files present in a forensic image of a smartphone. The
audio files were generated by different applications, namely by messaging
applications such as WhatsApp and Facebook Messenger. We also demon-
strated that the plugin can automatically transcribe files where speech was
detected, or files manually tagged, generating transcribed text with a WER
which is adequate for the purpose of discovering relevant information in a
forensic investigation. We showed that the detection and transcription, given
adequate hardware, can be performed faster than real-time, therefore signifi-
cantly decreasing the time required to find relevant information contained in
audio files present in a forensic image, and freeing valuable human resources.
We also showed that the most popular messaging applications store audio
or video messages unencrypted on the device, listing the locations where such
files are likely to be found and corresponding audio coding formats.
As future work, we plan to add support for other ASR engines, making
possible for users to select the most convenient ASR engine and model. We
also plan on testing the software with recordings that mimic real-world con-
ditions such as poor signal-to-noise ratio, or multiple overlapping voices in a
single recording.
Acknowledgment
This work was partially supported by CIIC under the FCT project UIDB-
04524-2020, and by FCT/MCTES and EU funds under the project UIDB/EEA/50008/2020.
22
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M.,
Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R.,
Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden,
P., Wicke, M., Yu, Y., & Zheng, X. (2016). TensorFlow: A System for
Large-Scale Machine Learning. In 12th USENIX Symposium on Operating
Systems Design and Implementation (OSDI 16) (pp. 265–283).
Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E.,
Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G. et al. (2016).
Deep speech 2: End-to-End Speech Recognition in English and Mandarin.
In M. F. Balcan, & K. Q. Weinberger (Eds.), Proceedings of The 33rd
International Conference on Machine Learning (pp. 173–182). New York,
New York, USA: PMLR volume 48 of Proceedings of Machine Learning
Research. URL: http://proceedings.mlr.press/v48/amodei16.html.
Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro,
B., Chen, J., Chrzanowski, M., Coates, A., Diamos, G., Elsen, E., Engel,
J., Fan, L., Fougner, C., Han, T., Hannun, A., Jun, B., LeGresley, P.,
Lin, L., Narang, S., Ng, A., Ozair, S., Prenger, R., Raiman, J., Satheesh,
S., Seetapun, D., Sengupta, S., Wang, Y., Wang, Z., Wang, C., Xiao, B.,
Yogatama, D., Zhan, J., & Zhu, Z. (2015). Deep Speech 2: End-to-End
Speech Recognition in English and Mandarin. arXiv:1512.02595 [cs] , .
arXiv:1512.02595.
Azfar, A., Choo, K.-K. R., & Liu, L. (2016). An Android Communication
App Forensic Taxonomy. Journal of Forensic Sciences, 61 , 1337–1350.
doi:10.1111/1556-4029.13164.
Barr-Smith, F., Farrant, T., Leonard-Lagarde, B., Rigby, D., Rigby, S., &
Sibley-Calder, F. (2021). Dead Man’s Switch: Forensic Autopsy of the
Nintendo Switch. Forensic Science International: Digital Investigation,
23
(p. 301110). URL: https://doi.org/10.1016/j.fsidi.2021.301110.
doi:10.1016/j.fsidi.2021.301110.
Chang, J.-H., Kim, N. S., & Mitra, S. K. (2006). Voice activity detection
based on multiple statistical models. IEEE Transactions on Signal Pro-
cessing, 54 , 1965–1976.
Chellapilla, K., Puri, S., & Simard, P. (2006). High performance convolu-
tional neural networks for document processing. In Tenth International
Workshop on Frontiers in Handwriting Recognition. Suvisoft.
Doukhan, D., Carrive, J., Vallet, F., Larcher, A., & Meignier, S. (2018a).
An open-source speaker gender detection framework for monitoring gender
equality. In Acoustics Speech and Signal Processing (ICASSP), 2018 IEEE
International Conference On. IEEE.
Doukhan, D., Lechapt, E., Evrard, M., & Carrive, J. (2018b). INA’S MIREX
2018 music and speech detection system. In Music Information Retrieval
Evaluation eXchange (MIREX 2018).
24
Errattahi, R., El Hannani, A., & Ouahmane, H. (2018). Automatic Speech
Recognition Errors Detection and Correction: A Review. Procedia Com-
puter Science, 128 , 32 – 37. doi:10.1016/j.procs.2018.03.005. 1st
International Conference on Natural Language and Speech Processing.
Guo, J., Sainath, T. N., & Weiss, R. J. (2019). A Spelling Correction Model
for End-to-end Speech Recognition. In ICASSP 2019 - 2019 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE. doi:10.1109/icassp.2019.8683745.
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E.,
Prenger, R., Satheesh, S., Sengupta, S., Coates, A., & Ng, A. Y. (2014).
Deep Speech: Scaling up end-to-end speech recognition. arXiv:1412.5567
[cs] , . arXiv:1412.5567.
Heo, H.-S., So, B.-M., Yang, I.-H., Yoon, S.-H., & Yu, H.-J. (2019). Auto-
mated recovery of damaged audio files using deep neural networks. Digital
Investigation, 30 , 117–126. doi:10.1016/j.diin.2019.07.007.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521 ,
436–444. doi:10.1038/nature14539.
Li, B., Zhou, E., Huang, B., Duan, J., Wang, Y., Xu, N., Zhang, J., &
Yang, H. (2014). Large scale recurrent neural network on GPU. In 2014
International Joint Conference on Neural Networks (IJCNN) (pp. 4062–
4069). doi:10.1109/IJCNN.2014.6889433.
25
Mahr, A., Cichon, M., Mateo, S., Grajeda, C., & Baggili, I. (2021). Zooming
into the pandemic! A forensic analysis of the Zoom Application. Forensic
Science International: Digital Investigation, 36 , 301107. doi:10.1016/j.
fsidi.2021.301107.
Manson, J. H., Bryant, G. A., Gervais, M. M., & Kline, M. A. (2013). Con-
vergence of speech rate in conversation predicts cooperation. Evolution
and Human Behavior , 34 , 419–426. doi:10.1016/j.evolhumbehav.2013.
08.001.
Maros, A., Almeida, J., Benevenuto, F., & Vasconcelos, M. (2020). Analyz-
ing the Use of Audio Messages in WhatsApp Groups. In Proceedings of
The Web Conference 2020 WWW ’20 (pp. 3005–3011). Taipei, Taiwan:
Association for Computing Machinery. doi:10.1145/3366423.3380070.
Morris, A. C., Maier, V., & Green, P. D. (2004). From WER and RIL to MER
and WIL: Improved evaluation measures for connected speech recognition.
In INTERSPEECH .
Nickolls, J., Buck, I., Garland, M., & Skadron, K. (2008). Scalable Parallel
Programming with CUDA. Queue, 6 , 40–53. doi:10.1145/1365490.1365
500.
26
Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech:
An ASR corpus based on public domain audio books. In 2015 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP)
(pp. 5206–5210). doi:10.1109/ICASSP.2015.7178964.
Peinl, R., Rizk, B., & Szabad, R. (2020). Open Source Speech Recognition
on Edge Devices. In 2020 10th International Conference on Advanced
Computer Information Technologies (ACIT). IEEE. doi:10.1109/acit49
673.2020.9208978.
Pratap, V., Hannun, A., Xu, Q., Cai, J., Kahn, J., Synnaeve, G., Liptchin-
sky, V., & Collobert, R. (2019). Wav2Letter++: A Fast Open-source
Speech Recognition System. In ICASSP 2019 - 2019 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
doi:10.1109/icassp.2019.8683535.
Quick, D., & Choo, K.-K. R. (2014). Impacts of increasing volume of digital
forensic data: A survey and future research challenges. Digital Investiga-
tion, 11 , 273–294. doi:10.1016/j.diin.2014.09.002.
Ravanelli, M., Parcollet, T., & Bengio, Y. (2019). The Pytorch-kaldi Speech
Recognition Toolkit. In ICASSP 2019 - 2019 IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
doi:10.1109/icassp.2019.8683713.
S, K., & E, C. (2016). A review on automatic speech recognition architec-
ture and approaches. International Journal of Signal Processing, Image
Processing and Pattern Recognition, 9 , 393–404. doi:10.14257/ijsip.2
016.9.4.34.
Snyder, K. M., Ashitaka, Y., Shimada, H., Ulrich, J. E., & Logan, G. D.
(2014). What skilled typists don’t know about the QWERTY keyboard.
Attention, Perception, & Psychophysics, 76 , 162–171.
Tencent (2020). Tencent Announces 2020 First Quarter Results. https:
//cdc-tencent-com-1258344706.image.myqcloud.com/uploads/2020
/05/18/13009f73ecab16501df9062e43e47e67.pdf.
Wang, D., & Chen, J. (2018). Supervised Speech Separation Based on Deep
Learning: An Overview. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 26 , 1702–1726. doi:10.1109/taslp.2018.2842159.
27
Watanabe, S., Delcroix, M., Metze, F., & Hershey, J. R. (Eds.) (2017). New
Era for Robust Speech Recognition. Springer International Publishing.
doi:10.1007/978-3-319-64680-0.
Wu, S., Zhang, Y., Wang, X., Xiong, X., & Du, L. (2017). Forensic analysis of
WeChat on Android smartphones. Digital Investigation, 21 , 3–10. doi:10
.1016/j.diin.2016.11.002.
Ying, D., Yan, Y., Dang, J., & Soong, F. K. (2011). Voice Activity Detection
Based on an Unsupervised Learning Framework. IEEE Transactions on
Audio, Speech, and Language Processing, 19 , 2624–2633. doi:10.1109/ta
sl.2011.2125953.
Yu, D., & Deng, L. (2015). Automatic Speech Recognition. Springer London.
URL: https://doi.org/10.1007/978-1-4471-5779-3. doi:10.1007/97
8-1-4471-5779-3.
Zhang, X.-L., & Wu, J. (2013). Deep Belief Networks Based Voice Activity
Detection. IEEE Transactions on Audio, Speech, and Language Processing,
21 , 697–710. doi:10.1109/tasl.2012.2229986.
28