Search | arXiv e-print repository

Developing Acoustic Models for Automatic Speech Recognition in Swedish

Abstract: This paper is concerned with automatic continuous speech recognition using trainable systems. The aim of this work is to build acoustic models for spoken Swedish. This is done employing hidden Markov models and using the SpeechDat database to train their parameters. Acoustic modeling has been worked out at a phonetic level, allowing general speech recognition applications, even though a simplified… ▽ More This paper is concerned with automatic continuous speech recognition using trainable systems. The aim of this work is to build acoustic models for spoken Swedish. This is done employing hidden Markov models and using the SpeechDat database to train their parameters. Acoustic modeling has been worked out at a phonetic level, allowing general speech recognition applications, even though a simplified task (digits and natural number recognition) has been considered for model evaluation. Different kinds of phone models have been tested, including context independent models and two variations of context dependent models. Furthermore many experiments have been done with bigram language models to tune some of the system parameters. System performance over various speaker subsets with different sex, age and dialect has also been examined. Results are compared to previous similar studies showing a remarkable improvement. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: 16 pages, 7 figures

MSC Class: 68T10 ACM Class: I.5.0; I.2.0; I.2.7

Journal ref: European Student Journal of Language and Speech, 1999

arXiv:2401.06588 [pdf, other]

doi 10.1016/j.specom.2005.05.005

Dynamic Behaviour of Connectionist Speech Recognition with Strong Latency Constraints

Authors: Giampiero Salvi

Abstract: This paper describes the use of connectionist techniques in phonetic speech recognition with strong latency constraints. The constraints are imposed by the task of deriving the lip movements of a synthetic face in real time from the speech signal, by feeding the phonetic string into an articulatory synthesiser. Particular attention has been paid to analysing the interaction between the time evolut… ▽ More This paper describes the use of connectionist techniques in phonetic speech recognition with strong latency constraints. The constraints are imposed by the task of deriving the lip movements of a synthetic face in real time from the speech signal, by feeding the phonetic string into an articulatory synthesiser. Particular attention has been paid to analysing the interaction between the time evolution model learnt by the multi-layer perceptrons and the transition model imposed by the Viterbi decoder, in different latency conditions. Two experiments were conducted in which the time dependencies in the language model (LM) were controlled by a parameter. The results show a strong interaction between the three factors involved, namely the neural network topology, the length of time dependencies in the LM and the decoder latency. △ Less

Submitted 12 January, 2024; originally announced January 2024.

ACM Class: I.5.0; I.2.7; E.4

Journal ref: Speech Communication Volume 48, Issue 7, July 2006, Pages 802-818

arXiv:2401.05717 [pdf, other]

doi 10.1016/j.specom.2006.07.009

Segment Boundary Detection via Class Entropy Measurements in Connectionist Phoneme Recognition

Authors: Giampiero Salvi

Abstract: This article investigates the possibility to use the class entropy of the output of a connectionist phoneme recogniser to predict time boundaries between phonetic classes. The rationale is that the value of the entropy should increase in proximity of a transition between two segments that are well modelled (known) by the recognition network since it is a measure of uncertainty. The advantage of th… ▽ More This article investigates the possibility to use the class entropy of the output of a connectionist phoneme recogniser to predict time boundaries between phonetic classes. The rationale is that the value of the entropy should increase in proximity of a transition between two segments that are well modelled (known) by the recognition network since it is a measure of uncertainty. The advantage of this measure is its simplicity as the posterior probabilities of each class are available in connectionist phoneme recognition. The entropy and a number of measures based on differentiation of the entropy are used in isolation and in combination. The decision methods for predicting the boundaries range from simple thresholds to neural network based procedure. The different methods are compared with respect to their precision, measured in terms of the ratio between the number C of predicted boundaries within 10 or 20 msec of the reference and the total number of predicted boundaries, and recall, measured as the ratio between C and the total number of reference boundaries. △ Less

Submitted 12 January, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

ACM Class: I.5.0; I.2.7; E.4

Journal ref: Speech Communication Volume 48, Issue 12, December 2006, Pages 1666-1676

arXiv:2307.06701 [pdf, other]

S-HR-VQVAE: Sequential Hierarchical Residual Learning Vector Quantized Variational Autoencoder for Video Prediction

Authors: Mohammad Adiban, Kalin Stefanov, Sabato Marco Siniscalchi, Giampiero Salvi

Abstract: We address the video prediction task by putting forth a novel model that combines (i) our recently proposed hierarchical residual vector quantized variational autoencoder (HR-VQVAE), and (ii) a novel spatiotemporal PixelCNN (ST-PixelCNN). We refer to this approach as a sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging the intrinsic capab… ▽ More We address the video prediction task by putting forth a novel model that combines (i) our recently proposed hierarchical residual vector quantized variational autoencoder (HR-VQVAE), and (ii) a novel spatiotemporal PixelCNN (ST-PixelCNN). We refer to this approach as a sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging the intrinsic capabilities of HR-VQVAE at modeling still images with a parsimonious representation, combined with the ST-PixelCNN's ability at handling spatiotemporal information, S-HR-VQVAE can better deal with chief challenges in video prediction. These include learning spatiotemporal information, handling high dimensional data, combating blurry prediction, and implicit modeling of physical characteristics. Extensive experimental results on the KTH Human Action and Moving-MNIST tasks demonstrate that our model compares favorably against top video prediction techniques both in quantitative and qualitative evaluations despite a much smaller model size. Finally, we boost S-HR-VQVAE by proposing a novel training method to jointly estimate the HR-VQVAE and ST-PixelCNN parameters. △ Less

Submitted 11 June, 2024; v1 submitted 13 July, 2023; originally announced July 2023.

Comments: 14 pages, 7 figures, 3 tables. Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence on 2023-07-12

ACM Class: I.2.10; I.4.10; I.4.5; I.4.2; I.2.6

arXiv:2208.04554 [pdf, other]

Hierarchical Residual Learning Based Vector Quantized Variational Autoencoder for Image Reconstruction and Generation

Authors: Mohammad Adiban, Kalin Stefanov, Sabato Marco Siniscalchi, Giampiero Salvi

Abstract: We propose a multi-layer variational autoencoder method, we call HR-VQVAE, that learns hierarchical discrete representations of the data. By utilizing a novel objective function, each layer in HR-VQVAE learns a discrete representation of the residual from previous layers through a vector quantized encoder. Furthermore, the representations at each layer are hierarchically linked to those at previou… ▽ More We propose a multi-layer variational autoencoder method, we call HR-VQVAE, that learns hierarchical discrete representations of the data. By utilizing a novel objective function, each layer in HR-VQVAE learns a discrete representation of the residual from previous layers through a vector quantized encoder. Furthermore, the representations at each layer are hierarchically linked to those at previous layers. We evaluate our method on the tasks of image reconstruction and generation. Experimental results demonstrate that the discrete representations learned by HR-VQVAE enable the decoder to reconstruct high-quality images with less distortion than the baseline methods, namely VQVAE and VQVAE-2. HR-VQVAE can also generate high-quality and diverse images that outperform state-of-the-art generative models, providing further verification of the efficiency of the learned representations. The hierarchical nature of HR-VQVAE i) reduces the decoding search time, making the method particularly suitable for high-load tasks and ii) allows to increase the codebook size without incurring the codebook collapse problem. △ Less

Submitted 9 August, 2022; originally announced August 2022.

Comments: 12 pages plus supplementary material. Submitted to BMVC 2022

ACM Class: I.4; I.2

arXiv:2106.06147 [pdf, other]

doi 10.1109/TPAMI.2022.3194311

NAAQA: A Neural Architecture for Acoustic Question Answering

Authors: Jerome Abdelnour, Jean Rouat, Giampiero Salvi

Abstract: The goal of the Acoustic Question Answering (AQA) task is to answer a free-form text question about the content of an acoustic scene. It was inspired by the Visual Question Answering (VQA) task. In this paper, based on the previously introduced CLEAR dataset, we propose a new benchmark for AQA, namely CLEAR2, that emphasizes the specific challenges of acoustic inputs. These include handling of var… ▽ More The goal of the Acoustic Question Answering (AQA) task is to answer a free-form text question about the content of an acoustic scene. It was inspired by the Visual Question Answering (VQA) task. In this paper, based on the previously introduced CLEAR dataset, we propose a new benchmark for AQA, namely CLEAR2, that emphasizes the specific challenges of acoustic inputs. These include handling of variable duration scenes, and scenes built with elementary sounds that differ between training and test set. We also introduce NAAQA, a neural architecture that leverages specific properties of acoustic inputs. The use of 1D convolutions in time and frequency to process 2D spectro-temporal representations of acoustic content shows promising results and enables reductions in model complexity. We show that time coordinate maps augment temporal localization capabilities which enhance performance of the network by ~17 percentage points. On the other hand, frequency coordinate maps have little influence on this task. NAAQA achieves 79.5% of accuracy on the AQA task with ~4 times fewer parameters than the previously explored VQA model. We evaluate the perfomance of NAAQA on an independent data set reconstructed from DAQA. We also test the addition of a MALiMo module in our model on both CLEAR2 and DAQA. We provide a detailed analysis of the results for the different question types. We release the code to produce CLEAR2 as well as NAAQA to foster research in this newly emerging machine learning task. △ Less

Submitted 12 January, 2024; v1 submitted 10 June, 2021; originally announced June 2021.

Comments: Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) in April 2021 (first revision February 2022)

ACM Class: I.2.7; I.2.10; I.5.0

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, Volume: 45 Issue: 4, Page(s): 4997-5009

arXiv:2009.05184 [pdf, other]

STEP-GAN: A Step-by-Step Training for Multi Generator GANs with application to Cyber Security in Power Systems

Authors: Mohammad Adiban, Arash Safari, Giampiero Salvi

Abstract: In this study, we introduce a novel unsupervised countermeasure for smart grid power systems, based on generative adversarial networks (GANs). Given the pivotal role of smart grid systems (SGSs) in urban life, their security is of particular importance. In recent years, however, advances in the field of machine learning, have raised concerns about cyber attacks on these systems. Power systems, amo… ▽ More In this study, we introduce a novel unsupervised countermeasure for smart grid power systems, based on generative adversarial networks (GANs). Given the pivotal role of smart grid systems (SGSs) in urban life, their security is of particular importance. In recent years, however, advances in the field of machine learning, have raised concerns about cyber attacks on these systems. Power systems, among the most important components of urban infrastructure, have, for example, been widely attacked by adversaries. Attackers disrupt power systems using false data injection attacks (FDIA), resulting in a breach of availability, integrity, or confidential principles of the system. Our model simulates possible attacks on power systems using multiple generators in a step-by-step interaction with a discriminator in the training phase. As a consequence, our system is robust to unseen attacks. Moreover, the proposed model considerably reduces the well-known mode collapse problem of GAN-based models. Our method is general and it can be potentially employed in a wide range of one of one-class classification tasks. The proposed model has low computational complexity and outperforms baseline systems about 14% and 41% in terms of accuracy on the highly imbalanced publicly available industrial control system (ICS) cyber attack power system dataset. △ Less

Submitted 10 September, 2020; originally announced September 2020.

arXiv:1902.11280 [pdf, ps, other]

From Visual to Acoustic Question Answering

Authors: Jerome Abdelnour, Giampiero Salvi, Jean Rouat

Abstract: We introduce the new task of Acoustic Question Answering (AQA) to promote research in acoustic reasoning. The AQA task consists of analyzing an acoustic scene composed by a combination of elementary sounds and answering questions that relate the position and properties of these sounds. The kind of relational questions asked, require that the models perform non-trivial reasoning in order to answer… ▽ More We introduce the new task of Acoustic Question Answering (AQA) to promote research in acoustic reasoning. The AQA task consists of analyzing an acoustic scene composed by a combination of elementary sounds and answering questions that relate the position and properties of these sounds. The kind of relational questions asked, require that the models perform non-trivial reasoning in order to answer correctly. Although similar problems have been extensively studied in the domain of visual reasoning, we are not aware of any previous studies addressing the problem in the acoustic domain. We propose a method for generating the acoustic scenes from elementary sounds and a number of relevant questions for each scene using templates. We also present preliminary results obtained with two models (FiLM and MAC) that have been shown to work for visual reasoning. △ Less

Submitted 28 February, 2019; originally announced February 2019.

arXiv:1902.09705 [pdf, other]

doi 10.1109/TCDS.2018.2882140

Beyond the Self: Using Grounded Affordances to Interpret and Describe Others' Actions

Authors: Giovanni Saponaro, Lorenzo Jamone, Alexandre Bernardino, Giampiero Salvi

Abstract: We propose a developmental approach that allows a robot to interpret and describe the actions of human agents by reusing previous experience. The robot first learns the association between words and object affordances by manipulating the objects in its environment. It then uses this information to learn a mapping between its own actions and those performed by a human in a shared environment. It fi… ▽ More We propose a developmental approach that allows a robot to interpret and describe the actions of human agents by reusing previous experience. The robot first learns the association between words and object affordances by manipulating the objects in its environment. It then uses this information to learn a mapping between its own actions and those performed by a human in a shared environment. It finally fuses the information from these two models to interpret and describe human actions in light of its own experience. In our experiments, we show that the model can be used flexibly to do inference on different aspects of the scene. We can predict the effects of an action on the basis of object properties. We can revise the belief that a certain action occurred, given the observed effects of the human action. In an early action recognition fashion, we can anticipate the effects when the action has only been partially observed. By estimating the probability of words given the evidence and feeding them into a pre-defined grammar, we can generate relevant descriptions of the scene. We believe that this is a step towards providing robots with the fundamental skills to engage in social collaboration with humans. △ Less

Submitted 25 February, 2019; originally announced February 2019.

Comments: code available at https://github.com/gsaponaro/tcds-gestures, IEEE Transactions on Cognitive and Developmental Systems

Journal ref: IEEE Transactions on Cognitive and Developmental Systems, vol. 12, no. 2, pp. 209-221, June 2020

arXiv:1811.10561 [pdf, other]

CLEAR: A Dataset for Compositional Language and Elementary Acoustic Reasoning

Authors: Jerome Abdelnour, Giampiero Salvi, Jean Rouat

Abstract: We introduce the task of acoustic question answering (AQA) in the area of acoustic reasoning. In this task an agent learns to answer questions on the basis of acoustic context. In order to promote research in this area, we propose a data generation paradigm adapted from CLEVR (Johnson et al. 2017). We generate acoustic scenes by leveraging a bank elementary sounds. We also provide a number of func… ▽ More We introduce the task of acoustic question answering (AQA) in the area of acoustic reasoning. In this task an agent learns to answer questions on the basis of acoustic context. In order to promote research in this area, we propose a data generation paradigm adapted from CLEVR (Johnson et al. 2017). We generate acoustic scenes by leveraging a bank elementary sounds. We also provide a number of functional programs that can be used to compose questions and answers that exploit the relationships between the attributes of the elementary sounds in each scene. We provide AQA datasets of various sizes as well as the data generation code. As a preliminary experiment to validate our data, we report the accuracy of current state of the art visual question answering models when they are applied to the AQA task without modifications. Although there is a plethora of question answering tasks based on text, image or video data, to our knowledge, we are the first to propose answering questions directly on audio streams. We hope this contribution will facilitate the development of research in the area. △ Less

Submitted 26 November, 2018; originally announced November 2018.

Comments: NeurIPS 2018 Visually Grounded Interaction and Language (ViGIL) Workshop

arXiv:1804.02772 [pdf, other]

Active Mini-Batch Sampling using Repulsive Point Processes

Authors: Cheng Zhang, Cengiz Öztireli, Stephan Mandt, Giampiero Salvi

Abstract: The convergence speed of stochastic gradient descent (SGD) can be improved by actively selecting mini-batches. We explore sampling schemes where similar data points are less likely to be selected in the same mini-batch. In particular, we prove that such repulsive sampling schemes lowers the variance of the gradient estimator. This generalizes recent work on using Determinantal Point Processes (DPP… ▽ More The convergence speed of stochastic gradient descent (SGD) can be improved by actively selecting mini-batches. We explore sampling schemes where similar data points are less likely to be selected in the same mini-batch. In particular, we prove that such repulsive sampling schemes lowers the variance of the gradient estimator. This generalizes recent work on using Determinantal Point Processes (DPPs) for mini-batch diversification (Zhang et al., 2017) to the broader class of repulsive point processes. We first show that the phenomenon of variance reduction by diversified sampling generalizes in particular to non-stationary point processes. We then show that other point processes may be computationally much more efficient than DPPs. In particular, we propose and investigate Poisson Disk sampling---frequently encountered in the computer graphics community---for this task. We show empirically that our approach improves over standard SGD both in terms of convergence speed as well as final model performance. △ Less

Submitted 20 June, 2018; v1 submitted 8 April, 2018; originally announced April 2018.

arXiv:1711.09714 [pdf, other]

doi 10.1109/TSMCB.2011.2172420

Language Bootstrapping: Learning Word Meanings From Perception-Action Association

Authors: Giampiero Salvi, Luis Montesano, Alexandre Bernardino, José Santos-Victor

Abstract: We address the problem of bootstrapping language acquisition for an artificial system similarly to what is observed in experiments with human infants. Our method works by associating meanings to words in manipulation tasks, as a robot interacts with objects and listens to verbal descriptions of the interactions. The model is based on an affordance network, i.e., a mapping between robot actions, ro… ▽ More We address the problem of bootstrapping language acquisition for an artificial system similarly to what is observed in experiments with human infants. Our method works by associating meanings to words in manipulation tasks, as a robot interacts with objects and listens to verbal descriptions of the interactions. The model is based on an affordance network, i.e., a mapping between robot actions, robot perceptions, and the perceived effects of these actions upon objects. We extend the affordance model to incorporate spoken words, which allows us to ground the verbal symbols to the execution of actions and the perception of the environment. The model takes verbal descriptions of a task as the input and uses temporal co-occurrence to create links between speech utterances and the involved objects, actions, and effects. We show that the robot is able form useful word-to-meaning associations, even without considering grammatical structure in the learning process and in the presence of recognition errors. These word-to-meaning associations are embedded in the robot's own understanding of its actions. Thus, they can be directly used to instruct the robot to perform tasks and also allow to incorporate context in the speech recognition task. We believe that the encouraging results with our approach may afford robots with a capacity to acquire language descriptors in their operation's environment as well as to shed some light as to how this challenging process develops with human infants. △ Less

Submitted 27 November, 2017; originally announced November 2017.

Comments: code available at https://github.com/giampierosalvi/AffordancesAndSpeech

ACM Class: I.2.9; I.2.10; I.2.7; I.2.6

Journal ref: in IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), Volume: 42 Issue: 3, year 2012, pages 660-671

arXiv:1711.09055 [pdf, other]

doi 10.21437/GLU.2017-17

Interactive Robot Learning of Gestures, Language and Affordances

Authors: Giovanni Saponaro, Lorenzo Jamone, Alexandre Bernardino, Giampiero Salvi

Abstract: A growing field in robotics and Artificial Intelligence (AI) research is human-robot collaboration, whose target is to enable effective teamwork between humans and robots. However, in many situations human teams are still superior to human-robot teams, primarily because human teams can easily agree on a common goal with language, and the individual members observe each other effectively, leveragin… ▽ More A growing field in robotics and Artificial Intelligence (AI) research is human-robot collaboration, whose target is to enable effective teamwork between humans and robots. However, in many situations human teams are still superior to human-robot teams, primarily because human teams can easily agree on a common goal with language, and the individual members observe each other effectively, leveraging their shared motor repertoire and sensorimotor resources. This paper shows that for cognitive robots it is possible, and indeed fruitful, to combine knowledge acquired from interacting with elements of the environment (affordance exploration) with the probabilistic observation of another agent's actions. We propose a model that unites (i) learning robot affordances and word descriptions with (ii) statistical recognition of human gestures with vision sensors. We discuss theoretical motivations, possible implementations, and we show initial results which highlight that, after having acquired knowledge of its surrounding environment, a humanoid robot can generalize this knowledge to the case when it observes another agent (human partner) performing the same motor actions previously executed during training. △ Less

Submitted 24 November, 2017; originally announced November 2017.

Comments: code available at https://github.com/gsaponaro/glu-gestures

Journal ref: International Workshop on Grounding Language Understanding (GLU), Satellite of Interspeech 2017

arXiv:1711.08992 [pdf, other]

doi 10.1109/TCDS.2019.2927941

Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition

Authors: Kalin Stefanov, Jonas Beskow, Giampiero Salvi

Abstract: This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robus… ▽ More This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multi-person face-to-face interaction dataset. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions. △ Less

Submitted 18 July, 2019; v1 submitted 24 November, 2017; originally announced November 2017.

Comments: 10 pages, IEEE Transactions on Cognitive and Developmental Systems

ACM Class: I.2; I.4; I.5

arXiv:1610.00520 [pdf, other]

Semi-supervised Learning with Sparse Autoencoders in Phone Classification

Authors: Akash Kumar Dhaka, Giampiero Salvi

Abstract: We propose the application of a semi-supervised learning method to improve the performance of acoustic modelling for automatic speech recognition based on deep neural net- works. As opposed to unsupervised initialisation followed by supervised fine tuning, our method takes advantage of both unlabelled and labelled data simultaneously through mini- batch stochastic gradient descent. We tested the m… ▽ More We propose the application of a semi-supervised learning method to improve the performance of acoustic modelling for automatic speech recognition based on deep neural net- works. As opposed to unsupervised initialisation followed by supervised fine tuning, our method takes advantage of both unlabelled and labelled data simultaneously through mini- batch stochastic gradient descent. We tested the method with varying proportions of labelled vs unlabelled observations in frame-based phoneme classification on the TIMIT database. Our experiments show that the method outperforms standard supervised training for an equal amount of labelled data and provides competitive error rates compared to state-of-the-art graph-based semi-supervised learning techniques. △ Less

Submitted 3 October, 2016; originally announced October 2016.

Comments: 5 pages, 1 figure, 2 tables

arXiv:1606.09163 [pdf, other]

Optimising The Input Window Alignment in CD-DNN Based Phoneme Recognition for Low Latency Processing

Authors: Akash Kumar Dhaka, Giampiero Salvi

Abstract: We present a systematic analysis on the performance of a phonetic recogniser when the window of input features is not symmetric with respect to the current frame. The recogniser is based on Context Dependent Deep Neural Networks (CD-DNNs) and Hidden Markov Models (HMMs). The objective is to reduce the latency of the system by reducing the number of future feature frames required to estimate the cu… ▽ More We present a systematic analysis on the performance of a phonetic recogniser when the window of input features is not symmetric with respect to the current frame. The recogniser is based on Context Dependent Deep Neural Networks (CD-DNNs) and Hidden Markov Models (HMMs). The objective is to reduce the latency of the system by reducing the number of future feature frames required to estimate the current output. Our tests performed on the TIMIT database show that the performance does not degrade when the input window is shifted up to 5 frames in the past compared to common practice (no future frame). This corresponds to improving the latency by 50 ms in our settings. Our tests also show that the best results are not obtained with the symmetric window commonly employed, but with an asymmetric window with eight past and two future context frames, although this observation should be confirmed on other data sets. The reduction in latency suggested by our results is critical for specific applications such as real-time lip synchronisation for tele-presence, but may also be beneficial in general applications to improve the lag in human-machine spoken interaction. △ Less

Submitted 29 June, 2016; originally announced June 2016.

Comments: 4 pages, 3 figures

arXiv:1305.4544 [pdf, other]

Efficient Image Retargeting for High Dynamic Range Scenes

Authors: Govind Salvi, Puneet Sharma, Shanmuganathan Raman

Abstract: Most of the real world scenes have a very high dynamic range (HDR). The mobile phone cameras and the digital cameras available in markets are limited in their capability in both the range and spatial resolution. Same argument can be posed about the limited dynamic range display devices which also differ in the spatial resolution and aspect ratios. In this paper, we address the problem of display… ▽ More Most of the real world scenes have a very high dynamic range (HDR). The mobile phone cameras and the digital cameras available in markets are limited in their capability in both the range and spatial resolution. Same argument can be posed about the limited dynamic range display devices which also differ in the spatial resolution and aspect ratios. In this paper, we address the problem of displaying the high contrast low dynamic range (LDR) image of a HDR scene in a display device which has different spatial resolution compared to that of the capturing digital camera. The optimal solution proposed in this work can be employed with any camera which has the ability to shoot multiple differently exposed images of a scene. Further, the proposed solutions provide the flexibility in the depiction of entire contrast of the HDR scene as a LDR image with an user specified spatial resolution. This task is achieved through an optimized content aware retargeting framework which preserves salient features along with the algorithm to combine multi-exposure images. We show the proposed approach performs exceedingly well in the generation of high contrast LDR image of varying spatial resolution compared to an alternate approach. △ Less

Submitted 20 May, 2013; originally announced May 2013.

Showing 1–17 of 17 results for author: Salvi, G