Search | arXiv e-print repository

MergeUp-augmented Semi-Weakly Supervised Learning for WSI Classification

Authors: Mingxi Ouyang, Yuqiu Fu, Renao Yan, ShanShan Shi, Xitong Ling, Lianghui Zhu, Yonghong He, Tian Guan

Abstract: Recent advancements in computational pathology and artificial intelligence have significantly improved whole slide image (WSI) classification. However, the gigapixel resolution of WSIs and the scarcity of manual annotations present substantial challenges. Multiple instance learning (MIL) is a promising weakly supervised learning approach for WSI classification. Recently research revealed employing… ▽ More Recent advancements in computational pathology and artificial intelligence have significantly improved whole slide image (WSI) classification. However, the gigapixel resolution of WSIs and the scarcity of manual annotations present substantial challenges. Multiple instance learning (MIL) is a promising weakly supervised learning approach for WSI classification. Recently research revealed employing pseudo bag augmentation can encourage models to learn various data, thus bolstering models' performance. While directly inheriting the parents' labels can introduce more noise by mislabeling in training. To address this issue, we translate the WSI classification task from weakly supervised learning to semi-weakly supervised learning, termed SWS-MIL, where adaptive pseudo bag augmentation (AdaPse) is employed to assign labeled and unlabeled data based on a threshold strategy. Using the "student-teacher" pattern, we introduce a feature augmentation technique, MergeUp, which merges bags with low-priority bags to enhance inter-category information, increasing training data diversity. Experimental results on the CAMELYON-16, BRACS, and TCGA-LUNG datasets demonstrate the superiority of our method over existing state-of-the-art approaches, affirming its efficacy in WSI classification. △ Less

Submitted 23 August, 2024; originally announced August 2024.

arXiv:2407.07764 [pdf, other]

PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer

Authors: Tongkun Guan, Chengyu Lin, Wei Shen, Xiaokang Yang

Abstract: Handwritten Mathematical Expression Recognition (HMER) has wide applications in human-machine interaction scenarios, such as digitized education and automated offices. Recently, sequence-based models with encoder-decoder architectures have been commonly adopted to address this task by directly predicting LaTeX sequences of expression images. However, these methods only implicitly learn the syntax… ▽ More Handwritten Mathematical Expression Recognition (HMER) has wide applications in human-machine interaction scenarios, such as digitized education and automated offices. Recently, sequence-based models with encoder-decoder architectures have been commonly adopted to address this task by directly predicting LaTeX sequences of expression images. However, these methods only implicitly learn the syntax rules provided by LaTeX, which may fail to describe the position and hierarchical relationship between symbols due to complex structural relations and diverse handwriting styles. To overcome this challenge, we propose a position forest transformer (PosFormer) for HMER, which jointly optimizes two tasks: expression recognition and position recognition, to explicitly enable position-aware symbol feature representation learning. Specifically, we first design a position forest that models the mathematical expression as a forest structure and parses the relative position relationships between symbols. Without requiring extra annotations, each symbol is assigned a position identifier in the forest to denote its relative spatial position. Second, we propose an implicit attention correction module to accurately capture attention for HMER in the sequence-based decoder architecture. Extensive experiments validate the superiority of PosFormer, which consistently outperforms the state-of-the-art methods 2.03%/1.22%/2.00%, 1.83%, and 4.62% gains on the single-line CROHME 2014/2016/2019, multi-line M2E, and complex MNE datasets, respectively, with no additional latency or computational cost. Code is available at https://github.com/SJTU-DeepVisionLab/PosFormer. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: Accepted by ECCV2024

arXiv:2406.18054 [pdf, other]

Leveraging Pre-trained Models for FF-to-FFPE Histopathological Image Translation

Authors: Qilai Zhang, Jiawen Li, Peiran Liao, Jiali Hu, Tian Guan, Anjia Han, Yonghong He

Abstract: The two primary types of Hematoxylin and Eosin (H&E) slides in histopathology are Formalin-Fixed Paraffin-Embedded (FFPE) and Fresh Frozen (FF). FFPE slides offer high quality histopathological images but require a labor-intensive acquisition process. In contrast, FF slides can be prepared quickly, but the image quality is relatively poor. Our task is to translate FF images into FFPE style, thereb… ▽ More The two primary types of Hematoxylin and Eosin (H&E) slides in histopathology are Formalin-Fixed Paraffin-Embedded (FFPE) and Fresh Frozen (FF). FFPE slides offer high quality histopathological images but require a labor-intensive acquisition process. In contrast, FF slides can be prepared quickly, but the image quality is relatively poor. Our task is to translate FF images into FFPE style, thereby improving the image quality for diagnostic purposes. In this paper, we propose Diffusion-FFPE, a method for FF-to-FFPE histopathological image translation using a pre-trained diffusion model. Specifically, we employ a one-step diffusion model as the generator and fine-tune it with LoRA adapters using adversarial learning objectives. To ensure that the model effectively captures both global structural information and local details, we propose a multi-scale feature fusion (MFF) module. This module utilizes two VAE encoders to extract features of varying image sizes and performs feature fusion before feeding them into the UNet. Furthermore, we utilize a pre-trained vision-language model for histopathology as the backbone for the discriminator to further improve performance We conducted FF-to-FFPE translation experiments on the TCGA-NSCLC datasets, and our method achieved better performance compared to other methods. The code and models are released at https://github.com/QilaiZhang/Diffusion-FFPE. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.10900 [pdf, other]

AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

Authors: Xiyang Wu, Tianrui Guan, Dianqi Li, Shuaiyi Huang, Xiaoyu Liu, Xijun Wang, Ruiqi Xian, Abhinav Shrivastava, Furong Huang, Jordan Lee Boyd-Graber, Tianyi Zhou, Dinesh Manocha

Abstract: Large vision-language models (LVLMs) hallucinate: certain context cues in an image may trigger the language module's overconfident and incorrect reasoning on abnormal or hypothetical objects. Though a few benchmarks have been developed to investigate LVLM hallucinations, they mainly rely on hand-crafted corner cases whose fail patterns may hardly generalize, and finetuning on them could undermine… ▽ More Large vision-language models (LVLMs) hallucinate: certain context cues in an image may trigger the language module's overconfident and incorrect reasoning on abnormal or hypothetical objects. Though a few benchmarks have been developed to investigate LVLM hallucinations, they mainly rely on hand-crafted corner cases whose fail patterns may hardly generalize, and finetuning on them could undermine their validity. These motivate us to develop the first automatic benchmark generation approach, AUTOHALLUSION, that harnesses a few principal strategies to create diverse hallucination examples. It probes the language modules in LVLMs for context cues and uses them to synthesize images by: (1) adding objects abnormal to the context cues; (2) for two co-occurring objects, keeping one and excluding the other; or (3) removing objects closely tied to the context cues. It then generates image-based questions whose ground-truth answers contradict the language module's prior. A model has to overcome contextual biases and distractions to reach correct answers, while incorrect or inconsistent answers indicate hallucinations. AUTOHALLUSION enables us to create new benchmarks at the minimum cost and thus overcomes the fragility of hand-crafted benchmarks. It also reveals common failure patterns and reasons, providing key insights to detect, avoid, or control hallucinations. Comprehensive evaluations of top-tier LVLMs, e.g., GPT-4V(ision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, show a 97.7% and 98.7% success rate of hallucination induction on synthetic and real-world datasets of AUTOHALLUSION, paving the way for a long battle against hallucinations. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.00672 [pdf, other]

Task-oriented Embedding Counts: Heuristic Clustering-driven Feature Fine-tuning for Whole Slide Image Classification

Authors: Xuenian Wang, Shanshan Shi, Renao Yan, Qiehe Sun, Lianghui Zhu, Tian Guan, Yonghong He

Abstract: In the field of whole slide image (WSI) classification, multiple instance learning (MIL) serves as a promising approach, commonly decoupled into feature extraction and aggregation. In this paradigm, our observation reveals that discriminative embeddings are crucial for aggregation to the final prediction. Among all feature updating strategies, task-oriented ones can capture characteristics specifi… ▽ More In the field of whole slide image (WSI) classification, multiple instance learning (MIL) serves as a promising approach, commonly decoupled into feature extraction and aggregation. In this paradigm, our observation reveals that discriminative embeddings are crucial for aggregation to the final prediction. Among all feature updating strategies, task-oriented ones can capture characteristics specifically for certain tasks. However, they can be prone to overfitting and contaminated by samples assigned with noisy labels. To address this issue, we propose a heuristic clustering-driven feature fine-tuning method (HC-FT) to enhance the performance of multiple instance learning by providing purified positive and hard negative samples. Our method first employs a well-trained MIL model to evaluate the confidence of patches. Then, patches with high confidence are marked as positive samples, while the remaining patches are used to identify crucial negative samples. After two rounds of heuristic clustering and selection, purified positive and hard negative samples are obtained to facilitate feature fine-tuning. The proposed method is evaluated on both CAMELYON16 and BRACS datasets, achieving an AUC of 97.13% and 85.85%, respectively, consistently outperforming all compared methods. △ Less

Submitted 2 June, 2024; originally announced June 2024.

arXiv:2405.05363 [pdf, other]

LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation

Authors: Tianrui Guan, Yurou Yang, Harry Cheng, Muyuan Lin, Richard Kim, Rajasimman Madhivanan, Arnie Sen, Dinesh Manocha

Abstract: In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stabilit… ▽ More In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success rate, respectively. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: Accepted to ICRA 2024

arXiv:2404.12777 [pdf, other]

EfficientGS: Streamlining Gaussian Splatting for Large-Scale High-Resolution Scene Representation

Authors: Wenkai Liu, Tao Guan, Bin Zhu, Lili Ju, Zikai Song, Dan Li, Yuesong Wang, Wei Yang

Abstract: In the domain of 3D scene representation, 3D Gaussian Splatting (3DGS) has emerged as a pivotal technology. However, its application to large-scale, high-resolution scenes (exceeding 4k$\times$4k pixels) is hindered by the excessive computational requirements for managing a large number of Gaussians. Addressing this, we introduce 'EfficientGS', an advanced approach that optimizes 3DGS for high-res… ▽ More In the domain of 3D scene representation, 3D Gaussian Splatting (3DGS) has emerged as a pivotal technology. However, its application to large-scale, high-resolution scenes (exceeding 4k$\times$4k pixels) is hindered by the excessive computational requirements for managing a large number of Gaussians. Addressing this, we introduce 'EfficientGS', an advanced approach that optimizes 3DGS for high-resolution, large-scale scenes. We analyze the densification process in 3DGS and identify areas of Gaussian over-proliferation. We propose a selective strategy, limiting Gaussian increase to key primitives, thereby enhancing the representational efficiency. Additionally, we develop a pruning mechanism to remove redundant Gaussians, those that are merely auxiliary to adjacent ones. For further enhancement, we integrate a sparse order increment for Spherical Harmonics (SH), designed to alleviate storage constraints and reduce training overhead. Our empirical evaluations, conducted on a range of datasets including extensive 4K+ aerial images, demonstrate that 'EfficientGS' not only expedites training and rendering times but also achieves this with a model size approximately tenfold smaller than conventional 3DGS while maintaining high rendering fidelity. △ Less

Submitted 19 April, 2024; originally announced April 2024.

arXiv:2404.03187 [pdf, other]

AGL-NET: Aerial-Ground Cross-Modal Global Localization with Varying Scales

Authors: Tianrui Guan, Ruiqi Xian, Xijun Wang, Xiyang Wu, Mohamed Elnoor, Daeun Song, Dinesh Manocha

Abstract: We present AGL-NET, a novel learning-based method for global localization using LiDAR point clouds and satellite maps. AGL-NET tackles two critical challenges: bridging the representation gap between image and points modalities for robust feature matching, and handling inherent scale discrepancies between global view and local view. To address these challenges, AGL-NET leverages a unified network… ▽ More We present AGL-NET, a novel learning-based method for global localization using LiDAR point clouds and satellite maps. AGL-NET tackles two critical challenges: bridging the representation gap between image and points modalities for robust feature matching, and handling inherent scale discrepancies between global view and local view. To address these challenges, AGL-NET leverages a unified network architecture with a novel two-stage matching design. The first stage extracts informative neural features directly from raw sensor data and performs initial feature matching. The second stage refines this matching process by extracting informative skeleton features and incorporating a novel scale alignment step to rectify scale variations between LiDAR and map data. Furthermore, a novel scale and skeleton loss function guides the network toward learning scale-invariant feature representations, eliminating the need for pre-processing satellite maps. This significantly improves real-world applicability in scenarios with unknown map scales. To facilitate rigorous performance evaluation, we introduce a meticulously designed dataset within the CARLA simulator specifically tailored for metric localization training and assessment. The code and dataset will be made publicly available. △ Less

Submitted 4 April, 2024; originally announced April 2024.

arXiv:2403.13235 [pdf, other]

AMCO: Adaptive Multimodal Coupling of Vision and Proprioception for Quadruped Robot Navigation in Outdoor Environments

Authors: Mohamed Elnoor, Kasun Weerakoon, Adarsh Jagan Sathyamoorthy, Tianrui Guan, Vignesh Rajagopal, Dinesh Manocha

Abstract: We present AMCO, a novel navigation method for quadruped robots that adaptively combines vision-based and proprioception-based perception capabilities. Our approach uses three cost maps: general knowledge map; traversability history map; and current proprioception map; which are derived from a robot's vision and proprioception data, and couples them to obtain a coupled traversability cost map for… ▽ More We present AMCO, a novel navigation method for quadruped robots that adaptively combines vision-based and proprioception-based perception capabilities. Our approach uses three cost maps: general knowledge map; traversability history map; and current proprioception map; which are derived from a robot's vision and proprioception data, and couples them to obtain a coupled traversability cost map for navigation. The general knowledge map encodes terrains semantically segmented from visual sensing, and represents a terrain's typically expected traversability. The traversability history map encodes the robot's recent proprioceptive measurements on a terrain and its semantic segmentation as a cost map. Further, the robot's present proprioceptive measurement is encoded as a cost map in the current proprioception map. As the general knowledge map and traversability history map rely on semantic segmentation, we evaluate the reliability of the visual sensory data by estimating the brightness and motion blur of input RGB images and accordingly combine the three cost maps to obtain the coupled traversability cost map used for navigation. Leveraging this adaptive coupling, the robot can depend on the most reliable input modality available. Finally, we present a novel planner that selects appropriate gaits and velocities for traversing challenging outdoor environments using the coupled traversability cost map. We demonstrate AMCO's navigation performance in different real-world outdoor environments and observe 10.8%-34.9% reduction w.r.t. two stability metrics, and up to 50% improvement in terms of success rate compared to current navigation methods. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: 8 pages

arXiv:2403.11193 [pdf, other]

Neural Markov Random Field for Stereo Matching

Authors: Tongfan Guan, Chen Wang, Yun-Hui Liu

Abstract: Stereo matching is a core task for many computer vision and robotics applications. Despite their dominance in traditional stereo methods, the hand-crafted Markov Random Field (MRF) models lack sufficient modeling accuracy compared to end-to-end deep models. While deep learning representations have greatly improved the unary terms of the MRF models, the overall accuracy is still severely limited by… ▽ More Stereo matching is a core task for many computer vision and robotics applications. Despite their dominance in traditional stereo methods, the hand-crafted Markov Random Field (MRF) models lack sufficient modeling accuracy compared to end-to-end deep models. While deep learning representations have greatly improved the unary terms of the MRF models, the overall accuracy is still severely limited by the hand-crafted pairwise terms and message passing. To address these issues, we propose a neural MRF model, where both potential functions and message passing are designed using data-driven neural networks. Our fully data-driven model is built on the foundation of variational inference theory, to prevent convergence issues and retain stereo MRF's graph inductive bias. To make the inference tractable and scale well to high-resolution images, we also propose a Disparity Proposal Network (DPN) to adaptively prune the search space of disparity. The proposed approach ranks $1^{st}$ on both KITTI 2012 and 2015 leaderboards among all published methods while running faster than 100 ms. This approach significantly outperforms prior global methods, e.g., lowering D1 metric by more than 50% on KITTI 2015. In addition, our method exhibits strong cross-domain generalization and can recover sharp edges. The codes at https://github.com/aeolusguan/NMRF △ Less

Submitted 21 March, 2024; v1 submitted 17 March, 2024; originally announced March 2024.

Comments: Accepted to CVPR 2024

arXiv:2403.10858 [pdf, other]

RetMIL: Retentive Multiple Instance Learning for Histopathological Whole Slide Image Classification

Authors: Hongbo Chu, Qiehe Sun, Jiawen Li, Yuxuan Chen, Lizhong Zhang, Tian Guan, Anjia Han, Yonghong He

Abstract: Histopathological whole slide image (WSI) analysis with deep learning has become a research focus in computational pathology. The current paradigm is mainly based on multiple instance learning (MIL), in which approaches with Transformer as the backbone are well discussed. These methods convert WSI tasks into sequence tasks by representing patches as tokens in the WSI sequence. However, the feature… ▽ More Histopathological whole slide image (WSI) analysis with deep learning has become a research focus in computational pathology. The current paradigm is mainly based on multiple instance learning (MIL), in which approaches with Transformer as the backbone are well discussed. These methods convert WSI tasks into sequence tasks by representing patches as tokens in the WSI sequence. However, the feature complexity brought by high heterogeneity and the ultra-long sequences brought by gigapixel size makes Transformer-based MIL suffer from the challenges of high memory consumption, slow inference speed, and lack of performance. To this end, we propose a retentive MIL method called RetMIL, which processes WSI sequences through hierarchical feature propagation structure. At the local level, the WSI sequence is divided into multiple subsequences. Tokens of each subsequence are updated through a parallel linear retention mechanism and aggregated utilizing an attention layer. At the global level, subsequences are fused into a global sequence, then updated through a serial retention mechanism, and finally the slide-level representation is obtained through a global attention pooling. We conduct experiments on two public CAMELYON and BRACS datasets and an public-internal LUNG dataset, confirming that RetMIL not only achieves state-of-the-art performance but also significantly reduces computational overhead. Our code will be accessed shortly. △ Less

Submitted 16 March, 2024; originally announced March 2024.

Comments: under review

arXiv:2403.09606 [pdf, ps, other]

Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey

Authors: Xiaoyu Liu, Paiheng Xu, Junda Wu, Jiaxin Yuan, Yifan Yang, Yuhang Zhou, Fuxiao Liu, Tianrui Guan, Haoliang Wang, Tong Yu, Julian McAuley, Wei Ai, Furong Huang

Abstract: Causal inference has shown potential in enhancing the predictive accuracy, fairness, robustness, and explainability of Natural Language Processing (NLP) models by capturing causal relationships among variables. The emergence of generative Large Language Models (LLMs) has significantly impacted various NLP domains, particularly through their advanced reasoning capabilities. This survey focuses on e… ▽ More Causal inference has shown potential in enhancing the predictive accuracy, fairness, robustness, and explainability of Natural Language Processing (NLP) models by capturing causal relationships among variables. The emergence of generative Large Language Models (LLMs) has significantly impacted various NLP domains, particularly through their advanced reasoning capabilities. This survey focuses on evaluating and improving LLMs from a causal view in the following areas: understanding and improving the LLMs' reasoning capacity, addressing fairness and safety issues in LLMs, complementing LLMs with explanations, and handling multimodality. Meanwhile, LLMs' strong reasoning capacities can in turn contribute to the field of causal inference by aiding causal relationship discovery and causal effect estimations. This review explores the interplay between causal inference frameworks and LLMs from both perspectives, emphasizing their collective potential to further the development of more advanced and equitable artificial intelligence systems. △ Less

Submitted 14 March, 2024; originally announced March 2024.

arXiv:2403.07719 [pdf, other]

Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis

Authors: Jiawen Li, Yuxuan Chen, Hongbo Chu, Qiehe Sun, Tian Guan, Anjia Han, Yonghong He

Abstract: Histopathological whole slide images (WSIs) classification has become a foundation task in medical microscopic imaging processing. Prevailing approaches involve learning WSIs as instance-bag representations, emphasizing significant instances but struggling to capture the interactions between instances. Additionally, conventional graph representation methods utilize explicit spatial positions to co… ▽ More Histopathological whole slide images (WSIs) classification has become a foundation task in medical microscopic imaging processing. Prevailing approaches involve learning WSIs as instance-bag representations, emphasizing significant instances but struggling to capture the interactions between instances. Additionally, conventional graph representation methods utilize explicit spatial positions to construct topological structures but restrict the flexible interaction capabilities between instances at arbitrary locations, particularly when spatially distant. In response, we propose a novel dynamic graph representation algorithm that conceptualizes WSIs as a form of the knowledge graph structure. Specifically, we dynamically construct neighbors and directed edge embeddings based on the head and tail relationships between instances. Then, we devise a knowledge-aware attention mechanism that can update the head node features by learning the joint attention score of each neighbor and edge. Finally, we obtain a graph-level embedding through the global pooling process of the updated head, serving as an implicit representation for the WSI classification. Our end-to-end graph representation learning approach has outperformed the state-of-the-art WSI analysis methods on three TCGA benchmark datasets and in-house test sets. Our code is available at https://github.com/WonderLandxD/WiKG. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: Accepted by CVPR 2024

arXiv:2402.10340 [pdf, other]

Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics

Authors: Xiyang Wu, Souradip Chakraborty, Ruiqi Xian, Jing Liang, Tianrui Guan, Fuxiao Liu, Brian M. Sadler, Dinesh Manocha, Amrit Singh Bedi

Abstract: In this paper, we highlight the critical issues of robustness and safety associated with integrating large language models (LLMs) and vision-language models (VLMs) into robotics applications. Recent works focus on using LLMs and VLMs to improve the performance of robotics tasks, such as manipulation and navigation. Despite these improvements, analyzing the safety of such systems remains underexplo… ▽ More In this paper, we highlight the critical issues of robustness and safety associated with integrating large language models (LLMs) and vision-language models (VLMs) into robotics applications. Recent works focus on using LLMs and VLMs to improve the performance of robotics tasks, such as manipulation and navigation. Despite these improvements, analyzing the safety of such systems remains underexplored yet extremely critical. LLMs and VLMs are highly susceptible to adversarial inputs, prompting a significant inquiry into the safety of robotic systems. This concern is important because robotics operate in the physical world where erroneous actions can result in severe consequences. This paper explores this issue thoroughly, presenting a mathematical formulation of potential attacks on LLM/VLM-based robotic systems and offering experimental evidence of the safety challenges. Our empirical findings highlight a significant vulnerability: simple modifications to the input can drastically reduce system effectiveness. Specifically, our results demonstrate an average performance deterioration of 19.4% under minor input prompt modifications and a more alarming 29.1% under slight perceptual changes. These findings underscore the urgent need for robust countermeasures to ensure the safe and reliable deployment of advanced LLM/VLM-based robotic systems. △ Less

Submitted 16 June, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

arXiv:2312.05490 [pdf, other]

Shapley Values-enabled Progressive Pseudo Bag Augmentation for Whole Slide Image Classification

Authors: Renao Yan, Qiehe Sun, Cheng Jin, Yiqing Liu, Yonghong He, Tian Guan, Hao Chen

Abstract: In computational pathology, whole slide image (WSI) classification presents a formidable challenge due to its gigapixel resolution and limited fine-grained annotations. Multiple instance learning (MIL) offers a weakly supervised solution, yet refining instance-level information from bag-level labels remains complex. While most of the conventional MIL methods use attention scores to estimate instan… ▽ More In computational pathology, whole slide image (WSI) classification presents a formidable challenge due to its gigapixel resolution and limited fine-grained annotations. Multiple instance learning (MIL) offers a weakly supervised solution, yet refining instance-level information from bag-level labels remains complex. While most of the conventional MIL methods use attention scores to estimate instance importance scores (IIS) which contribute to the prediction of the slide labels, these often lead to skewed attention distributions and inaccuracies in identifying crucial instances. To address these issues, we propose a new approach inspired by cooperative game theory: employing Shapley values to assess each instance's contribution, thereby improving IIS estimation. The computation of the Shapley value is then accelerated using attention, meanwhile retaining the enhanced instance identification and prioritization. We further introduce a framework for the progressive assignment of pseudo bags based on estimated IIS, encouraging more balanced attention distributions in MIL models. Our extensive experiments on CAMELYON-16, BRACS, and TCGA-LUNG datasets show our method's superiority over existing state-of-the-art approaches, offering enhanced interpretability and class-wise insights. We will release the code upon acceptance. △ Less

Submitted 15 May, 2024; v1 submitted 9 December, 2023; originally announced December 2023.

Comments: submitted to IEEE TRANSACTIONS ON MEDICAL IMAGING

arXiv:2312.05286 [pdf, other]

Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors

Authors: Tongkun Guan, Wei Shen, Xue Yang, Xuehui Wang, Xiaokang Yang

Abstract: Existing scene text detection methods typically rely on extensive real data for training. Due to the lack of annotated real images, recent works have attempted to exploit large-scale labeled synthetic data (LSD) for pre-training text detectors. However, a synth-to-real domain gap emerges, further limiting the performance of text detectors. Differently, in this work, we propose FreeReal, a real-dom… ▽ More Existing scene text detection methods typically rely on extensive real data for training. Due to the lack of annotated real images, recent works have attempted to exploit large-scale labeled synthetic data (LSD) for pre-training text detectors. However, a synth-to-real domain gap emerges, further limiting the performance of text detectors. Differently, in this work, we propose FreeReal, a real-domain-aligned pre-training paradigm that enables the complementary strengths of both LSD and unlabeled real data (URD). Specifically, to bridge real and synthetic worlds for pre-training, a glyph-based mixing mechanism (GlyphMix) is tailored for text images.GlyphMix delineates the character structures of synthetic images and embeds them as graffiti-like units onto real images. Without introducing real domain drift, GlyphMix freely yields real-world images with annotations derived from synthetic labels. Furthermore, when given free fine-grained synthetic labels, GlyphMix can effectively bridge the linguistic domain gap stemming from English-dominated LSD to URD in various languages. Without bells and whistles, FreeReal achieves average gains of 1.97%, 3.90%, 3.85%, and 4.56% in improving the performance of FCENet, PSENet, PANet, and DBNet methods, respectively, consistently outperforming previous pre-training methods by a substantial margin across four public datasets. Code is available at https://github.com/SJTU-DeepVisionLab/FreeReal. △ Less

Submitted 10 July, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

Comments: Accepted by ECCV2024

arXiv:2310.14566 [pdf, other]

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Authors: Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou

Abstract: We introduce HallusionBench, a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129… ▽ More We introduce HallusionBench, a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts. We introduce a novel structure for these visual questions designed to establish control groups. This structure enables us to conduct a quantitative analysis of the models' response tendencies, logical consistency, and various failure modes. In our evaluation on HallusionBench, we benchmarked 15 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably, all other evaluated models achieve accuracy below 16%. Moreover, our analysis not only highlights the observed failure modes, including language hallucination and visual illusion, but also deepens an understanding of these pitfalls. Our comprehensive case studies within HallusionBench shed light on the challenges of hallucination and illusion in LVLMs. Based on these insights, we suggest potential pathways for their future improvement. The benchmark and codebase can be accessed at https://github.com/tianyi-lab/HallusionBench. △ Less

Submitted 25 March, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: Accepted to CVPR 2024

arXiv:2310.07191 [pdf, other]

$pκ$-Curves: Interpolatory curves with curvature approximating a parabola

Authors: Zhihao Wang, Juan Cao, Tuan Guan, Zhonggui Chen, Yongjie Jessica Zhang

Abstract: This paper introduces a novel class of fair and interpolatory curves called $pκ$-curves. These curves are comprised of smoothly stitched Bézier curve segments, where the curvature distribution of each segment is made to closely resemble a parabola, resulting in an aesthetically pleasing shape. Moreover, each segment passes through an interpolated point at a parameter where the parabola has an extr… ▽ More This paper introduces a novel class of fair and interpolatory curves called $pκ$-curves. These curves are comprised of smoothly stitched Bézier curve segments, where the curvature distribution of each segment is made to closely resemble a parabola, resulting in an aesthetically pleasing shape. Moreover, each segment passes through an interpolated point at a parameter where the parabola has an extremum, encouraging the alignment of interpolated points with curvature extrema. To achieve these properties, we tailor an energy function that guides the optimization process to obtain the desired curve characteristics. Additionally, we develop an efficient algorithm and an initialization method, enabling interactive modeling of the $pκ$-curves without the need for global optimization. We provide various examples and comparisons with existing state-of-the-art methods to demonstrate the curve modeling capabilities and visually pleasing appearance of $pκ$-curves. △ Less

Submitted 11 October, 2023; originally announced October 2023.

arXiv:2307.06344 [pdf, other]

The Whole Pathological Slide Classification via Weakly Supervised Learning

Authors: Qiehe Sun, Jiawen Li, Jin Xu, Junru Cheng, Tian Guan, Yonghong He

Abstract: Due to its superior efficiency in utilizing annotations and addressing gigapixel-sized images, multiple instance learning (MIL) has shown great promise as a framework for whole slide image (WSI) classification in digital pathology diagnosis. However, existing methods tend to focus on advanced aggregators with different structures, often overlooking the intrinsic features of H\&E pathological slide… ▽ More Due to its superior efficiency in utilizing annotations and addressing gigapixel-sized images, multiple instance learning (MIL) has shown great promise as a framework for whole slide image (WSI) classification in digital pathology diagnosis. However, existing methods tend to focus on advanced aggregators with different structures, often overlooking the intrinsic features of H\&E pathological slides. To address this limitation, we introduced two pathological priors: nuclear heterogeneity of diseased cells and spatial correlation of pathological tiles. Leveraging the former, we proposed a data augmentation method that utilizes stain separation during extractor training via a contrastive learning strategy to obtain instance-level representations. We then described the spatial relationships between the tiles using an adjacency matrix. By integrating these two views, we designed a multi-instance framework for analyzing H\&E-stained tissue images based on pathological inductive bias, encompassing feature extraction, filtering, and aggregation. Extensive experiments on the Camelyon16 breast dataset and TCGA-NSCLC Lung dataset demonstrate that our proposed framework can effectively handle tasks related to cancer detection and differentiation of subtypes, outperforming state-of-the-art medical image classification methods based on MIL. The code will be released later. △ Less

Submitted 12 July, 2023; originally announced July 2023.

arXiv:2306.10003 [pdf, other]

C2F2NeUS: Cascade Cost Frustum Fusion for High Fidelity and Generalizable Neural Surface Reconstruction

Authors: Luoyuan Xu, Tao Guan, Yuesong Wang, Wenkai Liu, Zhaojie Zeng, Junle Wang, Wei Yang

Abstract: There is an emerging effort to combine the two popular 3D frameworks using Multi-View Stereo (MVS) and Neural Implicit Surfaces (NIS) with a specific focus on the few-shot / sparse view setting. In this paper, we introduce a novel integration scheme that combines the multi-view stereo with neural signed distance function representations, which potentially overcomes the limitations of both methods.… ▽ More There is an emerging effort to combine the two popular 3D frameworks using Multi-View Stereo (MVS) and Neural Implicit Surfaces (NIS) with a specific focus on the few-shot / sparse view setting. In this paper, we introduce a novel integration scheme that combines the multi-view stereo with neural signed distance function representations, which potentially overcomes the limitations of both methods. MVS uses per-view depth estimation and cross-view fusion to generate accurate surfaces, while NIS relies on a common coordinate volume. Based on this strategy, we propose to construct per-view cost frustum for finer geometry estimation, and then fuse cross-view frustums and estimate the implicit signed distance functions to tackle artifacts that are due to noise and holes in the produced surface reconstruction. We further apply a cascade frustum fusion strategy to effectively captures global-local information and structural consistency. Finally, we apply cascade sampling and a pseudo-geometric loss to foster stronger integration between the two architectures. Extensive experiments demonstrate that our method reconstructs robust surfaces and outperforms existing state-of-the-art methods. △ Less

Submitted 14 August, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

Comments: Accepted by ICCV2023

arXiv:2306.06236 [pdf, other]

iPLAN: Intent-Aware Planning in Heterogeneous Traffic via Distributed Multi-Agent Reinforcement Learning

Authors: Xiyang Wu, Rohan Chandra, Tianrui Guan, Amrit Singh Bedi, Dinesh Manocha

Abstract: Navigating safely and efficiently in dense and heterogeneous traffic scenarios is challenging for autonomous vehicles (AVs) due to their inability to infer the behaviors or intentions of nearby drivers. In this work, we introduce a distributed multi-agent reinforcement learning (MARL) algorithm that can predict trajectories and intents in dense and heterogeneous traffic scenarios. Our approach for… ▽ More Navigating safely and efficiently in dense and heterogeneous traffic scenarios is challenging for autonomous vehicles (AVs) due to their inability to infer the behaviors or intentions of nearby drivers. In this work, we introduce a distributed multi-agent reinforcement learning (MARL) algorithm that can predict trajectories and intents in dense and heterogeneous traffic scenarios. Our approach for intent-aware planning, iPLAN, allows agents to infer nearby drivers' intents solely from their local observations. We model two distinct incentives for agents' strategies: Behavioral Incentive for high-level decision-making based on their driving behavior or personality and Instant Incentive for motion planning for collision avoidance based on the current traffic state. Our approach enables agents to infer their opponents' behavior incentives and integrate this inferred information into their decision-making and motion-planning processes. We perform experiments on two simulation environments, Non-Cooperative Navigation and Heterogeneous Highway. In Heterogeneous Highway, results show that, compared with centralized training decentralized execution (CTDE) MARL baselines such as QMIX and MAPPO, our method yields a 4.3% and 38.4% higher episodic reward in mild and chaotic traffic, with 48.1% higher success rate and 80.6% longer survival time in chaotic traffic. We also compare with a decentralized training decentralized execution (DTDE) baseline IPPO and demonstrate a higher episodic reward of 12.7% and 6.3% in mild traffic and chaotic traffic, 25.3% higher success rate, and 13.7% longer survival time. △ Less

Submitted 21 August, 2023; v1 submitted 9 June, 2023; originally announced June 2023.

arXiv:2305.12437 [pdf, other]

SCP: Soft Conditional Prompt Learning for Aerial Video Action Recognition

Authors: Xijun Wang, Ruiqi Xian, Tianrui Guan, Fuxiao Liu, Dinesh Manocha

Abstract: We present a new learning approach, Soft Conditional Prompt Learning (SCP), which leverages the strengths of prompt learning for aerial video action recognition. Our approach is designed to predict the action of each agent by helping the models focus on the descriptions or instructions associated with actions in the input videos for aerial/robot visual perception. Our formulation supports various… ▽ More We present a new learning approach, Soft Conditional Prompt Learning (SCP), which leverages the strengths of prompt learning for aerial video action recognition. Our approach is designed to predict the action of each agent by helping the models focus on the descriptions or instructions associated with actions in the input videos for aerial/robot visual perception. Our formulation supports various prompts, including learnable prompts, auxiliary visual information, and large vision models to improve the recognition performance. We present a soft conditional prompt method that learns to dynamically generate prompts from a pool of prompt experts under different video inputs. By sharing the same objective with the task, our proposed SCP can optimize prompts that guide the model's predictions while explicitly learning input-invariant (prompt experts pool) and input-specific (data-dependent) prompt knowledge. In practice, we observe a 3.17-10.2% accuracy improvement on the aerial video datasets (Okutama, NECDrone), which consist of scenes with single-agent and multi-agent actions. We further evaluate our approach on ground camera videos to verify the effectiveness and generalization and achieve a 1.0-3.6% improvement on dataset SSV2. We integrate our method into the ROS2 as well. △ Less

Submitted 28 August, 2024; v1 submitted 21 May, 2023; originally announced May 2023.

Comments: IROS2024

arXiv:2303.17778 [pdf, other]

CrossLoc3D: Aerial-Ground Cross-Source 3D Place Recognition

Authors: Tianrui Guan, Aswath Muthuselvam, Montana Hoover, Xijun Wang, Jing Liang, Adarsh Jagan Sathyamoorthy, Damon Conover, Dinesh Manocha

Abstract: We present CrossLoc3D, a novel 3D place recognition method that solves a large-scale point matching problem in a cross-source setting. Cross-source point cloud data corresponds to point sets captured by depth sensors with different accuracies or from different distances and perspectives. We address the challenges in terms of developing 3D place recognition methods that account for the representati… ▽ More We present CrossLoc3D, a novel 3D place recognition method that solves a large-scale point matching problem in a cross-source setting. Cross-source point cloud data corresponds to point sets captured by depth sensors with different accuracies or from different distances and perspectives. We address the challenges in terms of developing 3D place recognition methods that account for the representation gap between points captured by different sources. Our method handles cross-source data by utilizing multi-grained features and selecting convolution kernel sizes that correspond to most prominent features. Inspired by the diffusion models, our method uses a novel iterative refinement process that gradually shifts the embedding spaces from different sources to a single canonical space for better metric learning. In addition, we present CS-Campus3D, the first 3D aerial-ground cross-source dataset consisting of point cloud data from both aerial and ground LiDAR scans. The point clouds in CS-Campus3D have representation gaps and other features like different views, point densities, and noise patterns. We show that our CrossLoc3D algorithm can achieve an improvement of 4.74% - 15.37% in terms of the top 1 average recall on our CS-Campus3D benchmark and achieves performance comparable to state-of-the-art 3D place recognition method on the Oxford RobotCar. The code and CS-CAMPUS3D benchmark will be available at github.com/rayguan97/crossloc3d. △ Less

Submitted 29 September, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

arXiv:2303.14502 [pdf, other]

VERN: Vegetation-aware Robot Navigation in Dense Unstructured Outdoor Environments

Authors: Adarsh Jagan Sathyamoorthy, Kasun Weerakoon, Tianrui Guan, Mason Russell, Damon Conover, Jason Pusey, Dinesh Manocha

Abstract: We propose a novel method for autonomous legged robot navigation in densely vegetated environments with a variety of pliable/traversable and non-pliable/untraversable vegetation. We present a novel few-shot learning classifier that can be trained on a few hundred RGB images to differentiate flora that can be navigated through, from the ones that must be circumvented. Using the vegetation classific… ▽ More We propose a novel method for autonomous legged robot navigation in densely vegetated environments with a variety of pliable/traversable and non-pliable/untraversable vegetation. We present a novel few-shot learning classifier that can be trained on a few hundred RGB images to differentiate flora that can be navigated through, from the ones that must be circumvented. Using the vegetation classification and 2D lidar scans, our method constructs a vegetation-aware traversability cost map that accurately represents the pliable and non-pliable obstacles with lower, and higher traversability costs, respectively. Our cost map construction accounts for misclassifications of the vegetation and further lowers the risk of collisions, freezing and entrapment in vegetation during navigation. Furthermore, we propose holonomic recovery behaviors for the robot for scenarios where it freezes, or gets physically entrapped in dense, pliable vegetation. We demonstrate our method on a Boston Dynamics Spot robot in real-world unstructured environments with sparse and dense tall grass, bushes, trees, etc. We observe an increase of 25-90% in success rates, 10-90% decrease in freezing rate, and up to 65% decrease in the false positive rate compared to existing methods. △ Less

Submitted 25 March, 2023; originally announced March 2023.

Comments: 8 Pages, 5 figures

arXiv:2303.01589 [pdf, other]

doi 10.1109/ICRA48891.2023.10160564

AZTR: Aerial Video Action Recognition with Auto Zoom and Temporal Reasoning

Authors: Xijun Wang, Ruiqi Xian, Tianrui Guan, Celso M. de Melo, Stephen M. Nogar, Aniket Bera, Dinesh Manocha

Abstract: We propose a novel approach for aerial video action recognition. Our method is designed for videos captured using UAVs and can run on edge or mobile devices. We present a learning-based approach that uses customized auto zoom to automatically identify the human target and scale it appropriately. This makes it easier to extract the key features and reduces the computational overhead. We also presen… ▽ More We propose a novel approach for aerial video action recognition. Our method is designed for videos captured using UAVs and can run on edge or mobile devices. We present a learning-based approach that uses customized auto zoom to automatically identify the human target and scale it appropriately. This makes it easier to extract the key features and reduces the computational overhead. We also present an efficient temporal reasoning algorithm to capture the action information along the spatial and temporal domains within a controllable computational cost. Our approach has been implemented and evaluated both on the desktop with high-end GPUs and on the low power Robotics RB5 Platform for robots and drones. In practice, we achieve 6.1-7.4% improvement over SOTA in Top-1 accuracy on the RoCoG-v2 dataset, 8.3-10.4% improvement on the UAV-Human dataset and 3.2% improvement on the Drone Action dataset. △ Less

Submitted 2 March, 2023; originally announced March 2023.

Comments: Accepted for publication at ICRA 2023

arXiv:2211.00288 [pdf, other]

Self-supervised Character-to-Character Distillation for Text Recognition

Authors: Tongkun Guan, Wei Shen, Xue Yang, Qi Feng, Zekun Jiang, Xiaokang Yang

Abstract: When handling complicated text images (e.g., irregular structures, low resolution, heavy occlusion, and uneven illumination), existing supervised text recognition methods are data-hungry. Although these methods employ large-scale synthetic text images to reduce the dependence on annotated real images, the domain gap still limits the recognition performance. Therefore, exploring the robust text fea… ▽ More When handling complicated text images (e.g., irregular structures, low resolution, heavy occlusion, and uneven illumination), existing supervised text recognition methods are data-hungry. Although these methods employ large-scale synthetic text images to reduce the dependence on annotated real images, the domain gap still limits the recognition performance. Therefore, exploring the robust text feature representations on unlabeled real images by self-supervised learning is a good solution. However, existing self-supervised text recognition methods conduct sequence-to-sequence representation learning by roughly splitting the visual features along the horizontal axis, which limits the flexibility of the augmentations, as large geometric-based augmentations may lead to sequence-to-sequence feature inconsistency. Motivated by this, we propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate general text representation learning. Specifically, we delineate the character structures of unlabeled real images by designing a self-supervised character segmentation module. Following this, CCD easily enriches the diversity of local characters while keeping their pairwise alignment under flexible augmentations, using the transformation matrix between two augmented views from images. Experiments demonstrate that CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution. Code is available at https://github.com/TongkunGuan/CCD. △ Less

Submitted 18 August, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

Comments: Accepted by ICCV2023

arXiv:2209.07725 [pdf, other]

VINet: Visual and Inertial-based Terrain Classification and Adaptive Navigation over Unknown Terrain

Authors: Tianrui Guan, Ruitao Song, Zhixian Ye, Liangjun Zhang

Abstract: We present a visual and inertial-based terrain classification network (VINet) for robotic navigation over different traversable surfaces. We use a novel navigation-based labeling scheme for terrain classification and generalization on unknown surfaces. Our proposed perception method and adaptive scheduling control framework can make predictions according to terrain navigation properties and lead t… ▽ More We present a visual and inertial-based terrain classification network (VINet) for robotic navigation over different traversable surfaces. We use a novel navigation-based labeling scheme for terrain classification and generalization on unknown surfaces. Our proposed perception method and adaptive scheduling control framework can make predictions according to terrain navigation properties and lead to better performance on both terrain classification and navigation control on known and unknown surfaces. Our VINet can achieve 98.37% in terms of accuracy under supervised setting on known terrains and improve the accuracy by 8.51% on unknown terrains compared to previous methods. We deploy VINet on a mobile tracked robot for trajectory following and navigation on different terrains, and we demonstrate an improvement of 10.3% compared to a baseline controller in terms of RMSE. △ Less

Submitted 1 March, 2023; v1 submitted 16 September, 2022; originally announced September 2022.

arXiv:2209.05722 [pdf, other]

GrASPE: Graph based Multimodal Fusion for Robot Navigation in Unstructured Outdoor Environments

Authors: Kasun Weerakoon, Adarsh Jagan Sathyamoorthy, Jing Liang, Tianrui Guan, Utsav Patel, Dinesh Manocha

Abstract: We present a novel trajectory traversability estimation and planning algorithm for robot navigation in complex outdoor environments. We incorporate multimodal sensory inputs from an RGB camera, 3D LiDAR, and the robot's odometry sensor to train a prediction model to estimate candidate trajectories' success probabilities based on partially reliable multi-modal sensor observations. We encode high-di… ▽ More We present a novel trajectory traversability estimation and planning algorithm for robot navigation in complex outdoor environments. We incorporate multimodal sensory inputs from an RGB camera, 3D LiDAR, and the robot's odometry sensor to train a prediction model to estimate candidate trajectories' success probabilities based on partially reliable multi-modal sensor observations. We encode high-dimensional multi-modal sensory inputs to low-dimensional feature vectors using encoder networks and represent them as a connected graph. The graph is then used to train an attention-based Graph Neural Network (GNN) to predict trajectory success probabilities. We further analyze the number of features in the image (corners) and point cloud data (edges and planes) separately to quantify their reliability to augment the weights of the feature graph representation used in our GNN. During runtime, our model utilizes multi-sensor inputs to predict the success probabilities of the trajectories generated by a local planner to avoid potential collisions and failures. Our algorithm demonstrates robust predictions when one or more sensor modalities are unreliable or unavailable in complex outdoor environments. We evaluate our algorithm's navigation performance using a Spot robot in real-world outdoor environments. We observe an increase of 10-30% in terms of navigation success rate and a 13-15% decrease in false positive estimations compared to the state-of-the-art navigation methods. △ Less

Submitted 16 May, 2023; v1 submitted 13 September, 2022; originally announced September 2022.

arXiv:2207.13848 [pdf, other]

Predicting the Output Structure of Sparse Matrix Multiplication with Sampled Compression Ratio

Authors: Zhaoyang Du, Yijin Guan, Tianchan Guan, Dimin Niu, Nianxiong Tan, Xiaopeng Yu, Hongzhong Zheng, Jianyi Meng, Xiaolang Yan, Yuan Xie

Abstract: Sparse general matrix multiplication (SpGEMM) is a fundamental building block in numerous scientific applications. One critical task of SpGEMM is to compute or predict the structure of the output matrix (i.e., the number of nonzero elements per output row) for efficient memory allocation and load balance, which impact the overall performance of SpGEMM. Existing work either precisely calculates the… ▽ More Sparse general matrix multiplication (SpGEMM) is a fundamental building block in numerous scientific applications. One critical task of SpGEMM is to compute or predict the structure of the output matrix (i.e., the number of nonzero elements per output row) for efficient memory allocation and load balance, which impact the overall performance of SpGEMM. Existing work either precisely calculates the output structure or adopts upper-bound or sampling-based methods to predict the output structure. However, these methods either take much execution time or are not accurate enough. In this paper, we propose a novel sampling-based method with better accuracy and low costs compared to the existing sampling-based method. The proposed method first predicts the compression ratio of SpGEMM by leveraging the number of intermediate products (denoted as FLOP) and the number of nonzero elements (denoted as NNZ) of the same sampled result matrix. And then, the predicted output structure is obtained by dividing the FLOP per output row by the predicted compression ratio. We also propose a reference design of the existing sampling-based method with optimized computing overheads to demonstrate the better accuracy of the proposed method. We construct 625 test cases with various matrix dimensions and sparse structures to evaluate the prediction accuracy. Experimental results show that the absolute relative errors of the proposed method and the reference design are 1.56\% and 8.12\%, respectively, on average, and 25\% and 156\%, respectively, in the worst case. △ Less

Submitted 27 July, 2022; originally announced July 2022.

Comments: This paper has been submitted to the IEEE International Conference on Parallel and Distributed Systems (ICPADS). 8 pages, 2 fgures, 3 tables

ACM Class: F.2.1; G.3; D.1.3; G.1.3

arXiv:2206.07244 [pdf, other]

OpSparse: a Highly Optimized Framework for Sparse General Matrix Multiplication on GPUs

Authors: Zhaoyang Du, Yijin Guan, Tianchan Guan, Dimin Niu, Linyong Huang, Hongzhong Zheng, Yuan Xie

Abstract: Sparse general matrix multiplication (SpGEMM) is an important and expensive computation primitive in many real-world applications. Due to SpGEMM's inherent irregularity and the vast diversity of its input matrices, developing high-performance SpGEMM implementation on modern processors such as GPUs is challenging. The state-of-the-art SpGEMM libraries (i.e., $nsparse$ and $spECK$) adopt several alg… ▽ More Sparse general matrix multiplication (SpGEMM) is an important and expensive computation primitive in many real-world applications. Due to SpGEMM's inherent irregularity and the vast diversity of its input matrices, developing high-performance SpGEMM implementation on modern processors such as GPUs is challenging. The state-of-the-art SpGEMM libraries (i.e., $nsparse$ and $spECK$) adopt several algorithms to tackle the challenges of global load balance, local load balance, and allocation of the result matrix. While these libraries focus on the high-level algorithm design for SpGEMM, they neglect several low-level architecture-specific optimizations, which causes inefficient implementations in their libraries. In this paper, we classify their inefficient implementations into seven categories. Based on our observations, we propose a highly optimized SpGEMM library called $OpSparse$. The optimizations in $OpSparse$ include 1) optimizing the binning method by improving the utilization of the shared memory, 2) optimizing the hashing method by reducing the access to the hash table, 3) improving the trade-off between hash collision rate and hardware utilization in the hashing method by setting appropriate binning ranges, 4) reducing the overheads of global memory utilization by minimizing the global memory usage of the metadata, and 5) improving the execution parallelism by overlapping global memory allocation with kernel execution. Performance evaluations with 26 commonly used matrices on an Nvidia Tesla V100 GPU show that $OpSparse$ achieves up to $27.8\times$, $1.81\times$, and $2.04\times$ performance speedup over three state-of-the-art libraries: $cuSPARSE$, $nsparse$, and $spECK$, respectively. △ Less

Submitted 14 June, 2022; originally announced June 2022.

Comments: This paper has been submitted to the IEEE Access since May 7, 2022, and is currently under review by IEEE Access. 20 pages, 11 fgures, 5 tables

MSC Class: 68-02; 68W10; 65F50 ACM Class: D.1.3; G.1.3

arXiv:2206.06611 [pdf, other]

Accelerating CPU-Based Sparse General Matrix Multiplication With Binary Row Merging

Authors: Zhaoyang Du, Yijin Guan, Tianchan Guan, Dimin Niu, Hongzhong Zheng, Yuan Xie

Abstract: Sparse general matrix multiplication (SpGEMM) is a fundamental building block for many real-world applications. Since SpGEMM is a well-known memory-bounded application with vast and irregular memory accesses, considering the memory access efficiency is of critical importance for SpGEMM's performance. Yet, the existing methods put less consideration into the memory subsystem and achieved suboptimal… ▽ More Sparse general matrix multiplication (SpGEMM) is a fundamental building block for many real-world applications. Since SpGEMM is a well-known memory-bounded application with vast and irregular memory accesses, considering the memory access efficiency is of critical importance for SpGEMM's performance. Yet, the existing methods put less consideration into the memory subsystem and achieved suboptimal performance. In this paper, we thoroughly analyze the memory access patterns of SpGEMM and their influences on the memory subsystem. Based on the analysis, we propose a novel and more efficient accumulation method named BRMerge for the multi-core CPU architectures. The BRMerge accumulation method follows the row-wise dataflow. It first accesses the $B$ matrix, generates the intermediate lists for one output row, and stores these intermediate lists in a consecutive memory space, which is implemented by a ping-pong buffer. It then immediately merges these intermediate lists generated in the previous phase two by two in a tree-like hierarchy between two ping-pong buffers. The architectural benefits of BRMerge are 1) streaming access patterns, 2) minimized TLB cache miss rate, and 3) reasonably high L1/L2 cache hit rates, which result in both low access latency and high bandwidth utilization when performing SpGEMM. Based on the BRMerge accumulation method, we propose two SpGEMM libraries named BRMerge-Upper and BRMerge-Precise, which use different allocation methods. Performance evaluations with 26 commonly used benchmarks on two CPU servers show that the proposed SpGEMM libraries significantly outperform the state-of-the-art SpGEMM libraries. △ Less

Submitted 19 August, 2022; v1 submitted 14 June, 2022; originally announced June 2022.

Comments: This work has been accepted by IEEE Access (DOI:10.1109/ACCESS.2022.3193937). There are 12 pages, 6 fgures, 2 tables

MSC Class: 68-02; 68W10; 65F50 ACM Class: D.1.3; G.1.3

arXiv:2206.05840 [pdf, other]

GAN based Data Augmentation to Resolve Class Imbalance

Authors: Sairamvinay Vijayaraghavan, Terry Guan, Jason, Song

Abstract: The number of credit card fraud has been growing as technology grows and people can take advantage of it. Therefore, it is very important to implement a robust and effective method to detect such frauds. The machine learning algorithms are appropriate for these tasks since they try to maximize the accuracy of predictions and hence can be relied upon. However, there is an impending flaw where in ma… ▽ More The number of credit card fraud has been growing as technology grows and people can take advantage of it. Therefore, it is very important to implement a robust and effective method to detect such frauds. The machine learning algorithms are appropriate for these tasks since they try to maximize the accuracy of predictions and hence can be relied upon. However, there is an impending flaw where in machine learning models may not perform well due to the presence of an imbalance across classes distribution within the sample set. So, in many related tasks, the datasets have a very small number of observed fraud cases (sometimes around 1 percent positive fraud instances found). Therefore, this imbalance presence may impact any learning model's behavior by predicting all labels as the majority class, hence allowing no scope for generalization in the predictions made by the model. We trained Generative Adversarial Network(GAN) to generate a large number of convincing (and reliable) synthetic examples of the minority class that can be used to alleviate the class imbalance within the training set and hence generalize the learning of the data more effectively. △ Less

Submitted 12 June, 2022; originally announced June 2022.

arXiv:2205.03517 [pdf, other]

AdaptiveON: Adaptive Outdoor Local Navigation Method For Stable and Reliable Actions

Authors: Jing Liang, Kasun Weerakoon, Tianrui Guan, Nare Karapetyan, Dinesh Manocha

Abstract: We present a novel outdoor navigation algorithm to generate stable and efficient actions to navigate a robot to reach a goal. We use a multi-stage training pipeline and show that our approach produces policies that result in stable and reliable robot navigation on complex terrains. Based on the Proximal Policy Optimization (PPO) algorithm, we developed a novel method to achieve multiple capabiliti… ▽ More We present a novel outdoor navigation algorithm to generate stable and efficient actions to navigate a robot to reach a goal. We use a multi-stage training pipeline and show that our approach produces policies that result in stable and reliable robot navigation on complex terrains. Based on the Proximal Policy Optimization (PPO) algorithm, we developed a novel method to achieve multiple capabilities for outdoor navigation tasks, namely alleviating the robot's drifting, keeping the robot stable on bumpy terrains, avoiding climbing on hills with steep elevation changes, and avoiding collisions. Our training process mitigates the reality (sim-to-real) gap by introducing generalized environmental and robotic parameters and training with rich features of Lidar perception in a high-fidelity Unity simulator. We evaluate our method in both simulation and real world environments using Clearpath Husky and Jackal robots. Further, we compare our method against the state-of-the-art approaches and observe that, in the real world it improves stability by at least 30.7% on uneven terrains, reduces drifting by 8.08% and decreases the elevation changes by 14.75%. △ Less

Submitted 6 December, 2022; v1 submitted 6 May, 2022; originally announced May 2022.

Comments: 10 pages

arXiv:2203.15459 [pdf, other]

Influence of Communication Among Shared Developers on the Productivity of Open Source Software Projects

Authors: Sairamvinay Vijayaraghavan, Jinxiao Song, Terry Guan, Seongwoo Choi, Sutej Kulkarni

Abstract: Many software developers rely on open source software for developing their applications and writing their source codes. Measuring an independent project's overall productivity is still an open problem for many technology companies. In this project, we address to bridge the gap of analyzing which are the most important features for prediction of a productivity based system. We have chosen to collec… ▽ More Many software developers rely on open source software for developing their applications and writing their source codes. Measuring an independent project's overall productivity is still an open problem for many technology companies. In this project, we address to bridge the gap of analyzing which are the most important features for prediction of a productivity based system. We have chosen to collect data from GitHub via their application programming interfaces (API) and analyze the data we gathered to understand the relation between the average time to close an issue and the features that we collected. Since most of the data we gathered were not Gaussian, we had to preprocess the data using outlier detection and applying transformations before statistical modeling. The best model we observed was polynomial regression with degree 5. Overall, we noticed that there are many aspects of software development that make developers increase their productivity. △ Less

Submitted 25 March, 2022; originally announced March 2022.

arXiv:2203.10694 [pdf, other]

FAR: Fourier Aerial Video Recognition

Authors: Divya Kothandaraman, Tianrui Guan, Xijun Wang, Sean Hu, Ming Lin, Dinesh Manocha

Abstract: We present an algorithm, Fourier Activity Recognition (FAR), for UAV video activity recognition. Our formulation uses a novel Fourier object disentanglement method to innately separate out the human agent (which is typically small) from the background. Our disentanglement technique operates in the frequency domain to characterize the extent of temporal change of spatial pixels, and exploits convol… ▽ More We present an algorithm, Fourier Activity Recognition (FAR), for UAV video activity recognition. Our formulation uses a novel Fourier object disentanglement method to innately separate out the human agent (which is typically small) from the background. Our disentanglement technique operates in the frequency domain to characterize the extent of temporal change of spatial pixels, and exploits convolution-multiplication properties of Fourier transform to map this representation to the corresponding object-background entangled features obtained from the network. To encapsulate contextual information and long-range space-time dependencies, we present a novel Fourier Attention algorithm, which emulates the benefits of self-attention by modeling the weighted outer product in the frequency domain. Our Fourier attention formulation uses much fewer computations than self-attention. We have evaluated our approach on multiple UAV datasets including UAV Human RGB, UAV Human Night, Drone Action, and NEC Drone. We demonstrate a relative improvement of 8.02% - 38.69% in top-1 accuracy and up to 3 times faster over prior works. △ Less

Submitted 18 July, 2022; v1 submitted 20 March, 2022; originally announced March 2022.

Comments: ECCV 2022 Poster paper

arXiv:2203.03382 [pdf, other]

Self-supervised Implicit Glyph Attention for Text Recognition

Authors: Tongkun Guan, Chaochen Gu, Jingzheng Tu, Xue Yang, Qi Feng, Yudi Zhao, Xiaokang Yang, Wei Shen

Abstract: The attention mechanism has become the \emph{de facto} module in scene text recognition (STR) methods, due to its capability of extracting character-level representations. These methods can be summarized into implicit attention based and supervised attention based, depended on how the attention is computed, i.e., implicit attention and supervised attention are learned from sequence-level text anno… ▽ More The attention mechanism has become the \emph{de facto} module in scene text recognition (STR) methods, due to its capability of extracting character-level representations. These methods can be summarized into implicit attention based and supervised attention based, depended on how the attention is computed, i.e., implicit attention and supervised attention are learned from sequence-level text annotations and or character-level bounding box annotations, respectively. Implicit attention, as it may extract coarse or even incorrect spatial regions as character attention, is prone to suffering from an alignment-drifted issue. Supervised attention can alleviate the above issue, but it is character category-specific, which requires extra laborious character-level bounding box annotations and would be memory-intensive when handling languages with larger character categories. To address the aforementioned issues, we propose a novel attention mechanism for STR, self-supervised implicit glyph attention (SIGA). SIGA delineates the glyph structures of text images by jointly self-supervised text segmentation and implicit attention alignment, which serve as the supervision to improve attention correctness without extra character-level annotations. Experimental results demonstrate that SIGA performs consistently and significantly better than previous attention-based STR methods, in terms of both attention correctness and final recognition performance on publicly available context benchmarks and our contributed contextless benchmarks. △ Less

Submitted 15 May, 2023; v1 submitted 7 March, 2022; originally announced March 2022.

Comments: CVPR2023

arXiv:2202.12873 [pdf, other]

TerraPN: Unstructured Terrain Navigation using Online Self-Supervised Learning

Authors: Adarsh Jagan Sathyamoorthy, Kasun Weerakoon, Tianrui Guan, Jing Liang, Dinesh Manocha

Abstract: We present TerraPN, a novel method that learns the surface properties (traction, bumpiness, deformability, etc.) of complex outdoor terrains directly from robot-terrain interactions through self-supervised learning, and uses it for autonomous robot navigation. Our method uses RGB images of terrain surfaces and the robot's velocities as inputs, and the IMU vibrations and odometry errors experienced… ▽ More We present TerraPN, a novel method that learns the surface properties (traction, bumpiness, deformability, etc.) of complex outdoor terrains directly from robot-terrain interactions through self-supervised learning, and uses it for autonomous robot navigation. Our method uses RGB images of terrain surfaces and the robot's velocities as inputs, and the IMU vibrations and odometry errors experienced by the robot as labels for self-supervision. Our method computes a surface cost map that differentiates smooth, high-traction surfaces (low navigation costs) from bumpy, slippery, deformable surfaces (high navigation costs). We compute the cost map by non-uniformly sampling patches from the input RGB image by detecting boundaries between surfaces resulting in low inference times (47.27% lower) compared to uniform sampling and existing segmentation methods. We present a novel navigation algorithm that accounts for a surface's cost, computes cost-based acceleration limits for the robot, and dynamically feasible, collision-free trajectories. TerraPN's surface cost prediction can be trained in ~25 minutes for five different surfaces, compared to several hours for previous learning-based segmentation methods. In terms of navigation, our method outperforms previous works in terms of success rates (up to 35.84% higher), vibration cost of the trajectories (up to 21.52% lower), and slowing the robot on bumpy, deformable surfaces (up to 46.76% slower) in different scenarios. △ Less

Submitted 22 June, 2022; v1 submitted 25 February, 2022; originally announced February 2022.

Comments: 10 pages, 6 figures

arXiv:2202.02800 [pdf, other]

doi 10.14778/3489496.3489508

Learning to be a Statistician: Learned Estimator for Number of Distinct Values

Authors: Renzhi Wu, Bolin Ding, Xu Chu, Zhewei Wei, Xiening Dai, Tao Guan, Jingren Zhou

Abstract: Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems, such as columnstore compression and data profiling. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. Such efficient estimation is critical for tasks where it is prohibitive to scan the data even once. Existing sample-based estimators typical… ▽ More Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems, such as columnstore compression and data profiling. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. Such efficient estimation is critical for tasks where it is prohibitive to scan the data even once. Existing sample-based estimators typically rely on heuristics or assumptions and do not have robust performance across different datasets as the assumptions on data can easily break. On the other hand, deriving an estimator from a principled formulation such as maximum likelihood estimation is very challenging due to the complex structure of the formulation. We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator. To this end, we need to answer several questions: i) how to make the learned model workload agnostic; ii) how to obtain training data; iii) how to perform model training. We derive conditions of the learning framework under which the learned model is workload agnostic, in the sense that the model/estimator can be trained with synthetically generated training data, and then deployed into any data warehouse simply as, e.g., user-defined functions (UDFs), to offer efficient (within microseconds on CPU) and accurate NDV estimations for unseen tables and workloads. We compare the learned estimator with the state-of-the-art sample-based estimators on nine real-world datasets to demonstrate its superior estimation accuracy. We publish our code for training data generation, model training, and the learned estimator online for reproducibility. △ Less

Submitted 6 February, 2022; originally announced February 2022.

Comments: Published at International Conference on Very Large Data Bases (VLDB) 2022

Journal ref: PVLDB, 15(2): 272 - 284, 2022

arXiv:2110.12663 [pdf, other]

doi 10.1109/TCSVT.2022.3156390

Industrial Scene Text Detection with Refined Feature-attentive Network

Authors: Tongkun Guan, Chaochen Gu, Changsheng Lu, Jingzheng Tu, Qi Feng, Kaijie Wu, Xinping Guan

Abstract: Detecting the marking characters of industrial metal parts remains challenging due to low visual contrast, uneven illumination, corroded character structures, and cluttered background of metal part images. Affected by these factors, bounding boxes generated by most existing methods locate low-contrast text areas inaccurately. In this paper, we propose a refined feature-attentive network (RFN) to s… ▽ More Detecting the marking characters of industrial metal parts remains challenging due to low visual contrast, uneven illumination, corroded character structures, and cluttered background of metal part images. Affected by these factors, bounding boxes generated by most existing methods locate low-contrast text areas inaccurately. In this paper, we propose a refined feature-attentive network (RFN) to solve the inaccurate localization problem. Specifically, we design a parallel feature integration mechanism to construct an adaptive feature representation from multi-resolution features, which enhances the perception of multi-scale texts at each scale-specific level to generate a high-quality attention map. Then, an attentive refinement network is developed by the attention map to rectify the location deviation of candidate boxes. In addition, a re-scoring mechanism is designed to select text boxes with the best rectified location. Moreover, we construct two industrial scene text datasets, including a total of 102156 images and 1948809 text instances with various character structures and metal parts. Extensive experiments on our dataset and four public datasets demonstrate that our proposed method achieves the state-of-the-art performance. △ Less

Submitted 29 March, 2022; v1 submitted 25 October, 2021; originally announced October 2021.

arXiv:2109.06250 [pdf, other]

TNS: Terrain Traversability Mapping and Navigation System for Autonomous Excavators

Authors: Tianrui Guan, Zhenpeng He, Ruitao Song, Dinesh Manocha, Liangjun Zhang

Abstract: We present a terrain traversability mapping and navigation system (TNS) for autonomous excavator applications in an unstructured environment. We use an efficient approach to extract terrain features from RGB images and 3D point clouds and incorporate them into a global map for planning and navigation. Our system can adapt to changing environments and update the terrain information in real-time. Mo… ▽ More We present a terrain traversability mapping and navigation system (TNS) for autonomous excavator applications in an unstructured environment. We use an efficient approach to extract terrain features from RGB images and 3D point clouds and incorporate them into a global map for planning and navigation. Our system can adapt to changing environments and update the terrain information in real-time. Moreover, we present a novel dataset, the Complex Worksite Terrain (CWT) dataset, which consists of RGB images from construction sites with seven categories based on navigability. Our novel algorithms improve the mapping accuracy over previous SOTA methods by 4.17-30.48% and reduce MSE on the traversability map by 13.8-71.4%. We have combined our mapping approach with planning and control modules in an autonomous excavator navigation system and observe 49.3% improvement in the overall success rate. Based on TNS, we demonstrate the first autonomous excavator that can navigate through unstructured environments consisting of deep pits, steep hills, rock piles, and other complex terrain features. △ Less

Submitted 5 May, 2022; v1 submitted 13 September, 2021; originally announced September 2021.

arXiv:2104.11896 [pdf, other]

M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers

Authors: Tianrui Guan, Jun Wang, Shiyi Lan, Rohan Chandra, Zuxuan Wu, Larry Davis, Dinesh Manocha

Abstract: We present a novel architecture for 3D object detection, M3DeTR, which combines different point cloud representations (raw, voxels, bird-eye view) with different feature scales based on multi-scale feature pyramids. M3DeTR is the first approach that unifies multiple point cloud representations, feature scales, as well as models mutual relationships between point clouds simultaneously using transfo… ▽ More We present a novel architecture for 3D object detection, M3DeTR, which combines different point cloud representations (raw, voxels, bird-eye view) with different feature scales based on multi-scale feature pyramids. M3DeTR is the first approach that unifies multiple point cloud representations, feature scales, as well as models mutual relationships between point clouds simultaneously using transformers. We perform extensive ablation experiments that highlight the benefits of fusing representation and scale, and modeling the relationships. Our method achieves state-of-the-art performance on the KITTI 3D object detection dataset and Waymo Open Dataset. Results show that M3DeTR improves the baseline significantly by 1.48% mAP for all classes on Waymo Open Dataset. In particular, our approach ranks 1st on the well-known KITTI 3D Detection Benchmark for both car and cyclist classes, and ranks 1st on Waymo Open Dataset with single frame point cloud input. Our code is available at: https://github.com/rayguan97/M3DETR. △ Less

Submitted 22 October, 2021; v1 submitted 24 April, 2021; originally announced April 2021.

arXiv:2103.04233 [pdf, other]

GANav: Efficient Terrain Segmentation for Robot Navigation in Unstructured Outdoor Environments

Authors: Tianrui Guan, Divya Kothandaraman, Rohan Chandra, Adarsh Jagan Sathyamoorthy, Kasun Weerakoon, Dinesh Manocha

Abstract: We propose GANav, a novel group-wise attention mechanism to identify safe and navigable regions in off-road terrains and unstructured environments from RGB images. Our approach classifies terrains based on their navigability levels using coarse-grained semantic segmentation. Our novel group-wise attention loss enables any backbone network to explicitly focus on the different groups' features with… ▽ More We propose GANav, a novel group-wise attention mechanism to identify safe and navigable regions in off-road terrains and unstructured environments from RGB images. Our approach classifies terrains based on their navigability levels using coarse-grained semantic segmentation. Our novel group-wise attention loss enables any backbone network to explicitly focus on the different groups' features with low spatial resolution. Our design leads to efficient inference while maintaining a high level of accuracy compared to existing SOTA methods. Our extensive evaluations on the RUGD and RELLIS-3D datasets shows that GANav achieves an improvement over the SOTA mIoU by 2.25-39.05% on RUGD and 5.17-19.06% on RELLIS-3D. We interface GANav with a deep reinforcement learning-based navigation algorithm and highlight its benefits in terms of navigation in real-world unstructured terrains. We integrate our GANav-based navigation algorithm with ClearPath Jackal and Husky robots, and observe an increase of 10% in terms of success rate, 2-47% in terms of selecting the surface with the best navigability and a decrease of 4.6-13.9% in trajectory roughness. Further, GANav reduces the false positive rate of forbidden regions by 37.79%. Code, videos, and a full technical report are available at https://gamma.umd.edu/offroad/. △ Less

Submitted 17 June, 2022; v1 submitted 6 March, 2021; originally announced March 2021.

arXiv:2009.13631 [pdf, other]

Tempura: A General Cost Based Optimizer Framework for Incremental Data Processing (Extended Version)

Authors: Zuozhi Wang, Kai Zeng, Botong Huang, Wei Chen, Xiaozong Cui, Bo Wang, Ji Liu, Liya Fan, Dachuan Qu, Zhenyu Hou, Tao Guan, Chen Li, Jingren Zhou

Abstract: Incremental processing is widely-adopted in many applications, ranging from incremental view maintenance, stream computing, to recently emerging progressive data warehouse and intermittent query processing. Despite many algorithms developed on this topic, none of them can produce an incremental plan that always achieves the best performance, since the optimal plan is data dependent. In this paper,… ▽ More Incremental processing is widely-adopted in many applications, ranging from incremental view maintenance, stream computing, to recently emerging progressive data warehouse and intermittent query processing. Despite many algorithms developed on this topic, none of them can produce an incremental plan that always achieves the best performance, since the optimal plan is data dependent. In this paper, we develop a novel cost-based optimizer framework, called Tempura, for optimizing incremental data processing. We propose an incremental query planning model called TIP based on the concept of time-varying relations, which can formally model incremental processing in its most general form. We give a full specification of Tempura, which can not only unify various existing techniques to generate an optimal incremental plan, but also allow the developer to add their rewrite rules. We study how to explore the plan space and search for an optimal incremental plan. We conduct a thorough experimental evaluation of Tempura in various incremental processing scenarios to show its effectiveness and efficiency. △ Less

Submitted 28 September, 2020; originally announced September 2020.

Comments: 19 pages, 8 figures. The short version of this paper is accepeted at VLDB 2021 (PVLDB Volume 14, Issue 1)

ACM Class: H.2.4

arXiv:2009.04924 [pdf, other]

doi 10.1007/978-3-030-32692-0_27

Ultrasound Liver Fibrosis Diagnosis using Multi-indicator guided Deep Neural Networks

Authors: Jiali Liu, Wenxuan Wang, Tianyao Guan, Ningbo Zhao, Xiaoguang Han, Zhen Li

Abstract: Accurate analysis of the fibrosis stage plays very important roles in follow-up of patients with chronic hepatitis B infection. In this paper, a deep learning framework is presented for automatically liver fibrosis prediction. On contrary of previous works, our approach can take use of the information provided by multiple ultrasound images. An indicator-guided learning mechanism is further propose… ▽ More Accurate analysis of the fibrosis stage plays very important roles in follow-up of patients with chronic hepatitis B infection. In this paper, a deep learning framework is presented for automatically liver fibrosis prediction. On contrary of previous works, our approach can take use of the information provided by multiple ultrasound images. An indicator-guided learning mechanism is further proposed to ease the training of the proposed model. This follows the workflow of clinical diagnosis and make the prediction procedure interpretable. To support the training, a dataset is well-collected which contains the ultrasound videos/images, indicators and labels of 229 patients. As demonstrated in the experimental results, our proposed model shows its effectiveness by achieving the state-of-the-art performance, specifically, the accuracy is 65.6%(20% higher than previous best). △ Less

Submitted 10 September, 2020; originally announced September 2020.

Comments: Jiali Liu and Wenxuan Wang are equal contribution

Journal ref: Machine Learning in Medical Imaging 2019

arXiv:2004.10976 [pdf, other]

OF-VO: Efficient Navigation among Pedestrians Using Commodity Sensors

Authors: Jing Liang, Yi-Ling Qiao, Tianrui Guan, Dinesh Manocha

Abstract: We present a modified velocity-obstacle (VO) algorithm that uses probabilistic partial observations of the environment to compute velocities and navigate a robot to a target. Our system uses commodity visual sensors, including a mono-camera and a 2D Lidar, to explicitly predict the velocities and positions of surrounding obstacles through optical flow estimation, object detection, and sensor fusio… ▽ More We present a modified velocity-obstacle (VO) algorithm that uses probabilistic partial observations of the environment to compute velocities and navigate a robot to a target. Our system uses commodity visual sensors, including a mono-camera and a 2D Lidar, to explicitly predict the velocities and positions of surrounding obstacles through optical flow estimation, object detection, and sensor fusion. A key aspect of our work is coupling the perception (OF: optical flow) and planning (VO) components for reliable navigation. Overall, our OF-VO algorithm using learning-based perception and model-based planning methods offers better performance than prior algorithms in terms of navigation time and success rate of collision avoidance. Our method also provides bounds on the probabilistic collision avoidance algorithm. We highlight the realtime performance of OF-VO on a Turtlebot navigating among pedestrians in both simulated and real-world scenes. A demo video is available at https://gamma.umd.edu/ofvo/ △ Less

Submitted 8 June, 2021; v1 submitted 23 April, 2020; originally announced April 2020.

Comments: 10 pages

arXiv:2004.06042 [pdf, other]

Adversarial Style Mining for One-Shot Unsupervised Domain Adaptation

Authors: Yawei Luo, Ping Liu, Tao Guan, Junqing Yu, Yi Yang

Abstract: We aim at the problem named One-Shot Unsupervised Domain Adaptation. Unlike traditional Unsupervised Domain Adaptation, it assumes that only one unlabeled target sample can be available when learning to adapt. This setting is realistic but more challenging, in which conventional adaptation approaches are prone to failure due to the scarce of unlabeled target data. To this end, we propose a novel A… ▽ More We aim at the problem named One-Shot Unsupervised Domain Adaptation. Unlike traditional Unsupervised Domain Adaptation, it assumes that only one unlabeled target sample can be available when learning to adapt. This setting is realistic but more challenging, in which conventional adaptation approaches are prone to failure due to the scarce of unlabeled target data. To this end, we propose a novel Adversarial Style Mining approach, which combines the style transfer module and task-specific module into an adversarial manner. Specifically, the style transfer module iteratively searches for harder stylized images around the one-shot target sample according to the current learning state, leading the task model to explore the potential styles that are difficult to solve in the almost unseen target domain, thus boosting the adaptation performance in a data-scarce scenario. The adversarial learning framework makes the style transfer module and task-specific module benefit each other during the competition. Extensive experiments on both cross-domain classification and segmentation benchmarks verify that ASM achieves state-of-the-art adaptation performance under the challenging one-shot setting. △ Less

Submitted 13 April, 2020; originally announced April 2020.

Comments: Preprint

arXiv:2003.05395 [pdf, other]

Frozone: Freezing-Free, Pedestrian-Friendly Navigation in Human Crowds

Authors: Adarsh Jagan Sathyamoorthy, Utsav Patel, Tianrui Guan, Dinesh Manocha

Abstract: We present Frozone, a novel algorithm to deal with the Freezing Robot Problem (FRP) that arises when a robot navigates through dense scenarios and crowds. Our method senses and explicitly predicts the trajectories of pedestrians and constructs a Potential Freezing Zone (PFZ); a spatial zone where the robot could freeze or be obtrusive to humans. Our formulation computes a deviation velocity to avo… ▽ More We present Frozone, a novel algorithm to deal with the Freezing Robot Problem (FRP) that arises when a robot navigates through dense scenarios and crowds. Our method senses and explicitly predicts the trajectories of pedestrians and constructs a Potential Freezing Zone (PFZ); a spatial zone where the robot could freeze or be obtrusive to humans. Our formulation computes a deviation velocity to avoid the PFZ, which also accounts for social constraints. Furthermore, Frozone is designed for robots equipped with sensors with a limited sensing range and field of view. We ensure that the robot's deviation is bounded, thus avoiding sudden angular motion which could lead to the loss of perception data of the surrounding obstacles. We have combined Frozone with a Deep Reinforcement Learning-based (DRL) collision avoidance method and use our hybrid approach to handle crowds of varying densities. Our overall approach results in smooth and collision-free navigation in dense environments. We have evaluated our method's performance in simulation and on real differential drive robots in challenging indoor scenarios. We highlight the benefits of our approach over prior methods in terms of success rates (up to 50% increase), pedestrian-friendliness (100% increase) and the rate of freezing (> 80% decrease) in challenging scenarios. △ Less

Submitted 11 March, 2020; originally announced March 2020.

Comments: 9 pages, 6 figures

arXiv:2002.03038 [pdf, other]

DenseCAvoid: Real-time Navigation in Dense Crowds using Anticipatory Behaviors

Authors: Adarsh Jagan Sathyamoorthy, Jing Liang, Utsav Patel, Tianrui Guan, Rohan Chandra, Dinesh Manocha

Abstract: We present DenseCAvoid, a novel navigation algorithm for navigating a robot through dense crowds and avoiding collisions by anticipating pedestrian behaviors. Our formulation uses visual sensors and a pedestrian trajectory prediction algorithm to track pedestrians in a set of input frames and provide bounding boxes that extrapolate the pedestrian positions in a future time. Our hybrid approach com… ▽ More We present DenseCAvoid, a novel navigation algorithm for navigating a robot through dense crowds and avoiding collisions by anticipating pedestrian behaviors. Our formulation uses visual sensors and a pedestrian trajectory prediction algorithm to track pedestrians in a set of input frames and provide bounding boxes that extrapolate the pedestrian positions in a future time. Our hybrid approach combines this trajectory prediction with a Deep Reinforcement Learning-based collision avoidance method to train a policy to generate smoother, safer, and more robust trajectories during run-time. We train our policy in realistic 3-D simulations of static and dynamic scenarios with multiple pedestrians. In practice, our hybrid approach generalizes well to unseen, real-world scenarios and can navigate a robot through dense crowds (~1-2 humans per square meter) in indoor scenarios, including narrow corridors and lobbies. As compared to cases where prediction was not used, we observe that our method reduces the occurrence of the robot freezing in a crowd by up to 48%, and performs comparably with respect to trajectory lengths and mean arrival times to goal. △ Less

Submitted 7 February, 2020; originally announced February 2020.

Comments: 8 pages, 5 figures

arXiv:1912.01118 [pdf, other]

Forecasting Trajectory and Behavior of Road-Agents Using Spectral Clustering in Graph-LSTMs

Authors: Rohan Chandra, Tianrui Guan, Srujan Panuganti, Trisha Mittal, Uttaran Bhattacharya, Aniket Bera, Dinesh Manocha

Abstract: We present a novel approach for traffic forecasting in urban traffic scenarios using a combination of spectral graph analysis and deep learning. We predict both the low-level information (future trajectories) as well as the high-level information (road-agent behavior) from the extracted trajectory of each road-agent. Our formulation represents the proximity between the road agents using a weighted… ▽ More We present a novel approach for traffic forecasting in urban traffic scenarios using a combination of spectral graph analysis and deep learning. We predict both the low-level information (future trajectories) as well as the high-level information (road-agent behavior) from the extracted trajectory of each road-agent. Our formulation represents the proximity between the road agents using a weighted dynamic geometric graph (DGG). We use a two-stream graph-LSTM network to perform traffic forecasting using these weighted DGGs. The first stream predicts the spatial coordinates of road-agents, while the second stream predicts whether a road-agent is going to exhibit overspeeding, underspeeding, or neutral behavior by modeling spatial interactions between road-agents. Additionally, we propose a new regularization algorithm based on spectral clustering to reduce the error margin in long-term prediction (3-5 seconds) and improve the accuracy of the predicted trajectories. Moreover, we prove a theoretical upper bound on the regularized prediction error. We evaluate our approach on the Argoverse, Lyft, Apolloscape, and NGSIM datasets and highlight the benefits over prior trajectory prediction methods. In practice, our approach reduces the average prediction error by approximately 75% over prior algorithms and achieves a weighted average accuracy of 91.2% for behavior prediction. Additionally, our spectral regularization improves long-term prediction by up to 70%. △ Less

Submitted 5 August, 2020; v1 submitted 2 December, 2019; originally announced December 2019.

Comments: Accepted to RAL/IROS 2020 as a dual journal and conference submission

arXiv:1905.02607 [pdf, other]

doi 10.1145/3328912

PocketCare: Tracking the Flu with Mobile Phones using Partial Observations of Proximity and Symptoms

Authors: Wen Dong, Tong Guan, Bruno Lepri, Chunming Qiao

Abstract: Mobile phones provide a powerful sensing platform that researchers may adopt to understand proximity interactions among people and the diffusion, through these interactions, of diseases, behaviors, and opinions. However, it remains a challenge to track the proximity-based interactions of a whole community and then model the social diffusion of diseases and behaviors starting from the observations… ▽ More Mobile phones provide a powerful sensing platform that researchers may adopt to understand proximity interactions among people and the diffusion, through these interactions, of diseases, behaviors, and opinions. However, it remains a challenge to track the proximity-based interactions of a whole community and then model the social diffusion of diseases and behaviors starting from the observations of a small fraction of the volunteer population. In this paper, we propose a novel approach that tries to connect together these sparse observations using a model of how individuals interact with each other and how social interactions happen in terms of a sequence of proximity interactions. We apply our approach to track the spreading of flu in the spatial-proximity network of a 3000-people university campus by mobilizing 300 volunteers from this population to monitor nearby mobile phones through Bluetooth scanning and to daily report flu symptoms about and around them. Our aim is to predict the likelihood for an individual to get flu based on how often her/his daily routine intersects with those of the volunteers. Thus, we use the daily routines of the volunteers to build a model of the volunteers as well as of the non-volunteers. Our results show that we can predict flu infection two weeks ahead of time with an average precision from 0.24 to 0.35 depending on the amount of information. This precision is six to nine times higher than with a random guess model. At the population level, we can predict infectious population in a two-week window with an r-squared value of 0.95 (a random-guess model obtains an r-squared value of 0.2). These results point to an innovative approach for tracking individuals who have interacted with people showing symptoms, allowing us to warn those in danger of infection and to inform health researchers about the progression of contact-induced diseases. △ Less

Submitted 7 May, 2019; originally announced May 2019.

Showing 1–50 of 55 results for author: Guan, T