Embodied AI in Virtual Environments with Alignment of Linguistic and Visual Inputs

Ph.D Project- Jason Armitage

Computational implementations of time-structured tasks in the real world and virtual environments reduce to processing sets of sequences over a collection of information sources. This project proposes new methods and resources to address multimodal and embodied AI tasks where the problem of aligning task-relevant information over sequences of visual and linguistic inputs arises. Our work identifies sequence alignment as an implied or explicit process in three tasks of this order and presents contributions to navigate environments and act on 3D visual assets given linguistic inputs. 1) We propose a new method for estimating the mutual information between multiple variables derived from cross–modal inputs to enhance the performance of 2D vision-language models in tasks with 3D multi-object scenes. 2) We examine the performance of Score Distillation Sampling in generative tasks on 3D scenes conditioned with text and address nonlinearities in processing a scene with limited views by combining gradient thresholds with piecewise linear updates. 3) We present a novel priority map module to conduct a hierarchical process of high-level alignment between textual spans and visual perspectives to enhance the performance of transformer-based systems in the embodied AI task of Vision-and-Language Navigation.