Prev: 2022.10.16 Next: 2022.10.18

Summary for 2022-10-17, created on 2022-10-27

UniTune: Text-Driven Image Editing by Fine Tuning an Image Generation Model on a Single Image arxiv:2210.09477 📈 511

Dani Valevski, Matan Kalman, Yossi Matias, Yaniv Leviathan

**Abstract:** We present UniTune, a simple and novel method for general text-driven image editing. UniTune gets as input an arbitrary image and a textual edit description, and carries out the edit while maintaining high semantic and visual fidelity to the input image. UniTune uses text, an intuitive interface for art-direction, and does not require additional inputs, like masks or sketches. At the core of our method is the observation that with the right choice of parameters, we can fine-tune a large text-to-image diffusion model on a single image, encouraging the model to maintain fidelity to the input image while still allowing expressive manipulations. We used Imagen as our text-to-image model, but we expect UniTune to work with other large-scale models as well. We test our method in a range of different use cases, and demonstrate its wide applicability.

You Only Live Once: Single-Life Reinforcement Learning arxiv:2210.08863 📈 205

Annie S. Chen, Archit Sharma, Sergey Levine, Chelsea Finn

**Abstract:** Reinforcement learning algorithms are typically designed to learn a performant policy that can repeatedly and autonomously complete a task, usually starting from scratch. However, in many real-world situations, the goal might not be to learn a policy that can do the task repeatedly, but simply to perform a new task successfully once in a single trial. For example, imagine a disaster relief robot tasked with retrieving an item from a fallen building, where it cannot get direct supervision from humans. It must retrieve this object within one test-time trial, and must do so while tackling unknown obstacles, though it may leverage knowledge it has of the building before the disaster. We formalize this problem setting, which we call single-life reinforcement learning (SLRL), where an agent must complete a task within a single episode without interventions, utilizing its prior experience while contending with some form of novelty. SLRL provides a natural setting to study the challenge of autonomously adapting to unfamiliar situations, and we find that algorithms designed for standard episodic reinforcement learning often struggle to recover from out-of-distribution states in this setting. Motivated by this observation, we propose an algorithm, $Q$-weighted adversarial learning (QWALE), which employs a distribution matching strategy that leverages the agent's prior experience as guidance in novel situations. Our experiments on several single-life continuous control problems indicate that methods based on our distribution matching formulation are 20-60% more successful because they can more quickly recover from novel states.

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them arxiv:2210.09261 📈 188

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, Jason Wei

**Abstract:** BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models? In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater. We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks. Since many tasks in BBH require multi-step reasoning, few-shot prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al., 2022), substantially underestimates the best performance and capabilities of language models, which is better captured via CoT prompting. As further analysis, we explore the interaction between CoT and model scale on BBH, finding that CoT enables emergent task performance on several BBH tasks with otherwise flat scaling curves.

Morig: Motion-aware rigging of character meshes from point clouds arxiv:2210.09463 📈 153

Zhan Xu, Yang Zhou, Li Yi, Evangelos Kalogerakis

**Abstract:** We present MoRig, a method that automatically rigs character meshes driven by single-view point cloud streams capturing the motion of performing characters. Our method is also able to animate the 3D meshes according to the captured point cloud motion. MoRig's neural network encodes motion cues from the point clouds into features that are informative about the articulated parts of the performing character. These motion-aware features guide the inference of an appropriate skeletal rig for the input mesh, which is then animated based on the point cloud motion. Our method can rig and animate diverse characters, including humanoids, quadrupeds, and toys with varying articulation. It accounts for occluded regions in the point clouds and mismatches in the part proportions between the input mesh and captured character. Compared to other rigging approaches that ignore motion cues, MoRig produces more accurate rigs, well-suited for re-targeting motion from captured characters.

Implicit models, latent compression, intrinsic biases, and cheap lunches in community detection arxiv:2210.09186 📈 152

Tiago P. Peixoto, Alec Kirkley

**Abstract:** The task of community detection, which aims to partition a network into clusters of nodes to summarize its large-scale structure, has spawned the development of many competing algorithms with varying objectives. Some community detection methods are inferential, explicitly deriving the clustering objective through a probabilistic generative model, while other methods are descriptive, dividing a network according to an objective motivated by a particular application, making it challenging to compare these methods on the same scale. Here we present a solution to this problem that associates any community detection objective, inferential or descriptive, with its corresponding implicit network generative model. This allows us to compute the description length of a network and its partition under arbitrary objectives, providing a principled measure to compare the performance of different algorithms without the need for "ground truth" labels. Our approach also gives access to instances of the community detection problem that are optimal to any given algorithm, and in this way reveals intrinsic biases in popular descriptive methods, explaining their tendency to overfit. Using our framework, we compare a number of community detection methods on artificial networks, and on a corpus of over 500 structurally diverse empirical networks. We find that more expressive community detection methods exhibit consistently superior compression performance on structured data instances, without having degraded performance on a minority of situations where more specialized algorithms perform optimally. Our results undermine the implications of the "no free lunch" theorem for community detection, both conceptually and in practice, since it is confined to unstructured data instances, unlike relevant community detection problems which are structured by requirement.

Deep Bidirectional Language-Knowledge Graph Pretraining arxiv:2210.09338 📈 92

Michihiro Yasunaga, Antoine Bosselut, Hongyu Ren, Xikun Zhang, Christopher D Manning, Percy Liang, Jure Leskovec

**Abstract:** Pretraining a language model (LM) on text has been shown to help various downstream NLP tasks. Recent works show that a knowledge graph (KG) can complement text data, offering structured background knowledge that provides a useful scaffold for reasoning. However, these works are not pretrained to learn a deep fusion of the two modalities at scale, limiting the potential to acquire fully joint representations of text and KG. Here we propose DRAGON (Deep Bidirectional Language-Knowledge Graph Pretraining), a self-supervised approach to pretraining a deeply joint language-knowledge foundation model from text and KG at scale. Specifically, our model takes pairs of text segments and relevant KG subgraphs as input and bidirectionally fuses information from both modalities. We pretrain this model by unifying two self-supervised reasoning tasks, masked language modeling and KG link prediction. DRAGON outperforms existing LM and LM+KG models on diverse downstream tasks including question answering across general and biomedical domains, with +5% absolute gain on average. In particular, DRAGON achieves notable performance on complex reasoning about language and knowledge (+10% on questions involving long contexts or multi-step reasoning) and low-resource QA (+8% on OBQA and RiddleSense), and new state-of-the-art results on various BioNLP tasks. Our code and trained models are available at https://github.com/michiyasunaga/dragon.

A GPU-specialized Inference Parameter Server for Large-Scale Deep Recommendation Models arxiv:2210.08804 📈 60

Yingcan Wei, Matthias Langer, Fan Yu, Minseok Lee, Kingsley Liu, Jerry Shi, Joey Wang

**Abstract:** Recommendation systems are of crucial importance for a variety of modern apps and web services, such as news feeds, social networks, e-commerce, search, etc. To achieve peak prediction accuracy, modern recommendation models combine deep learning with terabyte-scale embedding tables to obtain a fine-grained representation of the underlying data. Traditional inference serving architectures require deploying the whole model to standalone servers, which is infeasible at such massive scale. In this paper, we provide insights into the intriguing and challenging inference domain of online recommendation systems. We propose the HugeCTR Hierarchical Parameter Server (HPS), an industry-leading distributed recommendation inference framework, that combines a high-performance GPU embedding cache with an hierarchical storage architecture, to realize low-latency retrieval of embeddings for online model inference tasks. Among other things, HPS features (1) a redundant hierarchical storage system, (2) a novel high-bandwidth cache to accelerate parallel embedding lookup on NVIDIA GPUs, (3) online training support and (4) light-weight APIs for easy integration into existing large-scale recommendation workflows. To demonstrate its capabilities, we conduct extensive studies using both synthetically engineered and public datasets. We show that our HPS can dramatically reduce end-to-end inference latency, achieving 5~62x speedup (depending on the batch size) over CPU baseline implementations for popular recommendation models. Through multi-GPU concurrent deployment, the HPS can also greatly increase the inference QPS.

Differentiable Physics Simulation of Dynamics-Augmented Neural Objects arxiv:2210.09420 📈 58

Simon Le Cleac'h, Hong-Xing Yu, Michelle Guo, Taylor A. Howell, Ruohan Gao, Jiajun Wu, Zachary Manchester, Mac Schwager

**Abstract:** We present a differentiable pipeline for simulating the motion of objects that represent their geometry as a continuous density field parameterized as a deep network. This includes Neural Radiance Fields (NeRFs), and other related models. From the density field, we estimate the dynamical properties of the object, including its mass, center of mass, and inertia matrix. We then introduce a differentiable contact model based on the density field for computing normal and friction forces resulting from collisions. This allows a robot to autonomously build object models that are visually and dynamically accurate from still images and videos of objects in motion. The resulting Dynamics-Augmented Neural Objects (DANOs) are simulated with an existing differentiable simulation engine, Dojo, interacting with other standard simulation objects, such as spheres, planes, and robots specified as URDFs. A robot can use this simulation to optimize grasps and manipulation trajectories of neural objects, or to improve the neural object models through gradient-based real-to-simulation transfer. We demonstrate the pipeline to learn the coefficient of friction of a bar of soap from a real video of the soap sliding on a table. We also learn the coefficient of friction and mass of a Stanford bunny through interactions with a Panda robot arm from synthetic data, and we optimize trajectories in simulation for the Panda arm to push the bunny to a goal location.

What Makes Convolutional Models Great on Long Sequence Modeling? arxiv:2210.09298 📈 40

Yuhong Li, Tianle Cai, Yi Zhang, Deming Chen, Debadeepta Dey

**Abstract:** Convolutional models have been widely used in multiple domains. However, most existing models only use local convolution, making the model unable to handle long-range dependency efficiently. Attention overcomes this problem by aggregating global information but also makes the computational complexity quadratic to the sequence length. Recently, Gu et al. [2021] proposed a model called S4 inspired by the state space model. S4 can be efficiently implemented as a global convolutional model whose kernel size equals the input sequence length. S4 can model much longer sequences than Transformers and achieve significant gains over SoTA on several long-range tasks. Despite its empirical success, S4 is involved. It requires sophisticated parameterization and initialization schemes. As a result, S4 is less intuitive and hard to use. Here we aim to demystify S4 and extract basic principles that contribute to the success of S4 as a global convolutional model. We focus on the structure of the convolution kernel and identify two critical but intuitive principles enjoyed by S4 that are sufficient to make up an effective global convolutional model: 1) The parameterization of the convolutional kernel needs to be efficient in the sense that the number of parameters should scale sub-linearly with sequence length. 2) The kernel needs to satisfy a decaying structure that the weights for convolving with closer neighbors are larger than the more distant ones. Based on the two principles, we propose a simple yet effective convolutional model called Structured Global Convolution (SGConv). SGConv exhibits strong empirical performance over several tasks: 1) With faster speed, SGConv surpasses S4 on Long Range Arena and Speech Command datasets. 2) When plugging SGConv into standard language and vision models, it shows the potential to improve both efficiency and performance.

Visual Debates arxiv:2210.09015 📈 40

Avinash Kori, Ben Glocker, Francesca Toni

**Abstract:** The natural way of obtaining different perspectives on any given topic is by conducting a debate, where participants argue for and against the topic. Here, we propose a novel debate framework for understanding the classifier's reasoning for making a particular prediction by modelling it as a multiplayer sequential zero-sum game. The players aim to maximise their utilities by adjusting their arguments with respect to other players' counterarguments. The contrastive nature of our framework encourages players to put forward diverse arguments, picking up the reasoning trails missed by their opponents. Thus, our framework answers the question: why did the classifier make a certain prediction?, by allowing players to argue for and against the classifier's decision. In the proposed setup, given the question and the classifier's latent knowledge, both agents take turns in proposing arguments to support or contradict the classifier's decision; arguments here correspond to the selection of specific features from the discretised latent space of the continuous classifier. By the end of the debate, we collect sets of supportive and manipulative features, serving as an explanation depicting the internal reasoning of the classifier. We demonstrate our Visual Debates on the geometric SHAPE and MNIST datasets for subjective validation, followed by the high-resolution AFHQ dataset. For further investigation, our framework is available at \url{https://github.com/koriavinash1/VisualDebates}.

Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation arxiv:2210.09549 📈 29

Ruijun Li, Weihua Li, Yi Yang, Hanyu Wei, Jianhua Jiang, Quan Bai

**Abstract:** Recently, diffusion models have been proven to perform remarkably well in text-to-image synthesis tasks in a number of studies, immediately presenting new study opportunities for image generation. Google's Imagen follows this research trend and outperforms DALLE2 as the best model for text-to-image generation. However, Imagen merely uses a T5 language model for text processing, which cannot ensure learning the semantic information of the text. Furthermore, the Efficient UNet leveraged by Imagen is not the best choice in image processing. To address these issues, we propose the Swinv2-Imagen, a novel text-to-image diffusion model based on a Hierarchical Visual Transformer and a Scene Graph incorporating a semantic layout. In the proposed model, the feature vectors of entities and relationships are extracted and involved in the diffusion model, effectively improving the quality of generated images. On top of that, we also introduce a Swin-Transformer-based UNet architecture, called Swinv2-Unet, which can address the problems stemming from the CNN convolution operations. Extensive experiments are conducted to evaluate the performance of the proposed model by using three real-world datasets, i.e., MSCOCO, CUB and MM-CelebA-HQ. The experimental results show that the proposed Swinv2-Imagen model outperforms several popular state-of-the-art methods.

Keep Me Updated! Memory Management in Long-term Conversations arxiv:2210.08750 📈 27

Sanghwan Bae, Donghyun Kwak, Soyoung Kang, Min Young Lee, Sungdong Kim, Yuin Jeong, Hyeri Kim, Sang-Woo Lee, Woomyoung Park, Nako Sung

**Abstract:** Remembering important information from the past and continuing to talk about it in the present are crucial in long-term conversations. However, previous literature does not deal with cases where the memorized information is outdated, which may cause confusion in later conversations. To address this issue, we present a novel task and a corresponding dataset of memory management in long-term conversations, in which bots keep track of and bring up the latest information about users while conversing through multiple sessions. In order to support more precise and interpretable memory, we represent memory as unstructured text descriptions of key information and propose a new mechanism of memory management that selectively eliminates invalidated or redundant information. Experimental results show that our approach outperforms the baselines that leave the stored memory unchanged in terms of engagingness and humanness, with larger performance gap especially in the later sessions.

Table-To-Text generation and pre-training with TabT5 arxiv:2210.09162 📈 22

Ewa Andrejczuk, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Yasemin Altun

**Abstract:** Encoder-only transformer models have been successfully applied to different table understanding tasks, as in TAPAS (Herzig et al., 2020). A major limitation of these architectures is that they are constrained to classification-like tasks such as cell selection or entailment detection. We present TABT5, an encoder-decoder model that generates natural language text based on tables and textual inputs. TABT5 overcomes the encoder-only limitation by incorporating a decoder component and leverages the input structure with table specific embeddings and pre-training. TABT5 achieves new state-of-the-art results on several domains, including spreadsheet formula prediction with a 15% increase in sequence accuracy, QA with a 2.5% increase in sequence accuracy and data-to-text generation with a 2.5% increase in BLEU.

Robust Imitation of a Few Demonstrations with a Backwards Model arxiv:2210.09337 📈 20

Jung Yeon Park, Lawson L. S. Wong

**Abstract:** Behavior cloning of expert demonstrations can speed up learning optimal policies in a more sample-efficient way over reinforcement learning. However, the policy cannot extrapolate well to unseen states outside of the demonstration data, creating covariate shift (agent drifting away from demonstrations) and compounding errors. In this work, we tackle this issue by extending the region of attraction around the demonstrations so that the agent can learn how to get back onto the demonstrated trajectories if it veers off-course. We train a generative backwards dynamics model and generate short imagined trajectories from states in the demonstrations. By imitating both demonstrations and these model rollouts, the agent learns the demonstrated paths and how to get back onto these paths. With optimal or near-optimal demonstrations, the learned policy will be both optimal and robust to deviations, with a wider region of attraction. On continuous control domains, we evaluate the robustness when starting from different initial states unseen in the demonstration data. While both our method and other imitation learning baselines can successfully solve the tasks for initial states in the training distribution, our method exhibits considerably more robustness to different initial states.

CS-MLGCN : Multiplex Graph Convolutional Networks for Community Search in Multiplex Networks arxiv:2210.08811 📈 20

Ali Behrouz, Farnoosh Hashemi

**Abstract:** Community Search (CS) is one of the fundamental tasks in network science and has attracted much attention due to its ability to discover personalized communities with a wide range of applications. Given any query nodes, CS seeks to find a densely connected subgraph containing query nodes. Most existing approaches usually study networks with a single type of proximity between nodes, which defines a single view of a network. However, in many applications such as biological, social, and transportation networks, interactions between objects span multiple aspects, yielding networks with multiple views, called multiplex networks. Existing CS approaches in multiplex networks adopt pre-defined subgraph patterns to model the communities, which cannot find communities that do not have such pre-defined patterns in real-world networks. In this paper, we propose a query-driven graph convolutional network in multiplex networks, CS-MLGCN, that can capture flexible community structures by learning from the ground-truth communities in a data-driven fashion. CS-MLGCN first combines the local query-dependent structure and global graph embedding in each type of proximity and then uses an attention mechanism to incorporate information on different types of relations. Experiments on real-world graphs with ground-truth communities validate the quality of the solutions we obtain and the efficiency of our model.

Unsupervised Object-Centric Learning with Bi-Level Optimized Query Slot Attention arxiv:2210.08990 📈 15

Baoxiong Jia, Yu Liu, Siyuan Huang

**Abstract:** The ability to decompose complex natural scenes into meaningful object-centric abstractions lies at the core of human perception and reasoning. In the recent culmination of unsupervised object-centric learning, the Slot-Attention module has played an important role with its simple yet effective design and fostered many powerful variants. These methods, however, have been exceedingly difficult to train without supervision and are ambiguous in the notion of object, especially for complex natural scenes. In this paper, we propose to address these issues by (1) initializing Slot-Attention modules with learnable queries and (2) optimizing the model with bi-level optimization. With simple code adjustments on the vanilla Slot-Attention, our model, Bi-level Optimized Query Slot Attention, achieves state-of-the-art results on both synthetic and complex real-world datasets in unsupervised image segmentation and reconstruction, outperforming previous baselines by a large margin (~10%). We provide thorough ablative studies to validate the necessity and effectiveness of our design. Additionally, our model exhibits excellent potential for concept binding and zero-shot learning. We hope our effort could provide a single home for the design and learning of slot-based models and pave the way for more challenging tasks in object-centric learning. Our implementation is publicly available at https://github.com/Wall-Facer-liuyu/BO-QSA.

Affective Idiosyncratic Responses to Music arxiv:2210.09396 📈 13

Sky CH-Wang, Evan Li, Oliver Li, Smaranda Muresan, Zhou Yu

**Abstract:** Affective responses to music are highly personal. Despite consensus that idiosyncratic factors play a key role in regulating how listeners emotionally respond to music, precisely measuring the marginal effects of these variables has proved challenging. To address this gap, we develop computational methods to measure affective responses to music from over 403M listener comments on a Chinese social music platform. Building on studies from music psychology in systematic and quasi-causal analyses, we test for musical, lyrical, contextual, demographic, and mental health effects that drive listener affective responses. Finally, motivated by the social phenomenon known as wǎng-yì-yún, we identify influencing factors of platform user self-disclosures, the social support they receive, and notable differences in discloser user activity.

Defects of Convolutional Decoder Networks in Frequency Representation arxiv:2210.09020 📈 10

Ling Tang, Wen Shen, Zhanpeng Zhou, Yuefeng Chen, Quanshi Zhang

**Abstract:** In this paper, we prove representation bottlenecks of a cascaded convolutional decoder network, considering the capacity of representing different frequency components of an input sample. We conduct the discrete Fourier transform on each channel of the feature map in an intermediate layer of the decoder network. Then, we introduce the rule of the forward propagation of such intermediate-layer spectrum maps, which is equivalent to the forward propagation of feature maps through a convolutional layer. Based on this, we find that each frequency component in the spectrum map is forward propagated independently with other frequency components. Furthermore, we prove two bottlenecks in representing feature spectrums. First, we prove that the convolution operation, the zero-padding operation, and a set of other settings all make a convolutional decoder network more likely to weaken high-frequency components. Second, we prove that the upsampling operation generates a feature spectrum, in which strong signals repetitively appears at certain frequencies.

A Mixing Time Lower Bound for a Simplified Version of BART arxiv:2210.09352 📈 8

Omer Ronen, Theo Saarinen, Yan Shuo Tan, James Duncan, Bin Yu

**Abstract:** Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression algorithm. The posterior is a distribution over sums of decision trees, and predictions are made by averaging approximate samples from the posterior. The combination of strong predictive performance and the ability to provide uncertainty measures has led BART to be commonly used in the social sciences, biostatistics, and causal inference. BART uses Markov Chain Monte Carlo (MCMC) to obtain approximate posterior samples over a parameterized space of sums of trees, but it has often been observed that the chains are slow to mix. In this paper, we provide the first lower bound on the mixing time for a simplified version of BART in which we reduce the sum to a single tree and use a subset of the possible moves for the MCMC proposal distribution. Our lower bound for the mixing time grows exponentially with the number of data points. Inspired by this new connection between the mixing time and the number of data points, we perform rigorous simulations on BART. We show qualitatively that BART's mixing time increases with the number of data points. The slow mixing time of the simplified BART suggests a large variation between different runs of the simplified BART algorithm and a similar large variation is known for BART in the literature. This large variation could result in a lack of stability in the models, predictions, and posterior intervals obtained from the BART MCMC samples. Our lower bound and simulations suggest increasing the number of chains with the number of data points.

Self-Supervised Learning Through Efference Copies arxiv:2210.09224 📈 8

Franz Scherr, Qinghai Guo, Timoleon Moraitis

**Abstract:** Self-supervised learning (SSL) methods aim to exploit the abundance of unlabelled data for machine learning (ML), however the underlying principles are often method-specific. An SSL framework derived from biological first principles of embodied learning could unify the various SSL methods, help elucidate learning in the brain, and possibly improve ML. SSL commonly transforms each training datapoint into a pair of views, uses the knowledge of this pairing as a positive (i.e. non-contrastive) self-supervisory sign, and potentially opposes it to unrelated, (i.e. contrastive) negative examples. Here, we show that this type of self-supervision is an incomplete implementation of a concept from neuroscience, the Efference Copy (EC). Specifically, the brain also transforms the environment through efference, i.e. motor commands, however it sends to itself an EC of the full commands, i.e. more than a mere SSL sign. In addition, its action representations are likely egocentric. From such a principled foundation we formally recover and extend SSL methods such as SimCLR, BYOL, and ReLIC under a common theoretical framework, i.e. Self-supervision Through Efference Copies (S-TEC). Empirically, S-TEC restructures meaningfully the within- and between-class representations. This manifests as improvement in recent strong SSL baselines in image classification, segmentation, object detection, and in audio. These results hypothesize a testable positive influence from the brain's motor outputs onto its sensory representations.

On the Impact of Temporal Concept Drift on Model Explanations arxiv:2210.09197 📈 8

Zhixue Zhao, George Chrysostomou, Kalina Bontcheva, Nikolaos Aletras

**Abstract:** Explanation faithfulness of model predictions in natural language processing is typically evaluated on held-out data from the same temporal distribution as the training data (i.e. synchronous settings). While model performance often deteriorates due to temporal variation (i.e. temporal concept drift), it is currently unknown how explanation faithfulness is impacted when the time span of the target data is different from the data used to train the model (i.e. asynchronous settings). For this purpose, we examine the impact of temporal variation on model explanations extracted by eight feature attribution methods and three select-then-predict models across six text classification tasks. Our experiments show that (i)faithfulness is not consistent under temporal variations across feature attribution methods (e.g. it decreases or increases depending on the method), with an attention-based method demonstrating the most robust faithfulness scores across datasets; and (ii) select-then-predict models are mostly robust in asynchronous settings with only small degradation in predictive performance. Finally, feature attribution methods show conflicting behavior when used in FRESH (i.e. a select-and-predict model) and for measuring sufficiency/comprehensiveness (i.e. as post-hoc methods), suggesting that we need more robust metrics to evaluate post-hoc explanation faithfulness.

On the convergence of policy gradient methods to Nash equilibria in general stochastic games arxiv:2210.08857 📈 8

Angeliki Giannou, Kyriakos Lotidis, Panayotis Mertikopoulos, Emmanouil-Vasileios Vlatakis-Gkaragkounis

**Abstract:** Learning in stochastic games is a notoriously difficult problem because, in addition to each other's strategic decisions, the players must also contend with the fact that the game itself evolves over time, possibly in a very complicated manner. Because of this, the convergence properties of popular learning algorithms - like policy gradient and its variants - are poorly understood, except in specific classes of games (such as potential or two-player, zero-sum games). In view of this, we examine the long-run behavior of policy gradient methods with respect to Nash equilibrium policies that are second-order stationary (SOS) in a sense similar to the type of sufficiency conditions used in optimization. Our first result is that SOS policies are locally attracting with high probability, and we show that policy gradient trajectories with gradient estimates provided by the REINFORCE algorithm achieve an $\mathcal{O}(1/\sqrt{n})$ distance-squared convergence rate if the method's step-size is chosen appropriately. Subsequently, specializing to the class of deterministic Nash policies, we show that this rate can be improved dramatically and, in fact, policy gradient methods converge within a finite number of iterations in that case.

Anisotropic Multi-Scale Graph Convolutional Network for Dense Shape Correspondence arxiv:2210.09466 📈 7

Mohammad Farazi, Wenhui Zhu, Zhangsihao Yang, Yalin Wang

**Abstract:** This paper studies 3D dense shape correspondence, a key shape analysis application in computer vision and graphics. We introduce a novel hybrid geometric deep learning-based model that learns geometrically meaningful and discretization-independent features with a U-Net model as the primary node feature extraction module, followed by a successive spectral-based graph convolutional network. To create a diverse set of filters, we use anisotropic wavelet basis filters, being sensitive to both different directions and band-passes. This filter set overcomes the over-smoothing behavior of conventional graph neural networks. To further improve the model's performance, we add a function that perturbs the feature maps in the last layer ahead of fully connected layers, forcing the network to learn more discriminative features overall. The resulting correspondence maps show state-of-the-art performance on the benchmark datasets based on average geodesic errors and superior robustness to discretization in 3D meshes. Our approach provides new insights and practical solutions to the dense shape correspondence research.

Robot Learning Theory of Mind through Self-Observation: Exploiting the Intentions-Beliefs Synergy arxiv:2210.09435 📈 7

Francesca Bianco, Dimitri Ognibene

**Abstract:** In complex environments, where the human sensory system reaches its limits, our behaviour is strongly driven by our beliefs about the state of the world around us. Accessing others' beliefs, intentions, or mental states in general, could thus allow for more effective social interactions in natural contexts. Yet these variables are not directly observable. Theory of Mind (TOM), the ability to attribute to other agents' beliefs, intentions, or mental states in general, is a crucial feature of human social interaction and has become of interest to the robotics community. Recently, new models that are able to learn TOM have been introduced. In this paper, we show the synergy between learning to predict low-level mental states, such as intentions and goals, and attributing high-level ones, such as beliefs. Assuming that learning of beliefs can take place by observing own decision and beliefs estimation processes in partially observable environments and using a simple feed-forward deep learning model, we show that when learning to predict others' intentions and actions, faster and more accurate predictions can be acquired if beliefs attribution is learnt simultaneously with action and intentions prediction. We show that the learning performance improves even when observing agents with a different decision process and is higher when observing beliefs-driven chunks of behaviour. We propose that our architectural approach can be relevant for the design of future adaptive social robots that should be able to autonomously understand and assist human partners in novel natural environments and tasks.

Learning Control Admissibility Models with Graph Neural Networks for Multi-Agent Navigation arxiv:2210.09378 📈 7

Chenning Yu, Hongzhan Yu, Sicun Gao

**Abstract:** Deep reinforcement learning in continuous domains focuses on learning control policies that map states to distributions over actions that ideally concentrate on the optimal choices in each step. In multi-agent navigation problems, the optimal actions depend heavily on the agents' density. Their interaction patterns grow exponentially with respect to such density, making it hard for learning-based methods to generalize. We propose to switch the learning objectives from predicting the optimal actions to predicting sets of admissible actions, which we call control admissibility models (CAMs), such that they can be easily composed and used for online inference for an arbitrary number of agents. We design CAMs using graph neural networks and develop training methods that optimize the CAMs in the standard model-free setting, with the additional benefit of eliminating the need for reward engineering typically required to balance collision avoidance and goal-reaching requirements. We evaluate the proposed approach in multi-agent navigation environments. We show that the CAM models can be trained in environments with only a few agents and be easily composed for deployment in dense environments with hundreds of agents, achieving better performance than state-of-the-art methods.

Model Predictive Control via On-Policy Imitation Learning arxiv:2210.09206 📈 7

Kwangjun Ahn, Zakaria Mhammedi, Horia Mania, Zhang-Wei Hong, Ali Jadbabaie

**Abstract:** In this paper, we leverage the rapid advances in imitation learning, a topic of intense recent focus in the Reinforcement Learning (RL) literature, to develop new sample complexity results and performance guarantees for data-driven Model Predictive Control (MPC) for constrained linear systems. In its simplest form, imitation learning is an approach that tries to learn an expert policy by querying samples from an expert. Recent approaches to data-driven MPC have used the simplest form of imitation learning known as behavior cloning to learn controllers that mimic the performance of MPC by online sampling of the trajectories of the closed-loop MPC system. Behavior cloning, however, is a method that is known to be data inefficient and suffer from distribution shifts. As an alternative, we develop a variant of the forward training algorithm which is an on-policy imitation learning method proposed by Ross et al. (2010). Our algorithm uses the structure of constrained linear MPC, and our analysis uses the properties of the explicit MPC solution to theoretically bound the number of online MPC trajectories needed to achieve optimal performance. We validate our results through simulations and show that the forward training algorithm is indeed superior to behavior cloning when applied to MPC.

Modeling the Lighting in Scenes as Style for Auto White-Balance Correction arxiv:2210.09090 📈 7

Furkan Kınlı, Doğa Yılmaz, Barış Özcan, Furkan Kıraç

**Abstract:** Style may refer to different concepts (e.g. painting style, hairstyle, texture, color, filter, etc.) depending on how the feature space is formed. In this work, we propose a novel idea of interpreting the lighting in the single- and multi-illuminant scenes as the concept of style. To verify this idea, we introduce an enhanced auto white-balance (AWB) method that models the lighting in single- and mixed-illuminant scenes as the style factor. Our AWB method does not require any illumination estimation step, yet contains a network learning to generate the weighting maps of the images with different WB settings. Proposed network utilizes the style information, extracted from the scene by a multi-head style extraction module. AWB correction is completed after blending these weighting maps and the scene. Experiments on single- and mixed-illuminant datasets demonstrate that our proposed method achieves promising correction results when compared to the recent works. This shows that the lighting in the scenes with multiple illuminations can be modeled by the concept of style. Source code and trained models are available on https://github.com/birdortyedi/lighting-as-style-awb-correction.

Multi-Agent Automated Machine Learning arxiv:2210.09084 📈 7

Zhaozhi Wang, Kefan Su, Jian Zhang, Huizhu Jia, Qixiang Ye, Xiaodong Xie, Zongqing Lu

**Abstract:** In this paper, we propose multi-agent automated machine learning (MA2ML) with the aim to effectively handle joint optimization of modules in automated machine learning (AutoML). MA2ML takes each machine learning module, such as data augmentation (AUG), neural architecture search (NAS), or hyper-parameters (HPO), as an agent and the final performance as the reward, to formulate a multi-agent reinforcement learning problem. MA2ML explicitly assigns credit to each agent according to its marginal contribution to enhance cooperation among modules, and incorporates off-policy learning to improve search efficiency. Theoretically, MA2ML guarantees monotonic improvement of joint optimization. Extensive experiments show that MA2ML yields the state-of-the-art top-1 accuracy on ImageNet under constraints of computational cost, e.g., $79.7\%/80.5\%$ with FLOPs fewer than 600M/800M. Extensive ablation studies verify the benefits of credit assignment and off-policy learning of MA2ML.

Histopathological Image Classification based on Self-Supervised Vision Transformer and Weak Labels arxiv:2210.09021 📈 7

Ahmet Gokberk Gul, Oezdemir Cetin, Christoph Reich, Tim Prangemeier, Nadine Flinner, Heinz Koeppl

**Abstract:** Whole Slide Image (WSI) analysis is a powerful method to facilitate the diagnosis of cancer in tissue samples. Automating this diagnosis poses various issues, most notably caused by the immense image resolution and limited annotations. WSIs commonly exhibit resolutions of 100Kx100K pixels. Annotating cancerous areas in WSIs on the pixel level is prohibitively labor-intensive and requires a high level of expert knowledge. Multiple instance learning (MIL) alleviates the need for expensive pixel-level annotations. In MIL, learning is performed on slide-level labels, in which a pathologist provides information about whether a slide includes cancerous tissue. Here, we propose Self-ViT-MIL, a novel approach for classifying and localizing cancerous areas based on slide-level annotations, eliminating the need for pixel-wise annotated training data. Self-ViT- MIL is pre-trained in a self-supervised setting to learn rich feature representation without relying on any labels. The recent Vision Transformer (ViT) architecture builds the feature extractor of Self-ViT-MIL. For localizing cancerous regions, a MIL aggregator with global attention is utilized. To the best of our knowledge, Self-ViT- MIL is the first approach to introduce self-supervised ViTs in MIL-based WSI analysis tasks. We showcase the effectiveness of our approach on the common Camelyon16 dataset. Self-ViT-MIL surpasses existing state-of-the-art MIL-based approaches in terms of accuracy and area under the curve (AUC).

Unpacking Reward Shaping: Understanding the Benefits of Reward Engineering on Sample Complexity arxiv:2210.09579 📈 6

Abhishek Gupta, Aldo Pacchiano, Yuexiang Zhai, Sham M. Kakade, Sergey Levine

**Abstract:** Reinforcement learning provides an automated framework for learning behaviors from high-level reward specifications, but in practice the choice of reward function can be crucial for good results -- while in principle the reward only needs to specify what the task is, in reality practitioners often need to design more detailed rewards that provide the agent with some hints about how the task should be completed. The idea of this type of ``reward-shaping'' has been often discussed in the literature, and is often a critical part of practical applications, but there is relatively little formal characterization of how the choice of reward shaping can yield benefits in sample complexity. In this work, we build on the framework of novelty-based exploration to provide a simple scheme for incorporating shaped rewards into RL along with an analysis tool to show that particular choices of reward shaping provably improve sample efficiency. We characterize the class of problems where these gains are expected to be significant and show how this can be connected to practical algorithms in the literature. We confirm that these results hold in practice in an experimental evaluation, providing an insight into the mechanisms through which reward shaping can significantly improve the complexity of reinforcement learning while retaining asymptotic performance.

Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models arxiv:2210.09545 📈 6

Zhiyuan Zhang, Lingjuan Lyu, Xingjun Ma, Chenguang Wang, Xu Sun

**Abstract:** Deep Neural Networks (DNNs) are known to be vulnerable to backdoor attacks. In Natural Language Processing (NLP), DNNs are often backdoored during the fine-tuning process of a large-scale Pre-trained Language Model (PLM) with poisoned samples. Although the clean weights of PLMs are readily available, existing methods have ignored this information in defending NLP models against backdoor attacks. In this work, we take the first step to exploit the pre-trained (unfine-tuned) weights to mitigate backdoors in fine-tuned language models. Specifically, we leverage the clean pre-trained weights via two complementary techniques: (1) a two-step Fine-mixing technique, which first mixes the backdoored weights (fine-tuned on poisoned data) with the pre-trained weights, then fine-tunes the mixed weights on a small subset of clean data; (2) an Embedding Purification (E-PUR) technique, which mitigates potential backdoors existing in the word embeddings. We compare Fine-mixing with typical backdoor mitigation methods on three single-sentence sentiment classification tasks and two sentence-pair classification tasks and show that it outperforms the baselines by a considerable margin in all scenarios. We also show that our E-PUR method can benefit existing mitigation methods. Our work establishes a simple but strong baseline defense for secure fine-tuned NLP models against backdoor attacks.

Extensible Proxy for Efficient NAS arxiv:2210.09459 📈 6

Yuhong Li, Jiajie Li, Cong Han, Pan Li, Jinjun Xiong, Deming Chen

**Abstract:** Neural Architecture Search (NAS) has become a de facto approach in the recent trend of AutoML to design deep neural networks (DNNs). Efficient or near-zero-cost NAS proxies are further proposed to address the demanding computational issues of NAS, where each candidate architecture network only requires one iteration of backpropagation. The values obtained from the proxies are considered the predictions of architecture performance on downstream tasks. However, two significant drawbacks hinder the extended usage of Efficient NAS proxies. (1) Efficient proxies are not adaptive to various search spaces. (2) Efficient proxies are not extensible to multi-modality downstream tasks. Based on the observations, we design a Extensible proxy (Eproxy) that utilizes self-supervised, few-shot training (i.e., 10 iterations of backpropagation) which yields near-zero costs. The key component that makes Eproxy efficient is an untrainable convolution layer termed barrier layer that add the non-linearities to the optimization spaces so that the Eproxy can discriminate the performance of architectures in the early stage. Furthermore, to make Eproxy adaptive to different downstream tasks/search spaces, we propose a Discrete Proxy Search (DPS) to find the optimized training settings for Eproxy with only handful of benchmarked architectures on the target tasks. Our extensive experiments confirm the effectiveness of both Eproxy and Eproxy+DPS. Code is available at https://github.com/leeyeehoo/GenNAS-Zero.

Review Learning: Alleviating Catastrophic Forgetting with Generative Replay without Generator arxiv:2210.09394 📈 6

Jaesung Yoo, Sunghyuk Choi, Ye Seul Yang, Suhyeon Kim, Jieun Choi, Dongkyeong Lim, Yaeji Lim, Hyung Joon Joo, Dae Jung Kim, Rae Woong Park, Hyeong-Jin Yoon, Kwangsoo Kim

**Abstract:** When a deep learning model is sequentially trained on different datasets, it forgets the knowledge acquired from previous data, a phenomenon known as catastrophic forgetting. It deteriorates performance of the deep learning model on diverse datasets, which is critical in privacy-preserving deep learning (PPDL) applications based on transfer learning (TL). To overcome this, we propose review learning (RL), a generative-replay-based continual learning technique that does not require a separate generator. Data samples are generated from the memory stored within the synaptic weights of the deep learning model which are used to review knowledge acquired from previous datasets. The performance of RL was validated through PPDL experiments. Simulations and real-world medical multi-institutional experiments were conducted using three types of binary classification electronic health record data. In the real-world experiments, the global area under the receiver operating curve was 0.710 for RL and 0.655 for TL. Thus, RL was highly effective in retaining previously learned knowledge.

Learning Diversified Feature Representations for Facial Expression Recognition in the Wild arxiv:2210.09381 📈 6

Negar Heidari, Alexandros Iosifidis

**Abstract:** Diversity of the features extracted by deep neural networks is important for enhancing the model generalization ability and accordingly its performance in different learning tasks. Facial expression recognition in the wild has attracted interest in recent years due to the challenges existing in this area for extracting discriminative and informative features from occluded images in real-world scenarios. In this paper, we propose a mechanism to diversify the features extracted by CNN layers of state-of-the-art facial expression recognition architectures for enhancing the model capacity in learning discriminative features. To evaluate the effectiveness of the proposed approach, we incorporate this mechanism in two state-of-the-art models to (i) diversify local/global features in an attention-based model and (ii) diversify features extracted by different learners in an ensemble-based model. Experimental results on three well-known facial expression recognition in-the-wild datasets, AffectNet, FER+, and RAF-DB, show the effectiveness of our method, achieving the state-of-the-art performance of 89.99% on RAF-DB, 89.34% on FER+ and the competitive accuracy of 60.02% on AffectNet dataset.

6th Place Solution to Google Universal Image Embedding arxiv:2210.09377 📈 6

S. Gkelios, A. Kastellos, S. Chatzichristofis

**Abstract:** This paper presents the 6th place solution to the Google Universal Image Embedding competition on Kaggle. Our approach is based on the CLIP architecture, a powerful pre-trained model used to learn visual representation from natural language supervision. We also utilized the SubCenter ArcFace loss with dynamic margins to improve the distinctive power of class separability and embeddings. Finally, a diverse dataset has been created based on the test's set categories and the leaderboard's feedback. By carefully crafting a training scheme to enhance transfer learning, our submission scored 0.685 on the private leaderboard.

RibSeg v2: A Large-scale Benchmark for Rib Labeling and Anatomical Centerline Extraction arxiv:2210.09309 📈 6

Liang Jin, Shixuan Gu, Donglai Wei, Kaiming Kuang, Hanspeter Pfister, Bingbing Ni, Jiancheng Yang, Ming Li

**Abstract:** Automatic rib labeling and anatomical centerline extraction are common prerequisites for various clinical applications. Prior studies either use in-house datasets that are inaccessible to communities, or focus on rib segmentation that neglects the clinical significance of rib labeling. To address these issues, we extend our prior dataset (RibSeg) on the binary rib segmentation task to a comprehensive benchmark, named RibSeg v2, with 660 CT scans (15,466 individual ribs in total) and annotations manually inspected by experts for rib labeling and anatomical centerline extraction. Based on the RibSeg v2, we develop a pipeline including deep learning-based methods for rib labeling, and a skeletonization-based method for centerline extraction. To improve computational efficiency, we propose a sparse point cloud representation of CT scans and compare it with standard dense voxel grids. Moreover, we design and analyze evaluation metrics to address the key challenges of each task. Our dataset, code, and model are available online to facilitate open research at https://github.com/M3DV/RibSeg

Mitigating Covertly Unsafe Text within Natural Language Systems arxiv:2210.09306 📈 6

Alex Mei, Anisha Kabir, Sharon Levy, Melanie Subbiah, Emily Allaway, John Judge, Desmond Patton, Bruce Bimber, Kathleen McKeown, William Yang Wang

**Abstract:** An increasingly prevalent problem for intelligent technologies is text safety, as uncontrolled systems may generate recommendations to their users that lead to injury or life-threatening consequences. However, the degree of explicitness of a generated statement that can cause physical harm varies. In this paper, we distinguish types of text that can lead to physical harm and establish one particularly underexplored category: covertly unsafe text. Then, we further break down this category with respect to the system's information and discuss solutions to mitigate the generation of text in each of these subcategories. Ultimately, our work defines the problem of covertly unsafe language that causes physical harm and argues that this subtle yet dangerous issue needs to be prioritized by stakeholders and regulators. We highlight mitigation strategies to inspire future researchers to tackle this challenging problem and help improve safety within smart systems.

DE-CROP: Data-efficient Certified Robustness for Pretrained Classifiers arxiv:2210.08929 📈 6

Gaurav Kumar Nayak, Ruchit Rawal, Anirban Chakraborty

**Abstract:** Certified defense using randomized smoothing is a popular technique to provide robustness guarantees for deep neural networks against l2 adversarial attacks. Existing works use this technique to provably secure a pretrained non-robust model by training a custom denoiser network on entire training data. However, access to the training set may be restricted to a handful of data samples due to constraints such as high transmission cost and the proprietary nature of the data. Thus, we formulate a novel problem of "how to certify the robustness of pretrained models using only a few training samples". We observe that training the custom denoiser directly using the existing techniques on limited samples yields poor certification. To overcome this, our proposed approach (DE-CROP) generates class-boundary and interpolated samples corresponding to each training sample, ensuring high diversity in the feature space of the pretrained classifier. We train the denoiser by maximizing the similarity between the denoised output of the generated sample and the original training sample in the classifier's logit space. We also perform distribution level matching using domain discriminator and maximum mean discrepancy that yields further benefit. In white box setup, we obtain significant improvements over the baseline on multiple benchmark datasets and also report similar performance under the challenging black box setup.

Routine Usage of AI-based Chest X-ray Reading Support in a Multi-site Medical Supply Center arxiv:2210.10779 📈 5

Karsten Ridder, Alexander Preuhs, Axel Mertins, Clemens Joerger

**Abstract:** Research question: How can we establish an AI support for reading of chest X-rays in clinical routine and which benefits emerge for the clinicians and radiologists. Can it perform 24/7 support for practicing clinicians? 2. Findings: We installed an AI solution for Chest X-ray in a given structure (MVZ Uhlenbrock & Partner, Germany). We could demonstrate the practicability, performance, and benefits in 10 connected clinical sites. 3. Meaning: A commercially available AI solution for the evaluation of Chest X-ray images is able to help radiologists and clinical colleagues 24/7 in a complex environment. The system performs in a robust manner, supporting radiologists and clinical colleagues in their important decisions, in practises and hospitals regardless of the user and X-ray system type producing the image-data.

Review of the state of the art in autonomous artificial intelligence arxiv:2210.10659 📈 5

Petar Radanliev, David De Roure

**Abstract:** This article presents a new design for autonomous artificial intelligence (AI), based on the state-of-the-art algorithms, and describes a new autonomous AI system called AutoAI. The methodology is used to assemble the design founded on self-improved algorithms that use new and emerging sources of data (NEFD). The objective of the article is to conceptualise the design of a novel AutoAI algorithm. The conceptual approach is used to advance into building new and improved algorithms. The article integrates and consolidates the findings from existing literature and advances the AutoAI design into (1) using new and emerging sources of data for teaching and training AI algorithms and (2) enabling AI algorithms to use automated tools for training new and improved algorithms. This approach is going beyond the state-of-the-art in AI algorithms and suggests a design that enables autonomous algorithms to self-optimise and self-adapt, and on a higher level, be capable to self-procreate.

Learning Less Generalizable Patterns with an Asymmetrically Trained Double Classifier for Better Test-Time Adaptation arxiv:2210.09834 📈 5

Thomas Duboudin, Emmanuel Dellandréa, Corentin Abgrall, Gilles Hénaff, Liming Chen

**Abstract:** Deep neural networks often fail to generalize outside of their training distribution, in particular when only a single data domain is available during training. While test-time adaptation has yielded encouraging results in this setting, we argue that, to reach further improvements, these approaches should be combined with training procedure modifications aiming to learn a more diverse set of patterns. Indeed, test-time adaptation methods usually have to rely on a limited representation because of the shortcut learning phenomenon: only a subset of the available predictive patterns is learned with standard training. In this paper, we first show that the combined use of existing training-time strategies, and test-time batch normalization, a simple adaptation method, does not always improve upon the test-time adaptation alone on the PACS benchmark. Furthermore, experiments on Office-Home show that very few training-time methods improve upon standard training, with or without test-time batch normalization. We therefore propose a novel approach using a pair of classifiers and a shortcut patterns avoidance loss that mitigates the shortcut learning behavior by reducing the generalization ability of the secondary classifier, using the additional shortcut patterns avoidance loss that encourages the learning of samples specific patterns. The primary classifier is trained normally, resulting in the learning of both the natural and the more complex, less generalizable, features. Our experiments show that our method improves upon the state-of-the-art results on both benchmarks and benefits the most to test-time batch normalization.

Hierarchical Model-Based Imitation Learning for Planning in Autonomous Driving arxiv:2210.09539 📈 5

Eli Bronstein, Mark Palatucci, Dominik Notz, Brandyn White, Alex Kuefler, Yiren Lu, Supratik Paul, Payam Nikdel, Paul Mougin, Hongge Chen, Justin Fu, Austin Abrams, Punit Shah, Evan Racah, Benjamin Frenkel, Shimon Whiteson, Dragomir Anguelov

**Abstract:** We demonstrate the first large-scale application of model-based generative adversarial imitation learning (MGAIL) to the task of dense urban self-driving. We augment standard MGAIL using a hierarchical model to enable generalization to arbitrary goal routes, and measure performance using a closed-loop evaluation framework with simulated interactive agents. We train policies from expert trajectories collected from real vehicles driving over 100,000 miles in San Francisco, and demonstrate a steerable policy that can navigate robustly even in a zero-shot setting, generalizing to synthetic scenarios with novel goals that never occurred in real-world driving. We also demonstrate the importance of mixing closed-loop MGAIL losses with open-loop behavior cloning losses, and show our best policy approaches the performance of the expert. We evaluate our imitative model in both average and challenging scenarios, and show how it can serve as a useful prior to plan successful trajectories.

CEIP: Combining Explicit and Implicit Priors for Reinforcement Learning with Demonstrations arxiv:2210.09496 📈 5

Kai Yan, Alexander G. Schwing, Yu-Xiong Wang

**Abstract:** Although reinforcement learning has found widespread use in dense reward settings, training autonomous agents with sparse rewards remains challenging. To address this difficulty, prior work has shown promising results when using not only task-specific demonstrations but also task-agnostic albeit somewhat related demonstrations. In most cases, the available demonstrations are distilled into an implicit prior, commonly represented via a single deep net. Explicit priors in the form of a database that can be queried have also been shown to lead to encouraging results. To better benefit from available demonstrations, we develop a method to Combine Explicit and Implicit Priors (CEIP). CEIP exploits multiple implicit priors in the form of normalizing flows in parallel to form a single complex prior. Moreover, CEIP uses an effective explicit retrieval and push-forward mechanism to condition the implicit priors. In three challenging environments, we find the proposed CEIP method to improve upon sophisticated state-of-the-art techniques.

Understanding CNN Fragility When Learning With Imbalanced Data arxiv:2210.09465 📈 5

Damien Dablain, Kristen N. Jacobson, Colin Bellinger, Mark Roberts, Nitesh Chawla

**Abstract:** Convolutional neural networks (CNNs) have achieved impressive results on imbalanced image data, but they still have difficulty generalizing to minority classes and their decisions are difficult to interpret. These problems are related because the method by which CNNs generalize to minority classes, which requires improvement, is wrapped in a blackbox. To demystify CNN decisions on imbalanced data, we focus on their latent features. Although CNNs embed the pattern knowledge learned from a training set in model parameters, the effect of this knowledge is contained in feature and classification embeddings (FE and CE). These embeddings can be extracted from a trained model and their global, class properties (e.g., frequency, magnitude and identity) can be analyzed. We find that important information regarding the ability of a neural network to generalize to minority classes resides in the class top-K CE and FE. We show that a CNN learns a limited number of class top-K CE per category, and that their number and magnitudes vary based on whether the same class is balanced or imbalanced. This calls into question whether a CNN has learned intrinsic class features, or merely frequently occurring ones that happen to exist in the sampled class distribution. We also hypothesize that latent class diversity is as important as the number of class examples, which has important implications for re-sampling and cost-sensitive methods. These methods generally focus on rebalancing model weights, class numbers and margins; instead of diversifying class latent features through augmentation. We also demonstrate that a CNN has difficulty generalizing to test data if the magnitude of its top-K latent features do not match the training set. We use three popular image datasets and two cost-sensitive algorithms commonly employed in imbalanced learning for our experiments.

Multiple Instance Learning via Iterative Self-Paced Supervised Contrastive Learning arxiv:2210.09452 📈 5

Kangning Liu, Weicheng Zhu, Yiqiu Shen, Sheng Liu, Narges Razavian, Krzysztof J. Geras, Carlos Fernandez-Granda

**Abstract:** Learning representations for individual instances when only bag-level labels are available is a fundamental challenge in multiple instance learning (MIL). Recent works have shown promising results using contrastive self-supervised learning (CSSL), which learns to push apart representations corresponding to two different randomly-selected instances. Unfortunately, in real-world applications such as medical image classification, there is often class imbalance, so randomly-selected instances mostly belong to the same majority class, which precludes CSSL from learning inter-class differences. To address this issue, we propose a novel framework, Iterative Self-paced Supervised Contrastive Learning for MIL Representations (ItS2CLR), which improves the learned representation by exploiting instance-level pseudo labels derived from the bag-level labels. The framework employs a novel self-paced sampling strategy to ensure the accuracy of pseudo labels. We evaluate ItS2CLR on three medical datasets, showing that it improves the quality of instance-level pseudo labels and representations, and outperforms existing MIL methods in terms of both bag and instance level accuracy.

Deformably-Scaled Transposed Convolution arxiv:2210.09446 📈 5

Stefano B. Blumberg, Daniele Raví, Mou-Cheng Xu, Matteo Figini, Iasonas Kokkinos, Daniel C. Alexander

**Abstract:** Transposed convolution is crucial for generating high-resolution outputs, yet has received little attention compared to convolution layers. In this work we revisit transposed convolution and introduce a novel layer that allows us to place information in the image selectively and choose the `stroke breadth' at which the image is synthesized, whilst incurring a small additional parameter cost. For this we introduce three ideas: firstly, we regress offsets to the positions where the transpose convolution results are placed; secondly we broadcast the offset weight locations over a learnable neighborhood; and thirdly we use a compact parametrization to share weights and restrict offsets. We show that simply substituting upsampling operators with our novel layer produces substantial improvements across tasks as diverse as instance segmentation, object detection, semantic segmentation, generative image modeling, and 3D magnetic resonance image enhancement, while outperforming all existing variants of transposed convolutions. Our novel layer can be used as a drop-in replacement for 2D and 3D upsampling operators and the code will be publicly available.

A Pilot Study on Teacher-Facing Real-Time Classroom Game Dashboards arxiv:2210.09427 📈 5

Luke Swanson, David Gagnon, Jennifer Scianna

**Abstract:** Educational games are an increasingly popular teaching tool in modern classrooms. However, the development of complementary tools for teachers facilitating classroom gameplay is lacking. We present the results of a participatory design process for a teacher-facing, real-time game data dashboard. This two-phase process included a workshop to elicit teachers' requirements for such a tool, and a pilot study of our dashboard prototype. We analyze post-gameplay survey and interview data to understand teachers' experiences with the tool in terms of evidence of co-design, feasibility, and effectiveness. Our results indicate the participatory design yielded a tool both useful for and usable by teachers within the context of a real class gameplay session. We advocate for the continued development of data-driven teacher tools to improve the effectiveness of games deployed in the classroom.

A Transfer Learning Based Approach for Classification of COVID-19 and Pneumonia in CT Scan Imaging arxiv:2210.09403 📈 5

Gargi Desai, Nelly Elsayed, Zag Elsayed, Murat Ozer

**Abstract:** The world is still overwhelmed by the spread of the COVID-19 virus. With over 250 Million infected cases as of November 2021 and affecting 219 countries and territories, the world remains in the pandemic period. Detecting COVID-19 using the deep learning method on CT scan images can play a vital role in assisting medical professionals and decision authorities in controlling the spread of the disease and providing essential support for patients. The convolution neural network is widely used in the field of large-scale image recognition. The current method of RT-PCR to diagnose COVID-19 is time-consuming and universally limited. This research aims to propose a deep learning-based approach to classify COVID-19 pneumonia patients, bacterial pneumonia, viral pneumonia, and healthy (normal cases). This paper used deep transfer learning to classify the data via Inception-ResNet-V2 neural network architecture. The proposed model has been intentionally simplified to reduce the implementation cost so that it can be easily implemented and used in different geographical areas, especially rural and developing regions.

Neural Contact Fields: Tracking Extrinsic Contact with Tactile Sensing arxiv:2210.09297 📈 5

Carolina Higuera, Siyuan Dong, Byron Boots, Mustafa Mukadam

**Abstract:** We present Neural Contact Fields, a method that brings together neural fields and tactile sensing to address the problem of tracking extrinsic contact between object and environment. Knowing where the external contact occurs is a first step towards methods that can actively control it in facilitating downstream manipulation tasks. Prior work for localizing environmental contacts typically assume a contact type (e.g. point or line), does not capture contact/no-contact transitions, and only works with basic geometric-shaped objects. Neural Contact Fields are the first method that can track arbitrary multi-modal extrinsic contacts without making any assumptions about the contact type. Our key insight is to estimate the probability of contact for any 3D point in the latent space of object shapes, given vision-based tactile inputs that sense the local motion resulting from the external contact. In experiments, we find that Neural Contact Fields are able to localize multiple contact patches without making any assumptions about the geometry of the contact, and capture contact/no-contact transitions for known categories of objects with unseen shapes in unseen environment configurations. In addition to Neural Contact Fields, we also release our YCB-Extrinsic-Contact dataset of simulated extrinsic contact interactions to enable further research in this area. Project repository: https://github.com/carolinahiguera/NCF

CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection arxiv:2210.09267 📈 5

Jyh-Jing Hwang, Henrik Kretzschmar, Joshua Manela, Sean Rafferty, Nicholas Armstrong-Crews, Tiffany Chen, Dragomir Anguelov

**Abstract:** Robust 3D object detection is critical for safe autonomous driving. Camera and radar sensors are synergistic as they capture complementary information and work well under different environmental conditions. Fusing camera and radar data is challenging, however, as each of the sensors lacks information along a perpendicular axis, that is, depth is unknown to camera and elevation is unknown to radar. We propose the camera-radar matching network CramNet, an efficient approach to fuse the sensor readings from camera and radar in a joint 3D space. To leverage radar range measurements for better camera depth predictions, we propose a novel ray-constrained cross-attention mechanism that resolves the ambiguity in the geometric correspondences between camera features and radar features. Our method supports training with sensor modality dropout, which leads to robust 3D object detection, even when a camera or radar sensor suddenly malfunctions on a vehicle. We demonstrate the effectiveness of our fusion approach through extensive experiments on the RADIATE dataset, one of the few large-scale datasets that provide radar radio frequency imagery. A camera-only variant of our method achieves competitive performance in monocular 3D object detection on the Waymo Open Dataset.

ZooD: Exploiting Model Zoo for Out-of-Distribution Generalization arxiv:2210.09236 📈 5

Qishi Dong, Awais Muhammad, Fengwei Zhou, Chuanlong Xie, Tianyang Hu, Yongxin Yang, Sung-Ho Bae, Zhenguo Li

**Abstract:** Recent advances on large-scale pre-training have shown great potentials of leveraging a large set of Pre-Trained Models (PTMs) for improving Out-of-Distribution (OoD) generalization, for which the goal is to perform well on possible unseen domains after fine-tuning on multiple training domains. However, maximally exploiting a zoo of PTMs is challenging since fine-tuning all possible combinations of PTMs is computationally prohibitive while accurate selection of PTMs requires tackling the possible data distribution shift for OoD tasks. In this work, we propose ZooD, a paradigm for PTMs ranking and ensemble with feature selection. Our proposed metric ranks PTMs by quantifying inter-class discriminability and inter-domain stability of the features extracted by the PTMs in a leave-one-domain-out cross-validation manner. The top-K ranked models are then aggregated for the target OoD task. To avoid accumulating noise induced by model ensemble, we propose an efficient variational EM algorithm to select informative features. We evaluate our paradigm on a diverse model zoo consisting of 35 models for various OoD tasks and demonstrate: (i) model ranking is better correlated with fine-tuning ranking than previous methods and up to 9859x faster than brute-force fine-tuning; (ii) OoD generalization after model ensemble with feature selection outperforms the state-of-the-art methods and the accuracy on most challenging task DomainNet is improved from 46.5\% to 50.6\%. Furthermore, we provide the fine-tuning results of 35 PTMs on 7 OoD datasets, hoping to help the research of model zoo and OoD generalization. Code will be available at https://gitee.com/mindspore/models/tree/master/research/cv/zood.

Packed-Ensembles for Efficient Uncertainty Estimation arxiv:2210.09184 📈 5

Olivier Laurent, Adrien Lafage, Enzo Tartaglione, Geoffrey Daniel, Jean-Marc Martinez, Andrei Bursuc, Gianni Franchi

**Abstract:** Deep Ensembles (DE) are a prominent approach achieving excellent performance on key metrics such as accuracy, calibration, uncertainty estimation, and out-of-distribution detection. However, hardware limitations of real-world systems constrain to smaller ensembles and lower capacity networks, significantly deteriorating their performance and properties. We introduce Packed-Ensembles (PE), a strategy to design and train lightweight structured ensembles by carefully modulating the dimension of their encoding space. We leverage grouped convolutions to parallelize the ensemble into a single common backbone and forward pass to improve training and inference speeds. PE is designed to work under the memory budget of a single standard neural network. Through extensive studies we show that PE faithfully preserve the properties of DE, e.g., diversity, and match their performance in terms of accuracy, calibration, out-of-distribution detection and robustness to distribution shift.

AIM 2022 Challenge on Instagram Filter Removal: Methods and Results arxiv:2210.08997 📈 5

Furkan Kınlı, Sami Menteş, Barış Özcan, Furkan Kıraç, Radu Timofte, Yi Zuo, Zitao Wang, Xiaowen Zhang, Yu Zhu, Chenghua Li, Cong Leng, Jian Cheng, Shuai Liu, Chaoyu Feng, Furui Bai, Xiaotao Wang, Lei Lei, Tianzhi Ma, Zihan Gao, Wenxin He, Woon-Ha Yeo, Wang-Taek Oh, Young-Il Kim, Han-Cheol Ryu, Gang He

**Abstract:** This paper introduces the methods and the results of AIM 2022 challenge on Instagram Filter Removal. Social media filters transform the images by consecutive non-linear operations, and the feature maps of the original content may be interpolated into a different domain. This reduces the overall performance of the recent deep learning strategies. The main goal of this challenge is to produce realistic and visually plausible images where the impact of the filters applied is mitigated while preserving the content. The proposed solutions are ranked in terms of the PSNR value with respect to the original images. There are two prior studies on this task as the baseline, and a total of 9 teams have competed in the final phase of the challenge. The comparison of qualitative results of the proposed solutions and the benchmark for the challenge are presented in this report.

Robust Planning for Human-Robot Joint Tasks with Explicit Reasoning on Human Mental State arxiv:2210.08879 📈 5

Anthony Favier, Shashank Shekhar, Rachid Alami

**Abstract:** We consider the human-aware task planning problem where a human-robot team is given a shared task with a known objective to achieve. Recent approaches tackle it by modeling it as a team of independent, rational agents, where the robot plans for both agents' (shared) tasks. However, the robot knows that humans cannot be administered like artificial agents, so it emulates and predicts the human's decisions, actions, and reactions. Based on earlier approaches, we describe a novel approach to solve such problems, which models and uses execution-time observability conventions. Abstractly, this modeling is based on situation assessment, which helps our approach capture the evolution of individual agents' beliefs and anticipate belief divergences that arise in practice. It decides if and when belief alignment is needed and achieves it with communication. These changes improve the solver's performance: (a) communication is effectively used, and (b) robust for more realistic and challenging problems.

Industry-Scale Orchestrated Federated Learning for Drug Discovery arxiv:2210.08871 📈 5

Martijn Oldenhof, Gergely Ács, Balázs Pejó, Ansgar Schuffenhauer, Nicholas Holway, Noé Sturm, Arne Dieckmann, Oliver Fortmeier, Eric Boniface, Clément Mayer, Arnaud Gohier, Peter Schmidtke, Ritsuya Niwayama, Dieter Kopecky, Lewis Mervin, Prakash Chandra Rathi, Lukas Friedrich, András Formanek, Peter Antal, Jordon Rahaman, Adam Zalewski, Ezron Oluoch, Manuel Stößel, Michal Vančo, David Endico

**Abstract:** To apply federated learning to drug discovery we developed a novel platform in the context of European Innovative Medicines Initiative (IMI) project MELLODDY (grant n°831472), which was comprised of 10 pharmaceutical companies, academic research labs, large industrial companies and startups. To the best of our knowledge, The MELLODDY platform was the first industry-scale platform to enable the creation of a global federated model for drug discovery without sharing the confidential data sets of the individual partners. The federated model was trained on the platform by aggregating the gradients of all contributing partners in a cryptographic, secure way following each training iteration. The platform was deployed on an Amazon Web Services (AWS) multi-account architecture running Kubernetes clusters in private subnets. Organisationally, the roles of the different partners were codified as different rights and permissions on the platform and administrated in a decentralized way. The MELLODDY platform generated new scientific discoveries which are described in a companion paper.

How "troll" are you? Measuring and detecting troll behavior in online social networks arxiv:2210.08786 📈 5

Fatima Ezzeddine, Luca Luceri, Omran Ayoub, Ihab Sbeity, Gianluca Nogara, Emilio Ferrara, Silvia Giordano

**Abstract:** The detection of state-sponsored trolls acting in misinformation operations is an unsolved and critical challenge for the research community, with repercussions that go beyond the online realm. In this paper, we propose a novel approach for the detection of troll accounts, which consists of two steps. The first step aims at classifying trajectories of accounts' online activities as belonging to either a troll account or to an organic user account. In the second step, we exploit the classified trajectories to compute a metric, namely "troll score", which allows us to quantify the extent to which an account behaves like a troll. Experimental results show that our approach identifies accounts' trajectories with an AUC close to 99% and, accordingly, classify trolls and organic users with an AUC of 97%. Finally, we evaluate whether the proposed solution can be generalized to different contexts (e.g., discussions about Covid-19) and generic misbehaving users, showing promising results that will be further expanded in our future endeavors.

Bag of Tricks for Developing Diabetic Retinopathy Analysis Framework to Overcome Data Scarcity arxiv:2210.09558 📈 4

Gitaek Kwon, Eunjin Kim, Sunho Kim, Seongwon Bak, Minsung Kim, Jaeyoung Kim

**Abstract:** Recently, diabetic retinopathy (DR) screening utilizing ultra-wide optical coherence tomography angiography (UW-OCTA) has been used in clinical practices to detect signs of early DR. However, developing a deep learning-based DR analysis system using UW-OCTA images is not trivial due to the difficulty of data collection and the absence of public datasets. By realistic constraints, a model trained on small datasets may obtain sub-par performance. Therefore, to help ophthalmologists be less confused about models' incorrect decisions, the models should be robust even in data scarcity settings. To address the above practical challenging, we present a comprehensive empirical study for DR analysis tasks, including lesion segmentation, image quality assessment, and DR grading. For each task, we introduce a robust training scheme by leveraging ensemble learning, data augmentation, and semi-supervised learning. Furthermore, we propose reliable pseudo labeling that excludes uncertain pseudo-labels based on the model's confidence scores to reduce the negative effect of noisy pseudo-labels. By exploiting the proposed approaches, we achieved 1st place in the Diabetic Retinopathy Analysis Challenge.

An Improved Structured Mesh Generation Method Based on Physics-informed Neural Networks arxiv:2210.09546 📈 4

Xinhai Chen, Jie Liu, Junjun Yan, Zhichao Wang, Chunye Gong

**Abstract:** Mesh generation remains a key technology in many areas where numerical simulations are required. As numerical algorithms become more efficient and computers become more powerful, the percentage of time devoted to mesh generation becomes higher. In this paper, we present an improved structured mesh generation method. The method formulates the meshing problem as a global optimization problem related to a physics-informed neural network. The mesh is obtained by intelligently solving the physical boundary-constrained partial differential equations. To improve the prediction accuracy of the neural network, we also introduce a novel auxiliary line strategy and an efficient network model during meshing. The strategy first employs a priori auxiliary lines to provide ground truth data and then uses these data to construct a loss term to better constrain the convergence of the subsequent training. The experimental results indicate that the proposed method is effective and robust. It can accurately approximate the mapping (transformation) from the computational domain to the physical domain and enable fast high-quality structured mesh generation.

Team Flow at DRC2022: Pipeline System for Travel Destination Recommendation Task in Spoken Dialogue arxiv:2210.09518 📈 4

Ryu Hirai, Atsumoto Ohashi, Ao Guo, Hideki Shiroma, Xulin Zhou, Yukihiko Tone, Shinya Iizuka, Ryuichiro Higashinaka

**Abstract:** To improve the interactive capabilities of a dialogue system, e.g., to adapt to different customers, the Dialogue Robot Competition (DRC2022) was held. As one of the teams, we built a dialogue system with a pipeline structure containing four modules. The natural language understanding (NLU) and natural language generation (NLG) modules were GPT-2 based models, and the dialogue state tracking (DST) and policy modules were designed on the basis of hand-crafted rules. After the preliminary round of the competition, we found that the low variation in training examples for the NLU and failed recommendation due to the policy used were probably the main reasons for the limited performance of the system.

Track Targets by Dense Spatio-Temporal Position Encoding arxiv:2210.09455 📈 4

Jinkun Cao, Hao Wu, Kris Kitani

**Abstract:** In this work, we propose a novel paradigm to encode the position of targets for target tracking in videos using transformers. The proposed paradigm, Dense Spatio-Temporal (DST) position encoding, encodes spatio-temporal position information in a pixel-wise dense fashion. The provided position encoding provides location information to associate targets across frames beyond appearance matching by comparing objects in two bounding boxes. Compared to the typical transformer positional encoding, our proposed encoding is applied to the 2D CNN features instead of the projected feature vectors to avoid losing positional information. Moreover, the designed DST encoding can represent the location of a single-frame object and the evolution of the location of the trajectory among frames uniformly. Integrated with the DST encoding, we build a transformer-based multi-object tracking model. The model takes a video clip as input and conducts the target association in the clip. It can also perform online inference by associating existing trajectories with objects from the new-coming frames. Experiments on video multi-object tracking (MOT) and multi-object tracking and segmentation (MOTS) datasets demonstrate the effectiveness of the proposed DST position encoding.

Deepfake Text Detection: Limitations and Opportunities arxiv:2210.09421 📈 4

Jiameng Pu, Zain Sarwar, Sifat Muhammad Abdullah, Abdullah Rehman, Yoonjin Kim, Parantapa Bhattacharya, Mobin Javed, Bimal Viswanath

**Abstract:** Recent advances in generative models for language have enabled the creation of convincing synthetic text or deepfake text. Prior work has demonstrated the potential for misuse of deepfake text to mislead content consumers. Therefore, deepfake text detection, the task of discriminating between human and machine-generated text, is becoming increasingly critical. Several defenses have been proposed for deepfake text detection. However, we lack a thorough understanding of their real-world applicability. In this paper, we collect deepfake text from 4 online services powered by Transformer-based tools to evaluate the generalization ability of the defenses on content in the wild. We develop several low-cost adversarial attacks, and investigate the robustness of existing defenses against an adaptive attacker. We find that many defenses show significant degradation in performance under our evaluation scenarios compared to their original claimed performance. Our evaluation shows that tapping into the semantic information in the text content is a promising approach for improving the robustness and generalization performance of deepfake text detection schemes.

Face Pasting Attack arxiv:2210.09153 📈 4

Niklas Bunzel, Lukas Graner

**Abstract:** Cujo AI and Adversa AI hosted the MLSec face recognition challenge. The goal was to attack a black box face recognition model with targeted attacks. The model returned the confidence of the target class and a stealthiness score. For an attack to be considered successful the target class has to have the highest confidence among all classes and the stealthiness has to be at least 0.5. In our approach we paste the face of a target into a source image. By utilizing position, scaling, rotation and transparency attributes we reached 3rd place. Our approach took approximately 200 queries per attack for the final highest score and about ~7.7 queries minimum for a successful attack. The code is available at https://github.com/bunni90/FacePastingAttack .

Sparse Kronecker Product Decomposition: A General Framework of Signal Region Detection in Image Regression arxiv:2210.09128 📈 4

Sanyou Wu, Long Feng

**Abstract:** This paper aims to present the first Frequentist framework on signal region detection in high-resolution and high-order image regression problems. Image data and scalar-on-image regression are intensively studied in recent years. However, most existing studies on such topics focused on outcome prediction, while the research on image region detection is rather limited, even though the latter is often more important. In this paper, we develop a general framework named Sparse Kronecker Product Decomposition (SKPD) to tackle this issue. The SKPD framework is general in the sense that it works for both matrices (e.g., 2D grayscale images) and (high-order) tensors (e.g., 2D colored images, brain MRI/fMRI data) represented image data. Moreover, unlike many Bayesian approaches, our framework is computationally scalable for high-resolution image problems. Specifically, our framework includes: 1) the one-term SKPD; 2) the multi-term SKPD; and 3) the nonlinear SKPD. We propose nonconvex optimization problems to estimate the one-term and multi-term SKPDs and develop path-following algorithms for the nonconvex optimization. The computed solutions of the path-following algorithm are guaranteed to converge to the truth with a particularly chosen initialization even though the optimization is nonconvex. Moreover, the region detection consistency could also be guaranteed by the one-term and multi-term SKPD. The nonlinear SKPD is highly connected to shallow convolutional neural networks (CNN), particular to CNN with one convolutional layer and one fully connected layer. Effectiveness of SKPDs is validated by real brain imaging data in the UK Biobank database.

Nish: A Novel Negative Stimulated Hybrid Activation Function arxiv:2210.09083 📈 4

Yildiray Anaguna, Sahin Isik

**Abstract:** Activation functions play a crucial role in the performance and stability of neural networks. In this study, we propose a novel non-monotonic activation function is called Negative Stimulated Hybrid Activation Function (Nish). It behaves like a Rectified Linear Unit (ReLU) function for values greater than zero, and a sinus-sigmoidal function for values less than zero. The proposed function incorporates the sigmoid and sine wave, allowing new dynamics over traditional ReLU activations. We evaluate robustness of the Nish for different combinations of well-established architectures as well as recently proposed activation functions using on various well-known benchmarks. The results indicate that the accuracy rates obtained by the proposed activation function are slightly higher than those obtained using the set of weights calculated by Mish activation.

Reversing Image Signal Processors by Reverse Style Transferring arxiv:2210.09074 📈 4

Furkan Kınlı, Barış Özcan, Furkan Kıraç

**Abstract:** RAW image datasets are more suitable than the standard RGB image datasets for the ill-posed inverse problems in low-level vision, but not common in the literature. There are also a few studies to focus on mapping sRGB images to RAW format. Mapping from sRGB to RAW format could be a relevant domain for reverse style transferring since the task is an ill-posed reversing problem. In this study, we seek an answer to the question: Can the ISP operations be modeled as the style factor in an end-to-end learning pipeline? To investigate this idea, we propose a novel architecture, namely RST-ISP-Net, for learning to reverse the ISP operations with the help of adaptive feature normalization. We formulate this problem as a reverse style transferring and mostly follow the practice used in the prior work. We have participated in the AIM Reversed ISP challenge with our proposed architecture. Results indicate that the idea of modeling disruptive or modifying factors as style is still valid, but further improvements are required to be competitive in such a challenge.

Predicting Dense and Context-aware Cost Maps for Semantic Robot Navigation arxiv:2210.08952 📈 4

Yash Goel, Narunas Vaskevicius, Luigi Palmieri, Nived Chebrolu, Cyrill Stachniss

**Abstract:** We investigate the task of object goal navigation in unknown environments where the target is specified by a semantic label (e.g. find a couch). Such a navigation task is especially challenging as it requires understanding of semantic context in diverse settings. Most of the prior work tackles this problem under the assumption of a discrete action policy whereas we present an approach with continuous control which brings it closer to real world applications. We propose a deep neural network architecture and loss function to predict dense cost maps that implicitly contain semantic context and guide the robot towards the semantic goal. We also present a novel way of fusing mid-level visual representations in our architecture to provide additional semantic cues for cost map prediction. The estimated cost maps are then used by a sampling-based model predictive controller (MPC) for generating continuous robot actions. The preliminary experiments suggest that the cost maps generated by our network are suitable for the MPC and can guide the agent to the semantic goal more efficiently than a baseline approach. The results also indicate the importance of mid-level representations for navigation by improving the success rate by 7 percentage points.

DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models arxiv:2210.08933 📈 4

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, LingPeng Kong

**Abstract:** Recently, diffusion models have emerged as a new paradigm for generative models. Despite the success in domains using continuous signals such as vision and audio, adapting diffusion models to natural language is difficult due to the discrete nature of text. We tackle this challenge by proposing DiffuSeq: a diffusion model designed for sequence-to-sequence (Seq2Seq) text generation tasks. Upon extensive evaluation over a wide range of Seq2Seq tasks, we find DiffuSeq achieving comparable or even better performance than six established baselines, including a state-of-the-art model that is based on pre-trained language models. Apart from quality, an intriguing property of DiffuSeq is its high diversity during generation, which is desired in many Seq2Seq tasks. We further include a theoretical analysis revealing the connection between DiffuSeq and autoregressive/non-autoregressive models. Bringing together theoretical analysis and empirical evidence, we demonstrate the great potential of diffusion models in complex conditional language generation tasks.

Forget Unlearning: Towards True Data-Deletion in Machine Learning arxiv:2210.08911 📈 4

Rishav Chourasia, Neil Shah, Reza Shokri

**Abstract:** Unlearning has emerged as a technique to efficiently erase information of deleted records from learned models. We show, however, that the influence created by the original presence of a data point in the training set can still be detected after running certified unlearning algorithms (which can result in its reconstruction by an adversary). Thus, under realistic assumptions about the dynamics of model releases over time and in the presence of adaptive adversaries, we show that unlearning is not equivalent to data deletion and does not guarantee the "right to be forgotten." We then propose a more robust data-deletion guarantee and show that it is necessary to satisfy differential privacy to ensure true data deletion. Under our notion, we propose an accurate, computationally efficient, and secure data-deletion machine learning algorithm in the online setting based on noisy gradient descent algorithm.

Reducing Collision Checking for Sampling-Based Motion Planning Using Graph Neural Networks arxiv:2210.08864 📈 4

Chenning Yu, Sicun Gao

**Abstract:** Sampling-based motion planning is a popular approach in robotics for finding paths in continuous configuration spaces. Checking collision with obstacles is the major computational bottleneck in this process. We propose new learning-based methods for reducing collision checking to accelerate motion planning by training graph neural networks (GNNs) that perform path exploration and path smoothing. Given random geometric graphs (RGGs) generated from batch sampling, the path exploration component iteratively predicts collision-free edges to prioritize their exploration. The path smoothing component then optimizes paths obtained from the exploration stage. The methods benefit from the ability of GNNs of capturing geometric patterns from RGGs through batch sampling and generalize better to unseen environments. Experimental results show that the learned components can significantly reduce collision checking and improve overall planning efficiency in challenging high-dimensional motion planning tasks.

MCP: Self-supervised Pre-training for Personalized Chatbots with Multi-level Contrastive Sampling arxiv:2210.08753 📈 4

Zhaoheng Huang, Zhicheng Dou, Yutao Zhu, Zhengyi Ma

**Abstract:** Personalized chatbots focus on endowing the chatbots with a consistent personality to behave like real users and further act as personal assistants. Previous studies have explored generating implicit user profiles from the user's dialogue history for building personalized chatbots. However, these studies only use the response generation loss to train the entire model, thus it is prone to suffer from the problem of data sparsity. Besides, they overemphasize the final generated response's quality while ignoring the correlations and fusions between the user's dialogue history, leading to rough data representations and performance degradation. To tackle these problems, we propose a self-supervised learning framework MCP for capturing better representations from users' dialogue history for personalized chatbots. Specifically, we apply contrastive sampling methods to leverage the supervised signals hidden in user dialog history, and generate the pre-training samples for enhancing the model. We design three pre-training tasks based on three types of contrastive pairs from user dialogue history, namely response pairs, sequence augmentation pairs, and user pairs. We pre-train the utterance encoder and the history encoder towards the contrastive objectives and use these pre-trained encoders for generating user profiles while personalized response generation. Experimental results on two real-world datasets show a significant improvement in our proposed model MCP compared with the existing methods.

ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design arxiv:2210.09573 📈 3

Haoran You, Zhanyi Sun, Huihong Shi, Zhongzhi Yu, Yang Zhao, Yongan Zhang, Chaojian Li, Baopu Li, Yingyan Lin

**Abstract:** Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs' self-attention module is still arguably a major bottleneck, limiting their achievable hardware efficiency. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and NLP Transformers: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns; while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the enforced denser/sparser workloads and encoder/decoder engines for boosted hardware utilization. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3x, 142.9x, 86.0x, 10.1x, and 6.8x over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively.

A Hybrid System of Sound Event Detection Transformer and Frame-wise Model for DCASE 2022 Task 4 arxiv:2210.09529 📈 3

Yiming Li, Zhifang Guo, Zhirong Ye, Xiangdong Wang, Hong Liu, Yueliang Qian, Rui Tao, Long Yan, Kazushige Ouchi

**Abstract:** In this paper, we describe in detail our system for DCASE 2022 Task4. The system combines two considerably different models: an end-to-end Sound Event Detection Transformer (SEDT) and a frame-wise model, Metric Learning and Focal Loss CNN (MLFL-CNN). The former is an event-wise model which learns event-level representations and predicts sound event categories and boundaries directly, while the latter is based on the widely adopted frame-classification scheme, under which each frame is classified into event categories and event boundaries are obtained by post-processing such as thresholding and smoothing. For SEDT, self-supervised pre-training using unlabeled data is applied, and semi-supervised learning is adopted by using an online teacher, which is updated from the student model using the Exponential Moving Average (EMA) strategy and generates reliable pseudo labels for weakly-labeled and unlabeled data. For the frame-wise model, the ICT-TOSHIBA system of DCASE 2021 Task 4 is used. Experimental results show that the hybrid system considerably outperforms either individual model and achieves psds1 of 0.420 and psds2 of 0.783 on the validation set without external data. The code is available at https://github.com/965694547/Hybrid-system-of-frame-wise-model-and-SEDT.

SVLDL: Improved Speaker Age Estimation Using Selective Variance Label Distribution Learning arxiv:2210.09524 📈 3

Zuheng Kang, Jianzong Wang, Junqing Peng, Jing Xiao

**Abstract:** Estimating age from a single speech is a classic and challenging topic. Although Label Distribution Learning (LDL) can represent adjacent indistinguishable ages well, the uncertainty of the age estimate for each utterance varies from person to person, i.e., the variance of the age distribution is different. To address this issue, we propose selective variance label distribution learning (SVLDL) method to adapt the variance of different age distributions. Furthermore, the model uses WavLM as the speech feature extractor and adds the auxiliary task of gender recognition to further improve the performance. Two tricks are applied on the loss function to enhance the robustness of the age estimation and improve the quality of the fitted age distribution. Extensive experiments show that the model achieves state-of-the-art performance on all aspects of the NIST SRE08-10 and a real-world datasets.

No Pairs Left Behind: Improving Metric Learning with Regularized Triplet Objective arxiv:2210.09506 📈 3

A. Ali Heydari, Naghmeh Rezaei, Daniel J. McDuff, Javier L. Prieto

**Abstract:** We propose a novel formulation of the triplet objective function that improves metric learning without additional sample mining or overhead costs. Our approach aims to explicitly regularize the distance between the positive and negative samples in a triplet with respect to the anchor-negative distance. As an initial validation, we show that our method (called No Pairs Left Behind [NPLB]) improves upon the traditional and current state-of-the-art triplet objective formulations on standard benchmark datasets. To show the effectiveness and potentials of NPLB on real-world complex data, we evaluate our approach on a large-scale healthcare dataset (UK Biobank), demonstrating that the embeddings learned by our model significantly outperform all other current representations on tested downstream tasks. Additionally, we provide a new model-agnostic single-time health risk definition that, when used in tandem with the learned representations, achieves the most accurate prediction of subjects' future health complications. Our results indicate that NPLB is a simple, yet effective framework for improving existing deep metric learning models, showcasing the potential implications of metric learning in more complex applications, especially in the biological and healthcare domains.

Enabling Heterogeneous Domain Adaptation in Multi-inhabitants Smart Home Activity Learning arxiv:2210.09499 📈 3

Md Mahmudur Rahman, Mahta Mousavi, Peri Tarr, Mohammad Arif Ul Alam

**Abstract:** Domain adaptation for sensor-based activity learning is of utmost importance in remote health monitoring research. However, many domain adaptation algorithms suffer with failure to operate adaptation in presence of target domain heterogeneity (which is always present in reality) and presence of multiple inhabitants dramatically hinders their generalizability producing unsatisfactory results for semi-supervised and unseen activity learning tasks. We propose \emph{AEDA}, a novel deep auto-encoder-based model to enable semi-supervised domain adaptation in the existence of target domain heterogeneity and how to incorporate it to empower heterogeneity to any homogeneous deep domain adaptation architecture for cross-domain activity learning. Experimental evaluation on 18 different heterogeneous and multi-inhabitants use-cases of 8 different domains created from 2 publicly available human activity datasets (wearable and ambient smart homes) shows that \emph{AEDA} outperforms (max. 12.8\% and 8.9\% improvements for ambient smart home and wearables) over existing domain adaptation techniques for both seen and unseen activity learning in a heterogeneous setting.

FPGA Hardware Acceleration for Feature-Based Relative Navigation Applications arxiv:2210.09481 📈 3

Ramchander Rao Bhaskara, Manoranjan Majji

**Abstract:** Estimation of rigid transformation between two point clouds is a computationally challenging problem in vision-based relative navigation. Targeting a real-time navigation solution utilizing point-cloud and image registration algorithms, this paper develops high-performance avionics for power and resource constrained pose estimation framework. A Field-Programmable Gate Array (FPGA) based embedded architecture is developed to accelerate estimation of relative pose between the point-clouds, aided by image features that correspond to the individual point sets. At algorithmic level, the pose estimation method is an adaptation of Optimal Linear Attitude and Translation Estimator (OLTAE) for relative attitude and translation estimation. At the architecture level, the proposed embedded solution is a hardware/software co-design that evaluates the OLTAE computations on the bare-metal hardware for high-speed state estimation. The finite precision FPGA evaluation of the OLTAE algorithm is compared with a double-precision evaluation on MATLAB for performance analysis and error quantification. Implementation results of the proposed finite-precision OLTAE accelerator demonstrate the high-performance compute capabilities of the FPGA-based pose estimation while offering relative numerical errors below 7%.

Leveraging Non-dialogue Summaries for Dialogue Summarization arxiv:2210.09474 📈 3

Seongmin Park, Dongchan Shin, Jihwa Lee

**Abstract:** To mitigate the lack of diverse dialogue summarization datasets in academia, we present methods to utilize non-dialogue summarization data for enhancing dialogue summarization systems. We apply transformations to document summarization data pairs to create training data that better befit dialogue summarization. The suggested transformations also retain desirable properties of non-dialogue datasets, such as improved faithfulness to the source text. We conduct extensive experiments across both English and Korean to verify our approach. Although absolute gains in ROUGE naturally plateau as more dialogue summarization samples are introduced, utilizing non-dialogue data for training significantly improves summarization performance in zero- and few-shot settings and enhances faithfulness across all training regimes.

Early Diagnosis of Retinal Blood Vessel Damage via Deep Learning-Powered Collective Intelligence Models arxiv:2210.09449 📈 3

Pranjal Bhardwaj, Prajjwal Gupta, Thejineaswar Guhan, Kathiravan Srinivasan

**Abstract:** Early diagnosis of retinal diseases such as diabetic retinopathy has had the attention of many researchers. Deep learning through the introduction of convolutional neural networks has become a prominent solution for image-related tasks such as classification and segmentation. Most tasks in image classification are handled by deep CNNs pretrained and evaluated on imagenet dataset. However, these models do not always translate to the best result on other datasets. Devising a neural network manually from scratch based on heuristics may not lead to an optimal model as there are numerous hyperparameters in play. In this paper, we use two nature-inspired swarm algorithms: particle swarm optimization (PSO) and ant colony optimization (ACO) to obtain TDCN models to perform classification of fundus images into severity classes. The power of swarm algorithms is used to search for various combinations of convolutional, pooling, and normalization layers to provide the best model for the task. It is observed that TDCN-PSO outperforms imagenet models and existing literature, while TDCN-ACO achieves faster architecture search. The best TDCN model achieves an accuracy of 90.3%, AUC ROC of 0.956, and a Cohen kappa score of 0.967. The results were compared with the previous studies to show that the proposed TDCN models exhibit superior performance.

A tradeoff between universality of equivariant models and learnability of symmetries arxiv:2210.09444 📈 3

Vasco Portilheiro

**Abstract:** We prove an impossibility result, which in the context of function learning says the following: under certain conditions, it is impossible to simultaneously learn symmetries and functions equivariant under them using an ansatz consisting of equivariant functions. To formalize this statement, we carefully study notions of approximation for groups and semigroups. We analyze certain families of neural networks for whether they satisfy the conditions of the impossibility result: what we call ``linearly equivariant'' networks, and group-convolutional networks. A lot can be said precisely about linearly equivariant networks, making them theoretically useful. On the practical side, our analysis of group-convolutional neural networks allows us generalize the well-known ``convolution is all you need'' theorem to non-homogeneous spaces. We additionally find an important difference between group convolution and semigroup convolution.

Using Bottleneck Adapters to Identify Cancer in Clinical Notes under Low-Resource Constraints arxiv:2210.09440 📈 3

Omid Rohanian, Hannah Jauncey, Mohammadmahdi Nouriborji, Bronner P. Gonçalves, Christiana Kartsonaki, ISARIC Clinical Characterisation Group, Laura Merson, David Clifton

**Abstract:** Processing information locked within clinical health records is a challenging task that remains an active area of research in biomedical NLP. In this work, we evaluate a broad set of machine learning techniques ranging from simple RNNs to specialised transformers such as BioBERT on a dataset containing clinical notes along with a set of annotations indicating whether a sample is cancer-related or not. Furthermore, we specifically employ efficient fine-tuning methods from NLP, namely, bottleneck adapters and prompt tuning, to adapt the models to our specialised task. Our evaluations suggest that fine-tuning a frozen BERT model pre-trained on natural language and with bottleneck adapters outperforms all other strategies, including full fine-tuning of the specialised BioBERT model. Based on our findings, we suggest that using bottleneck adapters in low-resource situations with limited access to labelled data or processing capacity could be a viable strategy in biomedical text mining. The code used in the experiments are going to be made available at https://github.com/omidrohanian/bottleneck-adapters.

Sufficient Exploration for Convex Q-learning arxiv:2210.09409 📈 3

Fan Lu, Prashant Mehta, Sean Meyn, Gergely Neu

**Abstract:** In recent years there has been a collective research effort to find new formulations of reinforcement learning that are simultaneously more efficient and more amenable to analysis. This paper concerns one approach that builds on the linear programming (LP) formulation of optimal control of Manne. A primal version is called logistic Q-learning, and a dual variant is convex Q-learning. This paper focuses on the latter, while building bridges with the former. The main contributions follow: (i) The dual of convex Q-learning is not precisely Manne's LP or a version of logistic Q-learning, but has similar structure that reveals the need for regularization to avoid over-fitting. (ii) A sufficient condition is obtained for a bounded solution to the Q-learning LP. (iii) Simulation studies reveal numerical challenges when addressing sampled-data systems based on a continuous time model. The challenge is addressed using state-dependent sampling. The theory is illustrated with applications to examples from OpenAI gym. It is shown that convex Q-learning is successful in cases where standard Q-learning diverges, such as the LQR problem.

Potrika: Raw and Balanced Newspaper Datasets in the Bangla Language with Eight Topics and Five Attributes arxiv:2210.09389 📈 3

Istiak Ahmad, Fahad AlQurashi, Rashid Mehmood

**Abstract:** Knowledge is central to human and scientific developments. Natural Language Processing (NLP) allows automated analysis and creation of knowledge. Data is a crucial NLP and machine learning ingredient. The scarcity of open datasets is a well-known problem in machine and deep learning research. This is very much the case for textual NLP datasets in English and other major world languages. For the Bangla language, the situation is even more challenging and the number of large datasets for NLP research is practically nil. We hereby present Potrika, a large single-label Bangla news article textual dataset curated for NLP research from six popular online news portals in Bangladesh (Jugantor, Jaijaidin, Ittefaq, Kaler Kontho, Inqilab, and Somoyer Alo) for the period 2014-2020. The articles are classified into eight distinct categories (National, Sports, International, Entertainment, Economy, Education, Politics, and Science \& Technology) providing five attributes (News Article, Category, Headline, Publication Date, and Newspaper Source). The raw dataset contains 185.51 million words and 12.57 million sentences contained in 664,880 news articles. Moreover, using NLP augmentation techniques, we create from the raw (unbalanced) dataset another (balanced) dataset comprising 320,000 news articles with 40,000 articles in each of the eight news categories. Potrika contains both the datasets (raw and balanced) to suit a wide range of NLP research. By far, to the best of our knowledge, Potrika is the largest and the most extensive dataset for news classification.

Tight Analysis of Extra-gradient and Optimistic Gradient Methods For Nonconvex Minimax Problems arxiv:2210.09382 📈 3

Pouria Mahdavinia, Yuyang Deng, Haochuan Li, Mehrdad Mahdavi

**Abstract:** Despite the established convergence theory of Optimistic Gradient Descent Ascent (OGDA) and Extragradient (EG) methods for the convex-concave minimax problems, little is known about the theoretical guarantees of these methods in nonconvex settings. To bridge this gap, for the first time, this paper establishes the convergence of OGDA and EG methods under the nonconvex-strongly-concave (NC-SC) and nonconvex-concave (NC-C) settings by providing a unified analysis through the lens of single-call extra-gradient methods. We further establish lower bounds on the convergence of GDA/OGDA/EG, shedding light on the tightness of our analysis. We also conduct experiments supporting our theoretical results. We believe our results will advance the theoretical understanding of OGDA and EG methods for solving complicated nonconvex minimax real-world problems, e.g., Generative Adversarial Networks (GANs) or robust neural networks training.

Bridging the Gap between Artificial Intelligence and Artificial General Intelligence: A Ten Commandment Framework for Human-Like Intelligence arxiv:2210.09366 📈 3

Ananta Nair, Farnoush Banaei-Kashani

**Abstract:** The field of artificial intelligence has seen explosive growth and exponential success. The last phase of development showcased deep learnings ability to solve a variety of difficult problems across a multitude of domains. Many of these networks met and exceeded human benchmarks by becoming experts in the domains in which they are trained. Though the successes of artificial intelligence have begun to overshadow its failures, there is still much that separates current artificial intelligence tools from becoming the exceptional general learners that humans are. In this paper, we identify the ten commandments upon which human intelligence is systematically and hierarchically built. We believe these commandments work collectively to serve as the essential ingredients that lead to the emergence of higher-order cognition and intelligence. This paper discusses a computational framework that could house these ten commandments and suggests new architectural modifications that could lead to the development of smarter, more explainable, and generalizable artificial systems inspired by a neuromorphic approach.

TorchDIVA: An Extensible Computational Model of Speech Production built on an Open-Source Machine Learning Library arxiv:2210.09334 📈 3

Sean Kinahan, Julie Liss, Visar Berisha

**Abstract:** The DIVA model is a computational model of speech motor control that combines a simulation of the brain regions responsible for speech production with a model of the human vocal tract. The model is currently implemented in Matlab Simulink; however, this is less than ideal as most of the development in speech technology research is done in Python. This means there is a wealth of machine learning tools which are freely available in the Python ecosystem that cannot be easily integrated with DIVA. We present TorchDIVA, a full rebuild of DIVA in Python using PyTorch tensors. DIVA source code was directly translated from Matlab to Python, and built-in Simulink signal blocks were implemented from scratch. After implementation, the accuracy of each module was evaluated via systematic block-by-block validation. The TorchDIVA model is shown to produce outputs that closely match those of the original DIVA model, with a negligible difference between the two. We additionally present an example of the extensibility of TorchDIVA as a research platform. Speech quality enhancement in TorchDIVA is achieved through an integration with an existing PyTorch generative vocoder called DiffWave. A modified DiffWave mel-spectrum upsampler was trained on human speech waveforms and conditioned on the TorchDIVA speech production. The results indicate improved speech quality metrics in the DiffWave-enhanced output as compared to the baseline. This enhancement would have been difficult or impossible to accomplish in the original Matlab implementation. This proof-of-concept demonstrates the value TorchDIVA will bring to the research community. Researchers can download the new implementation at: https://github.com/skinahan/DIVA_PyTorch

Predicting Dynamic Stability from Static Features in Power Grid Models using Machine Learning arxiv:2210.09266 📈 3

Maurizio Titz, Franz Kaiser, Johannes Kruse, Dirk Witthaut

**Abstract:** A reliable supply with electric power is vital for our society. Transmission line failures are among the biggest threats for power grid stability as they may lead to a splitting of the grid into mutual asynchronous fragments. New conceptual methods are needed to assess system stability that complement existing simulation models. In this article we propose a combination of network science metrics and machine learning models to predict the risk of desynchronisation events. Network science provides metrics for essential properties of transmission lines such as their redundancy or centrality. Machine learning models perform inherent feature selection and thus reveal key factors that determine network robustness and vulnerability. As a case study, we train and test such models on simulated data from several synthetic test grids. We find that the integrated models are capable of predicting desynchronisation events after line failures with an average precision greater than $0.996$ when averaging over all data sets. Learning transfer between different data sets is generally possible, at a slight loss of prediction performance. Our results suggest that power grid desynchronisation is essentially governed by only a few network metrics that quantify the networks ability to reroute flow without creating exceedingly high static line loadings.

Improving Contrastive Learning on Visually Homogeneous Mars Rover Images arxiv:2210.09234 📈 3

Isaac Ronald Ward, Charles Moore, Kai Pak, Jingdao Chen, Edwin Goh

**Abstract:** Contrastive learning has recently demonstrated superior performance to supervised learning, despite requiring no training labels. We explore how contrastive learning can be applied to hundreds of thousands of unlabeled Mars terrain images, collected from the Mars rovers Curiosity and Perseverance, and from the Mars Reconnaissance Orbiter. Such methods are appealing since the vast majority of Mars images are unlabeled as manual annotation is labor intensive and requires extensive domain knowledge. Contrastive learning, however, assumes that any given pair of distinct images contain distinct semantic content. This is an issue for Mars image datasets, as any two pairs of Mars images are far more likely to be semantically similar due to the lack of visual diversity on the planet's surface. Making the assumption that pairs of images will be in visual contrast - when they are in fact not - results in pairs that are falsely considered as negatives, impacting training performance. In this study, we propose two approaches to resolve this: 1) an unsupervised deep clustering step on the Mars datasets, which identifies clusters of images containing similar semantic content and corrects false negative errors during training, and 2) a simple approach which mixes data from different domains to increase visual diversity of the total training dataset. Both cases reduce the rate of false negative pairs, thus minimizing the rate in which the model is incorrectly penalized during contrastive training. These modified approaches remain fully unsupervised end-to-end. To evaluate their performance, we add a single linear layer trained to generate class predictions based on these contrastively-learned features and demonstrate increased performance compared to supervised models; observing an improvement in classification accuracy of 3.06% using only 10% of the labeled data.

Confound-leakage: Confound Removal in Machine Learning Leads to Leakage arxiv:2210.09232 📈 3

Sami Hamdan, Bradley C. Love, Georg G. von Polier, Susanne Weis, Holger Schwender, Simon B. Eickhoff, Kaustubh R. Patil

**Abstract:** Machine learning (ML) approaches to data analysis are now widely adopted in many fields including epidemiology and medicine. To apply these approaches, confounds must first be removed as is commonly done by featurewise removal of their variance by linear regression before applying ML. Here, we show this common approach to confound removal biases ML models, leading to misleading results. Specifically, this common deconfounding approach can leak information such that what are null or moderate effects become amplified to near-perfect prediction when nonlinear ML approaches are subsequently applied. We identify and evaluate possible mechanisms for such confound-leakage and provide practical guidance to mitigate its negative impact. We demonstrate the real-world importance of confound-leakage by analyzing a clinical dataset where accuracy is overestimated for predicting attention deficit hyperactivity disorder (ADHD) with depression as a confound. Our results have wide-reaching implications for implementation and deployment of ML workflows and beg caution against naïve use of standard confound removal approaches.

Data-Driven Joint Inversions for PDE Models arxiv:2210.09228 📈 3

Kui Ren, Lu Zhang

**Abstract:** The task of simultaneously reconstructing multiple physical coefficients in partial differential equations from observed data is ubiquitous in applications. In this work, we propose an integrated data-driven and model-based iterative reconstruction framework for such joint inversion problems where additional data on the unknown coefficients are supplemented for better reconstructions. Our method couples the supplementary data with the PDE model to make the data-driven modeling process consistent with the model-based reconstruction procedure. We characterize the impact of learning uncertainty on the joint inversion results for two typical model inverse problems. Numerical evidences are provided to demonstrate the feasibility of using data-driven models to improve joint inversion of physical models.

Sub-8-bit quantization for on-device speech recognition: a regularization-free approach arxiv:2210.09188 📈 3

Kai Zhen, Martin Radfar, Hieu Duy Nguyen, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris

**Abstract:** For on-device automatic speech recognition (ASR), quantization aware training (QAT) is ubiquitous to achieve the trade-off between model predictive performance and efficiency. Among existing QAT methods, one major drawback is that the quantization centroids have to be predetermined and fixed. To overcome this limitation, we introduce a regularization-free, "soft-to-hard" compression mechanism with self-adjustable centroids in a mu-Law constrained space, resulting in a simpler yet more versatile quantization scheme, called General Quantizer (GQ). We apply GQ to ASR tasks using Recurrent Neural Network Transducer (RNN-T) and Conformer architectures on both LibriSpeech and de-identified far-field datasets. Without accuracy degradation, GQ can compress both RNN-T and Conformer into sub-8-bit, and for some RNN-T layers, to 1-bit for fast and accurate inference. We observe a 30.73% memory footprint saving and 31.75% user-perceived latency reduction compared to 8-bit QAT via physical device benchmarking.

Symbol Guided Hindsight Priors for Reward Learning from Human Preferences arxiv:2210.09151 📈 3

Mudit Verma, Katherine Metcalf

**Abstract:** Specifying rewards for reinforcement learned (RL) agents is challenging. Preference-based RL (PbRL) mitigates these challenges by inferring a reward from feedback over sets of trajectories. However, the effectiveness of PbRL is limited by the amount of feedback needed to reliably recover the structure of the target reward. We present the PRIor Over Rewards (PRIOR) framework, which incorporates priors about the structure of the reward function and the preference feedback into the reward learning process. Imposing these priors as soft constraints on the reward learning objective reduces the amount of feedback required by half and improves overall reward recovery. Additionally, we demonstrate that using an abstract state space for the computation of the priors further improves the reward learning and the agent's performance.

Flipped Classroom: Effective Teaching for Time Series Forecasting arxiv:2210.08959 📈 3

Philipp Teutsch, Patrick Mäder

**Abstract:** Sequence-to-sequence models based on LSTM and GRU are a most popular choice for forecasting time series data reaching state-of-the-art performance. Training such models can be delicate though. The two most common training strategies within this context are teacher forcing (TF) and free running (FR). TF can be used to help the model to converge faster but may provoke an exposure bias issue due to a discrepancy between training and inference phase. FR helps to avoid this but does not necessarily lead to better results, since it tends to make the training slow and unstable instead. Scheduled sampling was the first approach tackling these issues by picking the best from both worlds and combining it into a curriculum learning (CL) strategy. Although scheduled sampling seems to be a convincing alternative to FR and TF, we found that, even if parametrized carefully, scheduled sampling may lead to premature termination of the training when applied for time series forecasting. To mitigate the problems of the above approaches we formalize CL strategies along the training as well as the training iteration scale. We propose several new curricula, and systematically evaluate their performance in two experimental sets. For our experiments, we utilize six datasets generated from prominent chaotic systems. We found that the newly proposed increasing training scale curricula with a probabilistic iteration scale curriculum consistently outperforms previous training strategies yielding an NRMSE improvement of up to 81% over FR or TF training. For some datasets we additionally observe a reduced number of training iterations. We observed that all models trained with the new curricula yield higher prediction stability allowing for longer prediction horizons.

A Treatise On FST Lattice Based MMI Training arxiv:2210.08918 📈 3

Adnan Haider, Tim Ng, Zhen Huang, Xingyu Na, Antti Veikko Rosti

**Abstract:** Maximum mutual information (MMI) has become one of the two de facto methods for sequence-level training of speech recognition acoustic models. This paper aims to isolate, identify and bring forward the implicit modelling decisions induced by the design implementation of standard finite state transducer (FST) lattice based MMI training framework. The paper particularly investigates the necessity to maintain a preselected numerator alignment and raises the importance of determinizing FST denominator lattices on the fly. The efficacy of employing on the fly FST lattice determinization is mathematically shown to guarantee discrimination at the hypothesis level and is empirically shown through training deep CNN models on a 18K hours Mandarin dataset and on a 2.8K hours English dataset. On assistant and dictation tasks, the approach achieves between 2.3-4.6% relative WER reduction (WERR) over the standard FST lattice based approach.

Data-Driven Short-Term Daily Operational Sea Ice Regional Forecasting arxiv:2210.08877 📈 3

Timofey Grigoryev, Polina Verezemskaya, Mikhail Krinitskiy, Nikita Anikin, Alexander Gavrikov, Ilya Trofimov, Nikita Balabin, Aleksei Shpilman, Andrei Eremchenko, Sergey Gulev, Evgeny Burnaev, Vladimir Vanovskiy

**Abstract:** Global warming made the Arctic available for marine operations and created demand for reliable operational sea ice forecasts to make them safe. While ocean-ice numerical models are highly computationally intensive, relatively lightweight ML-based methods may be more efficient in this task. Many works have exploited different deep learning models alongside classical approaches for predicting sea ice concentration in the Arctic. However, only a few focus on daily operational forecasts and consider the real-time availability of data they need for operation. In this work, we aim to close this gap and investigate the performance of the U-Net model trained in two regimes for predicting sea ice for up to the next 10 days. We show that this deep learning model can outperform simple baselines by a significant margin and improve its quality by using additional weather data and training on multiple regions, ensuring its generalization abilities. As a practical outcome, we build a fast and flexible tool that produces operational sea ice forecasts in the Barents Sea, the Labrador Sea, and the Laptev Sea regions.

PTDE: Personalized Training with Distillated Execution for Multi-Agent Reinforcement Learning arxiv:2210.08872 📈 3

Yiqun Chen, Hangyu Mao, Tianle Zhang, Shiguang Wu, Bin Zhang, Jianye Hao, Dong Li, Bin Wang, Hongxing Chang

**Abstract:** Centralized Training with Decentralized Execution (CTDE) has been a very popular paradigm for multi-agent reinforcement learning. One of its main features is making full use of the global information to learn a better joint $Q$-function or centralized critic. In this paper, we in turn explore how to leverage the global information to directly learn a better individual $Q$-function or individual actor. We find that applying the same global information to all agents indiscriminately is not enough for good performance, and thus propose to specify the global information for each agent to obtain agent-specific global information for better performance. Furthermore, we distill such agent-specific global information into the agent's local information, which is used during decentralized execution without too much performance degradation. We call this new paradigm Personalized Training with Distillated Execution (PTDE). PTDE can be easily combined with many state-of-the-art algorithms to further improve their performance, which is verified in both SMAC and Google Research Football scenarios.

Differential Evolution based Dual Adversarial Camouflage: Fooling Human Eyes and Object Detectors arxiv:2210.08870 📈 3

Jialiang Sun, Tingsong Jiang, Wen Yao, Donghua Wang, Xiaoqian Chen

**Abstract:** Recent studies reveal that deep neural network (DNN) based object detectors are vulnerable to adversarial attacks in the form of adding the perturbation to the images, leading to the wrong output of object detectors. Most current existing works focus on generating perturbed images, also called adversarial examples, to fool object detectors. Though the generated adversarial examples themselves can remain a certain naturalness, most of them can still be easily observed by human eyes, which limits their further application in the real world. To alleviate this problem, we propose a differential evolution based dual adversarial camouflage (DE_DAC) method, composed of two stages to fool human eyes and object detectors simultaneously. Specifically, we try to obtain the camouflage texture, which can be rendered over the surface of the object. In the first stage, we optimize the global texture to minimize the discrepancy between the rendered object and the scene images, making human eyes difficult to distinguish. In the second stage, we design three loss functions to optimize the local texture, making object detectors ineffective. In addition, we introduce the differential evolution algorithm to search for the near-optimal areas of the object to attack, improving the adversarial performance under certain attack area limitations. Besides, we also study the performance of adaptive DE_DAC, which can be adapted to the environment. Experiments show that our proposed method could obtain a good trade-off between the fooling human eyes and object detectors under multiple specific scenes and objects.

Test-Time Training for Graph Neural Networks arxiv:2210.08813 📈 3

Yiqi Wang, Chaozhuo Li, Wei Jin, Rui Li, Jianan Zhao, Jiliang Tang, Xing Xie

**Abstract:** Graph Neural Networks (GNNs) have made tremendous progress in the graph classification task. However, a performance gap between the training set and the test set has often been noticed. To bridge such gap, in this work we introduce the first test-time training framework for GNNs to enhance the model generalization capacity for the graph classification task. In particular, we design a novel test-time training strategy with self-supervised learning to adjust the GNN model for each test graph sample. Experiments on the benchmark datasets have demonstrated the effectiveness of the proposed framework, especially when there are distribution shifts between training set and test set. We have also conducted exploratory studies and theoretical analysis to gain deeper understandings on the rationality of the design of the proposed graph test time training framework (GT3).

Merlin HugeCTR: GPU-accelerated Recommender System Training and Inference arxiv:2210.08803 📈 3

Joey Wang, Yingcan Wei, Minseok Lee, Matthias Langer, Fan Yu, Jie Liu, Alex Liu, Daniel Abel, Gems Guo, Jianbing Dong, Jerry Shi, Kunlun Li

**Abstract:** In this talk, we introduce Merlin HugeCTR. Merlin HugeCTR is an open source, GPU-accelerated integration framework for click-through rate estimation. It optimizes both training and inference, whilst enabling model training at scale with model-parallel embeddings and data-parallel neural networks. In particular, Merlin HugeCTR combines a high-performance GPU embedding cache with an hierarchical storage architecture, to realize low-latency retrieval of embeddings for online model inference tasks. In the MLPerf v1.0 DLRM model training benchmark, Merlin HugeCTR achieves a speedup of up to 24.6x on a single DGX A100 (8x A100) over PyTorch on 4x4-socket CPU nodes (4x4x28 cores). Merlin HugeCTR can also take advantage of multi-node environments to accelerate training even further. Since late 2021, Merlin HugeCTR additionally features a hierarchical parameter server (HPS) and supports deployment via the NVIDIA Triton server framework, to leverage the computational capabilities of GPUs for high-speed recommendation model inference. Using this HPS, Merlin HugeCTR users can achieve a 5~62x speedup (batch size dependent) for popular recommendation models over CPU baseline implementations, and dramatically reduce their end-to-end inference latency.

Rethinking Trajectory Prediction via "Team Game" arxiv:2210.08793 📈 3

Zikai Wei, Xinge Zhu, Bo Dai, Dahua Lin

**Abstract:** To accurately predict trajectories in multi-agent settings, e.g. team games, it is important to effectively model the interactions among agents. Whereas a number of methods have been developed for this purpose, existing methods implicitly model these interactions as part of the deep net architecture. However, in the real world, interactions often exist at multiple levels, e.g. individuals may form groups, where interactions among groups and those among the individuals in the same group often follow significantly different patterns. In this paper, we present a novel formulation for multi-agent trajectory prediction, which explicitly introduces the concept of interactive group consensus via an interactive hierarchical latent space. This formulation allows group-level and individual-level interactions to be captured jointly, thus substantially improving the capability of modeling complex dynamics. On two multi-agent settings, i.e. team sports and pedestrians, the proposed framework consistently achieves superior performance compared to existing methods.

Embedded Silicon-Organic Integrated Neuromorphic System arxiv:2210.12064 📈 2

Shengjie Zheng, Ling Liu, Junjie Yang, Jianwei Zhang, Tao Su, Bin Yue, Xiaojian Li

**Abstract:** The development of artificial intelligence (AI) and robotics are both based on the tenet of "science and technology are people-oriented", and both need to achieve efficient communication with the human brain. Based on multi-disciplinary research in systems neuroscience, computer architecture, and functional organic materials, we proposed the concept of using AI to simulate the operating principles and materials of the brain in hardware to develop brain-inspired intelligence technology, and realized the preparation of neuromorphic computing devices and basic materials. We simulated neurons and neural networks in terms of material and morphology, using a variety of organic polymers as the base materials for neuroelectronic devices, for building neural interfaces as well as organic neural devices and silicon neural computational modules. We assemble organic artificial synapses with simulated neurons from silicon-based Field-Programmable Gate Array (FPGA) into organic artificial neurons, the basic components of neural networks, and later construct biological neural network models based on the interpreted neural circuits. Finally, we also discuss how to further build neuromorphic devices based on these organic artificial neurons, which have both a neural interface friendly to nervous tissue and interact with information from real biological neural networks.

A Novel Feature Representation for Malware Classification arxiv:2210.09580 📈 2

John Musgrave, Temesguen Messay-Kebede, David Kapp, Anca Ralescu

**Abstract:** In this study we have presented a novel feature representation for malicious programs that can be used for malware classification. We have shown how to construct the features in a bottom-up approach, and analyzed the overlap of malicious and benign programs in terms of their components. We have shown that our method of analysis offers an increase in feature resolution that is descriptive of data movement in comparison to tf-idf features.

A novel statistical methodology for quantifying the spatial arrangements of axons in peripheral nerves arxiv:2210.09554 📈 2

Abida Sanjana Shemonti, Emanuele Plebani, Natalia P. Biscola, Deborah M. Jaffey, Leif A. Havton, Janet R. Keast, Alex Pothen, M. Murat Dundar, Terry L. Powley, Bartek Rajwa

**Abstract:** A thorough understanding of the neuroanatomy of peripheral nerves is required for a better insight into their function and the development of neuromodulation tools and strategies. In biophysical modeling, it is commonly assumed that the complex spatial arrangement of myelinated and unmyelinated axons in peripheral nerves is random, however, in reality the axonal organization is inhomogeneous and anisotropic. Present quantitative neuroanatomy methods analyze peripheral nerves in terms of the number of axons and the morphometric characteristics of the axons, such as area and diameter. In this study, we employed spatial statistics and point process models to describe the spatial arrangement of axons and Sinkhorn distances to compute the similarities between these arrangements (in terms of first- and second-order statistics) in various vagus and pelvic nerve cross-sections. We utilized high-resolution TEM images that have been segmented using a custom-built high-throughput deep learning system based on a highly modified U-Net architecture. Our findings show a novel and innovative approach to quantifying similarities between spatial point patterns using metrics derived from the solution to the optimal transport problem. We also present a generalizable pipeline for quantitative analysis of peripheral nerve architecture. Our data demonstrate differences between male- and female-originating samples and similarities between the pelvic and abdominal vagus nerves.

Adaptive Oracle-Efficient Online Learning arxiv:2210.09385 📈 2

Guanghui Wang, Zihao Hu, Vidya Muthukumar, Jacob Abernethy

**Abstract:** The classical algorithms for online learning and decision-making have the benefit of achieving the optimal performance guarantees, but suffer from computational complexity limitations when implemented at scale. More recent sophisticated techniques, which we refer to as oracle-efficient methods, address this problem by dispatching to an offline optimization oracle that can search through an exponentially-large (or even infinite) space of decisions and select that which performed the best on any dataset. But despite the benefits of computational feasibility, oracle-efficient algorithms exhibit one major limitation: while performing well in worst-case settings, they do not adapt well to friendly environments. In this paper we consider two such friendly scenarios, (a) "small-loss" problems and (b) IID data. We provide a new framework for designing follow-the-perturbed-leader algorithms that are oracle-efficient and adapt well to the small-loss environment, under a particular condition which we call approximability (which is spiritually related to sufficient conditions provided by Dudík et al., [2020]). We identify a series of real-world settings, including online auctions and transductive online classification, for which approximability holds. We also extend the algorithm to an IID data setting and establish a "best-of-both-worlds" bound in the oracle-efficient setting.

On Accelerated Perceptrons and Beyond arxiv:2210.09371 📈 2

Guanghui Wang, Rafael Hanashiro, Etash Guha, Jacob Abernethy

**Abstract:** The classical Perceptron algorithm of Rosenblatt can be used to find a linear threshold function to correctly classify $n$ linearly separable data points, assuming the classes are separated by some margin $γ> 0$. A foundational result is that Perceptron converges after $Ω(1/γ^{2})$ iterations. There have been several recent works that managed to improve this rate by a quadratic factor, to $Ω(\sqrt{\log n}/γ)$, with more sophisticated algorithms. In this paper, we unify these existing results under one framework by showing that they can all be described through the lens of solving min-max problems using modern acceleration techniques, mainly through optimistic online learning. We then show that the proposed framework also lead to improved results for a series of problems beyond the standard Perceptron setting. Specifically, a) For the margin maximization problem, we improve the state-of-the-art result from $O(\log t/t^2)$ to $O(1/t^2)$, where $t$ is the number of iterations; b) We provide the first result on identifying the implicit bias property of the classical Nesterov's accelerated gradient descent (NAG) algorithm, and show NAG can maximize the margin with an $O(1/t^2)$ rate; c) For the classical $p$-norm Perceptron problem, we provide an algorithm with $Ω(\sqrt{(p-1)\log n}/γ)$ convergence rate, while existing algorithms suffer the $Ω({(p-1)}/γ^2)$ convergence rate.

Unsupervised Optimal Power Flow Using Graph Neural Networks arxiv:2210.09277 📈 2

Damian Owerko, Fernando Gama, Alejandro Ribeiro

**Abstract:** Optimal power flow (OPF) is a critical optimization problem that allocates power to the generators in order to satisfy the demand at a minimum cost. Solving this problem exactly is computationally infeasible in the general case. In this work, we propose to leverage graph signal processing and machine learning. More specifically, we use a graph neural network to learn a nonlinear parametrization between the power demanded and the corresponding allocation. We learn the solution in an unsupervised manner, minimizing the cost directly. In order to take into account the electrical constraints of the grid, we propose a novel barrier method that is differentiable and works on initially infeasible points. We show through simulations that the use of GNNs in this unsupervised learning context leads to solutions comparable to standard solvers while being computationally efficient and avoiding constraint violations most of the time.

Turbocharging Solution Concepts: Solving NEs, CEs and CCEs with Neural Equilibrium Solvers arxiv:2210.09257 📈 2

Luke Marris, Ian Gemp, Thomas Anthony, Andrea Tacchetti, Siqi Liu, Karl Tuyls

**Abstract:** Solution concepts such as Nash Equilibria, Correlated Equilibria, and Coarse Correlated Equilibria are useful components for many multiagent machine learning algorithms. Unfortunately, solving a normal-form game could take prohibitive or non-deterministic time to converge, and could fail. We introduce the Neural Equilibrium Solver which utilizes a special equivariant neural network architecture to approximately solve the space of all games of fixed shape, buying speed and determinism. We define a flexible equilibrium selection framework, that is capable of uniquely selecting an equilibrium that minimizes relative entropy, or maximizes welfare. The network is trained without needing to generate any supervised training data. We show remarkable zero-shot generalization to larger games. We argue that such a network is a powerful component for many possible multiagent algorithms.

A Unified Algorithm for Stochastic Path Problems arxiv:2210.09255 📈 2

Christoph Dann, Chen-Yu Wei, Julian Zimmert

**Abstract:** We study reinforcement learning in stochastic path (SP) problems. The goal in these problems is to maximize the expected sum of rewards until the agent reaches a terminal state. We provide the first regret guarantees in this general problem by analyzing a simple optimistic algorithm. Our regret bound matches the best known results for the well-studied special case of stochastic shortest path (SSP) with all non-positive rewards. For SSP, we present an adaptation procedure for the case when the scale of rewards $B_\star$ is unknown. We show that there is no price for adaptation, and our regret bound matches that with a known $B_\star$. We also provide a scale adaptation procedure for the special case of stochastic longest paths (SLP) where all rewards are non-negative. However, unlike in SSP, we show through a lower bound that there is an unavoidable price for adaptation.

Provable Phase Retrieval with Mirror Descent arxiv:2210.09248 📈 2

Jean-Jacques Godeme, Jalal Fadili, Xavier Buet, Myriam Zerrad, Michel Lequime, Claude Amra

**Abstract:** In this paper, we consider the problem of phase retrieval, which consists of recovering an $n$-dimensional real vector from the magnitude of its $m$ linear measurements. We propose a mirror descent (or Bregman gradient descent) algorithm based on a wisely chosen Bregman divergence, hence allowing to remove the classical global Lipschitz continuity requirement on the gradient of the non-convex phase retrieval objective to be minimized. We apply the mirror descent for two random measurements: the \iid standard Gaussian and those obtained by multiple structured illuminations through Coded Diffraction Patterns (CDP). For the Gaussian case, we show that when the number of measurements $m$ is large enough, then with high probability, for almost all initializers, the algorithm recovers the original vector up to a global sign change. For both measurements, the mirror descent exhibits a local linear convergence behaviour with a dimension-independent convergence rate. Our theoretical results are finally illustrated with various numerical experiments, including an application to the reconstruction of images in precision optics.

Learning Object Affordance with Contact and Grasp Generation arxiv:2210.09245 📈 2

Haoming Li, Xinzhuo Lin, Yang Zhou, Xiang Li, Jiming Chen, Qi Ye

**Abstract:** Understanding object affordance can help in designing better and more robust robotic grasping. Existing work in the computer vision community formulates the object affordance understanding as a grasping pose generation problem, which treats the problem as a black box by learning a mapping between objects and the distributions of possible grasping poses for the objects. On the other hand, in the robotics community, estimating object affordance represented by contact maps is of the most importance as localizing the positions of the possible affordance can help the planning of grasping actions. In this paper, we propose to formulate the object affordance understanding as both contacts and grasp poses generation. we factorize the learning task into two sequential stages, rather than the black-box strategy: (1) we first reason the contact maps by allowing multi-modal contact generation; (2) assuming that grasping poses are fully constrained given contact maps, we learn a one-to-one mapping from the contact maps to the grasping poses. Further, we propose a penetration-aware partial optimization from the intermediate contacts. It combines local and global optimization for the refinement of the partial poses of the generated grasps exhibiting penetration. Extensive validations on two public datasets show our method outperforms state-of-the-art methods regarding grasp generation on various metrics.

Conditional Neural Processes for Molecules arxiv:2210.09211 📈 2

Miguel Garcia-Ortegon, Andreas Bender, Sergio Bacallado

**Abstract:** Neural processes (NPs) are models for transfer learning with properties reminiscent of Gaussian Processes (GPs). They are adept at modelling data consisting of few observations of many related functions on the same input space and are trained by minimizing a variational objective, which is computationally much less expensive than the Bayesian updating required by GPs. So far, most studies of NPs have focused on low-dimensional datasets which are not representative of realistic transfer learning tasks. Drug discovery is one application area that is characterized by datasets consisting of many chemical properties or functions which are sparsely observed, yet depend on shared features or representations of the molecular inputs. This paper applies the conditional neural process (CNP) to DOCKSTRING, a dataset of docking scores for benchmarking ML models. CNPs show competitive performance in few-shot learning tasks relative to supervised learning baselines common in QSAR modelling, as well as an alternative model for transfer learning based on pre-training and refining neural network regressors. We present a Bayesian optimization experiment which showcases the probabilistic nature of CNPs and discuss shortcomings of the model in uncertainty quantification.

Marksman Backdoor: Backdoor Attacks with Arbitrary Target Class arxiv:2210.09194 📈 2

Khoa D. Doan, Yingjie Lao, Ping Li

**Abstract:** In recent years, machine learning models have been shown to be vulnerable to backdoor attacks. Under such attacks, an adversary embeds a stealthy backdoor into the trained model such that the compromised models will behave normally on clean inputs but will misclassify according to the adversary's control on maliciously constructed input with a trigger. While these existing attacks are very effective, the adversary's capability is limited: given an input, these attacks can only cause the model to misclassify toward a single pre-defined or target class. In contrast, this paper exploits a novel backdoor attack with a much more powerful payload, denoted as Marksman, where the adversary can arbitrarily choose which target class the model will misclassify given any input during inference. To achieve this goal, we propose to represent the trigger function as a class-conditional generative model and to inject the backdoor in a constrained optimization framework, where the trigger function learns to generate an optimal trigger pattern to attack any target class at will while simultaneously embedding this generative backdoor into the trained model. Given the learned trigger-generation function, during inference, the adversary can specify an arbitrary backdoor attack target class, and an appropriate trigger causing the model to classify toward this target class is created accordingly. We show empirically that the proposed framework achieves high attack performance while preserving the clean-data performance in several benchmark datasets, including MNIST, CIFAR10, GTSRB, and TinyImageNet. The proposed Marksman backdoor attack can also easily bypass existing backdoor defenses that were originally designed against backdoor attacks with a single target class. Our work takes another significant step toward understanding the extensive risks of backdoor attacks in practice.

Fast Gaussian Process Predictions on Large Geospatial Fields with Prediction-Point Dependent Basis Functions arxiv:2210.09168 📈 2

Frida Marie Viset, Rudy Helmons, Manon Kok

**Abstract:** In order to perform GP predictions fast in large geospatial fields with small-scale variations, a computational complexity that is independent of the number of measurements $N$ and the size of the field is crucial. In this setting, GP approximations using $m$ basis functions requires $\mathcal{O}(Nm^2+m^3)$ computations. Using finite-support basis functions reduces the required number of computations to perform a single prediction to $\mathcal{O}(m^3)$, after a one-time training cost of $O(N)$. The prediction cost increases with increasing field size, as the number of required basis functions $m$ grows with the size of the field relative to the size of the spatial variations. To prevent the prediction speed from depending on field size, we propose leveraging the property that a subset of the trained system is a trained subset of the system to use only a local subset of $m'\ll m$ finite-support basis functions centered around each prediction point to perform predictions. Our proposed approximation requires $\mathcal{O}(m'^3)$ operations to perform each prediction after a one-time training cost of $\mathcal{O}(N)$. We show on real-life spatial data that our approach matches the prediction error of state-of-the-art methods and that it performs faster predictions, also compared to state-of-the-art approximations that lower the prediction cost of $\mathcal{O}(m^3)$ to $\mathcal{O}(m\log(m))$ using a conjugate gradient solver. Finally, we demonstrate that our approach can perform fast predictions on a global bathymetry dataset using millions of basis functions and tens of millions of measurements on a laptop computer.

Statistical, Robustness, and Computational Guarantees for Sliced Wasserstein Distances arxiv:2210.09160 📈 2

Sloan Nietert, Ritwik Sadhu, Ziv Goldfeld, Kengo Kato

**Abstract:** Sliced Wasserstein distances preserve properties of classic Wasserstein distances while being more scalable for computation and estimation in high dimensions. The goal of this work is to quantify this scalability from three key aspects: (i) empirical convergence rates; (ii) robustness to data contamination; and (iii) efficient computational methods. For empirical convergence, we derive fast rates with explicit dependence of constants on dimension, subject to log-concavity of the population distributions. For robustness, we characterize minimax optimal, dimension-free robust estimation risks, and show an equivalence between robust sliced 1-Wasserstein estimation and robust mean estimation. This enables lifting statistical and algorithmic guarantees available for the latter to the sliced 1-Wasserstein setting. Moving on to computational aspects, we analyze the Monte Carlo estimator for the average-sliced distance, demonstrating that larger dimension can result in faster convergence of the numerical integration error. For the max-sliced distance, we focus on a subgradient-based local optimization algorithm that is frequently used in practice, albeit without formal guarantees, and establish an $O(ε^{-4})$ computational complexity bound for it. Our theory is validated by numerical experiments, which altogether provide a comprehensive quantitative account of the scalability question.

Data Subsampling for Bayesian Neural Networks arxiv:2210.09141 📈 2

Eiji Kawasaki, Markus Holzmann

**Abstract:** Markov Chain Monte Carlo (MCMC) algorithms do not scale well for large datasets leading to difficulties in Neural Network posterior sampling. In this paper, we apply a generalization of the Metropolis Hastings algorithm that allows us to restrict the evaluation of the likelihood to small mini-batches in a Bayesian inference context. Since it requires the computation of a so-called "noise penalty" determined by the variance of the training loss function over the mini-batches, we refer to this data subsampling strategy as Penalty Bayesian Neural Networks - PBNNs. Its implementation on top of MCMC is straightforward, as the variance of the loss function merely reduces the acceptance probability. Comparing to other samplers, we empirically show that PBNN achieves good predictive performance for a given mini-batch size. Varying the size of the mini-batches enables a natural calibration of the predictive distribution and provides an inbuilt protection against overfitting. We expect PBNN to be particularly suited for cases when data sets are distributed across multiple decentralized devices as typical in federated learning.

Advanced Characterization-Informed Framework and Quantitative Insight to Irradiated Annular U-10Zr Metallic Fuels arxiv:2210.09104 📈 2

Fei Xu, Lu Cai, Daniele Salvato, Fidelma Dilemma, Luca Capriotti, Tiankai Yao

**Abstract:** U-10Zr-based metallic nuclear fuel is a promising fuel candidate for next-generation sodium-cooled fast reactors.The research experience of the Idaho National Laboratory for this type of fuel dates back to the 1960s. Idaho National Laboratory researchers have accumulated a considerable amount of experience and knowledge regarding fuel performance at the engineering scale. The limitation of advanced characterization and lack of proper data analysis tools prevented a mechanistic understanding of fuel microstructure evolution and properties degradation during irradiation. This paper proposed a new workflow, coupled with domain knowledge obtained by advanced post-irradiation examination methods, to provide unprecedented and quantified insights into the fission gas bubbles and pores, and lanthanide distribution in an annular fuel irradiated in the Advanced Test Reactor. In the study, researchers identify and confirm that the Zr-bearing secondary phases exist and generate the quantitative ratios of seven microstructures along the thermal gradient. Moreover, the distributions of fission gas bubbles on two samples of U-10Zr advanced fuels were quantitatively compared. Conclusive findings were obtained and allowed for evaluation of the lanthanide transportation through connected bubbles based on approximately 67,000 fission gas bubbles of the two advanced samples.

Machine Learning Technique Predicting Video Streaming Views to Reduce Cost of Cloud Services arxiv:2210.09078 📈 2

Mahmoud Darwich

**Abstract:** Video streams tremendously occupied the highest portion of online traffic. Multiple versions of a video are created to fit the user's device specifications. In cloud storage, Keeping all versions of frequently accessed video streams in the repository for the long term imposes a significant cost paid by video streaming providers. Generally, the popularity of a video changes each period of time, which means the number of views received by a video could be dropped, thus, the video must be deleted from the repository. Therefore, in this paper, we develop a method that predicts the popularity of each video stream in the repository in the next period. On the other hand, we propose an algorithm that utilizes the predicted popularity of a video to compute the storage cost, and then it decides whether the video will be kept or deleted from the cloud repository. The experiment results show a cost reduction of the cloud services by 15% compared to keeping all video streams.

Forward-Backward Latent State Inference for Hidden Continuous-Time semi-Markov Chains arxiv:2210.09058 📈 2

Nicolai Engelmann, Heinz Koeppl

**Abstract:** Hidden semi-Markov Models (HSMM's) - while broadly in use - are restricted to a discrete and uniform time grid. They are thus not well suited to explain often irregularly spaced discrete event data from continuous-time phenomena. We show that non-sampling-based latent state inference used in HSMM's can be generalized to latent Continuous-Time semi-Markov Chains (CTSMC's). We formulate integro-differential forward and backward equations adjusted to the observation likelihood and introduce an exact integral equation for the Bayesian posterior marginals and a scalable Viterbi-type algorithm for posterior path estimates. The presented equations can be efficiently solved using well-known numerical methods. As a practical tool, variable-step HSMM's are introduced. We evaluate our approaches in latent state inference scenarios in comparison to classical HSMM's.

Beyond Model Interpretability: On the Faithfulness and Adversarial Robustness of Contrastive Textual Explanations arxiv:2210.08902 📈 2

Julia El Zini, Mariette Awad

**Abstract:** Contrastive explanation methods go beyond transparency and address the contrastive aspect of explanations. Such explanations are emerging as an attractive option to provide actionable change to scenarios adversely impacted by classifiers' decisions. However, their extension to textual data is under-explored and there is little investigation on their vulnerabilities and limitations. This work motivates textual counterfactuals by laying the ground for a novel evaluation scheme inspired by the faithfulness of explanations. Accordingly, we extend the computation of three metrics, proximity,connectedness and stability, to textual data and we benchmark two successful contrastive methods, POLYJUICE and MiCE, on our suggested metrics. Experiments on sentiment analysis data show that the connectedness of counterfactuals to their original counterparts is not obvious in both models. More interestingly, the generated contrastive texts are more attainable with POLYJUICE which highlights the significance of latent representations in counterfactual search. Finally, we perform the first semantic adversarial attack on textual recourse methods. The results demonstrate the robustness of POLYJUICE and the role that latent input representations play in robustness and reliability.

Space, Time, and Interaction: A Taxonomy of Corner Cases in Trajectory Datasets for Automated Driving arxiv:2210.08885 📈 2

Kevin Rösch, Florian Heidecker, Julian Truetsch, Kamil Kowol, Clemens Schicktanz, Maarten Bieshaar, Bernhard Sick, Christoph Stiller

**Abstract:** Trajectory data analysis is an essential component for highly automated driving. Complex models developed with these data predict other road users' movement and behavior patterns. Based on these predictions - and additional contextual information such as the course of the road, (traffic) rules, and interaction with other road users - the highly automated vehicle (HAV) must be able to reliably and safely perform the task assigned to it, e.g., moving from point A to B. Ideally, the HAV moves safely through its environment, just as we would expect a human driver to do. However, if unusual trajectories occur, so-called trajectory corner cases, a human driver can usually cope well, but an HAV can quickly get into trouble. In the definition of trajectory corner cases, which we provide in this work, we will consider the relevance of unusual trajectories with respect to the task at hand. Based on this, we will also present a taxonomy of different trajectory corner cases. The categorization of corner cases into the taxonomy will be shown with examples and is done by cause and required data sources. To illustrate the complexity between the machine learning (ML) model and the corner case cause, we present a general processing chain underlying the taxonomy.

Cerebrovascular Segmentation via Vessel Oriented Filtering Network arxiv:2210.08868 📈 2

Zhanqiang Guo, Yao Luan, Jianjiang Feng, Wangsheng Lu, Yin Yin, Guangming Yang, Jie Zhou

**Abstract:** Accurate cerebrovascular segmentation from Magnetic Resonance Angiography (MRA) and Computed Tomography Angiography (CTA) is of great significance in diagnosis and treatment of cerebrovascular pathology. Due to the complexity and topology variability of blood vessels, complete and accurate segmentation of vascular network is still a challenge. In this paper, we proposed a Vessel Oriented Filtering Network (VOF-Net) which embeds domain knowledge into the convolutional neural network. We design oriented filters for blood vessels according to vessel orientation field, which is obtained by orientation estimation network. Features extracted by oriented filtering are injected into segmentation network, so as to make use of the prior information that the blood vessels are slender and curved tubular structure. Experimental results on datasets of CTA and MRA show that the proposed method is effective for vessel segmentation, and embedding the specific vascular filter improves the segmentation performance.

Social Biases in Automatic Evaluation Metrics for NLG arxiv:2210.08859 📈 2

Mingqi Gao, Xiaojun Wan

**Abstract:** Many studies have revealed that word embeddings, language models, and models for specific downstream tasks in NLP are prone to social biases, especially gender bias. Recently these techniques have been gradually applied to automatic evaluation metrics for text generation. In the paper, we propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics and discover that social biases are also widely present in some model-based automatic evaluation metrics. Moreover, we construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks. Results show that given gender-neutral references in the evaluation, model-based evaluation metrics may show a preference for the male hypothesis, and the performance of them, i.e. the correlation between evaluation metrics and human judgments, usually has more significant variation after gender swapping.

Break The Spell Of Total Correlation In betaTCVAE arxiv:2210.08794 📈 2

Zihao Chen, Qiang Li, Bing Guo, Yan Shen

**Abstract:** This paper proposes a way to break the spell of total correlation in betaTCVAE based on the motivation of the total correlation decomposition. An iterative decomposition path of total correlation is proposed, and an explanation for representation learning ability of VAE from the perspective of model capacity allocation. Newly developed objective function combines latent variable dimensions into joint distribution while relieving independent distribution constraint of the marginal distribution in combination, leading to latent variables with a more manipulable prior distribution. The novel model enables VAE to adjust the parameter capacity to divide dependent and independent data features flexibly. Experimental results on various datasets show an interesting relevance between model capacity and the latent variable grouping size, called the "V"-shaped best ELBO trajectory. Additional experiments demonstrate that the proposed method obtains better disentanglement performance with reasonable parameter capacity allocation. Finally, we design experiments to show the limitations of estimating total correlation with mutual information, identifying its source of estimation deviation.

ReasonChainQA: Text-based Complex Question Answering with Explainable Evidence Chains arxiv:2210.08763 📈 2

Minjun Zhu, Yixuan Weng, Shizhu He, Kang Liu, Jun Zhao

**Abstract:** The ability of reasoning over evidence has received increasing attention in question answering (QA). Recently, natural language database (NLDB) conducts complex QA in knowledge base with textual evidences rather than structured representations, this task attracts a lot of attention because of the flexibility and richness of textual evidence. However, existing text-based complex question answering datasets fail to provide explicit reasoning process, while it's important for retrieval effectiveness and reasoning interpretability. Therefore, we present a benchmark \textbf{ReasonChainQA} with explanatory and explicit evidence chains. ReasonChainQA consists of two subtasks: answer generation and evidence chains extraction, it also contains higher diversity for multi-hop questions with varying depths, 12 reasoning types and 78 relations. To obtain high-quality textual evidences for answering complex question. Additional experiment on supervised and unsupervised retrieval fully indicates the significance of ReasonChainQA. Dataset and codes will be made publicly available upon accepted.

Inferring Versatile Behavior from Demonstrations by Matching Geometric Descriptors arxiv:2210.08121 📈 2

Niklas Freymuth, Nicolas Schreiber, Philipp Becker, Aleksander Taranovic, Gerhard Neumann

**Abstract:** Humans intuitively solve tasks in versatile ways, varying their behavior in terms of trajectory-based planning and for individual steps. Thus, they can easily generalize and adapt to new and changing environments. Current Imitation Learning algorithms often only consider unimodal expert demonstrations and act in a state-action-based setting, making it difficult for them to imitate human behavior in case of versatile demonstrations. Instead, we combine a mixture of movement primitives with a distribution matching objective to learn versatile behaviors that match the expert's behavior and versatility. To facilitate generalization to novel task configurations, we do not directly match the agent's and expert's trajectory distributions but rather work with concise geometric descriptors which generalize well to unseen task configurations. We empirically validate our method on various robot tasks using versatile human demonstrations and compare to imitation learning algorithms in a state-action setting as well as a trajectory-based setting. We find that the geometric descriptors greatly help in generalizing to new task configurations and that combining them with our distribution-matching objective is crucial for representing and reproducing versatile behavior.

Adversarial and Safely Scaled Question Generation arxiv:2210.09467 📈 1

Sreehari Sankar, Zhihang Dong

**Abstract:** Question generation has recently gained a lot of research interest, especially with the advent of large language models. In and of itself, question generation can be considered 'AI-hard', as there is a lack of unanimously agreed sense of what makes a question 'good' or 'bad'. In this paper, we tackle two fundamental problems in parallel: on one hand, we try to solve the scaling problem, where question-generation and answering applications have to be applied to a massive amount of text without ground truth labeling. The usual approach to solve this problem is to either downsample or summarize. However, there are critical risks of misinformation with these approaches. On the other hand, and related to the misinformation problem, we try to solve the 'safety' problem, as many public institutions rely on a much higher level of accuracy for the content they provide. We introduce an adversarial approach to tackle the question generation safety problem with scale. Specifically, we designed a question-answering system that specifically prunes out unanswerable questions that may be generated, and further increases the quality of the answers that are generated. We build a production-ready, easily-plugged pipeline that can be used on any given body of text, that is scalable and immune from generating any hate speech, profanity, or misinformation. Based on the results, we are able to generate more than six times the number of quality questions generated by the abstractive approach, with a perceived quality being 44% higher, according to a survey of 168 participants.

Data-driven Modeling of Mach-Zehnder Interferometer-based Optical Matrix Multipliers arxiv:2210.09171 📈 1

Ali Cem, Siqi Yan, Yunhong Ding, Darko Zibar, Francesco Da Ros

**Abstract:** Photonic integrated circuits are facilitating the development of optical neural networks, which have the potential to be both faster and more energy efficient than their electronic counterparts since optical signals are especially well-suited for implementing matrix multiplications. However, accurate programming of photonic chips for optical matrix multiplication remains a difficult challenge. Here, we describe both simple analytical models and data-driven models for offline training of optical matrix multipliers. We train and evaluate the models using experimental data obtained from a fabricated chip featuring a Mach-Zehnder interferometer mesh implementing 3-by-3 matrix multiplication. The neural network-based models outperform the simple physics-based models in terms of prediction error. Furthermore, the neural network models are also able to predict the spectral variations in the matrix weights for up to 100 frequency channels covering the C-band. The use of neural network models for programming the chip for optical matrix multiplication yields increased performance on multiple machine learning tasks.

Asymptotic-Preserving Neural Networks for hyperbolic systems with diffusive scaling arxiv:2210.09081 📈 1

Giulia Bertaglia

**Abstract:** With the rapid advance of Machine Learning techniques and the deep increment of availability of scientific data, data-driven approaches have started to become progressively popular across science, causing a fundamental shift in the scientific method after proving to be powerful tools with a direct impact in many areas of society. Nevertheless, when attempting to analyze the dynamics of complex multiscale systems, the usage of standard Deep Neural Networks (DNNs) and even standard Physics-Informed Neural Networks (PINNs) may lead to incorrect inferences and predictions, due to the presence of small scales leading to reduced or simplified models in the system that have to be applied consistently during the learning process. In this Chapter, we will address these issues in light of recent results obtained in the development of Asymptotic-Preserving Neural Networks (APNNs) for hyperbolic models with diffusive scaling. Several numerical tests show how APNNs provide considerably better results with respect to the different scales of the problem when compared with standard DNNs and PINNs, especially when analyzing scenarios in which only little and scattered information is available.

Attribute Inference Attacks in Online Multiplayer Video Games: a Case Study on Dota2 arxiv:2210.09028 📈 1

Pier Paolo Tricomi, Lisa Facciolo, Giovanni Apruzzese, Mauro Conti

**Abstract:** Did you know that over 70 million of Dota2 players have their in-game data freely accessible? What if such data is used in malicious ways? This paper is the first to investigate such a problem. Motivated by the widespread popularity of video games, we propose the first threat model for Attribute Inference Attacks (AIA) in the Dota2 context. We explain how (and why) attackers can exploit the abundant public data in the Dota2 ecosystem to infer private information about its players. Due to lack of concrete evidence on the efficacy of our AIA, we empirically prove and assess their impact in reality. By conducting an extensive survey on $\sim$500 Dota2 players spanning over 26k matches, we verify whether a correlation exists between a player's Dota2 activity and their real-life. Then, after finding such a link ($p\!<\!0.01$ and $ρ>0.3$), we ethically perform diverse AIA. We leverage the capabilities of machine learning to infer real-life attributes of the respondents of our survey by using their publicly available in-game data. Our results show that, by applying domain expertise, some AIA can reach up to 98% precision and over 90% accuracy. This paper hence raises the alarm on a subtle, but concrete threat that can potentially affect the entire competitive gaming landscape. We alerted the developers of Dota2.

Regret Bounds for Learning Decentralized Linear Quadratic Regulator with Partially Nested Information Structure arxiv:2210.08886 📈 1

Lintao Ye, Ming Chi, Vijay Gupta

**Abstract:** We study the problem of learning decentralized linear quadratic regulator under a partially nested information constraint, when the system model is unknown a priori. We propose an online learning algorithm that adaptively designs a control policy as new data samples from a single system trajectory become available. Our algorithm design uses a disturbance-feedback representation of state-feedback controllers coupled with online convex optimization with memory and delayed feedback. We show that our online algorithm yields a controller that satisfies the desired information constraint and enjoys an expected regret that scales as $\sqrt{T}$ with the time horizon $T$.

A Unitary Transform Based Generalized Approximate Message Passing arxiv:2210.08861 📈 1

Jiang Zhu, Xiangming Meng, Xupeng Lei, Qinghua Guo

**Abstract:** We consider the problem of recovering an unknown signal ${\mathbf x}\in {\mathbb R}^n$ from general nonlinear measurements obtained through a generalized linear model (GLM), i.e., ${\mathbf y}= f\left({\mathbf A}{\mathbf x}+{\mathbf w}\right)$, where $f(\cdot)$ is a componentwise nonlinear function. Based on the unitary transform approximate message passing (UAMP) and expectation propagation, a unitary transform based generalized approximate message passing (GUAMP) algorithm is proposed for general measurement matrices $\bf{A}$, in particular highly correlated matrices. Experimental results on quantized compressed sensing demonstrate that the proposed GUAMP significantly outperforms state-of-the-art GAMP and GVAMP under correlated matrices $\bf{A}$.

Protein Sequence and Structure Co-Design with Equivariant Translation arxiv:2210.08761 📈 1

Chence Shi, Chuanrui Wang, Jiarui Lu, Bozitao Zhong, Jian Tang

**Abstract:** Proteins are macromolecules that perform essential functions in all living organisms. Designing novel proteins with specific structures and desired functions has been a long-standing challenge in the field of bioengineering. Existing approaches generate both protein sequence and structure using either autoregressive models or diffusion models, both of which suffer from high inference costs. In this paper, we propose a new approach capable of protein sequence and structure co-design, which iteratively translates both protein sequence and structure into the desired state from random initialization, based on context features given a priori. Our model consists of a trigonometry-aware encoder that reasons geometrical constraints and interactions from context features, and a roto-translation equivariant decoder that translates protein sequence and structure interdependently. Notably, all protein amino acids are updated in one shot in each translation step, which significantly accelerates the inference process. Experimental results across multiple tasks show that our model outperforms previous state-of-the-art baselines by a large margin, and is able to design proteins of high fidelity as regards both sequence and structure, with running time orders of magnitude less than sampling-based methods.

CNT (Conditioning on Noisy Targets): A new Algorithm for Leveraging Top-Down Feedback arxiv:2210.09505 📈 0

Alexia Jolicoeur-Martineau, Alex Lamb, Vikas Verma, Aniket Didolkar

**Abstract:** We propose a novel regularizer for supervised learning called Conditioning on Noisy Targets (CNT). This approach consists in conditioning the model on a noisy version of the target(s) (e.g., actions in imitation learning or labels in classification) at a random noise level (from small to large noise). At inference time, since we do not know the target, we run the network with only noise in place of the noisy target. CNT provides hints through the noisy label (with less noise, we can more easily infer the true target). This give two main benefits: 1) the top-down feedback allows the model to focus on simpler and more digestible sub-problems and 2) rather than learning to solve the task from scratch, the model will first learn to master easy examples (with less noise), while slowly progressing toward harder examples (with more noise).

S$^3$-NeRF: Neural Reflectance Field from Shading and Shadow under a Single Viewpoint arxiv:2210.08936 📈 0

Wenqi Yang, Guanying Chen, Chaofeng Chen, Zhenfang Chen, Kwan-Yee K. Wong

**Abstract:** In this paper, we address the "dual problem" of multi-view scene reconstruction in which we utilize single-view images captured under different point lights to learn a neural scene representation. Different from existing single-view methods which can only recover a 2.5D scene representation (i.e., a normal / depth map for the visible surface), our method learns a neural reflectance field to represent the 3D geometry and BRDFs of a scene. Instead of relying on multi-view photo-consistency, our method exploits two information-rich monocular cues, namely shading and shadow, to infer scene geometry. Experiments on multiple challenging datasets show that our method is capable of recovering 3D geometry, including both visible and invisible parts, of a scene from single-view images. Thanks to the neural reflectance field representation, our method is robust to depth discontinuities. It supports applications like novel-view synthesis and relighting. Our code and model can be found at https://ywq.github.io/s3nerf.

Prev: 2022.10.16 Next: 2022.10.18