Summary for 2021-06-22, created on 2021-12-20

Dangers of Bayesian Model Averaging under Covariate Shift arxiv:2106.11905 📈 45

Pavel Izmailov, Patrick Nicholson, Sanae Lotfi, Andrew Gordon Wilson

**Abstract:** Approximate Bayesian inference for neural networks is considered a robust alternative to standard training, often providing good performance on out-of-distribution data. However, Bayesian neural networks (BNNs) with high-fidelity approximate inference via full-batch Hamiltonian Monte Carlo achieve poor generalization under covariate shift, even underperforming classical estimation. We explain this surprising result, showing how a Bayesian model average can in fact be problematic under covariate shift, particularly in cases where linear dependencies in the input features cause a lack of posterior contraction. We additionally show why the same issue does not affect many approximate inference procedures, or classical maximum a-posteriori (MAP) training. Finally, we propose novel priors that improve the robustness of BNNs to many sources of covariate shift.

Randomness In Neural Network Training: Characterizing The Impact of Tooling arxiv:2106.11872 📈 25

Donglin Zhuang, Xingyao Zhang, Shuaiwen Leon Song, Sara Hooker

**Abstract:** The quest for determinism in machine learning has disproportionately focused on characterizing the impact of noise introduced by algorithmic design choices. In this work, we address a less well understood and studied question: how does our choice of tooling introduce randomness to deep neural network training. We conduct large scale experiments across different types of hardware, accelerators, state of art networks, and open-source datasets, to characterize how tooling choices contribute to the level of non-determinism in a system, the impact of said non-determinism, and the cost of eliminating different sources of noise. Our findings are surprising, and suggest that the impact of non-determinism in nuanced. While top-line metrics such as top-1 accuracy are not noticeably impacted, model performance on certain parts of the data distribution is far more sensitive to the introduction of randomness. Our results suggest that deterministic tooling is critical for AI safety. However, we also find that the cost of ensuring determinism varies dramatically between neural network architectures and hardware types, e.g., with overhead up to $746\%$, $241\%$, and $196\%$ on a spectrum of widely used GPU accelerator architectures, relative to non-deterministic training. The source code used in this paper is available at https://github.com/usyd-fsalab/NeuralNetworkRandomness.

Euro-PVI: Pedestrian Vehicle Interactions in Dense Urban Centers arxiv:2106.12442 📈 24

Apratim Bhattacharyya, Daniel Olmeda Reino, Mario Fritz, Bernt Schiele

**Abstract:** Accurate prediction of pedestrian and bicyclist paths is integral to the development of reliable autonomous vehicles in dense urban environments. The interactions between vehicle and pedestrian or bicyclist have a significant impact on the trajectories of traffic participants e.g. stopping or turning to avoid collisions. Although recent datasets and trajectory prediction approaches have fostered the development of autonomous vehicles yet the amount of vehicle-pedestrian (bicyclist) interactions modeled are sparse. In this work, we propose Euro-PVI, a dataset of pedestrian and bicyclist trajectories. In particular, our dataset caters more diverse and complex interactions in dense urban scenarios compared to the existing datasets. To address the challenges in predicting future trajectories with dense interactions, we develop a joint inference model that learns an expressive multi-modal shared latent space across agents in the urban scene. This enables our Joint-$β$-cVAE approach to better model the distribution of future trajectories. We achieve state of the art results on the nuScenes and Euro-PVI datasets demonstrating the importance of capturing interactions between ego-vehicle and pedestrians (bicyclists) for accurate predictions.

Towards Biologically Plausible Convolutional Networks arxiv:2106.13031 📈 23

Roman Pogodin, Yash Mehta, Timothy P. Lillicrap, Peter E. Latham

**Abstract:** Convolutional networks are ubiquitous in deep learning. They are particularly useful for images, as they reduce the number of parameters, reduce training time, and increase accuracy. However, as a model of the brain they are seriously problematic, since they require weight sharing - something real neurons simply cannot do. Consequently, while neurons in the brain can be locally connected (one of the features of convolutional networks), they cannot be convolutional. Locally connected but non-convolutional networks, however, significantly underperform convolutional ones. This is troublesome for studies that use convolutional networks to explain activity in the visual system. Here we study plausible alternatives to weight sharing that aim at the same regularization principle, which is to make each neuron within a pool react similarly to identical inputs. The most natural way to do that is by showing the network multiple translations of the same image, akin to saccades in animal vision. However, this approach requires many translations, and doesn't remove the performance gap. We propose instead to add lateral connectivity to a locally connected network, and allow learning via Hebbian plasticity. This requires the network to pause occasionally for a sleep-like phase of "weight sharing". This method enables locally connected networks to achieve nearly convolutional performance on ImageNet, thus supporting convolutional networks as a model of the visual stream.

ABCD: A Graph Framework to Convert Complex Sentences to a Covering Set of Simple Sentences arxiv:2106.12027 📈 23

Yanjun Gao, Ting-hao Huang, Rebecca J. Passonneau

**Abstract:** Atomic clauses are fundamental text units for understanding complex sentences. Identifying the atomic sentences within complex sentences is important for applications such as summarization, argument mining, discourse analysis, discourse parsing, and question answering. Previous work mainly relies on rule-based methods dependent on parsing. We propose a new task to decompose each complex sentence into simple sentences derived from the tensed clauses in the source, and a novel problem formulation as a graph edit task. Our neural model learns to Accept, Break, Copy or Drop elements of a graph that combines word adjacency and grammatical dependencies. The full processing pipeline includes modules for graph construction, graph editing, and sentence generation from the output graph. We introduce DeSSE, a new dataset designed to train and evaluate complex sentence decomposition, and MinWiki, a subset of MinWikiSplit. ABCD achieves comparable performance as two parsing baselines on MinWiki. On DeSSE, which has a more even balance of complex sentence types, our model achieves higher accuracy on the number of atomic sentences than an encoder-decoder baseline. Results include a detailed error analysis.

Differentiable Programming of Reaction-Diffusion Patterns arxiv:2107.06862 📈 21

Alexander Mordvintsev, Ettore Randazzo, Eyvind Niklasson

**Abstract:** Reaction-Diffusion (RD) systems provide a computational framework that governs many pattern formation processes in nature. Current RD system design practices boil down to trial-and-error parameter search. We propose a differentiable optimization method for learning the RD system parameters to perform example-based texture synthesis on a 2D plane. We do this by representing the RD system as a variant of Neural Cellular Automata and using task-specific differentiable loss functions. RD systems generated by our method exhibit robust, non-trivial 'life-like' behavior.

On the Diversity and Limits of Human Explanations arxiv:2106.11988 📈 20

Chenhao Tan

**Abstract:** A growing effort in NLP aims to build datasets of human explanations. However, the term explanation encompasses a broad range of notions, each with different properties and ramifications. Our goal is to provide an overview of diverse types of explanations and human limitations, and discuss implications for collecting and using explanations in NLP. Inspired by prior work in psychology and cognitive sciences, we group existing human explanations in NLP into three categories: proximal mechanism, evidence, and procedure. These three types differ in nature and have implications for the resultant explanations. For instance, procedure is not considered explanations in psychology and connects with a rich body of work on learning from instructions. The diversity of explanations is further evidenced by proxy questions that are needed for annotators to interpret and answer open-ended why questions. Finally, explanations may require different, often deeper, understandings than predictions, which casts doubt on whether humans can provide useful explanations in some tasks.

NodePiece: Compositional and Parameter-Efficient Representations of Large Knowledge Graphs arxiv:2106.12144 📈 12

Mikhail Galkin, Jiapeng Wu, Etienne Denis, William L. Hamilton

**Abstract:** Conventional representation learning algorithms for knowledge graphs (KG) map each entity to a unique embedding vector. Such a shallow lookup results in a linear growth of memory consumption for storing the embedding matrix and incurs high computational costs when working with real-world KGs. Drawing parallels with subword tokenization commonly used in NLP, we explore the landscape of more parameter-efficient node embedding strategies with possibly sublinear memory requirements. To this end, we propose NodePiece, an anchor-based approach to learn a fixed-size entity vocabulary. In NodePiece, a vocabulary of subword/sub-entity units is constructed from anchor nodes in a graph with known relation types. Given such a fixed-size vocabulary, it is possible to bootstrap an encoding and embedding for any entity, including those unseen during training. Experiments show that NodePiece performs competitively in node classification, link prediction, and relation prediction tasks while retaining less than 10% of explicit nodes in a graph as anchors and often having 10x fewer parameters.

Algorithmic Recourse in Partially and Fully Confounded Settings Through Bounding Counterfactual Effects arxiv:2106.11849 📈 12

Julius von Kügelgen, Nikita Agarwal, Jakob Zeitler, Afsaneh Mastouri, Bernhard Schölkopf

**Abstract:** Algorithmic recourse aims to provide actionable recommendations to individuals to obtain a more favourable outcome from an automated decision-making system. As it involves reasoning about interventions performed in the physical world, recourse is fundamentally a causal problem. Existing methods compute the effect of recourse actions using a causal model learnt from data under the assumption of no hidden confounding and modelling assumptions such as additive noise. Building on the seminal work of Balke and Pearl (1994), we propose an alternative approach for discrete random variables which relaxes these assumptions and allows for unobserved confounding and arbitrary structural equations. The proposed approach only requires specification of the causal graph and confounding structure and bounds the expected counterfactual effect of recourse actions. If the lower bound is above a certain threshold, i.e., on the other side of the decision boundary, recourse is guaranteed in expectation.

DP-SGD vs PATE: Which Has Less Disparate Impact on Model Accuracy? arxiv:2106.12576 📈 10

Archit Uniyal, Rakshit Naidu, Sasikanth Kotti, Sahib Singh, Patrik Joslin Kenfack, Fatemehsadat Mireshghallah, Andrew Trask

**Abstract:** Recent advances in differentially private deep learning have demonstrated that application of differential privacy, specifically the DP-SGD algorithm, has a disparate impact on different sub-groups in the population, which leads to a significantly high drop-in model utility for sub-populations that are under-represented (minorities), compared to well-represented ones. In this work, we aim to compare PATE, another mechanism for training deep learning models using differential privacy, with DP-SGD in terms of fairness. We show that PATE does have a disparate impact too, however, it is much less severe than DP-SGD. We draw insights from this observation on what might be promising directions in achieving better fairness-privacy trade-offs.

RootPainter3D: Interactive-machine-learning enables rapid and accurate contouring for radiotherapy arxiv:2106.11942 📈 10

Abraham George Smith, Jens Petersen, Cynthia Terrones-Campos, Anne Kiil Berthelsen, Nora Jarrett Forbes, Sune Darkner, Lena Specht, Ivan Richter Vogelius

**Abstract:** Organ-at-risk contouring is still a bottleneck in radiotherapy, with many deep learning methods falling short of promised results when evaluated on clinical data. We investigate the accuracy and time-savings resulting from the use of an interactive-machine-learning method for an organ-at-risk contouring task. We compare the method to the Eclipse contouring software and find strong agreement with manual delineations, with a dice score of 0.95. The annotations created using corrective-annotation also take less time to create as more images are annotated, resulting in substantial time savings compared to manual methods, with hearts that take 2 minutes and 2 seconds to delineate on average, after 923 images have been delineated, compared to 7 minutes and 1 seconds when delineating manually. Our experiment demonstrates that interactive-machine-learning with corrective-annotation provides a fast and accessible way for non computer-scientists to train deep-learning models to segment their own structures of interest as part of routine clinical workflows. Source code is available at \href{https://github.com/Abe404/RootPainter3D}{this HTTPS URL}.

On Positivity Bias in Negative Reviews arxiv:2106.12056 📈 9

Madhusudhan Aithal, Chenhao Tan

**Abstract:** Prior work has revealed that positive words occur more frequently than negative words in human expressions, which is typically attributed to positivity bias, a tendency for people to report positive views of reality. But what about the language used in negative reviews? Consistent with prior work, we show that English negative reviews tend to contain more positive words than negative words, using a variety of datasets. We reconcile this observation with prior findings on the pragmatics of negation, and show that negations are commonly associated with positive words in negative reviews. Furthermore, in negative reviews, the majority of sentences with positive words express negative opinions based on sentiment classifiers, indicating some form of negation.

Variance-Aware Off-Policy Evaluation with Linear Function Approximation arxiv:2106.11960 📈 9

Yifei Min, Tianhao Wang, Dongruo Zhou, Quanquan Gu

**Abstract:** We study the off-policy evaluation (OPE) problem in reinforcement learning with linear function approximation, which aims to estimate the value function of a target policy based on the offline data collected by a behavior policy. We propose to incorporate the variance information of the value function to improve the sample efficiency of OPE. More specifically, for time-inhomogeneous episodic linear Markov decision processes (MDPs), we propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration. We show that our algorithm achieves a tighter error bound than the best-known result. We also provide a fine-grained characterization of the distribution shift between the behavior policy and the target policy. Extensive numerical experiments corroborate our theory.

Provably Efficient Representation Learning in Low-rank Markov Decision Processes arxiv:2106.11935 📈 9

Weitong Zhang, Jiafan He, Dongruo Zhou, Amy Zhang, Quanquan Gu

**Abstract:** The success of deep reinforcement learning (DRL) is due to the power of learning a representation that is suitable for the underlying exploration and exploitation task. However, existing provable reinforcement learning algorithms with linear function approximation often assume the feature representation is known and fixed. In order to understand how representation learning can improve the efficiency of RL, we study representation learning for a class of low-rank Markov Decision Processes (MDPs) where the transition kernel can be represented in a bilinear form. We propose a provably efficient algorithm called ReLEX that can simultaneously learn the representation and perform exploration. We show that ReLEX always performs no worse than a state-of-the-art algorithm without representation learning, and will be strictly better in terms of sample efficiency if the function class of representations enjoys a certain mild "coverage'' property over the whole state-action space.

Towards Automated Evaluation of Explanations in Graph Neural Networks arxiv:2106.11864 📈 9

Vanya BK, Balaji Ganesan, Aniket Saxena, Devbrat Sharma, Arvind Agarwal

**Abstract:** Explaining Graph Neural Networks predictions to end users of AI applications in easily understandable terms remains an unsolved problem. In particular, we do not have well developed methods for automatically evaluating explanations, in ways that are closer to how users consume those explanations. Based on recent application trends and our own experiences in real world problems, we propose automatic evaluation approaches for GNN Explanations.

Credal Self-Supervised Learning arxiv:2106.11853 📈 9

Julian Lienen, Eyke Hüllermeier

**Abstract:** Self-training is an effective approach to semi-supervised learning. The key idea is to let the learner itself iteratively generate "pseudo-supervision" for unlabeled instances based on its current hypothesis. In combination with consistency regularization, pseudo-labeling has shown promising performance in various domains, for example in computer vision. To account for the hypothetical nature of the pseudo-labels, these are commonly provided in the form of probability distributions. Still, one may argue that even a probability distribution represents an excessive level of informedness, as it suggests that the learner precisely knows the ground-truth conditional probabilities. In our approach, we therefore allow the learner to label instances in the form of credal sets, that is, sets of (candidate) probability distributions. Thanks to this increased expressiveness, the learner is able to represent uncertainty and a lack of knowledge in a more flexible and more faithful manner. To learn from weakly labeled data of that kind, we leverage methods that have recently been proposed in the realm of so-called superset learning. In an exhaustive empirical evaluation, we compare our methodology to state-of-the-art self-supervision approaches, showing competitive to superior performance especially in low-label scenarios incorporating a high degree of uncertainty.

A Survey on Human-aware Robot Navigation arxiv:2106.11650 📈 9

Ronja Möller, Antonino Furnari, Sebastiano Battiato, Aki Härmä, Giovanni Maria Farinella

**Abstract:** Intelligent systems are increasingly part of our everyday lives and have been integrated seamlessly to the point where it is difficult to imagine a world without them. Physical manifestations of those systems on the other hand, in the form of embodied agents or robots, have so far been used only for specific applications and are often limited to functional roles (e.g. in the industry, entertainment and military fields). Given the current growth and innovation in the research communities concerned with the topics of robot navigation, human-robot-interaction and human activity recognition, it seems like this might soon change. Robots are increasingly easy to obtain and use and the acceptance of them in general is growing. However, the design of a socially compliant robot that can function as a companion needs to take various areas of research into account. This paper is concerned with the navigation aspect of a socially-compliant robot and provides a survey of existing solutions for the relevant areas of research as well as an outlook on possible future directions.

Repulsive Deep Ensembles are Bayesian arxiv:2106.11642 📈 9

Francesco D'Angelo, Vincent Fortuin

**Abstract:** Deep ensembles have recently gained popularity in the deep learning community for their conceptual simplicity and efficiency. However, maintaining functional diversity between ensemble members that are independently trained with gradient descent is challenging. This can lead to pathologies when adding more ensemble members, such as a saturation of the ensemble performance, which converges to the performance of a single model. Moreover, this does not only affect the quality of its predictions, but even more so the uncertainty estimates of the ensemble, and thus its performance on out-of-distribution data. We hypothesize that this limitation can be overcome by discouraging different ensemble members from collapsing to the same function. To this end, we introduce a kernelized repulsive term in the update rule of the deep ensembles. We show that this simple modification not only enforces and maintains diversity among the members but, even more importantly, transforms the maximum a posteriori inference into proper Bayesian inference. Namely, we show that the training dynamics of our proposed repulsive ensembles follow a Wasserstein gradient flow of the KL divergence with the true posterior. We study repulsive terms in weight and function space and empirically compare their performance to standard ensembles and Bayesian baselines on synthetic and real-world prediction tasks.

Sparsistent Model Discovery arxiv:2106.11936 📈 8

Georges Tod, Gert-Jan Both, Remy Kusters

**Abstract:** Discovering the partial differential equations underlying spatio-temporal datasets from very limited and highly noisy observations is of paramount interest in many scientific fields. However, it remains an open question to know when model discovery algorithms based on sparse regression can actually recover the underlying physical processes. In this work, we show the design matrices used to infer the equations by sparse regression can violate the irrepresentability condition (IRC) of the Lasso, even when derived from analytical PDE solutions (i.e. without additional noise). Sparse regression techniques which can recover the true underlying model under violated IRC conditions are therefore required, leading to the introduction of the randomised adaptive Lasso. We show once the latter is integrated within the deep learning model discovery framework DeepMod, a wide variety of nonlinear and chaotic canonical PDEs can be recovered: (1) up to $\mathcal{O}(2)$ higher noise-to-sample ratios than state-of-the-art algorithms, (2) with a single set of hyperparameters, which paves the road towards truly automated model discovery.

On the importance of cross-task features for class-incremental learning arxiv:2106.11930 📈 8

Albin Soutif--Cormerais, Marc Masana, Joost Van de Weijer, Bartłomiej Twardowski

**Abstract:** In class-incremental learning, an agent with limited resources needs to learn a sequence of classification tasks, forming an ever growing classification problem, with the constraint of not being able to access data from previous tasks. The main difference with task-incremental learning, where a task-ID is available at inference time, is that the learner also needs to perform cross-task discrimination, i.e. distinguish between classes that have not been seen together. Approaches to tackle this problem are numerous and mostly make use of an external memory (buffer) of non-negligible size. In this paper, we ablate the learning of cross-task features and study its influence on the performance of basic replay strategies used for class-IL. We also define a new forgetting measure for class-incremental learning, and see that forgetting is not the principal cause of low performance. Our experimental results show that future algorithms for class-incremental learning should not only prevent forgetting, but also aim to improve the quality of the cross-task features, and the knowledge transfer between tasks. This is especially important when tasks contain limited amount of data.

MEAL: Manifold Embedding-based Active Learning arxiv:2106.11858 📈 8

Deepthi Sreenivasaiah, Johannes Otterbach, Thomas Wollmann

**Abstract:** Image segmentation is a common and challenging task in autonomous driving. Availability of sufficient pixel-level annotations for the training data is a hurdle. Active learning helps learning from small amounts of data by suggesting the most promising samples for labeling. In this work, we propose a new pool-based method for active learning, which proposes promising patches extracted from full image, in each acquisition step. The problem is framed in an exploration-exploitation framework by combining an embedding based on Uniform Manifold Approximation to model representativeness with entropy as uncertainty measure to model informativeness. We applied our proposed method to the autonomous driving datasets CamVid and Cityscapes and performed a quantitative comparison with state-of-the-art baselines. We find that our active learning method achieves better performance compared to previous methods.

A Vertical Federated Learning Framework for Graph Convolutional Network arxiv:2106.11593 📈 8

Xiang Ni, Xiaolong Xu, Lingjuan Lyu, Changhua Meng, Weiqiang Wang

**Abstract:** Recently, Graph Neural Network (GNN) has achieved remarkable success in various real-world problems on graph data. However in most industries, data exists in the form of isolated islands and the data privacy and security is also an important issue. In this paper, we propose FedVGCN, a federated GCN learning paradigm for privacy-preserving node classification task under data vertically partitioned setting, which can be generalized to existing GCN models. Specifically, we split the computation graph data into two parts. For each iteration of the training process, the two parties transfer intermediate results to each other under homomorphic encryption. We conduct experiments on benchmark data and the results demonstrate the effectiveness of FedVGCN in the case of GraphSage.

Q-Learning Lagrange Policies for Multi-Action Restless Bandits arxiv:2106.12024 📈 7

Jackson A. Killian, Arpita Biswas, Sanket Shah, Milind Tambe

**Abstract:** Multi-action restless multi-armed bandits (RMABs) are a powerful framework for constrained resource allocation in which $N$ independent processes are managed. However, previous work only study the offline setting where problem dynamics are known. We address this restrictive assumption, designing the first algorithms for learning good policies for Multi-action RMABs online using combinations of Lagrangian relaxation and Q-learning. Our first approach, MAIQL, extends a method for Q-learning the Whittle index in binary-action RMABs to the multi-action setting. We derive a generalized update rule and convergence proof and establish that, under standard assumptions, MAIQL converges to the asymptotically optimal multi-action RMAB policy as $t\rightarrow{}\infty$. However, MAIQL relies on learning Q-functions and indexes on two timescales which leads to slow convergence and requires problem structure to perform well. Thus, we design a second algorithm, LPQL, which learns the well-performing and more general Lagrange policy for multi-action RMABs by learning to minimize the Lagrange bound through a variant of Q-learning. To ensure fast convergence, we take an approximation strategy that enables learning on a single timescale, then give a guarantee relating the approximation's precision to an upper bound of LPQL's return as $t\rightarrow{}\infty$. Finally, we show that our approaches always outperform baselines across multiple settings, including one derived from real-world medication adherence data.

Physics-Informed Deep Reversible Regression Model for Temperature Field Reconstruction of Heat-Source Systems arxiv:2106.11929 📈 7

Zhiqiang Gong, Weien Zhou, Jun Zhang, Wei Peng, Wen Yao

**Abstract:** Temperature monitoring during the life time of heat source components in engineering systems becomes essential to guarantee the normal work and the working life of these components. However, prior methods, which mainly use the interpolate estimation to reconstruct the temperature field from limited monitoring points, require large amounts of temperature tensors for an accurate estimation. This may decrease the availability and reliability of the system and sharply increase the monitoring cost. To solve this problem, this work develops a novel physics-informed deep reversible regression models for temperature field reconstruction of heat-source systems (TFR-HSS), which can better reconstruct the temperature field with limited monitoring points unsupervisedly. First, we define the TFR-HSS task mathematically, and numerically model the task, and hence transform the task as an image-to-image regression problem. Then this work develops the deep reversible regression model which can better learn the physical information, especially over the boundary. Finally, considering the physical characteristics of heat conduction as well as the boundary conditions, this work proposes the physics-informed reconstruction loss including four training losses and jointly learns the deep surrogate model with these losses unsupervisedly. Experimental studies have conducted over typical two-dimensional heat-source systems to demonstrate the effectiveness of the proposed method.

LV-BERT: Exploiting Layer Variety for BERT arxiv:2106.11740 📈 7

Weihao Yu, Zihang Jiang, Fei Chen, Qibin Hou, Jiashi Feng

**Abstract:** Modern pre-trained language models are mostly built upon backbones stacking self-attention and feed-forward layers in an interleaved order. In this paper, beyond this stereotyped layer pattern, we aim to improve pre-trained models by exploiting layer variety from two aspects: the layer type set and the layer order. Specifically, besides the original self-attention and feed-forward layers, we introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models. Furthermore, beyond the original interleaved order, we explore more layer orders to discover more powerful architectures. However, the introduced layer variety leads to a large architecture space of more than billions of candidates, while training a single candidate model from scratch already requires huge computation cost, making it not affordable to search such a space by directly training large amounts of candidate models. To solve this problem, we first pre-train a supernet from which the weights of all candidate models can be inherited, and then adopt an evolutionary algorithm guided by pre-training accuracy to find the optimal architecture. Extensive experiments show that LV-BERT model obtained by our method outperforms BERT and its variants on various downstream tasks. For example, LV-BERT-small achieves 79.8 on the GLUE testing set, 1.8 higher than the strong baseline ELECTRA-small.

ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations arxiv:2106.15320 📈 6

Sampanna Yashwant Kahu, William A. Ingram, Edward A. Fox, Jian Wu

**Abstract:** We focus on electronic theses and dissertations (ETDs), aiming to improve access and expand their utility, since more than 6 million are publicly available, and they constitute an important corpus to aid research and education across disciplines. The corpus is growing as new born-digital documents are included, and since millions of older theses and dissertations have been converted to digital form to be disseminated electronically in institutional repositories. In ETDs, as with other scholarly works, figures and tables can communicate a large amount of information in a concise way. Although methods have been proposed for extracting figures and tables from born-digital PDFs, they do not work well with scanned ETDs. Considering this problem, our assessment of state-of-the-art figure extraction systems is that the reason they do not function well on scanned PDFs is that they have only been trained on born-digital documents. To address this limitation, we present ScanBank, a new dataset containing 10 thousand scanned page images, manually labeled by humans as to the presence of the 3.3 thousand figures or tables found therein. We use this dataset to train a deep neural network model based on YOLOv5 to accurately extract figures and tables from scanned ETDs. We pose and answer important research questions aimed at finding better methods for figure extraction from scanned documents. One of those concerns the value for training, of data augmentation techniques applied to born-digital documents which are used to train models better suited for figure extraction from scanned documents. To the best of our knowledge, ScanBank is the first manually annotated dataset for figure and table extraction for scanned ETDs. A YOLOv5-based model, trained on ScanBank, outperforms existing comparable open-source and freely available baseline methods by a considerable margin.

It's All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning arxiv:2106.12066 📈 6

Alexey Tikhonov, Max Ryabinin

**Abstract:** Commonsense reasoning is one of the key problems in natural language processing, but the relative scarcity of labeled data holds back the progress for languages other than English. Pretrained cross-lingual models are a source of powerful language-agnostic representations, yet their inherent reasoning capabilities are still actively studied. In this work, we design a simple approach to commonsense reasoning which trains a linear classifier with weights of multi-head attention as features. To evaluate this approach, we create a multilingual Winograd Schema corpus by processing several datasets from prior work within a standardized pipeline and measure cross-lingual generalization ability in terms of out-of-sample performance. The method performs competitively with recent supervised and unsupervised approaches for commonsense reasoning, even when applied to other languages in a zero-shot manner. Also, we demonstrate that most of the performance is given by the same small subset of attention heads for all studied languages, which provides evidence of universal reasoning capabilities in multilingual encoders.

Test-time Collective Prediction arxiv:2106.12012 📈 6

Celestine Mendler-Dünner, Wenshuo Guo, Stephen Bates, Michael I. Jordan

**Abstract:** An increasingly common setting in machine learning involves multiple parties, each with their own data, who want to jointly make predictions on future test points. Agents wish to benefit from the collective expertise of the full set of agents to make better predictions than they would individually, but may not be willing to release their data or model parameters. In this work, we explore a decentralized mechanism to make collective predictions at test time, leveraging each agent's pre-trained model without relying on external validation, model retraining, or data pooling. Our approach takes inspiration from the literature in social science on human consensus-making. We analyze our mechanism theoretically, showing that it converges to inverse meansquared-error (MSE) weighting in the large-sample limit. To compute error bars on the collective predictions we propose a decentralized Jackknife procedure that evaluates the sensitivity of our mechanism to a single agent's prediction. Empirically, we demonstrate that our scheme effectively combines models with differing quality across the input space. The proposed consensus prediction achieves significant gains over classical model averaging, and even outperforms weighted averaging schemes that have access to additional validation data.

Rank-one matrix estimation with groupwise heteroskedasticity arxiv:2106.11950 📈 6

Joshua K. Behne, Galen Reeves

**Abstract:** We study the problem of estimating a rank-one matrix from Gaussian observations where different blocks of the matrix are observed under different noise levels. This problem is motivated by applications in clustering and community detection where latent variables can be partitioned into a fixed number of known groups (e.g., users and items) and the blocks of the matrix correspond to different types of pairwise interactions (e.g., user-user, user-item, or item-item interactions). In the setting where the number of blocks is fixed while the number of variables tends to infinity, we prove asymptotically exact formulas for the minimum mean-squared error in estimating both the matrix and the latent variables. These formulas describe the weak recovery thresholds for the problem and reveal invariance properties with respect to certain scalings of the noise variance. We also derive an approximate message passing algorithm and a gradient descent algorithm and show empirically that these algorithms achieve the information-theoretic limits in certain regimes.

Robust Regression Revisited: Acceleration and Improved Estimation Rates arxiv:2106.11938 📈 6

Arun Jambulapati, Jerry Li, Tselil Schramm, Kevin Tian

**Abstract:** We study fast algorithms for statistical regression problems under the strong contamination model, where the goal is to approximately optimize a generalized linear model (GLM) given adversarially corrupted samples. Prior works in this line of research were based on the robust gradient descent framework of Prasad et. al., a first-order method using biased gradient queries, or the Sever framework of Diakonikolas et. al., an iterative outlier-removal method calling a stationary point finder. We present nearly-linear time algorithms for robust regression problems with improved runtime or estimation guarantees compared to the state-of-the-art. For the general case of smooth GLMs (e.g. logistic regression), we show that the robust gradient descent framework of Prasad et. al. can be accelerated, and show our algorithm extends to optimizing the Moreau envelopes of Lipschitz GLMs (e.g. support vector machines), answering several open questions in the literature. For the well-studied case of robust linear regression, we present an alternative approach obtaining improved estimation rates over prior nearly-linear time algorithms. Interestingly, our method starts with an identifiability proof introduced in the context of the sum-of-squares algorithm of Bakshi and Prasad, which achieved optimal error rates while requiring large polynomial runtime and sample complexity. We reinterpret their proof within the Sever framework and obtain a dramatically faster and more sample-efficient algorithm under fewer distributional assumptions.

Local policy search with Bayesian optimization arxiv:2106.11899 📈 6

Sarah Müller, Alexander von Rohr, Sebastian Trimpe

**Abstract:** Reinforcement learning (RL) aims to find an optimal policy by interaction with an environment. Consequently, learning complex behavior requires a vast number of samples, which can be prohibitive in practice. Nevertheless, instead of systematically reasoning and actively choosing informative samples, policy gradients for local search are often obtained from random perturbations. These random samples yield high variance estimates and hence are sub-optimal in terms of sample complexity. Actively selecting informative samples is at the core of Bayesian optimization, which constructs a probabilistic surrogate of the objective from past samples to reason about informative subsequent ones. In this paper, we propose to join both worlds. We develop an algorithm utilizing a probabilistic model of the objective function and its gradient. Based on the model, the algorithm decides where to query a noisy zeroth-order oracle to improve the gradient estimates. The resulting algorithm is a novel type of policy search method, which we compare to existing black-box algorithms. The comparison reveals improved sample complexity and reduced variance in extensive empirical evaluations on synthetic objectives. Further, we highlight the benefits of active sampling on popular RL benchmarks.

Dynamic Customer Embeddings for Financial Service Applications arxiv:2106.11880 📈 6

Nima Chitsazan, Samuel Sharpe, Dwipam Katariya, Qianyu Cheng, Karthik Rajasethupathy

**Abstract:** As financial services (FS) companies have experienced drastic technology driven changes, the availability of new data streams provides the opportunity for more comprehensive customer understanding. We propose Dynamic Customer Embeddings (DCE), a framework that leverages customers' digital activity and a wide range of financial context to learn dense representations of customers in the FS industry. Our method examines customer actions and pageviews within a mobile or web digital session, the sequencing of the sessions themselves, and snapshots of common financial features across our organization at the time of login. We test our customer embeddings using real world data in three prediction problems: 1) the intent of a customer in their next digital session, 2) the probability of a customer calling the call centers after a session, and 3) the probability of a digital session to be fraudulent. DCE showed performance lift in all three downstream problems.

Lifted Model Checking for Relational MDPs arxiv:2106.11735 📈 6

Wen-Chi Yang, Jean-François Raskin, Luc De Raedt

**Abstract:** Model checking has been developed for verifying the behaviour of systems with stochastic and non-deterministic behavior. It is used to provide guarantees about such systems. While most model checking methods focus on propositional models, various probabilistic planning and reinforcement frameworks deal with relational domains, for instance, STRIPS planning and relational Markov Decision Processes. Using propositional model checking in relational settings requires one to ground the model, which leads to the well known state explosion problem and intractability. We present pCTL-REBEL, a lifted model checking approach for verifying pCTL properties on relational MDPs. It extends REBEL, the relational Bellman update operator, which is a lifted value iteration approach for model-based relational reinforcement learning, toward relational model-checking. PCTL-REBEL is lifted, which means that rather than grounding, the model exploits symmetries and reasons at an abstract relational level. Theoretically, we show that the pCTL model checking approach is decidable for relational MDPs even for possibly infinite domains provided that the states have a bounded size. Practically, we contribute algorithms and an implementation of lifted relational model checking, and we show that the lifted approach improves the scalability of the model checking approach.

A Deep Latent Space Model for Graph Representation Learning arxiv:2106.11721 📈 6

Hanxuan Yang, Qingchao Kong, Wenji Mao

**Abstract:** Graph representation learning is a fundamental problem for modeling relational data and benefits a number of downstream applications. Traditional Bayesian-based graph models and recent deep learning based GNN either suffer from impracticability or lack interpretability, thus combined models for undirected graphs have been proposed to overcome the weaknesses. As a large portion of real-world graphs are directed graphs (of which undirected graphs are special cases), in this paper, we propose a Deep Latent Space Model (DLSM) for directed graphs to incorporate the traditional latent variable based generative model into deep learning frameworks. Our proposed model consists of a graph convolutional network (GCN) encoder and a stochastic decoder, which are layer-wise connected by a hierarchical variational auto-encoder architecture. By specifically modeling the degree heterogeneity using node random factors, our model possesses better interpretability in both community structure and degree heterogeneity. For fast inference, the stochastic gradient variational Bayes (SGVB) is adopted using a non-iterative recognition model, which is much more scalable than traditional MCMC-based methods. The experiments on real-world datasets show that the proposed model achieves the state-of-the-art performances on both link prediction and community detection tasks while learning interpretable node embeddings. The source code is available at https://github.com/upperr/DLSM.

Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation arxiv:2106.11612 📈 6

Jiafan He, Dongruo Zhou, Quanquan Gu

**Abstract:** We study reinforcement learning (RL) with linear function approximation. Existing algorithms for this problem only have high-probability regret and/or Probably Approximately Correct (PAC) sample complexity guarantees, which cannot guarantee the convergence to the optimal policy. In this paper, in order to overcome the limitation of existing algorithms, we propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability. The uniform-PAC guarantee is the strongest possible guarantee for reinforcement learning in the literature, which can directly imply both PAC and high probability regret bounds, making our algorithm superior to all existing algorithms with linear function approximation. At the core of our algorithm is a novel minimax value function estimator and a multi-level partition scheme to select the training samples from historical observations. Both of these techniques are new and of independent interest.

Learn to Resolve Conversational Dependency: A Consistency Training Framework for Conversational Question Answering arxiv:2106.11575 📈 6

Gangwoo Kim, Hyunjae Kim, Jungsoo Park, Jaewoo Kang

**Abstract:** One of the main challenges in conversational question answering (CQA) is to resolve the conversational dependency, such as anaphora and ellipsis. However, existing approaches do not explicitly train QA models on how to resolve the dependency, and thus these models are limited in understanding human dialogues. In this paper, we propose a novel framework, ExCorD (Explicit guidance on how to resolve Conversational Dependency) to enhance the abilities of QA models in comprehending conversational context. ExCorD first generates self-contained questions that can be understood without the conversation history, then trains a QA model with the pairs of original and self-contained questions using a consistency-based regularizer. In our experiments, we demonstrate that ExCorD significantly improves the QA models' performance by up to 1.2 F1 on QuAC, and 5.2 F1 on CANARD, while addressing the limitations of the existing approaches.

SSUL: Semantic Segmentation with Unknown Label for Exemplar-based Class-Incremental Learning arxiv:2106.11562 📈 6

Sungmin Cha, Beomyoung Kim, Youngjoon Yoo, Taesup Moon

**Abstract:** This paper introduces a solid state-of-the-art baseline for a class-incremental semantic segmentation (CISS) problem. While the recent CISS algorithms utilize variants of the knowledge distillation (KD) technique to tackle the problem, they failed to fully address the critical challenges in CISS causing the catastrophic forgetting; the semantic drift of the background class and the multi-label prediction issue. To better address these challenges, we propose a new method, dubbed SSUL-M (Semantic Segmentation with Unknown Label with Memory), by carefully combining techniques tailored for semantic segmentation. Specifically, we claim three main contributions. (1) defining unknown classes within the background class to help to learn future classes (help plasticity), (2) freezing backbone network and past classifiers with binary cross-entropy loss and pseudo-labeling to overcome catastrophic forgetting (help stability), and (3) utilizing tiny exemplar memory for the first time in CISS to improve both plasticity and stability. The extensively conducted experiments show the effectiveness of our method, achieving significantly better performance than the recent state-of-the-art baselines on the standard benchmark datasets. Furthermore, we justify our contributions with thorough ablation analyses and discuss different natures of the CISS problem compared to the traditional class-incremental learning targeting classification. The official code is available at https://github.com/clovaai/SSUL.

IQ-Learn: Inverse soft-Q Learning for Imitation arxiv:2106.12142 📈 5

Divyansh Garg, Shuvam Chakraborty, Chris Cundy, Jiaming Song, Stefano Ermon

**Abstract:** In many sequential decision-making problems (e.g., robotics control, game playing, sequential prediction), human or expert data is available containing useful information about the task. However, imitation learning (IL) from a small amount of expert data can be challenging in high-dimensional environments with complex dynamics. Behavioral cloning is a simple method that is widely used due to its simplicity of implementation and stable convergence but doesn't utilize any information involving the environment's dynamics. Many existing methods that exploit dynamics information are difficult to train in practice due to an adversarial optimization process over reward and policy approximators or biased, high variance gradient estimators. We introduce a method for dynamics-aware IL which avoids adversarial training by learning a single Q-function, implicitly representing both reward and policy. On standard benchmarks, the implicitly learned rewards show a high positive correlation with the ground-truth rewards, illustrating our method can also be used for inverse reinforcement learning (IRL). Our method, Inverse soft-Q learning (IQ-Learn) obtains state-of-the-art results in offline and online imitation learning settings, significantly outperforming existing methods both in the number of required environment interactions and scalability in high-dimensional spaces, often by more than 3x.

Exploiting Negative Learning for Implicit Pseudo Label Rectification in Source-Free Domain Adaptive Semantic Segmentation arxiv:2106.12123 📈 5

Xin Luo, Wei Chen, Yusong Tan, Chen Li, Yulin He, Xiaogang Jia

**Abstract:** It is desirable to transfer the knowledge stored in a well-trained source model onto non-annotated target domain in the absence of source data. However, state-of-the-art methods for source free domain adaptation (SFDA) are subject to strict limits: 1) access to internal specifications of source models is a must; and 2) pseudo labels should be clean during self-training, making critical tasks relying on semantic segmentation unreliable. Aiming at these pitfalls, this study develops a domain adaptive solution to semantic segmentation with pseudo label rectification (namely \textit{PR-SFDA}), which operates in two phases: 1) \textit{Confidence-regularized unsupervised learning}: Maximum squares loss applies to regularize the target model to ensure the confidence in prediction; and 2) \textit{Noise-aware pseudo label learning}: Negative learning enables tolerance to noisy pseudo labels in training, meanwhile positive learning achieves fast convergence. Extensive experiments have been performed on domain adaptive semantic segmentation benchmark, \textit{GTA5 $\to$ Cityscapes}. Overall, \textit{PR-SFDA} achieves a performance of 49.0 mIoU, which is very close to that of the state-of-the-art counterparts. Note that the latter demand accesses to the source model's internal specifications, whereas the \textit{PR-SFDA} solution needs none as a sharp contrast.

PALMAR: Towards Adaptive Multi-inhabitant Activity Recognition in Point-Cloud Technology arxiv:2106.11902 📈 5

Mohammad Arif Ul Alam, Md Mahmudur Rahman, Jared Q Widberg

**Abstract:** With the advancement of deep neural networks and computer vision-based Human Activity Recognition, employment of Point-Cloud Data technologies (LiDAR, mmWave) has seen a lot interests due to its privacy preserving nature. Given the high promise of accurate PCD technologies, we develop, PALMAR, a multiple-inhabitant activity recognition system by employing efficient signal processing and novel machine learning techniques to track individual person towards developing an adaptive multi-inhabitant tracking and HAR system. More specifically, we propose (i) a voxelized feature representation-based real-time PCD fine-tuning method, (ii) efficient clustering (DBSCAN and BIRCH), Adaptive Order Hidden Markov Model based multi-person tracking and crossover ambiguity reduction techniques and (iii) novel adaptive deep learning-based domain adaptation technique to improve the accuracy of HAR in presence of data scarcity and diversity (device, location and population diversity). We experimentally evaluate our framework and systems using (i) a real-time PCD collected by three devices (3D LiDAR and 79 GHz mmWave) from 6 participants, (ii) one publicly available 3D LiDAR activity data (28 participants) and (iii) an embedded hardware prototype system which provided promising HAR performances in multi-inhabitants (96%) scenario with a 63% improvement of multi-person tracking than state-of-art framework without losing significant system performances in the edge computing device.

Failing with Grace: Learning Neural Network Controllers that are Boundedly Unsafe arxiv:2106.11881 📈 5

Panagiotis Vlantis, Michael M. Zavlanos

**Abstract:** In this work, we consider the problem of learning a feed-forward neural network (NN) controller to safely steer an arbitrarily shaped planar robot in a compact and obstacle-occluded workspace. Unlike existing methods that depend strongly on the density of data points close to the boundary of the safe state space to train NN controllers with closed-loop safety guarantees, we propose an approach that lifts such assumptions on the data that are hard to satisfy in practice and instead allows for graceful safety violations, i.e., of a bounded magnitude that can be spatially controlled. To do so, we employ reachability analysis methods to encapsulate safety constraints in the training process. Specifically, to obtain a computationally efficient over-approximation of the forward reachable set of the closed-loop system, we partition the robot's state space into cells and adaptively subdivide the cells that contain states which may escape the safe set under the trained control law. To do so, we first design appropriate under- and over-approximations of the robot's footprint to adaptively subdivide the configuration space into cells. Then, using the overlap between each cell's forward reachable set and the set of infeasible robot configurations as a measure for safety violations, we introduce penalty terms into the loss function that penalize this overlap in the training process. As a result, our method can learn a safe vector field for the closed-loop system and, at the same time, provide numerical worst-case bounds on safety violation over the whole configuration space, defined by the overlap between the over-approximation of the forward reachable set of the closed-loop system and the set of unsafe states. Moreover, it can control the tradeoff between computational complexity and tightness of these bounds. Finally, we provide a simulation study that verifies the efficacy of the proposed scheme.

A Unified Framework for Conservative Exploration arxiv:2106.11692 📈 5

Yunchang Yang, Tianhao Wu, Han Zhong, Evrard Garcelon, Matteo Pirotta, Alessandro Lazaric, Liwei Wang, Simon S. Du

**Abstract:** We study bandits and reinforcement learning (RL) subject to a conservative constraint where the agent is asked to perform at least as well as a given baseline policy. This setting is particular relevant in real-world domains including digital marketing, healthcare, production, finance, etc. For multi-armed bandits, linear bandits and tabular RL, specialized algorithms and theoretical analyses were proposed in previous work. In this paper, we present a unified framework for conservative bandits and RL, in which our core technique is to calculate the necessary and sufficient budget obtained from running the baseline policy. For lower bounds, our framework gives a black-box reduction that turns a certain lower bound in the nonconservative setting into a new lower bound in the conservative setting. We strengthen the existing lower bound for conservative multi-armed bandits and obtain new lower bounds for conservative linear bandits, tabular RL and low-rank MDP. For upper bounds, our framework turns a certain nonconservative upper-confidence-bound (UCB) algorithm into a conservative algorithm with a simple analysis. For multi-armed bandits, linear bandits and tabular RL, our new upper bounds tighten or match existing ones with significantly simpler analyses. We also obtain a new upper bound for conservative low-rank MDP.

Probabilistic Attention for Interactive Segmentation arxiv:2106.15338 📈 4

Prasad Gabbur, Manjot Bilkhu, Javier Movellan

**Abstract:** We provide a probabilistic interpretation of attention and show that the standard dot-product attention in transformers is a special case of Maximum A Posteriori (MAP) inference. The proposed approach suggests the use of Expectation Maximization algorithms for online adaptation of key and value model parameters. This approach is useful for cases in which external agents, e.g., annotators, provide inference-time information about the correct values of some tokens, e.g, the semantic category of some pixels, and we need for this new information to propagate to other tokens in a principled manner. We illustrate the approach on an interactive semantic segmentation task in which annotators and models collaborate online to improve annotation efficiency. Using standard benchmarks, we observe that key adaptation boosts model performance ($\sim10\%$ mIoU) in the low feedback regime and value propagation improves model responsiveness in the high feedback regime. A PyTorch layer implementation of our probabilistic attention model will be made publicly available here: https://github.com/apple/ml-probabilistic-attention.

Pure Exploration in Kernel and Neural Bandits arxiv:2106.12034 📈 4

Yinglun Zhu, Dongruo Zhou, Ruoxi Jiang, Quanquan Gu, Rebecca Willett, Robert Nowak

**Abstract:** We study pure exploration in bandits, where the dimension of the feature representation can be much larger than the number of arms. To overcome the curse of dimensionality, we propose to adaptively embed the feature representation of each arm into a lower-dimensional space and carefully deal with the induced model misspecifications. Our approach is conceptually very different from existing works that can either only handle low-dimensional linear bandits or passively deal with model misspecifications. We showcase the application of our approach to two pure exploration settings that were previously under-studied: (1) the reward function belongs to a possibly infinite-dimensional Reproducing Kernel Hilbert Space, and (2) the reward function is nonlinear and can be approximated by neural networks. Our main results provide sample complexity guarantees that only depend on the effective dimension of the feature spaces in the kernel or neural representations. Extensive experiments conducted on both synthetic and real-world datasets demonstrate the efficacy of our methods.

Tracking Instances as Queries arxiv:2106.11963 📈 4

Shusheng Yang, Yuxin Fang, Xinggang Wang, Yu Li, Ying Shan, Bin Feng, Wenyu Liu

**Abstract:** Recently, query based deep networks catch lots of attention owing to their end-to-end pipeline and competitive results on several fundamental computer vision tasks, such as object detection, semantic segmentation, and instance segmentation. However, how to establish a query based video instance segmentation (VIS) framework with elegant architecture and strong performance remains to be settled. In this paper, we present \textbf{QueryTrack} (i.e., tracking instances as queries), a unified query based VIS framework fully leveraging the intrinsic one-to-one correspondence between instances and queries in QueryInst. The proposed method obtains 52.7 / 52.3 AP on YouTube-VIS-2019 / 2021 datasets, which wins the 2-nd place in the YouTube-VIS Challenge at CVPR 2021 \textbf{with a single online end-to-end model, single scale testing \& modest amount of training data}. We also provide QueryTrack-ResNet-50 baseline results on YouTube-VIS-2021 val set as references for the VIS community.

Asynchronous Stochastic Optimization Robust to Arbitrary Delays arxiv:2106.11879 📈 4

Alon Cohen, Amit Daniely, Yoel Drori, Tomer Koren, Mariano Schain

**Abstract:** We consider stochastic optimization with delayed gradients where, at each time step $t$, the algorithm makes an update using a stale stochastic gradient from step $t - d_t$ for some arbitrary delay $d_t$. This setting abstracts asynchronous distributed optimization where a central server receives gradient updates computed by worker machines. These machines can experience computation and communication loads that might vary significantly over time. In the general non-convex smooth optimization setting, we give a simple and efficient algorithm that requires $O( σ^2/ε^4 + τ/ε^2 )$ steps for finding an $ε$-stationary point $x$, where $τ$ is the \emph{average} delay $\smash{\frac{1}{T}\sum_{t=1}^T d_t}$ and $σ^2$ is the variance of the stochastic gradients. This improves over previous work, which showed that stochastic gradient decent achieves the same rate but with respect to the \emph{maximal} delay $\max_{t} d_t$, that can be significantly larger than the average delay especially in heterogeneous distributed systems. Our experiments demonstrate the efficacy and robustness of our algorithm in cases where the delay distribution is skewed or heavy-tailed.

Multiple Organ Failure Prediction with Classifier-Guided Generative Adversarial Imputation Networks arxiv:2106.11878 📈 4

Xinlu Zhang, Yun Zhao, Rachael Callcut, Linda Petzold

**Abstract:** Multiple organ failure (MOF) is a severe syndrome with a high mortality rate among Intensive Care Unit (ICU) patients. Early and precise detection is critical for clinicians to make timely decisions. An essential challenge in applying machine learning models to electronic health records (EHRs) is the pervasiveness of missing values. Most existing imputation methods are involved in the data preprocessing phase, failing to capture the relationship between data and outcome for downstream predictions. In this paper, we propose classifier-guided generative adversarial imputation networks Classifier-GAIN) for MOF prediction to bridge this gap, by incorporating both observed data and label information. Specifically, the classifier takes imputed values from the generator(imputer) to predict task outcomes and provides additional supervision signals to the generator by joint training. The classifier-guide generator imputes missing values with label-awareness during training, improving the classifier's performance during inference. We conduct extensive experiments showing that our approach consistently outperforms classical and state-of-art neural baselines across a range of missing data scenarios and evaluation metrics.

Machine learning for risk assessment in gender-based crime arxiv:2106.11847 📈 4

Ángel González-Prieto, Antonio Brú, Juan Carlos Nuño, José Luis González-Álvarez

**Abstract:** Gender-based crime is one of the most concerning scourges of contemporary society. Governments worldwide have invested lots of economic and human resources to radically eliminate this threat. Despite these efforts, providing accurate predictions of the risk that a victim of gender violence has of being attacked again is still a very hard open problem. The development of new methods for issuing accurate, fair and quick predictions would allow police forces to select the most appropriate measures to prevent recidivism. In this work, we propose to apply Machine Learning (ML) techniques to create models that accurately predict the recidivism risk of a gender-violence offender. The relevance of the contribution of this work is threefold: (i) the proposed ML method outperforms the preexisting risk assessment algorithm based on classical statistical techniques, (ii) the study has been conducted through an official specific-purpose database with more than 40,000 reports of gender violence, and (iii) two new quality measures are proposed for assessing the effective police protection that a model supplies and the overload in the invested resources that it generates. Additionally, we propose a hybrid model that combines the statistical prediction methods with the ML method, permitting authorities to implement a smooth transition from the preexisting model to the ML-based model. This hybrid nature enables a decision-making process to optimally balance between the efficiency of the police system and aggressiveness of the protection measures taken.

Data Augmentation for Opcode Sequence Based Malware Detection arxiv:2106.11821 📈 4

Niall McLaughlin, Jesus Martinez del Rincon

**Abstract:** Data augmentation has been successfully used in many areas of deep-learning to significantly improve model performance. Typically data augmentation simulates realistic variations in data in order to increase the apparent diversity of the training-set. However, for opcode-based malware analysis, where deep learning methods are already achieving state of the art performance, it is not immediately clear how to apply data augmentation. In this paper we study different methods of data augmentation starting with basic methods using fixed transformations and moving to methods that adapt to the data. We propose a novel data augmentation method based on using an opcode embedding layer within the network and its corresponding opcode embedding matrix to perform adaptive data augmentation during training. To the best of our knowledge this is the first paper to carry out a systematic study of different augmentation methods applied to opcode sequence based malware classification.

Distributional Gradient Matching for Learning Uncertain Neural Dynamics Models arxiv:2106.11609 📈 4

Lenart Treven, Philippe Wenk, Florian Dörfler, Andreas Krause

**Abstract:** Differential equations in general and neural ODEs in particular are an essential technique in continuous-time system identification. While many deterministic learning algorithms have been designed based on numerical integration via the adjoint method, many downstream tasks such as active learning, exploration in reinforcement learning, robust control, or filtering require accurate estimates of predictive uncertainties. In this work, we propose a novel approach towards estimating epistemically uncertain neural ODEs, avoiding the numerical integration bottleneck. Instead of modeling uncertainty in the ODE parameters, we directly model uncertainties in the state space. Our algorithm - distributional gradient matching (DGM) - jointly trains a smoother and a dynamics model and matches their gradients via minimizing a Wasserstein loss. Our experiments show that, compared to traditional approximate inference methods based on numerical integration, our approach is faster to train, faster at predicting previously unseen trajectories, and in the context of neural ODEs, significantly more accurate.

Statistical Analysis of Perspective Scores on Hate Speech Detection arxiv:2107.02024 📈 3

Hadi Mansourifar, Dana Alsagheer, Weidong Shi, Lan Ni, Yan Huang

**Abstract:** Hate speech detection has become a hot topic in recent years due to the exponential growth of offensive language in social media. It has proven that, state-of-the-art hate speech classifiers are efficient only when tested on the data with the same feature distribution as training data. As a consequence, model architecture plays the second role to improve the current results. In such a diverse data distribution relying on low level features is the main cause of deficiency due to natural bias in data. That's why we need to use high level features to avoid a biased judgement. In this paper, we statistically analyze the Perspective Scores and their impact on hate speech detection. We show that, different hate speech datasets are very similar when it comes to extract their Perspective Scores. Eventually, we prove that, over-sampling the Perspective Scores of a hate speech dataset can significantly improve the generalization performance when it comes to be tested on other hate speech datasets.

Joint Learning of Portrait Intrinsic Decomposition and Relighting arxiv:2106.15305 📈 3

Mona Zehni, Shaona Ghosh, Krishna Sridhar, Sethu Raman

**Abstract:** Inverse rendering is the problem of decomposing an image into its intrinsic components, i.e. albedo, normal and lighting. To solve this ill-posed problem from single image, state-of-the-art methods in shape from shading mostly resort to supervised training on all the components on either synthetic or real datasets. Here, we propose a new self-supervised training paradigm that 1) reduces the need for full supervision on the decomposition task and 2) takes into account the relighting task. We introduce new self-supervised loss terms that leverage the consistencies between multi-lit images (images of the same scene under different illuminations). Our approach is applicable to multi-lit datasets. We apply our training approach in two settings: 1) train on a mixture of synthetic and real data, 2) train on real datasets with limited supervision. We show-case the effectiveness of our training paradigm on both intrinsic decomposition and relighting and demonstrate how the model struggles in both tasks without the self-supervised loss terms in limited supervision settings. We provide results of comprehensive experiments on SfSNet, CelebA and Photoface datasets and verify the performance of our approach on images in the wild.

Stock Market Analysis with Text Data: A Review arxiv:2106.12985 📈 3

Kamaladdin Fataliyev, Aneesh Chivukula, Mukesh Prasad, Wei Liu

**Abstract:** Stock market movements are influenced by public and private information shared through news articles, company reports, and social media discussions. Analyzing these vast sources of data can give market participants an edge to make profit. However, the majority of the studies in the literature are based on traditional approaches that come short in analyzing unstructured, vast textual data. In this study, we provide a review on the immense amount of existing literature of text-based stock market analysis. We present input data types and cover main textual data sources and variations. Feature representation techniques are then presented. Then, we cover the analysis techniques and create a taxonomy of the main stock market forecast models. Importantly, we discuss representative work in each category of the taxonomy, analyzing their respective contributions. Finally, this paper shows the findings on unaddressed open problems and gives suggestions for future work. The aim of this study is to survey the main stock market analysis models, text representation techniques for financial market prediction, shortcomings of existing techniques, and propose promising directions for future research.

PatentNet: A Large-Scale Incomplete Multiview, Multimodal, Multilabel Industrial Goods Image Database arxiv:2106.12139 📈 3

Fangyuan Lei, Da Huang, Jianjian Jiang, Ruijun Ma, Senhong Wang, Jiangzhong Cao, Yusen Lin, Qingyun Dai

**Abstract:** In deep learning area, large-scale image datasets bring a breakthrough in the success of object recognition and retrieval. Nowadays, as the embodiment of innovation, the diversity of the industrial goods is significantly larger, in which the incomplete multiview, multimodal and multilabel are different from the traditional dataset. In this paper, we introduce an industrial goods dataset, namely PatentNet, with numerous highly diverse, accurate and detailed annotations of industrial goods images, and corresponding texts. In PatentNet, the images and texts are sourced from design patent. Within over 6M images and corresponding texts of industrial goods labeled manually checked by professionals, PatentNet is the first ongoing industrial goods image database whose varieties are wider than industrial goods datasets used previously for benchmarking. PatentNet organizes millions of images into 32 classes and 219 subclasses based on the Locarno Classification Agreement. Through extensive experiments on image classification, image retrieval and incomplete multiview clustering, we demonstrate that our PatentNet is much more diverse, complex, and challenging, enjoying higher potentials than existing industrial image datasets. Furthermore, the characteristics of incomplete multiview, multimodal and multilabel in PatentNet are able to offer unparalleled opportunities in the artificial intelligence community and beyond.

Bounds on Causal Effects and Application to High Dimensional Data arxiv:2106.12121 📈 3

Ang Li, Judea Pearl

**Abstract:** This paper addresses the problem of estimating causal effects when adjustment variables in the back-door or front-door criterion are partially observed. For such scenarios, we derive bounds on the causal effects by solving two non-linear optimization problems, and demonstrate that the bounds are sufficient. Using this optimization method, we propose a framework for dimensionality reduction that allows one to trade bias for estimation power, and demonstrate its performance using simulation studies.

Near-Optimal Linear Regression under Distribution Shift arxiv:2106.12108 📈 3

Qi Lei, Wei Hu, Jason D. Lee

**Abstract:** Transfer learning is essential when sufficient data comes from the source domain, with scarce labeled data from the target domain. We develop estimators that achieve minimax linear risk for linear regression problems under distribution shift. Our algorithms cover different transfer learning settings including covariate shift and model shift. We also consider when data are generated from either linear or general nonlinear models. We show that linear minimax estimators are within an absolute constant of the minimax risk even among nonlinear estimators for various source/target distributions.

Towards Consistent Predictive Confidence through Fitted Ensembles arxiv:2106.12070 📈 3

Navid Kardan, Ankit Sharma, Kenneth O. Stanley

**Abstract:** Deep neural networks are behind many of the recent successes in machine learning applications. However, these models can produce overconfident decisions while encountering out-of-distribution (OOD) examples or making a wrong prediction. This inconsistent predictive confidence limits the integration of independently-trained learning models into a larger system. This paper introduces separable concept learning framework to realistically measure the performance of classifiers in presence of OOD examples. In this setup, several instances of a classifier are trained on different parts of a partition of the set of classes. Later, the performance of the combination of these models is evaluated on a separate test set. Unlike current OOD detection techniques, this framework does not require auxiliary OOD datasets and does not separate classification from detection performance. Furthermore, we present a new strong baseline for more consistent predictive confidence in deep models, called fitted ensembles, where overconfident predictions are rectified by transformed versions of the original classification task. Fitted ensembles can naturally detect OOD examples without requiring auxiliary data by observing contradicting predictions among its components. Experiments on MNIST, SVHN, CIFAR-10/100, and ImageNet show fitted ensemble significantly outperform conventional ensembles on OOD examples and are possible to scale.

Exploring the Representational Power of Graph Autoencoder arxiv:2106.12005 📈 3

Maroun Haddad, Mohamed Bouguessa

**Abstract:** While representation learning has yielded a great success on many graph learning tasks, there is little understanding behind the structures that are being captured by these embeddings. For example, we wonder if the topological features, such as the Triangle Count, the Degree of the node and other centrality measures are concretely encoded in the embeddings. Furthermore, we ask if the presence of these structures in the embeddings is necessary for a better performance on the downstream tasks, such as clustering and classification. To address these questions, we conduct an extensive empirical study over three classes of unsupervised graph embedding models and seven different variants of Graph Autoencoders. Our results show that five topological features: the Degree, the Local Clustering Score, the Betweenness Centrality, the Eigenvector Centrality, and Triangle Count are concretely preserved in the first layer of the graph autoencoder that employs the SUM aggregation rule, under the condition that the model preserves the second-order proximity. We supplement further evidence for the presence of these features by revealing a hierarchy in the distribution of the topological features in the embeddings of the aforementioned model. We also show that a model with such properties can outperform other models on certain downstream tasks, especially when the preserved features are relevant to the task at hand. Finally, we evaluate the suitability of our findings through a test case study related to social influence prediction.

Speeding Up OPFython with Numba arxiv:2106.11828 📈 3

Gustavo H. de Rosa, João Paulo Papa

**Abstract:** A graph-inspired classifier, known as Optimum-Path Forest (OPF), has proven to be a state-of-the-art algorithm comparable to Logistic Regressors, Support Vector Machines in a wide variety of tasks. Recently, its Python-based version, denoted as OPFython, has been proposed to provide a more friendly framework and a faster prototyping environment. Nevertheless, Python-based algorithms are slower than their counterpart C-based algorithms, impacting their performance when confronted with large amounts of data. Therefore, this paper proposed a simple yet highly efficient speed up using the Numba package, which accelerates Numpy-based calculations and attempts to increase the algorithm's overall performance. Experimental results showed that the proposed approach achieved better results than the naïve Python-based OPF and speeded up its distance measurement calculation.

A Clustering-based Framework for Classifying Data Streams arxiv:2106.11823 📈 3

Xuyang Yan, Abdollah Homaifar, Mrinmoy Sarkar, Abenezer Girma, Edward Tunstel

**Abstract:** The non-stationary nature of data streams strongly challenges traditional machine learning techniques. Although some solutions have been proposed to extend traditional machine learning techniques for handling data streams, these approaches either require an initial label set or rely on specialized design parameters. The overlap among classes and the labeling of data streams constitute other major challenges for classifying data streams. In this paper, we proposed a clustering-based data stream classification framework to handle non-stationary data streams without utilizing an initial label set. A density-based stream clustering procedure is used to capture novel concepts with a dynamic threshold and an effective active label querying strategy is introduced to continuously learn the new concepts from the data streams. The sub-cluster structure of each cluster is explored to handle the overlap among classes. Experimental results and quantitative comparison studies reveal that the proposed method provides statistically better or comparable performance than the existing methods.

Exemplars-guided Empathetic Response Generation Controlled by the Elements of Human Communication arxiv:2106.11791 📈 3

Navonil Majumder, Deepanway Ghosal, Devamanyu Hazarika, Alexander Gelbukh, Rada Mihalcea, Soujanya Poria

**Abstract:** The majority of existing methods for empathetic response generation rely on the emotion of the context to generate empathetic responses. However, empathy is much more than generating responses with an appropriate emotion. It also often entails subtle expressions of understanding and personal resonance with the situation of the other interlocutor. Unfortunately, such qualities are difficult to quantify and the datasets lack the relevant annotations. To address this issue, in this paper we propose an approach that relies on exemplars to cue the generative model on fine stylistic properties that signal empathy to the interlocutor. To this end, we employ dense passage retrieval to extract relevant exemplary responses from the training set. Three elements of human communication -- emotional presence, interpretation, and exploration, and sentiment are additionally introduced using synthetic labels to guide the generation towards empathy. The human evaluation is also extended by these elements of human communication. We empirically show that these approaches yield significant improvements in empathetic response quality in terms of both automated and human-evaluated metrics. The implementation is available at https://github.com/declare-lab/exemplary-empathy.

Self-Supervised Iterative Contextual Smoothing for Efficient Adversarial Defense against Gray- and Black-Box Attack arxiv:2106.11644 📈 3

Sungmin Cha, Naeun Ko, Youngjoon Yoo, Taesup Moon

**Abstract:** We propose a novel and effective input transformation based adversarial defense method against gray- and black-box attack, which is computationally efficient and does not require any adversarial training or retraining of a classification model. We first show that a very simple iterative Gaussian smoothing can effectively wash out adversarial noise and achieve substantially high robust accuracy. Based on the observation, we propose Self-Supervised Iterative Contextual Smoothing (SSICS), which aims to reconstruct the original discriminative features from the Gaussian-smoothed image in context-adaptive manner, while still smoothing out the adversarial noise. From the experiments on ImageNet, we show that our SSICS achieves both high standard accuracy and very competitive robust accuracy for the gray- and black-box attacks; e.g., transfer-based PGD-attack and score-based attack. A note-worthy point to stress is that our defense is free of computationally expensive adversarial training, yet, can approach its robust accuracy via input transformation.

Zero-Shot Chinese Character Recognition with Stroke-Level Decomposition arxiv:2106.11613 📈 3

Jingye Chen, Bin Li, Xiangyang Xue

**Abstract:** Chinese character recognition has attracted much research interest due to its wide applications. Although it has been studied for many years, some issues in this field have not been completely resolved yet, e.g. the zero-shot problem. Previous character-based and radical-based methods have not fundamentally addressed the zero-shot problem since some characters or radicals in test sets may not appear in training sets under a data-hungry condition. Inspired by the fact that humans can generalize to know how to write characters unseen before if they have learned stroke orders of some characters, we propose a stroke-based method by decomposing each character into a sequence of strokes, which are the most basic units of Chinese characters. However, we observe that there is a one-to-many relationship between stroke sequences and Chinese characters. To tackle this challenge, we employ a matching-based strategy to transform the predicted stroke sequence to a specific character. We evaluate the proposed method on handwritten characters, printed artistic characters, and scene characters. The experimental results validate that the proposed method outperforms existing methods on both character zero-shot and radical zero-shot tasks. Moreover, the proposed method can be easily generalized to other languages whose characters can be decomposed into strokes.

Multi-layered Semantic Representation Network for Multi-label Image Classification arxiv:2106.11596 📈 3

Xiwen Qu, Hao Che, Jun Huang, Linchuan Xu, Xiao Zheng

**Abstract:** Multi-label image classification (MLIC) is a fundamental and practical task, which aims to assign multiple possible labels to an image. In recent years, many deep convolutional neural network (CNN) based approaches have been proposed which model label correlations to discover semantics of labels and learn semantic representations of images. This paper advances this research direction by improving both the modeling of label correlations and the learning of semantic representations. On the one hand, besides the local semantics of each label, we propose to further explore global semantics shared by multiple labels. On the other hand, existing approaches mainly learn the semantic representations at the last convolutional layer of a CNN. But it has been noted that the image representations of different layers of CNN capture different levels or scales of features and have different discriminative abilities. We thus propose to learn semantic representations at multiple convolutional layers. To this end, this paper designs a Multi-layered Semantic Representation Network (MSRN) which discovers both local and global semantics of labels through modeling label correlations and utilizes the label semantics to guide the semantic representations learning at multiple layers through an attention mechanism. Extensive experiments on four benchmark datasets including VOC 2007, COCO, NUS-WIDE, and Apparel show a competitive performance of the proposed MSRN against state-of-the-art models.

Continuous-Depth Neural Models for Dynamic Graph Prediction arxiv:2106.11581 📈 3

Michael Poli, Stefano Massaroli, Clayton M. Rabideau, Junyoung Park, Atsushi Yamashita, Hajime Asama, Jinkyoo Park

**Abstract:** We introduce the framework of continuous-depth graph neural networks (GNNs). Neural graph differential equations (Neural GDEs) are formalized as the counterpart to GNNs where the input-output relationship is determined by a continuum of GNN layers, blending discrete topological structures and differential equations. The proposed framework is shown to be compatible with static GNN models and is extended to dynamic and stochastic settings through hybrid dynamical system theory. Here, Neural GDEs improve performance by exploiting the underlying dynamics geometry, further introducing the ability to accommodate irregularly sampled data. Results prove the effectiveness of the proposed models across applications, such as traffic forecasting or prediction in genetic regulatory networks.

Learning-Based Practical Light Field Image Compression Using A Disparity-Aware Model arxiv:2106.11558 📈 3

Mohana Singh, Renu M. Rameshan

**Abstract:** Light field technology has increasingly attracted the attention of the research community with its many possible applications. The lenslet array in commercial plenoptic cameras helps capture both the spatial and angular information of light rays in a single exposure. While the resulting high dimensionality of light field data enables its superior capabilities, it also impedes its extensive adoption. Hence, there is a compelling need for efficient compression of light field images. Existing solutions are commonly composed of several separate modules, some of which may not have been designed for the specific structure and quality of light field data. This increases the complexity of the codec and results in impractical decoding runtimes. We propose a new learning-based, disparity-aided model for compression of 4D light field images capable of parallel decoding. The model is end-to-end trainable, eliminating the need for hand-tuning separate modules and allowing joint learning of rate and distortion. The disparity-aided approach ensures the structural integrity of the reconstructed light fields. Comparisons with the state of the art show encouraging performance in terms of PSNR and MS-SSIM metrics. Also, there is a notable gain in the encoding and decoding runtimes. Source code is available at https://moha23.github.io/LF-DAAE.

Bayesian Neural Networks: Essentials arxiv:2106.13594 📈 2

Daniel T. Chang

**Abstract:** Bayesian neural networks utilize probabilistic layers that capture uncertainty over weights and activations, and are trained using Bayesian inference. Since these probabilistic layers are designed to be drop-in replacement of their deterministic counter parts, Bayesian neural networks provide a direct and natural way to extend conventional deep neural networks to support probabilistic deep learning. However, it is nontrivial to understand, design and train Bayesian neural networks due to their complexities. We discuss the essentials of Bayesian neural networks including duality (deep neural networks, probabilistic models), approximate Bayesian inference, Bayesian priors, Bayesian posteriors, and deep variational learning. We use TensorFlow Probability APIs and code examples for illustration. The main problem with Bayesian neural networks is that the architecture of deep neural networks makes it quite redundant, and costly, to account for uncertainty for a large number of successive layers. Hybrid Bayesian neural networks, which use few probabilistic layers judicially positioned in the networks, provide a practical solution.

Long-term Cross Adversarial Training: A Robust Meta-learning Method for Few-shot Classification Tasks arxiv:2106.12900 📈 2

Fan Liu, Shuyu Zhao, Xuelong Dai, Bin Xiao

**Abstract:** Meta-learning model can quickly adapt to new tasks using few-shot labeled data. However, despite achieving good generalization on few-shot classification tasks, it is still challenging to improve the adversarial robustness of the meta-learning model in few-shot learning. Although adversarial training (AT) methods such as Adversarial Query (AQ) can improve the adversarially robust performance of meta-learning models, AT is still computationally expensive training. On the other hand, meta-learning models trained with AT will drop significant accuracy on the original clean images. This paper proposed a meta-learning method on the adversarially robust neural network called Long-term Cross Adversarial Training (LCAT). LCAT will update meta-learning model parameters cross along the natural and adversarial sample distribution direction with long-term to improve both adversarial and clean few-shot classification accuracy. Due to cross-adversarial training, LCAT only needs half of the adversarial training epoch than AQ, resulting in a low adversarial training computation. Experiment results show that LCAT achieves superior performance both on the clean and adversarial few-shot classification accuracy than SOTA adversarial training methods for meta-learning models.

Diabetic Retinopathy Detection using Ensemble Machine Learning arxiv:2106.12545 📈 2

Israa Odeh, Mouhammd Alkasassbeh, Mohammad Alauthman

**Abstract:** Diabetic Retinopathy (DR) is among the worlds leading vision loss causes in diabetic patients. DR is a microvascular disease that affects the eye retina, which causes vessel blockage and therefore cuts the main source of nutrition for the retina tissues. Treatment for this visual disorder is most effective when it is detected in its earliest stages, as severe DR can result in irreversible blindness. Nonetheless, DR identification requires the expertise of Ophthalmologists which is often expensive and time-consuming. Therefore, automatic detection systems were introduced aiming to facilitate the identification process, making it available globally in a time and cost-efficient manner. However, due to the limited reliable datasets and medical records for this particular eye disease, the obtained predictions accuracies were relatively unsatisfying for eye specialists to rely on them as diagnostic systems. Thus, we explored an ensemble-based learning strategy, merging a substantial selection of well-known classification algorithms in one sophisticated diagnostic model. The proposed framework achieved the highest accuracy rates among all other common classification algorithms in the area. 4 subdatasets were generated to contain the top 5 and top 10 features of the Messidor dataset, selected by InfoGainEval. and WrapperSubsetEval., accuracies of 70.7% and 75.1% were achieved on the InfoGainEval. top 5 and original dataset respectively. The results imply the impressive performance of the subdataset, which significantly conduces to a less complex classification process

Neural Fashion Image Captioning : Accounting for Data Diversity arxiv:2106.12154 📈 2

Gilles Hacheme, Noureini Sayouti

**Abstract:** Image captioning has increasingly large domains of application, and fashion is not an exception. Having automatic item descriptions is of great interest for fashion web platforms, sometimes hosting hundreds of thousands of images. This paper is one of the first to tackle image captioning for fashion images. To address dataset diversity issues, we introduced the InFashAIv1 dataset containing almost 16.000 African fashion item images with their titles, prices, and general descriptions. We also used the well-known DeepFashion dataset in addition to InFashAIv1. Captions are generated using the Show and Tell model made of CNN encoder and RNN Decoder. We showed that jointly training the model on both datasets improves captions quality for African style fashion images, suggesting a transfer learning from Western style data. The InFashAIv1 dataset is released on Github to encourage works with more diversity inclusion.

Lagrangian dual framework for conservative neural network solutions of kinetic equations arxiv:2106.12147 📈 2

Hyung Ju Hwang, Hwijae Son

**Abstract:** In this paper, we propose a novel conservative formulation for solving kinetic equations via neural networks. More precisely, we formulate the learning problem as a constrained optimization problem with constraints that represent the physical conservation laws. The constraints are relaxed toward the residual loss function by the Lagrangian duality. By imposing physical conservation properties of the solution as constraints of the learning problem, we demonstrate far more accurate approximations of the solutions in terms of errors and the conservation laws, for the kinetic Fokker-Planck equation and the homogeneous Boltzmann equation.

Learning Identity-Preserving Transformations on Data Manifolds arxiv:2106.12096 📈 2

Marissa Connor, Kion Fallah, Christopher Rozell

**Abstract:** Many machine learning techniques incorporate identity-preserving transformations into their models to generalize their performance to previously unseen data. These transformations are typically selected from a set of functions that are known to maintain the identity of an input when applied (e.g., rotation, translation, flipping, and scaling). However, there are many natural variations that cannot be labeled for supervision or defined through examination of the data. As suggested by the manifold hypothesis, many of these natural variations live on or near a low-dimensional, nonlinear manifold. Several techniques represent manifold variations through a set of learned Lie group operators that define directions of motion on the manifold. However theses approaches are limited because they require transformation labels when training their models and they lack a method for determining which regions of the manifold are appropriate for applying each specific operator. We address these limitations by introducing a learning strategy that does not require transformation labels and developing a method that learns the local regions where each operator is likely to be used while preserving the identity of inputs. Experiments on MNIST and Fashion MNIST highlight our model's ability to learn identity-preserving transformations on multi-class datasets. Additionally, we train on CelebA to showcase our model's ability to learn semantically meaningful transformations on complex datasets in an unsupervised manner.

A Simple Baseline for Batch Active Learning with Stochastic Acquisition Functions arxiv:2106.12059 📈 2

Andreas Kirsch, Sebastian Farquhar, Yarin Gal

**Abstract:** In active learning, new labels are commonly acquired in batches. However, common acquisition functions are only meant for one-sample acquisition rounds at a time, and when their scores are used naively for batch acquisition, they result in batches lacking diversity, which deteriorates performance. On the other hand, state-of-the-art batch acquisition functions are costly to compute. In this paper, we present a novel class of stochastic acquisition functions that extend one-sample acquisition functions to the batch setting by observing how one-sample acquisition scores change as additional samples are acquired and modelling this difference for additional batch samples. We simply acquire new samples by sampling from the pool set using a Gibbs distribution based on the acquisition scores. Our acquisition functions are both vastly cheaper to compute and out-perform other batch acquisition functions.

FLEA: Provably Fair Multisource Learning from Unreliable Training Data arxiv:2106.11732 📈 2

Eugenia Iofinova, Nikola Konstantinov, Christoph H. Lampert

**Abstract:** Fairness-aware learning aims at constructing classifiers that not only make accurate predictions, but also do not discriminate against specific groups. It is a fast-growing area of machine learning with far-reaching societal impact. However, existing fair learning methods are vulnerable to accidental or malicious artifacts in the training data, which can cause them to unknowingly produce unfair classifiers. In this work we address the problem of fair learning from unreliable training data in the robust multisource setting, where the available training data comes from multiple sources, a fraction of which might not be representative of the true data distribution. We introduce FLEA, a filtering-based algorithm that allows the learning system to identify and suppress those data sources that would have a negative impact on fairness or accuracy if they were used for training. We show the effectiveness of our approach by a diverse range of experiments on multiple datasets. Additionally, we prove formally that - given enough data - FLEA protects the learner against corruptions as long as the fraction of affected data sources is less than half.

MIMIR: Deep Regression for Automated Analysis of UK Biobank Body MRI arxiv:2106.11731 📈 2

Taro Langner, Andrés Martínez Mora, Robin Strand, Håkan Ahlström, Joel Kullberg

**Abstract:** UK Biobank (UKB) conducts large-scale examinations of more than half a million volunteers, collecting health-related information on genetics, lifestyle, blood biochemistry, and more. Medical imaging of 100,000 subjects, with 70,000 follow-up sessions, enables measurements of organs, muscle, and body composition. With up to 170,000 mounting MR images, various methodologies are accordingly engaged in large-scale image analysis. This work presents an experimental inference engine that can automatically predict a comprehensive profile of subject metadata from UKB neck-to-knee body MRI. It was evaluated in cross-validation for baseline characteristics such as age, height, weight, and sex, but also measurements of body composition, organ volumes, and abstract properties like grip strength, pulse rate, and type 2 diabetic status. It predicted subsequently released test data covering twelve body composition metrics with a 3% median error. The proposed system can automatically analyze one thousand subjects within ten minutes, providing individual confidence intervals. The underlying methodology utilizes convolutional neural networks for image-based mean-variance regression on two-dimensional representations of the MRI data. This work aims to make the proposed system available for free to researchers, who can use it to obtain fast and fully-automated estimates of 72 different measurements immediately upon release of new UKB image data.

Learning Dynamical Systems from Noisy Sensor Measurements using Multiple Shooting arxiv:2106.11712 📈 2

Armand Jordana, Justin Carpentier, Ludovic Righetti

**Abstract:** Modeling dynamical systems plays a crucial role in capturing and understanding complex physical phenomena. When physical models are not sufficiently accurate or hardly describable by analytical formulas, one can use generic function approximators such as neural networks to capture the system dynamics directly from sensor measurements. As for now, current methods to learn the parameters of these neural networks are highly sensitive to the inherent instability of most dynamical systems of interest, which in turn prevents the study of very long sequences. In this work, we introduce a generic and scalable method based on multiple shooting to learn latent representations of indirectly observed dynamical systems. We achieve state-of-the-art performances on systems observed directly from raw images. Further, we demonstrate that our method is robust to noisy measurements and can handle complex dynamical systems, such as chaotic ones.

Categorising Fine-to-Coarse Grained Misinformation: An Empirical Study of COVID-19 Infodemic arxiv:2106.11702 📈 2

Ye Jiang, Xingyi Song, Carolina Scarton, Ahmet Aker, Kalina Bontcheva

**Abstract:** The spreading COVID-19 misinformation over social media already draws the attention of many researchers. According to Google Scholar, about 26000 COVID-19 related misinformation studies have been published to date. Most of these studies focusing on 1) detect and/or 2) analysing the characteristics of COVID-19 related misinformation. However, the study of the social behaviours related to misinformation is often neglected. In this paper, we introduce a fine-grained annotated misinformation tweets dataset including social behaviours annotation (e.g. comment or question to the misinformation). The dataset not only allows social behaviours analysis but also suitable for both evidence-based or non-evidence-based misinformation classification task. In addition, we introduce leave claim out validation in our experiments and demonstrate the misinformation classification performance could be significantly different when applying to real-world unseen misinformation.

The Hitchhiker's Guide to Prior-Shift Adaptation arxiv:2106.11695 📈 2

Tomas Sipka, Milan Sulc, Jiri Matas

**Abstract:** In many computer vision classification tasks, class priors at test time often differ from priors on the training set. In the case of such prior shift, classifiers must be adapted correspondingly to maintain close to optimal performance. This paper analyzes methods for adaptation of probabilistic classifiers to new priors and for estimating new priors on an unlabeled test set. We propose a novel method to address a known issue of prior estimation methods based on confusion matrices, where inconsistent estimates of decision probabilities and confusion matrices lead to negative values in the estimated priors. Experiments on fine-grained image classification datasets provide insight into the best practice of prior shift estimation and classifier adaptation, and show that the proposed method achieves state-of-the-art results in prior adaptation. Applying the best practice to two tasks with naturally imbalanced priors, learning from web-crawled images and plant species classification, increased the recognition accuracy by 1.1% and 3.4% respectively.

MMD-MIX: Value Function Factorisation with Maximum Mean Discrepancy for Cooperative Multi-Agent Reinforcement Learning arxiv:2106.11652 📈 2

Zhiwei Xu, Dapeng Li, Yunpeng Bai, Guoliang Fan

**Abstract:** In the real world, many tasks require multiple agents to cooperate with each other under the condition of local observations. To solve such problems, many multi-agent reinforcement learning methods based on Centralized Training with Decentralized Execution have been proposed. One representative class of work is value decomposition, which decomposes the global joint Q-value $Q_\text{jt}$ into individual Q-values $Q_a$ to guide individuals' behaviors, e.g. VDN (Value-Decomposition Networks) and QMIX. However, these baselines often ignore the randomness in the situation. We propose MMD-MIX, a method that combines distributional reinforcement learning and value decomposition to alleviate the above weaknesses. Besides, to improve data sampling efficiency, we were inspired by REM (Random Ensemble Mixture) which is a robust RL algorithm to explicitly introduce randomness into the MMD-MIX. The experiments demonstrate that MMD-MIX outperforms prior baselines in the StarCraft Multi-Agent Challenge (SMAC) environment.

Universal Domain Adaptation in Ordinal Regression arxiv:2106.11576 📈 2

Boris Chidlovskii, Assem Sadek, Christian Wolf

**Abstract:** We address the problem of universal domain adaptation (UDA) in ordinal regression (OR), which attempts to solve classification problems in which labels are not independent, but follow a natural order. We show that the UDA techniques developed for classification and based on the clustering assumption, under-perform in OR settings. We propose a method that complements the OR classifier with an auxiliary task of order learning, which plays the double role of discriminating between common and private instances, and expanding class labels to the private target images via ranking. Combined with adversarial domain discrimination, our model is able to address the closed set, partial and open set configurations. We evaluate our method on three face age estimation datasets, and show that it outperforms the baseline methods.

Recurrent Neural Network from Adder's Perspective: Carry-lookahead RNN arxiv:2106.12901 📈 1

Haowei Jiang, Feiwei Qin, Jin Cao, Yong Peng, Yanli Shao

**Abstract:** The recurrent network architecture is a widely used model in sequence modeling, but its serial dependency hinders the computation parallelization, which makes the operation inefficient. The same problem was encountered in serial adder at the early stage of digital electronics. In this paper, we discuss the similarities between recurrent neural network (RNN) and serial adder. Inspired by carry-lookahead adder, we introduce carry-lookahead module to RNN, which makes it possible for RNN to run in parallel. Then, we design the method of parallel RNN computation, and finally Carry-lookahead RNN (CL-RNN) is proposed. CL-RNN takes advantages in parallelism and flexible receptive field. Through a comprehensive set of tests, we verify that CL-RNN can perform better than existing typical RNNs in sequence modeling tasks which are specially designed for RNNs.

NAX: Co-Designing Neural Network and Hardware Architecture for Memristive Xbar based Computing Systems arxiv:2106.12125 📈 1

Shubham Negi, Indranil Chakraborty, Aayush Ankit, Kaushik Roy

**Abstract:** In-Memory Computing (IMC) hardware using Memristive Crossbar Arrays (MCAs) are gaining popularity to accelerate Deep Neural Networks (DNNs) since it alleviates the "memory wall" problem associated with von-Neumann architecture. The hardware efficiency (energy, latency and area) as well as application accuracy (considering device and circuit non-idealities) of DNNs mapped to such hardware are co-dependent on network parameters, such as kernel size, depth etc. and hardware architecture parameters such as crossbar size. However, co-optimization of both network and hardware parameters presents a challenging search space comprising of different kernel sizes mapped to varying crossbar sizes. To that effect, we propose NAX -- an efficient neural architecture search engine that co-designs neural network and IMC based hardware architecture. NAX explores the aforementioned search space to determine kernel and corresponding crossbar sizes for each DNN layer to achieve optimal tradeoffs between hardware efficiency and application accuracy. Our results from NAX show that the networks have heterogeneous crossbar sizes across different network layers, and achieves optimal hardware efficiency and accuracy considering the non-idealities in crossbars. On CIFAR-10 and Tiny ImageNet, our models achieve 0.8%, 0.2% higher accuracy, and 17%, 4% lower EDAP (energy-delay-area product) compared to a baseline ResNet-20 and ResNet-18 models, respectively.

Prevention and Resolution of Conflicts in Social Navigation -- a Survey arxiv:2106.12113 📈 1

Reuth Mirsky, Xuesu Xiao, Justin Hart, Peter Stone

**Abstract:** With the approaching goal of having robots collaborate in shared human-robot environments, navigation in this context becomes both crucial and desirable. Recent developments in robotics have encountered and tackled some of the challenges of navigating in mixed human-robot environments, and in recent years we observe a surge of related work that specifically targets the question of how to handle conflicts between agents in social navigation. These contributions offer models, algorithms, and evaluation metrics, however as this research area is inherently interdisciplinary, many of the relevant papers are not comparable and there is no standard vocabulary between the researchers. The main goal of this survey is to bridge this gap by proposing such a common language, using it to survey existing work, and highlighting open problems. It starts by defining a conflict in social navigation, and offers a detailed taxonomy of its components. This survey then maps existing work while discussing papers using the framing of the proposed taxonomy. Finally, this paper propose some future directions and problems that are currently in the frontier of social navigation to help focus research efforts.

BFTrainer: Low-Cost Training of Neural Networks on Unfillable Supercomputer Nodes arxiv:2106.12091 📈 1

Zhengchun Liu, Rajkumar Kettimuthu, Michael E. Papka, Ian Foster

**Abstract:** Supercomputer FCFS-based scheduling policies result in many transient idle nodes, a phenomenon that is only partially alleviated by backfill scheduling methods that promote small jobs to run before large jobs. Here we describe how to realize a novel use for these otherwise wasted resources, namely, deep neural network (DNN) training. This important workload is easily organized as many small fragments that can be configured dynamically to fit essentially any node*time hole in a supercomputer's schedule. We describe how the task of rescaling suitable DNN training tasks to fit dynamically changing holes can be formulated as a deterministic mixed integer linear programming (MILP)-based resource allocation algorithm, and show that this MILP problem can be solved efficiently at run time. We show further how this MILP problem can be adapted to optimize for administrator- or user-defined metrics. We validate our method with supercomputer scheduler logs and different DNN training scenarios, and demonstrate efficiencies of up to 93% compared with running the same training tasks on dedicated nodes. Our method thus enables substantial supercomputer resources to be allocated to DNN training with no impact on other applications.

A Federated Data-Driven Evolutionary Algorithm for Expensive Multi/Many-objective Optimization arxiv:2106.12086 📈 1

Jinjin Xu, Yaochu Jin, Wenli Du

**Abstract:** Data-driven optimization has found many successful applications in the real world and received increased attention in the field of evolutionary optimization. Most existing algorithms assume that the data used for optimization is always available on a central server for construction of surrogates. This assumption, however, may fail to hold when the data must be collected in a distributed way and is subject to privacy restrictions. This paper aims to propose a federated data-driven evolutionary multi-/many-objective optimization algorithm. To this end, we leverage federated learning for surrogate construction so that multiple clients collaboratively train a radial-basis-function-network as the global surrogate. Then a new federated acquisition function is proposed for the central server to approximate the objective values using the global surrogate and estimate the uncertainty level of the approximated objective values based on the local models. The performance of the proposed algorithm is verified on a series of multi/many-objective benchmark problems by comparing it with two state-of-the-art surrogate-assisted multi-objective evolutionary algorithms.

The Rate of Convergence of Variation-Constrained Deep Neural Networks arxiv:2106.12068 📈 1

Gen Li, Yuantao Gu, Jie Ding

**Abstract:** Multi-layer feedforward networks have been used to approximate a wide range of nonlinear functions. An important and fundamental problem is to understand the learnability of a network model through its statistical risk, or the expected prediction error on future data. To the best of our knowledge, the rate of convergence of neural networks shown by existing works is bounded by at most the order of $n^{-1/4}$ for a sample size of $n$. In this paper, we show that a class of variation-constrained neural networks, with arbitrary width, can achieve near-parametric rate $n^{-1/2+δ}$ for an arbitrarily small positive constant $δ$. It is equivalent to $n^{-1 +2δ}$ under the mean squared error. This rate is also observed by numerical experiments. The result indicates that the neural function space needed for approximating smooth functions may not be as large as what is often perceived. Our result also provides insight to the phenomena that deep neural networks do not easily suffer from overfitting when the number of neurons and learning parameters rapidly grow with $n$ or even surpass $n$. We also discuss the rate of convergence regarding other network parameters, including the input dimension, network layer, and coefficient norm.

A Practical & Unified Notation for Information-Theoretic Quantities in ML arxiv:2106.12062 📈 1

Andreas Kirsch, Yarin Gal

**Abstract:** A practical notation can convey valuable intuitions and concisely express new ideas. Information theory is of importance to machine learning, but the notation for information-theoretic quantities is sometimes opaque. We propose a practical and unified notation and extend it to include information-theoretic quantities between observed outcomes (events) and random variables. This includes the point-wise mutual information known in NLP and mixed quantities such as specific surprise and specific information in the cognitive sciences and information gain in Bayesian optimal experimental design. We apply our notation to prove a version of Stirling's approximation for binomial coefficients mentioned by MacKa (2003) using new intuitions. We also concisely rederive the evidence lower bound for variational auto-encoders and variational inference in approximate Bayesian neural networks. Furthermore, we apply the notation to a popular information-theoretic acquisition function in Bayesian active learning which selects the most informative (unlabelled) samples to be labelled by an expert and extend this acquisition function to the core-set problem with the goal of selecting the most informative samples given the labels.

Machine Learning for Model Order Selection in MIMO OFDM Systems arxiv:2106.11633 📈 1

Brenda Vilas Boas, Wolfgang Zirwas, Martin Haardt

**Abstract:** A variety of wireless channel estimation methods, e.g., MUSIC and ESPRIT, rely on prior knowledge of the model order. Therefore, it is important to correctly estimate the number of multipath components (MPCs) which compose such channels. However, environments with many scatterers may generate MPCs which are closely spaced. This clustering of MPCs in addition to noise makes the model order selection task difficult in practice to currently known algorithms. In this paper, we exploit the multidimensional characteristics of MIMO orthogonal frequency division multiplexing (OFDM) systems and propose a machine learning (ML) method capable of determining the number of MPCs with a higher accuracy than state of the art methods in almost coherent scenarios. Moreover, our results show that our proposed ML method has an enhanced reliability.

Online Ordering Platform City Distribution Based on Genetic Algorithm arxiv:2106.11578 📈 1

Yu Du

**Abstract:** Since the rising of the takeaway ordering platform, the M platform has taken the lead in the industry with its high-quality service. The increasing order volume leads the competition between platforms to reduce the distribution cost, which increases rapidly because of the unreasonable distribution route. By analyzing platform distribution's current situation, we study the vehicle routing problem of urban distribution on the M platform and minimize the distribution cost. Considering the constraints of the customer's expected delivery time and vehicle condition, we combine the different arrival times of the vehicle routing problem model using three soft time windows and solve the problem using a genetic algorithm (GA). The results show that our model and algorithm can design the vehicle path superior to the original model in terms of distribution cost and delivery time, thus providing decision support for the M platform to save distribution cost in urban distribution in the future.

Better Algorithms for Individually Fair $k$-Clustering arxiv:2106.12150 📈 0

Deeparnab Chakrabarty, Maryam Negahbani

**Abstract:** We study data clustering problems with $\ell_p$-norm objectives (e.g. $k$-Median and $k$-Means) in the context of individual fairness. The dataset consists of $n$ points, and we want to find $k$ centers such that (a) the objective is minimized, while (b) respecting the individual fairness constraint that every point $v$ has a center within a distance at most $r(v)$, where $r(v)$ is $v$'s distance to its $(n/k)$th nearest point. Jung, Kannan, and Lutz [FORC 2020] introduced this concept and designed a clustering algorithm with provable (approximate) fairness and objective guarantees for the $\ell_\infty$ or $k$-Center objective. Mahabadi and Vakilian [ICML 2020] revisited this problem to give a local-search algorithm for all $\ell_p$-norms. Empirically, their algorithms outperform Jung et. al.'s by a large margin in terms of cost (for $k$-Median and $k$-Means), but they incur a reasonable loss in fairness. In this paper, our main contribution is to use Linear Programming (LP) techniques to obtain better algorithms for this problem, both in theory and in practice. We prove that by modifying known LP rounding techniques, one gets a worst-case guarantee on the objective which is much better than in MV20, and empirically, this objective is extremely close to the optimal. Furthermore, our theoretical fairness guarantees are comparable with MV20 in theory, and empirically, we obtain noticeably fairer solutions. Although solving the LP {\em exactly} might be prohibitive, we demonstrate that in practice, a simple sparsification technique drastically improves the run-time of our algorithm.

The Neurally-Guided Shape Parser: Grammar-based Labeling of 3D Shape Regions with Approximate Inference arxiv:2106.12026 📈 0

R. Kenny Jones, Aalia Habib, Rana Hanocka, Daniel Ritchie

**Abstract:** We propose the Neurally-Guided Shape Parser (NGSP), a method that learns how to assign fine-grained semantic labels to regions of a 3D shape. NGSP solves this problem via MAP inference, modeling the posterior probability of a label assignment conditioned on an input shape with a learned likelihood function. To make this search tractable, NGSP employs a neural guide network that learns to approximate the posterior. NGSP finds high-probability label assignments by first sampling proposals with the guide network and then evaluating each proposal under the full likelihood. We evaluate NGSP on the task of fine-grained semantic segmentation of manufactured 3D shapes from PartNet, where shapes have been decomposed into regions that correspond to part instance over-segmentations. We find that NGSP delivers significant performance improvements over comparison methods that (i) use regions to group per-point predictions, (ii) use regions as a self-supervisory signal or (iii) assign labels to regions under alternative formulations. Further, we show that NGSP maintains strong performance even with limited labeled data or as shape regions undergo artificial corruption. Finally, we demonstrate that NGSP can be directly applied to CAD shapes found in online repositories and validate its effectiveness with a perceptual study.

Not All Labels Are Equal: Rationalizing The Labeling Costs for Training Object Detection arxiv:2106.11921 📈 0

Ismail Elezi, Zhiding Yu, Anima Anandkumar, Laura Leal-Taixe, Jose M. Alvarez

**Abstract:** Deep neural networks have reached high accuracy on object detection but their success hinges on large amounts of labeled data. To reduce the labels dependency, various active learning strategies have been proposed, typically based on the confidence of the detector. However, these methods are biased towards high-performing classes and can lead to acquired datasets that are not good representatives of the testing set data. In this work, we propose a unified framework for active learning, that considers both the uncertainty and the robustness of the detector, ensuring that the network performs well in all classes. Furthermore, our method leverages auto-labeling to suppress a potential distribution drift while boosting the performance of the model. Experiments on PASCAL VOC07+12 and MS-COCO show that our method consistently outperforms a wide range of active learning methods, yielding up to a 7.7% improvement in mAP, or up to 82% reduction in labeling cost. Code will be released upon acceptance of the paper.

Neural Distributed Image Compression using Common Information arxiv:2106.11723 📈 0

Nitish Mital, Ezgi Ozyilkan, Ali Garjani, Deniz Gunduz

**Abstract:** We present a novel deep neural network (DNN) architecture for compressing an image when a correlated image is available as side information only at the decoder. This problem is known as distributed source coding (DSC) in information theory. In particular, we consider a pair of stereo images, which generally have high correlation with each other due to overlapping fields of view, and assume that one image of the pair is to be compressed and transmitted, while the other image is available only at the decoder. In the proposed architecture, the encoder maps the input image to a latent space, quantizes the latent representation, and compresses it using entropy coding. The decoder is trained to extract the common information between the input image and the correlated image, using only the latter. The received latent representation and the locally generated common information are passed through a decoder network to obtain an enhanced reconstruction of the input image. The common information provides a succinct representation of the relevant information at the receiver. We train and demonstrate the effectiveness of the proposed approach on the KITTI and Cityscape datasets of stereo image pairs. Our results show that the proposed architecture is capable of exploiting the decoder-only side information, and outperforms previous work on stereo image compression with decoder side information.

Test Distribution-Aware Active Learning: A Principled Approach Against Distribution Shift and Outliers arxiv:2106.11719 📈 0

Andreas Kirsch, Tom Rainforth, Yarin Gal

**Abstract:** Expanding on MacKay (1992), we argue that conventional model-based methods for active learning - like BALD - have a fundamental shortfall: they fail to directly account for the test-time distribution of the input variables. This can lead to pathologies in the acquisition strategy, as what is maximally informative for model parameters may not be maximally informative for prediction: for example, when the data in the pool set is more dispersed than that of the final prediction task, or when the distribution of pool and test samples differs. To correct this, we revisit an acquisition strategy that is based on maximizing the expected information gained about possible future predictions, referring to this as the Expected Predictive Information Gain (EPIG). As EPIG does not scale well for batch acquisition, we further examine an alternative strategy, a hybrid between BALD and EPIG, which we call the Joint Expected Predictive Information Gain (JEPIG). We consider using both for active learning with Bayesian neural networks on a variety of datasets, examining the behavior under distribution shift in the pool set.

DARTS-PRIME: Regularization and Scheduling Improve Constrained Optimization in Differentiable NAS arxiv:2106.11655 📈 0

Kaitlin Maile, Erwan Lecarpentier, Hervé Luga, Dennis G. Wilson

**Abstract:** Differentiable Architecture Search (DARTS) is a recent neural architecture search (NAS) method based on a differentiable relaxation. Due to its success, numerous variants analyzing and improving parts of the DARTS framework have recently been proposed. By considering the problem as a constrained bilevel optimization, we present and analyze DARTS-PRIME, a variant including improvements to architectural weight update scheduling and regularization towards discretization. We propose a dynamic schedule based on per-minibatch network information to make architecture updates more informed, as well as proximity regularization to promote well-separated discretization. Our results in multiple domains show that DARTS-PRIME improves both performance and reliability, comparable to state-of-the-art in differentiable NAS.

Reinforcement Learning for Physical Layer Communications arxiv:2106.11595 📈 0

Philippe Mary, Visa Koivunen, Christophe Moy

**Abstract:** In this chapter, we will give comprehensive examples of applying RL in optimizing the physical layer of wireless communications by defining different class of problems and the possible solutions to handle them. In Section 9.2, we present all the basic theory needed to address a RL problem, i.e. Markov decision process (MDP), Partially observable Markov decision process (POMDP), but also two very important and widely used algorithms for RL, i.e. the Q-learning and SARSA algorithms. We also introduce the deep reinforcement learning (DRL) paradigm and the section ends with an introduction to the multi-armed bandits (MAB) framework. Section 9.3 focuses on some toy examples to illustrate how the basic concepts of RL are employed in communication systems. We present applications extracted from literature with simplified system models using similar notation as in Section 9.2 of this Chapter. In Section 9.3, we also focus on modeling RL problems, i.e. how action and state spaces and rewards are chosen. The Chapter is concluded in Section 9.4 with a prospective thought on RL trends and it ends with a review of a broader state of the art in Section 9.5.

Next Page