Readings

Info

This is a list of growing number of papers and implementations I think are interesting.

Long Tailed Recognition¶

Large-Scale Long-Tailed Recognition in an Open World
- Frequently in real world scenario there're new unseen classes or samples within the tail classes
- This tackles the problem with dynamic embedding to bring associative memory to aid prediction of long-tailed classes
- The model essentially combines direct image features with embeddings from other classes

Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
- They propose to use domain randomization to train deep learning algorithms on synthetic data and transferring to real-world data
- The idea is that with sufficient variability in the textures of synthetic data, real-world data becomes another variation of the synthetic data
- It works surprisingly well and it's a simple technique of varying image textures essentially enabling CNNs to be more robust to variations in image textures

Online Learning Rate Adaptation with Hypergradient Descent
- Reduces the need for learning rate scheduling for SGD, SGD and nesterov momentum, and Adam
- Uses the concept of hypergradients (gradients w.r.t. learning rate) obtained via reverse-mode automatic differentiation to dynamically update learning rates in real-time alongside weight updates
- Little additional computation because just needs just one additional copy of original gradients store in memory
- Severely under-appreciated paper

DARTS: Differentiable Architecture Search
- Neural search algorithm based on gradient descent and continuous relaxation in the architecture space.
- A good move towards automatic architecture designs of neural networks.
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
- Scales all dimensions of a CNN, resolution/depth/width using compound coefficient
- Uses neural architecture search

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis
- Compared to existing Hessian-based methods, this works on the KFE
- Reported 10x reduction in model size and 8x reduction in FLOPs on Wide ResNet32 (WRN32)

Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam
- Variational Adam (Vadam), an alternative to varianal inference via dropout.
- Vadam perturbs the network's weights when backpropagating, allowing low computation cost uncertainty estimates.
- Not as good as dropout in terms of performance, but a good direction for computationally cheaper options.

A Unified Approach to Intepreting Model Predictions
- Introduces SHAP (SHapley Additive exPlanations)
- "SHAP assigns each feature an importance value for a particular prediction"
  - Higher positive SHAP values (red) = increase the probability of the class
  - Higher negative SHAP values (blue) = decrease the probability of the class
Hierarchical interpretations for neural network predictions
- Given a prediction from the deep neural network, agglomerative contextual decomposition (ACD) produces a hierarchical clusters of input features alongside cluster-wise contribution to the final prediction.
- The hierarchical clustering is then optimized to identify learned clusters driving the DNN's predictions.

Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing
- Shows through scatterplots that multiple toy datasets although visually very different can have similar summary statistics like mean, standard deviation and pearson correlation
- This paper emphasises the need to always visualize your data

Netron
- Easily visualize your saved deep learning models (PyTorch .pth, TensorFlow .pb, MXNet .model, ONNX, and more)
- You can even check out each node's documentation quickly in the interface

BRITS
- If you face problems in missing data in your time series and you use existing imputation methods, there is an alternative called BRITS where it learns missing values in time series via a bidirectional recurrency dynamical system

DCCA: Deep Canonical Correlation Analysis
- Learn non-linear complex transformations such that resulting transformed data have high linear correlation
- Alternative to non-parametric methods like kernel canonical correlation analysis (KCCA) and non-linear extension of canonical correlation analysis (CCA)
- Shown to learn higher correlation representations than CCA/KCCA

An Empirical Analysis of Proximal Policy Optimization with Kronecker-factored Natural Gradients
- Shows 2 SOTA for deep RL currently (2018 / early 2019): PPO and ACKTR
- Attempts to combined PPO objective with K-FAC natural gradient optimization: PPOKFAC
- Does not improve sample complexity, stick with either PPO/ACKTR for now