Skip to content

Additional Readings


This is a list of growing number of papers and implementations I think are interesting.

Long Tailed Recognition

  • Large-Scale Long-Tailed Recognition in an Open World
    • Frequently in real world scenario there're new unseen classes or samples within the tail classes
    • This tackles the problem with dynamic embedding to bring associative memory to aid prediction of long-tailed classes
    • The model essentially combines direct image features with embeddings from other classes

Better Generalization (Overfitting Prevention or Regularization)

  • Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
    • They propose to use domain randomization to train deep learning algorithms on synthetic data and transferring to real-world data
    • The idea is that with sufficient variability in the textures of synthetic data, real-world data becomes another variation of the synthetic data
    • It works surprisingly well and it's a simple technique of varying image textures essentially enabling CNNs to be more robust to variations in image textures


  • Online Learning Rate Adaptation with Hypergradient Descent
    • Reduces the need for learning rate scheduling for SGD, SGD and nesterov momentum, and Adam
    • Uses the concept of hypergradients (gradients w.r.t. learning rate) obtained via reverse-mode automatic differentiation to dynamically update learning rates in real-time alongside weight updates
    • Little additional computation because just needs just one additional copy of original gradients store in memory
    • Severely under-appreciated paper

Network Compression

  • Energy-constrained Compression for Deep Neural Networks via Weighted Sparse Projection and Layer Input Masking
    • More production applications of DNN require low-energy consumption environment like self-driving cars, VR goggles, and drones
    • As such it's critical to optimize DNN not for its primary performance (accuracy etc.) but for its energy consumption performance too
    • In the DNN training, this paper introduces an energy budget constraint on top of other optimization objectives
    • This allows optimization of multiple objectives simultaneously (top-1 accuracy and energy consumption for example)
    • It's done through weighted sparse projection and layer input masking

Network Pruning

Bayesian Deep Learning

  • Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam
    • Variational Adam (Vadam), an alternative to varianal inference via dropout.
    • Vadam perturbs the network's weights when backpropagating, allowing low computation cost uncertainty estimates.
    • Not as good as dropout in terms of performance, but a good direction for computationally cheaper options.


  • A Unified Approach to Intepreting Model Predictions
    • Introduces SHAP (SHapley Additive exPlanations)
    • "SHAP assigns each feature an importance value for a particular prediction"
      • Higher positive SHAP values (red) = increase the probability of the class
      • Higher negative SHAP values (blue) = decrease the probability of the class
  • Hierarchical interpretations for neural network predictions
    • Given a prediction from the deep neural network, agglomerative contextual decomposition (ACD) produces a hierarchical clusters of input features alongside cluster-wise contribution to the final prediction.
    • The hierarchical clustering is then optimized to identify learned clusters driving the DNN's predictions.



  • Netron
    • Easily visualize your saved deep learning models (PyTorch .pth, TensorFlow .pb, MXNet .model, ONNX, and more)
    • You can even check out each node's documentation quickly in the interface

Missing Values

    • If you face problems in missing data in your time series and you use existing imputation methods, there is an alternative called BRITS where it learns missing values in time series via a bidirectional recurrency dynamical system


  • DCCA: Deep Canonical Correlation Analysis
    • Learn non-linear complex transformations such that resulting transformed data have high linear correlation
    • Alternative to non-parametric methods like kernel canonical correlation analysis (KCCA) and non-linear extension of canonical correlation analysis (CCA)
    • Shown to learn higher correlation representations than CCA/KCCA

Deep Reinforcement Learning