Skip to content

ljing2007/deep-learning

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Here are my personal deep learning notes. I've written this cheatsheet for keep track my knowledge but you can use it as a guide for learning deep learning aswell.

🗂 Data 🧠 Layers 📉 Loss 📈 Metrics 🔥 Training ✅ Production
Pytorch dataset Weight init Cross entropy Optimizers Ensemble
Pytorch dataloader Activations Weight Decay Transfer learning TTA
Split Self Attention Label Smoothing Clean mem Pseudolabeling
Normalization Trained CNN Mixup Half precision Webserver (Flask)
Data augmentation CoordConv SoftF1 Multiple GPUs Distillation
Deal imbalance Precomputation Pruning
Set seed Quantization (int8)
TorchScript
ONNX

🗂 Data

Balance the data

If you can not get more data of the underrepresented classes, you can fix the imbalance with code:

  • Fix it on the dataloader sampler:
    • Weighted Random Sampler
      • torch.utils.data.WeightedRandomSampler(weights=[…])
    • Subsample majority class. But you can lose important data.
      • catalyst.data.sampler.BalanceClassSampler(labels=ds.targets, mode="downsampling")
    • Oversample minority class. But you can overfit.
      • catalyst.data.sampler.BalanceClassSampler(labels=ds.targets, mode="upsampling")
  • Fix it on the loss function:
    • CrossEntropyLoss(weight=[…])
Custom BalanceClassSampler

class BalanceClassSampler(torch.utils.data.Sampler):
    """
    Allows you to create stratified sample on unbalanced classes.
    Inspired from Catalyst's BalanceClassSampler:
    https://catalyst-team.github.io/catalyst/_modules/catalyst/data/sampler.html#BalanceClassSampler

    Args:
        labels: list of class label for each elem in the dataset
        mode: Strategy to balance classes. Must be one of [downsampling, upsampling]
    """

    def __init__(self, labels:list[int], mode:str = "upsampling"):

        labels = np.array(labels)
        self.unique_labels = set(labels)

        ########## STEP 1:
        # Compute the final_num_samples_per_label
        # An Integer
        num_samples_per_label = {label: (labels == label).sum() for label in self.unique_labels}

        if   mode == "upsampling":   self.final_num_samples_per_label = max(num_samples_per_label.values())
        elif mode == "downsampling": self.final_num_samples_per_label = min(num_samples_per_label.values())
        else:                        raise Exception("mode should be: \"downsampling\" or \"upsampling\"")

        ########## STEP 2:
        # Compute actual indices of every label.
        # A Diccionary of lists
        self.indices_per_label = {label: np.arange(len(labels))[labels==label].tolist() for label in self.unique_labels}


    def __iter__(self): #-> Iterator[int]:

        indices = []
        for label in self.unique_labels:

            label_indices = self.indices_per_label[label]

            repeat_all_elementes  = self.final_num_samples_per_label // len(label_indices)
            pick_random_elementes = self.final_num_samples_per_label %  len(label_indices)

            indices += label_indices * repeat_all_elementes # repeat the list several times
            indices += random.sample(label_indices, k=pick_random_elementes)  # pick random idxs without repetition

        assert len(indices) == self.__len__()
        np.random.shuffle(indices) # Inplace shuffle the list

        return iter(indices)
    

    def __len__(self) -> int:
        return self.final_num_samples_per_label * len(self.unique_labels)

Split in train and validation

  • Training set: used for learning the parameters of the model.
  • Validation set: used for evaluating model while training. Don’t create a random validation set! Manually create one so that it matches the distribution of your data. Usaully a 10% or 20% of your train set.
    • N-fold cross-validation. Usually 10
  • Test set: used to get a final estimate of how well the network works.

Normalization

Scale the inputs to have mean 0 and a variance of 1. Also linear decorrelation/whitening/pca helps a lot. Normalization parameters are obtained only from train set, and then applied to both train and valid sets.

  • Option 1: Standarization x = x-x.mean() / x.std() Most used
    1. Mean subtraction: Center the data to zero. x = x - x.mean() fights vanishing and exploding gradients
    2. Standardize: Put the data on the same scale. x = x / x.std() improves convergence speed and accuracy
  • Option 2: PCA Whitening
    1. Mean subtraction: Center the data in zero. x = x - x.mean()
    2. Decorrelation or PCA: Rotate the data until there is no correlation anymore.
    3. Whitening: Put the data on the same scale. whitened = decorrelated / np.sqrt(eigVals + 1e-5)
  • Option 3: ZCA whitening Zero component analysis (ZCA).
  • Other options not used:
    • (x-x.min()) / (x.max()-x.min()): Values from 0 to 1
    • 2*(x-x.min()) / (x.max()-x.min()) - 1: Values from -1 to 1

Data augmentation

  • Cutout: Remove parts
    • Parámetro: Elegir el tamaño correto de cuadrado: 16px por ejemplo.
  • Mixup: Mix 2 samples (both x & y) x = λxᵢ + (1−λ)xⱼ & y = λyᵢ + (1−λ)yⱼ. Fast.ai doc
    • Parámetro: Elegir λ sampleando la distribución beta α=β=0.4 ó 0.2 (Así pocas veces la imgs se mezclarán)
  • CutMix: Mix 2 samples in some parts. Fast.ai doc
  • AugMix: No loos info.
  • RandAugment
  • AutoAugment

WandB post with TF2 code

Image data aug

Augmentation Description Pillow
Rotate Rotate some degrees pil_img.rotate()
Translate pil_img.transform()
Shear Affine transform pil_img.transform()
Autocontrast Equalize the histogram (linear) PIL.ImageOps.autocontrast()
Equalize Equalize the histogram (non-linear) PIL.ImageOps.equalize()
Posterize Reducing pixel bits PIL.ImageOps.posterize()
Solarize Inverting colors above a threshold PIL.ImageOps.solarize()
Color PIL.ImageEnhance.Color()
Contrast PIL.ImageEnhance.Contrast()
Brightness PIL.ImageEnhance.Brightness()
Sharpness Sharpen or blurs the image PIL.ImageEnhance.Sharpness()

Interpolations when rotate, translate or affine:

  • Image.BILINEAR
  • etc

🧠 Model

Weight init

Depends on the models architecture. Try to avoid vanishing or exploding outputs. blog1, blog2.

  • Constant value: Very bad
  • Random:
    • Uniform: From 0 to 1. Or from -1 to 1. Bad
    • Normal: Mean 0, std=1. Better
  • Xavier initialization: Good for MLPs with tanh activation func. paper
    • Uniform:
    • Normal:
  • Kaiming initialization: Good for MLPs with ReLU activation func. (a.k.a. He initialization) paper
    • Uniform
    • Normal
    • When you use Kaiming, you ha to fix ReLU(x) equals to min(x,0) - 0.5 for a correct mean (0)
  • Delta-Orthogonal initialization: Good for vanilla CNNs (10000 layers). Read this paper
def weight_init(m):

	# LINEAR
	if type(m) == nn.Linear:
		torch.nn.init.xavier_uniform(m.weight)
		m.bias.data.fill_(0.01)

	# CONVS
	classname = m.__class__.__name__
	if classname.find('Conv') != -1:
		nn.init.xavier_uniform_(m.weight, gain=nn.init.calculate_gain('relu'))
		nn.init.zeros_(m.bias)

model.apply(weight_init)

Activations

reference

  • Softmax: Sigle-label classification (last layer)
  • Sigmoid: Multi-label classification (last layer)
  • Hyperbolic tangent:
  • ReLU: Non-linearity compontent of the net (hidden layers) check this paper
  • ELU: Exponential Linear Unit. paper
  • SELU: Scaled Exponential Linear Unit. paper
  • PReLU or Leaky ReLU:
  • GLU: Gated Linear Unit. (from TabNet paper) blog linear1(x) * sigmoid(linear2(x))
  • SERLU:
  • Smoother ReLU. Differienzable. BEST
    • GeLU: Gaussian Error Linear Units. Used in transformers. paper. (2016)
    • Swish: x * sigmoid(x) paper (2017)
    • Elish: xxxx paper (2018)
    • Mish: x * tanh( ln(1 + e^x) ) paper (2019)
    • myActFunc 1 = 0.5 * x * ( tanh(x) + 1 )
    • myActFunc 2 = 0.5 * x * ( tanh (x+1) + 1)
    • myActFunc 3 = x * ((x+x+1)/(abs(x+1) + abs(x)) * 0.5 + 0.5)

CoordConv

class AddCoord2D(torch.nn.Module):
    def __init__(self, len):
        super(AddCoord2D, self).__init__()
        
        i_coord = torch.linspace(start=1/len, end=1, steps=len).view(len, -1).expand(-1, len)
        j_coord = torch.linspace(start=1/len, end=1, steps=len).view(-1, len).expand(len, -1)
        self.coords = torch.stack([i_coord, j_coord])

        print(self.coords.shape)

    def forward(self, x): # X shape: [BS, C, X, Y]
        BS = x.shape[0]
        return torch.cat((x, self.coords.expand(BS,-1,-1,-1)), dim=1)

🧐 Regularization

Dropout

During training, some neurons will be deactivated randomly. Hinton, 2012, Srivasta, 2014

Weight regularization

Weight penalty: Regularization in loss function (penalice high weights). Weight decay hyper-parameter usually 0.0005.

Visually, the weights only can take a value inside the blue region, and the red circles represent the minimum. Here, there are 2 weight variables.

L1 (LASSO) L2 (Ridge) Elastic Net
wr1 wr2 wr3
Shrinks coefficients to 0. Good for variable selection Most used. Makes coefficients smaller Tradeoff between variable selection and small coefficients
Penalizes the sum of absolute weights Penalizes the sum of squared weights Combination of 2 before
loss + wd * weights.abs().sum() loss + wd * weights.pow(2).sum()

DropConnect

At training and inference, some connections (weights) will be deactivated permanently. LeCun, 2013. This is very useful at the firsts layers.

Distillation

Knowledge Distillation (teacher-student) A teacher model teach a student model.

  • Smaller student model → faster model.
    • Model compresion: Less memory and computation
    • To generalize and avoid outliers.
    • Used in NLP transformers.
    • paper
  • Bigger student model is → more accurate model.
    • Useful when you have extra unlabeled data (kaggle competitions)
    • 1. Train the teacher model with labeled dataset.
    • 2. With the extra on unlabeled dataset, generate pseudo labels (soft or hard labels)
    • 3. Train a student model on both labeled and pseudo-labeled datasets.
    • 4. Student becomes teacher and repeat -> 2.
    • Paper: When Does Label Smoothing Help?
    • Paper: Noisy Student
    • Video: Noisy Student

📉 Loss

Loss function

  • Regression
    • MBE: Mean Bias Error: mean(GT - pred) It could determine if the model has positive bias or negative bias.
    • MAE: Mean Absolute Error (L1 loss): mean(|GT - pred|) The most simple.
    • MSE: Mean Squared Error (L2 loss): mean((GT-pred)²) Penalice large errors more than MAE. Most used
    • RMSE: Root Mean Squared Error: sqrt(MSE) Proportional to MSE. Value closer to MAE.
    • Percentage errors:
      • MAPE: Mean Absolute Percentage Error
      • MSPE: Mean Squared Percentage Error
      • RMSPE: Root Mean Squared Percentage Error
  • Classification
    • Cross Entropy: Sigle-label classification. Usually with softmax. nn.CrossEntropyLoss.
      • NLL: Negative Log Likelihood is the one-hot encoded target simplified version, see this nn.NLLLoss()
    • Binary Cross Entropy: Multi-label classification. Usually with sigmoid. nn.BCELoss
    • Hinge: Multi class SVM Loss nn.HingeEmbeddingLoss()
    • Focal loss: Similar to BCE but scaled down, so the network focuses more on incorrect and low confidence labels than on increasing its confidence in the already correct labels. -(1-p)^gamma * log(p) paper
  • Segmentation
    • Pixel-wise cross entropy
    • IoU (F0): (Pred ∩ GT)/(Pred ∪ GT) = TP / TP + FP * FN
    • Dice (F1): 2 * (Pred ∩ GT)/(Pred + GT) = 2·TP / 2·TP + FP * FN
      • Range from 0 (worst) to 1 (best)
      • In order to formulate a loss function which can be minimized, we'll simply use 1 − Dice

Label Smoothing

Smooth the one-hot target label.

LabelSmoothingCrossEntropy(eps:float=0.1, reduction='mean')

Referennce

📈 Metrics

Classification Metrics

Dataset with 5 disease images and 20 normal images. If the model predicts all images to be normal, its accuracy is 80%, and F1-score of such a model is 0.88

  • Accuracy: TP + TN / TP + TN + FP + FN
  • F1 Score: 2 * (Prec*Rec)/(Prec+Rec)
    • Precision: TP / TP + FP = TP / predicted possitives
    • Recall: TP / TP + FN = TP / actual possitives
  • Dice Score: 2 * (Pred ∩ GT)/(Pred + GT)
  • ROC, AUC:
  • Log loss:

🔥 Train

Learning Rate

How big the steps are during training.

  • Max LR: Compute it with LR Finder (lr_find())
  • LR schedule:
    • Constant: Never use.
    • Reduce it gradually: By steps, by a decay factor, with LR annealing, etc.
      • Flat + Cosine annealing: Flat start, and then at 50%-75%, start dropping the lr based on a cosine anneal.
    • Warm restarts (SGDWR, AdamWR):
    • OneCycle: Use LRFinder to know your maximum lr. Good for Adam.

Batch size

Number of samples to learn simultaneously.

  • Batch size = 1: Train each sample individually. (Online gradient descent) ❌
  • Batch size = length(dataset): Train the whole dataset at once, as a batch. (Batch gradient descent) ❌
  • Batch size = number: Train disjoint groups of samples (Mini-batch gradient descent). ✅
    • Usually a power of 2. 32 or 64 are good values.
    • Too low: like 4: Lot of updates. Very noisy random updates in the net (bad).
    • Too high: like 512 Few updates. Very general common updates (bad).
      • Faster computation. Takes advantage of GPU mem. But sometimes it can no be fitted (CUDA Out Of Memory)

Some people are tring to make a batch size finder according to this paper.

Number of epochs

Times to learn the whole dataset.

  • Train until start overffiting (validation loss becomes to increase) (early stopping)

Optimizers

Description Paper Fast.ai 2 Score
SGD Basic method. new_w = w - lr * grad_w SGD(lr=0.1)
SGD with Momentum Speed it up with momentum, usually mom=0.9 SGD(lr=0.1, mom=0.9)
AdaGrad Adaptative lr 2011 -
RMSProp Similar to momentum but with the gradient squared. 2012 RMSProp(lr=0.1)
Adam Momentum + RMSProp. 2014 Adam(lr=0.1, wd=0)
LARS Compute lr for each layer with a certain trust. 2017 Larc(lr=0.1, clip=False)
LARC Original LARS clipped to be always less than lr Larc(lr=0.1, clip=True)
AdamW Adam + decoupled weight decay 2017
AMSGrad Worse than Adam in practice. (AdamX: new verion) 2018
QHAdam Quasi-Hyperbolic Adam 2018 QHAdam(lr=0.1)
LAMB LARC with Adam 2019 Lamb(lr=0.1)
NovoGrad . 2019
Lookahead Stabilizes training at the rest of training. 2019 Lookahead(SGD(lr=0.1))
RAdam Rectified Adam. Stabilizes training at the start. 2019 RAdam(lr=0.1)
Ranger RAdam + Lookahead. 2019 ranger() ⭐⭐⭐
RangerLars RAdam + Lookahead + LARS. (aka Over9000) 2019 ⭐⭐⭐
Ralamb RAdam + LARS. 2019
Selective-Backprop Faster training by focusing on the biggest losers. 2019
DiffGrad Solves Adam’s "overshoot" issue 2019
AdaMod Optimizer with memory 2019
DeepMemory DiffGrad + AdaMod

  • SGD: new_w = w - lr[gradient_w]
  • SGD with Momentum: Usually mom=0.9.
    • mom=0.9, means a 10% is the normal derivative and a 90% is the same direction I went last time.
    • new_w = w - lr[(0.1 * gradient_w) + (0.9 * w)]
    • Other common values are 0.5, 0.7 and 0.99.
  • RMSProp (Adaptative lr) From 2012. Similar to momentum but with the gradient squared.
    • new_w = w - lr * gradient_w / [(0.1 * gradient_w²) + (0.9 * w)]
    • If the gradient in not so volatile, take grater steps. Otherwise, take smaller steps.
  • DiffGrad
  • AdaMod

Optimizers in Fast.ai

You can build every optimizer by doing 2 things:

  1. Stats: keep track of whats is going on on the parameters
  2. Steppers: Figure out how to update the parameters

TODO: Read:

Set seed

def seed_everything(seed):
	os.environ['PYTHONHASHSEED'] = str(seed)
	random.seed(seed)         # Random
	np.random.seed(seed)      # Numpy
	torch.manual_seed(seed)   # Pytorch
	torch.cuda.manual_seed(seed)
	torch.backends.cudnn.deterministic = True
	torch.backends.cudnn.benchmark     = False
	#tf.random.set_seed(seed) # Tensorflow

Clean mem

Read this

def clean_mem():
	gc.collect()
	torch.cuda.empty_cache()

Multiple GPUs

learn.to_parallel()

Reference

https://dev.fast.ai/distributed

Half precision

learn.to_fp16()
learn.to_fp32()

Reference

http://dev.fast.ai/callback.fp16

✅ Production

Webserver

SERVER (Flask)

import numpy as np
import torch
from torchvision import models
import torchvision.transforms as transforms
from PIL import Image
from flask import Flask, jsonify, request
import json


app = Flask(__name__)
app.config['JSON_SORT_KEYS'] = False

classes = json.load(open('imagenet_classes.json'))
model = models.densenet121(pretrained=True)
model.eval()

def pre_process(image_file):
    my_transforms = transforms.Compose([transforms.Resize(255),
                                        transforms.CenterCrop(224),
                                        transforms.ToTensor(),
                                        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])
    image = Image.open(image_file)
    return my_transforms(image).unsqueeze(0) # unsqueeze is for the BS dim

def post_process(logits):
    vals, idxs = logits.softmax(1).topk(5)
    vals = vals[0].numpy()
    idxs = idxs[0].numpy()
    result = {}
    for idx, val in zip(idxs, vals):
        result[classes[idx]] = round(float(val), 4)
    return result

def get_prediction(image_file):
    with torch.no_grad():
        image_tensor  = pre_process(image_file)
        output = model.forward(image_tensor)
        return post_process(output)

@app.route('/predict', methods=['POST'])
def predict():
    if request.method == 'POST':
        image_file = request.files['my_img_file']
        result_dict = get_prediction(image_file)
        #return jsonify(result_dict)
        #return json.dumps(result_dict)
        return result_dict

if __name__ == '__main__':
    app.run()

Run server

FLASK_ENV=development FLASK_APP=app.py flask run

CLIENT (command line)

curl -X POST -F [email protected] http://localhost:5000/predict

CLIENT (python)

import requests
resp = requests.post("http://localhost:5000/predict",
                     files={"my_img_file": open('cardigan.jpg','rb')})
print(resp.json())

Example server response

{
  "cardigan": 0.7083, 
  "wool": 0.0837, 
  "suit": 0.0431, 
  "Windsor_tie": 0.031, 
  "trench_coat": 0.0307
}

Quantization

3 options

What Accuracy Pytorch API
Dynamic Quantization Weights only Good qmodel = torch.quantization.quantize_dynamic(model, dtype=torch.qint8)
Post Training Quantization Weights and activations Good model.qconfig = torch.quantization.default_qconfig torch.quantization.prepare(model, inplace=True) torch.quantization.convert(model, inplace=True)
Quantization-Aware Training Weights and activations Best torch.quantization.prepare_qat -> torch.quantization.convert

Reference

Pruning

import torch.nn.utils.prune as prune

parameters_to_prune = (
    (model.conv1, 'weight'),
    (model.conv2, 'weight'),
    (model.fc1, 'weight'),
    (model.fc2, 'weight'),
    (model.fc3, 'weight'),
)

Percentage Pruning

prune.global_unstructured(
    parameters_to_prune,
    pruning_method=prune.L1Unstructured,
    amount=0.2,
)
class ThresholdPruning(prune.BasePruningMethod):
    PRUNING_TYPE = "unstructured"
    def __init__(self, threshold): self.threshold = threshold
    def compute_mask(self, tensor, default_mask): return torch.abs(tensor) > self.threshold
    
prune.global_unstructured(
    parameters_to_prune,
    pruning_method=ThresholdPruning,
    threshold=0.01
)

See pruning results

def pruned_info(model):
    print("Weights pruned:")
    print("==============")
    total_pruned, total_weights = 0,0
    for name, chil in model.named_children():
        layer_pruned  = torch.sum(chil.weight == 0)
        layer_weights = chil.weight.nelement()
        total_pruned += layer_pruned
        total_weights  += layer_weights

        print(name, "\t{:.2f}%".format(100 * float(layer_pruned)/ float(layer_weights)))
    print("==============")
    print("Total\t{:.2f}%".format(100 * float(total_pruned)/ float(total_weights)))
    
# Weights pruned:
# ==============
# conv1  1.85%
# conv2  8.10%
# fc1    19.76%
# fc2    10.66%
# fc3    9.40%
# ==============
# Total  17.90%

Iterative magnitude pruning is iterative process of removing connections (Prune/Train/Repeat):

  1. Train a big model
  2. Do early stopping
  3. Compress model
    • Prune: Find the 15% of weights with the smallest magnitude and set them to zero.
    • Train: Then finetune the model until it reaches within 99.5% of its original validation accuracy.
    • Repeat: Then prune another 15% of the smallest magnitude weights and finetune.

At the end you can have pruned the 15%, 30%, 45%, 60%, 75%, and 90% of your original model.

Reference

TorchScript

An intermediate representation of a PyTorch model

torch_script = torch.jit.script(MyModel())
torch_script.save("my_model_script.pt")

Reference

ONNX

torch.onnx.export(model, img, f, verbose=False, opset_version=11)  # Export to onnx

# Check onnx model
import onnx

model = onnx.load(f)  # load onnx model
onnx.checker.check_model(model)  # check onnx model
print(onnx.helper.printable_graph(model.graph))  # print a human readable representation of the graph
print('Export complete. ONNX model saved to %s\nView with https://github.com/lutzroeder/netron' % f)

Reference

🧐 Improve generalization
and avoid overfitting

(try in that order)

  1. Get more data
    • Similar datasets: Get a similar dataset for your problem.
    • Create your own dataset
      • Segmentation annotation with Polygon-RNN++
    • Synthetic data: Virtual objects and scenes instead of real images. Infinite possibilities of lighting, colors, angles...
  2. Data augmentation: Augment your current data. (albumentations for faster aug. using the GPU)
    • Test time augmentation (TTA): The same augmentations will also be applied when we are predicting (inference). It can improve our results if we run inference multiple times for each sample and average out the predictions.
    • AutoAugment: RL for data augmentation. Trasfer learning NOT THE WEIGHTS but the policies of how to do data augmentation.
  3. Regularization
    • Dropout. Usually 0.5
    • Weight penalty: Regularization in loss function (penalice high weights). Usually 0.0005
      • L1 regularization: penalizes the sum of absolute weights.
      • L2 regularization: penalizes the sum of squared weights by a factor, usually 0.01 or 0.1.
      • Weight decay: wd * w. Sometimes mathematically identical to L2 reg.
  4. Reduce model complexity: Limit the number of hidden layers and the number of units per layer.
    • Generalizable architectures?: Add more bachnorm layers, more densenets...
  5. Ensambles: Gather a bunch of models to give a final prediction. kaggle ensembling guide
    • Combination methods:
      • Ensembling: Merge final output (average, weighted average, majority vote, weighted majority vote).
      • Meta ensembling: Same but use a new model to produce the final output. (also called stacking or blending)
    • Models generation techniques:
      • Stacking: Just use different classifiers algorithms.
      • Bagging (Bootstrap aggregating): Each model trained with a subset of the training data. Used in random forests. Prob of sample being selected: 0.632 Prob of sample in Out Of Bag 0.368
      • Boosting: The predictors are not made independently, but sequentially. Used in gradient boosting.
      • Snapshot Ensembling: Only for neural nets. M models for the cost of 1. Thanks to SGD with restarts you have several local minimum that you can average. paper.

Other tricks:

  • Label Smoothing: Smooth the one-hot target label
  • Knowledge Distillation: A bigger trained net (teacher) helps the network paper

🕓 Train faster

  • Transfer learning: Use a pretrainded model and retrain with your data.
    1. Replace last layer
    2. Fine-tune new layers
    3. Fine-tune more layers (optional)
  • Batch Normalization Add BachNorm layers after your convolutions and linear layers for make things easier to your net and train faster.
  • Precomputation
    1. Freeze the layers you don’t want to modify
    2. Calculate the activations the last layer from the frozen layers(for your entire dataset)
    3. Save those activations to disk
    4. Use those activations as the input of your trainable layers
  • Half precision (fp16)
  • Multiple GPUs
  • 2nd order optimization

Normalization inside network:

  • Batch Normalization paper
  • Layer Normalization paper
  • Instance Normalization paper
  • Group Normalization paper

Supervised DL

  • Structured
  • Unstructured
    • Vision: Image, Video. Check my vision repo
    • Audio: Sound, music, speech. Check my audio repo. Audio overview
    • NLP: Text, Genomics. Check my NLP repo
    • Knoledge Graph (KG): Graph Neural Networks (GNN)
    • Trees
      • math expresions
      • syntax
      • Models: Tree-LSTM, RNNGrammar (RNNG).
      • Tree2seq by Polish notation. Duda: only for binary trees?

Autoencoder

  • Standard autoencoders: Made for reconstruct the input. No continuous latant space.
    • Simple Autoencoder: Same input and output net with a smaller middle hidden layer (botleneck layer, latent vector).
    • Denoising Autoencoder (DAE): Adds noise to the input to learn how to remove noise.
    • Only have a recontruction loss (pixel mean squared error for example)
  • Variational Autoencoder (VAE): Initially trained as a reconstruction problem, but later we can play with the latent vector to generate new outputs. Latant space need to be continuous.
    • Latent vector: Is modified by adding gaussian noise (normal distribution, mean and std vectors) during training.
    • Loss: loss = recontruction loss + latent loss
      • Recontruction loss: Keeps the output similar to the input (mean squared error)
      • Latent loss: Keeps the latent space continuous (KL divergence)
    • Disentangled Variational Autoencoder (β-VAE): Improved version. Each parameter of the latent vector is devotod to tweak 1 characteristic. paper.
      • β to small: Overfitting. Learn to reconstruct your training data, but i won't generalize
      • β to big: Loose high definition details. Worse performance.
  • Hierarchial VAE (HVAE):
    • Can be thought of as a series of VAEs stacked on top of each other
  • NVAE: Hierarchial VAE to the extreme

Neural Representations

  • 2D: [x,y]->[R,G,B]
  • 3D: [x,y,z]->[R,G,B,alpha]
  • Input coordinates with sine & cos (positional encoding) NeRF
  • Replacing the ReLU activations with sine functions SIREN
  • Input coordinates into a Fourier feature space Fourier

Mejoras sobre el NeRF

Description Website Video Paper
NeRF in the Wild web 3:41 Aug 2020
NeRF++ Oct 2020
Deformable NeRF (nerfies) web 7:26 Nov 2020
NeRF with time dimension web 2:21 Nov 2020
NeRF with better weight init web 3:54 Dec 2020

Graph Neural Networks

Semi-supervised DL

Check this kaggle discussion

Reinforcement Learning

Reinforcement learning reference


Resources


Antor TODO

Automatic featuring engeniring

How start a competition/ML project

  1. Data exploaration , haw is the data that we are going to work with
  2. Think about input representation
    • Is redundant?
    • Need to be converted to somthing else?
    • The most entropy that you can reconstruct the raw data
  3. Look at the metric
    • Makes sense?
    • Is it differentiable
    • Can i buid good enough metric equivalent
  4. Build a toy model an overfit it with 1 or few samples
    • To make sure that nothing is really broken

JPEG: 2 levels of comprehension:

  • Entropy
  • Choram

LIDAR

Projections (BAD REPRESENTATION) (complicated things with voxels) Dense matrix (antor) - Its a depth map i think - Not projections - NAtive output of the sensor but condensed in a dense matrix

Unordered set (point cloud, molecules)

  • Point net
  • transformer without positional encoding
    • AtomTransformer (by antor)
    • MoleculeTransformer (by antor)

TODO

About

🦅 Deep Learning awesome cheatsheet

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.7%
  • Other 0.3%