Here are my personal deep learning notes. I've written this cheatsheet for keep track my knowledge but you can use it as a guide for learning deep learning aswell.
🗂 Data | 🧠 Layers | 📉 Loss | 📈 Metrics | 🔥 Training | ✅ Production |
---|---|---|---|---|---|
Pytorch dataset | Weight init | Cross entropy | Optimizers | Ensemble | |
Pytorch dataloader | Activations | Weight Decay | Transfer learning | TTA | |
Split | Self Attention | Label Smoothing | Clean mem | Pseudolabeling | |
Normalization | Trained CNN | Mixup | Half precision | Webserver (Flask) | |
Data augmentation | CoordConv | SoftF1 | Multiple GPUs | Distillation | |
Deal imbalance | Precomputation | Pruning | |||
Set seed | Quantization (int8) | ||||
TorchScript | |||||
ONNX |
If you can not get more data of the underrepresented classes, you can fix the imbalance with code:
- Fix it on the dataloader
sampler
:- Weighted Random Sampler
torch.utils.data.WeightedRandomSampler(weights=[…])
- Subsample majority class. But you can lose important data.
catalyst.data.sampler.BalanceClassSampler(labels=ds.targets, mode="downsampling")
- Oversample minority class. But you can overfit.
catalyst.data.sampler.BalanceClassSampler(labels=ds.targets, mode="upsampling")
- Weighted Random Sampler
- Fix it on the loss function:
CrossEntropyLoss(weight=[…])
Custom BalanceClassSampler
class BalanceClassSampler(torch.utils.data.Sampler):
"""
Allows you to create stratified sample on unbalanced classes.
Inspired from Catalyst's BalanceClassSampler:
https://catalyst-team.github.io/catalyst/_modules/catalyst/data/sampler.html#BalanceClassSampler
Args:
labels: list of class label for each elem in the dataset
mode: Strategy to balance classes. Must be one of [downsampling, upsampling]
"""
def __init__(self, labels:list[int], mode:str = "upsampling"):
labels = np.array(labels)
self.unique_labels = set(labels)
########## STEP 1:
# Compute the final_num_samples_per_label
# An Integer
num_samples_per_label = {label: (labels == label).sum() for label in self.unique_labels}
if mode == "upsampling": self.final_num_samples_per_label = max(num_samples_per_label.values())
elif mode == "downsampling": self.final_num_samples_per_label = min(num_samples_per_label.values())
else: raise Exception("mode should be: \"downsampling\" or \"upsampling\"")
########## STEP 2:
# Compute actual indices of every label.
# A Diccionary of lists
self.indices_per_label = {label: np.arange(len(labels))[labels==label].tolist() for label in self.unique_labels}
def __iter__(self): #-> Iterator[int]:
indices = []
for label in self.unique_labels:
label_indices = self.indices_per_label[label]
repeat_all_elementes = self.final_num_samples_per_label // len(label_indices)
pick_random_elementes = self.final_num_samples_per_label % len(label_indices)
indices += label_indices * repeat_all_elementes # repeat the list several times
indices += random.sample(label_indices, k=pick_random_elementes) # pick random idxs without repetition
assert len(indices) == self.__len__()
np.random.shuffle(indices) # Inplace shuffle the list
return iter(indices)
def __len__(self) -> int:
return self.final_num_samples_per_label * len(self.unique_labels)
- Training set: used for learning the parameters of the model.
- Validation set: used for evaluating model while training. Don’t create a random validation set! Manually create one so that it matches the distribution of your data. Usaully a
10%
or20%
of your train set.- N-fold cross-validation. Usually
10
- N-fold cross-validation. Usually
- Test set: used to get a final estimate of how well the network works.
Scale the inputs to have mean 0 and a variance of 1. Also linear decorrelation/whitening/pca helps a lot. Normalization parameters are obtained only from train set, and then applied to both train and valid sets.
- Option 1: Standarization
x = x-x.mean() / x.std()
Most used- Mean subtraction: Center the data to zero.
x = x - x.mean()
fights vanishing and exploding gradients - Standardize: Put the data on the same scale.
x = x / x.std()
improves convergence speed and accuracy
- Mean subtraction: Center the data to zero.
- Option 2: PCA Whitening
- Mean subtraction: Center the data in zero.
x = x - x.mean()
- Decorrelation or PCA: Rotate the data until there is no correlation anymore.
- Whitening: Put the data on the same scale.
whitened = decorrelated / np.sqrt(eigVals + 1e-5)
- Mean subtraction: Center the data in zero.
- Option 3: ZCA whitening Zero component analysis (ZCA).
- Other options not used:
(x-x.min()) / (x.max()-x.min())
: Values from 0 to 12*(x-x.min()) / (x.max()-x.min()) - 1
: Values from -1 to 1
- In case of images, the scale is from 0 to 255, so it is not strictly necessary normalize.
- neural networks data preparation
- Cutout: Remove parts
- Parámetro: Elegir el tamaño correto de cuadrado: 16px por ejemplo.
- Mixup: Mix 2 samples (both x & y)
x = λxᵢ + (1−λ)xⱼ
&y = λyᵢ + (1−λ)yⱼ
. Fast.ai doc- Parámetro: Elegir
λ
sampleando la distribución beta α=β=0.4 ó 0.2 (Así pocas veces la imgs se mezclarán)
- Parámetro: Elegir
- CutMix: Mix 2 samples in some parts. Fast.ai doc
- AugMix: No loos info.
- RandAugment
- AutoAugment
Augmentation | Description | Pillow |
---|---|---|
Rotate | Rotate some degrees | pil_img.rotate() |
Translate | pil_img.transform() | |
Shear | Affine transform | pil_img.transform() |
Autocontrast | Equalize the histogram (linear) | PIL.ImageOps.autocontrast() |
Equalize | Equalize the histogram (non-linear) | PIL.ImageOps.equalize() |
Posterize | Reducing pixel bits | PIL.ImageOps.posterize() |
Solarize | Inverting colors above a threshold | PIL.ImageOps.solarize() |
Color | PIL.ImageEnhance.Color() | |
Contrast | PIL.ImageEnhance.Contrast() | |
Brightness | PIL.ImageEnhance.Brightness() | |
Sharpness | Sharpen or blurs the image | PIL.ImageEnhance.Sharpness() |
Interpolations when rotate, translate or affine:
- Image.BILINEAR
- etc
Depends on the models architecture. Try to avoid vanishing or exploding outputs. blog1, blog2.
- Constant value: Very bad
- Random:
- Uniform: From 0 to 1. Or from -1 to 1. Bad
- Normal: Mean 0, std=1. Better
- Xavier initialization: Good for MLPs with tanh activation func. paper
- Uniform:
- Normal:
- Kaiming initialization: Good for MLPs with ReLU activation func. (a.k.a. He initialization) paper
- Uniform
- Normal
- When you use Kaiming, you ha to fix
ReLU(x)
equals tomin(x,0) - 0.5
for a correct mean (0)
- Delta-Orthogonal initialization: Good for vanilla CNNs (10000 layers). Read this paper
def weight_init(m):
# LINEAR
if type(m) == nn.Linear:
torch.nn.init.xavier_uniform(m.weight)
m.bias.data.fill_(0.01)
# CONVS
classname = m.__class__.__name__
if classname.find('Conv') != -1:
nn.init.xavier_uniform_(m.weight, gain=nn.init.calculate_gain('relu'))
nn.init.zeros_(m.bias)
model.apply(weight_init)
- Softmax: Sigle-label classification (last layer)
- Sigmoid: Multi-label classification (last layer)
- Hyperbolic tangent:
- ReLU: Non-linearity compontent of the net (hidden layers) check this paper
- ELU: Exponential Linear Unit. paper
- SELU: Scaled Exponential Linear Unit. paper
- PReLU or Leaky ReLU:
- GLU: Gated Linear Unit. (from TabNet paper) blog
linear1(x) * sigmoid(linear2(x))
- SERLU:
- Smoother ReLU. Differienzable. BEST
- GeLU: Gaussian Error Linear Units. Used in transformers. paper. (2016)
- Swish:
x * sigmoid(x)
paper (2017) - Elish:
xxxx
paper (2018) - Mish:
x * tanh( ln(1 + e^x) )
paper (2019) - myActFunc 1 =
0.5 * x * ( tanh(x) + 1 )
- myActFunc 2 =
0.5 * x * ( tanh (x+1) + 1)
- myActFunc 3 =
x * ((x+x+1)/(abs(x+1) + abs(x)) * 0.5 + 0.5)
class AddCoord2D(torch.nn.Module):
def __init__(self, len):
super(AddCoord2D, self).__init__()
i_coord = torch.linspace(start=1/len, end=1, steps=len).view(len, -1).expand(-1, len)
j_coord = torch.linspace(start=1/len, end=1, steps=len).view(-1, len).expand(len, -1)
self.coords = torch.stack([i_coord, j_coord])
print(self.coords.shape)
def forward(self, x): # X shape: [BS, C, X, Y]
BS = x.shape[0]
return torch.cat((x, self.coords.expand(BS,-1,-1,-1)), dim=1)
During training, some neurons will be deactivated randomly. Hinton, 2012, Srivasta, 2014
Weight penalty: Regularization in loss function (penalice high weights). Weight decay
hyper-parameter usually 0.0005
.
Visually, the weights only can take a value inside the blue region, and the red circles represent the minimum. Here, there are 2 weight variables.
At training and inference, some connections (weights) will be deactivated permanently. LeCun, 2013. This is very useful at the firsts layers.
Knowledge Distillation (teacher-student) A teacher model teach a student model.
- Smaller student model → faster model.
- Model compresion: Less memory and computation
- To generalize and avoid outliers.
- Used in NLP transformers.
- paper
- Bigger student model is → more accurate model.
- Useful when you have extra unlabeled data (kaggle competitions)
- 1. Train the teacher model with labeled dataset.
- 2. With the extra on unlabeled dataset, generate pseudo labels (soft or hard labels)
- 3. Train a student model on both labeled and pseudo-labeled datasets.
- 4. Student becomes teacher and repeat -> 2.
- Paper: When Does Label Smoothing Help?
- Paper: Noisy Student
- Video: Noisy Student
- Regression
- MBE: Mean Bias Error:
mean(GT - pred)
It could determine if the model has positive bias or negative bias. - MAE: Mean Absolute Error (L1 loss):
mean(|GT - pred|)
The most simple. - MSE: Mean Squared Error (L2 loss):
mean((GT-pred)²)
Penalice large errors more than MAE. Most used - RMSE: Root Mean Squared Error:
sqrt(MSE)
Proportional to MSE. Value closer to MAE. - Percentage errors:
- MAPE: Mean Absolute Percentage Error
- MSPE: Mean Squared Percentage Error
- RMSPE: Root Mean Squared Percentage Error
- MBE: Mean Bias Error:
- Classification
- Cross Entropy: Sigle-label classification. Usually with softmax.
nn.CrossEntropyLoss
.- NLL: Negative Log Likelihood is the one-hot encoded target simplified version, see this
nn.NLLLoss()
- NLL: Negative Log Likelihood is the one-hot encoded target simplified version, see this
- Binary Cross Entropy: Multi-label classification. Usually with sigmoid.
nn.BCELoss
- Hinge: Multi class SVM Loss
nn.HingeEmbeddingLoss()
- Focal loss: Similar to BCE but scaled down, so the network focuses more on incorrect and low confidence labels than on increasing its confidence in the already correct labels.
-(1-p)^gamma * log(p)
paper
- Cross Entropy: Sigle-label classification. Usually with softmax.
- Segmentation
- Pixel-wise cross entropy
- IoU (F0):
(Pred ∩ GT)/(Pred ∪ GT)
=TP / TP + FP * FN
- Dice (F1):
2 * (Pred ∩ GT)/(Pred + GT)
=2·TP / 2·TP + FP * FN
- Range from
0
(worst) to1
(best) - In order to formulate a loss function which can be minimized, we'll simply use
1 − Dice
- Range from
Smooth the one-hot target label.
LabelSmoothingCrossEntropy(eps:float=0.1, reduction='mean')
Dataset with 5 disease images and 20 normal images. If the model predicts all images to be normal, its accuracy is 80%, and F1-score of such a model is 0.88
- Accuracy:
TP + TN / TP + TN + FP + FN
- F1 Score:
2 * (Prec*Rec)/(Prec+Rec)
- Precision:
TP / TP + FP
=TP / predicted possitives
- Recall:
TP / TP + FN
=TP / actual possitives
- Precision:
- Dice Score:
2 * (Pred ∩ GT)/(Pred + GT)
- ROC, AUC:
- Log loss:
How big the steps are during training.
- Max LR: Compute it with LR Finder (
lr_find()
) - LR schedule:
- Constant: Never use.
- Reduce it gradually: By steps, by a decay factor, with LR annealing, etc.
- Flat + Cosine annealing: Flat start, and then at 50%-75%, start dropping the lr based on a cosine anneal.
- Warm restarts (SGDWR, AdamWR):
- OneCycle: Use LRFinder to know your maximum lr. Good for Adam.
Number of samples to learn simultaneously.
Batch size = 1
: Train each sample individually. (Online gradient descent) ❌Batch size = length(dataset)
: Train the whole dataset at once, as a batch. (Batch gradient descent) ❌Batch size = number
: Train disjoint groups of samples (Mini-batch gradient descent). ✅- Usually a power of 2.
32
or64
are good values. - Too low: like
4
: Lot of updates. Very noisy random updates in the net (bad). - Too high: like
512
Few updates. Very general common updates (bad).- Faster computation. Takes advantage of GPU mem. But sometimes it can no be fitted (CUDA Out Of Memory)
- Usually a power of 2.
Some people are tring to make a batch size finder according to this paper.
Times to learn the whole dataset.
- Train until start overffiting (validation loss becomes to increase) (early stopping)
- http://dev.fast.ai/optimizer
- https://github.com/jettify/pytorch-optimizer
- https://github.com/lessw2020/Best-Deep-Learning-Optimizers
Description | Paper | Fast.ai 2 | Score | |
---|---|---|---|---|
SGD | Basic method. new_w = w - lr * grad_w |
SGD(lr=0.1) | ||
SGD with Momentum | Speed it up with momentum, usually mom=0.9 |
SGD(lr=0.1, mom=0.9) | ||
AdaGrad | Adaptative lr | 2011 | - | |
RMSProp | Similar to momentum but with the gradient squared. | 2012 | RMSProp(lr=0.1) | |
Adam | Momentum + RMSProp. | 2014 | Adam(lr=0.1, wd=0) | ⭐ |
LARS | Compute lr for each layer with a certain trust. | 2017 | Larc(lr=0.1, clip=False) | |
LARC | Original LARS clipped to be always less than lr | Larc(lr=0.1, clip=True) | ||
AdamW | Adam + decoupled weight decay | 2017 | ||
AMSGrad | Worse than Adam in practice. (AdamX: new verion) | 2018 | ||
QHAdam | Quasi-Hyperbolic Adam | 2018 | QHAdam(lr=0.1) | |
LAMB | LARC with Adam | 2019 | Lamb(lr=0.1) | |
NovoGrad | . | 2019 | ||
Lookahead | Stabilizes training at the rest of training. | 2019 | Lookahead(SGD(lr=0.1)) | |
RAdam | Rectified Adam. Stabilizes training at the start. | 2019 | RAdam(lr=0.1) | |
Ranger | RAdam + Lookahead. | 2019 | ranger() | ⭐⭐⭐ |
RangerLars | RAdam + Lookahead + LARS. (aka Over9000) | 2019 | ⭐⭐⭐ | |
Ralamb | RAdam + LARS. | 2019 | ||
Selective-Backprop | Faster training by focusing on the biggest losers. | 2019 | ||
DiffGrad | Solves Adam’s "overshoot" issue | 2019 | ||
AdaMod | Optimizer with memory | 2019 | ||
DeepMemory | DiffGrad + AdaMod |
- SGD:
new_w = w - lr[gradient_w]
- SGD with Momentum: Usually
mom=0.9
.mom=0.9
, means a10%
is the normal derivative and a90%
is the same direction I went last time.new_w = w - lr[(0.1 * gradient_w) + (0.9 * w)]
- Other common values are
0.5
,0.7
and0.99
.
- RMSProp (Adaptative lr) From 2012. Similar to momentum but with the gradient squared.
new_w = w - lr * gradient_w / [(0.1 * gradient_w²) + (0.9 * w)]
- If the gradient in not so volatile, take grater steps. Otherwise, take smaller steps.
- DiffGrad
- AdaMod
You can build every optimizer by doing 2 things:
- Stats: keep track of whats is going on on the parameters
- Steppers: Figure out how to update the parameters
TODO: Read:
- Efficient BackProp (1998, Yann LeCun)
- LR finder
- Superconvergence
- A disciplined approach to neural network hyper-parameters (2018, Leslie Smith)
- The 1cycle policy
def seed_everything(seed):
os.environ['PYTHONHASHSEED'] = str(seed)
random.seed(seed) # Random
np.random.seed(seed) # Numpy
torch.manual_seed(seed) # Pytorch
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
#tf.random.set_seed(seed) # Tensorflow
Read this
def clean_mem():
gc.collect()
torch.cuda.empty_cache()
learn.to_parallel()
learn.to_fp16()
learn.to_fp32()
import numpy as np
import torch
from torchvision import models
import torchvision.transforms as transforms
from PIL import Image
from flask import Flask, jsonify, request
import json
app = Flask(__name__)
app.config['JSON_SORT_KEYS'] = False
classes = json.load(open('imagenet_classes.json'))
model = models.densenet121(pretrained=True)
model.eval()
def pre_process(image_file):
my_transforms = transforms.Compose([transforms.Resize(255),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])
image = Image.open(image_file)
return my_transforms(image).unsqueeze(0) # unsqueeze is for the BS dim
def post_process(logits):
vals, idxs = logits.softmax(1).topk(5)
vals = vals[0].numpy()
idxs = idxs[0].numpy()
result = {}
for idx, val in zip(idxs, vals):
result[classes[idx]] = round(float(val), 4)
return result
def get_prediction(image_file):
with torch.no_grad():
image_tensor = pre_process(image_file)
output = model.forward(image_tensor)
return post_process(output)
@app.route('/predict', methods=['POST'])
def predict():
if request.method == 'POST':
image_file = request.files['my_img_file']
result_dict = get_prediction(image_file)
#return jsonify(result_dict)
#return json.dumps(result_dict)
return result_dict
if __name__ == '__main__':
app.run()
FLASK_ENV=development FLASK_APP=app.py flask run
curl -X POST -F [email protected] http://localhost:5000/predict
import requests
resp = requests.post("http://localhost:5000/predict",
files={"my_img_file": open('cardigan.jpg','rb')})
print(resp.json())
{
"cardigan": 0.7083,
"wool": 0.0837,
"suit": 0.0431,
"Windsor_tie": 0.031,
"trench_coat": 0.0307
}
What | Accuracy | Pytorch API | |
---|---|---|---|
Dynamic Quantization | Weights only | Good | qmodel = torch.quantization.quantize_dynamic(model, dtype=torch.qint8) |
Post Training Quantization | Weights and activations | Good | model.qconfig = torch.quantization.default_qconfig torch.quantization.prepare(model, inplace=True) torch.quantization.convert(model, inplace=True) |
Quantization-Aware Training | Weights and activations | Best | torch.quantization.prepare_qat -> torch.quantization.convert |
import torch.nn.utils.prune as prune
parameters_to_prune = (
(model.conv1, 'weight'),
(model.conv2, 'weight'),
(model.fc1, 'weight'),
(model.fc2, 'weight'),
(model.fc3, 'weight'),
)
prune.global_unstructured(
parameters_to_prune,
pruning_method=prune.L1Unstructured,
amount=0.2,
)
class ThresholdPruning(prune.BasePruningMethod):
PRUNING_TYPE = "unstructured"
def __init__(self, threshold): self.threshold = threshold
def compute_mask(self, tensor, default_mask): return torch.abs(tensor) > self.threshold
prune.global_unstructured(
parameters_to_prune,
pruning_method=ThresholdPruning,
threshold=0.01
)
def pruned_info(model):
print("Weights pruned:")
print("==============")
total_pruned, total_weights = 0,0
for name, chil in model.named_children():
layer_pruned = torch.sum(chil.weight == 0)
layer_weights = chil.weight.nelement()
total_pruned += layer_pruned
total_weights += layer_weights
print(name, "\t{:.2f}%".format(100 * float(layer_pruned)/ float(layer_weights)))
print("==============")
print("Total\t{:.2f}%".format(100 * float(total_pruned)/ float(total_weights)))
# Weights pruned:
# ==============
# conv1 1.85%
# conv2 8.10%
# fc1 19.76%
# fc2 10.66%
# fc3 9.40%
# ==============
# Total 17.90%
Iterative magnitude pruning is iterative process of removing connections (Prune/Train/Repeat):
- Train a big model
- Do early stopping
- Compress model
- Prune: Find the 15% of weights with the smallest magnitude and set them to zero.
- Train: Then finetune the model until it reaches within 99.5% of its original validation accuracy.
- Repeat: Then prune another 15% of the smallest magnitude weights and finetune.
At the end you can have pruned the 15%, 30%, 45%, 60%, 75%, and 90% of your original model.
- Code:
- Papers:
- Deep Compression (2015)
- Train Large, Then Compress (2020)
- Neural Networks are Surprisingly Modular (2020)
torch_script = torch.jit.script(MyModel())
torch_script.save("my_model_script.pt")
torch.onnx.export(model, img, f, verbose=False, opset_version=11) # Export to onnx
# Check onnx model
import onnx
model = onnx.load(f) # load onnx model
onnx.checker.check_model(model) # check onnx model
print(onnx.helper.printable_graph(model.graph)) # print a human readable representation of the graph
print('Export complete. ONNX model saved to %s\nView with https://github.com/lutzroeder/netron' % f)
- Get more data
- Similar datasets: Get a similar dataset for your problem.
- Create your own dataset
- Segmentation annotation with Polygon-RNN++
- Synthetic data: Virtual objects and scenes instead of real images. Infinite possibilities of lighting, colors, angles...
- Data augmentation: Augment your current data. (albumentations for faster aug. using the GPU)
- Test time augmentation (TTA): The same augmentations will also be applied when we are predicting (inference). It can improve our results if we run inference multiple times for each sample and average out the predictions.
- AutoAugment: RL for data augmentation. Trasfer learning NOT THE WEIGHTS but the policies of how to do data augmentation.
- Regularization
- Dropout. Usually
0.5
- Weight penalty: Regularization in loss function (penalice high weights). Usually
0.0005
- L1 regularization: penalizes the sum of absolute weights.
- L2 regularization: penalizes the sum of squared weights by a factor, usually
0.01
or0.1
. - Weight decay:
wd * w
. Sometimes mathematically identical to L2 reg.
- Dropout. Usually
- Reduce model complexity: Limit the number of hidden layers and the number of units per layer.
- Generalizable architectures?: Add more bachnorm layers, more densenets...
- Ensambles: Gather a bunch of models to give a final prediction. kaggle ensembling guide
- Combination methods:
- Ensembling: Merge final output (average, weighted average, majority vote, weighted majority vote).
- Meta ensembling: Same but use a new model to produce the final output. (also called stacking or blending)
- Models generation techniques:
- Stacking: Just use different classifiers algorithms.
- Bagging (Bootstrap aggregating): Each model trained with a subset of the training data. Used in random forests. Prob of sample being selected:
0.632
Prob of sample in Out Of Bag0.368
- Boosting: The predictors are not made independently, but sequentially. Used in gradient boosting.
- Snapshot Ensembling: Only for neural nets. M models for the cost of 1. Thanks to SGD with restarts you have several local minimum that you can average. paper.
- Combination methods:
- Label Smoothing: Smooth the one-hot target label
- Knowledge Distillation: A bigger trained net (teacher) helps the network paper
- Transfer learning: Use a pretrainded model and retrain with your data.
- Replace last layer
- Fine-tune new layers
- Fine-tune more layers (optional)
- Batch Normalization Add BachNorm layers after your convolutions and linear layers for make things easier to your net and train faster.
- Precomputation
- Freeze the layers you don’t want to modify
- Calculate the activations the last layer from the frozen layers(for your entire dataset)
- Save those activations to disk
- Use those activations as the input of your trainable layers
- Half precision (fp16)
- Multiple GPUs
- 2nd order optimization
- Structured
- Tabular
- xDeepFM
- Andres solution to ieee-fraud-detection
- NODE: Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data paper
- Continuous variables: Feed them directly to the network
- Categorical variable: Use embeddings
- Collaborative filtering: When you have users and items. Useful for recommendation systems.
- Singular Value Decomposition (SVD)
- Metrics: Mean Average Precision (MAP)
- Time series
- Arimax
- IoT sensors
- Geospatial: Do Kaggle course
- Tabular
- Unstructured
- Vision: Image, Video. Check my vision repo
- Audio: Sound, music, speech. Check my audio repo. Audio overview
- NLP: Text, Genomics. Check my NLP repo
- Knoledge Graph (KG): Graph Neural Networks (GNN)
- Trees
- math expresions
- syntax
- Models: Tree-LSTM, RNNGrammar (RNNG).
- Tree2seq by Polish notation. Duda: only for binary trees?
- Standard autoencoders: Made for reconstruct the input. No continuous latant space.
- Simple Autoencoder: Same input and output net with a smaller middle hidden layer (botleneck layer, latent vector).
- Denoising Autoencoder (DAE): Adds noise to the input to learn how to remove noise.
- Only have a recontruction loss (pixel mean squared error for example)
- Variational Autoencoder (VAE): Initially trained as a reconstruction problem, but later we can play with the latent vector to generate new outputs. Latant space need to be continuous.
- Latent vector: Is modified by adding gaussian noise (normal distribution, mean and std vectors) during training.
- Loss:
loss = recontruction loss + latent loss
- Recontruction loss: Keeps the output similar to the input (mean squared error)
- Latent loss: Keeps the latent space continuous (KL divergence)
- Disentangled Variational Autoencoder (β-VAE): Improved version. Each parameter of the latent vector is devotod to tweak 1 characteristic. paper.
- β to small: Overfitting. Learn to reconstruct your training data, but i won't generalize
- β to big: Loose high definition details. Worse performance.
- Hierarchial VAE (HVAE):
- Can be thought of as a series of VAEs stacked on top of each other
- NVAE: Hierarchial VAE to the extreme
- 2D: [x,y]->[R,G,B]
- 3D: [x,y,z]->[R,G,B,alpha]
- Input coordinates with sine & cos (positional encoding) NeRF
- Replacing the ReLU activations with sine functions SIREN
- Input coordinates into a Fourier feature space Fourier
Description | Website | Video | Paper |
---|---|---|---|
NeRF in the Wild | web | 3:41 | Aug 2020 |
NeRF++ | Oct 2020 | ||
Deformable NeRF (nerfies) | web | 7:26 | Nov 2020 |
NeRF with time dimension | web | 2:21 | Nov 2020 |
NeRF with better weight init | web | 3:54 | Dec 2020 |
- Type of graph data
- Graph Databases
- Knowledge Graphs (KG): Describes real-world entities and their interrelations
- Social Networks
- Transport Graphs
- Molecules (including proteins): Make predictions about their properties and reactions.
- Models
- GNN Graph Neural Network, 2009
- DeepWalk: Online Learning of Social Representations, 2014
- GraphSage, 2017
- Relational inductive biases, DL, and graph networks, 2018
- KGCN: Knowledge Graph Convolutional Network, 2019
- Survey papers
- A Gentle Introduction to GNN Medium, Feb 2019
- GNN: A Review of Methods and Applications: Dic 2018, last revised Jul 2019
- A Comprehensive Survey on GNN: Jan 2019, last revised Aug 2019
- Application examples:
- Smell molecules
- Newton vs the machine: Solving the 3-body problem using DL (Not using graphs)
Check this kaggle discussion
- Ladder Networks
- GANs
- Clustering like KMeans
- Variational Autoencoder (VAE)
- Pseudolabeling: Retrain with predicted test data as new labels.
- label propagation and label spreading tutorial
- Best resources:
- Openai spinning up: Probably the best one.
- Udacity repo: Good free repo for the paid course.
- theschool.ai move 37
- Reinforcement Learning: An Introduction: Best book
- Q-learning
- Policy gradients
- A3C
- C51
- Rainbow
- Implicit Quantile
- Evolutionary Strategy
- Genetic Algorithms
Reinforcement learning reference
- fast.ai
- deeplearning.ai
- deep learning book
- Weights & Biases by OpenAI
- DL cheatsheets
- How to train your resnet
- Pytorch DL course
- Trask book
- mlexplained
- Fast.ai tabular: Not really works well
- Problems:
- DL can not see frequency of an item
- Items that does not appear in the train set
- Manually align 2 distributions:
- Microsoft Malware Prediction
- CPMP Solution: https://www.kaggle.com/c/microsoft-malware-prediction/discussion/84069
- Data exploaration , haw is the data that we are going to work with
- Think about input representation
- Is redundant?
- Need to be converted to somthing else?
- The most entropy that you can reconstruct the raw data
- Look at the metric
- Makes sense?
- Is it differentiable
- Can i buid good enough metric equivalent
- Build a toy model an overfit it with 1 or few samples
- To make sure that nothing is really broken
- Entropy
- Choram
Projections (BAD REPRESENTATION) (complicated things with voxels) Dense matrix (antor) - Its a depth map i think - Not projections - NAtive output of the sensor but condensed in a dense matrix
- Point net
- transformer without positional encoding
- AtomTransformer (by antor)
- MoleculeTransformer (by antor)
- Multi-Task Learning: Train a model on a variety of learning tasks
- Meta-learning: Learn new tasks with minimal data using prior knowledge.
- N-Shot Learning
- Zero-shot: 0 trainning examples of that class.
- One-shot: 1 trainning example of that class.
- Few-shot: 2...5 trainning examples of that class.
- Models
- Naive approach: re-training the model on the new data, would severely overfit.
- Siamese Networks (2015) Knows if to inputs are the same or not. (2 Feature extraction shares wights)
- Matching Networks (2016) Weighted nearest-neighbor classifier applied within an embedding space.
- Model-Agnostic Meta-Learning (MAML) (2017)
- Prototypical Networks (2017): Better nearest-neighbor classifier of embeddings.
- Meta-Learning for Semi-Supervised classification (2018) Extensions of Prototypical Networks. SotA.
- Meta-Transfer Learning (MTL) (2018)
- Online Meta-Learning (2019)
- Neural Turing machine. paper, code
- Neural Arithmetic Logic Units (NALU) paper
- Remember the math:
- Matrix calculus
- Einsum: link 1, link 2
nvidia-smi daemon
: Check that sm% is near to 100% for a good GPU usage.