Quantization reduces the memory footprint of the models.
Models are post-trained with quantization of the Mixture-of-Experts weights.
Weights are quantized to 4.25 bits per parameter.
Quantizing MoE weights enables the larger model to fit on a single 80GB GPU.
The smaller model can run on systems with as little as 16GB memory.
Get more accurate answers with Super Search, upload files, personalized discovery feed, save searches and contribute to the PandiPedia.
Let's look at alternatives: