Qwen used an A100 80G GPU for testing the inference speed and memory footprint[1]. Some issues were reported with the record of memory of AWQ models on multiple devices and also unexpected memory footprint of 14B GPTQ models in the input context of 30720 tokens[1]. GPTQ-Int8 is not reported due to[1] problems with AutoGPTQ.
Get more accurate answers with Super Search, upload files, personalized discovery feed, save searches and contribute to the PandiPedia.
Let's look at alternatives: