Microsoft Research Develops a 2 bn Parameter Model That Can Run on Your CPU

Trained on a massive 4 tn token dataset, BitNet b1.58 2B4T performs strongly across tasks like language understanding, math, & coding

Microsoft Research Develops a  2 bn Parameter Model That Can Run on Your CPU

Microsoft Research has introduced BitNet b1.58 2B4T, a highly efficient 2-billion parameter language model that uses only 1.58 bits per weight—significantly less than traditional 16- or 32-bit precision.

Despite its compact size, the model matches the performance of full-precision counterparts while running efficiently on both GPUs and CPUs.

Trained on a massive 4 trillion token dataset, BitNet b1.58 2B4T performs strongly across tasks like language understanding, math, coding, and conversation. Microsoft has released the model weights on Hugging Face along with open-source code for deployment.

Key innovations include:

  • BitLinear layers that quantize weights to ternary values {-1, 0, +1} using an absmean quantization scheme
  • 8-bit activation quantization using the absmax strategy, applied per token
  • Squared ReLU (ReLU²) activations in the feed-forward layers
  • Rotary Position Embeddings (RoPE) and a bias-free design, similar to LLaMA architectures

The training process includes pre-training, supervised fine-tuning (SFT), and direct preference optimization (DPO).

BitNet b1.58 2B4T shows it’s possible to dramatically reduce the computational demands of language models without compromising performance, representing a major step toward more accessible and energy-efficient AI.

However, achieving this level of performance relies on Microsoft’s custom framework, bitnet.cpp, which currently supports only select hardware.

Notably missing from the compatibility list are GPUs — the backbone of today’s AI infrastructure.

While bitnets show strong potential, especially for low-resource environments, hardware compatibility remains a significant hurdle — and may continue to be a limiting factor in broader adoption.

"The current execution paths within transformers do not contain the specialized, highly optimized computational kernels required to leverage the advantages of the BitNet architecture. Running the model via transformers will likely result in inference speeds and energy usage comparable to, or potentially worse than, standard full-precision models within this framework on both CPU and GPU," Microsoft said in a blog post.