You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Only weights are quantized with symmetric quantization.
4538
+
The quantized weights are stored in column major order per expert.
4539
+
The quantization block size can be specified. If not provided, column wise quantization is used.
4540
+
4541
+
The SwiGLU (Swish-Gated Linear Unit) activation function is like:
4542
+
g = xW + b
4543
+
l = xV + c
4544
+
G = clamp(g, max=limit)
4545
+
L = clamp(l, min=-limit, max=limit)
4546
+
swiglu = G * sigmoid(alpha * G) * (L + beta)
4547
+
where x is the input, W and V are weight matrices, b and c are bias vectors, and alpha, beta and limit are constant float parameters.
4548
+
When swiglu_fusion=0, two GEMMs are not fused, and they are FC1 and FC3 in the inputs.
4549
+
When swiglu_fusion=1, two GEMMs are fused so that g and l are computed in a single GEMM (FC1), and g and l are interleaved on each row of size 2 * inter_size.
4550
+
When swiglu_fusion=2, two GEMMs are fused, and g and l are concatenated on each row.
4551
+
4536
4552
4537
4553
#### Version
4538
4554
@@ -4547,6 +4563,8 @@ This version of the operator has been available since version 1 of the 'com.micr
4547
4563
<dd>Beta parameter used in activation function.</dd>
4548
4564
<dt><tt>activation_type</tt> : string</dt>
4549
4565
<dd>Activation function to use. Choose from relu, gelu, silu, swiglu and identity. Default is relu</dd>
4566
+
<dt><tt>block_size</tt> : int</dt>
4567
+
<dd>Size of each quantization block along the K (input feature) dimension. Must be power of two and ≥ 16 (e.g., 16, 32, 64, 128). If provided, both hidden_size and inter_size must be divisible by the block size. Otherwise, there is no blocking and a whole column shares one scaling factor. </dd>
4550
4568
<dt><tt>expert_weight_bits</tt> : int</dt>
4551
4569
<dd>Number of bits used in quantized weights. Default is 4 bits</dd>
4552
4570
<dt><tt>k</tt> : int</dt>
@@ -4565,34 +4583,34 @@ This version of the operator has been available since version 1 of the 'com.micr
4565
4583
4566
4584
<dl>
4567
4585
<dt><tt>input</tt> : T</dt>
4568
-
<dd>2D input tensor with shape (num_rows, hidden_size) or 3D input tensor with shape (batch_size, sequence_length, hidden_size)</dd>
4586
+
<dd>2D tensor with shape (num_tokens, hidden_size), or 3D tensor with shape (batch_size, sequence_length, hidden_size)</dd>
4569
4587
<dt><tt>router_probs</tt> : T</dt>
4570
-
<dd>2D input tensor with shape (num_rows, num_experts)</dd>
4588
+
<dd>2D tensor with shape (num_tokens, num_experts)</dd>
4571
4589
<dt><tt>fc1_experts_weights</tt> : T1</dt>
4572
-
<dd>3D input tensor with shape (num_experts, inter_size, hidden_size), or (num_experts, inter_size, hidden_size / 2) for 4 bits. For swiglu, shape can be (num_experts, 2 * inter_size, hidden_size), or (num_experts, 2 * inter_size, hidden_size / 2) for 4 bits.</dd>
4590
+
<dd>3D tensor with shape (num_experts, fusion_size * inter_size, hidden_size / pack_size), The fusion_size is 2 for fused swiglu, or 1 otherwise. The pack_size is 8 / expert_weight_bits.</dd>
4573
4591
<dt><tt>fc1_scales</tt> : T2</dt>
4574
-
<dd>2D input tensor with shape (num_experts, inter_size), or (num_experts, 2 * inter_size) for swiglu</dd>
4592
+
<dd>2D tensor with shape (num_experts, fusion_size * inter_size), or 3D tensor with shape (num_experts, fusion_size * inter_size, hidden_size / block_size) when block_size is provided.</dd>
4575
4593
<dt><tt>fc1_experts_bias</tt> (optional) : T</dt>
4576
-
<dd>2D optional input tensor with shape (num_experts, inter_size), or (num_experts, 2 * inter_size) for swiglu</dd>
4594
+
<dd>2D optional tensor with shape (num_experts, fusion_size * inter_size)</dd>
4577
4595
<dt><tt>fc2_experts_weights</tt> : T1</dt>
4578
-
<dd>3D input tensor with shape (num_experts, hidden_size, inter_size) or (num_experts, hidden_size, inter_size / 2) for 4 bits</dd>
4596
+
<dd>3D tensor with shape (num_experts, hidden_size, inter_size / pack_size)</dd>
4579
4597
<dt><tt>fc2_scales</tt> : T2</dt>
4580
-
<dd>2D input tensor with shape (num_experts, hidden_size)</dd>
4598
+
<dd>2D tensor with shape (num_experts, hidden_size), or 3D tensor with shape (num_experts, hidden_size, inter_size / block_size) when block_size is provided.</dd>
4581
4599
<dt><tt>fc2_experts_bias</tt> (optional) : T</dt>
4582
-
<dd>2D optional input tensor with shape (num_experts, hidden_size)</dd>
4600
+
<dd>2D optional tensor with shape (num_experts, hidden_size)</dd>
<dd>3D optional input tensor with shape (num_experts, inter_size, hidden_size) or (num_experts, inter_size, hidden_size / 2)</dd>
4602
+
<dd>3D optional tensor with shape (num_experts, inter_size, hidden_size / pack_size)</dd>
4585
4603
<dt><tt>fc3_scales</tt> (optional) : T2</dt>
4586
-
<dd>2D optional input tensor with shape (num_experts, inter_size)</dd>
4604
+
<dd>2D optional tensor with shape (num_experts, inter_size), or 3D optional tensor with shape (num_experts, inter_size, hidden_size / block_size) when block_size is provided.</dd>
4587
4605
<dt><tt>fc3_experts_bias</tt> (optional) : T</dt>
4588
-
<dd>2D optional input tensor with shape (num_experts, inter_size)</dd>
4606
+
<dd>2D optional tensor with shape (num_experts, inter_size)</dd>
4589
4607
</dl>
4590
4608
4591
4609
#### Outputs
4592
4610
4593
4611
<dl>
4594
4612
<dt><tt>output</tt> : T</dt>
4595
-
<dd>2D input tensor with shape (num_rows, hidden_size) or 3D input tensor with shape (batch_size, sequence_length, hidden_size)</dd>
0 commit comments