vllm.model_executor.layers.fused_moe.fused_moe_method_base ¶
FusedMoEMethodBase ¶
Bases: QuantizeMethodBase
Source code in vllm/model_executor/layers/fused_moe/fused_moe_method_base.py
__init__ ¶
__init__(moe: FusedMoEConfig)
apply abstractmethod ¶
apply(
layer: Module,
x: Tensor,
router_logits: Tensor,
top_k: int,
renormalize: bool,
use_grouped_topk: bool = False,
topk_group: int | None = None,
num_expert_group: int | None = None,
global_num_experts: int = -1,
expert_map: Tensor | None = None,
custom_routing_function: Callable | None = None,
scoring_func: str = "softmax",
routed_scaling_factor: float = 1.0,
e_score_correction_bias: Tensor | None = None,
apply_router_weight_on_input: bool = False,
activation: str = "silu",
enable_eplb: bool = False,
expert_load_view: Tensor | None = None,
logical_to_physical_map: Tensor | None = None,
logical_replica_count: Tensor | None = None,
) -> Tensor | tuple[Tensor, Tensor]
Source code in vllm/model_executor/layers/fused_moe/fused_moe_method_base.py
create_weights abstractmethod ¶
create_weights(
layer: Module,
num_experts: int,
hidden_size: int,
intermediate_size_per_partition: int,
params_dtype: dtype,
**extra_weight_attrs,
)
Source code in vllm/model_executor/layers/fused_moe/fused_moe_method_base.py
get_fused_moe_quant_config abstractmethod ¶
get_fused_moe_quant_config(
layer: Module,
) -> FusedMoEQuantConfig | None
maybe_make_prepare_finalize ¶
maybe_make_prepare_finalize() -> (
FusedMoEPrepareAndFinalize | None
)
select_gemm_impl ¶
select_gemm_impl(
prepare_finalize: FusedMoEPrepareAndFinalize,
layer: Module,
) -> FusedMoEPermuteExpertsUnpermute
Source code in vllm/model_executor/layers/fused_moe/fused_moe_method_base.py
uses_weight_scale_2_pattern ¶
uses_weight_scale_2_pattern() -> bool
Returns True if this quantization method uses 'weight_scale_2' pattern for per-tensor weight scales (e.g., FP4 variants), False otherwise.
This method should be overridden by subclasses that use the 'weight_scale_2' pattern instead of the standard 'weight_scale' pattern.