Interleaved Thinking¶
Introduction¶
Interleaved thinking allows models to reason between tool calls, enabling more sophisticated decision-making after receiving tool results. This feature helps models chain multiple tool calls with reasoning steps in between and make nuanced decisions based on intermediate results.
Important: Interleaved thinking increases token usage and response latency. Consider your budget and performance requirements when enabling this feature.
How Interleaved Thinking Works¶
With interleaved thinking, the model can:
- Reason about the results of a tool call before deciding what to do next
- Chain multiple tool calls with reasoning steps in between
- Make more nuanced decisions based on intermediate results
- Provide transparent reasoning for its tool selection process
Supported Models¶
vLLM currently supports the following interleaved thinking models:
| Model Series | Reasoning Parser Name |
|---|---|
| moonshotai/Kimi-K2-Thinking | kimi_k2 |
| MiniMaxAI/MiniMax-M2 | minimax_m2 |
Example Usage¶
To use interleaved thinking with tool calls, specify a model that supports this feature and enable tool calls in your chat completion request. Here's an example:
Code
"""
vllm serve MiniMaxAI/MiniMax-M2 \
--tensor-parallel-size 4 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--enable-auto-tool-choice
"""
import json
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
def get_current_weather(location: str, unit: "str"):
"""Get the current weather in a given location"""
if unit == "celsius":
return f"The current temperature in {location} is 22°C."
else:
return f"The current temperature in {location} is 72°F."
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and state, e.g., 'San Francisco, CA'",
},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location", "unit"],
},
},
}
]
messages = [{"role": "user", "content": "What's the weather in Fahrenheit like in San Francisco?"}]
response = client.chat.completions.create(
model=client.models.list().data[0].id,
messages=messages,
tools=tools,
tool_choice="auto",
)
tool_call = response.choices[0].message.tool_calls[0].function
messages.append(
{
"role": "assistant",
"tool_calls": response.choices[0].message.tool_calls,
"reasoning": response.choices[0].message.reasoning, # append reasoning
}
)
# Simulate tool execution
available_tools = {"get_weather": get_current_weather}
completion_tool_calls = response.choices[0].message.tool_calls
for call in completion_tool_calls:
tool_to_call = available_tools[call.function.name]
args = json.loads(call.function.arguments)
result = tool_to_call(**args)
messages.append(
{
"role": "tool",
"content": result,
"tool_call_id": call.id,
"name": call.function.name,
}
)
response_2 = client.chat.completions.create(
model=client.models.list().data[0].id,
messages=messages,
tools=tools,
tool_choice="auto",
)
print(response_2.choices[0].message.content)
This example demonstrates how to set up interleaved thinking with tool calls using a weather retrieval function. The model reasons about the tool results before generating the final response.