vllm.v1.core.sched.interface ¶
SchedulerInterface ¶
Bases: ABC
Source code in vllm/v1/core/sched/interface.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | |
add_request abstractmethod ¶
add_request(request: Request) -> None
Add a new request to the scheduler's internal queue.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
request | Request | The new request being added. | required |
finish_requests abstractmethod ¶
finish_requests(
request_ids: Union[str, Iterable[str]],
finished_status: RequestStatus,
) -> None
Finish the requests in the scheduler's internal queue. If the request is not in the queue, this method will do nothing.
This method is called in two cases: 1. When the request is aborted by the client. 2. When the frontend process detects a stop string of the request after de-tokenizing its generated tokens.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
request_ids | Union[str, Iterable[str]] | A single or a list of request IDs. | required |
finished_status | RequestStatus | The finished status of the given requests. | required |
Source code in vllm/v1/core/sched/interface.py
get_kv_connector ¶
get_kv_connector() -> Optional[KVConnectorBase_V1]
get_request_counts abstractmethod ¶
has_finished_requests abstractmethod ¶
has_finished_requests() -> bool
Returns True if there are finished requests that need to be cleared. NOTE: This is different from not self.has_unfinished_requests().
The scheduler maintains an internal list of the requests finished in the previous step. This list is returned from the next call to schedule(), to be sent to the model runner in the next step to clear cached states for these finished requests.
This method checks if this internal list of finished requests is non-empty. This information is useful for DP attention.
Source code in vllm/v1/core/sched/interface.py
has_requests ¶
has_requests() -> bool
Returns True if there are unfinished requests, or finished requests not yet returned in SchedulerOutputs.
has_unfinished_requests ¶
has_unfinished_requests() -> bool
Returns True if there are unfinished requests in the scheduler's internal queue.
make_stats abstractmethod ¶
make_stats() -> Optional[SchedulerStats]
Make a SchedulerStats object for logging.
The SchedulerStats object is created for every scheduling step.
reset_prefix_cache abstractmethod ¶
reset_prefix_cache() -> bool
Reset the prefix cache for KV cache.
This is particularly required when the model weights are live-updated.
resume_request abstractmethod ¶
resume_request(
request_id: str,
prompt_token_ids: Optional[list[int]] = None,
finish_forever: Optional[bool] = False,
) -> None
Resume a leftover request.
This method is called when the client wants to resume a previously leftover request.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
request_id | str | The ID of the request to be resumed. | required |
prompt_token_ids | Optional[list[int]] | If provided, the new prompt token IDs to use for the resumed request. If None, the original prompt token IDs will be used. | None |
finish_forever | Optional[bool] | If True, the resumed request will be marked as finished after processing the current prompt tokens. If False, the request will continue to generate tokens as usual. | False |
Source code in vllm/v1/core/sched/interface.py
schedule abstractmethod ¶
schedule() -> SchedulerOutput
Schedule the requests to process in this scheduling step.
The scheduling decision is made at the iteration level. Each scheduling step corresponds to a single forward pass of the model. Therefore, this method is called repeatedly by a busy loop in the engine.
Essentially, the scheduler produces a dictionary of {req_id: num_tokens} that specifies how many tokens to process for each request in this scheduling step. For example, num_tokens can be as large as the number of prompt tokens for new requests, or it can be 1 for the requests that are auto-regressively generating new tokens one by one. Otherwise, it can be somewhere in between in case of chunked prefills, prefix caching, speculative decoding, etc.
Additionally, the scheduler also returns useful data about each request or the batch as a whole. The model runner will use this information in preparing inputs to the model.
Returns:
| Type | Description |
|---|---|
SchedulerOutput | A SchedulerOutput object containing information about the scheduled |
SchedulerOutput | requests. |
Source code in vllm/v1/core/sched/interface.py
shutdown abstractmethod ¶
update_draft_token_ids abstractmethod ¶
update_draft_token_ids(
draft_token_ids: DraftTokenIds,
) -> None
Update the draft token ids for the scheduled requests.
update_from_output abstractmethod ¶
update_from_output(
scheduler_output: SchedulerOutput,
model_runner_output: ModelRunnerOutput,
) -> dict[int, EngineCoreOutputs]
Update the scheduler state based on the model runner output.
This method is called after the model runner has processed the scheduled requests. The model runner output includes generated token ids, draft token ids for next step, etc. The scheduler uses this information to update its states, checks the finished requests, and returns the output for each request.
Returns:
| Type | Description |
|---|---|
dict[int, EngineCoreOutputs] | A dict of client index to EngineCoreOutputs object containing the |
dict[int, EngineCoreOutputs] | outputs for each request originating from that client. |