Predict Protocol - Version 2

This document proposes a predict/inference API independent of any specific ML/DL framework and model server. The proposed APIs are able to supprt both easy-to-use and high-performance use cases. By implementing this protocol both inference clients and servers will increase their utility and portability by being able to operate seamlessly on platforms that have standardized around this API. This protocol is endorsed by NVIDIA Triton Inference Server, TensorFlow Serving, and ONNX Runtime Server.

이 문서는 특정 ML/DL 프레임워크나 모델 서버와 독립적인 예측/추론 API 를 제안합니다. 제안하는 API는 사용하기 쉽고 고성능의 사용 예를 모두 지원할 수 있습니다. 이 규약을 구현함으로써, 추론 클라이언트와 서버는 이 API 기반으로 표준화된 어떤 플랫폼에든 이식될 수 있고 활용될 수 있습니다. NVIDIA의 Triton Inference Server, TensorFlow Serving, 그리고 ONNX Runtime Server가 이 규약을 지지합니다.

For an inference server to be compliant with this protocol the server must implement all APIs described below, except where an optional feature is explicitly noted. A compliant inference server may choose to implement either or both of the HTTP/REST API and the GRPC API.

추론 서버가 이 규약을 준수하기 위해서 서버는 명시적으로 선택사항인 것들을 제외하고는 아래에 기술된 모든 API를 구현해야 합니다. 이 규약을 준수하는 추론 서버는 HTTP/REST API 와 GRPC API 모두를 구현하거나 둘 중 하나만 구현할 수 있습니다.

The protocol supports an extension mechanism as a required part of the API, but this document does not propose any specific extensions, Any specific extensions will be proposed separately.

규약은 확장 메커니즘을 API의 요구 사항으로 지원합니다. 하지만 이 문서는 어떠한 특정 확장을 제안하지는 않습니다. 확장에 대한 것은 별도로 제안될 것입니다.

HTTP/REST

A compliant server must implement the health, metadata, and inference APIs described in this section.

The HTTP/REST API uses JSON because it is widely supported and language independent. In all JSON schemas shown in this document $number, $boolean, $object and $array refer to the fundamental JSON types. #optional indicates an optional JSON field.

All strings in all contexts are case-sensitive.

For KFServing the server must recognize the following URLs. The versions portion of the URL is shown as optional to allow implementations that don't support versioning or for cases when the user does not want to specify a specific model version (in which case the server will choose a version based on its own policies).

Health:

GET v2/health/live

GET v2/health/ready

GET v2/models/${MODEL_NAME}[versions/${MODEL_VERSION}]/ready

Server Metadata:

GET v2

Model Metadata:

GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]

Inference:

POST v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer

Health

A health request is made with an HTTP GET to a health endpoint. The HTTP response status code indicates a boolean result for the health request. A 200 status code indicates true and a 4xx status code indicates false. The HTTP response body should be empty. There are three health APIs.

Server Live

The "server live" API indicates if the inference server is able to receive and respond to metadata and inference requests. The "server live" API can be used directly to implement the Kubernetes livenessProbe.

Server Ready

The "server ready" health API indicates if all the models are ready for inferencing.

kyunghoj/required_api.md

Predict Protocol - Version 2

HTTP/REST

Health

Server Live

Server Ready

kyunghoj commented Jul 26, 2022

Uh oh!