Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅中国的 Amazon Web Services 服务入门。本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。
捕获数据
将输入记录到终端节点和推理输出 SageMaker 到 Amazon S3 的实时终端节点,您可以启用名为数据捕获. 它通常用于记录可用于训练、调试和监控的信息。亚马逊 SageMaker Model Monitor 会自动解析这些捕获的数据,并将此数据中的指标与您为模型创建的基线进行比较。有关模型监控器的更多信息,请监控模型的数据和模型质量、偏见和可解释性.
为了防止对推理请求的影响,Data Capture 停止捕获高磁盘使用率的请求。建议将磁盘利用率保持在 75% 以下,以确保数据捕获继续捕获请求。
要捕获数据,必须使用 SageMaker 托管服务。这要求你创建 SageMaker 模型、定义终端节点配置并创建 HTTPS 终端节点。
启用数据捕获所需的步骤都类似,无论您是否使用Amazon SDK for Python (Boto)或者 SageMaker Python 开发工具包。如果您将AmazonSDK,定义DataCaptureConfig字典以及必填字段在CreateEndpointConfig启用数据捕获的方法。如果您将 SageMaker Python SDK,导入DataCaptureConfigClass 并从此类中初始化实例。然后,将此对象传递到DataCaptureConfig中的参数sagemaker.model.Model.deploy()方法。
要使用继续执行的代码片段,请替换斜体占位符文本在示例代码中,包含您自己的信息。
如何启用数据捕获
指定数据捕获配置。可以使用此配置捕获请求负载和/或响应负载。继续的代码片段演示了如何使用Amazon SDK for Python (Boto)和 SageMaker Python 开发工具包。
- Amazon SDK for Python (Boto)
-
使用配置要捕获的数据DataCaptureConfig使用创建终端节点时使用CreateEndpointConfig方法。SetEnableCapture到布尔值 True。此外,还提供以下必备参数:
-
EndpointConfigName:终端节点配置的名称。当你制作CreateEndpoint请求.
-
ProductionVariants:您想在此终端节点上托管的型号列表。为每个模型定义字典数据类型。
-
DataCaptureConfig:字典数据类型,您可以在其中指定一个整数值,该值与要采样的数据的初始百分比相对应 (InitialSamplingPercentage)、您希望存储捕获数据的 Amazon S3 URI,以及捕获选项(CaptureOptions列表。指定Input要么Output为了CaptureMode中的CaptureOptions列表。
您可以选择指定方式 SageMaker 应该通过将键值对参数传递给捕获的数据进行编码CaptureContentTypeHeader词典。
# Create an endpoint config name.
endpoint_config_name = '<endpoint-config-name>'
# The name of the production variant.
variant_name = '<name-of-production-variant>'
# The name of the model that you want to host.
# This is the name that you specified when creating the model.
model_name = '<The_name_of_your_model>'
instance_type = '<instance-type>'
#instance_type='ml.m5.xlarge' # Example
# Number of instances to launch initially.
initial_instance_count = <integer>
# Sampling percentage. Choose an integer value between 0 and 100
initial_sampling_percentage = <integer>
# The S3 URI of where to store captured data in S3
s3_capture_upload_path = 's3://<bucket-name>/<data_capture_s3_key>'
# Specify either Input, Output, or both
capture_modes = [ "Input", "Output" ]
#capture_mode = [ "Input"] # Example - If you want to capture input only
endpoint_config_response = sagemaker_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
# List of ProductionVariant objects, one for each model that you want to host at this endpoint.
ProductionVariants=[
{
"VariantName": variant_name,
"ModelName": model_name,
"InstanceType": instance_type, # Specify the compute instance type.
"InitialInstanceCount": initial_instance_count # Number of instances to launch initially.
}
],
DataCaptureConfig= {
'EnableCapture': True, # Whether data should be captured or not.
'InitialSamplingPercentage' : initial_sampling_percentage,
'DestinationS3Uri': s3_capture_upload_path,
'CaptureOptions': [{"CaptureMode" : capture_mode} for capture_mode in capture_modes] # Example - Use list comprehension to capture both Input and Output
}
)
有关其他终端节点配置选项的更多信息,请参阅CreateEndpointConfig中的 API亚马逊 SageMaker 服务 API 参考指南.
- SageMaker Python SDK
-
导入DataCaptureConfig来自sagemaker.model_监控器模块。通过设置启用数据捕获EnableCapture布尔值True.
可选为以下参数提供参数:
SamplingPercentage:一个整数值,对应于要采样的数据的百分比。如果您没有提供抽样百分比, SageMaker 将采样默认值为 20 (20%) 个数据。
DestinationS3Uri:Amazon S3 URI SageMaker 将用于存储捕获的数据。如果你不提供一个, SageMaker 将在中存储捕获的数据"s3://<default-session-bucket>/ model-monitor/data-capture".
from sagemaker.model_monitor import DataCaptureConfig
# Set to True to enable data capture
enable_capture = True
# Optional - Sampling percentage. Choose an integer value between 0 and 100
sampling_percentage = <int>
# sampling_percentage = 30 # Example 30%
# Optional - The S3 URI of where to store captured data in S3
s3_capture_upload_path = 's3://<bucket-name>/<data_capture_s3_key>'
# Specify either Input, Output or both.
capture_modes = ['REQUEST','RESPONSE'] # In this example, we specify both
# capture_mode = ['REQUEST'] # Example - If you want to only capture input.
# Configuration object passed in when deploying Models to SM endpoints
data_capture_config = DataCaptureConfig(
enable_capture = enable_capture,
sampling_percentage = sampling_percentage, # Optional
destination_s3_uri = s3_capture_upload_path, # Optional
capture_options = [{"CaptureMode": capture_mode} for capture_mode in capture_modes]
)
部署模型
使用部署模型并创建 HTTPS 终端节点DataCapture已启用.
- Amazon SDK for Python (Boto3)
-
向 SageMaker 提供端点配置。该服务会启动 ML 计算实例,并按照配置中的规定部署一个或多个模型。
拥有模型和终端节点配置后,请使用CreateEndpoint用于创建终端节点的 API。在中,终端节点名称必须唯一。Amazon您的区域Amazonaccount.
以下内容使用在请求中指定的终端节点配置创建终端节点。亚马逊 SageMaker 使用终端节点来配置资源和部署模型。
# The name of the endpoint. The name must be unique within an AWS Region in your AWS account.
endpoint_name = '<endpoint-name>'
# The name of the endpoint configuration associated with this endpoint.
endpoint_config_name='<endpoint-config-name>'
create_endpoint_response = sagemaker_client.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=endpoint_config_name)
有关更多信息,请参阅 CreateEndpoint API。
- SageMaker Python SDK
-
为终端节点定义名称。此为可选步骤。如果你不提供一个, SageMaker 将为你创建一个唯一的名字:
from datetime import datetime
endpoint_name = f"DEMO-{datetime.utcnow():%Y-%m-%d-%H%M}"
print("EndpointName =", endpoint_name)
使用模型对象的内置将模型部署到实时的 HTTPS 终端节点deploy()方法。提供要将此模型部署到的 Amazon EC2 实例类型的名称instance_type此字段以及要在上运行终端节点的终端节点的实例的初始数量以及initial_instance_count字段:
initial_instance_count=<integer>
# initial_instance_count=1 # Example
instance_type='<instance-type>'
# instance_type='ml.m4.xlarge' # Example
# Uncomment if you did not define this variable in the previous step
#data_capture_config = <name-of-data-capture-configuration>
model.deploy(
initial_instance_count=initial_instance_count,
instance_type=instance_type,
endpoint_name=endpoint_name,
data_capture_config=data_capture_config
)
查看捕获的数据
从中创建预测器对象 SageMaker Python 开发工具包预测器类。您将使用返回的对象Predictor在 future 的步骤中调用终端节点的类。提供终端节点的名称(之前定义为endpoint_name),以及分别用于序列化器和反序列化器的序列化器和反序列化器对象。有关序列化器类型的信息,请参阅序列化器中的课SageMaker Python 开发工具包.
from sagemaker.predictor import Predictor
from sagemaker.serializers import <Serializer>
from sagemaker.deserializers import <Deserializers>
predictor = Predictor(endpoint_name=endpoint_name,
serializer = <Serializer_Class>,
deserializer = <Deserializer_Class>)
# Example
#from sagemaker.predictor import Predictor
#from sagemaker.serializers import CSVSerializer
#from sagemaker.deserializers import JSONDeserializer
#predictor = Predictor(endpoint_name=endpoint_name,
# serializer=CSVSerializer(),
# deserializer=JSONDeserializer())
在接下来的代码示例场景中,我们使用我们在本地存储在名为的 CSV 文件中的示例验证数据调用终端节点。validation_with_predictions. 我们的示例验证集包含每个输入的标签。
WITH 语句的前几行首先打开验证集 CSV 文件,然后用逗号字符拆分文件中的每一行",",然后将两个返回的对象存储到标签和 input_cols 变量中。对于每一行,输入 (input_cols) 被传递给预测变量的 (predictor) 对象内置方法Predictor.predict().
假设模型返回概率。整数值为 0 和 1.0 之间的概率范围。如果模型返回的概率大于 80% (0.8),我们将预测分配一个整数值标签 1。否则,我们为预测分配一个整数值标签 0。
from time import sleep
validate_dataset = "validation_with_predictions.csv"
# Cut off threshold of 80%
cutoff = 0.8
limit = 200 # Need at least 200 samples to compute standard deviations
i = 0
with open(f"test_data/{validate_dataset}", "w") as validation_file:
validation_file.write("probability,prediction,label\n") # CSV header
with open("test_data/validation.csv", "r") as f:
for row in f:
(label, input_cols) = row.split(",", 1)
probability = float(predictor.predict(input_cols))
prediction = "1" if probability > cutoff else "0"
baseline_file.write(f"{probability},{prediction},{label}\n")
i += 1
if i > limit:
break
print(".", end="", flush=True)
sleep(0.5)
print()
print("Done!")
由于您在前面的步骤中启用了数据捕获,因此,请求和响应负载将与其他一些元数据一起保存到您在中指定的 Amazon S3 位置。DataCaptureConfig. 向 Amazon S3 交付捕获数据可能需要几分钟时间。
通过列出 Amazon S3 中存储的数据捕获文件来查看捕获的数据。Amazon S3 路径的格式为:s3:///{endpoint-name}/{variant-name}/yyyy/mm/dd/hh/filename.jsonl.
预计会看到根据调用发生时间组织的来自各个时段的不同文件。运行以下命令以打印出单个捕获文件的内容:
print("\n".join(capture_file[-3:-1]))
这将返回 SageMaker 特定的 JSLINE 格式化的文件。以下是从我们使用调用的实时终端节点中获取的响应示例csv/textDATA:
{"captureData":{"endpointInput":{"observedContentType":"text/csv","mode":"INPUT",
"data":"69,0,153.7,109,194.0,105,256.1,114,14.1,6,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0\n",
"encoding":"CSV"},"endpointOutput":{"observedContentType":"text/csv; charset=utf-8","mode":"OUTPUT","data":"0.0254181120544672","encoding":"CSV"}},
"eventMetadata":{"eventId":"aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee","inferenceTime":"2022-02-14T17:25:49Z"},"eventVersion":"0"}
{"captureData":{"endpointInput":{"observedContentType":"text/csv","mode":"INPUT",
"data":"94,23,197.1,125,214.5,136,282.2,103,9.5,5,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,1\n",
"encoding":"CSV"},"endpointOutput":{"observedContentType":"text/csv; charset=utf-8","mode":"OUTPUT","data":"0.07675473392009735","encoding":"CSV"}},
"eventMetadata":{"eventId":"aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee","inferenceTime":"2022-02-14T17:25:49Z"},"eventVersion":"0"}
在前面的示例中,capture_file对象是列表类型。对列表的第一个元素进行索引以查看单个推理请求。
# The capture_file object is a list. Index the first element to view a single inference request
print(json.dumps(json.loads(capture_file[0]), indent=2))
此操作将返回与以下内容类似的响应。返回的值将因您的终端节点配置而异, SageMaker 模型和捕获的数据:
{
"captureData": {
"endpointInput": {
"observedContentType": "text/csv", # data MIME type
"mode": "INPUT",
"data": "50,0,188.9,94,203.9,104,151.8,124,11.6,8,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,1,0\n",
"encoding": "CSV"
},
"endpointOutput": {
"observedContentType": "text/csv; charset=character-encoding",
"mode": "OUTPUT",
"data": "0.023190177977085114",
"encoding": "CSV"
}
},
"eventMetadata": {
"eventId": "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
"inferenceTime": "2022-02-14T17:25:06Z"
},
"eventVersion": "0"
}