개요
sagemaker studio를 회사에서 운영하는데 Idle Notebook Instance에 의해 상당부분 과금이 일어나는 비효율을 발견하였다.
이를 해결하는 방법을 찾아 본 뒤 정리를 해둔다.
핵심 개념
Sagemaker Studio에 Global 정채긍로 사용자 JupyterServer가 최초 생성 될 때 Idle Termination 기능이 동작하게 스크립트를 심는다.
더 디테일한 방법은 Sagemaker Lifecycle Configuration에 JupyterServer
타입으로 설정을 하나 등록한 뒤 이를 Sagemaker Domain에 전체 적용하는 것이다.
Lifecycle Configuration 등록하기
등록 스크립트
install-autoshutdown-extension
라는 이름으로 설정을 추가해준다.
#!/bin/bash
LCC_CONTENT=`openssl base64 -A -in install_autoshutdown.sh` # install-autoshutdown.sh is a file with the above script contents
aws sagemaker create-studio-lifecycle-config \
--studio-lifecycle-config-name install-autoshutdown-extension \
--studio-lifecycle-config-content $LCC_CONTENT \
--studio-lifecycle-config-app-type JupyterServer
등록하면 아래와 같이 console에서 설정이 추가된 것을 확인 할 수 있다.
Idle Stop 스크립트
필자는 IDLE TIMEOUT
을 60분으로 잡았다.
해당 스크립트는 읽어보면 아마존에서 공식 지원해주는 idle_checker
를 jupyter server 시작 시에 같이 실행해주는 역할을 한다.
#!/bin/bash
# https://github.com/aws-samples/sagemaker-studio-lifecycle-config-examples/blob/main/scripts/install-autoshutdown-server-extension/on-jupyter-server-start.sh
# This script installs the idle notebook auto-checker server extension to SageMaker Studio
# The original extension has a lab extension part where users can set the idle timeout via a Jupyter Lab widget.
# In this version the script installs the server side of the extension only. The idle timeout
# can be set via a command-line script which will be also created by this create and places into the
# user's home folder
#
# Installing the server side extension does not require Internet connection (as all the dependencies are stored in the
# install tarball) and can be done via VPCOnly mode.
set -eux
# timeout in minutes
export TIMEOUT_IN_MINS=60
# Should already be running in user home directory, but just to check:
cd /home/sagemaker-user
# By working in a directory starting with ".", we won't clutter up users' Jupyter file tree views
mkdir -p .auto-shutdown
# Create the command-line script for setting the idle timeout
cat > .auto-shutdown/set-time-interval.sh << EOF
#!/opt/conda/bin/python
import json
import requests
TIMEOUT=${TIMEOUT_IN_MINS}
session = requests.Session()
# Getting the xsrf token first from Jupyter Server
response = session.get("http://localhost:8888/jupyter/default/tree")
# calls the idle_checker extension's interface to set the timeout value
response = session.post("http://localhost:8888/jupyter/default/sagemaker-studio-autoshutdown/idle_checker",
json={"idle_time": TIMEOUT, "keep_terminals": False},
params={"_xsrf": response.headers['Set-Cookie'].split(";")[0].split("=")[1]})
if response.status_code == 200:
print("Succeeded, idle timeout set to {} minutes".format(TIMEOUT))
else:
print("Error!")
print(response.status_code)
EOF
chmod +x .auto-shutdown/set-time-interval.sh
# "wget" is not part of the base Jupyter Server image, you need to install it first if needed to download the tarball
sudo yum install -y wget
# You can download the tarball from GitHub or alternatively, if you're using VPCOnly mode, you can host on S3
wget -O .auto-shutdown/extension.tar.gz https://github.com/aws-samples/sagemaker-studio-auto-shutdown-extension/raw/main/sagemaker_studio_autoshutdown-0.1.5.tar.gz
# Or instead, could serve the tarball from an S3 bucket in which case "wget" would not be needed:
# aws s3 --endpoint-url [S3 Interface Endpoint] cp s3://[tarball location] .auto-shutdown/extension.tar.gz
# Installs the extension
cd .auto-shutdown
tar xzf extension.tar.gz
cd sagemaker_studio_autoshutdown-0.1.5
# Activate studio environment just for installing extension
export AWS_SAGEMAKER_JUPYTERSERVER_IMAGE="${AWS_SAGEMAKER_JUPYTERSERVER_IMAGE:-'jupyter-server'}"
if [ "$AWS_SAGEMAKER_JUPYTERSERVER_IMAGE" = "jupyter-server-3" ] ; then
eval "$(conda shell.bash hook)"
conda activate studio
fi;
pip install --no-dependencies --no-build-isolation -e .
jupyter serverextension enable --py sagemaker_studio_autoshutdown
if [ "$AWS_SAGEMAKER_JUPYTERSERVER_IMAGE" = "jupyter-server-3" ] ; then
conda deactivate
fi;
# Restarts the jupyter server
nohup supervisorctl -c /etc/supervisor/conf.d/supervisord.conf restart jupyterlabserver
# Waiting for 30 seconds to make sure the Jupyter Server is up and running
sleep 30
# Calling the script to set the idle-timeout and active the extension
/home/sagemaker-user/.auto-shutdown/set-time-interval.sh
Sagemaker Domain 전체 적용
아래 스크립트를 통해 나의 ARN 및 sagemaker studio domain을 입력하여 전체 적용해준다.
#!/bin/bash
aws sagemaker update-domain --domain-id <Sagemaker Studio 도메인> \
--default-user-settings '{
"JupyterServerAppSettings": {
"DefaultResourceSpec": {
"LifecycleConfigArn": "arn:aws:sagemaker:ap-northeast-2:<ACCOUNT_ID>:studio-lifecycle-config/install-autoshutdown-extension",
"InstanceType": "system"
},
"LifecycleConfigArns": [
"arn:aws:sagemaker:ap-northeast-2:<ACCOUNT_ID>:studio-lifecycle-config/install-autoshutdown-extension"
]
}}'
특정 유저만 적용
혹시나 특정 유저만 적용이 필요하다면 아래와 같이 가능하다.
aws sagemaker update-user-profile --domain-id d-abc123 \
--user-profile-name my-existing-user \
--user-settings '{
"KernelGatewayAppSettings": {
"LifecycleConfigArns":
["arn:aws:sagemaker:us-east-2:123456789012:studio-lifecycle-config/install-pip-package-on-kernel"]
}
}'
Lifecycle Configuration 삭제 스크립트
유사 시 사용해준다.
#!/bin/bash
aws sagemaker delete-studio-lifecycle-config \
--studio-lifecycle-config-name install-autoshutdown-extension