개요

sagemaker studio를 회사에서 운영하는데 Idle Notebook Instance에 의해 상당부분 과금이 일어나는 비효율을 발견하였다.

이를 해결하는 방법을 찾아 본 뒤 정리를 해둔다.

핵심 개념

Sagemaker Studio에 Global 정채긍로 사용자 JupyterServer가 최초 생성 될 때 Idle Termination 기능이 동작하게 스크립트를 심는다.

더 디테일한 방법은 Sagemaker Lifecycle Configuration에 JupyterServer 타입으로 설정을 하나 등록한 뒤 이를 Sagemaker Domain에 전체 적용하는 것이다.

Lifecycle Configuration 등록하기

등록 스크립트

install-autoshutdown-extension 라는 이름으로 설정을 추가해준다.

#!/bin/bash

LCC_CONTENT=`openssl base64 -A -in install_autoshutdown.sh`  # install-autoshutdown.sh is a file with the above script contents

aws sagemaker create-studio-lifecycle-config \
--studio-lifecycle-config-name install-autoshutdown-extension \
--studio-lifecycle-config-content $LCC_CONTENT \
--studio-lifecycle-config-app-type JupyterServer

등록하면 아래와 같이 console에서 설정이 추가된 것을 확인 할 수 있다.

Idle Stop 스크립트

필자는 IDLE TIMEOUT을 60분으로 잡았다.

해당 스크립트는 읽어보면 아마존에서 공식 지원해주는 idle_checker를 jupyter server 시작 시에 같이 실행해주는 역할을 한다.

#!/bin/bash
# https://github.com/aws-samples/sagemaker-studio-lifecycle-config-examples/blob/main/scripts/install-autoshutdown-server-extension/on-jupyter-server-start.sh
# This script installs the idle notebook auto-checker server extension to SageMaker Studio
# The original extension has a lab extension part where users can set the idle timeout via a Jupyter Lab widget.
# In this version the script installs the server side of the extension only. The idle timeout
# can be set via a command-line script which will be also created by this create and places into the
# user's home folder
#
# Installing the server side extension does not require Internet connection (as all the dependencies are stored in the
# install tarball) and can be done via VPCOnly mode.

set -eux

# timeout in minutes
export TIMEOUT_IN_MINS=60

# Should already be running in user home directory, but just to check:
cd /home/sagemaker-user

# By working in a directory starting with ".", we won't clutter up users' Jupyter file tree views
mkdir -p .auto-shutdown

# Create the command-line script for setting the idle timeout
cat > .auto-shutdown/set-time-interval.sh << EOF
#!/opt/conda/bin/python
import json
import requests
TIMEOUT=${TIMEOUT_IN_MINS}
session = requests.Session()
# Getting the xsrf token first from Jupyter Server
response = session.get("http://localhost:8888/jupyter/default/tree")
# calls the idle_checker extension's interface to set the timeout value
response = session.post("http://localhost:8888/jupyter/default/sagemaker-studio-autoshutdown/idle_checker",
            json={"idle_time": TIMEOUT, "keep_terminals": False},
            params={"_xsrf": response.headers['Set-Cookie'].split(";")[0].split("=")[1]})
if response.status_code == 200:
    print("Succeeded, idle timeout set to {} minutes".format(TIMEOUT))
else:
    print("Error!")
    print(response.status_code)
EOF
chmod +x .auto-shutdown/set-time-interval.sh

# "wget" is not part of the base Jupyter Server image, you need to install it first if needed to download the tarball
sudo yum install -y wget
# You can download the tarball from GitHub or alternatively, if you're using VPCOnly mode, you can host on S3
wget -O .auto-shutdown/extension.tar.gz https://github.com/aws-samples/sagemaker-studio-auto-shutdown-extension/raw/main/sagemaker_studio_autoshutdown-0.1.5.tar.gz

# Or instead, could serve the tarball from an S3 bucket in which case "wget" would not be needed:
# aws s3 --endpoint-url [S3 Interface Endpoint] cp s3://[tarball location] .auto-shutdown/extension.tar.gz

# Installs the extension
cd .auto-shutdown
tar xzf extension.tar.gz
cd sagemaker_studio_autoshutdown-0.1.5

# Activate studio environment just for installing extension
export AWS_SAGEMAKER_JUPYTERSERVER_IMAGE="${AWS_SAGEMAKER_JUPYTERSERVER_IMAGE:-'jupyter-server'}"
if [ "$AWS_SAGEMAKER_JUPYTERSERVER_IMAGE" = "jupyter-server-3" ] ; then
    eval "$(conda shell.bash hook)"
    conda activate studio
fi;
pip install --no-dependencies --no-build-isolation -e .
jupyter serverextension enable --py sagemaker_studio_autoshutdown
if [ "$AWS_SAGEMAKER_JUPYTERSERVER_IMAGE" = "jupyter-server-3" ] ; then
    conda deactivate
fi;

# Restarts the jupyter server
nohup supervisorctl -c /etc/supervisor/conf.d/supervisord.conf restart jupyterlabserver

# Waiting for 30 seconds to make sure the Jupyter Server is up and running
sleep 30

# Calling the script to set the idle-timeout and active the extension
/home/sagemaker-user/.auto-shutdown/set-time-interval.sh

Sagemaker Domain 전체 적용

아래 스크립트를 통해 나의 ARN 및 sagemaker studio domain을 입력하여 전체 적용해준다.

#!/bin/bash

aws sagemaker update-domain --domain-id <Sagemaker Studio 도메인> \
--default-user-settings '{
"JupyterServerAppSettings": {
  "DefaultResourceSpec": {
    "LifecycleConfigArn": "arn:aws:sagemaker:ap-northeast-2:<ACCOUNT_ID>:studio-lifecycle-config/install-autoshutdown-extension",
    "InstanceType": "system"
   },
   "LifecycleConfigArns": [
     "arn:aws:sagemaker:ap-northeast-2:<ACCOUNT_ID>:studio-lifecycle-config/install-autoshutdown-extension"
   ]
}}'

특정 유저만 적용

혹시나 특정 유저만 적용이 필요하다면 아래와 같이 가능하다.

aws sagemaker update-user-profile --domain-id d-abc123 \
--user-profile-name my-existing-user \
--user-settings '{
"KernelGatewayAppSettings": {
  "LifecycleConfigArns":
    ["arn:aws:sagemaker:us-east-2:123456789012:studio-lifecycle-config/install-pip-package-on-kernel"]
  }
}'

Lifecycle Configuration 삭제 스크립트

유사 시 사용해준다.

#!/bin/bash

aws sagemaker delete-studio-lifecycle-config \
--studio-lifecycle-config-name install-autoshutdown-extension

Reference