Renewable Energy and High Performance Computing

풍력이나 태양광 같은 재생 에너지(renewable energy)를 이용한 전력 생산의 단점은 생산되는 전력이 일정하지 않다는 것입니다. 간단히 말해서 바람에 불지 않거나, 날씨로 인해 태양이 보이지 않으면 전력 생산이 줄어들기 때문에, 안정적인 에너지 공급원의 역할을 하지 못합니다. 게다가 전력이 필요하지 않은 시간대에 많은 전력을 생산하게 되는 것도 문제입니다. 미국 텍사스 주의 Lancium 이라는 회사는 잉여 renewable energy를 이용해 고성능 컴퓨팅 데이터센터를 운영하고자 합니다.

https://www.hpcwire.com/2022/08/02/inside-an-ambitious-play-to-shake-up-hpc-and-the-texas-grid/

Inside an Ambitious Play to Shake Up HPC and the Texas Grid With HPC demand ballooning and Moore’s law slowing down, modern supercomputers often undergo exhaustive efficiency efforts aimed at ameliorating exorbitant energy bills and correspondingly large carbon footprints. Others, meanwhile, are asking: is min-maxing the best option, or are there easier paths to reducing the bills and emissions of...

미국 텍사스 주의 서부에는 잉여 풍력 발전 용량이 많아 오히려 비용을 지불하고 발전을 하고 있다고 합니다. 따라서 공짜 수준의 전력을 활용하여 데이터센터를 운영하고 있다는 내용입니다. Lancium의 CTO는 University of Virginia 의 교수였고, High Performance Computing 분야를 오랫동안 연구했던 Andrew Grimshaw 라는 분입니다. 그의 링크드인 프로필에는 아래와 같이 회사 소개를 하고 있습니다.

At Lancium I am leading the effort to drastically reduce the carbon footprint of high performance computing by using intermittent renewable energy (wind and solar) to power our infrastructure. Due to the intermittent nature of wind and solar, the infrastructure must be able to deal with variations in power availability. To accomplish this we migrate jobs in "space and time". By "in space" we mean to migrate a running job from a site that is losing power, to a site that has plenty of power. By "in time" we mean suspend (persist to disk) jobs (and turn of the machine on which they are running), and then restart them later when power is available.

high performance computing (보통 과학기술용 계산 및 시뮬레이션을 하는 분야를 이렇게 지칭합니다)의 탄소 발자국을 급격히 줄이기 위해 재생에너지를 활용하는 일을 하고 있다고 설명합니다.

High Performance Computing 에서는 대부분 interactive job 이 아닌 long-running job을 실행하게 됩니다. 짧게는 수십분~수시간, 길게는 1주일에서 한달씩 걸리는 경우도 있을 것입니다. 따라서 특정 지역의 신재성 에너지 전력 공급 부족으로 데이터센터를 더 이상 켜 둘 수 없게 된다면, job들을 현재 상태 그대로 다른 데이터센터로 이전해 계속 실행해야 합니다. 처음부터 다시 실행하야 한다면 안되겠죠.

현재 상태 그대로 migration을 하기 위해서는 현재 상태를 기록해서 어딘가에 저장하거나 네트워크를 통해 다른 곳으로 보내야 합니다. 컴퓨터 시스템의 현재 상태를 기록한 것을 checkpoint 혹은 snapshot 이라고 합니다.

HPCWire 기사를 보면 Lancium에서는 DMTCP 라는 MPI 용 프레임워크를 사용한다고 합니다. 간략한 소개는 다음과 같습니다.

DMTCP (Distributed MultiThreaded Checkpointing) transparently checkpoints a single-host or distributed computation in user-space -- with no modifications to user code or to the O/S. It works on most Linux applications, including Python, Matlab, R, GUI desktops, MPI, etc. It is robust and widely used (on Sourceforge since 2007).

Lancium 은 HPC workload를 대상으로 하는 것 처럼 보이는데, 페이스북(메타)에서는 최근 Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models - USENIX 라는 논문을 발표했습니다. Deep Learning 학습을 (중간에) 다른 서버나 다른 클러스터로 옮겨야 하는 상황에 필요하다고 설명하고 있네요.

In addition to failure recovery, checkpoints are needed for moving training processes across different nodes or clusters. This shift may be required in cases such as server maintenance (e.g. critical security patches that could not be postponed), hardware failures, network issues, and resource optimization/re-allocation.

문제점들

사실 아무리 전력이 싸다고 해도, 싼 기간동안만 서버를 가동하다가 그 외 시간에는 꺼버리는 것이 얼마나 비용 효율적일지 잘 모르겠습니다. 데이터센터에 들어가는 서버, 네트워크, 냉방기 등 장치에 대한 투자금이 어마어마할텐데, 단지 전력이 비싸다는 이유로 그 장치들을 꺼 두는 것이 좋은 생각일까요?

특히 데이터에 의존하는 job 의 경우에는 데이터가 모든 데이터센터에 계속 동기화 된 상태로 유지되어야 합니다. 이것을 유지하는 데에도 상당한 비용과 노력이 필요할 것인데, 에너지 비용을 아끼는 것이 이러한 추가 지출을 상쇄할 수 있을까요? 최근 ISCA 2022에 발표된 논문 Understanding data storage and ingestion for large-scale deep recommendation model training: industrial product을 보면, 어쩌면 페이스북 정도의 인프라스트럭쳐를 갖추고 있는 경우 생각해 볼 수 있을 만 한 시나리오일 수도 있겠습니다.

어쩌면 안정성은 떨어지지만 저렴한 하드웨어로 중복 비용을 상쇄할 수 있을지도 모르겠습니다. 데이터센터가 offline인 시간 동안 하드웨어 정비/교체를 하면 되니, 기존 방식보다는 하드웨어 문제로 인한 비용은 낮아질 것입니다.

물론 에너지의 가격이 계속 비싸질 것이기 때문에 이런 계산의 결과는 계속 변화할 것입니다.

kyunghoj/renewable.md

Renewable Energy and High Performance Computing

문제점들