Difference between revisions of "WRF on the Cloud"

From LADCO Wiki
Jump to: navigation, search
Line 1: Line 1:
 +
 +
= Objectives =
 
LADCO is seeking to understand the best practices for submitting and managing multiprocessor computing jobs on a cloud computing platform. In particular, LADCO would like to develop a WRF production environment that utilizes cloud-based computing. The goal of this project is to prototype a WRF production environment on a public, on-demand high performance computing service in the cloud to create a WRF platform-as-a-service (PaaS) solution. The WRF PaaS must meet the following objectives:  
 
LADCO is seeking to understand the best practices for submitting and managing multiprocessor computing jobs on a cloud computing platform. In particular, LADCO would like to develop a WRF production environment that utilizes cloud-based computing. The goal of this project is to prototype a WRF production environment on a public, on-demand high performance computing service in the cloud to create a WRF platform-as-a-service (PaaS) solution. The WRF PaaS must meet the following objectives:  
  
Line 5: Line 7:
 
* Flexible cloud deployment from a command line interface to initiate computing clusters and spawn WRF jobs in the cloud
 
* Flexible cloud deployment from a command line interface to initiate computing clusters and spawn WRF jobs in the cloud
  
Call Notes
+
= Call Notes =
* WRF Benchmarking (emulating WRF 2016 12/4/1.3 grids) costing for CPUs, RAM and Storage
+
== November 28, 2018 ==
 +
=== WRF Benchmarking ===
 +
* Emulating WRF 2016 12/4/1.3 grids
 +
* Purpose for estimating costing for CPUs, RAM and Storage
 
* CPU: 8 Cores: 5.5 day run = 4 days; 24 Cores: 3 days
 
* CPU: 8 Cores: 5.5 day run = 4 days; 24 Cores: 3 days
 
* RAM: ~22 Gb RAM/run (2.5 Gb/core)
 
* RAM: ~22 Gb RAM/run (2.5 Gb/core)
* Storage: test netCDF4 and netCDF no compression; with compression saves a lot of space (1/3 of the output) relative to uncompressed NCF (~70% compression); need to link in the HDF and NC4 libraries with compression to downstream programs; estimate about 5.8 Tb for the year, goes to 16.9 without compression
+
* Storage
 +
** test netCDF4 and netCDF with no compression
 +
** with compression saves a lot of space (1/3 of the output) relative to uncompressed NCF (~70% compression)
 +
** need to link in the HDF and NC4 libraries with compression to downstream programs
 +
** estimate about 5.8 Tb for the year, goes to 16.9 Tb without compression
  
Costing analysis
+
=== Conceptual Approach to WRF on the Cloud ===
 
* Cluster management would launch a head node and compute nodes
 
* Cluster management would launch a head node and compute nodes
* 77 chunks, 20 computers for 16 days
+
* 77 5.5 day chunks, 20 computers for 16 days (or 80 computers for 4 days)
 
* Head node running constantly  
 
* Head node running constantly  
 
* Compute nodes running over the length of project
 
* Compute nodes running over the length of project
* Can probably use 80 computers 4 days insteady of 20 in 16 days
 
 
* Memory optimized machines performed better than compute optimized for CAMx
 
* Memory optimized machines performed better than compute optimized for CAMx
* Storage
+
 
 +
=== Storage Analysis ===
 +
* AWS
 
** Don't want to use local because it will need to be moved/migrated
 
** Don't want to use local because it will need to be moved/migrated
** Put the data on a storage appliance (S3) while running, and then push off to longer term storage
+
** Put the data on a storage appliance (S3) while running, and then push off to longer term storage (Glacier)
 
** Glacier is archived and need to submit access through the console, response times listed as 1-5 minutes
 
** Glacier is archived and need to submit access through the console, response times listed as 1-5 minutes
* Storage (Azure)
+
* Azure
 
** Fast and slower lake storage for offline
 
** Fast and slower lake storage for offline
 
** Managed disks for online
 
** Managed disks for online
* Transfer (estimate based on 5.8 Gb)
+
 
 +
=== Data Transfer Analyis ===
 +
* estimate based on 5.8 Gb
 +
* AWS
 
** Internet transfer will cost ~ $928 for 5.5 Gb
 
** Internet transfer will cost ~ $928 for 5.5 Gb
 
** Snowball 10 days to get data off a disk, costs $200 for entire WRF run (smallest was 50 Tb)
 
** Snowball 10 days to get data off a disk, costs $200 for entire WRF run (smallest was 50 Tb)
* Transfer Azure
+
* Azure
 
** Online transfer
 
** Online transfer
 
** Databox option (like snowball)
 
** Databox option (like snowball)
  
* Cluster Management Tools (interface analysis)
+
=== Cluster Management Tools (interface analysis) ===
** 3-4 seemed to work best across several cloud solutions
+
* 3-4 seemed to work best across several cloud solutions
** Alsys flight (works on AWS and Azure), used to bring up 40 nodes; set up a tor queuing system; trouble with using an AMI, need to pay for an AMI with this solution; can use Docker if we want to use containers, but Ramboll not positioned to use containers for this project
+
* Alsys Flight (works on AWS and Azure), used to bring up 40 nodes; set up a Tor queuing system; trouble with using an AMI, need to pay for an AMI with this solution; can use Docker if we want to use containers, but Ramboll not positioned to use containers for this project
 
* CFN: slower development, but now has an AWS parallel cluster (CFN reincarnated), improved tools and built in the Python package index (can be installed with PIP); let's you spin everything up from the command line and could be scripted  
 
* CFN: slower development, but now has an AWS parallel cluster (CFN reincarnated), improved tools and built in the Python package index (can be installed with PIP); let's you spin everything up from the command line and could be scripted  
 
* Haven't yet explored AWS Parallel Cluster/CFN in detail; similar to experience with Star Cluster; seems to be the best solution because you can use your own custom AMI; instance types are independent of the cluster management tools
 
* Haven't yet explored AWS Parallel Cluster/CFN in detail; similar to experience with Star Cluster; seems to be the best solution because you can use your own custom AMI; instance types are independent of the cluster management tools
 +
 +
=== Next Steps ===
 +
* LADCO to create a WRF AMI on AWS: WRF 3.9.1, netCDF4 with compression, MPICH2, PGI compiler, AMET
 +
* LADCO to create a login for Ramboll in our AWS organization
 +
* Ramboll to explore AWS Parallel cluster and then prototype with LADCO WRF AMI
 +
* Next call 12/5 @ 3 Central

Revision as of 16:40, 30 November 2018

Objectives

LADCO is seeking to understand the best practices for submitting and managing multiprocessor computing jobs on a cloud computing platform. In particular, LADCO would like to develop a WRF production environment that utilizes cloud-based computing. The goal of this project is to prototype a WRF production environment on a public, on-demand high performance computing service in the cloud to create a WRF platform-as-a-service (PaaS) solution. The WRF PaaS must meet the following objectives:

  • Configurable computing and storage to scale, as needed, to meet that needs of different WRF applications
  • Configurable WRF options to enable changing grids, simulation periods, physics options, and input data
  • Flexible cloud deployment from a command line interface to initiate computing clusters and spawn WRF jobs in the cloud

Call Notes

November 28, 2018

WRF Benchmarking

  • Emulating WRF 2016 12/4/1.3 grids
  • Purpose for estimating costing for CPUs, RAM and Storage
  • CPU: 8 Cores: 5.5 day run = 4 days; 24 Cores: 3 days
  • RAM: ~22 Gb RAM/run (2.5 Gb/core)
  • Storage
    • test netCDF4 and netCDF with no compression
    • with compression saves a lot of space (1/3 of the output) relative to uncompressed NCF (~70% compression)
    • need to link in the HDF and NC4 libraries with compression to downstream programs
    • estimate about 5.8 Tb for the year, goes to 16.9 Tb without compression

Conceptual Approach to WRF on the Cloud

  • Cluster management would launch a head node and compute nodes
  • 77 5.5 day chunks, 20 computers for 16 days (or 80 computers for 4 days)
  • Head node running constantly
  • Compute nodes running over the length of project
  • Memory optimized machines performed better than compute optimized for CAMx

Storage Analysis

  • AWS
    • Don't want to use local because it will need to be moved/migrated
    • Put the data on a storage appliance (S3) while running, and then push off to longer term storage (Glacier)
    • Glacier is archived and need to submit access through the console, response times listed as 1-5 minutes
  • Azure
    • Fast and slower lake storage for offline
    • Managed disks for online

Data Transfer Analyis

  • estimate based on 5.8 Gb
  • AWS
    • Internet transfer will cost ~ $928 for 5.5 Gb
    • Snowball 10 days to get data off a disk, costs $200 for entire WRF run (smallest was 50 Tb)
  • Azure
    • Online transfer
    • Databox option (like snowball)

Cluster Management Tools (interface analysis)

  • 3-4 seemed to work best across several cloud solutions
  • Alsys Flight (works on AWS and Azure), used to bring up 40 nodes; set up a Tor queuing system; trouble with using an AMI, need to pay for an AMI with this solution; can use Docker if we want to use containers, but Ramboll not positioned to use containers for this project
  • CFN: slower development, but now has an AWS parallel cluster (CFN reincarnated), improved tools and built in the Python package index (can be installed with PIP); let's you spin everything up from the command line and could be scripted
  • Haven't yet explored AWS Parallel Cluster/CFN in detail; similar to experience with Star Cluster; seems to be the best solution because you can use your own custom AMI; instance types are independent of the cluster management tools

Next Steps

  • LADCO to create a WRF AMI on AWS: WRF 3.9.1, netCDF4 with compression, MPICH2, PGI compiler, AMET
  • LADCO to create a login for Ramboll in our AWS organization
  • Ramboll to explore AWS Parallel cluster and then prototype with LADCO WRF AMI
  • Next call 12/5 @ 3 Central