Difference between revisions of "WRF on the Cloud"
Line 1: | Line 1: | ||
+ | |||
+ | = Objectives = | ||
LADCO is seeking to understand the best practices for submitting and managing multiprocessor computing jobs on a cloud computing platform. In particular, LADCO would like to develop a WRF production environment that utilizes cloud-based computing. The goal of this project is to prototype a WRF production environment on a public, on-demand high performance computing service in the cloud to create a WRF platform-as-a-service (PaaS) solution. The WRF PaaS must meet the following objectives: | LADCO is seeking to understand the best practices for submitting and managing multiprocessor computing jobs on a cloud computing platform. In particular, LADCO would like to develop a WRF production environment that utilizes cloud-based computing. The goal of this project is to prototype a WRF production environment on a public, on-demand high performance computing service in the cloud to create a WRF platform-as-a-service (PaaS) solution. The WRF PaaS must meet the following objectives: | ||
Line 5: | Line 7: | ||
* Flexible cloud deployment from a command line interface to initiate computing clusters and spawn WRF jobs in the cloud | * Flexible cloud deployment from a command line interface to initiate computing clusters and spawn WRF jobs in the cloud | ||
− | Call Notes | + | = Call Notes = |
− | + | == November 28, 2018 == | |
+ | === WRF Benchmarking === | ||
+ | * Emulating WRF 2016 12/4/1.3 grids | ||
+ | * Purpose for estimating costing for CPUs, RAM and Storage | ||
* CPU: 8 Cores: 5.5 day run = 4 days; 24 Cores: 3 days | * CPU: 8 Cores: 5.5 day run = 4 days; 24 Cores: 3 days | ||
* RAM: ~22 Gb RAM/run (2.5 Gb/core) | * RAM: ~22 Gb RAM/run (2.5 Gb/core) | ||
− | * Storage | + | * Storage |
+ | ** test netCDF4 and netCDF with no compression | ||
+ | ** with compression saves a lot of space (1/3 of the output) relative to uncompressed NCF (~70% compression) | ||
+ | ** need to link in the HDF and NC4 libraries with compression to downstream programs | ||
+ | ** estimate about 5.8 Tb for the year, goes to 16.9 Tb without compression | ||
− | + | === Conceptual Approach to WRF on the Cloud === | |
* Cluster management would launch a head node and compute nodes | * Cluster management would launch a head node and compute nodes | ||
− | * 77 chunks, 20 computers for 16 days | + | * 77 5.5 day chunks, 20 computers for 16 days (or 80 computers for 4 days) |
* Head node running constantly | * Head node running constantly | ||
* Compute nodes running over the length of project | * Compute nodes running over the length of project | ||
− | |||
* Memory optimized machines performed better than compute optimized for CAMx | * Memory optimized machines performed better than compute optimized for CAMx | ||
− | * | + | |
+ | === Storage Analysis === | ||
+ | * AWS | ||
** Don't want to use local because it will need to be moved/migrated | ** Don't want to use local because it will need to be moved/migrated | ||
− | ** Put the data on a storage appliance (S3) while running, and then push off to longer term storage | + | ** Put the data on a storage appliance (S3) while running, and then push off to longer term storage (Glacier) |
** Glacier is archived and need to submit access through the console, response times listed as 1-5 minutes | ** Glacier is archived and need to submit access through the console, response times listed as 1-5 minutes | ||
− | * | + | * Azure |
** Fast and slower lake storage for offline | ** Fast and slower lake storage for offline | ||
** Managed disks for online | ** Managed disks for online | ||
− | * | + | |
+ | === Data Transfer Analyis === | ||
+ | * estimate based on 5.8 Gb | ||
+ | * AWS | ||
** Internet transfer will cost ~ $928 for 5.5 Gb | ** Internet transfer will cost ~ $928 for 5.5 Gb | ||
** Snowball 10 days to get data off a disk, costs $200 for entire WRF run (smallest was 50 Tb) | ** Snowball 10 days to get data off a disk, costs $200 for entire WRF run (smallest was 50 Tb) | ||
− | * | + | * Azure |
** Online transfer | ** Online transfer | ||
** Databox option (like snowball) | ** Databox option (like snowball) | ||
− | + | === Cluster Management Tools (interface analysis) === | |
− | + | * 3-4 seemed to work best across several cloud solutions | |
− | + | * Alsys Flight (works on AWS and Azure), used to bring up 40 nodes; set up a Tor queuing system; trouble with using an AMI, need to pay for an AMI with this solution; can use Docker if we want to use containers, but Ramboll not positioned to use containers for this project | |
* CFN: slower development, but now has an AWS parallel cluster (CFN reincarnated), improved tools and built in the Python package index (can be installed with PIP); let's you spin everything up from the command line and could be scripted | * CFN: slower development, but now has an AWS parallel cluster (CFN reincarnated), improved tools and built in the Python package index (can be installed with PIP); let's you spin everything up from the command line and could be scripted | ||
* Haven't yet explored AWS Parallel Cluster/CFN in detail; similar to experience with Star Cluster; seems to be the best solution because you can use your own custom AMI; instance types are independent of the cluster management tools | * Haven't yet explored AWS Parallel Cluster/CFN in detail; similar to experience with Star Cluster; seems to be the best solution because you can use your own custom AMI; instance types are independent of the cluster management tools | ||
+ | |||
+ | === Next Steps === | ||
+ | * LADCO to create a WRF AMI on AWS: WRF 3.9.1, netCDF4 with compression, MPICH2, PGI compiler, AMET | ||
+ | * LADCO to create a login for Ramboll in our AWS organization | ||
+ | * Ramboll to explore AWS Parallel cluster and then prototype with LADCO WRF AMI | ||
+ | * Next call 12/5 @ 3 Central |
Revision as of 16:40, 30 November 2018
Contents
Objectives
LADCO is seeking to understand the best practices for submitting and managing multiprocessor computing jobs on a cloud computing platform. In particular, LADCO would like to develop a WRF production environment that utilizes cloud-based computing. The goal of this project is to prototype a WRF production environment on a public, on-demand high performance computing service in the cloud to create a WRF platform-as-a-service (PaaS) solution. The WRF PaaS must meet the following objectives:
- Configurable computing and storage to scale, as needed, to meet that needs of different WRF applications
- Configurable WRF options to enable changing grids, simulation periods, physics options, and input data
- Flexible cloud deployment from a command line interface to initiate computing clusters and spawn WRF jobs in the cloud
Call Notes
November 28, 2018
WRF Benchmarking
- Emulating WRF 2016 12/4/1.3 grids
- Purpose for estimating costing for CPUs, RAM and Storage
- CPU: 8 Cores: 5.5 day run = 4 days; 24 Cores: 3 days
- RAM: ~22 Gb RAM/run (2.5 Gb/core)
- Storage
- test netCDF4 and netCDF with no compression
- with compression saves a lot of space (1/3 of the output) relative to uncompressed NCF (~70% compression)
- need to link in the HDF and NC4 libraries with compression to downstream programs
- estimate about 5.8 Tb for the year, goes to 16.9 Tb without compression
Conceptual Approach to WRF on the Cloud
- Cluster management would launch a head node and compute nodes
- 77 5.5 day chunks, 20 computers for 16 days (or 80 computers for 4 days)
- Head node running constantly
- Compute nodes running over the length of project
- Memory optimized machines performed better than compute optimized for CAMx
Storage Analysis
- AWS
- Don't want to use local because it will need to be moved/migrated
- Put the data on a storage appliance (S3) while running, and then push off to longer term storage (Glacier)
- Glacier is archived and need to submit access through the console, response times listed as 1-5 minutes
- Azure
- Fast and slower lake storage for offline
- Managed disks for online
Data Transfer Analyis
- estimate based on 5.8 Gb
- AWS
- Internet transfer will cost ~ $928 for 5.5 Gb
- Snowball 10 days to get data off a disk, costs $200 for entire WRF run (smallest was 50 Tb)
- Azure
- Online transfer
- Databox option (like snowball)
Cluster Management Tools (interface analysis)
- 3-4 seemed to work best across several cloud solutions
- Alsys Flight (works on AWS and Azure), used to bring up 40 nodes; set up a Tor queuing system; trouble with using an AMI, need to pay for an AMI with this solution; can use Docker if we want to use containers, but Ramboll not positioned to use containers for this project
- CFN: slower development, but now has an AWS parallel cluster (CFN reincarnated), improved tools and built in the Python package index (can be installed with PIP); let's you spin everything up from the command line and could be scripted
- Haven't yet explored AWS Parallel Cluster/CFN in detail; similar to experience with Star Cluster; seems to be the best solution because you can use your own custom AMI; instance types are independent of the cluster management tools
Next Steps
- LADCO to create a WRF AMI on AWS: WRF 3.9.1, netCDF4 with compression, MPICH2, PGI compiler, AMET
- LADCO to create a login for Ramboll in our AWS organization
- Ramboll to explore AWS Parallel cluster and then prototype with LADCO WRF AMI
- Next call 12/5 @ 3 Central