How to Reduce Costs & Increase ROI of HPC Cloud Computing - Part II

This article is a follow-up to our previous piece on “How Much Does a CAE Job Cost You in the Cloud?” We will demonstrate that cloud resources operating at 2X the speed (compared to on-premises) and the inclusion of one additional software license per engineer can enhance an engineer's productivity by 4X, while also saving the company $260K in additional expenses compared to on-premises solutions. The increased availability of compute resources facilitates more simulations, fostering innovation and yielding better results, such as improved materials, geometries, physics, functionalities, etc. This ultimately leads to a more efficient and cost-effective workflow.


In Part I of this series, we calculated the cost of an automotive CFD application based on 100 million cells, simulated with Siemens STAR-CCM on 16 AMD-powered HBv3 compute nodes. The feedback from several readers expressed incredulous amazement about the cost of such a simulation. With a 30-hour run on one Azure HBv3 node and a 1.5-hour run on an HPC cluster of 16 HBv3 nodes, the latter cost was just $122.46!


The cost for such a complex simulation seems to be astonishingly low. Imagine that an innovative idea for a complex scenario can be tested immediately by running a 16-node simulation within 1.5 hours, for $122.46! This appears to be "innovation at your fingertips" at a very low cost, which, however, can have a significant influence on the quality of the company's next-generation product and on its cost savings and return on investment (ROI).


One reader argued that using 16 compute nodes would be more expensive than just running this simulation on one compute node. Let’s examine that (using the Bill of Materials list from the previous article):

Figure 1: Simulation times and cost for 1, 2, 4, 8, and 16 HBv3 nodes, running a STAR-CCM+ simulation for a 100 million cell automotive CFD application.


As you can see, the opposite is true; decreasing the number of compute nodes makes your parallel simulation runs even a bit more expensive. Additionally, on smaller numbers of nodes, the job takes much longer to complete. This effect holds for all application codes that scale well with an increase in the number of nodes. Many CFD codes belong to this group of highly scalable codes, such as STAR-CCM+, Fluent, CFX, etc., while Finite Element Analysis codes, especially the implicit ones, tend to exhibit limited scalability—sometimes only up to 2-4 nodes. Even in those cases, it makes sense to benchmark these codes on suitable compute instances to identify the best-suited instance and the scalability sweet spot.

Now, we will take one step further. In this article, we will investigate how using more computing resources in the cloud can reduce the company’s cloud bill and increase the productivity of the company’s engineers, dramatically.


Comparing Total Cost of Cloud with On-Premise Simulations

One of our large customers in Silicon Valley currently has 250 simulation engineers using the UberCloud Engineering Simulation Platform on a public cloud infrastructure, with more than 30 different UberCloud application software containers. In 2022, the engineers consumed HPC cloud resources amounting to $12.5 million, resulting in a $50K cloud consumption equivalent to running an average of about one simulation job per day. This includes evaluating the results, refining geometry and physics, storing data in pre-assigned storage in the cloud, and summarizing the results in management reports. 


This customer use case can be used to extrapolate the cost for a moderate group of, say, 10 senior simulation engineers (assuming they are what we like to call ‘power users’). The estimated annual cost for such a group on premises is:

$2.000K in salaries and incidental costs, for 10 engineers

$800K for 10 annual simulation software licenses (e.g. Ansys, Siemens, Dassault)

$200K for one HPC support specialist (IT)

$500K estimated TCO (Total Cost of Ownership) for an HPC Server for 10 engineers

Total cost on-prem = $3.5 Mio

Estimated annual cost in the cloud are:

$2.000K for salaries + incidental costs for 10 engineers

$800K for 10 annual simulation software licenses (e.g. Ansys, Siemens, Dassault)

$150K UberCloud software licences, support, and maintenance

$500K for cloud consumption for 10 engineers

Total cost cloud = $3.45 Mio


This comparison shows that costs in the cloud are similar to the costs on premises. An interesting observation is that the annual cost for UberCloud’s software licenses, services, support, and maintenance is just 20% of the engineer’s simulation software cost. We will see in the following that for an additional investment of 20% of the simulation software cost, an engineer can be 100% (and more) more productive.

Figure 2: STAR-CCM+ benchmark results for Azure HBv2, HBv3, and HBv4, compared with a 4-year old HPC server based on Intel Skylake CPUs.


Cost Savings With Faster and More Compute Resources

Hyperscalers like AWS, Azure, and Google GCP update their computing infrastructure on average 1 – 2 times a year, at least with every new processor generation. A good example is the HB series of high-performance cloud instances, which has been updated with the AMD EPYC series of processors (HBv2 with Rome 2021, HBv3 with Milan 2022, HBv4 with Genova 2023). Figure 2 above shows STAR-CCM+ benchmark results for Azure HBv2, HBv3, and HBv4, compared with a 4-year-old HPC server based on Intel Skylake CPUs.


In the following simplified analysis, we assume that computing nodes (based on cloud instances) are just 2X faster than the on-prem HPC cluster. A 2X faster cloud hardware saves 50% of the CAE license cost and makes an engineer 2X more productive (in this example, an engineer can run two jobs in the cloud compared to one job on-prem, both with just one software license). At the same time, however, the engineer consumes twice as many resources in the cloud. If you want to achieve the same 2X productivity increase per engineer on-prem, you would have to hire one additional engineer for $200K per year and add more compute resources to your high-performance cluster.


Now, a simple calculation for 10 engineers shows:

2X faster cloud hardware saves 50% CAE software license cost = - $400K,

2X faster cloud = 2X more productive engineers, saves - $2.000K (by not hiring another 10 engineers)

2X more productive engineers = 2X more cloud consumption, costs = + $500K

Additional UberCloud license cost = $0.00

Total = - $1.9 Mio


This results in savings of - $1.9 Mio or, in other words: $1.9 Mio is what you would have to spend if you want to achieve the same 2X productivity / results on premises as in the cloud.

Now, in addition to 2X faster cloud resources (resulting in 2X more productive engineers), let’s invest in one additional simulation software license of $80K for each engineer, which makes engineers another 2X more productive (resulting in a 4X productivity boost). On the other hand, cloud consumption also increases by 2X. Again, a simple calculation for 10 engineers shows:


One additional simulation software license for each engineer = 10 x $80K = + $800K

Additional cloud consumption = 10 x $50K = + $500K

Another 2X more productive engineers, saves = -$2.000K

Additional UberCloud license cost = $0.00

Total = -$700K


Total savings by 2X faster cloud and 2 software licenses per engineer = -$1.9M (2X faster hardware) - $700K (one additional software license per engineer) = - $2.6 Mio or, in other words: $2.6 Mio is what you will have to spend if you want to achieve the same 4X-productivity / results on premises as in the cloud.


In summary, for our example above, 2X faster cloud resources and one additional software license per engineer make engineers 4X more productive and save the company $2.6 million that it would otherwise have to spend to achieve the same 4X on premises. This example calculation can easily be extrapolated to more engineers, more cloud resources, and more software licenses, which will result in increasing cost savings compared to the on-prem model.

Finally, a 4X productivity increase, as described above, sounds high but is still conservative. Several of our customers achieve much higher productivities for their engineers, with the one in Silicon Valley mentioned at the beginning reporting up to a factor of 42 in productivity in the cloud, by using GPUs for a consumer electronics simulation code that has been developed with GPUs in mind.


Stay in the loop