The Next Generation of Computing Scalability
Jul 19, 2012
Editor's Note: (This article was updated in January 2013 to reflect the name of AIR's new catastrophe modeling platform, Touchstone™.) Touchstone offers a degree of scalability unimaginable just a few years ago. In this article, AIR Vice President of Technical Services Peter Lewis and Vice President and Chief Software Architect Boris Davidson explain how the adoption of Microsoft High Performance Computing technology is making this possible.
As catastrophe models increase in complexity and sophistication and as companies seek faster and more advanced analytics, AIR continues to improve the technology infrastructure on which our software platforms run. With the release of Touchstone™, AIR has built a new core analytical platform and adopted Microsoft High Performance Computing (HPC) server technology. The new HPC-based architecture, in combination with many other design innovations, vastly improves performance and scalability characteristics of loss analytics in Touchstone. This translates to more analyses in far less time, enabling faster and better informed underwriting decisions and improved sensitivity analyses.
The Evolution of Job Management at AIR
AIR's software applications have continually increased in detail and sophistication, and along with it came increased demands on computing resources. Recognizing that catastrophe modeling needed to expand beyond a single machine and single engine, AIR implemented distributed analysis processing via an internally developed Job Manager component in CLASIC/2TM. This was the very first step toward cluster-based computing capabilities for AIR software products. The Job Manager at the time allowed catastrophe modeling jobs to be distributed across multiple machines (single engine per machine), which improved performance and enabled companies to better utilize their hardware resources. When multi-core processors became more prevalent in the mid-2000s, companies found that they had more processing power under the hood but no good ways to take advantage of it because software was not originally designed to leverage the additional cores.
AIR recognized that large computationally intensive jobs would greatly benefit not only from being distributed across multiple machines, but also distributed across multiple cores on each machine. This capability became available in Job Manager 2.0, which was released in 2006. This update significantly enhanced CLASIC/2's resource management capabilities and allowed a single catastrophe modeling job to be distributed to up to 32 cores, providing dramatic performance improvements at the time. But as the resolution of models continued to increase and exposure data sets grew ever larger, AIR began to explore additional approaches to achieve even greater scalability.
To that end, for Touchstone, AIR has adopted the Microsoft HPC Server product as the underlying cluster computing technology. Moreover, in order to meet new functional and performance requirements of Touchstone, AIR engineers extended the HPC Server framework with a number of innovative solutions to support highly scalable orchestration of analysis tasks, distributed data staging, operational logging, progress tracking, and failure handling. This new HPC-based analysis workflow architecture results in significant improvements in job scalability, cluster resource utilization, and overall system throughput.
A Visit to Redmond
The Microsoft HPC server system provides a degree of scalability that was virtually unimaginable a few years ago. With the right hardware in place, large computationally complex jobs can now be scaled in parallel across hundreds of cores. Once the decision had been made to move forward with Microsoft's HPC Server technology, AIR's engineering team developed a proof-of-concept that demonstrated that Touchstone would not only exceed capabilities of AIR's Job Manager 2.0 but also substantially exceed performance and scalability of the current generation of catastrophe modeling software.
In March 2012, AIR technical staff spent a week at Microsoft's campus in Redmond, Washington. While there, AIR stress-tested Touchstone in collaboration with Microsoft Performance Engineers to ensure that modeling runs performed using Touchstone were able to successfully scale across dozens and even hundreds of cores simultaneously.
AIR has developed a strong and ongoing business partnership with Microsoft, with both parties equally committed to meeting the needs of AIR clients. The partnership has helped us to ensure that our implementation of HPC is fully vetted by the experts at Microsoft and seamlessly scales across the large computing clusters used by many of our clients.
In Redmond, AIR conducted large-scale performance testing on physical servers. These tests resulted in a number of additional optimizations subsequently implemented within the Touchstone system. Furthermore, extensive performance testing has been conducted on AIR's in-house virtualized infrastructure.
What Does Scalability Mean for You?
Industry trends indicate that portfolio sizes are growing and that companies are increasingly opting to run larger event sets to better capture tail risk. Companies also want to perform more sensitivity analyses with greater frequency, as well as the ability to make underwriting decisions in real time. To meet this growing demand in modeling throughput, companies will continue to grow their computing infrastructure with the expectation that modeling software applications will be able to fully leverage it.
AIR's Touchstone is able to utilize all available cores and distribute analysis workloads seamlessly, allowing unparalleled efficiency in hardware resource utilization and business workflows. With Touchstone, companies can seamlessly perform simultaneous analyses and generate results at different levels of detail and with different loss perspectives, depending on the job requirements. And it's not just the modeling runs that can be scaled across multiple cores. Large data imports that used to take hours can now be effectively distributed across multiple cores, dramatically reducing the amount of time spent getting data into AIR software.
AIR continues to work closely with Microsoft and considers our work on Touchstone to be part of a long-term collaboration. As with any major endeavor, it is critical to go from concepts and ideas to implementation and reality. We've done that with Touchstone. But this in no way means that we are done—just as Rome was not built in a day, neither will Touchstone be.