Artificial Intelligence (AI) techniques can be used to design and build intelligent "agents" that can accomplish specific tasks efficiently. AIR is pioneering the application of AI techniques to solve portfolio optimization problems for the insurance industry using a branch of AI known as Reinforcement Learning (RL). RL methodologies are commonly used in the field of robotics, but they are also being adapted and applied to address largescale and complex optimization problems.
Sequential DecisionMaking
Central to achieving a satisfactory solution to a practical problem is deciding how to frame the problem. The portfolio optimization problem is frequently formulated as a "01 knapsack problem," which is a type of "NPhard problem" (Nondeterministic Polynomialtimehard problem, the most complex problem category in computational complexity theory).
Common risk metrics such as "tail value at risk" (TVaR) and "average annual loss" (AAL) are used by insurance companies to measure the marginal impact of adding a policy into a portfolio. While marginal impact is a good proxy for the shortterm implications of deciding to write a policy, it does not reveal longterm implications, such as the adverse effect of stacking a portfolio with highly correlated policies.
AI techniques, however, can take both the short and longterm implications of decisions into account and can also bring some measure of automation to the policy selection process. To use the AI techniques discussed here, portfolio optimization needs to be formulated as a sequential decisionmaking problem—or, still more specifically, as a Markov Decision Process (MDP). An MDP is a mathematical framework for modeling decisionmaking processes and problems in which outcomes are partly stochastic and partly under the decisionmaker's control.
MDPs are framed as: 4tuple [S,A,P(•,•),R(•,•)]
where:
4tuple indicates that the problem is an ordered repeating set consisting of 4 elements;
S denotes a finite set of states;
A represents a finite set of actions available in a given state;
P indicates probabilities with respect to S and A such that:
the expression P(s,s′)=P(s_{t+1})=s′∣st=s,a_{t}=a) is the probability that action a in state s at time t will lead to state s′ at time t + 1;
and R—expressed as above: R(s,s′)—is the expected immediate reward (reinforcement) received after executing action a in state s and transitioning into state s′.
These relationships are illustrated schematically in Figure 1.
The basic Markov Decision Process framework is simple: a decisionmaking agent acts on its environment, receives feedback on whether the action had a positive or negative effect, and selects and executes successive actions one after another (a_{t}) until a predetermined stopping condition is met.
Information about the environment is automatically communicated to the decisionmaking agent with each new t + 1 iteration of an executed action (S_{t+1}). Based on the new state of the environment, the decisionmaking agent executes an action that causes the environment to transition into a new state in keeping with the transition probabilities, P(s,s′). Following this action, the decisionmaking agent receives a reward or reinforcement (R_{t}), that reflects the desirability of the new state.
Artificial Intelligence Framework
The Reinforcement Learning framework makes possible the use of automated optimal decisionmaking capabilities in uncertain, dynamic environments—such as changing insurance companies' risk profiles. The particular RL algorithm that is used to address the portfolio optimization problem is known as the "QLearning algorithm."^{2} QLearning, like many other RL algorithms, stems from dynamic programming, which makes use of Bellman Optimality Equations, which take the form:
The utility of the Bellman equations is their ability to achieve stateaction optimality. In these equations, γ is the discount factor for future rewards and Q* (s,a) is the value of the optimal action a that maximizes (or minimizes) the expected immediate reward in state s.
In effect, however, RL algorithms adapt the Bellman Optimality Equations into a kind of update rule for the iterative improvement of desired value functions. This "update rule" for basic QLearning is a derivative of Equation 2 and is expressed as follows:
Figure 2 adapts the basic Markov Decision Process schematic in Figure 1 to illustrate how the QLearning framework is applied to the insurance portfolio optimization problem.
In Figure 2, the "environment" is a portfolio or set of portfolios and the "decisionmaking agent" is a QLearning application. The MDP executes a policy selection decision—which immediately changes the state of the (portfolio/policy) environment. That new "state" is conveyed back to the MDP—along with the "reward" information as to whether the change advances toward (or away from) the optimization goal.
Accommodating Uncertainty
Both the occurrence and frequency of catastrophic events are uncertain, as are the intensity of the events and the damage and loss caused by them. A solution methodology that is able to account for this uncertainty—without having to make unrealistic simplifying assumptions—will give decisionmakers a competitive advantage.
The QLearning technique outlined above is neither too sophisticated for a nonspecialist user to understand and implement nor unduly limited in its ability to address challenging complex optimization problems. And, the QLearning framework has the ability to handle uncertainty, which is probably its most important advantage.
QLearning is considered to be one of the most important breakthroughs in Reinforcement Learning. QLearning operates, in effect, by "looking" one step ahead (or more, depending on the problem). In this way, the value of an action a in state s at time t converges toward the value of the action that consistently yields the maximum reward in the new state, s_{(t+1)}=s′, that the environment transitions into at time t + 1.
This iterative process is illustrated in Figure 3.
The decisionmaking agent (Markov Decision Process) depicted in Figure 3 receives portfolio performance information (Step 1) and, based on that information, selects the "best fit" policy from a pool of available policies (Step 2). That selection changes the state of the portfolio, and that new state—along with the valuation that it is desirable or not—is conveyed back to the decisionmaking agent (Step 3). Finally, valuation and new state information is stored, and a new selection is made (Step 4).
Importantly, the logic of this algorithm allows QLearning to operate without having to employ the transition probability and immediate reward information that have been used in dynamic programming. These probability and reward factors have often been criticized as being unrealistic, thus rendering whatever solution methodologies the algorithm arrives at as impractical.
A Case Study
An experimental case study was undertaken to compare the performance of the QLearning portfolio optimization method with that of other heuristic algorithms, namely Genetic Algorithms (GA) and Stochastic Steepest Ascent (SSA). The data for this case study consisted of 500 policy groups from a residential book in Florida. The premise was that an insurer wanted to identify policy groups suitable for incorporation into its overall portfolio. The insurer also wanted to be able to minimize its exposure while achieving, at the least, a threshold level of premium.
The total premium and TVaR for the entire book were USD 26,074,040 and USD 104,164,319, respectively. This simply means that the TVaR risk metric ranges from 0—if the insurer does not write anything—to USD 104,164,319—if the insurer writes all of the available policy groups—while the amount of premiums that can be collected ranges from 0 to USD 26,074,040. To produce a profit, the insurer needs to collect at least USD 8,500,000 in premiums.
This situation describes an optimization problem whose goal is to minimize the objective function of TVaR while satisfying the constraint: Premium_{TOTAL}≥USD 8,500,000.
Because of the stochastic nature of all three optimization methods, the performance comparison between them was conducted as a singlefactor twolevel experiment that was run 50 times. The mean and standard deviation of the TVaR produced by each method were determined and a confidence interval was computed for the difference in the TVaRs between QLearning and GA, the two bestperforming approaches, to quantify how much the expectations differ.
The statistical parameters of the final TVaR values for all three methodologies are listed in Table 1, while Figure 4 presents the minimum TVaR yielded by each method after each run.
Table 1. Statistical parameters of the final TVaR values

Mean (USD) 
Std. Deviation (USD) 
QLearning Framework 
7,423,433 
19,645 
GA 
7,913,001 
26,701 
SSA 
8,269,521 
43,060 
The confidence interval for the paired difference between the TVaRs yielded by the QLearning and the GA is 98%. In other words, 98% of the time the QLearning yielded TVaR values between USD 567,934 and 411,203 less than those produced by the GA.
The significance of this difference is that the QLearning approach yields a better result.
Conclusion
Reinforcement Learning techniques, by virtue of their ability to adapt to a stochastic environment, have the potential to advance the insurance portfolio optimization task by delivering superior solutions in the face of uncertainty. Tested against the commonly used Genetics Algorithm in optimizing a book of policy groups, QLearning was found to deliver statistically significant superior TVaRs while achieving similar premium levels.
^{1} See "Portfolio Optimization for Insurance Companies," January 2011; "Portfolio Optimization for Reinsurers," March 2012; "Managing Wind Pool Risk with Portfolio Optimization," June 2012.
^{2} For more information on QLearning, see "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto.