Alex Dombrovski
Department of Psychiatry
University of Pittsburgh
Laboratory studies of decision-making often involve choosing among a few actions. Yet in natural environments, we encounter a multitude of options whose values may be unknown. Given that our cognitive capacity is bounded, in complex environments, it becomes hard to solve the challenge of whether to exploit an action with known value or search for even better alternatives. In reinforcement learning, approaches to the intractable exploration/exploitation tradeoff typically involve controlling the temperature parameter of the softmax policy or encouraging the selection of uncertain options. To what extent such approaches capture the range of human behavior remains unclear, in part because they do not consider the memory constraints on maintaining multiple learned values across episodes.
We describe how selectively maintaining high-value actions in a manner that reduces information content helps to resolve the exploration/exploitation dilemma during a reinforcement-based timing task. By definition, the information content (i.e., Shannon’s entropy) of the value representation controls the shift from exploration to exploitation. When subjective values for different actions are similar, the entropy is high, inducing exploration. Under selective maintenance, entropy declines as the agent preferentially maps the most valuable parts of the environment and forgets the rest, facilitating exploitation. We demonstrate in silico that this memory-constrained algorithm performs as well as cognitively demanding uncertainty-driven exploration, even though the latter yields a more accurate representation of the contingency.
Human behavior is best captured by a selective maintenance model. Information dynamics consistent with selective maintenance are most pronounced in better-performing subjects, in those with higher non-verbal intelligence, and in learnable vs. unlearnable contingencies. In summary, when the action space is large, strategic maintenance of value information reduces cognitive load and facilitates the transition from exploration to exploitation. High entropy recruited a dorsal attention network, the activity of which was blunted in individuals with borderline personality disorder.