Ir al contenido principalSaltar al contenido

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

Abstract

arXiv:2607.01480v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR), along with recent selfdistillation variants such as SDPO, evaluates each rollout against a verifier and updates the policy from that episode-level signal. However, the richer procedural information in the rollout is rarely retained or reused. Across episodes and epochs, the model repeatedly encounters related problems under a changing policy, producing cross-episode signals that episode-local u

Asesor Virtual 24h - Abre el chat para consultasAsesor Virtual 24h
Hablar por WhatsApp con nuestro agenteLlámanos al teléfono