KAN We Flow?
Advancing Robotic Manipulation with 3D Flow Matching via KAN & RWKV
Anonymous ICRA submission
Overview of KAN-We-Flow. The policy receives a noised action and a condition that comprises three encoded parts, a point-cloud perception embedding, a robot-state embedding, and a time embedding. The concatenated representation is processed by a lightweight RWKV-KAN U-shaped backbone instead of a large UNet-style backbone; RWKV mixes long-range sequential/spatial context with linear complexity, while KAN performs learnable spline-based feature calibration. Then, a straight-line flow is learned with conditional consistency flow matching to produce a one-step velocity field, and the resulting actions are generated at real-time inference speed; an additional action-consistency regularization aligns Euler-extrapolated trajectories with demonstrations to stabilize training.
Simulation Robot Experiments
Hammer
Door
Pen
Simulation results for Hammer, Door, and Pen tasks.
Assembly
Disassemble
Hammer
Simulation results for Assembly, Disassemble, and Hammer.
Door Close
Door Open
Button Press Wall
Simulation results for Door Close, Door Open, and Button Press Wall.
Drawer Close
Drawer Open
Hand Insert
Simulation results for Drawer Close, Drawer Open, and Hand Insert.
Faucet Close
Faucet Open
Stick Pull
Simulation results for Faucet Close, Faucet Open, and Stick Pull.
Window Close
Window Open
Stick Push
Simulation results for Window Close, Window Open, and Stick Push.
Abstract
Diffusion-based visuomotor policies excel at modeling action distributions but are inference-inefficient, since recursively denoising from noise to policy requires many steps and heavy UNet backbones, which hinders deployment on resource-constrained robots. Flow matching alleviates the sampling burden by learning a one-step vector field, yet prior implementations still inherit large UNet-style architectures. In this work, we present KAN-We-Flow, a flow-matching policy that draws on recent advances in Receptance Weighted Key Value (RWKV) and Kolmogorov-Arnold Networks (KAN) from vision to build a lightweight and highly expressive backbone for 3D manipulation. Concretely, we introduce an RWKV-KAN block: an RWKV first performs efficient sequence/spatial mixing to propagate task context, and a subsequent GroupKAN layer applies learnable spline-based, groupwise functional mappings to perform feature-wise nonlinear calibration of the action mapping on RWKV outputs. Moreover, we introduce an Action Consistency Regularization (ACR), a lightweight auxiliary loss that enforces alignment between predicted action trajectories and expert demonstrations via Euler extrapolation, providing additional supervision to stabilize training and improve policy precision. Without resorting to large UNets, our design reduces parameters by 86.8%, maintains fast runtime, and achieves state-of-the-art success rates on Adroit, Meta-World, and DexArt benchmarks.
Comparison of Accuracy, Parameter, and Inference Time
Comparison of KAN-We-Flow with the state-of-the-art methods FlowPolicy and DP3 regarding accuracy, parameter, and inference time. (a) KAN-We-Flow achieves superior success rates among different benchmarks' tough tasks; (b) Our approach obtains an 86.8\% parameter reduction, compared with FlowPolicy and DP3; (c) Compared with DP3, our KAN-We-Flow achieves 92.6% inference time decrease in the Adroit–Pen task, enabling real-time control.