Finetuning PI05 for a custom task, robot and environment

Finetuning PI's Pi05 for a custom embodied autonomy use case on ~$10K of hardware.

Brandon Ong

2026-02-26

Background

When we first got started with model-based control used in humanoid robots, we found existing sources focused on the literature required to understand the core topics around it, perception, planning and control. These deep dives were useful in mapping how we got to the current state of technology, but we were searching for something more engineering focused, such as sources grounded in the desire to actually deploy such robots into custom use cases and environments.

This article aims to bridge the gap in practical knowledge required to deploy a first working prototype of an AI model controlling robots with <$10K and open source software. We will focus on the 10,000 feet view of what we think are the core foundations needed to do so with embodied autonomy, which we define as the use of AI models to control robot manipulation. Embodied autonomy is distinct from conventional, ROS/RL-based autonomy by its focus on the use of deep learning based methods, similar to techniques used to train AI models like Anthropic's Claude and ChatGPT. The emerging term to describe such AI models is Vision Language Action (VLA) models. Variants of VLAs are also increasingly used for self-driving.

Custom task and environment

Our goal is to get a first working prototype of a robot with two manipulation arms able to clean dirty sink surfaces with trash objects, at the lowest cost possible. Embodied autonomy is a good fit for dealing with varying toilet conditions and configurations (number of taps, type of sinks, soap dispensers, trash objects etc). Accomplishing this with conventional ROS methods would require extensive hand-written coverage for every configuration.

Why we chose a fixed manipulation setup, not a humanoid immediately

To optimize for iteration speed, we narrowed the scope of our AI model, as we recognized it would be non-trivial to produce an end-to-end pipeline given the constraints. We planned to take a progressive approach, starting with fixed bi-manipulation arms (shown), then to mobile manipulators like Galaxea's R1 Lite (used by Physical Intelligence) and eventually to full humanoid robots with legs, arms and hands all controlled by AI models.

Given the team's prior experience with ROS-based control, we planned to navigate the robots to points where the AI model will take over. Hence, the scope of a fixed manipulation setup was a good fit to our requirements and expectations for a first prototype. Ultimately, our ambition was to progressively handoff more autonomy to the AI model from ROS-based autonomy.

Hardware and Software (<$10K)

Hardware

We used 2x AgileX Piper Arms (~$3000 per arm), ORBBEC Gemini 336L for the top view camera and 2x RealSense D405 for wrist cameras, a common choice amongst the research community. We chose these cameras for their relatively low price, and they meet the minimum requirements we were looking for. However, we did experience issues with unpredictable disconnection with cameras, but they were still acceptable for an initial prototype. There is a new category of "robotics-specific cameras", like the Zed cameras which we are seeing being used more by the research community but cost significantly more ($100s v.s. $1000s).

Software

For the core software stack, we use @physical_int's (PI) OpenPi repository on GitHub, @huggingface's (HF) LeRobot library for collecting robot training data with teleoperation, LeRobot's data visualizer to explore our datasets, HF's for storing checkpoints and datasets, and @wandb for experimentation tracking. We use Lambda Labs for running training scripts in the cloud, and Modal for remote inference. The advantages of using PI's OpenPi repository is that all training and inference scripts are written in PyTorch and JAX, which make them simple to adapt with other tools, meeting researchers where they are. We elaborate on our experience with NVIDIA, HuggingFace and AI2's stack below.

Custom Embodiment

DROID and ALOHA are common "full stack" set ups in the research community. You can replicate these robots wholesale, or mix and match parts, but their bill of materials (BOM) will likely be in the range of >$20K. The single Franka Arm used in DROID already cost ~$8K while a full ALOHA setup cost ~$20K for the hardware costs only. However, the advantage of following these full stack set ups is most research models and open source tools were built with these set ups.

If you choose to build your own set up from scratch, the placement height and view angles of your cameras will determine your custom embodiment. In general, it would be wise to not differ too far from the camera positions used by open source models in their training data (if fine-tuning is preferred over training a model from scratch) in order to benefit most from cross-embodiment transfer, though details on camera positions and angles are not always precisely available. For our team, we eventually chose purely custom measurements, and accepted the tradeoffs that come with it.

Model Selection

When selecting amongst the available open source models, it is wise to consider the training procedures used to train these models. For example, the earlier versions of NVIDIA's GR00T models are trained on data with only one top camera and one wrist camera (single arm). Such models will likely struggle with bi-manipulation tasks and hence were not strong candidates for our use case. On the other hand, the advantage of PI models are their broad data mix in terms of embodiments, data sources and training stages used by PI to train a general model that adapts quickly. We were surprised that the finetuned Pi05 base model in the video adapted to our custom task and embodiment with ~400 training episodes (more details in data section below), which adds up to 10 hours of finetuning data. The guidelines provided by PI is between 1 to 20 hours of finetuning data for task adaptation. The base model without finetuning fails miserably.

The core challenge we faced with @NVIDIARobotics's GR00T models was resolving dependencies with tooling from their ecosystem. For example, setting up the dependencies for TensorRT is non-trivial but required to run inference with their GR00T models. We did not have much success setting up HF's X-VLA, or AI2's Molmo-Act ecosystem.

Inference

The most difficult part when building the inference pipeline is understanding the hardware APIs, and integrating them with the AI model outputs. The AI model outputs need to be matched correctly to the format the arms are expecting. Unfortunately, there is no standard specification that all AI models and robot hardware follow for now. Once all components are integrated, the Pi05 model takes in images from the 3 video camera feeds (1 top camera, 2 wrist cameras) and the 7 input joint states (for AgileX Piper Arms there are 6 motor/joint states, and 1 gripper/end effector state) as input. The AI model then generates n number of 7-dimensional arrays, where n is the action horizon (elaborated below) as output. The system is controlled at a high frequency via CAN bus to ensure reactive and smooth motion.

Our cameras record 30 FPS videos of 224x224 pixel views in RGB only. We experimented with RGB-D cameras, but did not observe significant performance improvements that warranted the additional complexity. In addition, most open source AI models use RGB images predominantly in their pretraining dataset, which are important for cross-embodiment transfer. It is worthy to note that depth information can be captured in different formats, i.e. Zed cameras are stereo cameras that use onboard computation to triangulate depth information, while other cameras might collect depth information in the form of infrared readings, like the Intel RealSense and Efference cameras. However, we were unable to harness these sensor data to improve our model performance.

When running model inference, there are three important components to consider: action horizon; asynchronous v.s. synchronous inference; real time chunking (RTC) with action queue threshold. These considerations are constrained by the model inference latency of your set up.

In our first iteration, we chose a conservative set up with an action horizon of 30, synchronous inference, and no real time chunking. Done this way, all camera feeds for training and inference are configured to 30Hz. The AI model then generates 30 future actions (an action horizon of 30), and once the 30 future actions are executed completely (synchronous), a new set of 30 actions are generated based on the most updated camera feeds (no realtime chunking and action queues maintained).

The alternative here is to enable asynchronous inference and realtime chunking with an action queue threshold. The way this works is instead of waiting for all 30 actions to finish executing (assuming an action horizon of 30), we define an action queue with a queue threshold. Based on the PI papers, an action-queue threshold of 50% performs reasonably well empirically. For example, at a queue threshold of 50%, when the arms execute the first 15 actions in the action queue and the queue threshold falls below 50%, the AI model runs inference to generate the next 30 actions with the most updated states (camera feeds and joint states), which fills up the queue back to ~99% action-queue threshold. Because the queue is constantly filled with actions for the arms to execute, the arms do not idle because they receive actions from the action queue to execute. The advantage of this inference setting is smoother arm motions. The abrupt stop in robotic motions, causing the arms to jerk happens because there are no actions to execute. For the described asynchronous inference setting, we were unable to find robust open source implementations we could integrate quickly into our pipeline. HuggingFace has an asynchronous implementation but did not work for us.

As mentioned above, one of the core considerations when choosing an inference set up is model inference latency. The reported inference latency from PI for Pi0 is 76ms for local inference and 86ms for remote inference done over Wi-Fi connection on an off-board NVIDIA RTX 4090 GPU. This enables PI to run asynchronous inference at 50Hz, ran every 0.5s after 25 actions are executed. However, we were only able to achieve a latency of ~4000ms per inference call from Modal on a A100 in US-East-2, which is ~100x slower. We did not expect such an order of magnitude difference given the lack of GPUs available in Singapore.

As such, we would require low latency, locally networked GPUs or migrate to a local RTX 4090 inference set up before considering integrating asynchronous inference into our pipeline. In addition, for older hardware, there might be limits to the input frequency of the robot arms. For example, earlier generation of Franka and URe5 can only execute actions at 20Hz. Even if we can produce 25 actions each time, these arms will be constrained by their execution frequency. The constraint on the AgileX Piper arms is 100Hz from what we know, but we did not get to testing this feature.

Embedded Inference

Lately, attention has been growing around NVIDIA's Jetson AGX Thor and AGX Orin, GPUs specifically designed for robotics and on-board inference. The key distinction between both chips is Thor has a Blackwell chip, while Orin has an Ampere chip. The implications that come with these chip architectures are what is relevant here. A Blackwell chip runs on Ubuntu 24.04 and CUDA 13+, while an Ampere chip runs on Ubuntu 22.04 and CUDA 12+. A similar pattern can be observed across NVIDIA's product lines, e.g. 4090 v.s. 5090; H200 v.s. B200.

What this means for developers using open source software is a chip that has been around for awhile is more widely supported. We faced much difficulty adapting open source scripts for inference on Thor and Orin, even with NVIDIA's official containers. We filed multiple GitHub issues that are attached below for reference. For Orin, the best open source solution available was a hacky workaround which cuts inference time by 4x, but even with this performance improvement, we are still far from the latency reported by PI.

Our conclusion with the Thor and Orin is that they are only useful if one is willing to write custom inference scripts and optimize them from scratch in order to get it working. For quick prototyping, it is likely more promising to choose a cloud provider with locally networked GPUs, and ensure you have a high bandwidth Wi-Fi connection to the robot, ideally a dedicated robot network connection for off-board inference. In the Pi paper, the off-board inference latency is merely 10ms better using inference on an onboard RTX 4090 GPU compared to one that is on the cloud.

GitHub Issues:

Jetson Thor support for Pi05 GitHub Issue
NVIDIA's Official OpenPi Jetson Container does not work for Jetson Thor

Hierarchical Planner

In 2025, when test-time scaling emerged with large language models (LLMs), the use of hierarchical planners with VLAs at inference time emerged with it. The general idea is to incorporate a separate "planner" model which thinks and senses before generating actions that are executed by the robot.

Previously, you could only provide the model a single high level instruction as input: "Clean the dining table." Now with a hierarchical planner, the model will break down the high level instruction, and sequentially come up with downstream instructions for future actions based on updated observations:

"Walk to the dishwasher and open it"
"Carry the cups on the table and load it into the dishwasher"
"Go back to the top rack of the dishwasher and pick up the cups inside"

An interesting experiment done by PI involved the use of a "human oracle", where a human guided the model with instructions in order to complete a task, simulating the use of a 2-level system (one model as the planner, the other as the action generator). Surprisingly, in majority of ablations, the model-based planner resulted in better performance than the human oracle.

For Pi05, the same Pi05 model is used as the planner for breaking down high level instructions, and these language tokens are passed to a separate Pi05 model for downstream action generation. You can read more about PI's implementation in the HiPlanner and Pi05 papers. The core insight is training a single model able to understand the task end-to-end in a unified architecture is what works best. Unfortunately for the open source community, the OpenPi repository does not provide an implementation of their high level planner, as described in this GitHub issue. We did find a hacky workaround in this GitHub Issue but were unable to replicate the results. Openpi-COMET is an alternative reference point which used Qwen3-VL-30B-A3B-Instruct as the hierarchical planner for a downstream Pi05 model controlling a robot in the BEHAVIOUR simulation environment.

We believe it is likely that most of the frontier robotics companies today use some form of a hierarchical planner. @Figure_robot mentions this with System 0+1+2 in Helix 2 and @GoogleDeepMind's Gemini Robotics 1.5 adapting Gemini as the planner. Galaxea's OpenGalaxea repository is another repository to monitor for open source implementations of a 2-system architecture. We have yet to experiment with OpenGalaxea.

Part 2 is in the works, which covers the remaining components of our end-to-end pipeline (model evaluation, data collection, model training, simulation and teleoperation).