Introducing the Trossen SDK: Open-Source Robotics Data Collection for Physical AI
- 60 minutes ago
- 6 min read

Want to jump right into all the technical details?
Check out the landing page: Data Collection SDK
Just take me to GitHub!
Robotics does not just have a model problem. It has a data problem.
If you are training manipulation policies, building imitation learning workflows, or trying to turn demonstrations into training-ready datasets, you already know where things tend to break down. Joint states come from one place. Camera streams come from another. Mobile base odometry lives somewhere else. Timing gets messy. Storage formats drift. Every hardware change creates more one-off code. And before long, the path from “record a task” to “train a model” becomes fragile, slow, and difficult to reuse.
That is a bad foundation for physical AI.
We think the field needs better shared infrastructure for robotics data collection. Something more modular. More reusable. More extensible. Something that can help the community move faster, instead of forcing every team to rebuild the same plumbing over and over again.
That is why we are open-sourcing the Trossen SDK.
What is the Trossen SDK?
The Trossen SDK is an open-source C++ framework for robotics and physical AI data collection. It is designed to record synchronized, multi-modal episodes from robot arms, cameras, and mobile bases, then move that data into formats that modern training pipelines can actually use. It is configuration-driven, hardware-agnostic by design, and built to be extended without requiring changes to the core library.
At a practical level, the goal is simple: reduce the friction between physically demonstrating a task and training a model on the resulting data.
For teams working in robotics research, imitation learning, teleoperation, and embodied AI, that gap is often where momentum gets lost. The Trossen SDK is built to help close it.
Why open-source robotics data collection matters
We are open-sourcing the Trossen SDK because robotics needs stronger common tooling.
In machine learning, progress accelerated as the ecosystem began to converge around shared frameworks, patterns, and abstractions. Robotics still has too much fragmentation at the data layer. Too many pipelines are custom, brittle, and tightly bound to a single setup. That slows iteration, makes collaboration harder, and creates unnecessary reinvention.
Open source is one of the clearest ways to improve that.
It gives researchers and engineers a real starting point instead of another blank page. It creates a path for the community to fork, test, refine, and extend the framework. It also increases the likelihood that the field will standardize on better ways to capture, structure, and hand off robotics data.
That is the larger point of this release.
Yes, we want people to use the SDK. But we also want to help push the ecosystem toward a more durable software layer for physical AI.
The problem with most robotics data collection workflows
For many teams, data collection still looks like a patchwork of scripts, middleware dependencies, bag files, format conversions, and hardware-specific workarounds. That approach can work for a proof of concept, but it rarely scales well.
The moment you add a new sensor, swap a camera, change recording rates, or move from one robot setup to another, you often end up rewriting infrastructure that should have been modular in the first place.
That is exactly the bottleneck the Trossen SDK is meant to address.
Instead of treating robotics data collection as a collection of temporary hacks, it treats it as a real system. The architecture separates hardware abstraction, producers, buffering, storage backends, and session orchestration into distinct layers. That makes the pipeline easier to reason about, extend, and keep stable as the system grows. It also supports high-frequency recording, lock-free queuing between producers and storage, and a direct path to downstream dataset conversion.
What makes the Trossen SDK different
1. A real architecture, not a script pile
The SDK is built around a clean separation of concerns. Hardware components, producers, sinks, backends, and session management are treated as independent parts of the system rather than collapsed into one fragile application. That matters because it allows teams to change one part of the stack without destabilizing everything else.
2. Configuration-driven robotics recording
Recording behavior is defined via JSON configuration rather than being hidden in source code. Hardware addresses, resolutions, poll rates, teleoperation pairings, duration limits, and backend parameters can all be adjusted without recompiling. That makes iteration faster and reduces the cost of adapting the stack to new experiments or hardware layouts.
3. Hardware extensibility without touching core code
The registry-based design is a major part of the SDK’s philosophy. New sensors, producers, and backends can be added through interfaces and registration, not by editing central dispatch code or rewriting the framework. If you want to support new hardware, the intent is to make that feel like a first-class extension, not a forked maintenance burden.
4. Built for synchronized, high-throughput data collection
The producer-to-sink path uses a lock-free queue, so sensor polling is not blocked by disk writes or serialization overhead. That is especially important for high-frequency streams such as arm state recording, where performance problems can quietly degrade data quality. The design is meant to preserve throughput and timing integrity under real recording conditions.
5. A direct path from recording to training
Recorded episodes can be converted to the LeRobot V2 format, producing outputs such as Parquet for structured state data and MP4 for video streams. That shortens the distance from capture to experimentation and avoids a lot of the manual cleanup that typically slows teams down after a recording session.
Why MCAP is part of the story
One of the most important choices in the SDK is its storage approach.
At the core of the storage layer is TrossenMCAP, built on the open MCAP standard. That matters because robotics data is inherently multi-channel and time-sensitive. You are dealing with different stream types, different rates, and a need to preserve synchronization and metadata in a way that remains usable later.
This is not just about file format preference. It is about building a better foundation for capture, inspection, reuse, and downstream processing.
Compared with fragmented flat-file approaches, the MCAP-based workflow keeps a recording episode structured and portable. Compared with middleware-bound approaches, it gives teams a more direct path to handling robotics data without dragging in a full ROS dependency chain.
Who the Trossen SDK is for
The Trossen SDK is for technical users who are tired of rebuilding the same infrastructure every time a project changes.
That includes:
Robotics researchers collecting demonstration data
Engineers building imitation learning or teleoperation pipelines
Teams evaluating sensor stacks and capture workflows
Developers who want a C++ data collection foundation without requiring ROS
Contributors who want to help shape open infrastructure for robotics and physical AI
If your work depends on reliable data capture, synchronized recording, and a path to modern training workflows, this project is for you.
Why we want the community involved
We are not releasing this as a static artifact.
We are releasing it because we want people to use it, challenge it, and help improve it.
The current release is an initial implementation. The architecture was built to grow, and we plan to keep pushing it forward with broader hardware support, stronger backends, tighter integration with training workflows, and continued refinement based on real use. The roadmap should not be shaped in isolation. It should be shaped by the people actually building, testing, and deploying robotics data pipelines.
That is why community participation is not a side note here. It is part of the strategy.
Try the Trossen SDK and help shape the roadmap
If you are working in robotics, embodied AI, manipulation, teleoperation, or data infrastructure, we want you to try the Trossen SDK.
Put it in your workflow. Use it with your hardware. Test the edges. Find what feels strong and what still needs work.
Then tell us.
Submit an issue on GitHub or message us directly at the bottom of the SDK page: Data Collection SDK
Open an issue. Request a feature. Share feedback. Fork the project. Extend support for new hardware. Help us understand where the biggest opportunities are for improvement and standardization.
We believe robotics needs stronger open foundations.
This is our contribution toward that future.
Now we want to build it with the community.
Explore the Trossen SDK
Review the architecture, supported workflows, and setup details on our SDK page.
Test it in your environment
Try it with your own robotics data collection workflow and see where it can save time and reduce friction.
Give feedback
Tell us what worked, what did not, and what would make the SDK more useful to your team.
Contribute
File issues, request features, or fork the project and help move open robotics infrastructure forward.