Atlas AI Agent Developer Experience Feedback
In this post I share some of my experiences as a developer of Atlas AI agents. My goal is to be able to define agents in a modular and reproducible way, making it easy to extend and maintain. I want a workflow that facilitates governance, and an evaluation framework that is flexible, transparent, and functional. All my feedback should be viewed in this light. I am not after the best user experience in the browser UI, although that is also important. This post is about all the stuff the end user does not see.
Observability
When we call client.agents.chat(), the SDK returns a final answer, structured data items, and a reasoning trace. Here is a representative response:
{
"response": {
"messages": [{
"role": "agent",
"content": {
"text": "Short answer: N08 flows to two manifolds ...",
"type": "text"
},
"data": [{
"type": "instance",
"view": {"space": "sp_process_domain_model", "externalId": "ProcessEquipment", "version": "1.0.0"},
"instanceId": {"space": "sp_inst_sol_early_anomaly_detection", "externalId": "cb58108b-..."},
"properties": {
"name": "N08",
"downstream": [
{"externalId": "c5440d61-...", "name": "East Side Production Manifold"},
{"externalId": "c5440d64-...", "name": "East Side Test Manifold"}
]
}
}],
"reasoning": [
{"content": [{"text": "Executed Find Process Equipment", "type": "text"}]}
]
}]
}
}
The reasoning field tells us a tool was called. That is the entire observability surface.
We do not get:
- Tool call inputs (the query the agent constructed)
- Tool call outputs (raw results before LLM processing)
- Intermediate reasoning (why the agent chose a tool, a filter, or stopped paginating)
- Per-tool latency
- Error and retry traces (a tool called six times shows six identical confirmation strings -- no indication of what changed between attempts or why they failed)
- Token usage per LLM call
When a tool call fails silently, we have no way to distinguish a bad query from a platform limitation from a discarded result. The only recourse is to open the browser UI, re-ask the question, and inspect the trace manually. For bulk evaluation this does not scale.
We need full observability of what the agent does if we are to understand why it fails and to evaluate it properly. An opt-in debug mode on the SDK that returns the complete execution trace -- tool inputs, tool outputs, intermediate reasoning, per-step timing -- would let us diagnose failures programmatically and build evaluators that assess query construction, not just final answers.
Context Engineering
The agent receives context from multiple sources: our YAML instructions, tool descriptions, data model metadata (view and property docstrings in CDF), and a platform-injected system prompt invisible to us.
We have no documentation on what each source is for, how they compose, or what the agent actually sees as its assembled prompt. Without this, instruction engineering is guesswork. We cannot tell whether unexpected agent behavior stems from our instructions, the platform prompt, or the data model metadata. We do not know if our instructions are in direct conflict with Cognite's system prompts.
Request: Document the context composition pipeline. Expose the full assembled prompt.
Show Us the System Prompt
Atlas AI agents ship with a platform-managed system prompt we cannot see. We need to know what is already on the canvas. Does the platform prompt instruct table formatting? Source citation? Uncertainty handling? Are our instructions overriding or conflicting with default behaviors? We are not asking to modify it, just to see it.
runPythonCode in the SDK
The toolkit supports runPythonCode - custom Python functions the agent can call. In the browser UI these run in a Pyodide sandbox. The SDK has no such sandbox, so the agent fails when runPythonCode are present in its configuration - even if the question does not trigger the Python tool.
To evaluate via the SDK in our CI pipelines, we must strip the Python tools from the agent config, adding deployment complexity. The evaluated agent then lacks tools that the production agent has. We lose eval coverage and we are measuring a different system.
SDK parity with the UI agent runtime is a prerequisite for meaningful evaluation.
Performance improvements
The chat completions are very slow. Up to several minutes for relatively simple questions. The overhead is significant, which limits user-friendliness and really makes evaluation a time-consuming effort.
I did a simple test where I gave the same question to the same agent via both the browser UI and the SDK in a notebook. UI needed 2min15min, and the SDK had not finished at 10 minutes. Querying the API itself is fast, so major overhead.
Documentation
An alpha release does not need production-grade documentation. But the features it ships should be documented. We should not have to reverse-engineer the SDK response schema, discover tool limitations by trial and error, or guess at supported query patterns.
A short reference covering the response structure, known limitations, and basic code examples would be sufficient at this stage.
CDF Built-in Agent Evaluation
CDF has a built-in agent evaluation feature. We need transparency on what metrics are computed, which LLM-based evaluators are used, what their prompts and success criteria are, and how scores are aggregated. If we cannot interpret the scores, we cannot act on them.
What we need from a CDF evaluation framework:
- Transparent, multi-metric evaluation. Each metric exposed individually -- not a combined pass/fail. Users need to see why something passed or failed.
- Scheduled evaluation runs. Manual triggering does not scale. Define a test suite, run it on a schedule or on agent config changes.
- Conversation evaluation. Not all agent usage is a one-shot question. Follow-up questions and longer conversations need evaluation support too.
- Debug data. Being able to drill into what went wrong.
- AI review. Get a pre-trained agent to evaluate the evaluation, suggesting actions on how to improve. Could be very useful and a time saver.
- Possibility to write custom evaluators would be very cool!
We tested the built-in evaluator with a deliberately corrupted reference response. We changed one entity ID out of 50 to a non-existent value. The candidate response returned the correct 50 IDs -- missing the fake one. The evaluation still passed. The evaluator either is not doing entity-level comparison, or is doing it too loosely to catch a single missing entity in a list of 50. We need to be able to tell whether our agent achieves 100% recall.
Modular definition
Currently we need to construct a monolithic YAML-file for deployment. While this works, I have ended up with a custom setup where I split the definition into config.yaml, instructions.md, and tools.yaml. Then I combine and construct the final YAML in a Github Workflow that eventually gets deployed. It would be nice if such a modular approach was supported by the cognite toolkit natively, as I think it makes it easier to touch files and make changes under a version control system.
Also, see next point.
runPythonCode modular design
If my agent has python code as part of its tooling, then I want to make sure that python code is unit tested. Especially if the agent ends up in production with end users who rely on it. Today my workflow is a disgusting mess of resolving relative imports in order to build a self-contained script that can make it into the final YAML. Python code should be written with separation of concerns, modularity, and testability in mind. So we need a much, much better way of introducing python code to the tool kit. An approach more similar to Functions makes more sense, but I’ll leave the solutioning up to you. I just know that the current set up is not elegant or dev friendly.
Thanks for reading all the way to the end :) Have a great weekend!
Sincerely,
Anders Brakestad
Check the
documentation
Ask the
Community
Take a look
at
Academy
Cognite
Status
Page
Contact
Cognite Support
