OpenAI AgentKit Guide: How to Build AI Agents

Oct 24, 2025

TL;DR

AgentKit combines Agent Builder, the Agents SDK, ChatKit, and Evals on the Responses API so you can plan, build, measure, and ship real agents. Start with one job and minimal tools, track accuracy, latency, and cost with Evals, then ship via ChatKit or the Apps SDK with versions pinned and a rollback plan.


Modern agents are not just chat. They read, plan, call tools, and return results inside a product experience users can trust. OpenAI’s AgentKit pulls these moving parts into one place so you can design a workflow, give it the right tools, measure quality, and ship a usable interface without stitching five systems together. In short, it tries to turn the messy early steps of agent building into a repeatable path from idea to running feature. 

AgentKit is more than a new API name. It is a set of components that work together. You get a visual Agent Builder for mapping the flow, an Agents SDK when you need code-level control, an embeddable ChatKit UI, and Evals to track accuracy, latency, and cost over time. All of it sits on top of OpenAI’s Responses layer, which handles model calls, tool use, and connectors. If you are planning to get a first agent into production, these pieces give you shared scaffolding across design, development, and operations. 

What is OpenAI AgentKit

Think of AgentKit as a factory floor for agentic features. You lay out the process in Agent Builder, wire approved tools, run tests with Evals, and attach a chat surface through ChatKit or ship inside ChatGPT with the Apps SDK. Each piece has a clear job, and none of them force you into a single development style. You can sketch with a canvas, then move hot paths into code when scale or complexity demands it. 

OpenAI’s own description is simple. Agents are systems that complete tasks on your behalf. Building them used to mean juggling orchestration glue, brittle prompt files, ad hoc tests, and a custom front end. AgentKit’s goal is to replace that pile with a standard set of parts that share vocabulary and telemetry. That matters because the same artifacts your designer shows on day one are the ones your on-call engineer debugs on day thirty. 

Here is how the pieces fit:

Agent Builder gives you a model of the workflow. You define inputs, connect a start node to an agent, add decision points, and specify which tools the agent may call. Versioning is built in, which means you can freeze a working flow before you try a new prompt or a different model. This keeps experiments honest, and it keeps rollback simple when a change underperforms. 

Agents SDK exists for the moments when a canvas is not enough. Maybe you need tighter CI integration, custom retries, or a handoff that crosses several internal services. The SDK exposes programmatic control and still targets the same goal as the canvas, which is a stable agent that can plan, call tools, and report what happened. You can start visually, then migrate critical paths to code without throwing work away. 

ChatKit solves the last mile. Many teams stall at the interface because a good chat surface is more than a textbox. ChatKit brings streaming, buttons and forms, widget actions, and theming that you can drop into a web or mobile app. Advanced docs cover self-hosting and custom authentication when you need to meet stricter requirements. 

Evals closes the loop. You create small datasets that look like real inputs, run comparisons across prompts or models, and track changes in accuracy, latency, and token spend. This turns subjective “seems better” into evidence your team can trust, which is the difference between a flashy demo and a feature that improves month after month. 

There is also a distribution path inside ChatGPT. The Apps SDK lets you package your logic for ChatGPT, powered by MCP so the same app can talk to your services through standard connectors. It is in preview, which means you can build and test now, with broader distribution opening later. For some teams, that becomes a channel to reach users without shipping a separate interface. 

If you remember one idea, make it this. AgentKit is not a single feature. It is a set of roles. Builder for layout, SDK for control, ChatKit for experience, Evals for truth, and Apps SDK for reach. Once you see those roles, planning your first agent becomes a matter of assigning work to the right part and writing down how you will measure progress. 


AgentKit architecture, components, and when to use each

AgentKit is not one thing. It is a small set of parts that fit together so a team can move from idea to a working agent without rebuilding the same plumbing. The platform gives you a visual canvas in Agent Builder, a code surface in the Agents SDK, an embeddable interface with ChatKit, first-party evaluation tools in Evals, and a distribution path inside ChatGPT through the Apps SDK. All of this sits on top of the Responses API, which handles tool calling, connectors, and model interaction. 

  • Agent Builder is where most teams start because it turns the workflow into a shared artifact. You lay out inputs, add an agent node, wire tools and branches, then version the flow when it is good enough to freeze. That versioning matters once you begin to compare prompts or swap models, since you can roll back to a known good state without guessing what changed. OpenAI’s docs position Builder as the fastest path from sketch to working path, with releases you can manage in the same place. 

  • The Agents SDK takes over when the canvas stops being the best lever. If you need tight CI, custom retry logic, or handoffs that span several internal services, writing the orchestration in code gives you control without losing alignment with the rest of AgentKit. The SDK targets the same goals as the canvas, which means you can prototype visually and migrate hot paths to code without discarding the design. 

  • Great agents stall when the interface is weak. ChatKit exists to remove that roadblock. It is a drop-in chat surface with streaming, actions, and widgets, so you can show tool usage, collect structured inputs, and make the experience feel native to your app. The reference explains how to theme, embed, and handle events on your server, which shortens the distance from a working flow to a usable product. 

  • Evals lets you create small datasets that mirror real inputs, grade runs, and compare prompts or models over time. Because evals live inside the same platform, the results are easy to tie back to the specific version of a flow or prompt you shipped. That makes conversations about accuracy, latency, and cost concrete rather than subjective. 

  • Under every agent flow is the Responses API. It is the stateful runtime that unifies chat, tools, and multi-step work, and it replaces older patterns that split these concerns. If you are calling tools, streaming results, or managing long tasks, you are relying on Responses even if you start from the canvas. The Azure OpenAI documentation mirrors this design, which is useful context for teams that also deploy on Azure. 

  • For your own systems, you can expose capabilities through the Model Context Protocol and register them as tools or connectors. MCP gives you a stable way to connect models to data and actions without bespoke adapters for each app.

  • For users inside ChatGPT, the new Apps SDK lets you package logic and interface elements so your app runs where users already are, with MCP under the hood. Both routes reduce integration friction and give you a cleaner upgrade path as the ecosystem grows. 


How to choose the surface, in one sentence each. Start in Agent Builder when speed and shared understanding matter. Reach for the Agents SDK when you need code-level control or CI. Use ChatKit when you want to embed a polished chat UI without building it yourself. Lean on Evals any time a change could affect accuracy, latency, or cost. Prefer Responses for any multi-step work with tools. Pick MCP and the Apps SDK when distribution inside ChatGPT or standard connectors will save you time. 

Plan your first agent: goals, tools, and contracts

Start by writing the job to be done in a single sentence. Make it outcome-focused, not model-focused. “File a ticket from a conversation and post the link in Slack” is clearer than “use GPT-4o to summarize and act.” This sentence becomes the north star for your canvas in Agent Builder and the acceptance test you encode later in Evals

List only the tools the agent truly needs. For each one, define a typed schema, scope, and safe defaults. The Responses API expects structured tool definitions, and your descriptions influence when the model decides to call them. Keep scopes narrow. Give every tool an allow-list for destinations and a maximum runtime. If you connect external systems, prefer connectors or MCP servers so you are using supported patterns rather than bespoke adapters. 

Decide where the agent’s knowledge will come from. If the task uses files or internal data, document which sources are trusted and how they will be retrieved. If the experience is chat-led, plan the inputs the UI must collect and which pieces belong in structured widgets versus free text. This is where ChatKit helps, since it supports streaming, actions, and form-like widgets you can wire to your backend. 

Write two budgets before you place a single node. First, a latency budget that users will accept for a normal run and for a slow path. Second, a cost budget per successful task. Add a simple rule for retries and escalation when the agent cannot proceed. These numbers guide your branching in Builder and your comparison runs in Evals. 

Finally, capture contracts. Pin the model family you intend to start with and the tool versions you will allow. Describe what success looks like as an automated check in Evals, then note any manual review that is required for risky actions. This turns “let’s try AgentKit” into an implementation plan that the team can build, measure, and ship. 

If you want, I can move on to the next section and show how to turn this plan into a working flow in Agent Builder, with versioning and safe rollbacks.

Build the agent in AgentKit: step by step

1. Sketch the flow in Agent Builder
• Name the user input, list the allowed tools, and define a clear “done” condition.
• Keep this first draft simple so you can version and compare later.

2. Add the agent node and first tool
• Give the agent a short, literal job description.
• Point the node at the Responses runtime.
• Wire a typed tool schema with a one-sentence description that tells the model when to call it.

3. Connect data and actions via connectors or MCP
• Prefer built-in connectors or an MCP server over ad hoc HTTP calls.
• If exposing an internal service, stand up an MCP server so you have a stable, reusable interface.

4. Add control flow and bounded retries
• Create a happy path and a recovery path for predictable failures.
• Keep retries explicit and capped to avoid loops.
• Save a baseline version once the happy path works end to end.

5. Set performance and cost budgets
• Define soft targets for p50 and hard caps for p95 latency.
• Set a maximum cost per completed task.
• Add token and rate limits to the most expensive nodes.

6. Manage secrets and configuration early
• Store keys and private endpoints as environment secrets.
• Use per-tenant scopes and keep permissions narrow.
• Never paste credentials into prompts or node descriptions.

7. Stand up a simple UI with ChatKit
• Embed ChatKit in a thin web page to test in context.
• Use streaming and widgets for structured inputs and actions.
• If you plan to distribute inside ChatGPT later, note what will move to the Apps SDK.

8. Version and prepare rollback
• Pin the model family, tool versions, and the flow version that produced your baseline.
• Write a one-line rollback condition and keep it with the release notes.
• Treat each change as a new version so you can revert in two clicks if quality drops.


Test, evaluate, and iterate with Evals

Great agents are measured, not guessed. Evals is OpenAI’s built-in way to create small datasets, attach graders, and track runs so you can prove that a change actually helps. You can configure evals in the dashboard or programmatically with the API, and the same approach applies whether your agent lives behind ChatKit or a custom UI. 

Write evals that reflect real tasks

Start with a dozen inputs that mirror production, including a few tricky cases. Define graders that mark each run as pass, partial, or fail, and keep the grading criteria close to your user story. Because evals attach to specific flows and prompts, you preserve context for audits and future tuning. 

Compare variants and models the same way every time

Use eval runs to compare a prompt change, a tool setting, or a different model under identical conditions. Keep one frozen baseline and only change a single variable per run. The API makes this repeatable, and reports capture results you can review with your team. 

Track accuracy, latency, and cost together

Quality that blows your budget is not a win. Include latency and token usage in your decision process so you can trade accuracy against speed and cost with eyes open. Responses exposes usage data you can surface alongside eval scores, which turns subjective “feels faster” into numbers. 

Make it part of the build loop

Run a quick eval after each change and before each release. If you ship a ChatKit UI, use evals to optimize the experience without guesswork. For deeper customization, the open source Evals project and docs show how to define your own tasks and graders. The goal is simple. Keep a tight feedback loop so your agent improves month after month. 

Ship the experience with ChatKit and the Apps SDK

You have a working flow and passing evals. Now it needs a front end that feels real. OpenAI gives you two shipping paths that cover most product plans: embed a chat surface in your app with ChatKit, or distribute inside ChatGPT using the Apps SDK. Both routes speak the same language as your AgentKit flows and the Responses API, which keeps the handoff from build to ship clean. 

ChatKit in your product. ChatKit is a batteries included chat interface that streams tokens, renders widgets, and wires actions back to your server. You drop the component into your web or mobile app, theme it, and attach handlers for events like tool progress or step completion. The official guide covers the basics, and the advanced integration docs show how to self host, add custom widgets, and tune performance once traffic grows. Use ChatKit when you want full control of the experience and data flow in your own stack. 

Practical setup looks like this: connect ChatKit to your agent backend, expose only the inputs you truly need as form widgets, and stream intermediate updates so users can see progress during longer tool calls. Keep an audit trail of every user action and tool step in your logs. The ChatKit repo and docs list the supported widgets, props, and patterns for server callbacks, which shortens the time from working flow to shippable UI. 

Apps SDK inside ChatGPT. If your users already live in ChatGPT, publish the same capability as an app. The Apps SDK lets you register tools, define UI surface elements, and connect to your backend through MCP. You build a small server, describe each tool with a schema and scopes, and follow the developer guidelines for review and deployment. The reference includes annotations such as a read only hint on tools, which helps the model plan safely and keeps your intent clear. Use the Apps SDK when distribution and low friction onboarding matter more than owning every pixel. 

MCP and connectors. Both paths benefit from standard connectors. Build or adopt an MCP server to expose private data and actions as first class tools with typed inputs and explicit permissions. MCP is the protocol OpenAI documents for connecting models to external systems through a stable contract, and the platform guide walks through server setup and connector lifecycle. This keeps your integration portable across ChatKit, AgentKit, and ChatGPT apps. 

A simple rule helps teams choose. Ship ChatKit when you need a native product experience under your brand, with full control of auth, analytics, and feature rollout. Ship with the Apps SDK when you want users to find and use your agent inside ChatGPT, with MCP doing the heavy lifting for safe tool access. You can do both from the same backend once your contracts and evals are in place. 

Operate in production: logging, alerts, and rollback

Shipping the flow is the midpoint, not the finish. Production agents need durable evidence, fast containment, and guardrails that keep costs and latency in check.

Instrument traces you can trust

Capture inputs, tool arguments, model choices, intermediate steps, and outcomes for every run. OpenAI’s evaluation stack supports trace grading, which means end to end runs can be assessed with graders that attach to the same traces you view during debugging. Keep these artifacts exportable so audits and postmortems do not depend on a console login. 

Alerts and dashboards that matter

Create near real time alerts for three classes of issues: quality regressions, latency or error spikes, and cost anomalies. Pair alerts with dashboards that track pass rate from Evals, p95 latency, and tokens or requests per successful task. The goal is simple. See quality, speed, and spend in the same view so operators can act on facts. For governance language and evidence patterns, map your ops checks to NIST AI RMF outcomes so security and risk teams can reuse the same signals. 

Pin releases and plan rollbacks

Every release should pin a flow version, a model family, and tool definitions. Write a one line rollback rule for each change. When a regression appears in evals or in production traces, roll back first, then investigate. Versioning in AgentKit’s Builder and code surfaces is designed for this kind of safe iteration. 

Put cost and rate controls close to the hot paths

Treat budget as a reliability concern. Track token use per step and per task, then add rate limits and max token caps where spend concentrates. The Responses runtime reports usage alongside multi step tool work, which makes it possible to watch cost trends as you tune prompts or swap models. Use those numbers in release decisions, not only offline benchmarks. 

Keep the UI and distribution in sync

If you embed with ChatKit, log widget actions and server callbacks so user behavior and tool effects share an identifier. If you also ship inside ChatGPT with the Apps SDK, mirror the same tool schemas and scopes through MCP. One contract for tools reduces drift across surfaces and keeps your audit trail consistent. 

Runbooks beat heroics

Write short runbooks for the on call rotation: where to look first, which metrics define “bad,” who to page for connectors, and how to roll back. Align the runbooks to the same eval pass criteria you use before releases. NIST’s RMF sets the expectation that operations, measurement, and governance share the same playbook, not three different ones. 

When to prefer the Agents SDK over Agent Builder

Agent Builder is the shortest path to a working flow that everyone can see and discuss. The Agents SDK is what you reach for when control and integration matter more than canvas speed. The SDK gives you programmatic access to the same agent primitives, which means you can bring agents under CI, write tests, review changes, and ship through the pipelines your team already uses. 

Choose the SDK when orchestration must live inside your codebase. If you need custom retries, fine grained branching, or dynamic decisions that depend on your own services, the SDK is a better fit. It works on top of the Responses API, so the model can call multiple tools within one request while your code controls state, error handling, and logging. That level of control is hard to maintain on a canvas once complexity rises. 

Use the SDK when you care about exact streaming behavior and back pressure. The streaming guides show how to stream tokens and intermediate events, which lets you update your UI progressively, throttle output, or fall back when a tool is slow. These patterns reward code level handling, especially if you pair an SDK backend with a ChatKit front end. 

Security and compliance can also push you to code. If your deployment requires custom authentication, strict data residency, or self hosting of UI components, you can run ChatKit on your own infrastructure and bind it to an SDK backend. The SDK keeps secrets in your environment, enforces your own access policies, and emits logs in the format your observability stack expects. 

Tooling is cleaner in code when contracts are complex. Function tools with JSON schemas, connectors, and MCP servers all benefit from typed definitions and version control. In the SDK you can validate arguments, pin tool versions, and roll changes forward behind flags, while still exposing those tools to ChatKit or to ChatGPT apps later. 

A practical migration path looks like this. Prototype the workflow in Builder so product and design can agree on shape and copy. As reliability needs grow, move hot paths to the SDK one by one. Keep your Evals and tool schemas identical between the canvas and code so scores remain comparable and rollbacks are simple. 

Rule of thumb signals for the SDK

  • You need agents in CI with unit tests and code reviews. 

  • You require custom streaming, throttling, or fallbacks tied to your UI. 

  • You must self host UI components or enforce bespoke auth and logging. 

  • You are managing complex tool contracts or MCP servers that change often. 

Conclusion

AgentKit gives you the parts to turn an idea into a working agent: a canvas to sketch the flow, a code surface for control, a ready UI, and evals that turn changes into evidence. Start small. Define one job, wire only the tools you need, pin a version, and track accuracy, latency, and cost on every change. If your use case outgrows the canvas, move hot paths to the SDK and keep the same contracts and evals. Ready to ship your first build? Grab the checklist, run a two week proof, and decide go or no go with numbers instead of hunches.