The core problem in API migration is not syntactic translation — it is semantic matching under ambiguity. A Mulesoft HTTP policy does not map to an Apigee proxy endpoint in a one-to-one substitution; the mapping depends on what the policy chain is doing, what the surrounding policies imply about data flow, and whether the behaviour can be reproduced with a single Mulesoft component or requires composition. A rule-based approach can handle the easy cases — a static list of 1:1 policy substitutions — but breaks immediately when policies are chained, customised with JavaScript, or doing something the documented mapping table does not anticipate.
Policy semantic matching via RAG. The system maintains a vector index over Mulesoft and Apigee/Tibco documentation, migration guides, and accumulated human-reviewed migration examples. When the system encounters a source policy, it retrieves the most semantically relevant Mulesoft equivalents. This is appropriate for LLM use: the task is retrieval-augmented generation, not open-ended reasoning. The model's job is to rank and explain matches, not to invent them. The risk — coverage gaps in the doc corpus for undocumented runtime behaviour — is real but bounded and improvable over time.
Migration code generation. Given a matched policy and its context in the source API, the LLM generates the corresponding Mulesoft XML or DataWeave. This is the highest-leverage AI task and the one most prone to errors. Simple policies with clean 1:1 mappings are highly reliable. Complex chained policies with custom scripting are not — the model can produce plausible-looking code that fails at runtime. The model routing layer exists specifically to manage this: send simple policies to a fast, cheap model; send complex or ambiguous ones to a stronger model with more context.
Confidence scoring for review routing. The system must decide which migrations need human eyes and which do not. A fixed threshold on a single metric does not capture the variance well — a policy can have high syntactic similarity to a known mapping but semantically different runtime behaviour. The scoring function combines RAG retrieval confidence, structural complexity of the source policy, and known-risky pattern detection (custom scripts, undocumented features, unusual chaining). The output is not a pass/fail — it is a score with a reason, so the human reviewer knows exactly what triggered the flag.
Corpus feedback loop. Every human correction is a labelled migration example. Over the first 20–30 APIs, the system accumulates a ground-truth corpus that did not exist at day zero. Subsequent migrations benefit from this — retrieval results improve as the vector index includes real migration examples alongside documentation. This is the mechanism that makes the system get better over time; without it, the 20th migration is as uncertain as the first.