4 minutes
cline-vertex-gw
I love coding with AI agents, but my wallet absolutely hates it. If you have ever run Cline or other agentic loops, you know exactly what I mean. The token multiplier effect can turn a 20-minute debugging session into a three-figure cloud bill before you even finish your first coffee.
To make matters worse, Google Cloud Vertex AI has incredibly cheap and reliable access to models like Claude and Gemini, but its REST API is completely different from the standard OpenAI or Ollama endpoints that most developer tools expect.
To bridge this gap and keep my cloud costs from spiraling out of control, I built cline-vertex-gw. It is a lightweight, high-performance gateway written in Go that translates Ollama and OpenAI API requests into native Vertex AI calls on the fly, while aggressively crushing token overhead using in-flight compression pipelines.
graph TD
Client[Cline / Ollama / OpenAI Client] -->|Ollama / OpenAI Dialect| GW[cline-vertex-gw]
GW -->|1. Normalize Whitespace| Pipe[Compression Pipeline]
GW -->|2. Collapse Env Blocks| Pipe
GW -->|3. Byte-Budget Trim| Pipe
GW -->|4. Hashed Dedup| Pipe
Pipe -->|Google GenAI SDK / REST| Vertex[Google Cloud Vertex AI]
Bridging the API gap in real-time
Most translation proxies are fragile wrappers that break on streaming responses or tool-calling payloads. cline-vertex-gw was designed from the ground up to support native streaming (SSE and NDJSON) and bidirectional tool/function-calling translations.
When Cline sends a request with tool definitions, the gateway intercepts and translates the tools, tool_choice, and tool_calls schemas into the exact shape expected by the upstream model (like Anthropic messages or Gemini function declarations). The same thing happens on the return path, allowing tool execution to work transparently without the client ever knowing it is talking to Google Cloud.
The secret sauce: token crushing compression
The real magic of the gateway lies in its in-flight prompt compression stack. When you are using an agent like Cline, it injects a massive <environment_details> block at the end of every user turn. This block lists your open editor tabs, visible files, working directory, and a complete recursive file tree.
Because LLM conversations are stateful history chains, Cline ends up re-shipping this exact same, mostly-static environment payload on every single message turn. By turn fifteen, you are paying to upload the same file tree fifteen times over.
func CollapseEnvBlocks(contents []*genai.Content) []*genai.Content {
if !collapseEnvBlocks || len(contents) == 0 {
return contents
}
// We only care about the latest user turn's environment details
lastUserIdx := -1
for i := len(contents) - 1; i >= 0; i-- {
c := contents[i]
if c == nil {
continue
}
if c.Role == genai.RoleUser || c.Role == "user" {
lastUserIdx = i
break
}
}
out := make([]*genai.Content, len(contents))
for i, c := range contents {
if c == nil || i == lastUserIdx {
out[i] = c
continue
}
// If it is an older turn, we strip out the giant environment details block
nc, saved, n := collapseInContent(c)
out[i] = nc
}
return out
}
The gateway intercepts the content array and locates the last user turn. It preserves that final turn’s environment snapshot verbatim (since the model needs to know the “right now” state) but collapses the environment block in all older history turns down to a single-line placeholder.
Combined with lossless whitespace normalization and a smart sliding-window deduplicator that replaces repetitive copy-pasted logs with backward references, this pipeline routinely slices 40% to 70% off our prompt token consumption on long debugging runs.
Spin it up in minutes
Getting it running is straightforward. You just need your Google Cloud application default credentials and a single binary build.
# Build the binary
go build -o cline-vertex-gw .
# Set your project variables and launch it
export GOOGLE_CLOUD_PROJECT=my-awesome-gcp-project
export GOOGLE_CLOUD_LOCATION=us-central1
export GATEWAY_AUTH_TOKEN=super-secure-shared-secret
./cline-vertex-gw
Once up, you can point Cline’s Ollama or OpenAI-Compatible provider at http://127.0.0.1:11434 and watch your GCP billing graphs flatten out into beautiful, cost-effective lines.
The project is fully open source and is my go-to helper for day-to-day agentic coding.
- Codebase & Contributing: f0o/cline-vertex-gw on Github
Related Content: