Documentation
Complete API reference and integration guide for OllaBridge Cloud
Overview
OllaBridge Cloud is a free LLM gateway that provides an OpenAI-compatible API backed by Ollama. It runs on Hugging Face Spaces and can be enhanced with GPU boost nodes from Google Colab.
Architecture: Browser → OllaBridge Cloud (HF Spaces) → Ollama (CPU) or GPU Relay Node (Colab T4)
Quick Start
Test the API
curl https://ruslanmv-ollabridge.hf.space/health
curl https://ruslanmv-ollabridge.hf.space/ollama/v1/models
curl -X POST https://ruslanmv-ollabridge.hf.space/ollama/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5:1.5b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'
API Endpoints
All endpoints are available at https://ruslanmv-ollabridge.hf.space
LLM Inference
POST
/ollama/v1/chat/completions
OpenAI-compatible chat completions (streaming + non-streaming)
GET
/ollama/v1/models
List available models (OpenAI format)
POST
/ollama/api/chat
Ollama native chat endpoint (passthrough)
GET
/ollama/api/tags
List models (Ollama native format)
POST
/ollama/api/embeddings
Generate text embeddings
System
GET
/health
Gateway health + relay statistics
GET
/ollama/status
Ollama + relay node status
GET
/docs
Interactive Swagger API documentation
Device Relay
WS
/relay/connect
WebSocket relay for GPU boost nodes
POST
/device/start
Start device pairing flow
Chat Completions
Request Body
| Field | Type | Default | Description |
model | string | qwen2.5:1.5b | Model name |
messages | array | required | Chat messages ({role, content}) |
temperature | float | 0.7 | Sampling temperature |
max_tokens | int | null | Max tokens to generate |
stream | bool | false | Enable SSE streaming |
Streaming Response (SSE)
data: {"id":"chatcmpl-ollabridge","choices":[{"delta":{"content":"Hello"}}]}
data: {"id":"chatcmpl-ollabridge","choices":[{"delta":{"content":"!"}}]}
data: {"choices":[{"finish_reason":"stop"}],"usage":{...}}
data: [DONE]
Available Models
Models available on the free CPU tier:
| Model | Size | Speed (CPU) | Quality |
qwen2.5:1.5b | 1 GB | ~8 tok/s | Good (default) |
qwen2.5:0.5b | 400 MB | ~15 tok/s | Basic |
phi3:mini | 2.3 GB | ~3 tok/s | Better |
gemma2:2b | 1.6 GB | ~5 tok/s | Good |
Note: With a GPU boost node (Colab T4), you can run 7-14B models at 20-40 tok/s.
3D Avatar Chatbot Integration
Connect the 3D Avatar Chatbot to OllaBridge Cloud:
Settings Panel Configuration
| Setting | Value |
| Provider | Custom / OpenAI-compatible |
| Base URL | https://ruslanmv-ollabridge.hf.space/ollama/v1 |
| Model | qwen2.5:1.5b |
| API Key | (leave empty) |
WebSocket Relay
GPU nodes connect to the relay via WebSocket. The node dials out to the cloud — no port forwarding needed.
Protocol
| Direction | Type | Description |
| Node → Cloud | hello | Announce models and capabilities |
| Node → Cloud | ping | Heartbeat |
| Cloud → Node | req | Forward a chat request |
| Node → Cloud | res | Return the response |
Google Colab GPU Boost
Use a free Colab T4 GPU as a boost worker:
- Open the
colab_gpu_node.ipynb notebook in Google Colab
- Select Runtime → Change runtime type → T4 GPU
- Set the OllaBridge Cloud URL and run all cells
- The node connects via WebSocket or exposes via ngrok
SDK Examples
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="https://ruslanmv-ollabridge.hf.space/ollama/v1",
api_key="not-needed",
)
response = client.chat.completions.create(
model="qwen2.5:1.5b",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
JavaScript (fetch)
const response = await fetch(
"https://ruslanmv-ollabridge.hf.space/ollama/v1/chat/completions",
{
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "qwen2.5:1.5b",
messages: [{ role: "user", content: "Hello!" }],
stream: false,
}),
}
);
const data = await response.json();
console.log(data.choices[0].message.content);