Documentation

Overview

OllaBridge Cloud is a free LLM gateway that provides an OpenAI-compatible API backed by Ollama. It runs on Hugging Face Spaces and can be enhanced with GPU boost nodes from Google Colab.

Architecture: Browser → OllaBridge Cloud (HF Spaces) → Ollama (CPU) or GPU Relay Node (Colab T4)

Quick Start

Test the API

# Health check
curl https://ruslanmv-ollabridge.hf.space/health

# List available models
curl https://ruslanmv-ollabridge.hf.space/ollama/v1/models

# Chat completion
curl -X POST https://ruslanmv-ollabridge.hf.space/ollama/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:1.5b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }'

API Endpoints

All endpoints are available at https://ruslanmv-ollabridge.hf.space

LLM Inference

POST /ollama/v1/chat/completions

OpenAI-compatible chat completions (streaming + non-streaming)

GET /ollama/v1/models

List available models (OpenAI format)

POST /ollama/api/chat

Ollama native chat endpoint (passthrough)

GET /ollama/api/tags

List models (Ollama native format)

POST /ollama/api/embeddings

Generate text embeddings

System

GET /health

Gateway health + relay statistics

GET /ollama/status

Ollama + relay node status

GET /docs

Interactive Swagger API documentation

Device Relay

WS /relay/connect

WebSocket relay for GPU boost nodes

POST /device/start

Start device pairing flow

Chat Completions

Request Body

Field	Type	Default	Description
`model`	string	qwen2.5:1.5b	Model name
`messages`	array	required	Chat messages ({role, content})
`temperature`	float	0.7	Sampling temperature
`max_tokens`	int	null	Max tokens to generate
`stream`	bool	false	Enable SSE streaming

Streaming Response (SSE)

# Server-Sent Events format
data: {"id":"chatcmpl-ollabridge","choices":[{"delta":{"content":"Hello"}}]}
data: {"id":"chatcmpl-ollabridge","choices":[{"delta":{"content":"!"}}]}
data: {"choices":[{"finish_reason":"stop"}],"usage":{...}}
data: [DONE]

Available Models

Models available on the free CPU tier:

Model	Size	Speed (CPU)	Quality
`qwen2.5:1.5b`	1 GB	~8 tok/s	Good (default)
`qwen2.5:0.5b`	400 MB	~15 tok/s	Basic
`phi3:mini`	2.3 GB	~3 tok/s	Better
`gemma2:2b`	1.6 GB	~5 tok/s	Good

Note: With a GPU boost node (Colab T4), you can run 7-14B models at 20-40 tok/s.

3D Avatar Chatbot Integration

Connect the 3D Avatar Chatbot to OllaBridge Cloud:

Settings Panel Configuration

Setting	Value
Provider	Custom / OpenAI-compatible
Base URL	`https://ruslanmv-ollabridge.hf.space/ollama/v1`
Model	`qwen2.5:1.5b`
API Key	(leave empty)

WebSocket Relay

GPU nodes connect to the relay via WebSocket. The node dials out to the cloud — no port forwarding needed.

Protocol

Direction	Type	Description
Node → Cloud	`hello`	Announce models and capabilities
Node → Cloud	`ping`	Heartbeat
Cloud → Node	`req`	Forward a chat request
Node → Cloud	`res`	Return the response

Google Colab GPU Boost

Use a free Colab T4 GPU as a boost worker:

Open the colab_gpu_node.ipynb notebook in Google Colab
Select Runtime → Change runtime type → T4 GPU
Set the OllaBridge Cloud URL and run all cells
The node connects via WebSocket or exposes via ngrok

SDK Examples

Python (OpenAI SDK)

# pip install openai
from openai import OpenAI

client = OpenAI(
    base_url="https://ruslanmv-ollabridge.hf.space/ollama/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="qwen2.5:1.5b",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

JavaScript (fetch)

const response = await fetch(
  "https://ruslanmv-ollabridge.hf.space/ollama/v1/chat/completions",
  {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: "qwen2.5:1.5b",
      messages: [{ role: "user", content: "Hello!" }],
      stream: false,
    }),
  }
);
const data = await response.json();
console.log(data.choices[0].message.content);

Contents