Documentation

Complete API reference and integration guide for OllaBridge Cloud

Contents

Overview

OllaBridge Cloud is a free LLM gateway that provides an OpenAI-compatible API backed by Ollama. It runs on Hugging Face Spaces and can be enhanced with GPU boost nodes from Google Colab.

Architecture: Browser → OllaBridge Cloud (HF Spaces) → Ollama (CPU) or GPU Relay Node (Colab T4)

Quick Start

Test the API

# Health check curl https://ruslanmv-ollabridge.hf.space/health # List available models curl https://ruslanmv-ollabridge.hf.space/ollama/v1/models # Chat completion curl -X POST https://ruslanmv-ollabridge.hf.space/ollama/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen2.5:1.5b", "messages": [{"role": "user", "content": "Hello!"}], "stream": false }'

API Endpoints

All endpoints are available at https://ruslanmv-ollabridge.hf.space

LLM Inference

POST /ollama/v1/chat/completions
OpenAI-compatible chat completions (streaming + non-streaming)
GET /ollama/v1/models
List available models (OpenAI format)
POST /ollama/api/chat
Ollama native chat endpoint (passthrough)
GET /ollama/api/tags
List models (Ollama native format)
POST /ollama/api/embeddings
Generate text embeddings

System

GET /health
Gateway health + relay statistics
GET /ollama/status
Ollama + relay node status
GET /docs
Interactive Swagger API documentation

Device Relay

WS /relay/connect
WebSocket relay for GPU boost nodes
POST /device/start
Start device pairing flow

Chat Completions

Request Body

FieldTypeDefaultDescription
modelstringqwen2.5:1.5bModel name
messagesarrayrequiredChat messages ({role, content})
temperaturefloat0.7Sampling temperature
max_tokensintnullMax tokens to generate
streamboolfalseEnable SSE streaming

Streaming Response (SSE)

# Server-Sent Events format data: {"id":"chatcmpl-ollabridge","choices":[{"delta":{"content":"Hello"}}]} data: {"id":"chatcmpl-ollabridge","choices":[{"delta":{"content":"!"}}]} data: {"choices":[{"finish_reason":"stop"}],"usage":{...}} data: [DONE]

Available Models

Models available on the free CPU tier:

ModelSizeSpeed (CPU)Quality
qwen2.5:1.5b1 GB~8 tok/sGood (default)
qwen2.5:0.5b400 MB~15 tok/sBasic
phi3:mini2.3 GB~3 tok/sBetter
gemma2:2b1.6 GB~5 tok/sGood
Note: With a GPU boost node (Colab T4), you can run 7-14B models at 20-40 tok/s.

3D Avatar Chatbot Integration

Connect the 3D Avatar Chatbot to OllaBridge Cloud:

Settings Panel Configuration

SettingValue
ProviderCustom / OpenAI-compatible
Base URLhttps://ruslanmv-ollabridge.hf.space/ollama/v1
Modelqwen2.5:1.5b
API Key(leave empty)

WebSocket Relay

GPU nodes connect to the relay via WebSocket. The node dials out to the cloud — no port forwarding needed.

Protocol

DirectionTypeDescription
Node → CloudhelloAnnounce models and capabilities
Node → CloudpingHeartbeat
Cloud → NodereqForward a chat request
Node → CloudresReturn the response

Google Colab GPU Boost

Use a free Colab T4 GPU as a boost worker:

  1. Open the colab_gpu_node.ipynb notebook in Google Colab
  2. Select Runtime → Change runtime type → T4 GPU
  3. Set the OllaBridge Cloud URL and run all cells
  4. The node connects via WebSocket or exposes via ngrok

SDK Examples

Python (OpenAI SDK)

# pip install openai from openai import OpenAI client = OpenAI( base_url="https://ruslanmv-ollabridge.hf.space/ollama/v1", api_key="not-needed", ) response = client.chat.completions.create( model="qwen2.5:1.5b", messages=[{"role": "user", "content": "Hello!"}], ) print(response.choices[0].message.content)

JavaScript (fetch)

const response = await fetch( "https://ruslanmv-ollabridge.hf.space/ollama/v1/chat/completions", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ model: "qwen2.5:1.5b", messages: [{ role: "user", content: "Hello!" }], stream: false, }), } ); const data = await response.json(); console.log(data.choices[0].message.content);