RFC: Community Takeover of the Babylon Gateway

Michael · 9 February 2026 19:17

This RFC presents a proposal for the community takeover of the Gateway infrastructure.
It is based on Foundation RFP: Babylon Gateway
, specifically https://drive.google.com/file/d/18m-Y5A2Te_1CJlVS8rww0HDHtbJvwg4O/view?usp=sharing [1]

The goal is to find an architecture with the highest possible availability within a very tight budget.
However, the proposal should be flexible enough that it can be easily scaled up later.

0. Basic Idea:
The core intent of this proposal is a decentralized, community-led operation of the Babylon Gateway infrastructure.
Instead of a single entity managing everything, the idea is to recruit experienced community members (such as validator runners, or other members who already have or have had gateway servers running, or others who possess the necessary expertise) to each maintain an individual gateway server. These servers would be integrated into pools, routing requests based on geographical proximity to ensure low latency and high resilience.

Important Note: This RFC is a conceptual framework for a community takeover. It is not a personal service application, and I am not proposing to operate the entire infrastructure myself. Instead, the suggested compensation would be distributed among the participating community operators.
The following sections outline the technical details of this proposed implementation.

1.Technical Setup:

Domain Management: Hosted via Cloudflare
Traffic Management: Cloudflare Pools with Geo-Steering
Three pool groups: Europe, Americas, Asia (can be expanded later if necessary)
- Each pool group consists of a blue pool and a green pool (for updates, see “High Availability” in [1])
- A total of eight pools will be created (Europe-blue, Europe-green, Americas-blue, Americas-green, Asia-blue, Asia-green, Stokenet-blue, Stokenet-green)
Connectivity (Gateway server to pool): Implemented via Cloudflare Tunnels
Health Checks: A custom script on each server will be queried by Cloudflare.
The script reports “OK” only when the database is fully up to date

2.Security & Redundancy:

Security: Since all traffic runs through Cloudflare, server IP addresses remain private (via Cloudflare Tunnels), benefiting from Cloudflare’s native DDoS protection and security layers.
Metrics (“Monitoring” in [1]) can be collected via push-based mechanisms (e.g. Prometheus Pushgateway or Grafana Agent), avoiding the need to expose inbound ports on Gateway servers. The collected metrics can then be displayed on a public page.

Rate Limiting can be enabled in Cloudflare and set to a value that allows unrestricted wallet operation. In the current Gateway configuration [1], the rate limit is specified as 1,550 requests per minute. This value is relatively high and could be reduced and adapted to the wallet’s load behavior. Since multiple wallets may share a single IP address, this must be handled carefully.

Redundancy: Cloudflare is the only Single Point of Failure (SPOF).
Historically, Cloudflare has had very high uptime, so the suggestion is to leave this as a known SPOF.
Cloudflare downtime is considered acceptable within the >99.9% SLA target.

Best-effort manual disaster bypass:
If Cloudflare goes down, servers could be temporarily opened for direct access and closed again afterward.

3.Updates:

In each pool group, either the green or the blue pool is active at any given time.

The following describes a cost-efficient variant without requiring new servers:
If an update is required, individual servers are removed from the active pool and updated.
After the update, the servers connect to the inactive pool.
Once enough servers have migrated, the inactive pool becomes the active pool,
and the remaining servers can be updated. Only one server is planned for Stokenet access. For updates, a new one would have to be generated and synchronized. The old one can then be terminated.

This is a minimal proposal due to the current situation. If development gains momentum, additional server groups can be added and updates optimized. The basic concept does not need to be adapted for this.

4.Pending Transactions Observability

From [1]:
“When a transaction is submitted to a node via a Gateway (and has not yet been committed),
its status can be read only from the same Gateway it was submitted to, as the information is
available only in that Gateway’s database. If the request is routed to another Gateway, the
status will not be available.

A similar limitation applies when a Gateway attempts to read the status of a transaction from
a different node than the one it was submitted to, before the transaction is committed. Other
nodes are not aware of the transaction and therefore cannot report its status.”

To ensure that a pending transaction is always queried from the same server,
session affinity (cookie or ip_based) is enabled at the load balancer.

During connection setup, a cookie is sent to the client and used for routing.
(Session affinity · Cloudflare Load Balancing docs)

To meet the second requirement, the node, database server, and API server could be combined on a single powerful server (see section 6).

5.Operations

The setup is a one-time configuration. Anyone wishing to operate a Gateway server can contact the Cloudflare account administrator (suggested: RAC).
After successful verification, they will receive credentials to connect their server to the pools.

Currently (assuming Radixtools queries can be blocked), a single server can connect to multiple pools. If query volume and network usage increase, servers can be dedicated to specific pools.

6.Server Architecture

In [1] Annex, an architecture consisting of one writer DB server, three read replicas,
five load balancers, and seven Gateway instances is proposed.

The load balancers could be omitted in the architecture proposed above and replaced by Cloudflare.
Whether it makes sense to bundle the database servers, node, and API on stronger servers
should be evaluated and discussed.

With bundling, two servers per pool group and one server for Stokenet (a total of seven servers) could be sufficient.This architecture prioritizes cost and operational simplicity over maximum scalability and is intended as a temporary solution.
If our (financial) situation improves, DB and API layers can be separated without redesign.

7.Financials & Compensation

Cloudflare Costs: Business Plan (Includes a contractual SLA that promises 100 % uptime for the service, Cloudflare for Small & medium-sized businesses): $200/month (covered by RAC)
Estimated traffic, etc.: $50
Total: $250/month

Server Costs:
Option a: Servers could be rented by RAC, and community members who volunteer to operate them would be compensated for their effort (“Adopt a Server”).

Option b: Community members organize and operate the servers themselves, connect them to the pools, and are reimbursed for server costs as part of their compensation.

Option a) allows RAC to control distribution and providers, option b) enables a more open architecture.

Server Specs:
A server capable of handling a 2 TB database plus a data-generating node is estimated at $500–$600 per month.
Examples:
Hetzner (AMD EPYC™ 9454P, 2 × 3.84 TB NVMe, 256 GB RAM): ~€320
OVH similar configuration: ~€450

Admin Compensation:
Intended as a budget rate for administrative tasks rather than a full commercial salary,
reflecting the current tight financial situation.

Total Budget:
Starting with seven operators (including one Stokenet server), the total monthly cost could be under $6,000 (If it is possible to keep the servers per gateway and the compensation below $800).
Compensation would be subject to performance and uptime (with deductions for unannounced downtimes).

Server costs (server rental, compensation for operators) account for the largest share of the concept’s costs. It may be possible to further reduce costs by using alternative providers to Cloudflare (e.g., Bunny CDN https://bunny.net).

linuxx · 11 February 2026 22:17

On these i go in two ways . First the “projects” that are sucking huge querys on gateways must to be obligated to download by quotas. If they need still that amount of querys then…go on his own gateway or some mechanish that they must to pay for the querys. And second with scalability of course on mind put the minimal amount of servers, probably 5 , 2 on 2 geo areas is enoght for the current use of the net and 1 stokenet , specially if we put quotas on the amount of querys these of course will go to download the load on the servers and the hardware needed to run its. But in general these is a very good aproach. THKS Michael.

StakingCoins · 15 February 2026 18:36

What about a more cost effective active-active policy?

All gateways are active but with two distinct maintenance windows

Michael · 15 February 2026 20:14

That is the plan. In normal operation, all servers are in the active pool. During an update they are they removed from the active pool step by step, updated and assigned to the inactive pool. Once there are enough updated servers in the inactive pool, the switch is made and the inactive pool becomes the active pool. For users, it then looks as if this pool has been updated. (Ideally, all three pools are switched over at the same time).

During maintenance, a server can be removed from the pool and reconnected after maintenance. If the other servers in the pool experience a problem during this time, Cloudflare will automatically forward requests to other pools. Users may notice a slightly longer response time, but access should continue to function.

The minimum start would be a mainnet server assigned to all three pools and a stokenet server. Whether the desired performance could be achieved with this would have to be tested.
Furthermore, there would be no redundancy in the event of a failure, which I consider to be a rather risky strategy.