API Rate Limits When Working With Neural Networks — and How to Handle Them

Working with neural networks has become incredibly relevant today. However, if you’re using a remote service to interact with AI agents (for example, the OpenAI API), you’ll inevitably face several limitations, such as:

maximum request size
maximum number of requests per time unit
rate limits for throughput and concurrency

As we know, API calls to remote services always come with limits. And when you’re processing large datasets — especially inside loops — these constraints quickly become a problem.

Why “just send everything at once” doesn’t work

You might say:

“Why not combine multiple requests into one and then parse the result?”

In practice, this doesn’t always work. OpenAI (and many other models) has a peculiar trait — when working with lists or arrays, the neural network often drops, skips, or excludes one or more items.

That’s why the most reliable way is to process each array element individually. It may be slower, but it guarantees accuracy and consistency of the output.

What rate limiting is — and why it matters

For these situations, we have a mechanism called rate limiting — a way to restrict how frequently you can send requests. It helps prevent service overload and ensures fair resource distribution.

If you ignore rate limits, you’ll quickly encounter errors like:

Error: 429 Too Many Requests

or even temporary API key suspension.

My case: translating large datasets

In one of my projects, I needed to translate large volumes of text into Russian using the OpenAI API. At first, I just looped through everything and fired off requests — but I soon hit the limits.

The solution was simple: add a request-rate controller. Here’s a minimal example of a client that respects request limits per minute.

This approach allows you to smoothly distribute the load and stay within API limits — even when processing thousands of items.

# translationClient.py

import Config from "@app/config";
import Bottleneck from "bottleneck";
import { createRequire } from "node:module";
import { z } from "zod";
import zodToJsonSchema, { JsonSchema7Type } from "zod-to-json-schema";

const TranslationSchema = z.object({
  genres: z.array(z.string()).describe("Genres"),
  features: z.array(z.string()).describe("Features"),
  descriptionFull: z.string().describe("Full Description"),
  supportedLanguagesText: z
    .array(z.string())
    .describe("Supported languages text"),
  supportedLanguagesVoice: z
    .array(z.string())
    .describe("Supported languages voice"),
});

export type TranslationInput = z.infer<typeof TranslationSchema>;

const TranslationJsonSchema = zodToJsonSchema(TranslationSchema, {
  $refStrategy: "none",
}) as JsonSchema7Type;

type JsonSchemaObject = JsonSchema7Type & {
  properties?: Record<string, unknown>;
  required?: string[];
};

const schemaObject = TranslationJsonSchema as JsonSchemaObject;

if (schemaObject && schemaObject.properties) {
  schemaObject.required = Object.keys(schemaObject.properties);
}

const require = createRequire(import.meta.url);
const openAiModuleName = "openai";
const { default: OpenAI } = require(openAiModuleName) as { default: any };

const client = new OpenAI({ apiKey: Config.openAiKey });

export const openAiLimiter = new Bottleneck({
  // Minimum spacing between calls so we stay under the rate limit
  // For 500 requests per minute -> ~8.33 per second -> ~120 ms interval
  minTime: 120,

  // Use the reservoir to cap total calls per window
  reservoir: 500,                        // maximum calls per window
  reservoirRefreshInterval: 60 * 1000,  // refresh the allowance every minute
  reservoirRefreshAmount: 500
});

const limitCall = async <TRes>(call: CallableFunction, maxRetries = 5): Promise<TRes> => {
  let attempt = 0
  let backoff = 500

  while (true) {
    try {
      return await openAiLimiter.schedule(() => call());
    } catch (error: any) {
      attempt++
      if (attempt > maxRetries) {
        throw error
      }

      if (error.status == 429) {
        await new Promise(res => setTimeout(res, backoff))
        backoff *= 2
      } else {
        throw error
      }
    }
  }
}

export async function getGameTranslation(
  lang: string,
  input: TranslationInput,
) {
  return await limitCall<TranslationInput>(async () => {
    const completion = await client.chat.completions.create({
      model: "gpt-4o-mini",
      messages: [
        {
          role: "system",
          content: `You are a professional translator. Preserve meaning and tone. Return ONLY valid JSON`,
        },
        {
          role: "user",
          content: [
            `Translate this object to ${lang} language: ${JSON.stringify(input)}`,
            "Keep formatting where sensible",
            "If there are names or bock codes, do not translate them",
            "Return every property from the original object even if it should be empty.",
            "Input",
          ].join("\n\n"),
        },
      ],
      response_format: {
        type: "json_schema",
        json_schema: {
          name: "Translation",
          schema: TranslationJsonSchema,
          strict: true,
        },
      },
    });

    const raw = completion.choices[0]?.message?.content ?? "{}";
    const translation = TranslationSchema.parse(JSON.parse(raw));
    return translation;
  })
}

Improved version with LangFuse monitoring added

import Config from "@app/config";
import Bottleneck from "bottleneck";
import { OpenAI } from "openai/client";
import { z } from "zod";
import zodToJsonSchema, { JsonSchema7Type } from "zod-to-json-schema";
import { observeOpenAI } from "@langfuse/openai";
import { NodeSDK } from "@opentelemetry/sdk-node";
import { LangfuseSpanProcessor } from "@langfuse/otel";

const TranslationSchema = z.object({
  genres: z.array(z.string()).describe("Genres"),
  features: z.array(z.string()).describe("Features"),
  descriptionFull: z.string().describe("Full Description"),
  supportedLanguagesText: z
    .array(z.string())
    .describe("Supported languages text"),
  supportedLanguagesVoice: z
    .array(z.string())
    .describe("Supported languages voice"),
});

export type TranslationInput = z.infer<typeof TranslationSchema>;

const TranslationJsonSchema = zodToJsonSchema(TranslationSchema, {
  $refStrategy: "none",
}) as JsonSchema7Type;

type JsonSchemaObject = JsonSchema7Type & {
  properties?: Record<string, unknown>;
  required?: string[];
};

const schemaObject = TranslationJsonSchema as JsonSchemaObject;

if (schemaObject && schemaObject.properties) {
  schemaObject.required = Object.keys(schemaObject.properties);
}

const sdk = new NodeSDK({
  spanProcessors: [
    new LangfuseSpanProcessor({
      publicKey: Config.langfusePublicKey,
      secretKey: Config.langfuseSecretKey,
      baseUrl: Config.langfuseHost,
    }),
  ],
});

sdk.start();

// Wrap the OpenAI client with Langfuse tracing
const tracedOpenAI = observeOpenAI(new OpenAI({ apiKey: Config.openAiKey }), {
  // Configure trace-level attributes for all API calls
  traceName: "my-openai-trace", // Name for the trace
  sessionId: "user-session-123", // Track user session
  userId: "user-abc", // Track user identity
  tags: ["openai-integration"], // Add searchable tags
});

export const openAiLimiter = new Bottleneck({
  // Minimum spacing between calls so we stay under the rate limit
  // For 500 requests per minute -> ~8.33 per second -> ~120 ms interval
  minTime: 120,

  // Use the reservoir to cap total calls per window
  reservoir: 500, // maximum calls per window
  reservoirRefreshInterval: 60 * 1000, // refresh the allowance every minute
  reservoirRefreshAmount: 500,
});

const limitCall = async <TRes>(
  call: CallableFunction,
  maxRetries = 5,
): Promise<TRes> => {
  let attempt = 0;
  let backoff = 500;

  while (true) {
    try {
      return await openAiLimiter.schedule(() => call());
    } catch (error: any) {
      attempt++;
      if (attempt > maxRetries) {
        throw error;
      }

      if (error.status == 429) {
        await new Promise((res) => setTimeout(res, backoff));
        backoff *= 2;
      } else {
        throw error;
      }
    }
  }
};

export async function getGameTranslation(
  lang: string,
  input: TranslationInput,
) {
  return await limitCall<TranslationInput>(async () => {
    const completion = await tracedOpenAI.chat.completions.create({
      model: "gpt-4o-mini",
      messages: [
        {
          role: "system",
          content: `You are a professional translator. Preserve meaning and tone. Return ONLY valid JSON`,
        },
        {
          role: "user",
          content: [
            `Translate this object to ${lang} language: ${JSON.stringify(input)}`,
            "Keep formatting where sensible",
            "If there are names or bock codes, do not translate them",
            "Return every property from the original object even if it should be empty.",
            "Input",
          ].join("\n\n"),
        },
      ],
      response_format: {
        type: "json_schema",
        json_schema: {
          name: "Translation",
          schema: TranslationJsonSchema,
          strict: true,
        },
      },
    });

    const raw = completion.choices[0]?.message?.content ?? "{}";
    const translation = TranslationSchema.parse(JSON.parse(raw));
    return translation;
  });
}

What approaches do you use? Share in the comments how you handle rate limits, process request queues, or distribute API load across services — I’d love to learn from your methods.

API Rate Limits When Working With Neural Networks — and How to Handle Them

Why “just send everything at once” doesn’t work

What rate limiting is — and why it matters

My case: translating large datasets

Improved version with LangFuse monitoring added

Share your experience