API Rate Limits When Working With Neural Networks — and How to Handle Them
- Published on
- • 6 mins read•--- views
API Rate Limits When Working With Neural Networks — and How to Handle Them
Working with neural networks has become incredibly relevant today. However, if you’re using a remote service to interact with AI agents (for example, the OpenAI API), you’ll inevitably face several limitations, such as:
- maximum request size
- maximum number of requests per time unit
- rate limits for throughput and concurrency
As we know, API calls to remote services always come with limits. And when you’re processing large datasets — especially inside loops — these constraints quickly become a problem.
Why “just send everything at once” doesn’t work
You might say:
“Why not combine multiple requests into one and then parse the result?”
In practice, this doesn’t always work. OpenAI (and many other models) has a peculiar trait — when working with lists or arrays, the neural network often drops, skips, or excludes one or more items.
That’s why the most reliable way is to process each array element individually. It may be slower, but it guarantees accuracy and consistency of the output.
What rate limiting is — and why it matters
For these situations, we have a mechanism called rate limiting — a way to restrict how frequently you can send requests. It helps prevent service overload and ensures fair resource distribution.
If you ignore rate limits, you’ll quickly encounter errors like:
Error: 429 Too Many Requests
or even temporary API key suspension.
My case: translating large datasets

In one of my projects, I needed to translate large volumes of text into Russian using the OpenAI API. At first, I just looped through everything and fired off requests — but I soon hit the limits.
The solution was simple: add a request-rate controller. Here’s a minimal example of a client that respects request limits per minute.
This approach allows you to smoothly distribute the load and stay within API limits — even when processing thousands of items.
# translationClient.py
import Config from "@app/config";
import Bottleneck from "bottleneck";
import { createRequire } from "node:module";
import { z } from "zod";
import zodToJsonSchema, { JsonSchema7Type } from "zod-to-json-schema";
const TranslationSchema = z.object({
genres: z.array(z.string()).describe("Genres"),
features: z.array(z.string()).describe("Features"),
descriptionFull: z.string().describe("Full Description"),
supportedLanguagesText: z
.array(z.string())
.describe("Supported languages text"),
supportedLanguagesVoice: z
.array(z.string())
.describe("Supported languages voice"),
});
export type TranslationInput = z.infer<typeof TranslationSchema>;
const TranslationJsonSchema = zodToJsonSchema(TranslationSchema, {
$refStrategy: "none",
}) as JsonSchema7Type;
type JsonSchemaObject = JsonSchema7Type & {
properties?: Record<string, unknown>;
required?: string[];
};
const schemaObject = TranslationJsonSchema as JsonSchemaObject;
if (schemaObject && schemaObject.properties) {
schemaObject.required = Object.keys(schemaObject.properties);
}
const require = createRequire(import.meta.url);
const openAiModuleName = "openai";
const { default: OpenAI } = require(openAiModuleName) as { default: any };
const client = new OpenAI({ apiKey: Config.openAiKey });
export const openAiLimiter = new Bottleneck({
// Minimum spacing between calls so we stay under the rate limit
// For 500 requests per minute -> ~8.33 per second -> ~120 ms interval
minTime: 120,
// Use the reservoir to cap total calls per window
reservoir: 500, // maximum calls per window
reservoirRefreshInterval: 60 * 1000, // refresh the allowance every minute
reservoirRefreshAmount: 500
});
const limitCall = async <TRes>(call: CallableFunction, maxRetries = 5): Promise<TRes> => {
let attempt = 0
let backoff = 500
while (true) {
try {
return await openAiLimiter.schedule(() => call());
} catch (error: any) {
attempt++
if (attempt > maxRetries) {
throw error
}
if (error.status == 429) {
await new Promise(res => setTimeout(res, backoff))
backoff *= 2
} else {
throw error
}
}
}
}
export async function getGameTranslation(
lang: string,
input: TranslationInput,
) {
return await limitCall<TranslationInput>(async () => {
const completion = await client.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content: `You are a professional translator. Preserve meaning and tone. Return ONLY valid JSON`,
},
{
role: "user",
content: [
`Translate this object to ${lang} language: ${JSON.stringify(input)}`,
"Keep formatting where sensible",
"If there are names or bock codes, do not translate them",
"Return every property from the original object even if it should be empty.",
"Input",
].join("\n\n"),
},
],
response_format: {
type: "json_schema",
json_schema: {
name: "Translation",
schema: TranslationJsonSchema,
strict: true,
},
},
});
const raw = completion.choices[0]?.message?.content ?? "{}";
const translation = TranslationSchema.parse(JSON.parse(raw));
return translation;
})
}
Improved version with LangFuse monitoring added
import Config from "@app/config";
import Bottleneck from "bottleneck";
import { OpenAI } from "openai/client";
import { z } from "zod";
import zodToJsonSchema, { JsonSchema7Type } from "zod-to-json-schema";
import { observeOpenAI } from "@langfuse/openai";
import { NodeSDK } from "@opentelemetry/sdk-node";
import { LangfuseSpanProcessor } from "@langfuse/otel";
const TranslationSchema = z.object({
genres: z.array(z.string()).describe("Genres"),
features: z.array(z.string()).describe("Features"),
descriptionFull: z.string().describe("Full Description"),
supportedLanguagesText: z
.array(z.string())
.describe("Supported languages text"),
supportedLanguagesVoice: z
.array(z.string())
.describe("Supported languages voice"),
});
export type TranslationInput = z.infer<typeof TranslationSchema>;
const TranslationJsonSchema = zodToJsonSchema(TranslationSchema, {
$refStrategy: "none",
}) as JsonSchema7Type;
type JsonSchemaObject = JsonSchema7Type & {
properties?: Record<string, unknown>;
required?: string[];
};
const schemaObject = TranslationJsonSchema as JsonSchemaObject;
if (schemaObject && schemaObject.properties) {
schemaObject.required = Object.keys(schemaObject.properties);
}
const sdk = new NodeSDK({
spanProcessors: [
new LangfuseSpanProcessor({
publicKey: Config.langfusePublicKey,
secretKey: Config.langfuseSecretKey,
baseUrl: Config.langfuseHost,
}),
],
});
sdk.start();
// Wrap the OpenAI client with Langfuse tracing
const tracedOpenAI = observeOpenAI(new OpenAI({ apiKey: Config.openAiKey }), {
// Configure trace-level attributes for all API calls
traceName: "my-openai-trace", // Name for the trace
sessionId: "user-session-123", // Track user session
userId: "user-abc", // Track user identity
tags: ["openai-integration"], // Add searchable tags
});
export const openAiLimiter = new Bottleneck({
// Minimum spacing between calls so we stay under the rate limit
// For 500 requests per minute -> ~8.33 per second -> ~120 ms interval
minTime: 120,
// Use the reservoir to cap total calls per window
reservoir: 500, // maximum calls per window
reservoirRefreshInterval: 60 * 1000, // refresh the allowance every minute
reservoirRefreshAmount: 500,
});
const limitCall = async <TRes>(
call: CallableFunction,
maxRetries = 5,
): Promise<TRes> => {
let attempt = 0;
let backoff = 500;
while (true) {
try {
return await openAiLimiter.schedule(() => call());
} catch (error: any) {
attempt++;
if (attempt > maxRetries) {
throw error;
}
if (error.status == 429) {
await new Promise((res) => setTimeout(res, backoff));
backoff *= 2;
} else {
throw error;
}
}
}
};
export async function getGameTranslation(
lang: string,
input: TranslationInput,
) {
return await limitCall<TranslationInput>(async () => {
const completion = await tracedOpenAI.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content: `You are a professional translator. Preserve meaning and tone. Return ONLY valid JSON`,
},
{
role: "user",
content: [
`Translate this object to ${lang} language: ${JSON.stringify(input)}`,
"Keep formatting where sensible",
"If there are names or bock codes, do not translate them",
"Return every property from the original object even if it should be empty.",
"Input",
].join("\n\n"),
},
],
response_format: {
type: "json_schema",
json_schema: {
name: "Translation",
schema: TranslationJsonSchema,
strict: true,
},
},
});
const raw = completion.choices[0]?.message?.content ?? "{}";
const translation = TranslationSchema.parse(JSON.parse(raw));
return translation;
});
}
Share your experience
What approaches do you use? Share in the comments how you handle rate limits, process request queues, or distribute API load across services — I’d love to learn from your methods.