Limiting Text Length in OpenAI: Word Count for Video Generation

While developing my video content generation pipeline, I needed to control the maximum length of generated text so that the final video wouldn’t end up too long.

What are the ways to do this? For example, you can set a maximum word count for the generated text. If you want the starting point to be the video length in minutes, use the formula: video length in minutes = number of words * average reading speed per minute. The average reading speed is about 100–120 words per minute (for English).

So, what about interacting with the neural network itself? There are two main approaches:

Approach one: Explicitly specify in the prompt that the generated text should not exceed a certain number of characters. This usually works only moderately well, and I wouldn’t recommend relying on this method alone. It’s also better to specify an acceptable margin of error so the neural network doesn’t get confused.

Approach two: Set the max_tokens parameter, which limits the output of the neural network. Along with this, I also recommend passing the temperature parameter with a value in the range of 0.5–0.8, so the neural network is more specific in its responses and less verbose.

import re

from src.types import LanguageType


class TextUtils:
    END_SENTENCE_CHARS = (".", "!", "?", ",", ";", "。", "\n")

    RATIO = {
        "en": 1.4,
        "ru": 1.6,
        "es": 1.6,
        "de": 1.6,
        "zh": 2.0,
        "ja": 2.0,
        "ko": 2.0,
    }

    @staticmethod
    def get_tokens_per_word(language: LanguageType) -> float:
        value = TextUtils.RATIO.get(language)

        if not value:
            raise ValueError(f"Unsupported language: {language}")

        return value

    @staticmethod
    def split_by_tokens(text: str, max_tokens: int, overlap_tokens: int = 0):
        """
        Разбивает текст на куски по max_tokens токенов с перекрытием overlap_tokens.
        Использует tiktoken (cl100k_base).
        """
        if max_tokens <= 0:
            raise ValueError("max_tokens must be > 0")
        if overlap_tokens < 0:
            raise ValueError("overlap_tokens must be >= 0")

        import tiktoken

        enc = tiktoken.get_encoding("cl100k_base")
        tokens = enc.encode(text)

        chunks = []
        start = 0
        n = len(tokens)

        while start < n:
            end = min(start + max_tokens, n)
            chunk_tokens = tokens[start:end]
            chunks.append(enc.decode(chunk_tokens))
            if end >= n:
                break
            start = end - overlap_tokens if overlap_tokens > 0 else end

        return chunks

    @staticmethod
    def group_text_into_sentences(
        text: str, max_words_length: int | None = 20
    ) -> list[list[dict]]:
        new_sent_split_pattern = "|".join(map(re.escape, TextUtils.END_SENTENCE_CHARS))
        raw = [w.strip() for w in re.split(new_sent_split_pattern, text) if w.strip()]

        replacements = {"#": "", "---": ""}

        raw = [{"word": TextUtils.replace(w, replacements)} for w in raw if w.strip()]

        return TextUtils.group_words_dict_into_sentences(raw, max_words_length)

    @staticmethod
    def replace(value: str, replacements: dict) -> str:
        for old, new in replacements.items():
            value = value.replace(old, new)

        return value

    @staticmethod
    def group_words_dict_into_sentences(
        words: list[dict],
        max_words_length: int | None = 20,
    ) -> list[list[dict]]:
        sentences = []
        current_sentence = []

        for _, word in enumerate(words):
            current_sentence.append(word)

            has_end_char = (
                word.get("word", "").strip().endswith(TextUtils.END_SENTENCE_CHARS)
            )
            current_sentence_words = "".join(
                [w.get("word", "") for w in current_sentence if w.get("word")]
            )
            is_length_exceeded = (
                max_words_length and len(current_sentence_words) >= max_words_length
            )

            should_split = has_end_char or is_length_exceeded

            # Create new sentence every max_words_per_line words or at punctuation
            if should_split:
                sentences.append(current_sentence)
                current_sentence = []

        # Add remaining words if any
        if current_sentence:
            sentences.append(current_sentence)

        return sentences

from src.text.text_utils import TextUtils
from src.types import LanguageType

class ModelApiConfig:
    @staticmethod
    def get_tokens_limit_for_words(
        words_amount: int, lang: LanguageType, temperature: float = 0.5
    ):
        tokens_to_generate = int(
            TextUtils.get_tokens_per_word(lang) * words_amount * 1.3
        )

        return {
            "max_completion_tokens": tokens_to_generate,
            # "max_tokens": tokens_to_generate,
            "temperature": temperature,
        }

# agent example with setting of those parameters
def get_agent(self, words_to_generate: int = 300, language: LanguageType = "en"):
    return AtomicAgent[
        VisualSubtitlesMakerInputSchema,
        VisualSubtitlesMakerOutputSchema,
    ](
        config=AgentConfig(
            client=instructor.from_openai(self.client, mode=instructor.Mode.JSON),
            model=Config.gpt_model(),
            model_api_parameters=ModelApiConfig.get_tokens_limit_for_words(
                words_to_generate, language
            ),
            system_prompt_generator=SystemPromptGenerator(
                background=[
                    """Ты — режиссёр-постановщик кинематографичного фильма.""",
                    """На основе истории из субтитров ты должен создать атмосферные и визуально разнообразные сцены.""",
                ],
                steps=[
                    """Для каждой сцены сделай одно визуальное описание:
                - Ключевое действие/объект сцены
                - Обстановка (место, атмосфера, время)
                - Эмоции/настроение
                """,
                ],
                output_instructions=[
                    """Пиши в стиле промта для генерации изображений (cinematic).""",
                    """Не добавляй слайды с призывами подписаться, лайкать или комментировать.""",
                ],
            ),
        )
    )

How do you calculate the number of tokens based on the number of words? Below is code that does this for four languages. On average, one word in English equals 1.3 tokens. If you pass a fractional number, the neural network will silently ignore it and assume there are no restrictions, which can drain your balance. It’s a good idea to check the results generated by the neural network afterward. With the script below, I managed to achieve over 90% accuracy. Leave comments and share your experience. Good luck with your development!