Prompting the World: Tailoring AI Prompts for Multilingual Performance

admin

2026/06/10 14:16:20

A SaaS company spent six weeks optimizing a customer-support prompt in English. The prompt instructed the model to classify incoming tickets, route them to the correct team, and generate a first-response draft. In English, the prompt achieved 94% classification accuracy and a 4.2/5.0 quality score on the response drafts.

The company rolled out to French, German, and Japanese markets. They translated the English prompt — literally, by a human translator — and deployed it. French accuracy dropped to 67%. German dropped to 71%. Japanese dropped to 58%. The French model was misrouting tickets to billing instead of technical support. The German model was generating responses that were technically correct but tonally wrong — too casual for enterprise customers. The Japanese model was failing to extract intent from polite, indirect customer inquiries.

The company spent three weeks debugging the model, the data, and the fine-tuning. The problem was the prompt. It had been translated, not localized. And in prompt engineering, translation is not localization.

Prompt engineering is language-specific. A prompt that works in English will not necessarily work in French, German, Japanese, or Arabic — not because the model is weak in those languages, but because instruction-following is language-specific. The same model, the same architecture, the same training data scale — but different instruction-following behavior across languages. Translating a prompt does not transfer its performance.

Why prompts fail across languages

LLM instruction-following is not uniform across languages. A model trained on 90% English data and 5% French data will not follow French instructions with the same precision as English instructions — even if it speaks French fluently. The gap is not linguistic capability. It is instruction-following capability, and it varies by language, by task, and by prompt structure.

Training data composition. LLMs are trained on internet-scale data that is heavily skewed toward English. Instruction-tuning datasets (like FLAN, Alpaca, and GPT-4's instruction-tuning) are also predominantly English. The model learns to follow instructions in English from massive exposure. It learns to follow instructions in French from significantly less exposure. The result: the model understands French text fluently but follows French instructions less precisely than English instructions.

Linguistic structure and instruction clarity. English instructions benefit from English's relatively rigid word order and explicit sentence structure. French allows more flexibility. Japanese omits subjects entirely and relies heavily on context. A prompt structure that works in English — imperative statements, explicit step-by-step instructions — may be unnecessarily rigid in French and insufficiently explicit in Japanese. The prompt must be restructured for the linguistic norms of the target language, not translated word-for-word.

Cultural communication patterns. English prompt engineering relies on direct, explicit instructions: “Do X, then do Y, then output Z.” This works in English because English communication norms value directness. In Japanese, indirect instructions with context are more effective: “Considering the customer's situation, an appropriate response might be...” The model has learned Japanese communication patterns from its training data. A translated English prompt that is overly direct can feel unnatural to the model's Japanese-language understanding, reducing compliance.

Tokenization and semantic density. Different languages tokenize differently. English is relatively token-efficient (about 1.3 tokens per word). Chinese is highly token-efficient (often 1 token per character). German compounds words, creating long tokens. Japanese mixes kanji, hiragana, and katakana, each with different tokenization behavior. A prompt that fits comfortably within a model's context window in English may be token-starved in another language — not because the character count is higher, but because the token count is higher. The prompt must be adapted for token efficiency, not just character count.

Few-shot example transfer. Few-shot examples are a core technique in prompt engineering. But examples that work in English do not necessarily transfer to other languages. A few-shot example that relies on English-specific idioms, cultural references, or linguistic patterns will not help the model understand the task in French or Japanese. The examples must be localized, not translated. And in some languages, few-shot examples are less effective than they are in English — requiring more examples, or different example structures, to achieve the same performance.

Five dimensions of prompt localization

Instruction structure adaptation. The structure of instructions must be adapted to the target language's linguistic norms. English prompts use imperative chains: “Extract the invoice number. Then validate the date. Then output JSON.” French prompts benefit from more connective structure: “Commencez par extraire le numéro de facture. Ensuite, validez la date. Enfin, générez la sortie JSON.” Japanese prompts benefit from context-first structure: “請求書番号を抽出してください。その後、日付を検証し、JSONで出力します。” The instruction structure must match the language's natural communicative patterns, not the English original.

Tone and register calibration. Prompt tone must be calibrated for the target language's formality norms. An English prompt for customer service might use a friendly, informal tone: “You're helping a customer who...” A Japanese prompt for the same use case requires a formal, respectful tone: “お客様とのサポートをしていただけませ…” Using an informal tone in Japanese prompts can reduce model compliance because it conflicts with the register the model expects for that type of task.

Task decomposition by language capability. Some tasks that can be done in a single prompt in English require decomposition in other languages. The model's capability in the target language may be strong for simple tasks but weaker for complex, multi-step tasks. The prompt must be redesigned to match the model's capability profile in each language — not assume that English-capability transfers.

Few-shot example localization. Few-shot examples must be culturally and linguistically appropriate for the target language. An English few-shot example about “thanking the customer for their patience” translates poorly to Japanese, where the cultural norms around customer patience and gratitude are different. The examples must reflect the target culture's communication patterns, not just the target language's vocabulary.

Output format specification. Output format instructions (“output JSON,” “use XML tags”) must be specified in ways the model understands reliably in each language. In English, “output valid JSON” is a reliable instruction. In Japanese, the model may need “JSONで出力してください” (please output in JSON) plus a formatting example to achieve the same reliability. The output format specification must be adapted for each language's instruction-following patterns.

A prompt localization framework

Localizing a prompt is not translating it. It is re-engineering it for the target language's instruction-following behavior. A systematic framework has four steps:

Step 1: Baseline performance measurement. Before localizing, measure the prompt's performance in the source language across all relevant metrics: accuracy, tone, format compliance, edge-case handling. This is the baseline. The goal of localization is not to replicate the English prompt — it is to achieve equivalent task performance in the target language. The baseline defines what “equivalent” means.

Step 2: Error analysis by language. Translate the prompt literally and test it in the target language. Analyze the errors: Are they comprehension errors (model didn't understand the task)? Compliance errors (model understood but didn't follow instructions)? Or output errors (model followed instructions but output is wrong)? Different error types require different localization strategies. Comprehension errors need clearer instructions. Compliance errors need restructured prompts. Output errors need better few-shot examples.

Step 3: Prompt restructuring. Restructure the prompt for the target language's instruction-following patterns. This may involve: changing imperative chains to context-first structures (for Japanese), adding connecting words (for French), decomposing complex tasks into simpler subtasks (for languages with weaker complex-task performance), or adding explicit formatting examples (for languages where format compliance is weaker). The restructured prompt should be tested against the baseline metrics.

Step 4: Few-shot example localization and output format calibration. Localize few-shot examples for cultural and linguistic appropriateness. Calibrate output format instructions for the target language. Test the localized prompt on a held-out test set that matches real-world usage. Iterate until performance is within 5 percentage points of the English baseline.

The cost of not localizing

The SaaS company with the 94% → 67% French accuracy drop eventually fixed the problem. They stopped translating prompts and started localizing them. The process took four weeks and cost $38K: $22K for a bilingual prompt engineer, $8K for localized few-shot example creation, and $8K for testing and iteration.

The cost of not fixing it: three months of misrouted French tickets, a 23% increase in average resolution time, and a customer satisfaction score that dropped from 4.2 to 3.1 in the French market. The company estimated the revenue impact at $340K in French-market churn during the three-month period.

Prompt translation is not prompt localization. And prompt localization is not optional if you want consistent AI performance across languages. The model is not the problem. The prompt is.

Artlangs Translation provides LLM prompt engineering localization across 230+ language pairs: instruction structure adaptation, tone and register calibration, few-shot example localization, and output format specification for consistent AI performance across languages. We work with AI-native companies, SaaS platforms, and enterprise AI teams in San Francisco, London, Berlin, Tokyo, and Singapore. Because your AI's performance should not depend on what language your customers speak.

PREV: Protecting Your Assets: Managing Large-Scale IP Translation Projects

NEXT: Compliance with Confidence: ESG Report Translation for Hong Kong Listed Companies

News