โครงสร้างพื้นฐาน Prompt Caching: การลดต้นทุนและ Latency ของ LLM

Anthropic prefix caching ช่วยลดต้นทุนได้ 90% และลด latency ได้ 85% สำหรับ prompt ที่ยาว OpenAI เปิดใช้งาน automatic caching โดยค่าเริ่มต้น (ประหยัดต้นทุน 50%) 31% ของการเรียกใช้ LLM มีความคล้ายคลึงทางความหมาย—ความไม่มีประสิทธิภาพมหาศาลหากไม่มี caching...

Blake Crosley

Mar 17, 2026 7 min read Disclaimer

โครงสร้างพื้นฐาน Prompt Caching: การลดต้นทุนและ Latency ของ LLM

อัปเดต 11 ธันวาคม 2025

อัปเดตธันวาคม 2025: Anthropic prefix caching ช่วยลดต้นทุนได้ 90% และลด latency ได้ 85% สำหรับ prompt ที่ยาว OpenAI เปิดใช้งาน automatic caching โดยค่าเริ่มต้น (ประหยัดต้นทุน 50%) 31% ของการเรียกใช้ LLM มีความคล้ายคลึงทางความหมายกับคำขอก่อนหน้า—ความไม่มีประสิทธิภาพมหาศาลในการ deploy ที่ไม่มีโครงสร้างพื้นฐาน caching² Cache reads มีราคา $0.30/M tokens เทียบกับ $3.00/M สำหรับการประมวลผลใหม่ (Anthropic) สถาปัตยกรรม multi-tier caching (semantic → prefix → inference) ช่วยเพิ่มการประหยัดสูงสุด

Prompt caching ของ Anthropic ลดต้นทุนได้สูงสุด 90% และ latency ได้สูงสุด 85% สำหรับ prompt ที่ยาว¹ OpenAI ลดต้นทุนได้ 50% ด้วย automatic caching ที่เปิดใช้งานโดยค่าเริ่มต้น งานวิจัยแสดงให้เห็นว่า 31% ของการเรียกใช้ LLM มีความคล้ายคลึงทางความหมายกับคำขอก่อนหน้า ซึ่งเป็นความไม่มีประสิทธิภาพมหาศาลในการ deploy ที่ไม่มีโครงสร้างพื้นฐาน caching² องค์กรที่รันแอปพลิเคชัน AI ในระบบ production กำลังเสียเงินจำนวนมากหากไม่มีกลยุทธ์ caching ที่เหมาะสม

Prompt caching ทำงานในหลายระดับ—ตั้งแต่ prefix caching ฝั่ง provider ที่นำการคำนวณ KV cache กลับมาใช้ใหม่ ไปจนถึง semantic caching ระดับแอปพลิเคชันที่ส่งคืน response ก่อนหน้าสำหรับ query ที่คล้ายกัน การเข้าใจแต่ละชั้นและเวลาที่ควร deploy ช่วยให้องค์กรเพิ่มประสิทธิภาพทั้งต้นทุนและ latency ตามรูปแบบ workload ที่เฉพาะเจาะจง

พื้นฐาน Caching

ต้นทุน LLM inference มาจากสองแหล่ง: การประมวลผล input token และการสร้าง output token กลยุทธ์ caching มุ่งเป้าทั้งสอง:

Input token caching (prefix caching)

ทุกคำขอ LLM ประมวลผล input token ผ่าน attention mechanism ของโมเดล สร้าง key-value pairs ที่เก็บใน KV cache เมื่อหลายคำขอมี prefix ที่เหมือนกัน—system prompt, few-shot examples, หรือ document context—การคำนวณ KV cache จะทำซ้ำโดยไม่จำเป็น

วิธีแก้ไขด้วย Prefix caching: เก็บค่า KV ที่คำนวณแล้วสำหรับ prefix ทั่วไป คำขอที่ตามมาที่มี prefix ตรงกันจะข้ามการคำนวณใหม่ เริ่มจาก cached state

ผลกระทบด้านต้นทุน: - Anthropic: Cache reads มีราคา $0.30/M tokens เทียบกับ $3.00/M สำหรับการประมวลผลใหม่ (ประหยัด 90%) - OpenAI: ส่วนลด 50% สำหรับ cached token - Google: ราคาแปรผันตาม context window

ผลกระทบด้าน Latency: การข้ามการคำนวณ prefix ลด time-to-first-token ได้ 50-85% ขึ้นอยู่กับความยาว prefix

Output caching (semantic caching)

บางคำขอสมควรได้รับ response ที่เหมือนกัน—คำถามที่ซ้ำกัน, deterministic query, หรือการค้นหาที่ไม่ต้องการการสร้างใหม่

วิธีแก้ไขด้วย Semantic caching: เก็บ response output โดยใช้ key จาก input ที่มีความหมายคล้ายกัน ส่งคืน cached response โดยไม่เรียก LLM สำหรับ query ที่ตรงกัน

ผลกระทบด้านต้นทุน: Cached response ไม่ต้องเรียก API—ประหยัด 100% เมื่อ cache hit

ผลกระทบด้าน Latency: Response กลับมาในหน่วยมิลลิวินาทีเทียบกับวินาทีสำหรับ LLM inference

ลำดับชั้น Caching

ระบบ production มักจะ implement หลายชั้น caching:

Request → Semantic Cache (ประหยัด 100%) → Prefix Cache (ประหยัด 50-90%) → Full Inference
              ↓                                  ↓                              ↓
         Cached response              Cached KV state              การคำนวณใหม่

แต่ละชั้นจับโอกาสการเพิ่มประสิทธิภาพที่แตกต่างกันตามรูปแบบความคล้ายคลึงของคำขอ

Prompt caching ระดับ Provider

Anthropic Claude

Anthropic เสนอ prompt caching ที่ปรับแต่งได้มากที่สุด:³

ราคา: - Cache writes: เพิ่มขึ้น 25% จากราคา input พื้นฐาน - Cache reads: ส่วนลด 90% (10% ของราคาพื้นฐาน) - จุดคุ้มทุน: cache hit 2+ ครั้งต่อ cached prefix

ข้อกำหนด: - ขั้นต่ำ 1,024 token ต่อ cache checkpoint - สูงสุด 4 cache checkpoint ต่อคำขอ - อายุ cache: 5 นาทีจากการเข้าถึงล่าสุด (ขยายเป็น 1 ชั่วโมงหากมีการ hit บ่อย) - สามารถ cache ได้สูงสุด 5 conversation turn

การ Implement:

import anthropic

client = anthropic.Anthropic()

# ทำเครื่องหมายเนื้อหาสำหรับ caching ด้วย cache_control
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert assistant for our enterprise software...",
            "cache_control": {"type": "ephemeral"}  # ทำเครื่องหมายสำหรับ caching
        }
    ],
    messages=[{"role": "user", "content": "How do I configure user permissions?"}]
)

แนวปฏิบัติที่ดี: - วางเนื้อหาคงที่ (system prompt, documentation) ที่จุดเริ่มต้นของ prompt - วางเนื้อหาแบบไดนามิก (user input, conversation) ที่ท้าย - ใช้ cache checkpoint ที่ขอบเขตธรรมชาติ - ติดตาม cache hit rate เพื่อยืนยันการเพิ่มประสิทธิภาพ

OpenAI

OpenAI implement automatic caching โดยไม่ต้องแก้ไขโค้ด:⁴

ราคา: - Cached token: 50% ของราคา input พื้นฐาน - ไม่มี cache write premium

ข้อกำหนด: - ขั้นต่ำ 1,024 token สำหรับการมีสิทธิ์ caching - Cache hit เกิดขึ้นในหน่วย 128-token - อายุ cache: 5-10 นาทีหากไม่มีการใช้งาน

พฤติกรรมอัตโนมัติ: - Prompt ที่เกิน 1,024 token จะ cache อัตโนมัติ - ระบบตรวจจับ prefix ที่ตรงกันในคำขอต่างๆ - ไม่ต้องเปลี่ยนแปลง API

การตรวจสอบ:

response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[...],
)

# ตรวจสอบ usage สำหรับ cache hit
print(f"Cached tokens: {response.usage.prompt_tokens_details.cached_tokens}")
print(f"Total input tokens: {response.usage.prompt_tokens}")

Google Gemini

Google ให้บริการ context caching สำหรับโมเดล Gemini:⁵

ราคา: - แปรผันตามขนาด cached context และระยะเวลา - ค่าธรรมเนียมการจัดเก็บสำหรับ cached content

คุณสมบัติ: - การสร้างและจัดการ cache แบบชัดเจน - time-to-live ที่ปรับแต่งได้ - การแชร์ cache ข้ามคำขอ

การ Implement:

from google.generativeai import caching

# สร้าง cached content
cache = caching.CachedContent.create(
    model='models/gemini-1.5-pro-001',
    display_name='product-documentation',
    system_instruction="You are a product expert...",
    contents=[product_docs],
    ttl=datetime.timedelta(hours=1)
)

# ใช้ cached content ในคำขอ
model = genai.GenerativeModel.from_cached_content(cached_content=cache)
response = model.generate_content("How do I configure feature X?")

Amazon Bedrock

AWS เสนอ prompt caching ใน preview สำหรับโมเดลที่รองรับ:⁶

ข้อกำหนด: - Claude 3.5 Sonnet ต้องการขั้นต่ำ 1,024 token ต่อ checkpoint - Checkpoint ที่สองต้องการ 2,048 token

รูปแบบการ implement ตรงกับแนวทาง cache_control ของ Anthropic ภายในโครงสร้าง API ของ Bedrock

vLLM prefix caching

การ inference แบบ self-hosted ด้วย vLLM รวม automatic prefix caching:⁷

สถาปัตยกรรม

Automatic Prefix Caching (APC) ของ vLLM เก็บ KV block ใน hash table ทำให้สามารถนำ cache กลับมาใช้ได้โดยไม่ต้องใช้ tree structure:

การออกแบบหลัก: - KV block ทั้งหมดเก็บใน block pool ตั้งแต่เริ่มต้น - การค้นหาแบบ hash-based สำหรับ prefix matching - การดำเนินการ O(1) สำหรับการจัดการ block - รักษาประสิทธิภาพหน่วยความจำ PagedAttention

การกำหนดค่า

from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enable_prefix_caching=True,  # เปิดใช้งาน APC
    gpu_memory_utilization=0.90,
)

ผลกระทบด้านประสิทธิภาพ

vLLM ที่มี PagedAttention แสดง throughput สูงกว่าการ implement แบบพื้นฐาน 14-24 เท่า⁸ Prefix caching เพิ่ม:

ความแตกต่างต้นทุน 10 เท่าระหว่าง cached และ uncached token
การลด latency ขนาดใหญ่สำหรับ prefix ที่ตรงกัน
ประสิทธิภาพหน่วยความจำผ่านการแชร์ KV block

ข้อพิจารณาด้านความปลอดภัย

vLLM รองรับ cache isolation สำหรับสภาพแวดล้อมที่แชร์:

# Per-request cache salt ป้องกันการเข้าถึง cache ข้าม tenant
response = llm.generate(
    prompt="...",
    sampling_params=SamplingParams(...),
    cache_salt="tenant-123"  # แยก cache ตาม tenant
)

การฉีด cache salt เข้าไปใน block hash ป้องกัน timing attack ที่ผู้โจมตีอาจอนุมานเนื้อหา cached ผ่านการสังเกต latency

LMCache extension

LMCache ขยาย vLLM ด้วยความสามารถ caching ขั้นสูง:⁹

คุณสมบัติ: - การนำ KV cache กลับมาใช้ข้าม engine instance - การจัดเก็บ multi-tier (GPU → CPU RAM → disk) - การ cache เนื้อหาที่ไม่ใช่ prefix - ลด latency 3-10 เท่าใน benchmark

สถาปัตยกรรม:

vLLM Engine → LMCache → GPU VRAM (hot)
                     → CPU RAM (warm)
                     → Local Disk (cold)

Semantic caching

Semantic caching ส่งคืน response ก่อนหน้าสำหรับ query ที่มีความหมายคล้ายกัน (ไม่ใช่แค่เหมือนกัน):

GPTCache

GPTCache ให้บริการ semantic caching แบบ open-source สำหรับแอปพลิเคชัน LLM:¹⁰

สถาปัตยกรรม:

Query → Embedding → Vector Search → Similarity Check → Response/API Call
              ↓           ↓              ↓
         BERT/OpenAI   Milvus/FAISS   Threshold (0.8)

ส่วนประกอบ: - LLM Adapter: การเชื่อมต่อกับ LLM provider ต่างๆ - Embedding Generator: การแปลง query เป็น vector - Vector Store: การค้นหาความคล้ายคลึง (Milvus, FAISS, Zilliz) - Cache Manager: การจัดเก็บและดึงข้อมูล - Similarity Evaluator: การจับคู่ตาม threshold

การ Implement:

from gptcache import cache
from gptcache.adapter import openai

# เริ่มต้น semantic cache
cache.init(
    pre_embedding_func=get_text_embedding,
    data_manager=manager,
)

# ใช้ cached OpenAI call
response = openai.ChatCompletion.create(
    model='gpt-4',
    messages=[{"role": "user", "content": "What is machine learning?"}]
)
# Query ที่มีความหมายคล้ายกัน ("Explain ML", "Define machine learning")
# จะส่งคืน cached response

ประสิทธิภาพ

GPTCache บรรลุการเพิ่มประสิทธิภาพที่สำคัญ:¹¹

การลด API call: สูงถึง 68.8% ในหมวดหมู่ query ต่างๆ
Cache hit rate: 61.6% ถึง 68.8%
ความแม่นยำ: 97%+ positive hit rate
การลด latency: 40-50% เมื่อ cache hit, สูงถึง 100 เท่าสำหรับ full hit

เทคนิคขั้นสูง

VectorQ adaptive threshold:¹²

Static similarity threshold (เช่น 0.8) ทำงานได้ไม่ดีกับ query ที่หลากหลาย VectorQ เรียนรู้ embedding-specific threshold region ที่ปรับตัวตามความซับซ้อนของ query:

Factual query ง่ายๆ: threshold สูงกว่า (การจับคู่เข้มงวดกว่า)
Open-ended query: threshold ต่ำกว่า (การนำกลับมาใช้มากกว่า)
Ambiguous query: การปรับแบบไดนามิก

SCALM pattern detection:

SCALM ปรับปรุง GPTCache ผ่านการตรวจจับ pattern และการวิเคราะห์ความถี่: - ปรับปรุง cache hit ratio 63% - ลดการใช้ token 77% - ระบุ pattern ของ cache entry ที่มีความถี่สูง

เมื่อใดควรใช้ semantic caching

ผู้สมัครที่ดี: - Query แบบ FAQ ที่มี answer space จำกัด - Lookup query (ข้อมูลผลิตภัณฑ์, documentation) - Deterministic response (การคำนวณ, การจัดรูปแบบ) - แอปพลิเคชันที่มี traffic สูงและมี query ซ้ำ

ผู้สมัครที่ไม่เหมาะ: - การสร้างเชิงสร้างสรรค์ที่ต้องการความเป็นเอกลักษณ์ - Response ที่เป็นส่วนตัว (context เฉพาะผู้ใช้) - ข้อมูลที่ไวต่อเวลา - รูปแบบ query ที่ซ้ำน้อย

รูปแบบการ Implement

แอปพลิเคชัน Chat

ระบบ chat ได้ประโยชน์จากทั้ง prefix และ semantic caching:

System prompt caching:

# Static system prompt ถูก cache ที่จุดเริ่มต้นคำขอ
system_prompt = """
You are a customer support agent for Acme Corp...
[2000+ tokens of guidelines and knowledge]
"""

# Dynamic conversation ต่อท้ายหลัง cached prefix
messages = [
    {"role": "system", "content": system_prompt, "cache_control": {...}},
    {"role": "user", "content": user_message}
]

Conversation history caching: Anthropic รองรับ caching สูงสุด 5 conversation turn ลดต้นทุนสำหรับ multi-turn conversation

แอปพลิเคชัน RAG

Retrieval-augmented generation cache retrieved context:

# โครงสร้าง cache สำหรับ RAG
cached_context = {
    "system": system_prompt,           # cache เสมอ
    "documents": retrieved_chunks,      # cache ต่อ query cluster
    "examples": few_shot_examples       # คงที่ข้ามคำขอ
}

# เฉพาะ user query ที่แปรผัน
dynamic_content = {
    "query": user_question
}

Document chunk caching: เมื่อหลาย query ดึง document เดียวกัน prefix caching จะกำจัดการประมวลผลซ้ำซ้อนของ shared context

Agentic workflow

ระบบ agent ที่มี tool calling ได้ประโยชน์จาก prefix caching:

System prompt → Tool definitions → Conversation history → Current query
    (cached)       (cached)           (partially cached)      (dynamic)

เทคนิคการเพิ่มประสิทธิภาพ: - วาง tool definition หลัง system prompt สำหรับ combined caching - จัดกลุ่มคำขอ agent ที่คล้ายกันเพื่อเพิ่ม cache hit - Implement semantic caching สำหรับการเรียก tool ทั่วไป

การวัดและติดตาม

เมตริกที่ต้องติดตาม

Cache hit rate:

cache_hit_rate = cached_tokens / total_input_tokens
# เป้าหมาย: >50% สำหรับ workload ที่มี repetition สูง

การประหยัดต้นทุน:

savings = (uncached_cost - cached_cost) / uncached_cost
# คำนวณโดยใช้ราคา token เฉพาะ provider

ผลกระทบ Latency:

latency_improvement = (uncached_ttft - cached_ttft) / uncached_ttft
# วัด time-to-first-token สำหรับการวิเคราะห์ prefix caching

การตรวจสอบ Provider

Anthropic:

response = client.messages.create(...)
usage = response.usage

print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Fresh input tokens: {usage.input_tokens}")

OpenAI:

response = client.chat.completions.create(...)
cached = response.usage.prompt_tokens_details.cached_tokens
print(f"Cache hit ratio: {cached / response.usage.prompt_tokens:.2%}")

การเตือนและ Alerting

กำหนดการเตือนสำหรับ: - Cache hit rate ต่ำกว่า threshold (เช่น <30%) - Cache write cost เกินกว่า hit savings - การเสื่อมสภาพ latency แม้จะ caching แล้ว

แนวปฏิบัติที่ดีที่สุด

การเพิ่มประสิทธิภาพ Prompt structure

เนื้อหาคงที่มาก่อน: วาง system prompt, few-shot example, และ documentation ที่จุดเริ่มต้น
เนื้อหาไดนามิกตามหลัง: User input, conversation history, และตัวแปรคำขอเฉพาะที่ท้าย
จัดกลุ่มเนื้อหาที่คล้ายกัน: จัด prompt ที่คล้ายกันให้แชร์ prefix ร่วมกัน

กลยุทธ์การจัดการ Cache

ระบบ warm-up: Pre-populate cache ด้วย prefix ทั่วไปก่อนเปิดใช้งาน production
ตั้ง cache TTL อย่างเหมาะสม: สร้างสมดุลระหว่างต้นทุน storage และ hit rate
แบ่งส่วน cache ตาม tenant: ป้องกันการรั่วไหล cache ข้าม tenant ในระบบ multi-tenant

การพิจารณาด้านความปลอดภัย

หลีกเลี่ยงการ cache ข้อมูลอ่อนไหว: อย่า cache response ที่มี PII หรือข้อมูลลับ
Implement cache isolation: ใช้ per-tenant cache salt หรือแบ่ง cache แยกกัน
ติดตาม cache access pattern: ตรวจจับพฤติกรรมที่ผิดปกติที่อาจบ่งบอกการโจมตี

การทดสอบและ Validation

วัด baseline ก่อน: กำหนดต้นทุนและ latency โดยไม่มี caching
A/B test กลยุทธ์ caching: เปรียบเทียบแนวทางต่างๆกับ traffic จริง
ติดตามความถูกต้อง: ยืนยันว่า cached response ยังคงเหมาะสมเมื่อเวลาผ่านไป

บทสรุป

Prompt caching เป็นโครงสร้างพื้นฐานที่จำเป็นสำหรับการ deploy LLM ในระบบ production เมื่อ 31% ของ query แสดงความคล้ายคลึงทางความหมาย องค์กรที่ไม่มี caching จะจ่ายเงินมากขึ้นอย่างมากสำหรับการคำนวณซ้ำซ้อน การรวม provider-level prefix caching (ประหยัด 50-90%) กับ application-level semantic caching (ประหยัด 100% เมื่อ hit) ช่วยเพิ่มประสิทธิภาพทั้งต้นทุนและ latency

กุญแจสำคัญคือการเข้าใจรูปแบบ workload ของคุณ: ระบุส่วนใดของ prompt ที่ซ้ำกันในคำขอต่างๆ, วัด cache hit rate, และเพิ่มประสิทธิภาพ prompt structure เพื่อเพิ่ม cache reuse สูงสุด ด้วย caching ที่เหมาะสม, แอปพลิเคชัน AI ในระบบ production สามารถลดต้นทุนได้หลายเท่าในขณะที่ปรับปรุงประสบการณ์ผู้ใช้ผ่าน response ที่เร็วขึ้น

ข้อจำกัดความรับผิดชอบ: เนื้อหานี้มีวัตถุประสงค์เพื่อให้ข้อมูลเท่านั้น และไม่ถือเป็นคำแนะนำจากผู้เชี่ยวชาญ ข้อมูลอาจไม่สะท้อนถึงการพัฒนาล่าสุดในอุตสาหกรรม ผลลัพธ์ที่อธิบายเป็นเพียงตัวอย่างและขึ้นอยู่กับสถานการณ์เฉพาะ สำหรับคำแนะนำที่เหมาะกับความต้องการของคุณ ติดต่อเรา.

โครงสร้างพื้นฐาน Prompt Caching: การลดต้นทุนและ Latency ของ LLM

พื้นฐาน Caching

Input token caching (prefix caching)

Output caching (semantic caching)

ลำดับชั้น Caching

Prompt caching ระดับ Provider

Anthropic Claude

OpenAI

Google Gemini

Amazon Bedrock

vLLM prefix caching

สถาปัตยกรรม

การกำหนดค่า

ผลกระทบด้านประสิทธิภาพ

ข้อพิจารณาด้านความปลอดภัย

LMCache extension

Semantic caching

GPTCache

ประสิทธิภาพ

เทคนิคขั้นสูง

เมื่อใดควรใช้ semantic caching

รูปแบบการ Implement

แอปพลิเคชัน Chat

แอปพลิเคชัน RAG

Agentic workflow

การวัดและติดตาม

เมตริกที่ต้องติดตาม

การตรวจสอบ Provider

การเตือนและ Alerting

แนวปฏิบัติที่ดีที่สุด

การเพิ่มประสิทธิภาพ Prompt structure

กลยุทธ์การจัดการ Cache

การพิจารณาด้านความปลอดภัย

การทดสอบและ Validation

บทสรุป

You Might Also Like

Lambda Labs vs Paperspace vs Vast.ai: การเปรียบเทียบผู้ให้บร...

หน่วยจ่ายไฟ: PDU ความหนาแน่นสูงสำหรับโครงสร้างพื้นฐานศูนย์ข้...

แร็คความหนาแน่นสูง: การออกแบบ 100kW+ สำหรับโครงสร้างพื้นฐานศ...

ขอใบเสนอราคา_

ได้รับคำขอแล้ว_