A Coding Tutorial of Model Context Protocol Focusing on Semantic Chunking, Dynamic Token Management, and Context Relevance Scoring for Efficient LLM Interactions
Managing context effectively is a critical challenge when working with large language models, especially in environments like Google Colab, where resource constraints and long documents can quickly exceed available token windows. In this tutorial, we guide you through a practical implementation of the Model Context Protocol (MCP) by building a ModelContextManager that automatically chunks incoming text, generates semantic embeddings using Sentence-Transformers, and scores each chunk based on recency, importance, and relevance. You’ll learn how to integrate this manager with a Hugging Face sequence-to-sequence model, demonstrated here with FLAN-T5, to add, optimize, and retrieve only the most pertinent pieces of context. Along the way, we’ll cover token counting with a GPT-2 tokenizer, context-window optimization strategies, and interactive sessions that let you query and visualize your dynamic context in real time.
Copy Code Copied Use a different Browser
import torch
import numpy as np
from typing import List, Dict, Any, Optional, Union, Tuple
from dataclasses import dataclass
import time
import gc
from ook import tqdm
We import essential libraries for building a dynamic context manager: torch and numpy handle tensor and numerical operations, while typing and dataclasses provide structured type annotations and data containers. Utility modules, such as time and gc, support timestamping and memory cleanup, as well as ook offers interactive progress bars for chunk processing in Colab.
Copy Code Copied Use a different Browser
@dataclass
class ContextChunk:
«»»A chunk of text with metadata for the Model Context Protocol.»»»
text: str
embedding: Optional[torch.Tensor] = None
importance: float = 1.0
timestamp: float = 0.0
metadata: Dict[str, Any] = None
def __post_init__(self):
if ata is None:
ata =
if tamp == 0.0:
tamp = ()
The ContextChunk dataclass encapsulates a single segment of text along with its embedding, a user-assigned importance score, a timestamp, and arbitrary metadata. Its __post_init__ method ensures that each chunk is stamped with the current time upon creation and that metadata defaults to an empty dictionary if none is provided.
Copy Code Copied Use a different Browser
class ModelContextManager:
«»»
Manager for implementing Model Context Protocol in LLMs on Google Colab.
Handles context window optimization, token management, and relevance scoring.
«»»
def __init__(
self,
max_context_length: int = 8192,
embedding_model: str = «sentence-transformers/all-MiniLM-L6-v2»,
relevance_threshold: float = 0.7,
recency_weight: float = 0.3,
importance_weight: float = 0.3,
semantic_weight: float = 0.4,
device: str = «cuda» if _available() else «cpu»
):
«»»
Initialize the Model Context Manager.
Args:
max_context_length: Maximum number of tokens in context window
embedding_model: Model to use for text embeddings
relevance_threshold: Threshold for chunk relevance to be included
recency_weight: Weight for recency in relevance calculation
importance_weight: Weight for importance in relevance calculation
semantic_weight: Weight for semantic similarity in relevance calculation
device: Device to run computations on
«»»
_context_length = max_context_length
e = device
s = []
nt_token_count = 0
ance_threshold = relevance_threshold
cy_weight = recency_weight
tance_weight = importance_weight
tic_weight = semantic_weight
try:
from sentence_transformers import SentenceTransformer
print(f»Loading embedding model embedding_model…»)
ding_model = SentenceTransformer(embedding_model).to(e)
print(f»Embedding model loaded successfully on e»)
except ImportError:
print(«Installing sentence-transformers…»)
import subprocess
([«pip», «install», «sentence-transformers»])
from sentence_transformers import SentenceTransformer
ding_model = SentenceTransformer(embedding_model).to(e)
print(f»Embedding model loaded successfully on e»)
try:
from transformers import GPT2Tokenizer
izer = GPT2T_pretrained(«gpt2»)
except ImportError:
print(«Installing transformers…»)
import subprocess
([«pip», «install», «transformers»])
from transformers import GPT2Tokenizer
izer = GPT2T_pretrained(«gpt2»)
def add_chunk(self, text: str, importance: float = 1.0, metadata: Dict[str, Any] = None) -> None:
«»»
Add a new chunk of text to the context manager.
Args:
text: The text content to add
importance: Importance score (0-1)
metadata: Additional metadata for the chunk
«»»
with _grad():
embedding = ding_e(text, convert_to_tensor=True)
chunk = ContextChunk(
text=text,
embedding=embedding,
importance=importance,
timestamp=(),
metadata=metadata or
)
d(chunk)
nt_token_count += len(e(text))
if nt_token_count > _context_length:
ize_context()
def optimize_context(self) -> None:
«»»Optimize context by removing less relevant chunks to fit within token limit.»»»
if not s:
return
print(«Optimizing context window…»)
scores = _chunks()
sorted_indices = rt(scores)[::-1]
new_chunks = []
new_token_count = 0
for idx in sorted_indices:
chunk = s[idx]
chunk_tokens = len(e())
if new_token_count + chunk_tokens ance_threshold * 1.5:
for i, included_chunk in enumerate(new_chunks):
included_idx = sorted_indices[i]
if scores[included_idx] ay:
«»»
Score chunks based on recency, importance, and semantic relevance.
Args:
query: Optional query to calculate semantic relevance against
Returns:
Array of scores for each chunk
«»»
if not s:
return ([])
current_time = ()
max_age = max(current_time — tamp for chunk in s) or 1.0
recency_scores = ([
1.0 — ((current_time — tamp) / max_age)
for chunk in s
])
importance_scores = ([tance for chunk in s])
if query is not None:
query_embedding = ding_e(query, convert_to_tensor=True)
similarity_scores = ([
e_similarity(ding, query_embedding, dim=0).item()
for chunk in s
])
similarity_scores = (similarity_scores — similarity_()) / (similarity_() — similarity_() + 1e-8)
else:
similarity_scores = (len(s))
final_scores = (
cy_weight * recency_scores +
tance_weight * importance_scores +
tic_weight * similarity_scores
)
return final_scores
def retrieve_context(self, query: str = None, k: int = None) -> str:
«»»
Retrieve the most relevant context for a given query.
Args:
query: The query to retrieve context for
k: The maximum number of chunks to return (None = all relevant chunks)
Returns:
String containing the combined relevant context
«»»
if not s:
return «»
scores = _chunks(query)
relevant_indices = (scores >= ance_threshold)[0]
relevant_indices = relevant_indices[rt(scores[relevant_indices])[::-1]]
if k is not None:
relevant_indices = relevant_indices[:k]
relevant_texts = [s[i].text for i in relevant_indices]
return «\n\n».join(relevant_texts)
def get_stats(self) -> Dict[str, Any]:
«»»Get statistics about the current context state.»»»
return
«chunk_count»: len(s),
«token_count»: nt_token_count,
«max_tokens»: _context_length,
«usage_percentage»: nt_token_count / _context_length * 100 if _context_length else 0,
«avg_chunk_size»: nt_token_count / len(s) if s else 0,
«oldest_chunk_age»: () — min(tamp for chunk in s) if s else 0,
def visualize_context(self):
«»»Visualize the current context window distribution.»»»
try:
import t as plt
import pandas as pd
if not s:
print(«No chunks to visualize»)
return
scores = _chunks()
chunk_sizes = [len(e()) for chunk in s]
timestamps = [tamp for chunk in s]
relative_times = [() — ts for ts in timestamps]
importance = [tance for chunk in s]
df = pd.DataFrame(
‘Size (tokens)’: chunk_sizes,
‘Age (seconds)’: relative_times,
‘Importance’: importance,
‘Score’: scores
)
fig, axs = ots(2, 2, figsize=(14, 10))
axs[0, 0].bar(range(len(chunk_sizes)), chunk_sizes)
axs[0, 0].set_title(‘Token Distribution by Chunk’)
axs[0, 0].set_ylabel(‘Tokens’)
axs[0, 0].set_xlabel(‘Chunk Index’)
axs[0, 1].scatter(chunk_sizes, scores)
axs[0, 1].set_title(‘Score vs Chunk Size’)
axs[0, 1].set_xlabel(‘Tokens’)
axs[0, 1].set_ylabel(‘Score’)
axs[1, 0].scatter(relative_times, scores)
axs[1, 0].set_title(‘Score vs Chunk Age’)
axs[1, 0].set_xlabel(‘Age (seconds)’)
axs[1, 0].set_ylabel(‘Score’)
axs[1, 1].scatter(importance, scores)
axs[1, 1].set_title(‘Score vs Importance’)
axs[1, 1].set_xlabel(‘Importance’)
axs[1, 1].set_ylabel(‘Score’)
_layout()
()
except ImportError:
print(«Please install matplotlib and pandas for visualization»)
print(‘!pip install matplotlib pandas’)
The ModelContextManager class orchestrates the end-to-end handling of context for LLMs by chunking input text, generating embeddings, and tracking token usage against a configurable limit. It implements relevance scoring (combining recency, importance, and semantic similarity), automatic context pruning, retrieval of the most pertinent chunks, and convenient utilities for monitoring and visualizing context statistics.
Copy Code Copied Use a different Browser
class MCPColabDemo:
«»»Demonstration of Model Context Protocol in Google Colab with a Language Model.»»»
def __init__(
self,
model_name: str = «google/flan-t5-base»,
max_context_length: int = 2048,
device: str = «cuda» if _available() else «cpu»
):
«»»
Initialize the MCP Colab demo with a specified model.
Args:
model_name: Hugging Face model name
max_context_length: Maximum context length for the MCP manager
device: Device to run the model on
«»»
e = device
xt_manager = ModelContextManager(
max_context_length=max_context_length,
device=device
)
try:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
print(f»Loading model model_name…»)
= AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
izer = AutoT_pretrained(model_name)
print(f»Model loaded successfully on device»)
except ImportError:
print(«Installing transformers…»)
import subprocess
([«pip», «install», «transformers»])
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
= AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
izer = AutoT_pretrained(model_name)
print(f»Model loaded successfully on device»)
def add_document(self, text: str, chunk_size: int = 512, overlap: int = 50) -> None:
«»»
Add a document to the context by chunking it appropriately.
Args:
text: Document text
chunk_size: Size of each chunk in characters
overlap: Overlap between chunks in characters
«»»
chunks = []
for i in range(0, len(text), chunk_size — overlap):
chunk = text[i:i + chunk_size]
if len(chunk) > 20:
d(chunk)
print(f»Adding len(chunks) chunks to context…»)
for i, chunk in enumerate(tqdm(chunks)):
pos = i / len(chunks)
importance = 1.0 — 0.5 * min(pos, 1 — pos)
xt__chunk(
text=chunk,
importance=importance,
metadata=»source»: «document», «position»: i, «total_chunks»: len(chunks)
)
def process_query(self, query: str, max_new_tokens: int = 256) -> str:
«»»
Process a query using the context manager and model.
Args:
query: The query to process
max_new_tokens: Maximum number of tokens in response
Returns:
Model response
«»»
xt__chunk(query, importance=1.0, metadata=»type»: «query»)
relevant_context = xt_eve_context(query=query)
prompt = f»Context: relevant_context\n\nQuestion: query\n\nAnswer:»
inputs = izer(prompt, return_tensors=»pt»).to(e)
print(«Generating response…»)
with _grad():
outputs = ate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
response = e(outputs[0], skip_special_tokens=True)
xt__chunk(
response,
importance=0.9,
metadata=»type»: «response», «query»: query
)
return response
def interactive_session(self):
«»»Run an interactive session in the notebook.»»»
from IPay import clear_output
print(«Starting interactive MCP session. Type ‘exit’ to end.»)
conversation_history = []
while True:
query = input(«\nYour query: «)
if () == ‘exit’:
break
if () == ‘stats’:
print(«\nContext Statistics:»)
stats = xt__stats()
for key, value in ():
print(f»key: value»)
xt_lize_context()
continue
if () == ‘clear’:
xt_s = []
xt_nt_token_count = 0
conversation_history = []
clear_output(wait=True)
print(«Context cleared!»)
continue
response = ss_query(query)
conversation_d((query, response))
print(«\nResponse:»)
print(response)
print(«\n» + «-«*50)
stats = xt__stats()
print(f»Context usage: stats[‘token_count’]/stats[‘max_tokens’] tokens (stats[‘usage_percentage’]:.1f%)»)
The MCPColabDemo class ties the context manager to a seq2seq LLM, loading FLAN-T5 (or any specified Hugging Face model) on the chosen device, and provides utility methods for chunking and ingesting entire documents, processing user queries by prepending only the most relevant context, and running an interactive Colab session complete with real-time stats, visualizations, and commands for clearing or inspecting the evolving context window.
Copy Code Copied Use a different Browser
def run_mcp_demo():
«»»Run a simple demo of the Model Context Protocol.»»»
print(«Running Model Context Protocol Demo…»)
context_manager = ModelContextManager(max_context_length=4096)
print(«Adding sample chunks…»)
context__chunk(
«The Model Context Protocol (MCP) is a framework for managing context »
«windows in large language models. It helps optimize token usage and improve relevance.»,
importance=1.0
)
context__chunk(
«Context management involves techniques like sliding windows, chunking, »
«and relevance filtering to handle large documents efficiently.»,
importance=0.8
)
for i in range(10):
context__chunk(
f»This is test chunk i with some filler content to simulate a larger context »
f»window that needs optimization. This helps demonstrate the MCP functionality »
f»for context window management in language models on Google Colab.»,
importance=0.5 — (i * 0.02)
)
stats = context__stats()
print(«\nInitial Statistics:»)
for key, value in ():
print(f»key: value»)
query = «How does the Model Context Protocol work?»
print(f»\nRetrieving context for: ‘query’»)
context = context_eve_context(query)
print(f»\nRelevant context:\ncontext»)
print(«\nVisualizing context:»)
context_lize_context()
print(«\nDemo complete!»)
The run_mcp_demo function ties everything together in a single script: it instantiates the ModelContextManager, adds a series of sample chunks with varying importance, prints out initial statistics, retrieves and displays the most relevant context for a test query, and finally visualizes the context window, providing a complete, end-to-end demonstration of the Model Context Protocol in action.
Copy Code Copied Use a different Browser
if __name__ == «__main__»:
run_mcp_demo()
Finally, this standard Python entry-point guard ensures that the run_mcp_demo() function executes only when the script is run directly (rather than imported as a module), triggering the end-to-end demonstration of the Model Context Protocol workflow.
In conclusion, we will have a fully functional MCP system that not only curbs runaway token usage but also prioritizes context fragments that truly matter for your queries. The ModelContextManager equips you with tools to balance semantic relevance, temporal freshness, and user-assigned importance. At the same time, the accompanying MCPColabDemo class provides an accessible framework for real-time experimentation and visualization. Armed with these patterns, you can extend the core principles by adjusting relevance thresholds, experimenting with various embedding models, or integrating with alternative LLM backends to tailor your domain-specific workflows. Ultimately, this approach enables you to create concise yet highly relevant prompts, resulting in more accurate and efficient responses from your language models.
Here is the . Also, don’t forget to follow us on and join our and . Don’t Forget to join our .
[
The post appeared first on .