←back to Blog

Build a Low-Footprint AI Coding Assistant with Mistral Devstral

«`html

Build a Low-Footprint AI Coding Assistant with Mistral Devstral

In this tutorial, we provide a Colab-friendly guide specifically designed for users facing disk space constraints. Running large language models like Mistral can be challenging in environments with limited storage and memory. This tutorial demonstrates how to deploy the powerful devstral-small model using aggressive quantization, cache management, and efficient token generation. This setup ensures maximum performance with minimal footprint, making it ideal for debugging code, writing small tools, or prototyping on the go.

Installation of Essential Packages

To get started, install the necessary lightweight packages:

!pip install -q kagglehub mistral-common bitsandbytes transformers --no-cache-dir
!pip install -q accelerate torch --no-cache-dir

This installation ensures that no cache is stored, minimizing disk usage. Essential libraries for efficient model loading and inference are also included.

Cache Management

To maintain a minimal disk footprint, we define a function to clean up unnecessary files:

def cleanup_cache():
   """Clean up unnecessary files to save disk space"""
   cache_dirs = ['/root/.cache', '/tmp/kagglehub']
   for cache_dir in cache_dirs:
       if os.path.exists(cache_dir):
           shutil.rmtree(cache_dir, ignore_errors=True)
   gc.collect()

This proactive cleanup helps free up space before and after key operations.

Model Initialization

Next, we define the LightweightDevstral class, which handles model loading and text generation:

class LightweightDevstral:
   def __init__(self):
       print("Downloading model (streaming mode)...")
       self.model_path = kagglehub.model_download(
           'mistral-ai/devstral-small-2505/Transformers/devstral-small-2505/1',
           force_download=False 
       )
       quantization_config = BitsAndBytesConfig(
           bnb_4bit_compute_dtype=torch.float16,
           bnb_4bit_quant_type="nf4",
           bnb_4bit_use_double_quant=True,
           bnb_4bit_quant_storage=torch.uint8,
           load_in_4bit=True
       )
       print("Loading ultra-compressed model...")
       self.model = AutoModelForCausalLM.from_pretrained(
           self.model_path,
           torch_dtype=torch.float16,
           device_map="auto",
           quantization_config=quantization_config,
           low_cpu_mem_usage=True, 
           trust_remote_code=True
       )
       self.tokenizer = MistralTokenizer.from_file(f'{self.model_path}/tekken.json')
       cleanup_cache()
       print("Lightweight assistant ready! (~2 GB disk usage)")

Memory-Efficient Generation

The generate method employs memory-safe practices:

def generate(self, prompt, max_tokens=400): 
       """Memory-efficient generation"""
       tokenized = self.tokenizer.encode_chat_completion(
           ChatCompletionRequest(messages=[UserMessage(content=prompt)])
       )
       input_ids = torch.tensor([tokenized.tokens])
       if torch.cuda.is_available():
           input_ids = input_ids.to(self.model.device)
       with torch.inference_mode(): 
           output = self.model.generate(
               input_ids=input_ids,
               max_new_tokens=max_tokens,
               temperature=0.6,
               top_p=0.85,
               do_sample=True,
               pad_token_id=self.tokenizer.eos_token_id,
               use_cache=True 
           )[0]
       del input_ids
       torch.cuda.empty_cache() if torch.cuda.is_available() else None
       return self.tokenizer.decode(output[len(tokenized.tokens):])

Interactive Coding Mode

We introduce Quick Coding Mode, allowing users to submit short coding prompts directly:

def quick_coding():
   """Lightweight interactive session"""
   print("\nQUICK CODING MODE")
   print("=" * 40)
   print("Enter short coding prompts (type 'exit' to quit)")
  
   session_count = 0
   max_sessions = 5 
  
   while session_count < max_sessions:
       prompt = input(f"\n[{session_count+1}/{max_sessions}] Your prompt: ")
       if prompt.lower() in ['exit', 'quit', '']:
           break
       try:
           result = assistant.generate(prompt, max_tokens=300)
           print("Solution:")
           print(result[:500]) 
           gc.collect()
           if torch.cuda.is_available():
               torch.cuda.empty_cache()
       except Exception as e:
           print(f"Error: {str(e)[:100]}...")
       session_count += 1
   print(f"\nSession complete! Memory cleaned.")

Disk Usage Monitoring

Finally, we provide a disk usage monitor:

def check_disk_usage():
   """Monitor disk usage"""
   import subprocess
   try:
       result = subprocess.run(['df', '-h', '/'], capture_output=True, text=True)
       lines = result.stdout.split('\n')
       if len(lines) > 1:
           usage_line = lines[1].split()
           used = usage_line[2]
           available = usage_line[3]
           print(f"Disk: {used} used, {available} available")
   except:
       print("Disk usage check unavailable")

This tutorial demonstrates how to leverage the capabilities of Mistral’s Devstral model in space-constrained environments like Google Colab, without compromising usability or speed. The model loads in a highly compressed format, performs efficient text generation, and ensures memory is promptly cleared after use.

Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and subscribe to our newsletter.

```