«`html
Build a Low-Footprint AI Coding Assistant with Mistral Devstral
In this tutorial, we provide a Colab-friendly guide specifically designed for users facing disk space constraints. Running large language models like Mistral can be challenging in environments with limited storage and memory. This tutorial demonstrates how to deploy the powerful devstral-small model using aggressive quantization, cache management, and efficient token generation. This setup ensures maximum performance with minimal footprint, making it ideal for debugging code, writing small tools, or prototyping on the go.
Installation of Essential Packages
To get started, install the necessary lightweight packages:
!pip install -q kagglehub mistral-common bitsandbytes transformers --no-cache-dir !pip install -q accelerate torch --no-cache-dir
This installation ensures that no cache is stored, minimizing disk usage. Essential libraries for efficient model loading and inference are also included.
Cache Management
To maintain a minimal disk footprint, we define a function to clean up unnecessary files:
def cleanup_cache(): """Clean up unnecessary files to save disk space""" cache_dirs = ['/root/.cache', '/tmp/kagglehub'] for cache_dir in cache_dirs: if os.path.exists(cache_dir): shutil.rmtree(cache_dir, ignore_errors=True) gc.collect()
This proactive cleanup helps free up space before and after key operations.
Model Initialization
Next, we define the LightweightDevstral
class, which handles model loading and text generation:
class LightweightDevstral: def __init__(self): print("Downloading model (streaming mode)...") self.model_path = kagglehub.model_download( 'mistral-ai/devstral-small-2505/Transformers/devstral-small-2505/1', force_download=False ) quantization_config = BitsAndBytesConfig( bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_quant_storage=torch.uint8, load_in_4bit=True ) print("Loading ultra-compressed model...") self.model = AutoModelForCausalLM.from_pretrained( self.model_path, torch_dtype=torch.float16, device_map="auto", quantization_config=quantization_config, low_cpu_mem_usage=True, trust_remote_code=True ) self.tokenizer = MistralTokenizer.from_file(f'{self.model_path}/tekken.json') cleanup_cache() print("Lightweight assistant ready! (~2 GB disk usage)")
Memory-Efficient Generation
The generate
method employs memory-safe practices:
def generate(self, prompt, max_tokens=400): """Memory-efficient generation""" tokenized = self.tokenizer.encode_chat_completion( ChatCompletionRequest(messages=[UserMessage(content=prompt)]) ) input_ids = torch.tensor([tokenized.tokens]) if torch.cuda.is_available(): input_ids = input_ids.to(self.model.device) with torch.inference_mode(): output = self.model.generate( input_ids=input_ids, max_new_tokens=max_tokens, temperature=0.6, top_p=0.85, do_sample=True, pad_token_id=self.tokenizer.eos_token_id, use_cache=True )[0] del input_ids torch.cuda.empty_cache() if torch.cuda.is_available() else None return self.tokenizer.decode(output[len(tokenized.tokens):])
Interactive Coding Mode
We introduce Quick Coding Mode, allowing users to submit short coding prompts directly:
def quick_coding(): """Lightweight interactive session""" print("\nQUICK CODING MODE") print("=" * 40) print("Enter short coding prompts (type 'exit' to quit)") session_count = 0 max_sessions = 5 while session_count < max_sessions: prompt = input(f"\n[{session_count+1}/{max_sessions}] Your prompt: ") if prompt.lower() in ['exit', 'quit', '']: break try: result = assistant.generate(prompt, max_tokens=300) print("Solution:") print(result[:500]) gc.collect() if torch.cuda.is_available(): torch.cuda.empty_cache() except Exception as e: print(f"Error: {str(e)[:100]}...") session_count += 1 print(f"\nSession complete! Memory cleaned.")
Disk Usage Monitoring
Finally, we provide a disk usage monitor:
def check_disk_usage(): """Monitor disk usage""" import subprocess try: result = subprocess.run(['df', '-h', '/'], capture_output=True, text=True) lines = result.stdout.split('\n') if len(lines) > 1: usage_line = lines[1].split() used = usage_line[2] available = usage_line[3] print(f"Disk: {used} used, {available} available") except: print("Disk usage check unavailable")
This tutorial demonstrates how to leverage the capabilities of Mistral’s Devstral model in space-constrained environments like Google Colab, without compromising usability or speed. The model loads in a highly compressed format, performs efficient text generation, and ensures memory is promptly cleared after use.
Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and subscribe to our newsletter.
```