A Coding Implementation to Build a Transformer-Based Regression Language Model to Predict Continuous Values from Text

In this tutorial, we will build a Regression Language Model (RLM), which predicts continuous numerical values directly from text sequences. Unlike traditional models that classify or generate text, our focus is on training a transformer-based architecture that learns the quantitative relationships embedded within natural language descriptions.

Target Audience Analysis

The target audience for this implementation primarily includes:

Data scientists and machine learning engineers seeking to enhance their understanding of regression models in natural language processing.
Business analysts looking to leverage AI for predictive analytics in various domains, such as finance, marketing, and operations.
Academic researchers interested in exploring advanced machine learning techniques and their applications.

Pain Points: The audience may struggle with:

Understanding the complexities of implementing transformer models for regression tasks.
Finding reliable resources and tutorials that provide a clear, step-by-step approach.
Visualizing model performance and interpreting results effectively.

Goals: Their goals include:

Developing a robust understanding of regression models in the context of NLP.
Applying these models to real-world data for predictive insights.
Enhancing their technical skills and knowledge in AI and machine learning.

Interests: The audience is likely interested in:

Hands-on coding examples and practical applications of machine learning techniques.
Visualizations that help in understanding model behavior and performance.
Networking with other professionals in the AI and machine learning community.

Communication Preferences: They prefer:

Clear, concise, and structured content that is easy to follow.
Technical details accompanied by practical examples.
Interactive and engaging formats, such as tutorials and webinars.

Implementation Overview

We will generate synthetic text-to-number data, tokenize it, and train a lightweight transformer encoder to map linguistic cues to real-valued targets. By the end of this tutorial, you will understand how to implement RLMs from scratch, visualize their learning behavior, and test their generalization on unseen examples.

Generating Synthetic Data

def generate_synthetic_data(n_samples=2000):
   """Generate synthetic text-to-number regression data"""
   templates = [
       ("The temperature is {} degrees", lambda x: x),
       ("I rate this {} out of ten", lambda x: x),
       ("The price is {} dollars", lambda x: x),
       ("Confidence level: {}", lambda x: x / 100),
       ("Speed of {} kilometers per hour", lambda x: x / 10),
       ("{} percent complete", lambda x: x / 100),
       ("Scored {} points in the game", lambda x: x / 10),
       ("The distance is {} meters", lambda x: x),
   ]
   data = []
   for _ in range(n_samples):
       template, transform = templates[np.random.randint(len(templates))]
       value = np.random.uniform(0, 100)
       text = template.format(round(value, 1))
       target = transform(value)
       data.append((text, target))
   return data

This function creates a synthetic dataset that pairs natural language sentences with corresponding numerical values. By using varied templates, we ensure the model learns diverse text-number relationships.

Tokenization

class SimpleTokenizer:
   def __init__(self):
       self.word2idx = {"": 0, "": 1}
       self.idx2word = {0: "", 1: ""}
       self.vocab_size = 2
  
   def fit(self, texts):
       """Build vocabulary from texts"""
       words = []
       for text in texts:
           words.extend(re.findall(r'\w+|[^\w\s]', text.lower()))
       word_counts = Counter(words)
       for word, _ in word_counts.most_common():
           if word not in self.word2idx:
               self.word2idx[word] = self.vocab_size
               self.idx2word[self.vocab_size] = word
               self.vocab_size += 1
  
   def encode(self, text, max_len=20):
       """Convert text to token indices"""
       words = re.findall(r'\w+|[^\w\s]', text.lower())
       indices = [self.word2idx.get(w, 1) for w in words]
       if len(indices) < max_len:
           indices += [0] * (max_len - len(indices))
       else:
           indices = indices[:max_len]
       return indices

This tokenizer converts raw text into numerical tokens that the model can process, ensuring consistent transformations for training.

Creating the Dataset

class RLMDataset(Dataset):
   def __init__(self, data, tokenizer, max_len=20):
       self.data = data
       self.tokenizer = tokenizer
       self.max_len = max_len
  
   def __len__(self):
       return len(self.data)
  
   def __getitem__(self, idx):
       text, target = self.data[idx]
       tokens = self.tokenizer.encode(text, self.max_len)
       return torch.tensor(tokens), torch.tensor([target], dtype=torch.float32)

This class packages our text-number pairs into a PyTorch Dataset, preparing them for batching and training.

Building the Regression Language Model

class RegressionLanguageModel(nn.Module):
   def __init__(self, vocab_size, embed_dim=128, num_heads=4, num_layers=2,
                dropout=0.1, max_len=20):
       super().__init__()
       self.token_embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
       self.position_embedding = nn.Embedding(max_len, embed_dim)
       encoder_layer = nn.TransformerEncoderLayer(
           d_model=embed_dim,
           nhead=num_heads,
           dim_feedforward=embed_dim * 4,
           dropout=dropout,
           batch_first=True
       )
       self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
       self.fc1 = nn.Linear(embed_dim, 64)
       self.relu = nn.ReLU()
       self.dropout = nn.Dropout(dropout)
       self.fc2 = nn.Linear(64, 1)
       self.max_len = max_len
  
   def forward(self, x):
       batch_size, seq_len = x.shape
       positions = torch.arange(0, seq_len, device=x.device).unsqueeze(0).expand(batch_size, -1)
       token_embed = self.token_embedding(x)
       pos_embed = self.position_embedding(positions)
       embeddings = token_embed + pos_embed
       padding_mask = (x == 0)
       encoded = self.transformer(embeddings, src_key_padding_mask=padding_mask)
       mask_expanded = (~padding_mask).unsqueeze(-1).float()
       summed = (encoded * mask_expanded).sum(dim=1)
       pooled = summed / mask_expanded.sum(dim=1)
       x = self.fc1(pooled)
       x = self.relu(x)
       x = self.dropout(x)
       output = self.fc2(x)
       return output

This model leverages token and positional embeddings flowing through a multi-layer encoder, allowing it to learn numerical cues from language.

Training the Model

def train_rlm(model, train_loader, val_loader, epochs=15, lr=0.001):  
   criterion = nn.MSELoss()
   optimizer = optim.Adam(model.parameters(), lr=lr)
   train_losses, val_losses = [], []
   for epoch in range(epochs):
       model.train()
       train_loss = 0
       for tokens, targets in train_loader:
           tokens, targets = tokens.to(device), targets.to(device)
           optimizer.zero_grad()
           outputs = model(tokens)
           loss = criterion(outputs, targets)
           loss.backward()
           optimizer.step()
           train_loss += loss.item()
       train_loss /= len(train_loader)
       train_losses.append(train_loss)
       model.eval()
       val_loss = 0
       with torch.no_grad():
           for tokens, targets in val_loader:
               tokens, targets = tokens.to(device), targets.to(device)
               outputs = model(tokens)
               loss = criterion(outputs, targets)
               val_loss += loss.item()
       val_loss /= len(val_loader)
       val_losses.append(val_loss)
       print(f"Epoch {epoch+1:2d}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")
   return train_losses, val_losses

We train the model using Adam and MSE loss, iterating over mini-batches to backpropagate and update weights while tracking training and validation losses.

Testing Predictions

After training, we can test the model's predictions on unseen examples:

test_examples = [
   "The temperature is 25.5 degrees",
   "I rate this 8.0 out of ten",
   "The price is 45.0 dollars",
   "75.0 percent complete"
]
with torch.no_grad():
   for text in test_examples:
       tokens = torch.tensor([tokenizer.encode(text)]).to(device)
       prediction = model(tokens).item()
       print(f"Input: {text}")
       print(f"Predicted value: {prediction:.4f}\n")

In conclusion, we successfully designed, trained, and evaluated a Regression Language Model capable of predicting continuous values from textual inputs. This implementation demonstrates how combining positional embeddings, transformer encoders, and a regression head enables the model to capture the numerical semantics embedded in language.

For the complete code and additional resources, please visit our GitHub page.