Code Embeddings
18 Apr 2023 CommentsIn this post, I would like to explore the idea of using embedding vectors to represent code snippets, and compute the cosine similarity scores between a few examples. I will compare OpenAI’s text-embedding-ada-002 with two open-source models, SantaCoder and Salesforce CodeGen. The OpenAI model was trained on general text data, but it’s the only embedding model the company currently offers. The other two models were trained on code to generate code, so we would need to do some hacking to get the embedding vectors.
Setup
from itertools import combinations
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
How do we get the embedding vectors from a model? We can use the transformer
part of the model to get the hidden states - layer-by-layer outputs of the transformer blocks. Thanks to a conversation with Vlad, I decided to use the last hidden state. The shape of this output is (batch_size, seq_len, hidden_size)
, so we would need to aggregate over the seq_len
dimension to get the “summary”. We could take the mean, or max, or some other way. I tried the max, but the similarity scores looked all close to 1. So here let’s stick with the mean.
It is noteworthy to spell out the variety of choices here. Echoing my previous post, the embeddings from a generative model were not trained to definitively tell things apart.
class EmbeddingModel:
"""
Wrapper class to get the embedding vector from a model.
"""
def __init__(self, tokenizer, model):
self.tokenizer = tokenizer
self.model = model
def get_embedding(self, code):
with torch.no_grad():
inputs = self.tokenizer(code, return_tensors="pt").to(self.model.device)
# here we use the transformer part (not the lm head, which is used for generation).
outputs = self.model.transformer(**inputs)
# we could use the last_hidden_state, or second last...
# the shape is (batch_size, seq_len, hidden_size), so we would need to aggregate over the seq_len dimension.
# we choose to take the mean here.
return outputs.last_hidden_state.mean(axis=1)[0]
def cosine_similarity(embedding1, embedding2):
"""
Compute the cosine similarity between two embedding vectors.
"""
return embedding1.dot(embedding2) / embedding1.norm() / embedding2.norm()
Get some snippets of code to compare.
code_snippets = []
code_snippets.append("""def print_hello_world():
print("hello world")
""")
code_snippets.append("""def do_something():
a = 1
b = a + 1
return b
""")
code_snippets.append("""def hello_world():
return "hello world"
""")
OpenAI text-embedding-ada-002
Here we just initialize and define one function for later use.
# openai
import openai
openai.api_key = "YOUR_API_KEY"
def get_openai_embedding(text, model="text-embedding-ada-002"):
text = text.replace("\n", " ")
return torch.tensor(openai.Embedding.create(input = [text], model=model)['data'][0]['embedding'])
SantaCoder
# bigcode/santacoder
device = "cpu"
tokenizer = AutoTokenizer.from_pretrained("bigcode/santacoder")
model = AutoModelForCausalLM.from_pretrained("bigcode/santacoder", trust_remote_code=True).to(device)
santacoder = EmbeddingModel(tokenizer, model)
Let’s see how the model generates code from a prompt.
code = """def binary_search():
"""
inputs = santacoder.tokenizer(code, return_tensors="pt").to(device)
outputs = santacoder.model.generate(**inputs, max_length=512)
print(santacoder.tokenizer.decode(outputs[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))
# don't think `truncate_before_pattern` had any effect here
Output:
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
def binary_search():
# 1. Get the input
n = int(input())
# 2. Initialize the list
arr = [0] * n
# 3. Get the elements
for i in range(n):
arr[i] = int(input())
# 4. Find the index
index = binary_search_recursive(arr, 0, n - 1)
# 5. Print the result
print(index)
def binary_search_recursive(arr, left, right):
# 1. Base case
if left > right:
return -1
# 2. Get the middle index
mid = (left + right) // 2
# 3. Check if the element is present at the middle index
if arr[mid] == mid:
return mid
# 4. Check if the element is smaller than the middle index
elif arr[mid] < mid:
return binary_search_recursive(arr, left, mid - 1)
# 5. Check if the element is greater than the middle index
else:
return binary_search_recursive(arr, mid + 1, right)
if __name__ == '__main__':
binary_search()<|endoftext|>...
The truncate_before_pattern
argument didn’t seem to take effect here and I saw a lot more code after <|endoftext|>
. Doesn’t affect this experiment, so I’ll leave it as is.
What does a SantaCoder embedding vector look like?
embedding = santacoder.get_embedding("def hello_world(): return 'hello world'")
print(embedding.shape)
print(embedding)
Output:
torch.Size([2048])
tensor([-0.2162, 0.7902, -1.0303, ..., 1.9904, -1.1341, -2.7802],
device='cpu')
Time to compute the cosine similarity scores using the embedding vectors. We denote SIM local
as the results from the open-source model, and SIM openai
as the results from the OpenAI model.
for code1, code2 in combinations(code_snippets, 2):
sim_local = cosine_similarity(santacoder.get_embedding(code1), santacoder.get_embedding(code2)).item()
sim_openai = cosine_similarity(get_openai_embedding(code1), get_openai_embedding(code2)).item()
print(code1)
print("----------")
print(code2)
print(f"SIM local: {sim_local:.2f} SIM openai: {sim_openai:.2f}")
print("=================================")
Output:
def print_hello_world():
print("hello world")
----------
def do_something():
a = 1
b = a + 1
return b
SIM local: 0.79 SIM openai: 0.81
=================================
def print_hello_world():
print("hello world")
----------
def hello_world():
return "hello world"
SIM local: 0.90 SIM openai: 0.93
=================================
def do_something():
a = 1
b = a + 1
return b
----------
def hello_world():
return "hello world"
SIM local: 0.84 SIM openai: 0.82
=================================
You know… They agree quite well. If I don’t want to use the OpenAI model, I can just use SantaCoder.
Salesforce CodeGen
We only use the 2B parameter model for this experiment here.
# salesforce codegen-2B-mono
device = "cpu"
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-2B-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-2B-mono").to(device)
codegen = EmbeddingModel(tokenizer, model)
Same here. Let’s see how the model generates code from a prompt.
code = """def binary_search():
"""
inputs = codegen.tokenizer(code, return_tensors="pt").to(device)
outputs = codegen.model.generate(**inputs, max_length=512)
print(codegen.tokenizer.decode(outputs[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))
Output:
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
def binary_search():
print("Binary Search")
arr = list(map(int, input("Enter the array: ").split()))
key = int(input("Enter the key: "))
low = 0
high = len(arr) - 1
while low <= high:
mid = (low + high) // 2
if arr[mid] == key:
print("Element found at index: ", mid)
break
elif arr[mid] < key:
low = mid + 1
else:
high = mid - 1
else:
print("Element not found")
Examine the CodeGen embedding vector.
embedding = codegen.get_embedding("def hello_world(): return 'hello world'")
print(embedding.shape)
print(embedding)
Output:
torch.Size([2560])
tensor([ 1.2350, 1.5245, 1.9772, ..., 3.1278, -2.5663, -1.3473],
device='cpu')
Compute the cosine similarity scores. Here SIM local
are from the Salesforce model, and SIM openai
as the results from the OpenAI model.
for code1, code2 in combinations(code_snippets, 2):
sim_local = cosine_similarity(codegen.get_embedding(code1), codegen.get_embedding(code2)).item()
sim_openai = cosine_similarity(get_openai_embedding(code1), get_openai_embedding(code2)).item()
print(code1)
print("----------")
print(code2)
print(f"SIM local: {sim_local:.2f} SIM openai: {sim_openai:.2f}")
print("=================================")
Output:
def print_hello_world():
print("hello world")
----------
def do_something():
a = 1
b = a + 1
return b
SIM local: 0.98 SIM openai: 0.81
=================================
def print_hello_world():
print("hello world")
----------
def hello_world():
return "hello world"
SIM local: 0.99 SIM openai: 0.93
=================================
def do_something():
a = 1
b = a + 1
return b
----------
def hello_world():
return "hello world"
SIM local: 0.98 SIM openai: 0.82
=================================
Here CodeGen is giving scores all close to 1. It doesn’t seem to differentiate between different snippets. I’ve tried to use the 6B model, but the results were similar. CodeGen should be a decent model for code generation, but apparently, the embedding vector obtained from its last hidden state is not suitable for code similarity.