abokbot/wikipedia-first-paragraph
Viewer โข Updated โข 6.46M โข 176 โข 4
How to use abokbot/wikipedia-embedding with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("abokbot/wikipedia-embedding")
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]We use MS Marco Encoder msmarco-MiniLM-L-6-v3 from the sentence-transformers library to encode the text from dataset abokbot/wikipedia-first-paragraph.
The dataset contains the first paragraphs of the English "20220301.en" version of the Wikipedia dataset.
The output is an embedding tensor of size [6458670, 384].
It was obtained by running the following code.
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
dataset = load_dataset("abokbot/wikipedia-first-paragraph")
bi_encoder = SentenceTransformer('msmarco-MiniLM-L-6-v3')
bi_encoder.max_seq_length = 256
wikipedia_embedding = bi_encoder.encode(dataset["text"], convert_to_tensor=True, show_progress_bar=True)
This operation took 35min on a Google Colab notebook with GPU.
More information of MS Marco encoders here https://www.sbert.net/docs/pretrained-models/ce-msmarco.html
from sentence_transformers import SentenceTransformer model = SentenceTransformer("abokbot/wikipedia-embedding") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3]