Matryoshka Representation Learning
Paper • 2205.13147 • Published • 26
How to use checkthisout/finetuned_arctic with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("checkthisout/finetuned_arctic")
sentences = [
"How have algorithms in hiring and credit decisions been shown to impact existing inequities, according to the context?",
"Shoshana Zuboff. The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of\nPower. Public Affairs. 2019.\n64. Angela Chen. Why the Future of Life Insurance May Depend on Your Online Presence. The Verge. Feb.\n7, 2019.\nhttps://www.theverge.com/2019/2/7/18211890/social-media-life-insurance-new-york-algorithms-big\ndata-discrimination-online-records\n68",
"SECTION TITLE\nFOREWORD\nAmong the great challenges posed to democracy today is the use of technology, data, and automated systems in \nways that threaten the rights of the American public. Too often, these tools are used to limit our opportunities and \nprevent our access to critical resources or services. These problems are well documented. In America and around \nthe world, systems supposed to help with patient care have proven unsafe, ineffective, or biased. Algorithms used \nin hiring and credit decisions have been found to reflect and reproduce existing unwanted inequities or embed \nnew harmful bias and discrimination. Unchecked social media data collection has been used to threaten people’s",
"ways and to the greatest extent possible; where not possible, alternative privacy by design safeguards should be \nused. Systems should not employ user experience and design decisions that obfuscate user choice or burden \nusers with defaults that are privacy invasive. Consent should only be used to justify collection of data in cases \nwhere it can be appropriately and meaningfully given. Any consent requests should be brief, be understandable \nin plain language, and give you agency over data collection and the specific context of use; current hard-to\nunderstand notice-and-choice practices for broad uses of data should be changed. Enhanced protections and"
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("checkthisout/finetuned_arctic")
# Run inference
sentences = [
'What are the implications of surveillance technologies on the rights and opportunities of underserved communities?',
'limits its focus to both government and commercial use of surveillance technologies when juxtaposed with \nreal-time or subsequent automated analysis and when such systems have a potential for meaningful impact \non individuals’ or communities’ rights, opportunities, or access. \nUNDERSERVED COMMUNITIES: The term “underserved communities” refers to communities that have \nbeen systematically denied a full opportunity to participate in aspects of economic, social, and civic life, as \nexemplified by the list in the preceding definition of “equity.” \n11',
'manage risks associated with activities or business processes common across sectors, such as the use of \nlarge language models (LLMs), cloud-based services, or acquisition. \nThis document defines risks that are novel to or exacerbated by the use of GAI. After introducing and \ndescribing these risks, the document provides a set of suggested actions to help organizations govern, \nmap, measure, and manage these risks. \n \n \n1 EO 14110 defines Generative AI as “the class of AI models that emulate the structure and characteristics of input \ndata in order to generate derived synthetic content. This can include images, videos, audio, text, and other digital',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
InformationRetrievalEvaluator| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.805 |
| cosine_accuracy@3 | 0.925 |
| cosine_accuracy@5 | 0.965 |
| cosine_accuracy@10 | 0.97 |
| cosine_precision@1 | 0.805 |
| cosine_precision@3 | 0.3083 |
| cosine_precision@5 | 0.193 |
| cosine_precision@10 | 0.097 |
| cosine_recall@1 | 0.805 |
| cosine_recall@3 | 0.925 |
| cosine_recall@5 | 0.965 |
| cosine_recall@10 | 0.97 |
| cosine_ndcg@10 | 0.8921 |
| cosine_mrr@10 | 0.8663 |
| cosine_map@100 | 0.868 |
| dot_accuracy@1 | 0.805 |
| dot_accuracy@3 | 0.925 |
| dot_accuracy@5 | 0.965 |
| dot_accuracy@10 | 0.97 |
| dot_precision@1 | 0.805 |
| dot_precision@3 | 0.3083 |
| dot_precision@5 | 0.193 |
| dot_precision@10 | 0.097 |
| dot_recall@1 | 0.805 |
| dot_recall@3 | 0.925 |
| dot_recall@5 | 0.965 |
| dot_recall@10 | 0.97 |
| dot_ndcg@10 | 0.8921 |
| dot_mrr@10 | 0.8663 |
| dot_map@100 | 0.868 |
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
What groups are involved in the processes that require cooperation and collaboration? |
processes require the cooperation of and collaboration among industry, civil society, researchers, policymakers, |
Why is collaboration among different sectors important in these processes? |
processes require the cooperation of and collaboration among industry, civil society, researchers, policymakers, |
What did the panelists emphasize regarding the regulation of technology before it is built and instituted? |
(before the technology is built and instituted). Various panelists also emphasized the importance of regulation |
MatryoshkaLoss with these parameters:{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768,
512,
256,
128,
64
],
"matryoshka_weights": [
1,
1,
1,
1,
1
],
"n_dims_per_step": -1
}
eval_strategy: stepsper_device_train_batch_size: 20per_device_eval_batch_size: 20num_train_epochs: 5multi_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 20per_device_eval_batch_size: 20per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 5max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Falsehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseeval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseeval_use_gather_object: Falsebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robin| Epoch | Step | cosine_map@100 |
|---|---|---|
| 1.0 | 40 | 0.8449 |
| 1.25 | 50 | 0.8586 |
| 2.0 | 80 | 0.8693 |
| 2.5 | 100 | 0.8702 |
| 3.0 | 120 | 0.8703 |
| 3.75 | 150 | 0.8715 |
| 4.0 | 160 | 0.8659 |
| 5.0 | 200 | 0.8680 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
Snowflake/snowflake-arctic-embed-m