Jacob Chanyeol Choi & Junseong Kim
May 29, 2024
Linq-Embed-Mistral, Elevating Text Retrieval with Improved GPT Data Through Task-Specific Control and Quality Refinement
We are proud to present Linq-Embed-Mistral, a breakthrough in the realm of text retrieval.
Abstract
Linq-Embed-Mistral represents a significant leap forward in text retrieval technology, leveraging the robust foundations of E5-mistral and Mistral-7B-v0.1. The model achieves an average score of 68.2 across 56 datasets in the Massive Text Embedding Benchmark (MTEB) (as of May 29, 2024), making it the highest-ranking publicly accessible model and third overall. In terms of retrieval performance, it scores 60.2, placing it first among all models listed on the MTEB leaderboard. This outstanding performance underscores its superior capability in enhancing search precision and reliability.
In the Mistral Model Series, Linq-Embed-Mistral focuses on creating and integrating sophisticated synthetic datasets, boosting its performance from 56.9 for E5-mistral and 59.0 for SFR to 60.2. Our advanced data refinement methods significantly enhance the model’s ability to identify misleading documents and ensure more accurate results. The model’s rapid training and evaluation capabilities facilitate the development of customized solutions for specialized domains, allowing for quick deployment and fine-tuning to meet specific professional needs.
Introduction
In recent years, the convergence of large language models (LLMs) and information retrieval (IR) has garnered significant attention. Especially, effective text retrieval is pivotal in integrating LLMs and IR systems as it greatly improves the system's capacity. Enhancing the text retrieval aspect is also crucial for frameworks like Retrieval-Augmented Generation (RAG), which incorporate current external information to overcome the static nature of LLMs, thus delivering dependable and dynamic answers to user inquiries. This blog explores extensive experiments focused on improving text retrieval using advanced data refinement methods, including sophisticated data crafting, data filtering, and negative mining techniques. These methods are applied to both (1) existing benchmark dataset, and (2) highly tailored synthetic dataset generated via LLMs.
Recent studies highlight the efficacy of LLMs in generating synthetic data, primarily for enhancing human-labeled datasets or improving performance. This motivates us to investigate a critical question:
Can we rely on LLM-generated data to improve retrieval performances? If not, how can we enhance its quality for this specific task?
We employ advanced methods such as data crafting
w/ extensive prompt engineering, data filtering
, and negative mining guided by teacher models
, which are highly tailored to each task, to improve the quality of the synthetic data generated by LLM. Our efforts aim to create high-quality triplet datasets (query, positive example, negative example), significantly improving text retrieval performances.
Research Highlight
Similar to the SFR, our Linq-Embed-Mistral
represents a notable progression in text-embedding models, leveraging the robust bases of E5-mistral and Mistral-7B-v0.1.
The key experimental points are:
Linq-Embed-Mistral performs well in MTEB benchmarks, with an average score of
68.2
across 56 datasets. This places it1st
among publicly accessible models listed on the MTEB leaderboard and 3rd overall.The model shows a significant enhancement in the retrieval performance, ranking 1st among all models listed on the MTEB leaderboard with a performance score of
60.2
.Within the Mistral Model Series, a suite of models based on the foundational Mistral architecture, SFR enhances E5-mistral by adding a specially curated dataset of MTEB tasks. In contrast, our approach focuses solely on creating and integrating more sophisticated synthetic datasets. This has increased our model's score from 56.9 for E5-mistral and 59.0 for SFR to an
60.2
.
Our contribution points are as follows:
Our proposed
Data Refinement Methods
, which include sophisticated data crafting, filtering, and negative mining, significantly enhance the model's ability to identify misleading documents. By improving the quality of the benchmark dataset and addressing issues in the synthetic dataset generated by GPT-4, these methods ensure more accurate and reliable results.We propose
Homogeneous Task Ordering
andMixed Task Fine-tuning
, which enhance the model performance by promoting better generalization and training stability, especially when mixed task fine-tuning is limited to within 20 steps. Here,homogeneous task ordering
provides precise insights into task ordering effects, whereasMixed Task Fine-tuning
mitigates the catastrophic forgetting.We design
Streamlined Evaluation
, which uses 4-bit precision and a light retrieval evaluation set. This speeds up the process of validation, where our streamlined evaluation has negligible performance differences, compared with the full-scale evaluation. Our design allows a single GPU to evaluate one checkpoint in approximately 5 hours, with retrieval tasks specifically taking around 4 hours.
Full Evaluation on MTEB
The Massive Text Embedding Benchmark (MTEB) stands as the most comprehensive benchmark for evaluating embedding models, incorporating 56 datasets across seven task types: classification, clustering, pair classification, re-ranking, retrieval, STS, and summarization.
The key experimental points are:
Linq-Embed-Mistral performs well in MTEB benchmarks, with an average score of
68.2
across 56 datasets. This places it1st
among publicly accessible models listed on the MTEB leaderboard and 3rd overall.The model shows a significant enhancement in the retrieval performance, ranking 1st among all models listed on the MTEB leaderboard with a performance score of
60.2
.Within the Mistral Model Series, a suite of models based on the foundational Mistral architecture, SFR enhances E5-mistral by adding a specially curated dataset of MTEB tasks. In contrast, our approach focuses solely on creating and integrating more sophisticated synthetic datasets. This has increased our model's score from 56.9 for E5-mistral and 59.0 for SFR to an
60.2
.
Comparison with publicly accessible models
Comparison with commercial models
Comparison with the models that tops the MTEB leaderboard
highlighting the first and second place items in each task using bold and underlined formatting.
Acknowledgments
Contributions to the development of Linq-Embed-Mistral were made by
Dr. Junseong Kim (junseong.kim@getlinq.com; Project Leader; GPT Data Strategy, Experiment Design, Technical Guidance and Advice)
Dr. Seolhwa Lee (seolhwa.lee@getlinq.com; AI Researcher; Dataset Filtering & Mining, Modeling Experiments, GPT Data Generation)
Jihoon Kwon (jihoon.kwon@wecoverai.com; AI Intern; Benchmark Datasets for Training, Dataset Filtering & Mining, Data Training Pipelines, Modeling Experiments, MTEB Evaluation)
Sangmo Gu (sangmo.gu@wecoverai.com; AI Intern; GPT Data Strategy, GPT Data Filtering, GPT Data Generation)
Yejin Kim (yjkim.stat@yonsei.ac.kr; AI Intern; Baseline Models)
Minkyung Cho (kveldsstjerne@snu.ac.kr; AI Intern; Baseline GPT Data)
Prof. Jy-yong Sohn (jyyong.sohn@getlinq.com; Advisor; Technical Guidance and Advice).
Dr. Chanyeol Choi (jacob.choi@getlinq.com; Advisor; Technical Guidance and Advice).