Eedi

Codes for Eedi - Mining Misconceptions in Mathematics.

Update

[2024/12/13] 🔥 We achieved the 11th place in the private leaderboard (and 9th in the public leaderboard)! Here, we briefly introduce our solution.

Environment

pip install -r requirements.txt

Caution

When training with DeepSpeed ZeRO-3, there may be issues with saving model parameters. We strongly recommend using ZeRO-2 for training.

Method

Overview

Our approach is mainly divided into two parts: retrieval and reranking.

We first use a text embedding model to encode the query (information such as math problems and options etc.) and misconceptions separately. By calculating the cosine similarity between feature vectors, we identify the top-25 misconceptions most similar to each query.
In the reranking phase, we train a single-tower model to predict the similarity between the query and each retrieved misconception. Based on the predicted similarity scores, we reorder the retrieved misconceptions.

Retrieval

In the retrieval phase, we use Qwen2.5-32B-Instruct as the text embedding model. We follow standard practice by appending an EOS token to the end of the input text and selecting the feature vector of the last token during the output phase as the feature vector for the entire text. The input format is as shown below.

TEMPLATE = """Instruct: Given a math multiple-choice question query with answers and related knowledge, retrieve the misconception that are pertinent to the incorrect answer of the query.\nQuery: Question:{QuestionText}Construct Name:{ConstructName}\n\nSubject Name:{SubjectName}\n\nCorrect Answer:{CorrectAnswer}Incorrect Answer:{Answer}"""

Our training data primarily consists of three parts:

A GPT-4o-mini synthetic dataset: This is generated by providing seed examples to GPT-4o-mini, which then creates new Questions and Answers. These Questions and Answers share the same Subject, ConstructName, and misconceptions as the provided examples.
MalAlgoQA: Each entry in MalAlgoQA consists of a question, four options, and the corresponding rationale for each of the four options. We treat the rationale for each incorrect option as its corresponding misconception.
Eedi training dataset: The official training dataset provided by the competition.

The sizes of the three datasets are as shown below:

Dataset	Size
GPT-4o-mini synthetic dataset	1869
MalAlgoQA	807
Eedi training dataset	1869

Some details of data processing:

The training process of our retrieval model is divided into three stages for the following reasons: 1) The three datasets vary in quality, and training first on the lower-quality GPT-4o-mini synthetic dataset and MalAlgoQA, followed by fine-tuning on the Eedi training dataset, yields better results; 2) The GPT-4o-mini synthetic dataset and MalAlgoQA cannot be mixed for training due to differences in their formats.
We use Qwen2.5-Math-7B-Instruct, which was trained by us in the early stages of the competition, to relabel the misconceptions in the GPT-4o-mini synthetic dataset. While it may sound unusual, this approach indeed resulted in improvements.
For each data entry, we randomly sampled 25 negative samples from the set of positive samples corresponding to other data in the training set.

Reranking

Similar to the retrieval phase, we also use Qwen2.5-32B-Instruct as the base model. Likewise, we append an EOS token to the end of the input text and map the feature vector of the last token into a similarity scalar during the output phase. For efficiency reasons, we only rerank the top-18 out of the 25 retrieved misconceptions. The input format is as shown below.

TEMPLATE = """Instruct: Given a math multiple-choice question query with answers and related knowledge, retrieve the misconception that are pertinent to the incorrect answer of the query.\nQuery: Question:{QuestionText}\n\nConstruct Name:{ConstructName}\n\nSubject Name:{SubjectName}\n\nCorrect Answer:{CorrectAnswer}\n\nIncorrect Answer:{Answer}\n\nMisconception:{DOC}"""

In the selection of positive and negative samples, we follow the "mine hard negatives" approach. We use the model introduced in the previous section to retrieve the top-150 misconceptions and randomly sample 25 of them as negative samples.

Conclusion

Since this is our first encounter with this type of task, there are many areas for improvement in this approach:

Synthetic Data: The training set officially provided for this competition does not include all misconceptions. Synthesizing some data using misconceptions that do not appear in the training set is crucial for achieving better results on the private leaderboard. In fact, we worked on this aspect, but abandoned it because we didn’t observe improvements on the public leaderboard.
Selection of Positive and Negative Samples: As mentioned in our retrieval training, for each data entry, we randomly sampled 25 negative samples from the set of positive samples corresponding to other data in the training set. This approach led to significant improvements on the public leaderboard, but it may carry a risk of overfitting.
Stronger Base Model and Larger Batch Size: Using a more powerful base model and a larger batch size would bring notable improvements.

Acknowledgements

Here, we express our deepest gratitude to the following individuals and projects, from which our solution has drawn significant inspiration!

FlagEmbedding
Sayoulala's Solution

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Eedi

Update

Environment

Method

Overview

Retrieval

Reranking

Conclusion

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Eedi

Update

Environment

Method

Overview

Retrieval

Reranking

Conclusion

Acknowledgements