There are two leaderboards, one for WikiHop and one for MedHop. They compare the accuracies of different methods, evaluated on the hidden test data of each respective dataset. All models are evaluated in the standard (unmasked) setting.
Planning to submit a model? Submit your code on codalab.
# | Model / Reference | Affiliation | Date | Accuracy[%] |
---|---|---|---|---|
1 | RealFormer-large (single) | [anonymized] | January 2021 | 84.4 |
2 | ETC-large (single) | [anonymized] | May 2020 | 82.3 |
3 | Longformer (single) | AI2 | March 2020 | 81.9 |
4 | Path-based GCN (ensemble) | Zhejiang University (ZJU) | September 2019 | 78.3 |
5 | Chen et al. (2019) | UT Austin | September 2019 | 76.5 |
6 | QIT (ensemble) | [anonymized] | March 2023 | 76.5 |
7 | ChainEx (single) | [anonymized] | May 2019 | 74.9 |
8 | JDReader (ensemble) | JD AI Research | March 2019 | 74.3 |
9 | DynSAN (ensemble) | Samsung Research (SRC-B) | March 2019 | 73.8 |
10 | CEG (single) | Beijing University of Posts and Telecommunications | December 2019 | 73.6 |
11 | [anonymized] | [anonymized] | October 2020 | 73.1 |
12 | QIT (single) | [anonymized] | Februrary 2023 | 72.7 |
13 | ECGN | [anonymized] | March 2020 | 72.6 |
14 | Path-based GCN (single) | Zhejiang University (ZJU) | July 2019 | 72.5 |
15 | [anonymized] | [anonymized] | December 2019 | 72.3 |
16 | ClueReader v1 (single) | Qufu Normal University | September 2020 | 72.0 |
17 | DynSAN basic (single) | Samsung Research (SRC-B) | February 2019 | 71.4 |
18 | Entity-GCN v2 (ensemble) | University of Amsterdam && University of Edinburgh | November 2018 | 71.2 |
19 | [anonymized] | [anonymized] | November 2019 | 71.0 |
20 | HDEGraph | JD AI Research | February 2019 | 70.9 |
21 | CFC | Salesforce Research | September 2018 | 70.6 |
22 | [anonymized] | [anonymized] | November 2018 | 69.6 |
23 | EPAr | UNC-NLP | February 2019 | 69.1 |
24 | BAG | University of Sydney | March 2019 | 69.0 |
25 | [anonymized] | [anonymized] | September 2018 | 67.6 |
26 | Entity-GCN v1 | University of Amsterdam && University of Edinburgh | May 2018 | 67.6 |
27 | SimpleMemNet | [anonymized] | September 2018 | 66.9 |
28 | [anonymized] | [anonymized] | November 2018 | 66.5 |
29 | MHQA-GRN | IBM && University of Rochester | August 2018 | 65.4 |
30 | Jenga | Facebook AI Research | February 2018 | 65.3 |
31 | Vanilla CoAttention Model | Nanyang Technological University | December 2017 | 59.9 |
32 | Coref-GRU | Carnegie Mellon University | April 2018 | 59.3 |
33 | BiDAF (Seo et al. '17) | Initial Benchmarks | September 2017 | 42.9 |
34 | Most Frequent Given Candidate | Initial Benchmarks | September 2017 | 38.8 |
35 | Document-cue | Initial Benchmarks | September 2017 | 36.7 |
36 | FastQA (Weissenborn et al. '17) | Initial Benchmarks | September 2017 | 25.7 |
37 | TF-IDF | Initial Benchmarks | September 2017 | 25.6 |
38 | Random Candidate | Initial Benchmarks | September 2017 | 11.5 |
# | Model / Reference | Affiliation | Date | Accuracy [%] |
---|---|---|---|---|
1 | MedKGQA | [anonymized] | November 2020 | 64.8 |
2 | EPAr | UNC-NLP | February 2019 | 60.3 |
3 | Most Frequent Given Candidate | Initial Benchmarks | September 2017 | 58.4 |
4 | Vanilla CoAttention Model | Nanyang Technological University | December 2017 | 58.1 |
5 | BiDAF (Seo et al. '17) | Initial Benchmarks | September 2017 | 47.8 |
6 | ClueReader v1 | Qufu Normal University | September 2020 | 46.0 |
7 | Document-cue | Initial Benchmarks | September 2017 | 44.9 |
8 | FastQA (Weissenborn et al. '17) | Initial Benchmarks | September 2017 | 23.1 |
9 | Random Candidate | Initial Benchmarks | September 2017 | 13.9 |
10 | TF-IDF | Initial Benchmarks | September 2017 | 9.0 |