QAngaroo Leaderboards


The QAngaroo Leaderboards

There are two leaderboards, one for WikiHop and one for MedHop. They compare the accuracies of different methods, evaluated on the hidden test data of each respective dataset. All models are evaluated in the standard (unmasked) setting.

Planning to submit a model? Submit your code on codalab.

WikiHop

# Model / Reference Affiliation Date Accuracy[%]
1 RealFormer-large (single) [anonymized] January 2021 84.4
2 ETC-large (single) [anonymized] May 2020 82.3
3 Longformer (single) AI2 March 2020 81.9
4 Path-based GCN (ensemble) Zhejiang University (ZJU) September 2019 78.3
5 Chen et al. (2019) UT Austin September 2019 76.5
6 QIT (ensemble) [anonymized] March 2023 76.5
7 ChainEx (single) [anonymized] May 2019 74.9
8 JDReader (ensemble) JD AI Research March 2019 74.3
9 DynSAN (ensemble) Samsung Research (SRC-B) March 2019 73.8
10 CEG (single) Beijing University of Posts and Telecommunications December 2019 73.6
11 [anonymized] [anonymized] October 2020 73.1
12 QIT (single) [anonymized] Februrary 2023 72.7
13 ECGN [anonymized] March 2020 72.6
14 Path-based GCN (single) Zhejiang University (ZJU) July 2019 72.5
15 [anonymized] [anonymized] December 2019 72.3
16 ClueReader v1 (single) Qufu Normal University September 2020 72.0
17 DynSAN basic (single) Samsung Research (SRC-B) February 2019 71.4
18 Entity-GCN v2 (ensemble) University of Amsterdam && University of Edinburgh November 2018 71.2
19 [anonymized] [anonymized] November 2019 71.0
20 HDEGraph JD AI Research February 2019 70.9
21 CFC Salesforce Research September 2018 70.6
22 [anonymized] [anonymized] November 2018 69.6
23 EPAr UNC-NLP February 2019 69.1
24 BAG University of Sydney March 2019 69.0
25 [anonymized] [anonymized] September 2018 67.6
26 Entity-GCN v1 University of Amsterdam && University of Edinburgh May 2018 67.6
27 SimpleMemNet [anonymized] September 2018 66.9
28 [anonymized] [anonymized] November 2018 66.5
29 MHQA-GRN IBM && University of Rochester August 2018 65.4
30 Jenga Facebook AI Research February 2018 65.3
31 Vanilla CoAttention Model Nanyang Technological University December 2017 59.9
32 Coref-GRU Carnegie Mellon University April 2018 59.3
33 BiDAF (Seo et al. '17) Initial Benchmarks September 2017 42.9
34 Most Frequent Given Candidate Initial Benchmarks September 2017 38.8
35 Document-cue Initial Benchmarks September 2017 36.7
36 FastQA (Weissenborn et al. '17) Initial Benchmarks September 2017 25.7
37 TF-IDF Initial Benchmarks September 2017 25.6
38 Random Candidate Initial Benchmarks September 2017 11.5

MedHop

# Model / Reference Affiliation Date Accuracy [%]
1 MedKGQA [anonymized] November 2020 64.8
2 EPAr UNC-NLP February 2019 60.3
3 Most Frequent Given Candidate Initial Benchmarks September 2017 58.4
4 Vanilla CoAttention Model Nanyang Technological University December 2017 58.1
5 BiDAF (Seo et al. '17) Initial Benchmarks September 2017 47.8
6 ClueReader v1 Qufu Normal University September 2020 46.0
7 Document-cue Initial Benchmarks September 2017 44.9
8 FastQA (Weissenborn et al. '17) Initial Benchmarks September 2017 23.1
9 Random Candidate Initial Benchmarks September 2017 13.9
10 TF-IDF Initial Benchmarks September 2017 9.0