What Is KBQA and What Are Its Benchmarks?

7 Jun 2024


(1) Silei Xu, Computer Science Department, Stanford University Stanford, CA with equal contribution {silei@cs.stanford.edu};

(2) Shicheng Liu, Computer Science Department, Stanford University Stanford, CA with equal contribution {shicheng@cs.stanford.edu};

(3) Theo Culhane, Computer Science Department, Stanford University Stanford, CA {tculhane@cs.stanford.edu};

(4) Elizaveta Pertseva, Computer Science Department, Stanford University Stanford, CA, {pertseva@cs.stanford.edu};

(5) Meng-Hsi Wu, Computer Science Department, Stanford University Stanford, CA, Ailly.ai {jwu@ailly.ai};

(6) Sina J. Semnani, Computer Science Department, Stanford University Stanford, CA, {sinaj@cs.stanford.edu};

(7) Monica S. Lam, Computer Science Department, Stanford University Stanford, CA, {lam@cs.stanford.edu}.

Abstract and Introduction

Related Work

Semantic Parsing for Wikidata

WikiWebQuestions (WWQ) Dataset



Experiment with QALD-7

Conclusions, Limitations, Ethical Considerations, Acknowledgements, and References

A. Examples of Recovering from Entity Linking Errors

2.1 KBQA

The KBQA task aims to make large knowledge bases accessible by natural language. One common approach is semantic parsing where a natural language query is translated into a formal logical form, which is then executed to retrieve an answer from the knowledge base. To handle large KBs, one method is to formulate SP as a multi-staged search problem by retrieving entities and expanding the graphs according to the relationships between their properties and the query (Yih et al., 2015, 2016; Luo et al., 2018).

Lan and Jiang (2020) add constraints to the staged query graph generation method. Another popular method is to use seq2seq models obtained by fine-tuning pretrained language models. Das et al. (2021) first find other queries that contain semantically similar subparts, and construct a new logical form by combining the similar subparts of the found queries. Ye et al. (2022) search over the KB based on predefined rules to derive a set of candidate logical forms, rank them, and generate the final logical form.

Cao et al. (2022b) first generate a “sketch” program and then fill in its arguments. Gu and Su (2022) use dynamic program induction to generate query structures. Based on a user query, Shu et al. (2022) retrieve entities, example logical forms, and related schema. Unlike FreeBase, Wikidata does not have a fixed schema.

Another approach to KBQA is based on graph retrieval (Dong et al., 2015; Miller et al., 2016; Sun et al., 2018, 2019; Mavromatis and Karypis, 2022; Sen et al., 2021; Vivona and Hassani, 2019; Verga et al., 2021). It predicts the answers directly within the subgraph extracted based on the topic entity in the question. Yu et al. (2023) combine semantic parsing with retrieval and achieve the state-ofthe-art on the WebQuestionsSP dataset (Yih et al., 2016).

However, retrieval-based methods cannot handle entire categories of questions, such as questions with no available answer and questions like “the tallest mountain” where no entities are mentioned by name. They have poor interpretability and do not support query optimization.

2.2 KBQA Benchmarks

Most of the early KBQA benchmarks are based on Freebase (Berant et al., 2013; Yih et al., 2016; Talmor and Berant, 2018). Recently, new benchmarks have been created for Wikidata (Cao et al., 2022a; Saha et al., 2019). However, these benchmarks are created using rule-based synthesis or paraphrases, which are easier for semantic parsers.

CSQA collects human-written questions for single triples and constructs complex questions using fixed rules with very limited natural language variety (Saha et al., 2019). KQA Pro first synthesizes queries with canonical natural language and then crowdsources human paraphrases (Cao et al., 2022a).

Campagna et al. (2019) show that a model can achieve significantly higher accuracy over paraphrased data compared to real-world data even for untrained queries. Thus, we base our WikiWebQuestions dataset on WebQuestionsSP (Yih et al., 2016), where data are collected from real-world users using the Google Suggest API.

2.3 LLMs for Semantic Parsing

Shin et al. (2021) show the promise of few-shot prompting LLMs for semantic parsing. They use constrained decoding to enforce the syntax of the formal language, and achieve comparable results with a smaller fine-tuned BART model (Lewis et al., 2020) on datasets with small database schemas. Rubin et al. (2022) fine-tune a small retriever to obtain the most relevant few-shot examples to use for each input.

Niu et al. (2023) use a few-shot prompted Codex model to break down the natural language input to make the task easier for a smaller semantic parser. LLMs have also been applied to semantic parsing on relational databases (Hu et al., 2022; Poesia et al., 2022; Li et al., 2023; An et al., 2023; Nan et al., 2023; Arora et al., 2023). The schemas used in these projects are very small when compared to Wikidata.

2.4 Entity Linking

Entity linking involves finding the named entities in a query, and linking them to the corresponding entities in the knowledge graph so that the query can be executed using the proper entities as reference points. The current state-of-the-art entity linking model on the WebQuestionsSP dataset is ReFinED (Ayoola et al., 2022).

They use a bidirectional transformer on the query to predict the most likely mentions of named entities within a query, and then combine that information with embeddings computed over every entity in the knowledge base to predict which entity the mention is most likely to be referring to. Prior to ReFinED, the state-of-the-art was ELQ (Li et al., 2020). They similarly generate embeddings for each entity in the knowledge base, and then use the predicted mentions of entities combined with these predicted embeddings to generate likely entities.

This paper is available on arxiv under CC 4.0 license.