Task 4: Question Answering Challenge

Task definition

The goal of the task is to develop a solution capable of providing answers to general-knowledge questions typical for popular TV quiz shows, such as Fifteen to One (PL: Jeden z dziesięciu).

The participants will be provided with:

  • a small development set (question and answers)

  • test set A (without answers; the results will be immediately shown in the leaderboard)

  • test set B (without answers; the results will be shown in the leaderboard in the last week of the competition).

No training set will be provided this time, but participants are free to use all kinds of available offline sources. Thus, caching Wikipedia is fine but asking Google on the fly is not; the system should be ready to run without an Internet connection. The answering process also has to be completely automated and not involve any form of human assistance. 

Questions (and answers)

Questions and answers were retrieved from various sources and may correspond to actual questions asked in the quiz shows before. 

The questions could be grouped according to the type of answer they seek, for example:

  • A name of a specific entity:

    • Q: Jak nazywa się bohaterka gier komputerowych z serii Tomb Raider?

    • A: Lara Croft

  • A name of a more general category of entities:

    • Q: Paź królowej to gatunek których owadów?

    • A: motyli

  • True or false value (“tak” or “nie”):

    • Q: Czy w przypadku skrócenia kadencji Sejmu ulega skróceniu kadencja Senatu?

    • A: Tak

  • One of the options, which are given in the question:

    • Q: Co zabiera Wenus więcej czasu: obieg dookoła Słońca czy obrót dookoła osi?

    • A: Obrót dookoła osi

  • Words that complete a given sentence, quote or expression:

    • Q: Proszę dokończyć powiedzenie: „piłka jest okrągła, a bramki są…”

    • A: dwie

  • Numbers:

    • Q: Ile pełnych tygodni ma rok kalendarzowy?

    • A: 52

  • Alternative names for a given entity:

    • Q: Jaki przydomek nosił Ludwik I, król Franków i syn Karola Wielkiego?

    • A: Pobożny

 Please note that questions can be expressed in various ways, e.g.

  • Proszę rozwinąć skrót CIA.

  • Tę samą nazwę noszą pocisk miotany ręką, kamień półszlachetny i owoc południowy. Jaką?

  • Ten urodzony w XIX w. Nantes francuski pisarz uchodzi za prekursora literatury fantastycznonaukowej. O kogo chodzi?

  • Festiwal filmowy Dwa Brzegi odbywa się jednocześnie w dwóch miejscowościach na dwóch brzegach Wisły. Jedna z nich to Janowiec, a druga?

  • Tadeusz Andrzej Bonawentura, walczył o niepodległość USA. Jak brzmi jego nazwisko?

  • Zobaczyć... i umrzeć – o które miasto chodzi?

Answers should also be provided as in a quiz, i.e. they should consist of just a few words that satisfy the question, not a whole sentence or document, like in other QA tasks. Also, gold answers:

  • can contain prepositions:

    • Q: W którym mieście trasa drogi krzyżowej przebiega ulicą Via Dolorosa?

    • A: w Jerozolimie

  • can be inflected:

    • Q: Symbolem którego pierwiastka jest Cr?

    • A: chromu 

  • can contain punctuation:

    • Q: W jakim filmie mężczyzna w białym garniturze zrywa lilie ze stawu?

    • A: „Noce i dnie”

  • for person names, include first name and surname:

    • Q: Który brytyjski pisarz wprowadził do literatury pojęcie Wielkiego Brata?

    • A: George Orwell

 The following types of questions were removed from the set:

  • about current issues (Kto jest prezydentem Francji?)

  • seeking multiple entities in an answer (“Proszę podać nazwy dwóch państw, przez które przepływa Nil.); NOTE: questions with exactly two answers are valid and answers can be given in any order: Co występuje w powiedzeniu razem z makiem i oznacza nic?” – pasternak i figa / figa i pasternak

  • related to the rules of spelling (“Przez jakie ‘h’ piszemy słowo ‘charyzma’?”)

  • those requiring longer explanations (“Jakie pokrewieństwo łączyło reżysera Jana Łomnickiego i aktora Tadeusza?")

Development and test data

Development and test data are provided as UTF-8-encoded text files: in.tsv contains questions, while expected.tsv contains tab-separated answers (in cases in which more than one answer variant is provided).

The data has been published in the following repository: https://github.com/poleval/2021-question-answering 

Gold-standard test data annotation has been published in the "secret" branch (https://github.com/poleval/2021-question-answering/tree/secret/test-B).

Submission format

The submission file should contain just the answers in seprate lines.

Evaluation

The task will be evaluated by comparing the known answer (gold standard) to the one provided by the participating systems (predictions). Specifically, we will compute accuracy as the number of matching answers divided by the number of questions in the test set.

Checking if the two answers match will depend on the question type:

  1. For non-numerical questions, we will assess textual similarity. To that end, a Levenshtein distance will be computed between the two (lowercased) strings and if it is less than ½ of the length of the gold standard answer, we accept the candidate answer.

  2. For numerical questions (e.g. In which year…), we will assess numerical similarity. Specifically, we will use a regular expression to extract a sequence of characters that could be interpreted as a number. If such sequences can be found in both answers and represent the same number, we accept the prediction.

For some questions, more than one answer text is available, e.g. Richard I, Richard Cœur de Lion and Richard the Lionheart. In such cases the answer that has the best match with the candidate will be used.

Baseline

The WIKI_SEARCH baseline solution uses the question as a query to Wikipedia search service and returns the title of the first returned article as an answer, as long as it doesn’t overlap with the question.

Specifically, the following procedure is used:

  1. Split the question into tokens using spaCy (model pl_core_news_sm) and ignore the one-character tokens,
  2. Send the space-separated tokens as a query to the Search API of the Polish Wikipedia,
  3. For each of the returned articles:
    1. Split its title into tokens with spaCy,
    2. If none of the tokens of the title has at least 50% overlap (measured as in Evaluation) with any of the tokens of the question:
      1. remove the part of the title starting from ‘(‘, if found
      2. return the title as an answer,
    3. Otherwise, continue to the next result,
  4. If no answer is found at this point, remove the first of the question tokens and jump back to (2).

Video introduction