Every year a conference is held in Moscow " Dialogue " which brings together linguists and data analysts. They discuss what natural language is and how to teach a machine to understand and process it. The conference traditionally includes competitions (tracks) Dialogue Evaluation Both representatives of large companies creating Natural Language Processing (NLP) solutions and individual researchers can participate. It may seem that if you are just a student, you can’t compete with systems that big specialists of big companies have been creating for years. Dialogue Evaluation is just the case when in the final standings a simple student can be above a big-name company.
This year will be the 9th year of Dialogue Evaluation. Each year the number of competitions is different. NLP tasks such as Sentiment Analysis, Word Sense Induction, Automatic Spelling Correction, Named Entity Recognition, and others have already become track topics.
This year, four groups of organizers prepared these tracks :
- Generation of headlines for news articles.
- Anaphora and coreference resolution.
- Morphological analysis in the material of low-resource languages.
- Automatic analysis of a type of ellipsis (gapping).
Today we will tell you about the last of them: what is ellipsis and why teach the machine to reconstruct it in the text, how we created a new case to solve this problem, how the competition itself took place and what results the participants were able to achieve.
AGRR-2019 (Automatic Gapping Resolution for Russian)
In the fall of 2018, we faced a research challenge related to ellipsis – the intentional omission of a chain of words in a text that can be reconstructed from context. How do you automatically find such an omission in a text and fill it in correctly? This is easy for a native speaker, but it is not easy to teach a machine. Pretty quickly it became clear that this was good stuff for the competition, and we got to work.
The organization of the new theme competition has its own special features, and to us they seem like pluses in many ways. One of the main things is the creation of a corpus (lots of markup texts to learn from). What should it look like and how big should it be? For many tasks there are standards for data representation, from which you can build upon. So, for example, for the task of defining named entities IO/BIO/IOBES markup schemes have been worked out, for syntactic and morphological parsing tasks CONLL format is traditionally used, no need to invent anything, but strictly follow the guidlines.
In our case, it was up to us to assemble the case and formulate the task ourselves.
Here’s the problem…
Here we will inevitably have to make a popular-linguistic introduction about what ellipsis is in general and gapping as one of its kinds.
Whatever your ideas about language, it’s hard to argue that the surface level of expression (text or speech) is not the only one. The spoken phrase is the tip of the iceberg. The iceberg itself includes pragmatic evaluation, construction of syntactic structure, selection of lexical material, and so on. Ellipsis is a phenomenon that beautifully connects the surface level with the deeper level. It is the omission of repetitive elements of syntactic structure. If we imagine the syntactic structure of a sentence as a tree, and in this tree we can distinguish the same subtrees, then often (but not always), in order to make the sentence natural, repetitive elements are deleted. This deletion is called an ellipsis (example 1).
(1) I never got a call back and I don’t understand why
I never got a call back.
The omissions produced by ellipsis can be unambiguously reconstructed from the linguistic context. Compare the first example with the second (2), where there is an omission, but what exactly is missing there is unclear. This case is not an ellipsis.
Gapping is one of the frequent types of ellipsis. Consider example (3)and see how it works.
(3) I mistook her for Italian and him for Swedish.
In all the examples there are more than two sentences (clauses), they are composed among themselves. The first clause has a verb (linguists would rather say “predicate”) took and its members : i , her and for the Italian In the second clause there is no expressed verb, there are only syntactically unrelated “remnants” (or remnants) its and for the Swede , but we understand how the pass is recovered.
To restore the skip, we turn to the first clause and copy the whole structure from it (Example 4). We replace only those parts for which there are “parallel” remnants in the incomplete clause. Copy the predicate took , her replace it with its , for the Italian replace with remnant for the Swede For i parallel remnant was not found, so we copy it without replacing it.
(4) I mistook her for an Italian, and his I took For a Swede.
It seems that all we need to do to recover the gap is to determine whether there is gapping in this sentence, find the incomplete clause and the associated complete clause (from which the material for recovery is taken), and then figure out what “residues” (remnants) there are in the incomplete clause and what they correspond to in the complete clause. These conditions seem to be enough to effectively fill the gap. In this way we are trying to mimic the process in the mind of a person reading or hearing a text in which there may be gaps.
So, what’s the purpose of this?
Understandably, someone hearing about ellipsis and the processing complexities associated with it for the first time may have a legitimate question, “Why?” Skeptics want to
suggest to read the fathers of linguistics explain that if solving an applied problem will yield material that can be useful in theoretical research, then this is already a sufficient answer to the question of the purpose of such activity.
Theorists have been studying ellipsis in different languages for about 50 years, describing limitations, and highlighting common patterns in different languages. That said, we are not aware of any corpus that illustrates any type of ellipsis with more than a few hundred examples. This is partly due to the rarity of the phenomenon (for example, in our data gapping occurs in no more than 5 sentences out of 10, 000). So the creation of such a corpus is already an important result.
In text data applications, the rarity of the phenomenon allows you to simply ignore it. The inability of a syntax parser to recover gapping omissions will certainly not bring many errors. But rare phenomena make up a vast and motley linguistic periphery. It seems that experience with such a problem in itself should be of interest to those who want to create systems that work not only on simple, short, clean texts with common vocabulary, that is, on spherical texts in a vacuum, which are almost never found in nature.
Few parsers can boast an effective system for identifying and resolving ellipsis. But ABBYY internal parser has a module responsible for skip recovery, and it is based on manually written rules. This ability of the parser is what enabled us to create a large corpus for the competition. The potential benefit to the original parser is in replacing the slow module. Also, while working on the corpus, we did a detailed error analysis of the current system.
How we built the case
Our corpus is primarily for training automated systems, which means that it is extremely important that it be voluminous and diverse. With this in mind, we structured our data collection work as follows. For the corpus we have selected texts of different genres: from technical documentation and patents to news and posts from social media. All of them were tagged with ABBYY parser. Over the course of a month we distributed the data between markup linguists. The annotators were asked, without changing the markup, to rate it on a scale of :
0 – there is no gapping in the sentence, the markup is irrelevant.
1 – there is gapping, and the markup is correct.
2 – the gapping is there, but there is something wrong with the markup.
3 – complicated case, is it gapping at all?
We ended up using each of the groups. Examples from category 1 fell into the positive class of our dataset. Examples from categories 2 and 3 we did not want to re-label by hand to save time, but these examples came in handy later to evaluate our resulting corpus. We can use them to judge which cases the system is consistently mislabeling, which means that they do not fall into our corpus. Finally, by including in the corpus the examples classified by the markers as category 0, we gave the systems an opportunity to “learn from others’ mistakes”, that is, not just to imitate the behavior of the original system, but to work better than it.
Each example was evaluated by two markers. After that, a little more than half of the sentences fell into the corpus from the original data. These made up the entire positive class of examples and part of the negative class. We decided to make the negative class twice as large as the positive class, so that, on the one hand, the classes would be comparable in volume, and on the other hand, the preponderance of the negative class existing in the language would be preserved.
To keep this proportion, we had to add more negative examples to the corpus, in addition to the category 0 examples described. Here is example (5)of category 0, which can confuse not only a machine, but also a person.
(5) But by then Jack was in love with Cindy Page, now Mrs. Jack Swaik.
The second clause does not restore in love , because what is meant is that Cindy Page is now Mrs. Jack Swaik because she married him.
In general, for such a relatively rare syntactic phenomenon as gapping, almost any random sentence in the language is a negative example, because the probability of finding gapping in a random sentence is tiny. However, the use of such negative examples can lead to overlearning on punctuation marks. In our corpus, examples for the negative class were reached by simple criteria : presence of a verb, presence of a comma or dash, minimum sentence length of at least 6 tokens.
For the contest, we separated a dev part (in a 1:5 ratio) from the training corpus, which participants were asked to use to tune their systems. The final versions of the systems were trained on the combined train and dev parts. We hand-sketched the test case (test), which is the 10th part of train + dev. Here is the exact number of examples by class :
In addition, we added a file with raw partitioning from the original system to the manually tested training data. There are over 100, 000 examples in it, and participants could use this data to supplement the training sample if they wished. Looking ahead, we will say that only one participant figured out how to significantly increase the training corpus using dirty data without losing quality.
We deliberately rejected the use of third-party parsers and developed markup in which all the elements of interest are linearly marked in the text string. We used two types of markup. The first one, human-readable, is designed to work with markup, and it is convenient to perform error analysis of the resulting systems. In this method, square brackets inside the sentence mark all the elements of gapping. Each pair of brackets is labeled with the name of the corresponding element. We used the following notations :
Here are examples of gapping sentences with parenthetical markup.
The staple partitioning is suitable for material analysis. The corpus, on the other hand, stores the data in a different format, which can be easily converted to parentheses if desired. One line corresponds to one sentence. The columns indicate the presence of gapping in the sentence, and for each possible label in its column there are character offsets of the beginning and end of the segment corresponding to the element. This is what the offset markup looks like, corresponding to the bracket markup in ().
Tasks for participants
AGRR-2019 participants could solve any of three tasks :
- Binary classification. You need to determine if there is gapping in the sentence.
- Gapping resolution. Need to restore gapping positions (V) and verb controller positions (cV).
- Full markup. You need to define offsets for all gapping elements.
Each successive problem must somehow solve the previous one. It is clear that any partitioning is possible only in sentences on which the binary classification shows a positive class (gapping is there), and full partitioning also includes finding the boundaries of the missing and controlling predicates.
For the binary classification task, we used standard metrics : accuracy and completeness, – and participants’ results were ranked by f-measure.
For the tasks of gapping resolution and full markup, we decided to use a character-based f-measure because the source texts were not tokenized and we did not want differences in the tokenizers used by the participants to affect the results. True-negative examples did not contribute to the character-by-character f-measure; a different f-measure was counted for each markup element; the final result was obtained by macro-averaging over the entire corpus. Thanks to this metric calculation, false-positive cases were noticeably penalized, which is important when positive examples in real data are many times less than negative ones.
The course of the competition
In parallel with the corps gathering, we were accepting entries for the competition. We ended up registering over 40 participants. Then we posted the training case and started the competition. Participants had 4 weeks to build their models.
The evaluation phase was as follows: the participants received 20 thousand sentences without markup, with a test corpus inside them. Teams had to mark up this data with their systems, and then we evaluated the results of the markups on the test corpus. Mixing the test in a large amount of data ensured that the case, despite the desire, could not be marked out manually in the few days given for the run (automated markup).
Nine teams made it to the finals, including representatives of two IT companies, researchers from Moscow State University, MIPT, Higher School of Economics, and IPPI RAS.
All but one team participated in all three competitions. Under the terms of AGRR-2019, all teams published their solution code. A summary table with the results can be found at our repository , there you can also find links to the posted solutions of the commands with brief descriptions.
Almost all showed high results. Here are the decision scores of the prize-winning teams :
A detailed description of the top solutions will soon be available in the articles of the participants in the “Dialogue” collection.
So, in this article we told you how to formulate a task, prepare a corpus, and run a competition based on a rare linguistic phenomenon. The NLP community also benefits from such work, because the competitions help to compare different architectures and approaches with each other on concrete material, and the linguists get a corpus of a rare phenomenon with a possibility to replenish it (using the winners’ solutions). The collected corpus exceeds several times the volume of existing corpora (for gapping, the volume of the corpus exceeds the volume of corpus not only for the Russian language, but also for all languages). All data and links to the solutions of the participants can be found in our githab.
On May 30, at a special session " Dialogue " dedicated to the Automatic Gapping Analysis Competition, the results of AGRR 2019 will be summarized. We will talk about the organization of the competition and elaborate on the content of the created corpus, and the participants will present the selected architectures with which they solved the problem.
NLP Advanced Research Group