%
% File ACL2016.tex
%
\documentclass[11pt]{article}
\usepackage{acl2016}
\usepackage{times}
\usepackage{latexsym}
%\aclfinalcopy % Uncomment this line for the final submission
%\def\aclpaperid{***} % Enter the acl Paper ID here
% To expand the titlebox for more authors, uncomment
% below and set accordingly.
% \addtolength\titlebox{.5in}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{url}
\usepackage{graphicx}
\usepackage{color}
\usepackage{xspace}
\newcommand{\txt}[1]{{\it "#1"}}
\newcommand{\tool}[1]{{\sc #1}\xspace}
\newcommand{\bruno}[1]{{\textcolor{blue}{ #1 }}}
\newcommand{\karen}[1]{{\textcolor{blue}{ #1 }}}
\newcommand{\zl}{{\sc ZombiLingo}\xspace}%TODO BG pourquoi ne pas le mettre en \tool ??
\title{Zombies with a (Quality) Purpose:\\
Producing Dependency Syntax Annotations With \zl}
\author{
Bruno Guillaume\\
Inria Nancy Grand-Est/LORIA\\
{\tt Bruno.Guillaume@loria.fr}
\And
Karën Fort\\
Université Paris-Sorbonne/EA STIH\\
{\tt karen.fort@paris-sorbonne.fr}
}
%\Keywords{Crowdsourcing, GWAP, Language Resources, Annotation, Dependency Syntax}}
\begin{document}
\maketitle
\begin{abstract}
We present here a game with a purpose allowing for the massive annotation of dependency syntax relations for French, \zl. We particularly detail the training and evaluation mechanisms and discuss some present biases of the game that we are addressing now.
\end{abstract}
% ========================================================================================================================
\section{Games With A Purpose and (Quality) Language Resources Production}
Games With A Purpose (GWAPs) are crowdsourcing platforms with a more or less gamified interface. Although some categorize them as serious games, the main purpose of GWAPs is to produce data, not directly to train people (although players end up being well-trained on the task, they are not trained on the domain). %TODO KF ici une ref
The constantly growing need for language resources in natural language processing (NLP) has fostered the development of such platforms, with some impressive successes~\cite{Chamberlain2013}.
Some GWAPs like \tool{JeuxDeMots}\footnote{See: \url{jeuxdemots.org}} use the basic knowledge of the language of the participants, in this particular case to find associations of ideas \cite{Lafourcade2008}. This is also the case for the word-sense disambiguation games described in \cite{Jurgens2014} or \cite{Venhuizen2013}. Other games, like \tool{Phrase Detectives}\footnote{See: \url{anawiki.essex.ac.uk/phrasedetectives}}, rely on the school knowledge of the players, for example to annotate co-reference~\cite{Chamberlain2008}. These GWAPs demonstrated their efficiency in producing dynamic and reliable language resources. Although they imply some degree of training on the game (especially \tool{Phrase Detectives}), none of them rely as much on the learning capabilities of the participants as \tool{FoldIt} does \cite{Khatib2011a} (for a totally different domain). In this GWAP, designed to help solve protein folding puzzles, the players, who have no previous knowledge in biochemistry, go through various training phases to be able to perform the (more and more complex) tasks at hand.
\zl represents an attempt at using the learning capabilities of the players to have them perform dependency annotations, a task that is considered complex and at which at least some gamification attempts failed~\cite{Hana2012}. To ensure that the game allows for the production of quality annotations, we developed both the training of the players and the evaluation of the annotations.
We will first present the game, then give details on the training provided to the players, before explaining the means of evaluation used throughout the game to ensure the annotation quality. Finally, we discuss the present limits of the system and propose ways to improve it.
% ========================================================================================================================
\section{Decomposing the Complexity of the Task}
\label{sec:zl}
\zl has been developed since the end of 2013. The design phase itself took around 6 months of work with a student trainee (we give details on this and on the gamification of \zl in \cite{Fort2014}). The development was done in several steps, by a student and a part-time engineer\footnote{The the game is available at \url{zombilingo.org}.}. We were finally able to hire a full-time engineer, who will work on the game in the coming years.
\subsection{General Mechanism}
For these reasons, the game is for the moment limited to annotating (freely available) French corpora, using the Sequoia corpus~\cite{Candito2012} as a reference. The input corpus is pre-annotated using the \tool{Talismane} parser~\cite{Urieli2013}, then fed into the game where the players correct the pre-annotations. The resulting annotated corpora are directly and freely available on the Web site of the game\footnote{Generally (depending on the input corpus) under a CC BY-NC-SA license, see: \url{zombilingo.org/information}.}.
%\karen{vérifier la licence : CC BY-NC-SA ?}
It is important to note that the players are not aware of the fact that they are correcting annotations. Instead, they are proposed a sentence, with a highlighted item (lexical unit) and they are asked to select the item associated to it through the relation being processed. See Figure~\ref{fig:interface} for an example with {\bf a\_obj} relation (in French, the relation {\bf a\_obj} links a verb to an argument introduced by the preposition ``à'' and other realizations of this argument).
\begin{figure*}
\begin{center}
\includegraphics[width=12cm]{images/ZL_general.png}
\caption{Annotation interface of ZombiLingo}
\label{fig:interface}
\end{center}
\end{figure*}
As the full dependency syntax annotation of a sentence would be too complex to be performed by players, we turned it into a sequence of tasks, focusing on one relation or one type of relation, at a time. This allows for a focused training of the players.
% ========================================================================================================================
\subsection{Training Mechanism}
Training has been proven to be one of the most effective ways to improve the quality and speed of annotation~\cite{Dandapat2009,Bayerl2011} and it is all the more important in GWAPs, where the participants are not directly accessible for explanations.
In \zl, the beginners are proposed a limited list of phenomena (relations), which they can only start playing when they have finished the dedicated tutorial, very much like in \tool{Phrase Detectives} or \tool{FoldIt}. This phase includes the correct annotation of ten examples of the phenomenon (each one in a different sentence) selected among the reference sentences. Players can access instructions\footnote{These instructions are built from the annotation guide, and adapted for non-linguists.} at anytime (see Professor Frankenperrier, in the top right corner of the interface, see Figure~\ref{fig:interface}) and they are given feedback in case of error (see Figure~\ref{fig:error}).
The tutorial ends only when the player has correctly annotated ten examples. Therefore, the training of a bad player may cover much more than ten sentences. During this phase, the player cannot score any point.
% Depending of the kind of resources which are expected, training may be needed.
% Some tasks may just rely on the player's intuition; for these tasks, training will be meaningless.
% For instance, when a player is asked to tag pictures with open keywords, or to give its feeling about a positive of negative judgment about any kind of items, no training are considered.
% On the other hand, when a task is more deterministic it may be useful to train players.
% We call deterministic a task on which we may expect a given result;
% this includes tasks for which we have a precide documentation (like an annotation guide), a gold standard set of annotated data or an access to some experts which can decide in case of ambiguities.
% ========================================================================================================================
\section{Producing Quality Annotations}
\label{sec:eval}
\subsection{Evaluating Against a Reference}
\label{sec:ref}
We observed that the decomposition of the task to focus on one dependency relation (a phenomenon) at a time coupled with an initial training of players on each phenomenon is not enough to ensure the quality on the long run.
A participant who did the training phase some time ago may have forgotten parts of the rules or may have developed some kind of misunderstanding on the phenomenon. This can lead to errors in the annotation. Another typical case is that of players who may be tempted to play very quickly, ignoring the rules once trained.
%Obviously, this issue does not appear with all players and it would be annoying for them to be forced to retrain.
To deal with this, we added a specific mechanism in the game using again the set of gold-standard annotations during the game to check that the players are still applying the rules. %TODO KF : vérifier PD
For each player $p$ and for each phenomenon $x$, we define a score $\alpha(p,x) \in [0;1]$.
This score is intended to record how confident we can be with respect to player $p$ playing phenomenon $x$.
When an item is selected in the game, it is either a normal item, taken from pre-annotated sentences, with a probability $\alpha(p,x)$ or an item taken from the gold standard, with a probability $1 - \alpha(p,x)$.
When a gold standard item is proposed to the player, the mechanism -- that we term "control phase" -- runs as follows:
\begin{itemize}
\item if the answer of the player is correct, $\alpha(p,x)$ is increased, meaning that we give more confidence to this player $p$ for this phenomenon $x$ (a maximum of $0.95$ is imposed on $\alpha(p,x)$ to ensure that all players will always be given gold-standard items at some future point in the game),
\item if the answer of the player is wrong, the player is warned about the error, $\alpha(p,x)$ is decreased, and we give less confidence to $p$ playing $x$,
\item if a player gives three wrong answers on the same phenomenon, his/her ability to play with this phenomenon is deactivated and s/he is invited to go back to the training phase like a new player.
\end{itemize}
The goal of the last point is to prevent players from using a strategy where they focus on the quantity of items rather than on the quality of the annotation.
If a participant plays too fast (or simply badly), s/he will probably give wrong answers and will have to go over the training phase again, loosing time without scoring. A (very) bad player will end up retraining over and over again. Therefore, playing fast is not as a long-term winning strategy in \zl.
\begin{figure*}
\begin{center}
\includegraphics[width=12cm]{images/ZL_ErreurTuto2.png}
\caption{Feedback given to the player when annotating a reference sentence}
\label{fig:error}
\end{center}
\end{figure*}
\subsection{Selecting Annotations}
%TODO BG : vote ?
%\karen{Il faut parler de la manière dont on prend en compte le vote}
%\karen{BG : parler de l'output ?}
%TODO BG parler de l'output (?)
In \zl, the selection of the best annotations is not directly dependent on the number of participants who "voted" for them. Instead, the system relies on the confidence it has in the player, which itself evolves according to the player's performance on reference annotations.
The pre-annotations are assigned an initial confidence score of 5. If a player confirms a pre-annotation, the confidence score of this annotation is increased by the confidence score of the player on this phenomenon (see section~\ref{sec:ref}) at this time in the game.
On the contrary, if a player corrects the pre-annotation, this new annotation is created and initialized with the confidence score of the player.
The best annotation at time $T$ is the one with the best confidence score at that time.
%
% Pour l'export des corpus, il faudra décider de ce que l'on fait en cas de scores égaux sur la même relation.
% ========================================================================================================================
\section{Discussion}
\label{sec:discuss}
The current version of the game heavily relies on the pre-annotation tool.
When a new sentence is added to the system, it is automatically annotated by a parser (the \tool{Talismane} parser is currently used), so the phenomena proposed to the player are relative to this pre-annotation.
If the parser has predicted a dependency labeled $x$ with the governor $t_g$ and the dependency $t_d$,
the player will be asked a question like \emph{``what is the dependent of governor $t_g$ with respect to relation $x$?''}\footnote{For some relations, the question is formulated in the other way: \emph{``what is the dependent of governor $t_g$ and relation $x$?''}.}
This way of building game items presented to the player raises two problems.
\subsection{Phenomena to Remove (Noise)}
The first problem appears when the parser is wrong and there is no relation $x$ with governor $t_g$. In this case, the player may be influenced by the system and may try to find an item (lexical unit) in the sentence that would answer the question. In the game interface, the player can click on two crossing bones, on the right of the sentence (see Figure~\ref{fig:interface}) to express that \emph{``There is no such phenomenon here''}. But this ``crossing bones'' answer is not integrated in the training phase or in the control phase. We therefore suspect that it is underused by most of the players, generating some noise in the produced data.
Indeed, looking at the database, we can observe that, among the 10 top players, the usage of the crossing bones ranges from 0.15\% to 2.83\%, i.e. very low rates that show that the players do not use them enough.
If we focus, for instance on the {\bf de\_obj} relation\footnote{This corresponds to arguments introduced by the preposition ``de'' but it also contains some other kinds of realizations of the same type of argument, like the clitic \emph{en} or the use of the complementizer \emph{que}.} and manually explore 50 cases of rejection (i.e. when a player has clicked the crossing bones) for this phenomenon:
\begin{itemize}
\item 13 cases are wrongly corrected annotations (11 are reference sentences and 2 are correctly pre-annotated sentences). This reflects that the instructions given to the players are still not clear enough and that the training should be improved (5 of these 13 cases concern the usage of the complementizer \emph{que}, 2 concern the realization with the clitic \emph{en});
\item in the 37 remaining cases, the players have correctly modified the wrong pre-annotations.
\end{itemize}
We believe that these results can be improved, at least partly, by modifying the training and control phases so that they include the absence of the phenomenon at hand. Another leverage to be used is the interface itself, as the crossing bones are not visible enough.
\subsection{Phenomena to Add (Silence)}
The second problem is more difficult to tackle. If some relation $x$ with the governor $t_g$ exists but was not predicted by the parser (not even with a wrong dependent), the player will never be asked the corresponding question (silence). We have introduced a first game mode to deal with this setting in the case where the parser predicts a relation from $t_g$ to $t_d$, but with some other relation $x'$ which is often confused with $x$. In this game mode, the player is given $t_g$ and $t_d$ and he has to choose between $x$ and $x'$ for the relation (see Figure~\ref{fig:alt}).
We have identified two couples $x$/$x'$ of such relations with the \tool{Talismane} pre-annotation: tense auxiliary ({\bf aux.tps}) vs passive auxiliary ({\bf aux.pass}) and verb modifier ({\bf mod}) vs verb argument ({\bf p\_obj.o}).
\begin{figure*}
\begin{center}
\includegraphics[width=12cm]{images/ZL_alt.png}
\caption{Game mode where the user has to select the right relation}
\label{fig:alt}
\end{center}
\end{figure*}
One possible (partial) solution to this would be to use more than one pre-annotation tool, in particular to apply parsers giving probabilities on the output.
% In future version, use more than a one simple preannotation
% - a parser with probabilities on output
% - several parser in competition
% ========================================================================================================================
\section{Obtained Results and prospects}
As of mid-October 2015, the number of registered players on the game is 422. They have produced nearly 59,000 annotations. However, only 8 players participated a lot and produced more than 1,000 annotations.
The first evaluation results show an average of 86\% accuracy on the reference sentences for the 10 most active players (best and worst omitted) and more than 89\% for the 30 most active players (who played easier cases).
These numbers are very encouraging. However, the interface and the training phase have to be improved to take into account the fact that a phenomenon can sometimes not appear in the sentence, therefore reducing the produced noise. We also need to investigate the pre-annotation process in order to find more efficient ways to qualify the produced annotation and the potential ambiguities.
These issues will be addressed in the coming months, together with the integration of a new design and game features, which will hopefully allow us to attract more players and produce better annotations.
Finally, we plan to develop versions of \zl for other languages, to provide freely available corpora with quality dependency syntax annotations.
% \section*{Acknowledgements}
% \zl is funded by Inria and Délégation générale à la langue française et aux langues de France (DGLFLF). A number of students have participated in the project in the past, including Hadrien Chastant and Valentin Stern and we want to thank them here. We also thank Antoine Chenardin, who participated in the development as an engineer.
\bibliographystyle{acl2016}
\bibliography{gwap}
\end{document}