Publications

The Case of Fake News and Automatic Content Generation in the Era of Big Data and Machine Learning

Published in Book Chapter -- Médias sociaux : perspectives sur les défis liés à la cybersécurité, la gouvernementalité algorithmique et l’intelligence artificielle, 2021

Recommended citation: Garneau, N. (2021, February). The case of fake news and automatic content generation in the era of big data and machine learning in Médias sociaux : perspectives sur les défis liés à la cybersécurité, la gouvernementalité algorithmique et l’intelligence artificielle p.139-147 https://www.pulaval.com/produit/medias-sociaux-perspectives-sur-les-defis-lies-a-la-cybersecurite-la-gouvernementalite-algorithmique-et-l-intelligence-artificielle

Analogy Training Multilingual Encoders

Published in to appear in the 35th AAAI Conference on Artificial Intelligence, 2021

Language encoders encode words and phrases in ways that capture their local semantic relatedness, but are known to be globally inconsistent. Global inconsistencycan seemingly be corrected for, in part, by leveraging signals from knowledge bases, but previous results are partial and limited to monolingual English encoders.We extract a large-scale multilingual, multi-word analogy dataset from Wikidata for diagnosing and correcting for global inconsistencies and implement a four-way Siamese BERT architecture for grounding multilingualBERT (mBERT) in Wikidata through analogy training. We show that analogy training not only improves theglobal consistency of mBERT, as well as the isomorphism of language-specific subspaces, but also leads to significant gains on downstream tasks such as bilingual dictionary induction and sentence retrieval.

Recommended citation: Garneau, N., Hartmann, M., Sandholm, A., Ruder, S., Vulic, I., & Søgaard, A. (2021, February). Analogy Training Multilingual Encoders. In Proceedings of the 35th AAAI Conference on Artificial Intelligence.

Generating Intelligible Plumitifs Descriptions: Use Case Application with Ethical Considerations

Published in Proceedings of the 13th International Conference on Natural Language Generation, 2020

Plumitifs (dockets) were initially a tool for law clerks. Nowadays, they are used as summaries presenting all the steps of a judicial case. Information concerning parties’ identity, jurisdiction in charge of administering the case, and some information relating to the nature and the course of the preceding are available through plumitifs. They are publicly accessible but barely understandable; they are written using abbreviations and referring to provisions from the Criminal Code of Canada, which makes them hard to reason about. In this paper, we propose a simple yet efficient multi-source language generation architecture that leverages both the plumitif and the Criminal Code’s content to generate intelligible plumitifs descriptions. It goes without saying that ethical considerations rise with these sensitive documents made readable and available at scale, legitimate concerns that we address in this paper.

Recommended citation: Beauchemin, D., Garneau, N., Gaumond, E., Déziel, P. L., Khoury, R., & Lamontagne, L. (2020, December). Generating Intelligible Plumitifs Descriptions: Use Case Application with Ethical Considerations. In Proceedings of the 13th International Conference on Natural Language Generation (pp. 15-21). https://www.aclweb.org/anthology/2020.inlg-1.3/

PhD Thesis Proposal: Generalizing Pre-Trained Neural Language Models

Published in , 2020

In its most generic form, a language model is trained to predict the next symbol given a history of symbols. Symbols can either be words or characters for example. Trained on a corpus, a language model can therefore predict the probability of a sequence of symbols of indenite length. The language models are used in several applications of natural language processing such as machine translation where it is used to obtain the most likely translation into the target language given a list of candidates. In recent years, neural networks have become the state of the art in language modelling. It has specifically been shown that these pre-trained language models are inherently feature extractors (language representations) with which one can add a simple classifier to accomplish a specific task like text classification or sequence labelling. The Web contains a colossal amount of textual data and we now see the apogee of more imposing models in number of parameters which push the state of the art even further. However, these powerful pre-trained language models can only be effectively transferred to tasks of the same language which is limited mainly to English for the moment.

Recommended citation: Garneau, N. (2020) /files/language-models.pdf

A Robust Self-Learning Method for Fully Unsupervised Cross-Lingual Mappings of Word Embeddings: Making the Method Robustly Reproducible as Well

Published in Proceedings of the 12th Language Resources and Evaluation Conference, 2020

In this paper, we reproduce the experiments of Artetxe et al. (2018b) regarding the robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. We show that the reproduction of their method is indeed feasible with some minor assumptions. We further investigate the robustness of their model by introducing four new languages that are less similar to English than the ones proposed by the original paper. In order to assess the stability of their model, we also conduct a grid search over sensible hyperparameters. We then propose key recommendations that apply to any research project in order to deliver fully reproducible research.

Recommended citation: Garneau, N., Godbout, M., Beauchemin, D., Durand, A., & Lamontagne, L. (2020). A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings: Making the method robustly reproducible as well. Proceedings of the 12th Language Resources and Evaluation Conference. https://www.aclweb.org/anthology/2020.lrec-1.681/

Contextual generation of word embeddings for out of vocabulary words in downstream tasks

Published in Canadian Conference on Artificial Intelligence, 2019

Over the past few years, the use of pre-trained word embeddings to solve natural language processing tasks has considerably improved performances on every end. However, even though these embeddings are trained on gigantic corpora, the vocabulary is fixed and thus numerous out of vocabulary words appear in specific downstream tasks. Recent studies proposed models able to generate embeddings for out of vocabulary words given its morphology and its context. These models assume that we have sufficient textual data in hand to train them. In contrast, we specifically tackle the case where such data is not available anymore and we rely only on pre-trained embeddings. As a solution, we introduce a model that predicts meaningful embeddings from the spelling of a word as well as from the context in which it appears for a downstream task without the need of pre-training on a given corpus. We thoroughly test our model on a joint tagging task on three different languages. Results show that our model helps consistently on all languages, outperforms other ways of handling out of vocabulary words and can be integrated into any neural model to predict out of vocabulary words.

Recommended citation: Garneau, Nicolas, Jean-Samuel Leboeuf, and Luc Lamontagne. "Contextual generation of word embeddings for out of vocabulary words in downstream tasks." CanadianAI 2019 (2019) https://link.springer.com/chapter/10.1007/978-3-030-18305-9_60

Predicting and interpreting embeddings for out of vocabulary words in downstream tasks

Published in Conference on Empirical Methods in Natural Language Processing, 2018

We propose a novel way to handle out of vocabulary (OOV) words in downstream natural language processing (NLP) tasks. We implement a network that predicts useful embeddings for OOV words based on their morphology and on the context in which they appear. Our model also incorporates an attention mechanism indicating the focus allocated to the left context words, the right context words or the word’s characters, hence making the prediction more interpretable. The model is a``drop-in’‘module that is jointly trained with the downstream task’s neural network, thus producing embeddings specialized for the task at hand. When the task is mostly syntactical, we observe that our model aims most of its attention on surface form characters. On the other hand, for tasks more semantical, the network allocates more attention to the surrounding words. In all our tests, the module helps the network to achieve better performances in comparison to the use of simple random embeddings.

Recommended citation: Garneau, Nicolas, Jean-Samuel Leboeuf, and Luc Lamontagne. "Predicting and interpreting embeddings for out of vocabulary words in downstream tasks." EMNLP 2018 (2018): 331. https://www.aclweb.org/anthology/W18-5439/