Skip to main content

Showing 1–8 of 8 results for author: Antoun, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.20595  [pdf, other

    cs.DL cs.CL

    Harvesting Textual and Structured Data from the HAL Publication Repository

    Authors: Francis Kulumba, Wissam Antoun, Guillaume Vimont, Laurent Romary

    Abstract: HAL (Hyper Articles en Ligne) is the French national publication repository, used by most higher education and research organizations for their open science policy. As a digital library, it is a rich repository of scholarly documents, but its potential for advanced research has been underutilized. We present HALvest, a unique dataset that bridges the gap between citation networks and the full text… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

  2. arXiv:2309.13322  [pdf, other

    cs.CL

    From Text to Source: Results in Detecting Large Language Model-Generated Content

    Authors: Wissam Antoun, Benoît Sagot, Djamé Seddah

    Abstract: The widespread use of Large Language Models (LLMs), celebrated for their ability to generate human-like text, has raised concerns about misinformation and ethical implications. Addressing these concerns necessitates the development of robust methods to detect and attribute text generated by LLMs. This paper investigates "Cross-Model Detection," by evaluating whether a classifier trained to disting… ▽ More

    Submitted 27 March, 2024; v1 submitted 23 September, 2023; originally announced September 2023.

    Comments: Accepted to COLING-LREC 2024

  3. arXiv:2306.05871  [pdf, ps, other

    cs.CL

    Towards a Robust Detection of Language Model Generated Text: Is ChatGPT that Easy to Detect?

    Authors: Wissam Antoun, Virginie Mouilleron, Benoît Sagot, Djamé Seddah

    Abstract: Recent advances in natural language processing (NLP) have led to the development of large language models (LLMs) such as ChatGPT. This paper proposes a methodology for developing and evaluating ChatGPT detectors for French text, with a focus on investigating their robustness on out-of-domain data and against common attack schemes. The proposed method involves translating an English dataset into Fr… ▽ More

    Submitted 9 June, 2023; originally announced June 2023.

    Comments: Accepted to TALN 2023

  4. arXiv:2306.01497  [pdf, other

    cs.CL

    Data-Efficient French Language Modeling with CamemBERTa

    Authors: Wissam Antoun, Benoît Sagot, Djamé Seddah

    Abstract: Recent advances in NLP have significantly improved the performance of language models on a variety of tasks. While these advances are largely driven by the availability of large amounts of data and computational power, they also benefit from the development of better training methods and architectures. In this paper, we introduce CamemBERTa, a French DeBERTa model that builds upon the DeBERTaV3 ar… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

    Comments: Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canda

  5. arXiv:2103.04353  [pdf, other

    cs.CL

    Empathetic BERT2BERT Conversational Model: Learning Arabic Language Generation with Little Data

    Authors: Tarek Naous, Wissam Antoun, Reem A. Mahmoud, Hazem Hajj

    Abstract: Enabling empathetic behavior in Arabic dialogue agents is an important aspect of building human-like conversational models. While Arabic Natural Language Processing has seen significant advances in Natural Language Understanding (NLU) with language models such as AraBERT, Natural Language Generation (NLG) remains a challenge. The shortcomings of NLG encoder-decoder models are primarily due to the… ▽ More

    Submitted 7 March, 2021; originally announced March 2021.

  6. arXiv:2012.15520  [pdf, other

    cs.CL

    AraGPT2: Pre-Trained Transformer for Arabic Language Generation

    Authors: Wissam Antoun, Fady Baly, Hazem Hajj

    Abstract: Recently, pre-trained transformer-based architectures have proven to be very efficient at language modeling and understanding, given that they are trained on a large enough corpus. Applications in language generation for Arabic are still lagging in comparison to other NLP advances primarily due to the lack of advanced Arabic language generation models. In this paper, we develop the first advanced… ▽ More

    Submitted 7 March, 2021; v1 submitted 31 December, 2020; originally announced December 2020.

  7. arXiv:2012.15516  [pdf, other

    cs.CL

    AraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding

    Authors: Wissam Antoun, Fady Baly, Hazem Hajj

    Abstract: Advances in English language representation enabled a more sample-efficient pre-training task by Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA). Which, instead of training a model to recover masked tokens, it trains a discriminator model to distinguish true input tokens from corrupted tokens that were replaced by a generator network. On the other hand, curr… ▽ More

    Submitted 7 March, 2021; v1 submitted 31 December, 2020; originally announced December 2020.

  8. arXiv:2003.00104  [pdf, ps, other

    cs.CL

    AraBERT: Transformer-based Model for Arabic Language Understanding

    Authors: Wissam Antoun, Fady Baly, Hazem Hajj

    Abstract: The Arabic language is a morphologically rich language with relatively few resources and a less explored syntax compared to English. Given these limitations, Arabic Natural Language Processing (NLP) tasks like Sentiment Analysis (SA), Named Entity Recognition (NER), and Question Answering (QA), have proven to be very challenging to tackle. Recently, with the surge of transformers based models, lan… ▽ More

    Submitted 7 March, 2021; v1 submitted 28 February, 2020; originally announced March 2020.

    Comments: Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), Marseille, France (2020)