Urja Khurana | publications

2025

DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?

Khurana, Urja, Nalisnick, Eric, and Fokkens, Antske

To appear at COLING 2025

HTML

2024

Crowd-Calibrator: Can Annotator Disagreement Inform Calibration in Subjective Tasks?

Khurana, Urja, Nalisnick, Eric, Fokkens, Antske, and Swayamdipta, Swabha

In First Conference on Language Modeling 2024

HTML
LUCY: Linking Uncertainty and ConsistencY of Large Language Models for Question Answering

Khurana, Urja, and Krause, Lea

Extended Abstract presented at GenBench 2024

HTML

2023

Leveraging Few-Shot Data Augmentation and Waterfall Prompting for Response Generation

Krause, Lea, Baez Santamaria, Selene, Meer, Michiel, and Khurana, Urja

In Proceedings of The Eleventh Dialog System Technology Challenge 2023

Abs HTML

This paper discusses our approaches for task-oriented conversational modelling using subjective knowledge, with a particular emphasis on response generation. Our methodology was shaped by an extensive data analysis that evaluated key factors such as response length, sentiment, and dialogue acts present in the provided dataset. We used few-shot learning to augment the data with newly generated subjective knowledge items and present three approaches for DSTC11: (1) task-specific model exploration, (2) incorporation of the most frequent question into all generated responses, and (3) a waterfall prompting technique using a combination of both GPT-3 and ChatGPT.
Confidently Wrong: Exploring the Calibration and Expression of (Un)Certainty of Large Language Models in a Multilingual Setting

Krause, Lea, Tufa, Wondimagegnhue, Baez Santamaria, Selene, Daza, Angel, Khurana, Urja, and Vossen, Piek

In Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023) 2023

Abs HTML

While the fluency and coherence of Large Language Models (LLMs) in text generation have seen significant improvements, their competency in generating appropriate expressions of uncertainty remains limited.Using a multilingual closed-book QA task and GPT-3.5, we explore how well LLMs are calibrated and express certainty across a diverse set of languages, including low-resource settings. Our results reveal strong performance in high-resource languages but a marked decline in performance in lower-resource languages. Across all, we observe an exaggerated expression of confidence in the model, which does not align with the correctness or likelihood of its responses. Our findings highlight the need for further research into accurate calibration of LLMs especially in a multilingual setting.

2022

Hate Speech Criteria: A Modular Approach to Task-Specific Hate Speech Definitions

Khurana, Urja, Vermeulen, Ivar, Nalisnick, Eric, Van Noorloos, Marloes, and Fokkens, Antske

In Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH) 2022

Abs Bib HTML

The subjectivity of automatic hate speech detection makes it a complex task, reflected in different and incomplete definitions in NLP. We present hate speech criteria, developed with insights from a law and social science expert, that help researchers create more explicit definitions and annotation guidelines on five aspects: (1) target groups and (2) dominance, (3) perpetrator characteristics, (4) explicit presence of negative interactions, and the (5) type of consequences/effects. Definitions can be structured so that they cover a more broad or more narrow phenomenon and conscious choices can be made on specifying criteria or leaving them open. We argue that the goal and exact task developers have in mind should determine how the scope of hate speech is defined. We provide an overview of the properties of datasets from hatespeechdata.com that may help select the most suitable dataset for a specific scenario.
@inproceedings{khurana-etal-2022-hate, bibtex_show = {true}, title = {Hate Speech Criteria: A Modular Approach to Task-Specific Hate Speech Definitions}, author = {Khurana, Urja and Vermeulen, Ivar and Nalisnick, Eric and Van Noorloos, Marloes and Fokkens, Antske}, booktitle = {Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)}, month = jul, year = {2022}, address = {Seattle, Washington (Hybrid)}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.woah-1.17}, html = {https://aclanthology.org/2022.woah-1.17}, doi = {10.18653/v1/2022.woah-1.17}, pages = {176--191}, selected = {true} }
Will It Blend? Mixing Training Paradigms {\&} Prompting for Argument Quality Prediction

Meer, Michiel, Reuver, Myrthe, Khurana, Urja, Krause, Lea, and Baez Santamaria, Selene

In Proceedings of the 9th Workshop on Argument Mining 2022

Abs HTML

This paper describes our contributions to the Shared Task of the 9th Workshop on Argument Mining (2022). Our approach uses Large Language Models for the task of Argument Quality Prediction. We perform prompt engineering using GPT-3, and also investigate the training paradigms multi-task learning, contrastive learning, and intermediate-task training. We find that a mixed prediction setup outperforms single models. Prompting GPT-3 works best for predicting argument validity, and argument novelty is best estimated by a model trained using all three training paradigms.

2021

How Emotionally Stable is {ALBERT}? Testing Robustness with Stochastic Weight Averaging on a Sentiment Analysis Task

Khurana, Urja, Nalisnick, Eric, and Fokkens, Antske

In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems 2021

Abs Bib HTML

Despite their success, modern language models are fragile. Even small changes in their training pipeline can lead to unexpected results. We study this phenomenon by examining the robustness of ALBERT (Lan et al., 2020) in combination with Stochastic Weight Averaging (SWA)—a cheap way of ensembling—on a sentiment analysis task (SST-2). In particular, we analyze SWA’s stability via CheckList criteria (Ribeiro et al., 2020), examining the agreement on errors made by models differing only in their random seed. We hypothesize that SWA is more stable because it ensembles model snapshots taken along the gradient descent trajectory. We quantify stability by comparing the models’ mistakes with Fleiss’ Kappa (Fleiss, 1971) and overlap ratio scores. We find that SWA reduces error rates in general; yet the models still suffer from their own distinct biases (according to CheckList).
@inproceedings{khurana-etal-2021-emotionally, bibtex_show = {true}, title = {How Emotionally Stable is {ALBERT}? Testing Robustness with Stochastic Weight Averaging on a Sentiment Analysis Task}, author = {Khurana, Urja and Nalisnick, Eric and Fokkens, Antske}, booktitle = {Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems}, month = nov, year = {2021}, address = {Punta Cana, Dominican Republic}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2021.eval4nlp-1.3}, html = {https://aclanthology.org/2021.eval4nlp-1.3}, selected = {true}, pages = {16--31} }

2017

The linguistic features of fake news headlines and statements

Khurana, Urja, and Intelligentie, Bachelor Opleiding Kunstmatige

2017

Abs Bib HTML

The recent rise of fake news has heavily influenced people. From the 2016 US Presidential Elections to Pizzagate. Thus, it is essential to address this socially relevant phenomenon. Until now, most of the research done was related to satire and clickbait. This thesis explores the linguistic features that are able to distinguish between fake and real news headlines and statements. By extracting different linguistic features of statements and headlines, the predictive power of each different feature is explored. Using classifiers, the overall approach is discussed. It appears that unigrams, POS tag sequences, punctuation and generality are some of the features that have the most predictive power. In the end, the performances were above baseline.
@phdthesis{khurana2017linguistic, bibtex_show = {true}, title = {The linguistic features of fake news headlines and statements}, author = {Khurana, Urja and Intelligentie, Bachelor Opleiding Kunstmatige}, year = {2017}, html = {https://scripties.uba.uva.nl/search?id=633521} }
Cookin’: Interactive Cooking Assistant

Alberts, Houda, Khurana, Urja, Tjhia, Melissa, and Olij, Richard

In 2017

Abs Bib HTML

This report focuses on an interactive system that helps people during cooking by guiding them through the steps of a recipe which is obtained with the Spoonacular API. This involves both text and speech instructions that the system gives to the user, while the user can interact with it by using their voice and poses. In literature, these systems have been explored before, but only certain parts were implemented or it only gave limited feedback during cooking. This system guides a user completely through the process by reading out the equipment, ingredients and steps. Each multimedia component, speech and visuals, will be explored separately and then each component will be connected to the actual system by a user interface. This results in a system that works with textual input at the start, and voice commands and poses during the process where machine learning is used for the recognition of poses and the speech is made more efficient by using more words to recognize commands. Speech recognition is done with the Google Speech Recognition API and Kinect is used for the detection of the joint coordinates of poses. Training is done on a dataset which contains the arm coordinates and that can be extended. For classifying, the k-nearest neighbor algorithm was used since it gives the highest accuracy (99.76%) and is also relatively fast. The value for k is chosen to be 8, based on the amount of poses and a noise class. In the future, other multimedia components can be added to the infrastructure, since it is easily adjustable.
@inproceedings{Khurana2017CookinIC, bibtex_show = {true}, title = {Cookin’: Interactive Cooking Assistant}, author = {Alberts, Houda and Khurana, Urja and Tjhia, Melissa and Olij, Richard}, year = {2017}, html = {https://www.semanticscholar.org/paper/Cookin%E2%80%99%3A-Interactive-Cooking-Assistant-Khurana-Tjhia/520b27ed7aa4a72ca6a329c7010de9b5530706d3} }