Research

My research interests broadly lie in the field of Machine Learning, Natural Language Processing and Computational Linguistics. My current research crush is on distributional semantics and explainability of LLMs and Bertology.

My aim is to contribute effectively to the development of Language Models. I am excited about discovering what kind of linguistic information is used by LLMs to carry out a task, and in deepening the general understanding of black box models.

Publications

Salogni at GeoLingIt: Geolocalization by Fine-tuning BERT
Ilaria Salogni
[code]

Abstract: The recent growing interest in low-resource languages has been significantly bolstered by transformer-based models. By fine-tuning three such models, two based on BERT and the other on RoBERTa, I aim at geolocating sequences exhibiting non-standard language varieties relying solely on linguistic content. I find that, given that the information contained in the embeddings is all we need to carry out this complex task, a model architecture with less task-specific layers leads to better results. Furthermore, models pre-trained on miscellaneous corpora generalize better than those trained exclusively on tweets. The work also shows that the greater availability of resources of a certain regional variety positively affects the capacity of the model.

Research Projects

From classification to quantification: directly estimate cause-of-death prevalence in a verbal autopsies dataset
Master Thesis in Human Language Technologies @Unipi
[code]

Abstract: Verbal autopsies are standardized textual questionnaires that gather information about the symptoms experienced by the deceased in the days preceding a fatality, developed to address the need for fundamental registration of deaths also for countries with weak registration systems. These documents pertain to the field of epidemiology, in which the aim align more with the task of quantification (estimation of the classes prior distribution) than that of classification, understood as minimizing errors while accurately assigning the correct pathology to each death. We developed innovative text quantification techniques that leverage the dataset’s intrinsic hierarchical nature. Surprisingly, many of the supposedly more sophisticated quantification methods failed to outperform a simple Classify-and-Count approach. However, utilizing a hierarchical classification algorithm instead of a flat one yielded interesting outcomes, as it produced the same results with a shorter training time. By testing hierarchical quantification algorithms in various settings, this work introduces the concept of hierarchical quantification for the first time. Nevertheless, our results show no significant improvement compared to algorithms that disregard the hierarchical nature of the labels.
A quantitative analysis of the morphological complexity using French verbal system data
Assignment in Computational Psycholinguistic course @Unipi
[code]

Abstract: As claimed by Marzi, Pirrelli and Ferro, quantitatively assessing the comparative complexity of inflectional systems across languages is an hard task, because of typological diversity. The processing-oriented approach I reproposed consists in the training of a Temporal Self- Organizing Map (TSOM, a recurrent variant of Kohonen’s Self-Organizing Maps) , on a set of top-frequency French verb forms, with no added information about the morphosemantic and morphosyntactic content conveyed by the forms, and assessing the behaviour of the network after the training, discussing in conclusion howmorphological inflection can be seen as "a collective, emergent system, whose global self-organization rests on a surprisingly small handful of language-independent principles of word coactivation and competition"
An overview of some huge changes that took place in Natural Language Processing in the past 10 year
Assignment in Digital Culture Seminar @Unipi
[report]

Abstract: I wrote thisreport for the Digital Culture Seminar course, taking the opportunity to deepen and complete in a personal research what is the linguistic representation within the LLM. To do so, I first talk about some huge changes that took place in Natural Language Processing in the past 10 years: the new role of corpora and of the classical pipeline, the introduction of embeddings.
A domain-adaptation assessment in a semi-automatic annotation pipeline
Assignment in Computational Linguistic II course @Unipi
[report]

Abstract: This hands-on project consists in the revision of the semi-automatic annotation of a textual corpus, carried out with the trainable UDPipe pipeline, using a data-driven model, with the aim of creating, after manual correction by two human annotators, a golden corpus. The texts were specifically chosen in genres and eras different from that of the model's training corpus, to observe the difficulties created by domain-adaptation, and for the same purpose the golden corpus was used to evaluate two other automatic annotation models of the Italian.

Please have a look at my Curriculum Vitae for a comprehensive list of my projects.