Volume 3, Issue 1, Fall 2023, *Special Issue Edition
DOI: https://doi.org/10.47289/AIEJ20231109
History
Received 19 June 2023
Accepted 25 October 2023
Published 13 November 2023
Contact
Aditya Joshi
217-B, School of Computer Science & Engineering, University of New South Wales, Kensington, NSW 2052
Email: aditya.joshi@unsw.edu.au
Acknowledgements
None
Disclosure of Funding
None
Large language models (LLMs) are typically evaluated on the basis of task-based benchmarks such as MMLU. Such benchmarks do not examine the behaviour of LLMs in specific contexts. This is particularly true in the LGBTI+ context where social stereotypes may result in variation in LGBTI+ terminology. Therefore, domain-specific lexicons or dictionaries may be useful as a representative list of words against which the LLM’s behaviour needs to be evaluated. This paper presents a methodology for evaluation of LLMs using an LGBTI+ lexicon in Indian languages. The methodology consists of four steps: formulating NLP tasks relevant to the expected behaviour, creating prompts that test LLMs, using the LLMs to obtain the output and, finally, manually evaluating the results. Our qualitative analysis shows that the three LLMs we experiment on are unable to detect underlying hateful content. Similarly, we observe limitations in using machine translation as means to evaluate natural language understanding in languages other than English. The methodology presented in this paper can be useful for LGBTI+ lexicons in other languages as well as other domain-specific lexicons. The work done in this paper opens avenues for responsible behaviour of LLMs in the Indian context, especially with prevalent social perception of the LGBTI+ community.
Keywords
Language Models, Fairness, Queerphobia, Evaluation, Natural Language Processing, Indian Languages
NOTE: This paper contains words that are offensive towards the LGBTI+ community and are used solely for the purposes of the study.
Natural language processing (NLP) is a branch of artificial intelligence that deals with computational approaches that operate on text and text-related problems such as sentiment detection. Large language models (LLMs) are an advancement in NLP that represent language and solve NLP problems using stacks of neural networks [1]. LLMs are trained on web corpora scraped from sources such as Wikipedia, social media conversations and discussion forums. Social biases expressed by authors find their way into the source data, thereby posing risks to responsible behaviour of LLMs when presented with hateful and discriminatory input. Evaluation of LLMs in terms of their behaviour in specific contexts assumes importance.
Despite legal reforms and progressive verdicts (cite: Navtej Singh Johar verdict., NALSA 2014, HIV AIDS ACT 2017, Mental Healthcare Act, TG Act) upholding LGBTI+ rights, sexual- and gender minorities in India continue to be disenfranchised and marginalized due to heteropatriarchal socio-cultural norms. Multiple studies among LGBTI+ communities In India highlight experiences and instances of verbal abuse [3, 4, 5, 6, 7], including those experienced by the communities on virtual platforms [8, 9]. Some studies have indicated verbal abuse as among the most common forms of abuse experienced by subsets of LGBTI+ communities in Indian settings [10]. Past work examines news reportage regarding LGBTI+ community in English language [11]. Further, qualitative studies exploring experiences of users on gay dating- and other social media platform, detail accounts of individuals who experience bullying, verbal abuse, harassment, and blackmail due to their expressed and perceived sexual orientation and gender expression [23, 24]. Culture, religious beliefs and legal situation of LGBTI+ people majorly shapes the frameworks of representing LGBTI+ people in newspapers and television (https://humsafar.org/wp-content/uploads/2018/03/pdf_last_line_SANCHAAR-English-Media-Reference-Guide-7th-April-2015-with-Cover.pdf; Accessed on 19th June, 2023). The media in turn shapes up the opinion of its’end users. In India where LGBTI+ people often face marginalization [2], these words reflect social perception of LGBTI+ people. While the language and etiquette surrounding LGBTI+ terminologies continues to evolve globally, the Indian context presents challenges due to the presence of multiple spoken languages and different socio-lingual nuances that may not be entirely understood or documented in existing research or broader literature.
India has 22+ official languages which includes English. Table 1 shows the number of native speakers in India and GPT-4 accuracy on translated MMLU for top-spoken Indian languages. This paper focuses on words referring to LGBTI+ people in some of the Indian languages (those among the top-spoken are highlighted in boldface in the table). The words are grouped into three groups based on their source: social jargon, pejoratives and popular culture. Social jargon refers to jargon pertaining to traditional communities or social groups. An additional challenge posed in identifying and tagging words as “hateful, discriminatory, or homo-/transphobic” lies in recognizing contextual layers in instances where the term is used. For instance, the term “hijra” that is often used by non-LGBTI+ individuals pejoratively is a valid gender identity within Indian contexts. In such instances, usage of the word itself does not intend toward or account for verbal abuse and recognizing its usage as pejorative could depend on the context.
Table 1: Number of native speakers and GPT-4 accuracy for top-spoken Indian languages
Language |
Native Speakers in India (As from Wikipedia; Source: Census 2011) |
GPT-4 3-shot accuracy on MMLU |
English |
259,678 |
85.5% |
Hindi |
528 million |
-Not reported- |
Bengali |
97.2 million |
73.2% |
Marathi |
83 million |
66.7% |
Telugu |
81.1 million |
62.0% |
Tamil |
69.9 million |
-Not reported- |
Punjabi |
33 million |
71.4% |
Use of languages other than English adds a new dimension to the evaluation of LLMs, particularly as users also use transliteration where they write Indian language words using the Latin script used for English. The recent model, GPT-4, reports multilingual ability on MMLU [12], a benchmark consisting of multiple-choice STEM questions in English. To report performance on languages other than English, MMLU datasets are translated into the target language (say, an Indian language), and then tested on GPT-4. However, given the value of evaluating them in the LGBTI+ context in languages other than English, we investigate the research question:
“How do LLMs perform when the input contains LGBTI+ words in Indian languages?”
Our method of evaluation rests on the premise that the words in the lexicon may be used in two scenarios. The scenarios refer to two kinds of input. The first kind of input is where the words are used in a descriptive, un-offensive manner. This may be to seek information about the words. For example, the sentence “What does the word ‘gaandu’ mean?” contains the word ‘gaandu’, an offensive Hindi word used for effeminate men or gay men. The second kind of input is where the words are used in an offensive manner. This refers to hateful sentences such as “Hey, did you look at the gaandu!” contains the word ‘gaandu’ which refers to the anal receptive partner in a MSM relationship. In some instances, the word itself may not be pejorative in its essence. For instance, “Hijra” as an identity is well acknowledged and accepted as a self-identity by many transgender individuals in India. However, even though the word itself is not offensive, it could be used to demean and bully men perceived or presenting as effeminate, impotent and would be considered an abuse in those instances.
The lexicon provides us the words of interest. The performance of LLMs is evaluated using a four-step methodology that uncovers a qualitative and quantitative understanding of behaviour of LLMs. The research presented in this paper opens avenues to investigate a broader theme of research:
Strategies can be put in place to evaluate LLMs on domain-specific dictionaries of words.
The four-step methodology to conduct our evaluation is guided by the two scenarios: descriptive and offensive. The four steps in our method are: task formulation, prompt engineering, LLM usage and manual evaluation. We present our findings via quantitative and qualitative analyses.
In NLP research, LLMs are typically evaluated using natural language understanding [13] benchmarks such as GLUE [14], Big-Bench [15] and MMLU. These benchmarks provide publicly available datasets along with associated leaderboards that summarise advances in the field. GLUE provides datasets for NLP tasks such as sentiment classification for English language datasets. However, NLU benchmarks do not take into account domain-specific behaviour. Such domain-specific behaviour may be required in the context of the LGBTI+ vocabulary. Our work presents a method to evaluate this behaviour.
This work relates to evaluation of LLMs using dictionaries. Past work shows how historical changes in meanings of words may be evaluated using LLMs [16]. Historical meanings of words are tested on the output of LLMs. This relates to old meanings of words. Social jargon words in our lexicon represent traditional communities of LGBTI+ people. They relate to the historical understanding of these words. Historical meanings also change over time. LLMs have been evaluated in terms of change of meaning over time [17]. This relates to pejoratives in our lexicon. The words have evolved in meaning over time - sometimes, the LGBTI+ sense gets added over time. The ability of LLMs to expand abbreviations helps to understand their contextual understanding [18]. This pertains to the two scenarios in which LGBTI+ words may be used. They may be offensive in some context while not in others. While these methods show how LLMs understand the meaning of words in the dictionaries, they do not account for the two scenarios. Given our lexicon, such a distinction is necessary in the evaluation. Our work is able to show the distinction.
The lexicon used in this work was presented in the “Queer in AI” social at NAACL 2021 by the lead author as an invited talk (https://sites.google.com/view/queer-in-ai/naacl-2021; Accessed on 22nd November, 2023). It consists of 38 words: 18 used as social jargon, 17 as pejoratives and 3 in popular culture. The words are primarily in Hindi and Marathi (12 and 9 respectively) but also include words in other languages.
Figure 1 shows the four-step methodology used for evaluation. The LGBTI lexicon acts as the input. Based on the expected behaviours, we formulate NLP tasks in the first step. For each of the tasks, we engineer prompts that serve as inputs to the LLM. Prompts contain placeholders for words in the lexicon. The LLMs are then used to generate the output for prompts with each word provided in a separate prompt. The outputs are manually evaluated to produce accuracy values for a pair of LLM and NLP task. These values indicate the proportion of words in the lexicon for which the model is able to produce the correct response.
Figure 1: Four-step method used for evaluation
Task Formulation: We map the two scenarios of expected usage to three NLP tasks. These are research problems in NLP that have benchmark datasets and approaches of their own. The three tasks are:
We experiment with zero-shot formulation of the tasks. This means that we use the foundation models as it is, and do not provide any labeled examples as a part of the input. The model must figure out the task based on the question in the input.
The next step is prompt engineering. For each task described above, we define textual prompts (or ‘prompts’) as input. Prompts are textual inputs provided to the language models. The LLM must produce a response to the prompt as the prediction. Since the text in a prompt determines the output of the LLM, we define three prompts per task. This allows for giving the model the best chance to produce the correct output.
We experimented with sentences in Indian languages as prompts. However, two of the models we experiment with did not produce any output. As a result, we used prompts that mix words in English and Indian languages. Such code-mixing is common in bilingual Indian language speakers who effectively use Indian language words in a sentence with the syntactic structure of English or vice versa. For each of the tasks, the prompts are as follows:
LLM Usage: The prompts are provided as inputs to LLMs in the sentence completion mode. We experiment with three language models: GPT-Neo, GPT-J and GPT-3, and one web-based demonstration: ChatGPT.
GPT-Neo [19] and GPT-J [20] are open-source models. They were trained on the Pile dataset which is reported to contain biased content. GPT-3 [21] is a proprietary language model, and was trained on 45TB of data which was manually filtered for biased and harmful content. We use GPT-Neo and GPT-J models with 1.3 billion and 6 billion parameters respectively. The GPT-3 model consists of 175 billion parameters which is significantly larger.
We use Google Colab environment with an A100 GPU for our experiments on GPT-Neo and GPT-J. Beam search with a width of 5 is used. For GPT-3, we use the Open AI playground and test on the text-davinci-003 model which is reported to be the best performing model among the options provided in the playground at the time of running the experiments.
ChatGPT was used via its online interface. ChatGPT is a GPT-based model that employs reinforcement learning via human feedback.
Manual Evaluation: The output for every prompt-word pair is recorded. A human evaluator manually evaluates every output. The human evaluator is familiar with the words in the dataset. The evaluation is done in terms of the following questions:
As stated above, we use three prompts per task. To avoid the impact of ineffective prompts on the performance of a model, we report the highest value of accuracy across all prompts for a task as the accuracy of the language model on the task.
Table 2 shows the accuracy values for the three tasks using words in our lexicons. In general, GPT-3 is the best performing model. It produces an accuracy of 81.57%, 82% and 61% for question-answering, machine translation and hate speech detection respectively. ChatGPT which is built on the top of GPT-3 does slightly poorly with 76.31% for question-answering. The ChatGPT tool blocked all inputs for machine translation and hate speech detection by stating that the input contained potentially offensive content. Therefore, the values have not been reported.
GPT-Neo is the worst performing model. It produces 0% accuracy for machine translation. We observe that several outputs of GPT-Neo are in fact transliterations of words in the native script. This is incorrect despite the prompt being ‘Translate into English’. However, it detects hateful content in the case of 47% words.
Table 2: Accuracy values of LLMs with respect to the three tasks using words in our lexicon.
Question-answering |
Machine translation |
Hate speech detection |
||
Is the answer correct? (%) |
Is the answer partially correct? (%) |
Is the translation correct? (%) |
Is the hate label correct? (%) |
|
GPT-3 |
68 |
81.57 |
82 |
61 |
GPT-J |
26 |
39.47 |
18 |
29 |
ChatGPT |
50 |
76.31 |
- |
- |
GPT-Neo |
5 |
28.94 |
0 |
47 |
We also observe that the absolute accuracy values are higher for question-answering as compared to hate speech detection. The models perform better when tasked with describing and translating words in the lexicon as compared to detecting hateful usage of the words.
We manually analyse outputs generated by the LLMs for the three tasks which were marked as incorrect by the evaluator. The following are some examples of errors:
Table 3: Incorrect translations produced by the LLMs
Tritiya panthi (‘third gender’) |
Samalingi (‘homosexual’) |
|
GPT-J |
“three-faced” or “triple-faced.” |
“to be gay” or “to be lesbian.” |
GPT-3 |
Third Gender |
’coming out’ or ‘coming out of the closet’ |
GPT-Neo |
“I am a woman.”, |
’coming out’ or ‘coming out of the closet’ |
The poor performance of the models on machine translation and their inability to translate fundamental notions in the LGBTI+ vocabulary highlight the limitation of translation as a mechanism to evaluate multilingual ability of LLMs. Recent LLMs have claimed multilingual ability using translated versions of benchmarks such as MMLU. Our evaluation suggests that using translated English datasets to make claims about Indian languages ignores their unique variations. Table 4 shows some words in our lexicon (indicated in bold in the middle column) and their corresponding translations to English. The English word ‘sister-in-law’ can be translated as ‘Saali’ or ‘Boudi’ if it is a sister of one’s wife or husband. The latter is used in a homophobic sense towards effeminate gay men. Translation of sentences containing ‘sister-in-law’ to Bangla is likely to generate one among the two words - thereby changing the queer-phobic implications. Similar situation is observed in case of word ‘Mamu’ which is a word for maternal uncle in Bangla and Urdu language. The word is often used as a public tease word for men suspected or assumed to be gay. The adjective ‘meetha’ in Hindi is typically used for sweetmeats/ foods to indicate sweetness. However, when used for a man (as in a ‘he is meetha’), it refers to the condescending implication that the person may be queer. This is not true for the adjective ‘pyaara’ which is used with animate entities to indicate sweetness/likeability (‘he is a sweet boy’ returns ‘wah ek pyara ladka hai’ in Google Translate as of 29th May, 2023 where ‘sweet’ and ‘pyaara’ are the aligned words, although pyara means ‘lovable’). This example shows that translation of Hindi sentences to English may lose out the queerphobic intent since both words map to the English word ‘sweet’. Similarly, the words ‘Gud’,’paavli kam’, ‘Chakka’ (meaning a ball stroke scoring six runs in cricket but used in a derogatory sense for transgender or effiminate people) and ‘thoku’ (meaning a striker but used derogatorily towards male partner engaging in the act of anal sex) are metaphorically used in an offensive sense towards LGBTI people. These words, when translated into English, do not carry the hurtful intent.
Table 4: Example words in our lexicon showing inadequacies of translation
English word |
Indian language word |
Queerphobic? |
sister-in-law |
Saali (Bengali) (sister of one’s wife) |
No |
Boudi (Bengali) (sister of one’s husband) |
Yes |
|
Maternal Uncle |
Mamu (Bengali, Urdu) |
Yes |
sweet |
Meetha (Hindi) (adjective for an object) |
Yes |
Pyaara (Hindi) (adjective for a person) |
No |
|
Jaggery (cane sugar) |
Gud (Hindi) |
Yes |
Three-quarter |
Paavali kam (Hindi) |
Yes |
Sixer |
Chakka (Hindi, Marathi, Gujarati, Punjabi and multiple Indian languages) |
Yes |
Banger |
Thoku (Hindi) |
Yes |
Perversion |
Vikruti (Hindi, Marathi) |
Yes |
Pervert |
Vikrut (Hindi, Marathi) |
Yes |
We identify the following limitations of our work:
Despite the above limitations, the work reports a useful evaluation of LLMs in the context of the Indian language LGBTI+ vocabulary. The evaluation approach reported in the paper can find applications in similar analyses based on lexicons or word lists.
LLMs trained on web data may learn from biases present in the data. We show how LLMs can be evaluated using a domain-specific, language-specific lexicon. Our lexicon is a LGBTI+ vocabulary in Indian languages. Our evaluation covers two scenarios in which the words in the lexicon may be used in the input to LLMs: (a) in an objective sense to seek information, (b) in a subjective sense when the words are used in an offensive manner. We first identify three natural language processing (NLP) tasks related to the scenarios: question-answering, machine translation and hate speech detection. We design prompts corresponding to the three tasks and use three LLMs (GPT-Neo, GPT-J and GPT-3) and a web-based tool (ChatGPT) to obtain sentence completion outputs with the input as the prompts containing words in the lexicon. Our manual evaluation shows that the LLMs perform with a best accuracy of 61-82%. All the models perform better on question-answering and machine translation as compared to hate speech detection. This indicates that the models are able to computationally understand the meaning of the words in the lexicon but do not predict the underlying hateful implications of some of these words. GPT-3 outperforms GPT-Neo and GPT-J on the three tasks. A qualitative analysis of our evaluation uncovers errors corresponding to inappropriate definitions, incomplete contextual understanding and incorrect translation. These error categories serve as basis to examine the behaviour of future LLMs.
A wider implication of this research would be toward strengthening language models for enhanced hate-speech detection that also recognizes contexts as per socio-linguistic nuances and unique variations. While the presented research starts on a smaller premise, the scope can be expanded by a more detailed understanding of Indian LGBTI+ terminologies and contexts, and training LLMs in these contexts. This research thus holds the potential toward making virtual spaces safer for Indian LGBTI+ and contribute substantially toward research on performance of LLMs on multilingual platforms.
In general, we observe that the language models have a limited translation ability for Indian languages. This may indicate that using translated benchmark datasets may result in inaccurate claims about the LLM’s multilingual ability. Our four-step method was conducted on an Indian language LGBTI+ lexicon. The method is equally applicable to any other language. It can also find utility in the context of responsible AI when tasked with evaluating LLMs on other domain-specific lexicons with certain expected behaviours.
[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Kindly add Sanchar at number 2
[2] Chakrapani, V., Newman, P. A., Shunmugam, M., Rawat, S., Mohan, B. R., Baruah, D., & Tepjan, S. (2023). A scoping review of lesbian, gay, bisexual, transgender, queer, and intersex (LGBTI+I+) people’s health in India. PLOS Global Public Health, 3(4), e0001362.
[3] Madelaine Adelman PhD & Kathryn Woods MSW (2006) Identification Without Intervention: Transforming the Anti-LGBTI+ School Climate, Journal of Poverty, 10:2, 5-26, DOI: 10.1300/J134v10n02_02
[4] Chakrapani V, Newman PA, Shunmugam M, McLuckie A, Melwin F. (2007) Structural violence against kothi–identified men who have sex with men in Chennai, India: a qualitative investigation. AIDS Education & Prevention. 19(4):346-64.
[5] Biello KB, Thomas BE, Johnson BE, Closson EF, Navakodi P, Dhanalakshmi A, Menon S, Mayer KH, Safren SA, Mimiaga MJ. (2017) Transactional sex and the challenges to safer sexual behaviors: A study among male sex workers in Chennai, India. AIDS care. 29(2):231-8.
[6] Chakrapani, V., Newman, P. A., & Shunmugam, M. (2020). Stigma toward and mental health of hijras/trans women and self-identified men who have sex with men in India. In N. Nakamura & C. H. Logie (Eds.), LGBTI+ mental health: International perspectives and experiences (pp. 103–119). American Psychological Association. https://doi.org/10.1037/0000159-008
[7] Experiences of bullying in schools: A survey among sexual/gender minority youth in Tamil Nadu (2018) http://orinam.net/content/wp-content/uploads/2019/06/UNESCO_researchbrief.pdf
[8] S. Abraham, A. Saju. (2022) Cyber bullying in the LGBTI+ and delineating Indian Government’s role for LGBTI+ in the cyberspace. International Journal of Science and Research (DOI: 10.21275/SR22617213031)
[9] Maji S, Abhiram AH.(2023) “Mental Health Cost of Internet”: A Mixed-Method Study of Cyberbullying among Indian Sexual Minorities. Telematics and Informatics Reports. 100064.
[10] Srivastava A, Davis JP, Patel P, Daniel EE, Karkal S, Rice E.(2022) Polyvictimization, sex work, and depressive symptoms among transgender women and men who have sex with men. Journal of interpersonal violence. 37(13-14):NP11089-109.
[11] Kumari G. Reportage of Decriminalizing LGBTI+ Community in India by Supreme Court: Content Study of Indian Newspapers in English Language. Journal of Media Research. 2019 Sep 1;12(3):76-102.
Badgett, M. V. (2014). The economic cost of stigma and the exclusion of LGBT people: A case study of India.
[12] Hendrycks, D, and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt. Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
[13] Allen, J. (1995). Natural language understanding. Benjamin-Cummings Publishing Co., Inc..
[14] Wang, A., and Singh, A., Michael, J. and Hill, F. and Levy, O. and Bowman, SR. (2019) GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. ICLR 2019.
[15] Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., ... & Kim, H. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint
[16] Manjavacas, E., and Fonteyn, L.(2022) “Non-Parametric Word Sense Disambiguation for Historical Languages.” In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities.
[17] Giulianelli, M., Del Tredici, M, and Fernández, R. (2020) “Analysing lexical semantic change with contextualised word representations.” In Proceedings of ACL.
[18] Shanqing C., et al. (2022.) Context-Aware Abbreviation Expansion Using Large Language Models. In Proceedings of NAACL.
[19] Black, S., Gao, L., Wang, P. , Leahy, C., Biderman, S. (2021) GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, Zenodo. https://doi.org/10.5281/zenodo.5297715.
[20] Wang, Ben and Komatsuzaki, Aran. (2021) GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model, https://github.com/kingoflolz/mesh-transformer-jax.
[21] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.
[22] Vanita, R. (Ed.). (2013). Queering India: Same-sex love and eroticism in Indian culture and society. Routledge.
[23] Birnholtz J, Rawat S, Vashista R, Baruah D, Dange A, Boyer AM. (2020). Layers of marginality: an exploration of visibility, impressions, and cultural context on geospatial apps for men who have sex with men in Mumbai, India. Social Media+ Society. 6(2):2056305120913995.
[24] Pinch A, Birnholtz J, Rawat S, Bhatter A, Baruah D, Dange A. (2022) “Someone Else Is Behind The Screen”: Visibility, Privacy, and Trust on Geosocial Networking Apps in India. Social Media+ Society.(3):2056305122112607