Itihasa - Staff

Itihasa: A large-scale corpus for Sanskrit to English translation

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Standard

Itihasa : A large-scale corpus for Sanskrit to English translation. / Aralikatte, Rahul Rajendra; de Lhoneux, Miryam; Kunchukuttan, Anoop ; Søgaard, Anders.

Proceedings of the 8th Workshop on Asian Translation (WAT2021). Association for Computational Linguistics, 2022. p. 191–197.

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Harvard

Aralikatte, RR, de Lhoneux, M, Kunchukuttan, A & Søgaard, A 2022, Itihasa: A large-scale corpus for Sanskrit to English translation. in Proceedings of the 8th Workshop on Asian Translation (WAT2021). Association for Computational Linguistics, pp. 191–197, 8th Workshop on Asian Translation (WAT2021), Online, 05/08/2021. https://doi.org/10.18653/v1/2021.wat-1.22

APA

Aralikatte, R. R., de Lhoneux, M., Kunchukuttan, A., & Søgaard, A. (2022). Itihasa: A large-scale corpus for Sanskrit to English translation. In Proceedings of the 8th Workshop on Asian Translation (WAT2021) (pp. 191–197). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.wat-1.22

Vancouver

Aralikatte RR, de Lhoneux M, Kunchukuttan A, Søgaard A. Itihasa: A large-scale corpus for Sanskrit to English translation. In Proceedings of the 8th Workshop on Asian Translation (WAT2021). Association for Computational Linguistics. 2022. p. 191–197 https://doi.org/10.18653/v1/2021.wat-1.22

Author

Aralikatte, Rahul Rajendra ; de Lhoneux, Miryam ; Kunchukuttan, Anoop ; Søgaard, Anders. / Itihasa : A large-scale corpus for Sanskrit to English translation. Proceedings of the 8th Workshop on Asian Translation (WAT2021). Association for Computational Linguistics, 2022. pp. 191–197

Bibtex

@inproceedings{6a0e43960a2b4863857b11e2b98df027,

title = "Itihasa: A large-scale corpus for Sanskrit to English translation",

abstract = "This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.",

author = "Aralikatte, {Rahul Rajendra} and {de Lhoneux}, Miryam and Anoop Kunchukuttan and Anders S{\o}gaard",

year = "2022",

doi = "10.18653/v1/2021.wat-1.22",

language = "English",

pages = "191–197",

booktitle = "Proceedings of the 8th Workshop on Asian Translation (WAT2021)",

publisher = "Association for Computational Linguistics",

note = "8th Workshop on Asian Translation (WAT2021) ; Conference date: 05-08-2021 Through 06-08-2021",

}

RIS

TY - GEN

T1 - Itihasa

T2 - 8th Workshop on Asian Translation (WAT2021)

AU - Aralikatte, Rahul Rajendra

AU - de Lhoneux, Miryam

AU - Kunchukuttan, Anoop

AU - Søgaard, Anders

PY - 2022

Y1 - 2022

N2 - This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.

AB - This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.

U2 - 10.18653/v1/2021.wat-1.22

DO - 10.18653/v1/2021.wat-1.22

M3 - Article in proceedings

SP - 191

EP - 197

BT - Proceedings of the 8th Workshop on Asian Translation (WAT2021)

PB - Association for Computational Linguistics

Y2 - 5 August 2021 through 6 August 2021

ER -

ID: 300449427

Department of Communication