Itihasa: A large-scale corpus for Sanskrit to English translation
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Standard
Itihasa : A large-scale corpus for Sanskrit to English translation. / Aralikatte, Rahul Rajendra; de Lhoneux, Miryam; Kunchukuttan, Anoop ; Søgaard, Anders.
Proceedings of the 8th Workshop on Asian Translation (WAT2021). Association for Computational Linguistics, 2022. p. 191–197.Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - Itihasa
T2 - 8th Workshop on Asian Translation (WAT2021)
AU - Aralikatte, Rahul Rajendra
AU - de Lhoneux, Miryam
AU - Kunchukuttan, Anoop
AU - Søgaard, Anders
PY - 2022
Y1 - 2022
N2 - This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.
AB - This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.
U2 - 10.18653/v1/2021.wat-1.22
DO - 10.18653/v1/2021.wat-1.22
M3 - Article in proceedings
SP - 191
EP - 197
BT - Proceedings of the 8th Workshop on Asian Translation (WAT2021)
PB - Association for Computational Linguistics
Y2 - 5 August 2021 through 6 August 2021
ER -
ID: 300449427