Itihasa: A large-scale corpus for Sanskrit to English translation

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Documents

  • Fulltext

    Final published version, 1.18 MB, PDF document

This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.
Original languageEnglish
Title of host publicationProceedings of the 8th Workshop on Asian Translation (WAT2021)
PublisherAssociation for Computational Linguistics
Publication date2022
Pages191–197
DOIs
Publication statusPublished - 2022
Event8th Workshop on Asian Translation (WAT2021) - Online
Duration: 5 Aug 20216 Aug 2021

Conference

Conference8th Workshop on Asian Translation (WAT2021)
ByOnline
Periode05/08/202106/08/2021

Number of downloads are based on statistics from Google Scholar and www.ku.dk


No data available

ID: 300449427