Itihasa: A large-scale corpus for Sanskrit to English translation
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Documents
- Fulltext
Final published version, 1.18 MB, PDF document
This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.
Original language | English |
---|---|
Title of host publication | Proceedings of the 8th Workshop on Asian Translation (WAT2021) |
Publisher | Association for Computational Linguistics |
Publication date | 2022 |
Pages | 191–197 |
DOIs | |
Publication status | Published - 2022 |
Event | 8th Workshop on Asian Translation (WAT2021) - Online Duration: 5 Aug 2021 → 6 Aug 2021 |
Conference
Conference | 8th Workshop on Asian Translation (WAT2021) |
---|---|
By | Online |
Periode | 05/08/2021 → 06/08/2021 |
Number of downloads are based on statistics from Google Scholar and www.ku.dk
No data available
ID: 300449427