Tutorial
Aligning LLMs to Low-Resource Languages

Abstract

This tutorial provides a detailed guide on collecting data for aligning large language models (LLMs) with low-resource languages (LRLs).
It addresses the challenge of data scarcity in these languages and introduces a pipeline for generating high-quality data, using Swahili as a primary example. The tutorial covers strategies for dataset collection and alignment of LLMs to LRLs, offering comprehensive guidance on producing and utilizing high-quality data for language technology development in under-resourced languages.

Notebooks

AYA_Notebook
LRL_Notebook

Organizers

Nazar Beknazarov
Nazar Beknazarov
Toloka AI
Ahmet U ̈stu ̈n
Ahmet Üstün
Cohere for AI
Marzieh Fadaee
Marzieh Fadaee
Cohere for AI

Natalia Fedorova
Natalia Fedorova
Toloka AI
Sergey Koshelev
Sergey Koshelev
Toloka AI
Alisa Smirnova
Alisa Smirnova
Toloka AI