Tokenization for i+1 language learning is non Trivial
When automatically configuring flashcards for language learning, it’s trivial to create prompts which only contain one unknown word. However, multi-word expressions or n-grams (e.g. idioms, typical noun phrases) are self-contained and already atomic. In light of this, segmentation of unknown structures goes beyond the word level and is non-trivial.
There are no notes linking to this note.