For evaluation, we collect datasets from the
VLSP project
.
It is a collection of many articles from the "Politics – Society" section of Vietnamese
newspaper
"Tuổi trẻ" (The Youth).
All words that have been manually spell-checked and segmented by linguists from the
Vietnam Lexicography Center.
There are 3 datasets and a replicate of 3 without diacritics:
Features of datasets:
Our tool,
vnSegmenter.js gains the fastest execution time across all datasets (~5x the other
fastest tool and 10x-100x the slowest one).
vnTokenizerseems to have the longest execution time in all datasets.
DongDu,
UETsegmenter,
JVnTextpro has average execution time.