Results

For evaluation, we collect datasets from the VLSP project .
It is a collection of many articles from the "Politics – Society" section of Vietnamese newspaper "Tuổi trẻ" (The Youth).
All words that have been manually spell-checked and segmented by linguists from the Vietnam Lexicography Center.

There are 3 datasets and a replicate of 3 without diacritics:

  • The first dataset includes 7K sentences (about 2M syllables), each sentence is on one line, has no HTML tags.
  • The second dataset include 27K sentences, similar to the first dataset but still has HTML tags.
  • The third dataset include 70K sentences, the same as the second dataset.
  • The 7K-n, 27K-n, and 70K-n dataset includes 7K, 27K, and 70K sentences from the first, second, and third dataset respectively but diacritics are eliminated.

Features of datasets:

  • The syllables in multisyllabic words are connected by "_".
  • Data files still contain the HTML tags, empty line, special characters.
  • Input files have extension *.raw; output files have extension *.seg.
  • All words are in lowercase.



Our tool, vnSegmenter.js gains the fastest execution time across all datasets (~5x the other fastest tool and 10x-100x the slowest one). vnTokenizerseems to have the longest execution time in all datasets. DongDu, UETsegmenter, JVnTextpro has average execution time.