ListingExpress©

For evaluation, we collect datasets from the VLSP project .
It is a collection of many articles from the "Politics – Society" section of Vietnamese newspaper "Tuổi trẻ" (The Youth).
All words that have been manually spell-checked and segmented by linguists from the Vietnam Lexicography Center.

There are 3 datasets and a replicate of 3 without diacritics:

The first dataset includes 7K sentences (about 2M syllables), each sentence is on one line, has no HTML tags.
The second dataset include 27K sentences, similar to the first dataset but still has HTML tags.
The third dataset include 70K sentences, the same as the second dataset.
The 7K-n, 27K-n, and 70K-n dataset includes 7K, 27K, and 70K sentences from the first, second, and third dataset respectively but diacritics are eliminated.

Features of datasets:

The syllables in multisyllabic words are connected by "_".
Data files still contain the HTML tags, empty line, special characters.
Input files have extension *.raw; output files have extension *.seg.
All words are in lowercase.

Our tool, vnSegmenter.js gains the fastest execution time across all datasets (~5x the other fastest tool and 10x-100x the slowest one). vnTokenizerseems to have the longest execution time in all datasets. DongDu, UETsegmenter, JVnTextpro has average execution time.

Results

1. Datasets

2. Execution time

3. Precision percentage

4. F1-score