Skip to content
Snippets Groups Projects
Commit 1c07b9e6 authored by Tom N Harris's avatar Tom N Harris
Browse files

Use external program to split pages into words

An external tokenizer inserts extra spaces to mark words in the input text.
The text is sent through STDIN and STDOUT file handles.

A good choice for Chinese and Japanese is MeCab.
http://sourceforge.net/projects/mecab/
With the command line 'mecab -O wakati'
parent 6c528220
No related branches found
No related tags found
No related merge requests found
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment