10 packages returned for Tags:"Tokenization"
- 19,012 total downloads
- last updated 8/25/2022
- Latest version: 188.8.131.52
Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation.
- 9,349 total downloads
- last updated 11/22/2015
- Latest version: 0.6.5
TextMatch is a library for searching inside texts using Lucene query expressions. Supports all types of Lucene query expressions - boolean, wildcard, fuzzy. Options are available for tweaking tokenization, such as case-sensitivity and word stemming.