public class RegexTokenizer extends UnaryTransformer<String,scala.collection.Seq<String>,RegexTokenizer>
gaps
is true).
Optional parameters also allow filtering tokens using a minimal length.
It returns an array of strings that can be empty.Constructor and Description |
---|
RegexTokenizer() |
RegexTokenizer(String uid) |
Modifier and Type | Method and Description |
---|---|
BooleanParam |
gaps()
Indicates whether regex splits on gaps (true) or matches tokens (false).
|
boolean |
getGaps() |
int |
getMinTokenLength() |
String |
getPattern() |
IntParam |
minTokenLength()
Minimum token length, >= 0.
|
Param<String> |
pattern()
Regex pattern used to match delimiters if
gaps is true or tokens if gaps is false. |
RegexTokenizer |
setGaps(boolean value) |
RegexTokenizer |
setMinTokenLength(int value) |
RegexTokenizer |
setPattern(String value) |
String |
uid() |
setInputCol, setOutputCol, transform, transformSchema
copy, transform, transform, transform
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning
clear, copyValues, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, paramMap, params, set, set, set, setDefault, setDefault, setDefault, shouldOwn, validateParams
public RegexTokenizer(String uid)
public RegexTokenizer()
public String uid()
public IntParam minTokenLength()
public RegexTokenizer setMinTokenLength(int value)
public int getMinTokenLength()
public BooleanParam gaps()
public RegexTokenizer setGaps(boolean value)
public boolean getGaps()
public Param<String> pattern()
gaps
is true or tokens if gaps
is false.
Default: "\\s+"
public RegexTokenizer setPattern(String value)
public String getPattern()