# Tokenizer [](https://github.com/bzick/tokenizer/actions/workflows/tokenizer.yml) [](https://codecov.io/gh/bzick/tokenizer) [](https://goreportcard.com/report/github.com/bzick/tokenizer) [](https://godoc.org/github.com/bzick/tokenizer) Tokenizer — parse any string, slice or infinite buffer to any tokens. Main features: * High performance. * No regexp. * Provides [simple API](https://pkg.go.dev/github.com/bzick/tokenizer). * Supports [integer](#integer-number) and [float](#float-number) numbers. * Supports [quoted string or other "framed"](#framed-string) strings. * Supports [injection](#injection-in-framed-string) in quoted or "framed" strings. * Supports unicode. * [Customization of tokens](#user-defined-tokens). * Autodetect white space symbols. * Parse any data syntax (xml, [json](https://github.com/bzick/tokenizer/blob/master/example_test.go), yaml), any programming language. * Single pass through the data. * Parses [infinite incoming data](#parse-buffer) and don't panic. Use cases: - Parsing html, xml, [json](./example_test.go), yaml and other text formats. - Parsing huge or infinite texts. - Parsing any programming languages. - Parsing templates. - Parsing formulas. For example, parsing SQL `WHERE` condition `user_id = 119 and modified > "2020-01-01 00:00:00" or amount >= 122.34`: ```go // define custom tokens keys const ( TEquality = 1 TDot = 2 TMath = 3 ) // configure tokenizer parser := tokenizer.New() parser.DefineTokens(TEquality, []string{"<", "<=", "==", ">=", ">", "!="}) parser.DefineTokens(TDot, []string{"."}) parser.DefineTokens(TMath, []string{"+", "-", "/", "*", "%"}) parser.DefineStringToken(`"`, `"`).SetEscapeSymbol(tokenizer.BackSlash) // create tokens stream stream := parser.ParseString(`user_id = 119 and modified > "2020-01-01 00:00:00" or amount >= 122.34`) defer stream.Close() // iterate over each token for stream.Valid() { if stream.CurrentToken().Is(tokenizer.TokenKeyword) { field := stream.CurrentToken().ValueString() // ... } stream.Next() } ``` tokens stram: ``` string: user_id = 119 and modified > "2020-01-01 00:00:00" or amount >= 122.34 tokens: |user_id| =| 119| and| modified| >| "2020-01-01 00:00:00"| or| amount| >=| 122.34| | 0 | 1| 2 | 3 | 4 | 5| 6 | 7 | 8 | 9 | 10 | 0: {key: TokenKeyword, value: "user_id"} token.Value() == "user_id" 1: {key: TEquality, value: "="} token.Value() == "=" 2: {key: TokenInteger, value: "119"} token.ValueInt() == 119 3: {key: TokenKeyword, value: "and"} token.Value() == "and" 4: {key: TokenKeyword, value: "modified"} token.Value() == "modified" 5: {key: TEquality, value: ">"} token.Value() == ">" 6: {key: TokenString, value: "\"2020-01-01 00:00:00\""} token.ValueUnescaped() == "2020-01-01 00:00:00" 7: {key: TokenKeyword, value: "or"} token.Value() == "and" 8: {key: TokenKeyword, value: "amount"} token.Value() == "amount" 9: {key: TEquality, value: ">="} token.Value() == ">=" 10: {key: TokenFloat, value: "122.34"} token.ValueFloat() == 122.34 ``` More examples: - [JSON parser](./example_test.go) ## Begin ### Create and parse ```go import ( "github.com/bzick/tokenizer" ) var parser := tokenizer.New() parser.AllowKeywordUnderscore() // ... and other configuration code ``` There is two ways to **parse string or slice**: - `parser.ParseString(str)` - `parser.ParseBytes(slice)` The package allows to **parse an endless stream** of data into tokens. For parsing, you need to pass `io.Reader`, from which data will be read (chunk-by-chunk): ```go fp, err := os.Open("data.json") // huge JSON file // check fs, configure tokenizer ... stream := parser.ParseStream(fp, 4096).SetHistorySize(10) defer stream.Close() for stream.IsValid() { // ... stream.Next() } ``` ## Embedded tokens - `tokenizer.TokenUnknown` — unspecified token key. - `tokenizer.TokenKeyword` — keyword, any combination of letters, including unicode letters. - `tokenizer.TokenInteger` — integer value - `tokenizer.TokenFloat` — float/double value - `tokenizer.TokenString` — quoted string - `tokenizer.TokenStringFragment` — fragment framed (quoted) string ### Unknown token — `tokenizer.TokenUnknown` A token marks as `TokenUnknown` if the parser detects an unknown token: ```go parser.ParseString(`one!`) ``` ``` { { Key: tokenizer.TokenKeyword Value: "One" }, { Key: tokenizer.TokenUnknown Value: "!" } } ``` By default, `TokenUnknown` tokens are added to the stream. To exclude them from the stream, use the `tokenizer.StopOnUndefinedToken()` method ``` { { Key: tokenizer.TokenKeyword Value: "one" } } ``` Please note that if the `tokenizer.StopOnUndefinedToken` setting is enabled, then the string may not be fully parsed. To find out that the string was not fully parsed, check the length of the parsed string `stream.GetParsedLength()` and the length of the original string. ### Keywords Any word that is not a custom token is stored in a single token as `tokenizer.TokenKeyword`. The word can contains unicode characters, numbers (see `tokenizer.AllowNumbersInKeyword ()`) and underscore (see `tokenizer.AllowKeywordUnderscore ()`). ```go parser.ParseString(`one two четыре`) ``` ``` tokens: { { Key: tokenizer.TokenKeyword Value: "one" }, { Key: tokenizer.TokenKeyword Value: "two" }, { Key: tokenizer.TokenKeyword Value: "четыре" } } ``` ### Integer number Any integer is stored as one token with key `tokenizer.Token Integer`. ```go parser.ParseString(`223 999`) ``` ``` tokens: { { Key: tokenizer.TokenInteger Value: "223" }, { Key: tokenizer.TokenInteger Value: "999" }, } ``` To get int64 from the token value use `stream.GetInt()`: ```go stream := tokenizer.ParseString("123") fmt.Print("Token is %d", stream.CurrentToken().GetInt()) // Token is 123 ``` ### Float number Any float number is stored as one token with key `tokenizer.TokenFloat`. Float number may - have point, for example `1.2` - have exponent, for example `1e6` - have lower `e` or upper `E` letter in the exponent, for example `1E6`, `1e6` - have sign in the exponent, for example `1e-6`, `1e6`, `1e+6` ``` tokenizer.ParseString(`1.3e-8`): { { Key: tokenizer.TokenFloat Value: "1.3e-8" }, } ``` To get float64 from the token value use `token.GetFloat()`: ```go stream := tokenizer.ParseString("1.3e2") fmt.Print("Token is %d", stream.CurrentToken().GetFloat()) // Token is 130 ``` ### Framed string Strings that are framed with tokens are called framed strings. An obvious example is quoted a string like `"one two"`. There quotes — edge tokens. You can create and customize framed string through `tokenizer.AddQuote()`: ```go const TokenDoubleQuotedString = 10 tokenizer.DefineStringToken(TokenDoubleQuotedString, `"`, `"`).SetEscapeSymbol('\\') stream := tokenizer.ParseString(`"two \"three"`) ``` ``` { { Key: tokenizer.TokenString Value: "\"two \\"three\"" }, } ``` To get a framed string without edge tokens and special characters, use the `stream.ValueUnescape()` method: ```go v := stream.CurrentToken().ValueUnescape() // result: two "three ``` The method `token.StringKey()` will be return token string key defined in the `DefineStringToken`: ```go stream.CurrentToken().StringKey() == TokenDoubleQuotedString // true ``` ### Injection in framed string Strings can contain expression substitutions that can be parsed into tokens. For example `"one {{two}} three"`. Fragments of strings before, between and after substitutions will be stored in tokens as `tokenizer.TokenStringFragment`. ```go const ( TokenOpenInjection = 1 TokenCloseInjection = 2 TokenQuotedString = 3 ) parser := tokenizer.New() parser.DefineTokens(TokenOpenInjection, []string{"{{"}) parser.DefineTokens(TokenCloseInjection, []string{"}}"}) parser.DefineStringToken(TokenQuotedString, `"`, `"`).AddInjection(TokenOpenInjection, TokenCloseInjection) parser.ParseString(`"one {{ two }} three"`) ``` Tokens: ``` { { Key: tokenizer.TokenStringFragment, Value: "one" }, { Key: TokenOpenInjection, Value: "{{" }, { Key: tokenizer.TokenKeyword, Value: "two" }, { Key: TokenCloseInjection, Value: "}}" }, { Key: tokenizer.TokenStringFragment, Value: "three" }, } ``` Use cases: - parse templates - parse placeholders ## User defined tokens The new token can be defined via the `DefineTokens` method: ```go const ( TokenCurlyOpen = 1 TokenCurlyClose = 2 TokenSquareOpen = 3 TokenSquareClose = 4 TokenColon = 5 TokenComma = 6 TokenDoubleQuoted = 7 ) // json parser parser := tokenizer.New() parser. DefineTokens(TokenCurlyOpen, []string{"{"}). DefineTokens(TokenCurlyClose, []string{"}"}). DefineTokens(TokenSquareOpen, []string{"["}). DefineTokens(TokenSquareClose, []string{"]"}). DefineTokens(TokenColon, []string{":"}). DefineTokens(TokenComma, []string{","}). DefineStringToken(TokenDoubleQuoted, `"`, `"`).SetSpecialSymbols(tokenizer.DefaultStringEscapes) stream := parser.ParseString(`{"key": [1]}`) ``` ## Known issues * zero-byte `\0` ignores in the source string. ## Benchmark Parse string/bytes ``` pkg: tokenizer cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz BenchmarkParseBytes stream_test.go:251: Speed: 70 bytes string with 19.689µs: 3555284 byte/sec stream_test.go:251: Speed: 7000 bytes string with 848.163µs: 8253130 byte/sec stream_test.go:251: Speed: 700000 bytes string with 75.685945ms: 9248744 byte/sec stream_test.go:251: Speed: 11093670 bytes string with 1.16611538s: 9513355 byte/sec BenchmarkParseBytes-8 158481 7358 ns/op ``` Parse infinite stream ``` pkg: tokenizer cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz BenchmarkParseInfStream stream_test.go:226: Speed: 70 bytes at 33.826µs: 2069414 byte/sec stream_test.go:226: Speed: 7000 bytes at 627.357µs: 11157921 byte/sec stream_test.go:226: Speed: 700000 bytes at 27.675799ms: 25292856 byte/sec stream_test.go:226: Speed: 30316440 bytes at 1.18061702s: 25678471 byte/sec BenchmarkParseInfStream-8 433092 2726 ns/op PASS ```