Tokenization
Within a search engine, tokenization is the process of splitting text into “tokens”, both during querying and indexing. Tokens are the basic units for finding matches between queries and records.
Separators and non-separators
Algolia’s tokenizer divides characters into two classes: non-separators and separators.
Non-separators are alphanumeric characters, and separators are non-alphanumeric characters like spaces and hyphens (-
).
Turning a string into tokens (tokenizing) happens character-by-character. The tokenizer identifies the longest groups of contiguous characters belonging to the same class (separator or non-separator), and creates a token for each group.
For example, the string Hello, World!
results in four tokens:
Hello
(non-separator),
(with a trailing space) (separator)World
(non-separator)!
(separator)
Hello
and World
are comprised of non-separator characters, while ,
(with a trailing space) and !
are comprised of separators.
Only non-separator characters are indexed, and thus searchable, by default. In the example above, only Hello
and World
are indexed. Regardless if a user searches for Hello, World!
or hello world
, any record with these tokens will be a match.
Index separators
You can customize what characters are indexed using separatorsToIndex
.
Including a character in this setting has these consequences:
- It’s tokenized as a non-separator.
- It’s not combined it with adjacent characters. The tokenizer always puts the character alone in its own token, even if it appears next to other non-separators, or even next to itself.
- It’s indexed.
For example, if separatorsToIndex
includes #@
(hash and at sign),
then the string #@lgolia!!
is tokenized as:
#
(non-separator)@
(non-separator)lgolia
(non-separator)!!
(separator)
Since #
and @
are included in separatorsToIndex
,
the tokens #
, @
, and lgolia
are indexed.
Even though they appear next to each other, #
and @
are separate tokens.
Now, when a user searches for #
, @
, or LGOLIA!!
this record matches.
Sequence expressions
Although characters in separatorsToIndex
are tokenized as their own,
when they’re adjacent to a non-separator token, the order should be preserved.
For example, if @
is included in separatorsToIndex
,
then the string alice@wonderland
is interpreted as alice @ wonderland
(all tokens must be adjacent, in this order).
The phrase alice @ wonderland
(with spaces in between) has the same tokens, but with no restrictions on order.
A search for alice@wonderland
, returns records with alice@wonderland
and alice @ wonderland
(with spaces),
but not records with wonderland @ alice
or alice was @ wonderland
.
When tokens must occur in a particular order, it’s known as a sequence expression.
Algolia always creates sequence expressions when alphanumeric characters surround a hyphen (-
),
even if the hyphen isn’t included in separatorsToIndex
.
For example, the term real-time
creates a sequence expression.
The query real-time
matches records with real time
and real-time
,
but not real [...] time
, time real
, or time [...] real
([...]
indicates other words in the string).
The query real time
, without a hyphen, matches any records with those two words,
regardless of word order or proximity.
Sequence expressions limitation
Sequence expression matching relies on words position: all tokens must be adjacent.
The indexing only keeps the position of the first 1,000 words of every attribute. For all words beyond this limit, sequence expression matching doesn’t work.
Mitigation and solution
To mitigate the issue, you can:
- Transform the query, for example from
real-time
toreal time
- Use smaller records