Splitting and concatenation
On this page
Algolia improves search relevance by splitting long words into shorter ones and combining (concatenating) short words into longer ones.
This helps users find results even when their query doesn’t exactly match your indexed records.
You can adjust this behavior in the Algolia dashboard
or with the typoTolerance
parameter.
To learn more about query processing, see Tokenization.
Splitting
When processing user queries,
Algolia attempts to improve relevance by splitting a single query term into two separate words.
This helps return results when users accidentally concatenate words, like typing katherinejohnson
instead of katherine johnson
.
Algolia splits query words into only two parts to improve relevance without sacrificing performance.
This reduces the number of generated tokens,
keeping search fast and efficient while still improving matches for concatenated terms.
For example, the query jamesearljones
is split into james
and earljones
,
not into james
, earl
, and jones
.
How splits work
For each word in a query,
Algolia evaluates every possible two-part split.
For example, the query katherinejohnson
could generate the following splits:
katherinejohnson
k
,atherinejohnson
ka
,therinejohnson
kat
,herinejohnson
kath
,erinejohnson
kathe
,rinejohnson
kather
,inejohnson
katheri
,nejohnson
katherin
,ejohnson
katherine
,johnson
katherinej
,ohnson
katherinejo
,hnson
katherinejoh
,nson
Algolia splits query terms only if they’re at least as long as the value defined by minWordSizefor1Typo
.
By default, this is 4 characters,
so terms shorter than this (such as car
) aren’t split,
while longer terms (such as kath
and katherinejohnson
) can be.
The first part of the split can be up to 12 characters long, while the second part can be any length.
Algolia uses a split as an alternative search term if both parts of the split exist as distinct words in your index.
For example,
if katherine
and johnson
are both in your records, Algolia adds katherine johnson
as an alternative search term.
If both aren’t in your records, this split is ignored.
Alternative search terms are treated as sequence expressions, which means that the split terms must be next to each other and in the same order in an attribute.
Algolia may generate multiple splits.
For example, it can split nowhere
into no
and where
, or now
and here
.
It selects the split that matches the most records.
A split may not be used if the original query term yields better results.
Concatenation
Algolia concatenates tokens to improve matching for acronyms and contractions.
Concatenation during indexing
During indexing, Algolia combines tokens separated by:
.
(period)'
(apostrophe)®
(registered symbol)©
(copyright symbol)
This helps index acronyms such as B.C.E.
and contractions such as don't
.
For example, hello.world
creates the tokens hello
, .
, and world
,
and then helloworld
after concatenation.
The .
character is a separator and isn’t indexed by default
(see the separatorsToIndex
parameter).
Algolia doesn’t index tokens shorter than three characters.
For example, B.C.E.
creates B
, .
, C
, .
, E
, and BCE
.
It indexes only BCE
, not B
, C
, E
, or the separator .
.
Concatenation at query time
Algolia performs the same concatenation in search queries as it does during indexing. It also uses:
- Bigram concatenation. Algolia merges each pair of adjacent tokens for the first five words in the query.
- All-word concatenation. Algolia combines all query words into a single token when there are three or more words.
These concatenation methods increase the chance of matching product names or long phrases written without spaces.
For example, the search query a wonderful day in the neighborhood
results in these tokens:
- Initial tokenization:
a
,wonderful
,day
,in
,the
,neighborhood
- Bigram concatenation:
awonderful
,wonderfulday
,dayin
,inthe
- All-word concatenation:
awonderfuldayintheneighborhood
Concatenation with numbers
Algolia applies specific logic for concatenating tokens with numbers and separators:
- If a token starts with a number, Algolia doesn’t merge it with adjacent ones. For example,
m.55
createsm55
, but5.mm
forms5
andmm
, not5mm
. This avoids misinterpreting floating point numbers, so1.3GB
isn’t treated as13GB
. - When a number appears next to a separator, Algolia indexes each adjacent non-separator token individually, regardless of length. For example,
3.GB
creates3
,.
, andGB
. Algolia indexes3
andGB
but not3GB
, because it starts with a number. - Algolia skips bigram concatenation when two adjacent tokens both start or end with digits. This prevents irrelevant combination in queries such as
XC90 2020 Volvo
, where merging the terms intoXC902020
would reduce relevance and produce inaccurate matches.
Algolia applies specific logic to hyphenated attributes, which can affect search behavior. For example, for hyphenated ISBN or part numbers, all-word concatenation or careful attribute formatting helps ensure good search relevance. For more information, see Searching in hyphenated attributes.
Improve relevance for single word and ambiguous queries
In some cases, especially with short or ambiguous queries,
Algolia may split or interpret terms in unexpected ways.
For example, a search for Augusta
might return a less relevant result that contains August a
,
because Algolia interprets this as a better match based on frequency or attribute position.
To improve relevance in these cases:
- Use unordered attributes. Make sure most of your searchable attributes (such as description, name, or title) are
unordered
. This removes any differences in ranking due to the position of the matches. - Try prioritizing exact matches for single word queries. Consider setting
exactOnSingleWordQuery
toword
. This boosts exact matches when the user’s query is only one word.
If specific queries don’t return the expected results, consider:
- Adding a
keywords
attribute. Add akeywords
field to your records with the exact words you want user queries to match. Put the attribute at the top of your searchable attributes list. This favors exact matches with the defined keywords, due to the Attribute criterion. - Turning off typo tolerance for specific words. Use
disableTypoToleranceOnWords
to require exact matches for specified words. Be cautious with this approach, as users must spell the word exactly for it to match.
After making changes, test your configuration with a variety of queries. Only apply updates that improve relevance. If possible, A/B test the configuration before rolling it out to all users.