Guides / Managing results / Optimize search results / Handling natural languages

Splitting and concatenation

Algolia improves search relevance by splitting long words into shorter ones and combining (concatenating) short words into longer ones. This helps users find results even when their query doesn’t exactly match your indexed records. You can adjust this behavior in the Algolia dashboard or with the typoTolerance parameter.

To learn more about query processing, see Tokenization.

Splitting

When processing user queries, Algolia attempts to improve relevance by splitting a single query term into two separate words. This helps return results when users accidentally concatenate words, like typing katherinejohnson instead of katherine johnson.

Algolia splits query words into only two parts to improve relevance without sacrificing performance. This reduces the number of generated tokens, keeping search fast and efficient while still improving matches for concatenated terms. For example, the query jamesearljones is split into james and earljones, not into james, earl, and jones.

How splits work

For each word in a query, Algolia evaluates every possible two-part split. For example, the query katherinejohnson could generate the following splits:

  • katherinejohnson
  • k, atherinejohnson
  • ka, therinejohnson
  • kat, herinejohnson
  • kath, erinejohnson
  • kathe, rinejohnson
  • kather, inejohnson
  • katheri, nejohnson
  • katherin, ejohnson
  • katherine, johnson
  • katherinej, ohnson
  • katherinejo, hnson
  • katherinejoh, nson

Algolia splits query terms only if they’re at least as long as the value defined by minWordSizefor1Typo. By default, this is 4 characters, so terms shorter than this (such as car) aren’t split, while longer terms (such as kath and katherinejohnson) can be.

The first part of the split can be up to 12 characters long, while the second part can be any length.

Algolia uses a split as an alternative search term if both parts of the split exist as distinct words in your index. For example, if katherine and johnson are both in your records, Algolia adds katherine johnson as an alternative search term. If both aren’t in your records, this split is ignored.

Alternative search terms are treated as sequence expressions, which means that the split terms must be next to each other and in the same order in an attribute.

Algolia may generate multiple splits. For example, it can split nowhere into no and where, or now and here. It selects the split that matches the most records. A split may not be used if the original query term yields better results.

Concatenation

Algolia concatenates tokens to improve matching for acronyms and contractions.

Concatenation during indexing

During indexing, Algolia combines tokens separated by:

  • . (period)
  • ' (apostrophe)
  • ® (registered symbol)
  • © (copyright symbol)

This helps index acronyms such as B.C.E. and contractions such as don't.

For example, hello.world creates the tokens hello, ., and world, and then helloworld after concatenation. The . character is a separator and isn’t indexed by default (see the separatorsToIndex parameter).

Algolia doesn’t index tokens shorter than three characters. For example, B.C.E. creates B, ., C, ., E, and BCE. It indexes only BCE, not B, C, E, or the separator ..

Concatenation at query time

Algolia performs the same concatenation in search queries as it does during indexing. It also uses:

  • Bigram concatenation. Algolia merges each pair of adjacent tokens for the first five words in the query.
  • All-word concatenation. Algolia combines all query words into a single token when there are three or more words.

These concatenation methods increase the chance of matching product names or long phrases written without spaces. For example, the search query a wonderful day in the neighborhood results in these tokens:

  • Initial tokenization: a, wonderful, day, in, the, neighborhood
  • Bigram concatenation: awonderful, wonderfulday, dayin, inthe
  • All-word concatenation: awonderfuldayintheneighborhood

Concatenation with numbers

Algolia applies specific logic for concatenating tokens with numbers and separators:

  • If a token starts with a number, Algolia doesn’t merge it with adjacent ones. For example, m.55 creates m55, but 5.mm forms 5 and mm, not 5mm. This avoids misinterpreting floating point numbers, so 1.3GB isn’t treated as 13GB.
  • When a number appears next to a separator, Algolia indexes each adjacent non-separator token individually, regardless of length. For example, 3.GB creates 3, ., and GB. Algolia indexes 3 and GB but not 3GB, because it starts with a number.
  • Algolia skips bigram concatenation when two adjacent tokens both start or end with digits. This prevents irrelevant combination in queries such as XC90 2020 Volvo, where merging the terms into XC902020 would reduce relevance and produce inaccurate matches.

Algolia applies specific logic to hyphenated attributes, which can affect search behavior. For example, for hyphenated ISBN or part numbers, all-word concatenation or careful attribute formatting helps ensure good search relevance. For more information, see Searching in hyphenated attributes.

Improve relevance for single word and ambiguous queries

In some cases, especially with short or ambiguous queries, Algolia may split or interpret terms in unexpected ways. For example, a search for Augusta might return a less relevant result that contains August a, because Algolia interprets this as a better match based on frequency or attribute position.

To improve relevance in these cases:

  • Use unordered attributes. Make sure most of your searchable attributes (such as description, name, or title) are unordered. This removes any differences in ranking due to the position of the matches.
  • Try prioritizing exact matches for single word queries. Consider setting exactOnSingleWordQuery to word. This boosts exact matches when the user’s query is only one word.

If specific queries don’t return the expected results, consider:

  • Adding a keywords attribute. Add a keywords field to your records with the exact words you want user queries to match. Put the attribute at the top of your searchable attributes list. This favors exact matches with the defined keywords, due to the Attribute criterion.
  • Turning off typo tolerance for specific words. Use disableTypoToleranceOnWords to require exact matches for specified words. Be cautious with this approach, as users must spell the word exactly for it to match.

After making changes, test your configuration with a variety of queries. Only apply updates that improve relevance. If possible, A/B test the configuration before rolling it out to all users.

Did you find this page helpful?