Carry out accent-insensitive search utilizing OpenSearch

March 31, 2023

1

We frequently want our textual content search to be agnostic of accent marks. Accent-insensitive search, additionally known as diacritics-agnostic search, is the place search outcomes are the identical for queries which will or could not include Latin characters equivalent to à, è, Ê, ñ, and ç. Diacritics are English letters with an accent to mark a distinction in pronunciation. Lately, phrases with diacritics have trickled into the mainstream English language, equivalent to café or protégé. Effectively, touché! OpenSearch has the reply!

OpenSearch is a scalable, versatile, and extensible open-source software program suite in your search workload. OpenSearch may be deployed in three totally different modes: the self-managed open-source OpenSearch, the managed Amazon OpenSearch Service, and Amazon OpenSearch Serverless. All three deployment modes are powered by Apache Lucene, and supply textual content analytics utilizing the Lucene analyzers.

On this put up, we display the way to carry out accent-insensitive search utilizing OpenSearch to deal with diacritics.

Answer overview

Lucene Analyzers are Java libraries which might be used to research textual content whereas indexing and looking out paperwork. These analyzers include tokenizers and filters. The tokenizers cut up the incoming textual content into a number of tokens, and the filters are used to rework the tokens by modifying or eradicating the pointless characters.

OpenSearch helps customized analyzers, which allow you to configure totally different mixtures of tokenizers and filters. It could possibly include character filters, tokenizers, and token filters. With the intention to allow our diacritic-insensitive search, we configure customized analyzers that use the ASCII folding token filter.

ASCIIFolding is a technique used to covert alphabetic, numeric, and symbolic Unicode characters that aren’t within the first 127 ASCII characters (the Fundamental Latin Unicode block) into their ASCII equivalents, if one exists. For instance, the filter adjustments “à” to “a”. This enables search engines like google and yahoo to return outcomes agnostic of the accent.

On this put up, we configure accent-insensitive search utilizing the ASCIIFolding filter supported in OpenSearch Service. We ingest a set of European film names with diacritics and confirm search outcomes with and with out the diacritics.

Create an index with a customized analyzer

We first create the index asciifold_movies with customized analyzer custom_asciifolding:

PUT /asciifold_movies
{
  "settings": {
    "evaluation": {
      "analyzer": {
        "custom_asciifolding": {
          "tokenizer": "commonplace",
          "filter": [
            "my_ascii_folding"
          ]
        }
      },
      "filter": {
        "my_ascii_folding": {
          "kind": "asciifolding",
          "preserve_original": true
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "kind": "textual content",
        "analyzer": "custom_asciifolding",
        "fields": {
          "key phrase": {
            "kind": "key phrase",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

Ingest pattern information

Subsequent, we ingest pattern information with Latin characters into the index asciifold_movies:

POST _bulk
{ "index" : { "_index" : "asciifold_movies", "_id":"1"} }
{  "title" : "Jour de fête"}
{ "index" : { "_index" : "asciifold_movies", "_id":"2"} }
{  "title" : "La gloire de mon père" }
{ "index" : { "_index" : "asciifold_movies", "_id":"3"} }
{  "title" : "Le roi et l’oiseau" }
{ "index" : { "_index" : "asciifold_movies", "_id":"4"} }
{  "title" : "Être et avoir" }
{ "index" : { "_index" : "asciifold_movies", "_id":"5"} }
{  "title" : "Kirikou et la sorcière"}
{ "index" : { "_index" : "asciifold_movies", "_id":"6"} }
{  "title" : "Señora Acero"}
{ "index" : { "_index" : "asciifold_movies", "_id":"7"} }
{  "title" : "Señora garçon"}
{ "index" : { "_index" : "asciifold_movies", "_id":"8"} }
{  "title" : "Jour de fete"}

Question the index

Now we question the asciifold_movies index for phrases with and with out Latin characters.

Our first question makes use of an accented character:

GET asciifold_movies/_search
{
  "question": {
    "match": {
      "title": "fête"
    }
  }
}

Our second question makes use of a spelling of the identical phrase with out the accent mark:

GET asciifold_movies/_search
{
  "question": {
    "match": {
      "title": "fete"
    }
  }
}

Within the previous queries, the search phrases “fête” and “fete” return the identical outcomes:

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "complete": 5,
    "profitable": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "complete": {
      "worth": 2,
      "relation": "eq"
    },
    "max_score": 0.7361701,
    "hits": [
      {
        "_index": "asciifold_movies",
        "_id": "8",
        "_score": 0.7361701,
        "_source": {
          "title": "Jour de fete"
        }
      },
      {
        "_index": "asciifold_movies",
        "_id": "1",
        "_score": 0.42547938,
        "_source": {
          "title": "Jour de fête"
        }
      }
    ]
  }
}

Equally, strive evaluating outcomes for “señora” and “senora” or “sorcière” and “sorciere.” The accent-insensitive outcomes are as a result of ASCIIFolding filter used with the customized analyzers.

Allow aggregations for fields with accents

Now that we’ve got enabled accent-insensitive search, let’s take a look at how we will make aggregations work with accents.

Strive the next question on the index:

GET asciifold_movies/_search
{
  "dimension": 0,
  "aggs": {
    "check": {
      "phrases": {
        "discipline": "title.key phrase"
      }
    }
  }
}

We get the next response:

"aggregations" : {
    "check" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Jour de fete",
          "doc_count" : 1
        },
        {
          "key" : "Jour de fête",
          "doc_count" : 1
        },
        {
          "key" : "Kirikou et la sorcière",
          "doc_count" : 1
        },
        {
          "key" : "La gloire de mon père",
          "doc_count" : 1
        },
        {
          "key" : "Le roi et l’oiseau",
          "doc_count" : 1
        },
        {
          "key" : "Señora Acero",
          "doc_count" : 1
        },
        {
          "key" : "Señora garçon",
          "doc_count" : 1
        },
        {
          "key" : "Être et avoir",
          "doc_count" : 1
        }
      ]
    }
  }

Create accent-insensitive aggregations utilizing a normalizer

Within the earlier instance, the aggregation returns two totally different buckets, one for “Jour de fête” and one for “Jour de fete.” We will allow aggregations to create one bucket for the sphere, whatever the diacritics. That is achieved utilizing the normalizer filter.

The normalizer helps a subset of character and token filters. Utilizing simply the defaults, the normalizer filter is an easy technique to standardize Unicode textual content in a language-independent method for search, thereby standardizing totally different types of the identical character in Unicode and permitting diacritic-agnostic aggregations.

Let’s modify the index mapping to incorporate the normalizer. Delete the earlier index, then create a brand new index with the next mapping and ingest the identical dataset:

PUT /asciifold_movies
{
  "settings": {
    "evaluation": {
      "analyzer": {
        "custom_asciifolding": {
          "tokenizer": "commonplace",
          "filter": [
            "my_ascii_folding"
          ]
        }
      },
      "filter": {
        "my_ascii_folding": {
          "kind": "asciifolding",
          "preserve_original": true
        }
      },
      "normalizer": {
        "custom_normalizer": {
          "kind": "customized",
          "filter": "asciifolding"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "kind": "textual content",
        "analyzer": "custom_asciifolding",
        "fields": {
          "key phrase": {
            "kind": "key phrase",
            "ignore_above": 256,
            "normalizer": "custom_normalizer"
          }
        }
      }
    }
  }
}

After you ingest the identical dataset, strive the next question:

GET asciifold_movies/_search
{
  "dimension": 0,
  "aggs": {
    "check": {
      "phrases": {
        "discipline": "title.key phrase"
      }
    }
  }
}

We get the next outcomes:

"aggregations" : {
    "check" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Jour de fete",
          "doc_count" : 2
        },
        {
          "key" : "Etre et avoir",
          "doc_count" : 1
        },
        {
          "key" : "Kirikou et la sorciere",
          "doc_count" : 1
        },
        {
          "key" : "La gloire de mon pere",
          "doc_count" : 1
        },
        {
          "key" : "Le roi et l'oiseau",
          "doc_count" : 1
        },
        {
          "key" : "Senora Acero",
          "doc_count" : 1
        },
        {
          "key" : "Senora garcon",
          "doc_count" : 1
        }
      ]
    }
  }

Now we examine the outcomes, and we will see the aggregations with time period “Jour de fête” and “Jour de fete” are rolled up into one bucket with doc_count=2.

Abstract

On this put up, we confirmed the way to allow accent-insensitive search and aggregations by designing the index mapping to do ASCII folding for search tokens and normalize the key phrase discipline for aggregations. You need to use the OpenSearch question DSL to implement a vary of search options, offering a versatile basis for structured and unstructured search purposes. The Open Supply OpenSearch group has additionally prolonged the product to allow help for pure language processing, machine studying algorithms, customized dictionaries, and all kinds of different plugins.

When you have suggestions about this put up, submit it within the feedback part. When you have questions on this put up, begin a brand new thread on the Amazon OpenSearch Service discussion board or contact AWS Assist.

In regards to the Writer

Aruna Govindaraju is an Amazon OpenSearch Specialist Options Architect and has labored with many industrial and open-source search engines like google and yahoo. She is obsessed with search, relevancy, and person expertise. Her experience with correlating end-user indicators with search engine habits has helped many shoppers enhance their search expertise. Her favourite pastime is mountain climbing the New England trails and mountains.