Multilingual Emoji Search with E5 Embeddings

I built a fully static emoji search tool that runs multilingual semantic retrieval in the browser. It supports Chinese and English text queries, emoji-to-emoji reverse search, skin-tone variants, keyboard copy, and offline use after the first visit. No query leaves the browser.

You can try the emoji search tool.

The design is a small retrieval system under browser constraints: choose a multilingual embedding model, precompute and compress the corpus, route different query types efficiently, and keep the scoring behavior calibrated enough for interactive controls to make sense. The most useful lesson was not the UI implementation, but where a lightweight embedding pipeline still needs measurement and post-processing.

1. Product and System Constraints

The tool had a few constraints that shaped the implementation:

Query by Chinese or English text, combining exact keyword matching with semantic retrieval.
Paste an emoji and retrieve semantically related emoji.
Support skin-tone selection without duplicating semantic entries.
Keep inference local: no server-side model calls, query logging, or API dependency.
Stay deployable as static assets.
Load the model on first visit, cache it, and make repeat visits work offline.

Those constraints made the model choice less about maximum benchmark quality and more about the quality-latency-size tradeoff. I used Xenova/multilingual-e5-small: multilingual, 384 dimensions, and about 30 MB after int8 quantization. A larger E5 model would likely improve retrieval quality, but the added download and browser inference cost did not fit the target experience.

The corpus contains 1,914 emoji entries. Each entry has Chinese and English names plus CLDR keywords; flag entries come from cldr-annotations-derived-full. At build time, a Node script embeds the corpus once, applies the vector post-processing described below, and writes roughly 1.1 MB of artifacts.

At runtime, the main thread only handles UI. Model inference and vector math run in a Web Worker. Query routing has three paths:

Exact keyword hits use a dictionary fast path for lower latency and more predictable results.
Text-semantic queries run the E5 embedding model and cosine retrieval.
Pasted-emoji reverse lookups skip model inference and compare against prebuilt vectors directly.

Fig. 1 · Build artifacts feed directly into the runtime Worker. The only per-query model call is for free-text semantic search.

2. Reverse Search Without Runtime Inference

Emoji reverse search has a useful shortcut: if the query is already an emoji, its vector was computed at build time. Runtime inference is unnecessary.

The implementation normalizes the input by stripping skin-tone modifiers (U+1F3FB to U+1F3FF), then uses Intl.Segmenter to extract the first grapheme cluster. That matters for ZWJ sequences such as 🤦‍♂️, where JavaScript string length is not the same as user-perceived character length. After that, the worker looks up Map<emoji, index> and dot-products the query vector against the 1,914 stored int8 vectors.

Text search still runs the model per query, which takes tens of milliseconds with q8 inference. Reverse search is just vector lookup plus dot products, so it is both simpler and faster.

3. Diagnosing the Retrieval Failure

After wiring up reverse search, the similarity slider had almost no visible effect. That was a signal that the score distribution, not the UI, was broken.

For a pasted 🐼 query, the raw cosine scores against the other 1,913 emoji looked like this:

Text

top-1      0.972  (🦊)
top-10     0.955  (🐭)
median     0.901
bottom-10% 0.886  (90% of emoji score above this)

A threshold of 0.84 was therefore close to meaningless: it admitted almost every emoji. The likely cause is embedding anisotropy. E5-style retrieval uses a passage: prefix for corpus items, and that shared context can add a common component to document vectors. The corpus mean confirmed the suspicion: ||mean|| was around 0.7, large enough to dominate cosine comparisons.

This failure mode is related to the issue described in All-but-the-Top (Mu & Viswanath, ICLR 2018): a small number of common directions can dominate distances in embedding space. The minimal fix is to subtract the corpus mean and renormalize:

JavaScript

for (let i = 0; i < n; i++) {
  let n2 = 0;
  for (let j = 0; j < dim; j++) {
    vec[i * dim + j] -= mean[j];
    n2 += vec[i * dim + j] ** 2;
  }
  const nrm = Math.sqrt(n2);
  for (let j = 0; j < dim; j++) vec[i * dim + j] /= nrm;
}

Corpus vectors are mean-centered at build time, and runtime text-query embeddings subtract the same corpus mean before cosine search. Otherwise the query and corpus would live in different vector spaces.

After centering, the scores became interpretable:

Text

🐼 vs 🐻     0.963  ->   0.570
🐼 vs 🍜     0.909  ->  -0.011
🐼 vs 😀     0.920  ->   0.201

Before centering, all three pairs looked highly similar. After centering, the animal pair stays high, the food pair moves near zero, and the generic face emoji lands in between. That is the behavior I want from a threshold slider.

Fig. 2 · Cosine distribution of 🐼 against the other 1,913 emoji on a shared [−1, 1] x-axis. Solid line = cosine 0; dashed line = default threshold 0.20. Before centering, scores collapse into 0.85-0.95. After centering, the distribution spreads across roughly [−0.24, 0.62], so the threshold becomes useful.

4. Why I Stopped at Mean-Centering

The All-but-the-Top paper recommends subtracting the mean and removing the top K = D / 100 principal components. With 384-dimensional E5 vectors, that suggests K ~= 4. I tested that as an ablation:

Post-processing	🐼 top-1	🐼 vs 🐻	🐼 vs 🍜
Mean only	0.628	0.572	-0.009
Mean + 4 PCs	0.531	0.478	-0.090
Mean + 8 PCs	0.376	0.343	0.004

For this corpus, removing principal components did not help. It reduced scores for related emoji along with unrelated ones, which made the distribution narrower rather than more useful.

My interpretation is that the paper’s default recipe was designed around word2vec/GloVe-style word embeddings, where frequency-driven common directions are a major source of distortion. E5 is a contrastively trained sentence encoder. After removing the corpus mean, additional component removal appears to discard semantic signal rather than just nuisance directions.

So the final choice is deliberately simple: mean-center, L2-normalize, quantize to int8, and keep the full 384-dimensional space. That gives a better quality-size-latency tradeoff than applying the full paper recipe blindly.

5. Evaluation Notes and Next Steps

This is not a benchmark, but I wanted the checks to cover the failure mode that actually broke the product:

Score distribution: unrelated emoji should not all sit above the default threshold.
Local semantic sanity checks: animal-to-animal pairs should score above animal-to-food pairs.
Query-path parity: the same threshold should behave reasonably for text search and pasted-emoji reverse search.
Latency and packaging: first load should stay acceptable, repeated visits should avoid network and model reloads.

If I were turning this into a more formal retrieval project, I would add a small hand-labeled evaluation set across Chinese queries, English queries, and emoji reverse queries; compare multilingual-e5-small against smaller and larger embedding models; measure quantization error before and after int8 packing; and tune thresholds against precision@K rather than by visual inspection.

I built the first version with Claude Code, which helped compress implementation time. The part I would not delegate away is the retrieval judgment: checking the score distribution, identifying the embedding-space failure mode, and choosing the simplest post-processing step that fixed the product behavior without overfitting the implementation.