WZH

Building Emoji Search with Claude Code in Half an Hour

Sat, 25 Apr 2026 00:00:00 GMT

Built a semantic emoji search with Claude Code. Fully static, the model runs in the browser. A short writeup below.

1. Requirements

Search emoji by text in Chinese or English; pasting an emoji should also surface similar ones.
Show emoji glosses in both languages.
Skin-tone toggle.
Tunable retrieval params: top-K, similarity threshold, etc.
Click to copy; Enter copies the first result.
Fully static and local-only — privacy first.
The model loads on first visit and is cached afterwards; offline-capable from the second visit, with first load under 30s.

2. Overall Design

The model is Xenova/multilingual-e5-small: multilingual, 384-dim, ~30 MB after int8 quantization — small enough to live in the browser's IndexedDB. e5-base would be better quality but isn't worth the cost in the browser.

1914 emoji, each annotated with Chinese and English names plus CLDR keywords (flag entries derive from cldr-annotations-derived-full). At build time, embeddings are computed once in Node, centered, and quantized — totaling roughly 1.1 MB of artifacts.

At runtime the main thread only handles UI; model inference and vector math live in a Web Worker. Queries take one of three paths:

Pure keyword queries hit a dictionary lookup.
Text-semantic queries go through the model embedding + cosine.
Pasted-emoji reverse lookups skip the model entirely and just dot-product against the prebuilt vectors.

3. Notable Bits

3.1 Emoji Reverse Search

The tool needs to support pasting an emoji to find similar ones (e.g. paste 🐼, get back 🦊 🐵 🐻). There's a free optimization here: since the query is an emoji, its vector was already computed at build time and lives in emoji-vectors.i8.bin. At runtime we just look up the index and use it directly — no need to run the model again.

The implementation is straightforward: strip skin-tone modifiers (U+1F3FB to U+1F3FF) from the input, use Intl.Segmenter to extract the first grapheme cluster (which handles ZWJ sequences like 🤦‍♂️), look up Map<emoji, index>, and dot-product directly against the 1914 stored int8 vectors.

Text search runs the model every query (~tens of ms with q8). Reverse search has none of that overhead — it's just dot products, and noticeably faster.

3.2 The e5 Distribution Bias

Right after wiring up reverse search, dragging the similarity slider had no visible effect. I checked the cosine distribution from 🐼 to every other emoji:

top-1   0.972  (🦊)
top-10  0.955  (🐭)
p50     0.901
p90     0.886  ← 90% of emoji sit above this

A 0.84 threshold basically lets every one of the 1913 other emoji through. Likely cause: e5 is trained with a passage: prefix on every passage, so all doc vectors get pulled toward a shared direction by that common context. Computing the mean confirms it: ||mean|| ≈ 0.7, far from negligible.

This isn't a new problem. The fix has been around for years: All-but-the-Top (Mu & Viswanath, ICLR 2018) discusses how a few common directions dominate distance in word embeddings. The remedy is to subtract the mean and renormalize:

for (let i = 0; i < n; i++) {
  let n2 = 0;
  for (let j = 0; j < dim; j++) {
    vec[i * dim + j] -= mean[j];
    n2 += vec[i * dim + j] ** 2;
  }
  const nrm = Math.sqrt(n2);
  for (let j = 0; j < dim; j++) vec[i * dim + j] /= nrm;
}

Subtract the mean at build time, store the centered vectors, and at runtime subtract the same mean from query embeddings — keeping the feature spaces aligned. After:

🐼 vs 🐻     0.963  →   0.570
🐼 vs 🍜     0.909  →  -0.011
🐼 vs 😀     0.920  →   0.201

The distribution opens up. The same threshold now works for both text and reverse search.

The All-but-the-Top paper actually recommends going further: subtract the mean and remove the top K = D/100 principal components (D=384 → K≈4). I tried it:

K	🐼 top-1	🐼 vs 🐻	🐼 vs 🍜
0 (mean only)	0.628	0.572	-0.009
4 (paper default)	0.531	0.478	-0.090
8	0.376	0.343	0.004

But the result is counterintuitive: as K grows, the cosine to related emoji (🐼-🐻) gets squashed alongside everything else — the distribution narrows instead of widening. My guess is the paper targeted word2vec / GloVe — Zipfian-frequency-dominated word vectors with many shared directions — whereas e5 is a contrastively trained sentence encoder whose training objective already does a lot of the work compressing shared directions. After removing the mean, what's left is mostly real semantics, and removing more is just dropping signal.

Final pick: K=0, first-order is enough.

站点更新日志

Sun, 19 Oct 2025 00:00:00 GMT

(Photo by Volodymyr Dobrovolskyy on Unsplash )

更新日志

2025-11

优化了 Cards 瀑布流排序算法，优化图片加载性能。
处理和规范相关域名使用。
解决了一些已知问题；内容样式与交互优化。

2025-10

搭建网站初版（Astro 框架、Pure 主题，大量参考 joeytoday 博客样式）。
打通部署链路（Cloudflare Workers）。
搭建对象存储链路（Cloudflare R2）。
集成图像处理功能（Cloudflare Transform Images）。
搭建发图自动上传图床机器人（Telegram Bots）。
新增 Cards 内容类型。
移除 Comments 模块、文章转发模块、阅读时间提示模块等。
内容样式与交互优化。

公司的坑位多少才足够？

Sun, 11 Apr 2021 00:00:00 GMT

1. 等坑位问题

公司每层只有 4 个坑位，需要供至少两百人使用，每次需要解决生理问题的时候都需要等待一段时间，在等待的过程中想到了这个问题：

如果 4 个坑位供 200 个人使用，假设每人平均需要使用 20 分钟，每天上班时间的坑位使用都不会断（即每出来一个人必定有一个人进去），并且每天在正式上班之前就已经有 4 个人在任意时间开始使用了。请问，我在上班时候的任意时刻去到厕所，发现 4 个坑位满员且前面没有人排队，如果我等的话我需要等待多长时间？

回到原来这道题，因为设定了每个人蹲 20 分钟（实际上是数学期望为 20 分钟，这里为了简化问题，直接定死为 20 分钟），因此在最好情况下，我到的时候刚好有人出来，需要等待 0 分钟；在最差情况下，我到的时候刚好 4 个人同时入坑，我需要等待 20 分钟，那么我们只需要考虑在这 20 分钟以内，第一个人出来的时间的期望。又因为我们只需要考虑我们到厕所的时刻的前 20 分钟，因此问题转化为了：在 0~20 之间随机取 4 个数，这 4 个数里的最小值就是我需要等待的时间。

1.1 蒙特卡罗方法

身为一个程序猿，我们先用百试不爽的蒙特卡罗方法模拟一下，假设每人蹲坑时间 $T = 20$，坑位数量 $N = 4$，模拟 100 万次：

import random

NUM_TESTS = 1000000  # Num of tests
NUM_TOILETS = 4      # Num of toilets
TOILET_TIME = 20     # Average toilet time

total_time = 0
for _ in range(NUM_TESTS):
    total_time += min([random.random() * TOILET_TIME for _ in range(NUM_TOILETS)])

print(f'Num of tests: {NUM_TESTS:,}')
print(f'Num of toilets: {NUM_TOILETS}')
print(f'Average toilet time: {TOILET_TIME} mins')
print(f'You need to wait for {total_time / NUM_TESTS:.3f} mins.')

模拟结果如下：

Num of tests: 1,000,000
Num of toilets: 4
Average toilet time: 20 mins
You need to wait for 3.996 mins.

在这种情况下，我们需要等待约 4 分钟时间。

如果你多试几次不同的值，很快就会发现实际上我们要等待的平均时间是 $\frac{T}{N + 1}$。

1.2 解析方法

下面我们尝试分析。

假设每人蹲坑时间为 $T$，坑位数量为 $N$，每个坑位在上一个 $T$ 分钟的时间周期内进去的时刻为 $t_i$，$i \in {1, 2, \cdots, n}$。那么根据上面的分析，我们要等待的时长应该是 $x = \min(t_1, t_2, \cdots, t_N)$，现在要求 $x$ 的期望 $E[x]$。

我们试着算一下 $x$ 的累积分布函数（Cumulative Distribution Function，CDF）：

$$ \begin{aligned} \operatorname{CDF}(x) &= P(x \le t) \ &= 1 - P(x > t) \ &= 1 - \prod_{i = 1}^{N} p(x > t_i) \ &= 1 - (1 - \frac{x}{T})^N \end{aligned} $$

对累积分布函数求导，可以得到 $x$ 的概率密度函数（Probability Density Function，PDF）：

$$ \begin{aligned} \operatorname{PDF}(x) &= (-1) \cdot (N) \cdot (1 - \frac{x}{T})^{N - 1} \cdot (-\frac{1}{T}) \ &= \frac{N}{T} \cdot (1 - \frac{x}{T})^{N - 1} \end{aligned} $$

那么 $x$ 的数学期望 $E[x]$ 计算如下：

$$ \begin{aligned} E[x] &= \int_0^T x \cdot \operatorname{PDF}(x) ~ dx \ &= \int_0^T x \cdot \frac{N}{T} \cdot (1 - \frac{x}{T})^{N - 1} ~ dx \end{aligned} $$

我们采用换元积分法：

$$ \begin{aligned} t(x) &= 1 - \frac{x}{T} \ t'(x) &= -\frac{1}{T} \ x &= (1 - t) \cdot T \ E[x] &= \int_0^T x \cdot \frac{N}{T} \cdot (1 - \frac{x}{T})^{N - 1} ~ dx \ &= - N \int_0^T x \cdot (1 - \frac{x}{T})^{N - 1} \cdot (-\frac{1}{T}) ~ dx \ &= - N \int_1^0 (1 - t) \cdot T \cdot t^{N - 1} ~ dt \ &= - NT \int_1^0 (t^{N - 1} - t^N) ~ dt \ &= - NT \int_1^0 (\frac{t^N}{N} - \frac{t^{N + 1}}{N + 1})' ~ dt \ &= - NT \cdot (0 - \frac{1}{N (N + 1)}) \ &= \frac{T}{N + 1} \end{aligned} $$

结论与蒙特卡罗方法模拟的结果一致。

2. 感知错觉：检查悖论

回顾一下上一节的结论：

假设每人蹲坑时间为 $T$，坑位数量为 $N$，每个坑位在上一个 $T$ 分钟的时间周期内进去的时刻为 $t_i$，$i \in {1, 2, \cdots, n}$。那么我们在任意时刻进入，发现坑位满员且无人等待，那么我们需要等待的时长 $x$ 的数学期望为：

$$E[x] = \frac{T}{N + 1}$$

然而在现实生活中，我们往往会觉得等了远远不止这个时间。以我司坑位举例，坑位数量 $N=4$，假设每人蹲坑时间 $T=20$，那么这种情况下我需要等待的时间应该是 $\frac{T}{N + 1} = \frac{20}{4 + 1} = 4$ 分钟，但是我感觉经常需要等待 10 分钟以上，记忆中最长的一次能到 15 分钟以上，这是为什么？

事实上，这是采样偏差所导致的问题。

按照我们前面的设定，每人的蹲坑时间固定，每个坑位的初始开始时间随机且独立，我们在任意时刻到达厕所（视为一次采样），那么需要等待时间的采样应该是一个均匀分布。而实际上，每人的蹲坑时间有长有短，更有可能是一个均值为 $T$ 的高斯分布，而不是一个均匀分布，这就导致了每个坑位在时间轴上被使用的情况的「区间划分」是不均匀的，那么如果我们依然是在均匀的任意时刻到达厕所（视为一次采样），那么我们会对等待时间较长的区间采样得更多，而对等待时间较短的区间采样得更少，因此导致了我们实际上确实等了更长的时间（而不仅仅是在感受上）！

下面再基于蒙特卡罗方法重新模拟。我们将每人的蹲坑时间从固定的 $T=20$，改为符合均值为 $T=20$、方差为 $5$ 的正态分布，其余变量不变，模拟 100 万次：

import random

NUM_TESTS = 1000000    # Num of tests
NUM_TOILETS = 4        # Num of toilets
TOILET_TIME = 20       # Average toilet time
TOILET_TIME_SIGMA = 5  # Standard deviation of toilet time

total_time = 0
for _ in range(NUM_TESTS):
    waiting_time = []
    while len(waiting_time) < NUM_TOILETS:
        t = random.normalvariate(TOILET_TIME, TOILET_TIME_SIGMA)
        if t > 0:
            waiting_time.append(t)
    total_time += min(waiting_time)

print(f'Num of tests: {NUM_TESTS:,}')
print(f'Num of toilets: {NUM_TOILETS}')
print(f'Average toilet time: {TOILET_TIME} mins')
print(f'You need to wait for {total_time / NUM_TESTS:.3f} mins.')

模拟结果如下：

Num of tests: 1,000,000
Num of toilets: 4
Average toilet time: 20 mins
You need to wait for 14.857 mins.

Amazing，需要等待的时间从 4 分钟涨到了 15 分钟！😂

当然，这里出于习惯假设了高斯分布，并且指定了到达时间的方差，并不一定正确，目的只是为了在直观上理解这种现象。事实上，这个现象也有科学家进行了相关研究，例如，美国计算机学家 Allen Downey 对普渡大学的班级平均人数进行统计，发现通过统计得出的平均人数是 90 人，而教务处给出的真正的平均人数是 35 人，从而提出了 检查悖论（Inspection Paradox，维基百科），大意如下：

如果我们等待一些预先定义好的时间 $t$，然后观察包含 $t$ 的更新区间有多大，我们应该期望它比平均大小的更新区间大。

有关准确的数学描述与理论证明，请参考维基百科或相关论文。

这里再举一些跟采样偏差相关的现象：

等公交的时候，明明站牌上写每 10 分钟一趟，却感觉等了远不止 10 分钟。
开车的时候，无论走哪条路，总是感觉经常碰上绿灯。
总感觉明星挣钱、职业电竞选手挣钱、直播带货挣钱，那是因为挣钱的人曝光量大。
航空公司经常表示航班上座率不够，公司亏损，可坐飞机时候却经常感觉飞机拥挤。
Allen Downey：在社交网络上，每名用户拥有 44 名好友，而每名用户的好友平均拥有 104 名好友。
Allen Downey：联邦监狱每名囚犯的平均判决时间是 3.6 年，但正在服刑的所有囚犯的平均判决时间是 13 年。

有时，我们也会用 幸存者偏差（维基百科）来解释这些现象，实际上不管是幸存者偏差还是这里的检查悖论，其实都是因为统计或采样的方式不同。这里就不再展开了。

3. 启示

这对我们有什么启示？

首先，从个人角度出发，等车、等蹲坑时间比平均时间确实是客观世界的规律，当遇到这些事情的时候，想开点，放宽心。🙂

其次，如果作为公司「如厕机制」的设计人，有什么措施能够有效降低等待时间？我们复用前面基于高斯分布的蒙特卡罗模拟方法，计算一下真实等待时间与厕坑数量的关系，代码如下：

import random

NUM_TESTS = 1000000    # Num of tests
TOILET_TIME = 20       # Average toilet time
TOILET_TIME_SIGMA = 5  # Standard deviation of toilet time

for num_toilets in range(2, 22, 2):

    total_time1 = 0
    for _ in range(NUM_TESTS):
        total_time1 += min([random.random() * TOILET_TIME for _ in range(num_toilets)])

    total_time2 = 0
    for _ in range(NUM_TESTS):
        waiting_time = []
        while len(waiting_time) < num_toilets:
            t = random.normalvariate(TOILET_TIME, TOILET_TIME_SIGMA)
            if t > 0:
                waiting_time.append(t)
        total_time2 += min(waiting_time)

    print(f'| {num_toilets} | {total_time1 / NUM_TESTS:.3f} | {total_time2 / NUM_TESTS:.3f} |')

模拟结果如下：

厕坑数量	均匀蹲坑等待时间（min）	高斯蹲坑等待时间（min）
2	6.664	17.181
4	3.999	14.858
6	2.854	13.664
8	2.224	12.881
10	1.823	12.311
12	1.538	11.857
14	1.332	11.491
16	1.177	11.178
18	1.052	10.908
20	0.951	10.667

均匀蹲坑等待时间降低明显，但高斯蹲坑等待时间似乎降低不多，这里面的差距在哪？多装了坑位，理论上总的蹲坑时长肯定是按比例增加，但是似乎对我们的心理感受提升得却不是很明显？这里其实还忽略了我们的一个很重要的假设：每次到厕所的时候坑位用满，且无人排队。

事实上，随着坑位的增加，坑位用满的情况也会降低，举个比较极端的例子，如果坑位从 4 个扩展到 20 个，事实上我们到达的时候往往是有坑位的，等待时间几乎可以变成 0，而这部分人群是无法被采样到的。

如果要考虑更细致的话，应该将公司楼层总人数、如厕流行时间段等等因素考虑进来，那就变成了一个复杂的规划问题了。不过无论如何，在坑位只有 4 个、等待时间经常要超过 10 分钟甚至 15 分钟来说，增加坑位总是好的。

4. 参考资料

Mathematics Stack Exchange：Expectation of Minimum of $n$ i.i.d. uniform random variables（2014，StackExchange）
你在生活中用过最高端的数学知识是什么？（2017，王赟 Maigo）
公交车总迟到？你大概掉进了等待时间悖论（2018，李雷 et al.）
直播带货能赚大钱吗？公交车为什啥总不来？（2020，李永乐）
有哪些数学上的事实，没有一定数学知识的人不会相信？（2020，知乎回答，有对检查悖论理论证明的截图）