长字符串如何建立索引

场景分析

库中有一张表记录着所有客户文件上传或者、客户上传的BASE64图片资源，file_store_record 表中有两个字段 name original_name 分别取记录文件名称和文件的原文件名称，由于文件名称是唯一的，有些做法会在该字段下建立唯一索引，并且都是通过时间戳（秒||毫秒）+ 文件后缀的形式去生成唯一值

1
2
3

name				original.name
15634508992959536.jpg		1寸白底.jpg
15634510398049541.jpg		1寸白底.jpg

索引优化

首先我们分析下上面的情况，能如何去优化，或者说有哪些问题，首先name 的字段类型是 VARCHAR，VARCHAR 存储可变长度的 M 个字符，大小 0-65535 字节，如果是 UTF-8 编码的话，一个字符占 3 字节，如果是 UTF-8mb4，一个字符占 4 字节，UTF8mb4 varchar(10)=40字节；从上面的例子可以看出已经超过了 20 字符，name 字段 varchar(20)，索引的存储并不会存储你改列的实际大小，只会存储改字段定义类型所占的字节大小；第二是索引是存储在内存中的，索引的查找也是会设计到比较的，虽然是存储在内存中，一般还是建议索引的类型和长度越短越好，第三的思路是进行比较部分长度，其实在前12，13，14 位的区分度就比较高，可以建立 前缀索引

思路一
mysql> alter table SUser add index index(name(12));
mysql> alter table SUser add index index(name(13));
mysql> alter table SUser add index index(name(14));

思路二
拆分name
name_pre, suffix
name_pre 单独存储 15634508992959536
suffix 存储 .jpg
name 可以用 BIGINT 存储，存储空间为 8 字节
虽然很明显了减少了索引的存储空间，但是索引的检索过程中，还是会进行比较长检索

思路三
在思路一的基础上进行改造，由于时间戳的前面部分基本都是一样的，区分度是非常的低，也会导致检索浪费，
如果直接反转 name_pre，
15634508992959536->63595929980543651
15634510398049541->14594089301543651
索引前半部分的区分度变得非常的高，不过也增加业务操作成本，需要进行反转拼接

思路四
索取的区分度虽然提高了，但是字段的长度还是没有太大的降低，BITLOG 8字节 长度也是比较高的，还有一种情况，
类似于人脸机的文件存储格式为 FACE_1166914936_20190719T124010508_121.jpg，这样就比较的尴尬了，
后面了解到有一种 crc32 ，当然需要增加额外的存储空间，不过查询的方式相比反转会更稳定点，
应该冲突的概率比较低

检错能力极强，开销小，易于用编码器及检测电路实现。从其检错能力来看，它所不能发现的错误的几率仅为0.0047%以下。从性能上和开销上考虑，均远远优于奇偶校验及算术和校验等方式。因而，在数据存储和数据通讯领域，CRC无处不在：著名的通讯协议X.25的FCS（帧检错序列）采用的是CRC-CCITT，ARJ、LHA等压缩工具软件采用的是CRC32，磁盘驱动器的读写采用了CRC16，通用的图像存储格式GIF、TIFF等也都用CRC作为检错手段。

CRC32 百度百科

测试情况，由于样本不多，不过也能分析出大概情况

# 区分度并不高
mysql> select
    ->   count(*),
    ->   count(distinct left(name,8)) as L8,
    ->   count(distinct left(name,9)) as L9,
    ->   count(distinct left(name,10)) as L10,
    ->   count(distinct left(name,30)) as L30,
    ->   count(distinct left(name,35)) as L35
    -> from file_store_record;
+----------+----+----+-----+-----+-----+
| count(*) | L8 | L9 | L10 | L30 | L35 |
+----------+----+----+-----+-----+-----+
|       13 |  8 |  8 |   8 |  11 |  13 |
+----------+----+----+-----+-----+-----+
1 row in set (0.00 sec)

# 前缀索引
mysql> select
    ->   count(*),
    ->   count(distinct left(reverse(name),4)) as L4,
    ->   count(distinct left(reverse(name),5)) as L5,
    ->   count(distinct left(reverse(name),6)) as L6,
    ->   count(distinct left(reverse(name),8)) as L7
    -> from file_store_record;
+----------+----+----+----+----+
| count(*) | L4 | L5 | L6 | L7 |
+----------+----+----+----+----+
|       13 |  1 |  9 | 13 | 13 |
+----------+----+----+----+----+
1 row in set (0.00 sec)

# 前缀索引 倒序存储的方式
# 10 78 136	154
mysql> select
    ->   count(*),
    ->   count(distinct substr(reverse(name),5,1)) as L1,
    ->   count(distinct substr(reverse(name),5,2)) as L2,
    ->   count(distinct substr(reverse(name),5,3)) as L3,
    ->   count(distinct substr(reverse(name),5,4)) as L4
    -> from file_store_record;
+----------+----+----+----+----+
| count(*) | L1 | L2 | L3 | L4 |
+----------+----+----+----+----+
|       13 |  9 | 13 | 13 | 13 |
+----------+----+----+----+----+
1 row in set (0.00 sec)

# crc32
alter table file_store_record add name_crc int unsigned, add index(id_card_crc);
UPDATE file_store_record SET name_crc = crc32(name);
mysql> select
    ->     count(*),
    ->     count(distinct left(name_crc,1)) as L1,
    ->     count(distinct left(name_crc,2)) as L2,
    ->     count(distinct left(name_crc,3)) as L3,
    ->     count(distinct left(name_crc,4)) as L4,
    ->     count(distinct left(name_crc,5)) as L5
    ->     from file_store_record;
+----------+----+----+----+----+----+
| count(*) | L1 | L2 | L3 | L4 | L5 |
+----------+----+----+----+----+----+
|       13 |  4 | 12 | 13 | 13 | 13 |
+----------+----+----+----+----+----+
1 row in set (0.00 sec)