2025/12/26 12:18:39
网站建设
项目流程
网页制作培训要多少钱,网站seo策划,网站域名最便宜,网站建设dw站点建设一、前言: token是什么
LLM只做一个事情#xff0c;就是吃掉token吐出token#xff0c;token是LLM#xff08;大语言模型#xff09;的基本元素。token与LLM的关系#xff0c;相当于乐高积木与乐高工厂#xff0c;我的世界方块与我的世界游戏。那么token到底是什么呢就是吃掉token吐出tokentoken是LLM大语言模型的基本元素。token与LLM的关系相当于乐高积木与乐高工厂我的世界方块与我的世界游戏。那么token到底是什么呢有人翻译成令牌有人翻译成词源。我们不妨换个概念理解token就是最小操作、最小信息单元的意思。这个最小是相对于LLM要处理的原始文本来说的。举个栗子当一个句子文本输入到电脑中天然就就具有字符级别的切分。如果不打算继续拆分或组合我们可以通过一个映射关系将现有这些字符转换为整数数组称为编码过程。编码后数组内的元素就是token元素取值就等于token取值。LLM可以吃掉这个token数组并吐出新数组。对这个新数组按前前述的映射进行逆转换称为解码过程。解码后我们就能得到人类可以理解的文本了。/* by 01022.hk - online tools website : 01022.hk/zh/calorie.html */ // 原句子 我有一个 apple. // 句子拆分 [我,有,一,个, ,a,p,p,l,e,.,\0] // 编码为整数数组 [1,2,3,4,5,6,7,7,8,9,10,11]从实际应用看主流LLM几乎不用纯字符级级别切分而是为了更好效果使用BPE/WordPiece/SentencePiece等子词sub-word算法。此时hello大概率是1个或2个token而不是5个。对于中文来说我有一个 可能切成了 我/有/一/个也可能是我有/一个取决于词表。在字词算法中单个token拎出来会存在不可解释性因为是打散的词根。但是无论怎么处理LLM传入传出的都是一个整数数组数组元素的数量就是token数量也是LLM服务的计费标准。再从实际应用看主流LLM几乎都采用BPE或BBPE方式进行Tokenizer。我们接下来继续了解BPE。二、BPE(字节对编码)字节对编码 是一种简单的数据压缩形式这种方法用数据中不存的一个字节表示最常出现的连续字节数据。这样的替换需要重建全部原始数据。编码过程如下/* by 01022.hk - online tools website : 01022.hk/zh/calorie.html */ // wiki的BPE案例 aaabdaaabac: aaZ //“aa”出现次数最多用中没有出现的“Z”替换 ZabdZabac: aaZ, ZaY //同上更新替换表 YbdYbac: aaZ, ZaY, YbX //同上更新替换表 XdXac:aaZ, ZaY, YbX // 无可用替换我们将aaabdaaabac通过BPE方式编码成了XdXac。解码时只需要对附带的 替换表(aaZ, ZaY, YbX)按顺序逆向操作就能得到原信息。BPE 用“比字符大、比单词小”的子词当积木之所以能流行主要是因为其编码后的token数量适中处于单字符切分全词切分之间。相对与全词切分BPE是子词切分不仅可以控制上限避免词库膨胀还能最小可退到字节/字符最大可保留整词粒度随频率动态伸缩。就算预见新的词组也无所谓不存在未登录词的问题。而且一套算法与英语、阿拉伯语语言无关都是一套处理方式。还具有词表可读性好在一定效果下计算成本低等特点。三、BPE Tokenizer一个BPE Tokenizer主要功能可分为1.训练处理得到词表2.编解码。词表的训练上面已经做了示意接下来我们主要针对编解码部分。训练好的BPE的数据主要包括三个部分vocab.json符号 → id 的字典merges.txt按合并顺序排列的“信息对”tokenizer_config.json预处理规则(regex文本)、特殊标记。另外常见的还有tokenizer.json文件他是Hugging Face 生态把“原本分散的三份文件”压进一个JSON文件。典型的结构如下在不同版本中merges可能会有字符串和数组两种对象存储方式解析时候需要注意// cl100k_base { version: 1.0, truncation: null, padding: null, added_tokens: [ //特殊token { id: 100257, content: |endoftext|, single_word: false, lstrip: false, rstrip: false, normalized: false, special: true } ], normalizer: null, pre_tokenizer: { // 有的有有的没有因此regex需要预先硬编码 type: Sequence, pretokenizers: [ { type: Split, // 预处理分割“防呆尺” pattern: { Regex: (?i:s|t|re|ve|m|ll|d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}][\\r\\n]*|\\s*[\\r\\n]|\\s(?!\\S)|\\s }, behavior: Removed, //一般默认写死命中正则的片段保留没命中的扔掉与invert 配合。 invert: true //一般默认写死把“命中/没命中”反转——最终只保留上面正则抓到的那些片段其余全部丢弃。 }, { type: ByteLevel, add_prefix_space: false, trim_offsets: true, use_regex: false } ] }, post_processor: null, decoder: { type: ByteLevel, add_prefix_space: true, trim_offsets: true, use_regex: true }, model: { type: BPE, dropout: null, unk_token: null, continuing_subword_prefix: , end_of_word_suffix: , fuse_unk: false, byte_fallback: false, vocab: { !: 0, |endofprompt|: 100276 }, merges: [ ĠCon veyor // 或 [Ġp,ain] ] } }通过读取预先的数据BPE Tokenizer就可以用了其核心的功能就是编码和解码即Encode和Decode。四、Tokenizer的C#实现在python中可以直接用HuggingFace的AutoTokenizer载入本地权重。在C#中我们可以拉取SharpToken (2.0.4)和 TiktokenSharp(1.2.0)计算Token。但是如果我们要自己在C# 开发LLM尽管很少有人这么干一个好的Tokenizer就很重要了需要更多自定义的功能如支持huggingFac的tokenizer.json数据并灵活的处理special token充分优化。于是就有了LumTokenizer这个项目。主要功能实现如下:读取tokenizer.json数据如果没有regex内置了3种pretoken的regexRegex50KBase≈GPT-2 的 5 万级别基础词表RegexCl100KBase≈OpenAI CLIP / GPT-3.5 / GPT-4 使用的 10 万级别 CL-100K 词表RegexO200KBase≈Meta LLaMA、Mistral 等开源模型偏好的 20 万级别 O-200K 词表高效的特殊token切分如果是模型训练用tokenizer需要单独高效处理特殊token。因为特殊token的目的是正文出现越少越好因此一般不会出现在词表中需要通过单独切分的机制进行识别和切分。高效的缓存机制LumTokenizer 在分词阶段订制了一套SpanDictionary, 为了实现高效的切片搜索也就是说一个stirng可以基于NET的Span特性切成多个Slice而SpanDictionary可以直接基于Span 执行Key的匹配Span无法作为传统Dictionary的泛型极大节省了子串string转换的开销。Benchmark测试如下在含有中文这种多字节字符的长文500字符左右处理时具有很好的性能。MethodtextMeanErrorStdDevRatioRatioSDGen0AllocatedAlloc RatioSharpToken_cl100k_baseChinese122.99 us2.314 us2.273 us5.710.120.73249.1 KB1.19TiktokenSharp_cl100k_baseChinese96.00 us1.829 us2.106 us4.450.110.48836.34 KB0.83LumTokenizer_cl100k_baseChinese21.56 us0.268 us0.251 us1.000.020.61047.63 KB1.00SharpToken_cl100k_baseEnglish26.77 us0.520 us0.639 us1.020.030.67148.38 KB0.74TiktokenSharp_cl100k_baseEnglish20.21 us0.383 us0.376 us0.770.020.42725.51 KB0.49LumTokenizer_cl100k_baseEnglish26.13 us0.495 us0.509 us1.000.030.915511.31 KB1.00SharpToken_cl100k_baseMixed90.97 us1.580 us1.478 us3.780.090.854510.9 KB1.23TiktokenSharp_cl100k_baseMixed63.85 us1.274 us1.564 us2.650.080.48836.74 KB0.76LumTokenizer_cl100k_baseMixed24.08 us0.465 us0.435 us1.000.030.70198.83 KB1.00具体可以去仓库看详细代码。[MemoryDiagnoser] public class CompareBenchmark { internal GptEncoding _sharpToken; internal TikToken _tikToken; internal BPETokenizer _tokenizer1; internal BPETokenizer _tokenizer2; [GlobalSetup] public void Setup() { _sharpToken GptEncoding.GetEncoding(cl100k_base); _tikToken TikToken.GetEncodingAsync(cl100k_base).ConfigureAwait(false).GetAwaiter().GetResult(); _tokenizer1 BPETokenizer.CreateTokenizer( D:\Data\Personal\AI\llm\tokenizer\cl100k.txt, true, RegexType.RegexCl100KBase); _tokenizer2 BPETokenizer.CreateTokenizer( D:\Data\Personal\AI\llm\tokenizer\qw_tokenizer.json, false, RegexType.RegexCl100KBase); } // 1. 声明参数源 public IEnumerablestring TextSamples() { yield return TextCatalog.English; yield return TextCatalog.Chinese; yield return TextCatalog.Mixed; } // 2. 每个方法改成带参数 [Benchmark] [ArgumentsSource(nameof(TextSamples))] public int SharpToken_cl100k_base(string text) { var encoded _sharpToken.Encode(text); var decoded _sharpToken.Decode(encoded); return encoded.Count; } [Benchmark] [ArgumentsSource(nameof(TextSamples))] public int TiktokenSharp_cl100k_base(string text) { var encoded _tikToken.Encode(text); var decoded _tikToken.Decode(encoded); return encoded.Count; } [Benchmark(Baseline true)] [ArgumentsSource(nameof(TextSamples))] public int LumTokenizer_cl100k_base(string text) { var encoded _tokenizer1.Encode(text, false); var decoded _tokenizer1.Decode(encoded, false); return encoded.Count; } public int LumTokenizer_qwen150k(string text) { var encoded _tokenizer2.Encode(text, false); var decoded _tokenizer2.Decode(encoded, false); return encoded.Count; } }五、单元测试现在单元测试可以说是越来越重要了因为只有具有了完善的单元测试才能放心的让ai去优化修改已有代码。本文这个BPE Tokenizer项目单元测试分了5类。P0_BasicTest基础测试测试编解码数据读取词表完善性等主要功能P1_RobustnessTests鲁棒性测试针对边缘条件如仅空字符、仅特殊字符、超长文本、越界id等情况P2_VocabBpeTests编解码准确性要求正确的对原文进行分割并准确编码通过几种特定情况下的案例进行兜底。P3_ChineseSubwordTests中文字符测试其中也包含了token压缩率的检验。主要是考虑在代码编写过程中可能导致部分尾字节或特殊混编情况下不能准确字节合并的bug。P4_EnglishSubwordTests英文字符测试目的同上部分bug出现时尽管decode正常但encode编码也可能未达到预期忽略了某些合并环节导致压缩率过高。编解码准确度与常用库比较LumTokenizer_cl100k_base 34655,61078,11,832,315,42482,596,77069,323,1455,73135,11335,11,10975,279,3446,315,279,46337,323,12280,12970,61078,11,889,65928,813,26135,11,439,568,1587,813,3611,38705,11,4184,311,52671,323,74571,13,61078,753,8060,439,264,7126,2995,360,3933,5678,323,813,1917,304,63355,323,31926,16134,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,29 King Lear, one of Shakespeares darkest and most savage plays, tells the story of the foolish and Job-like Lear, who divides his kingdom, as he does his affections, according to vanity and whim. Lear’s failure as a father engulfs himself and his world in turmoil and tragedy.|im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end| SharpToken_cl100k_base 34655,61078,11,832,315,42482,596,77069,323,1455,73135,11335,11,10975,279,3446,315,279,46337,323,12280,12970,61078,11,889,65928,813,26135,11,439,568,1587,813,3611,38705,11,4184,311,52671,323,74571,13,61078,753,8060,439,264,7126,2995,360,3933,5678,323,813,1917,304,63355,323,31926,16134,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,29 King Lear, one of Shakespeares darkest and most savage plays, tells the story of the foolish and Job-like Lear, who divides his kingdom, as he does his affections, according to vanity and whim. Lear’s failure as a father engulfs himself and his world in turmoil and tragedy.|im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end| TikTokenr_cl100k_base 34655,61078,11,832,315,42482,596,77069,323,1455,73135,11335,11,10975,279,3446,315,279,46337,323,12280,12970,61078,11,889,65928,813,26135,11,439,568,1587,813,3611,38705,11,4184,311,52671,323,74571,13,61078,753,8060,439,264,7126,2995,360,3933,5678,323,813,1917,304,63355,323,31926,16134,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,1822,91,318,5011,91,29,15339,220,220,57668,53901,27,91,318,6345,91,29 King Lear, one of Shakespeares darkest and most savage plays, tells the story of the foolish and Job-like Lear, who divides his kingdom, as he does his affections, according to vanity and whim. Lear’s failure as a father engulfs himself and his world in turmoil and tragedy.|im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end| LumTokenizer_qwen150k 33555,59978,11,825,315,41382,594,75969,323,1429,72035,11088,11,10742,279,3364,315,279,45237,323,12011,12681,59978,11,879,64828,806,25079,11,438,566,1558,806,3527,37605,11,4092,311,51571,323,73471,13,59978,748,7901,438,264,6981,2922,360,3848,5561,323,806,1879,304,62255,323,30826,13,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645,151644,14990,220,220,108386,151645 King Lear, one of Shakespeares darkest and most savage plays, tells the story of the foolish and Job-like Lear, who divides his kingdom, as he does his affections, according to vanity and whim. Lear’s failure as a father engulfs himself and his world in turmoil and tragedy.|im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end||im_start|hello 你好|im_end|六、最后LumTokenizer这个项目现在版本是1.0.6.1整体效果较好很快速稳定现在自己训练模型就在用它尽管目前某些常用习惯写死了但大家需要的可自行适配和扩展。MiniGPT和MiniMind都是很好的LLM学习入门python项目但C#基本没有。Tokenier是C#开发LLM的重要环节奈何.Net生态还是差很多资料也少现在AI生成的内容都千篇一律很多现有库更新的又很慢。真要用C#来干LLM真是难上加难估计也没人这么干。如果您觉得有收获的话请多多支持本系列。再次感谢您的阅读本案例及更加完整丰富的机器学习模型案例的代码已全部开源新朋友们可以关注公众号回复Tokenizer查看仓库地址获取全部完整代码实现。