2026/3/8 17:23:45
网站建设
项目流程
国外网站如何建设,吉安做网站的公司,东丽网站建设公司,长沙旅游网站开发机制分析
关键文件和类
文件路径#xff1a;langchain_text_splitters/character.py
类名#xff1a;RecursiveCharacterTextSplitter
核心入口函数#xff1a;_split_text
解析步骤及源码分析步骤说明示例/细节1. 分隔符降级按 separators[\n\n, \nlangchain_text_splitters/character.py类名RecursiveCharacterTextSplitter核心入口函数_split_text解析步骤及源码分析步骤说明示例/细节1. 分隔符降级按separators[\n\n, \n, , ]顺序尝试分隔符可自定义先用\n\n切段落若任一段chunk_size 字符则对该段降级使用\n切依次类推2. 递归切分对“超长段”重复步骤 1直到所有片段 ≤ chunk_size 或已用完分隔符若句子级仍超长最终用空字符串按字符硬切3. 段合并Merge把“好段”依次拼成尽可能长的块保证 ≤ chunk_size如果拼到再加就超chunk_size则封口、起新块4. 重叠Overlap当合并完成一个块A1之后算法回退chunk_overlap的长度以便下个块A2包含前一个块末尾overlap的内容。特别的若块A1的最后一段长度大于chunk_overlap则为保证语义完整不强行生成overlap。相邻两个块 A1和A2则 A2 头部 A1 尾部 overlap 内容上述步骤的总体逻辑在_split_text函数中关键处均有代码注释说明def_split_text(self,text:str,separators:list[str])-list[str]:Split incoming text and return chunks.final_chunks[]# Get appropriate separator to use# 得到当前层级的分隔符separatorseparators[-1]new_separators[]fori,_sinenumerate(separators):_separator_sifself._is_separator_regexelsere.escape(_s)if_s:separator_sbreak# 快速搜索文本中是否存在分隔符不存在则分隔符降级ifre.search(_separator,text):separator_s new_separatorsseparators[i1:]break_separatorseparatorifself._is_separator_regexelsere.escape(separator)splits_split_text_with_regex(text,_separator,keep_separatorself._keep_separator)# Now go merging things, recursively splitting longer texts.# 所谓的“好段”指的是长度小于chunk_size的段_good_splits[]_separatorifself._keep_separatorelseseparatorforsinsplits:ifself._length_function(s)self._chunk_size:_good_splits.append(s)else:# 如果发现一个段不是好段就把前面累积的好段做合并if_good_splits:merged_textself._merge_splits(_good_splits,_separator)final_chunks.extend(merged_text)_good_splits[]ifnotnew_separators:# 这里看似会生成一个超长块但考虑到最后一个分隔符是空亦即按字符切分这行代码其实是跑不到的。final_chunks.append(s)else:# 对于超长段用下级分隔符递归切分other_infoself._split_text(s,new_separators)final_chunks.extend(other_info)if_good_splits:merged_textself._merge_splits(_good_splits,_separator)final_chunks.extend(merged_text)returnfinal_chunks合并与重叠的逻辑则在_merge_splits函数中关键处均有代码注释说明def_merge_splits(self,splits:Iterable[str],separator:str)-list[str]:# We now want to combine these smaller pieces into medium size# chunks to send to the LLM.separator_lenself._length_function(separator)docs[]current_doc:list[str][]total0# splits是所谓的好段我们将好段尽可能拼接为较大的块fordinsplits:_lenself._length_function(d)# 拼到再加就超chunk_size则封口、起新块if(total_len(separator_leniflen(current_doc)0else0)self._chunk_size):iftotalself._chunk_size:logger.warning(fCreated a chunk of size{total}, fwhich is longer than the specified{self._chunk_size})iflen(current_doc)0:docself._join_docs(current_doc,separator)ifdocisnotNone:docs.append(doc)# Keep on popping if:# - we have a larger chunk than in the chunk overlap# - or if we still have any chunks and the length is long# 这一段是overlap的核心处理逻辑# 算法回退chunk_overlap的长度以便下个块A2包含前一个块A1末尾overlap的内容。特别的若块A1的最后一段长度大于chunk_overlap则为保证语义完整不会生成overlap。current_doc里存的就是一个块的所有段。whiletotalself._chunk_overlapor(total_len(separator_leniflen(current_doc)0else0)self._chunk_sizeandtotal0):total-self._length_function(current_doc[0])(separator_leniflen(current_doc)1else0)current_doccurrent_doc[1:]current_doc.append(d)total_len(separator_leniflen(current_doc)1else0)docself._join_docs(current_doc,separator)ifdocisnotNone:docs.append(doc)returndocs