使用LLM进行内容提取和格式化

在如今各种大型语言模型（LLM）蓬勃发展的背景下，许多基础性工作已经可以完全依赖这些东西来完成。特别是在处理文本内容的信息提取时，LLM的应用使得这一过程变得异常轻松，更何况现在的LLM api费用也是超级低，各家不是在送token就是在免费用的路上，本地运行ollama也可以完全脱离网络使用。

不再需要使用正则或者NLP，直接使用LLM进行一个力大飞砖。

结构化输出工具-Instructor

Instructor是一个基于Pydantic的工具可以在结合LLM的情况将输出的内容储存到结构化对象中。

Example:

import instructor
from openai import OpenAI
from pydantic import BaseModel
# 定义数据结构
class UserInfo(BaseModel):
    name: str
    age: int

# 创建客户端
client = instructor.from_openai(OpenAI())
# 力大飞转
user_info = client.chat.completions.create(
    model="gpt-3.5-turbo",
    response_model=UserInfo,
    messages=[{"role": "user", "content": "Extract the user name: 'John is 20 years old'"}],
)

那么看起来好像很神奇的功能是怎么实现的呢？

Instructor的工作原理

实际上结构化输出的工作还是依赖LLM自己，Instructor库实际上还是调用了各家的function calling（tools）功能。

上面的示例代码实际向llm的请求内容如下：

"""
{
        'args': (),
        'kwargs': {
            'messages': [
                {
                    'role': 'user',
                    'content': "Extract the user name: 'John is 20 years old'",
                }
            ],
            'model': 'gpt-3.5-turbo',
            'tools': [
                {
                    'type': 'function',
                    'function': {
                        'name': 'UserInfo',
                        'description': 'Correctly extracted `UserInfo` with all the required parameters with correct types',
                        'parameters': {
                            'properties': {
                                'name': {'title': 'Name', 'type': 'string'},
                                'age': {'title': 'Age', 'type': 'integer'},
                            },
                            'required': ['age', 'name'],
                            'type': 'object',
                        },
                    },
                }
            ],
            'tool_choice': {'type': 'function', 'function': {'name': 'UserInfo'}},
        },
    }
"""

先使用pydantic定义好数据结构后，Instructor再将自定义类型加入到tools中，结合用户给出的prompt让LLM自己去处理。

之后LLM就能返回正确的json结构了（还是要看LLM的代码和推理能力）。

结语

结构化数据功能是在函数调用的基础上“额外拓展”的功能，原先有这个功能的只有openai，剩下的开源方案都只能魔改prompt实现，就像我之前一样。但是时代不同了，现在本地能用的带此功能的开源模型也很多了特别是qwen2.5全系列模型都支持，更有各个尺寸的参数都可以按需使用。小参数模型即使是CPU也能跑得起来。

使用LLM进行内容提取和格式化

阅读此文章之前，你可能需要首先阅读以下的文章才能更好的理解上下文。

使用LLM进行内容提取和格式化

结构化输出工具-Instructor

Instructor的工作原理

结语