使用Hatchling轻松实现Python项目打包

本文将会介绍如何使用hatchling轻松实现Python打包。

在日常工作中，将Python代码打包成第三方模块，并上传至托管网站，并不是一个高频的需求，但学会Python代码的打包，也是一项必备技能。

所谓Python打包，就是将我们自己写的Python代码打包成第三方模块，方便后续使用或用于开源分享，比如常用的requests, numpy等第三方模块。常见的Python打包工具有：

setuptools
hatchling
Flit
PDM

这也是PyPI官网打包教程中给出的打包工具。本文将会介绍其中的hatchling。

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

Hatchling是一个Python包管理工具，主要用于方便地管理依赖关系和环境隔离。它与Hatch一起使用，用于配置、版本化、指定依赖关系以及将 Python 包发布到PyPI上。Hatchling的主要功能包括：

配置项目：使用pyproject.toml文件配置项目
版本化：管理项目的版本化过程
指定依赖关系：为项目指定所需的依赖关系
发布包到 PyPI：将项目发布到 Python Package Index (PyPI) 上

Hatchling 的插件系统允许用户轻松扩展其功能。

项目介绍

本文将会以一个简单的小项目为例，介绍如何使用hatchling来实现Python打包。我们的小项目主要使用tiktoken模块来实现文本或文本列表的token数计算。

项目的整体代码如下：

.
├── README.md
├── dist
│   ├── token_counter-0.0.1-py3-none-any.whl
│   └── token_counter-0.0.1.tar.gz
├── pyproject.toml
├── src
│   └── token_counter
│       ├── __init__.py
│       └── token_count.py
└── tests
    ├── __init__.py
    ├── package_test.py
    └── test_token_count.py

我们的功能实现放在src/token_counter模块中的token_count.py脚本，代码如下：

# -*- coding: utf-8 -*-
# @place: Pudong, Shanghai
# @file: token_count.py
# @time: 2024/1/22 17:45
import tiktoken

from typing import List, Union

class TokenCounter(object):
    def __init__(self, model="gpt-3.5-turbo"):
        """
        :param model: name of model, type: string        """
        self.model = model

    def count(self, _input: Union[List, str]) -> Union[List[int], int]:
        """
        :param _input: user input, type: str or List[str]        :return: Return the number of tokens used by text, type int or List[int]        """
        try:
            encoding = tiktoken.encoding_for_model(self.model)
        except KeyError:
            print("Warning: model not found. Using cl100k_base encoding.")
            encoding = tiktoken.get_encoding("cl100k_base")

        if isinstance(_input, list):
            token_count_list = []
            for text in _input:
                token_count_list.append(len(encoding.encode(text)))
            return token_count_list
        elif isinstance(_input, str):
            return len(encoding.encode(_input))
        else:
            raise NotImplementedError(f"not support data type for {type(_input)}, please use str or List[str].")

该脚本主要使用tiktoken模块来实现输入文本或文本列表的token数的计算。

在tests模块下，使用单元测试(unittest模块)对代码进行测试，代码（tests/test_token_count.py）如下：

# -*- coding: utf-8 -*-
# @place: Pudong, Shanghai
# @file: test_token_count.py
# @time: 2024/1/22 17:53
import unittest

from src.token_counter.token_count import TokenCounter

class TestTokenCounter(unittest.TestCase):
    def setUp(self):
        self.token_cnt = TokenCounter()

    def test_case1(self):
        text = "who are you?"
        tokens_cnt = self.token_cnt.count(_input=text)
        self.assertEqual(tokens_cnt, 4)

    def test_case2(self):
        texts = ["who are you?", "How's it going on?"]
        tokens_cnt = self.token_cnt.count(_input=texts)
        self.assertEqual(tokens_cnt, [4, 6])

    def test_case3(self):
        with self.assertRaises(NotImplementedError) as cm:
            self.token_cnt.count(_input=23)
        the_exception = cm.exception
        self.assertEqual(the_exception.__str__(), "not support data type for <class 'int'>, please use str or List[str].")

if __name__ == '__main__':
    suite = unittest.TestSuite()
    suite.addTest(TestTokenCounter('test_case1'))
    suite.addTest(TestTokenCounter('test_case2'))
    suite.addTest(TestTokenCounter('test_case3'))
    run = unittest.TextTestRunner()
    run.run(suite)

单元测试并不影响项目打包，但为了项目的完整性，需要把测试过程加上。

项目打包

对于Python打包，还需要一个配置文件（pyproject.toml），配置如下：

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "token_counter"
version = "0.0.1"
dependencies = [
  "tiktoken >= 0.5.0",
]
authors = [
  { name="jclian", email="[email protected]" },
]
description = "A package for token count using tiktoken"
readme = "README.md"
requires-python = ">=3.9"
classifiers = [
    "Programming Language :: Python :: 3",
    "License :: OSI Approved :: MIT License",
    "Operating System :: OS Independent",
]

[project.urls]
Homepage = "https://github.com/percent4/package_python_project"

在这个配置文件中，声明了build系统使用hatchling，以及常见的打包配置项，比如：项目名称，版本号，依赖关系等等，这里不再过多介绍，上述配置应该是清晰明了的。

接下来是打包过程，我们需要安装依赖，命令如下：

python3 -m pip install --upgrade build
pip3 install hatchling
python -m build

使用python -m build即可打包，此时在dist目录下就会出现两个打包好的文件，如下：

token_counter-0.0.1.tar.gz
token_counter-0.0.1-py3-none-any.whl

这正是我们在平常安装Python第三方模块时所用到的安装文件。有了这些安装包，我们还需要将它们上传至第三方的托管网站，比如PyPI。

一般使用twine工具来实现安装包上传。

Twine模块是用于与 PyPI交互的实用工具，用于发布和管理Python包。它提供了上传代码到PyPI的功能，可以简化包的上传过程，无需执行setup.py。安装Twine模块后，你可以使用它来上传你的Python包到PyPI。要使用Twine模块在PyPI 上发布你自己的包，你需要准备好你的安装包，并使用twine upload命令上传它。

这里不再介绍如何上传安装包，后续有机会再介绍。

实验

将刚才的项目代码打包后，在本地安装好token_counter模块，命令如下：

pip3 install dist/token_counter-0.0.1-py3-none-any.whl

安装完毕后，我们来测试该模块的使用，示例代码如下：

# -*- coding: utf-8 -*-
# @place: Pudong, Shanghai
# @file: package_test.py
# @time: 2024/1/22 21:08
from token_counter.token_count import TokenCounter

text = "who are you?"
print(TokenCounter().count(_input=text))

输出结果：