Python AIOHTTP 基本使用

发表于 2024-03-29 分类于 tech ， Python 本文字数： 958 阅读时长 ≈ 3 分钟

序言

最近在用 Python 写爬虫来下载漫画（见之前几篇介绍 LANraragi 的文章）。一本漫画的网页通常是这样的：

该网页上的所有数据可以用 BeautifulSoup 获得，而一个页面只需要一次通信。但是下载图片不一样，一张图片就需要一次通信，而且响应体往往很大。正如包子不会一个个地煮，图片也不用顺序地下载，否则太浪费时间了。

AIOHTTP 是一个异步 HTTP 客户端/服务端模块，很适合这一场景。通过异步编程，能够最大程度地榨取网络（和对方服务器）的性能。本文着眼于介绍 AIOHTTP 的基本使用和案例，不会详细阐述原理。

原理

AIOHTTP 是基于 Python 协程机制的 HTTP 库。HTTP 不必多介绍了，而 Python 协程以 asyncio 模块为基础，使用 async 和 await 作为关键字，旨在提高 IO 密集型任务的效率。

使用协程就会涉及并发 —— 任务的同步以及数据的互斥 —— 这一点在写代码时需要注意。

案例1：hello world

直接复制了官网的案例：

import aiohttp
import asyncio

async def main():
    async with aiohttp.ClientSession() as session:
        async with session.get('http://python.org') as response:
            print("Status:", response.status)
            print("Content-type:", response.headers['content-type'])

            html = await response.text()
            print("Body:", html[:15], "...")

asyncio.run(main())

aiohttp.ClientSession 对象负责发出所有请求；
如果函数体有 async 或 await 关键字，则该函数必须是异步的，即使用 async def 定义；
对于一个异步函数：
- 如果在同步环境调用，需要使用 asyncio.run(异步函数(参数))；
- 如果在异步环境调用，需要使用 await 异步函数(参数)，比如上面的 html = await response.text()

案例2：`ClientSession`

aiohttp.ClientSession 有自己的连接池，通常在整个程序中只创建一个实例，从而复用连接。

import aiohttp
import asyncio

async def get(url, session, headers={}, cookies={}):
    # 设置 headers 和 cookies 的方式和 requests.get() 一样
    async with session.get(url, headers=headers, cookies=cookies) as response:
        return await response.text()

async def main():
    async with aiohttp.ClientSession() as session:
        text = await get("http://python.org", session)
        print(text[:50])
        text = await get("https://docs.aiohttp.org/en/stable/", session)
        print(text[:50])

asyncio.run(main())

案例3：下载文件

如果要下载照片、视频这样的大文件，应该使用如下方式：

async def download_image(session: aiohttp.ClientSession, uri: str, file_path: str):
    async with session.get(uri) as response:
        with open(file_path, 'wb') as file:
            while True:
                chunk = await response.content.read(8192)
                if not chunk:
                    break
                file.write(chunk)

案例4：多任务执行

这个是 asyncio 的功能，而非 AIOHTTP 的

假如我们要下载多张图片，可以这样：

async def main():
    site = 'https://gustaavv.github.io/MarkDownImages/'
    images = ['image-20230722142231405.png', 'image-20230722142307844.png']
    async with aiohttp.ClientSession() as session:
        for im in images:
            await download_image(session, site + im, im)

asyncio.run(main())

但是，下载仍然是顺序执行的，不是并发的。

正确的做法是使用 asyncio.gather，该函数可以并发地执行一系列任务。

async def main():
    site = 'https://gustaavv.github.io/MarkDownImages/'
    images = ['image-20230722142231405.png', 'image-20230722142307844.png']
    async with aiohttp.ClientSession() as session:
        tasks = [download_image(session, site + im, im) for im in images]
        result_list = await asyncio.gather(*tasks)

asyncio.run(main())

gather() 的参数是对异步函数的调用，即 gather(f1(), f2())。这里 gather(*[f1(), f2()]) 等价于 gather(f1(), f2())

await 使得所有下载任务都完成后，程序才会执行下一条语句。

总结

在我的爬虫项目中，AIOHTTP 的使用比较简单，只实现了并发下载图片功能。而工作主要集中在解析 HTML 上。

在项目中，我使用了模板方法的设计模式：在父类中使用 AIOHTTP 实现了并发下载，而子类只需要实现 parse_html() 这个抽象函数就足够了。AIOHTTP 只在搭框架的时候用到了，之后再也不会涉及了。考虑到以后可能再次用到，我写了这篇博客，记录了 AIOHTTP 和 asyncio 的基本用法。

序言

原理

案例1：hello world

案例2：ClientSession

案例3：下载文件

案例4：多任务执行

总结

参考

案例2：`ClientSession`