使用Batch Inference提速推理

本文件介绍了如何通过 DaoAI 深度学习 SDK 实现两种不同的模型推理方法,并比较它们的性能。 目标是展示使用批量推理 (Batch Inference) 相较于传统多线程 (Multi-threading) 单张图片推理的性能提升。

代码结构

代码主要由两个核心函数组成:

1. normal_inference():

该函数使用传统的多线程方法逐张图片进行推理。 在每一个批次中为每张图片创建一个线程,每个批次创建16个线程,并行处理。

通过测量每个批次的处理时间(排除首尾两个批次),计算平均每张图片的推理时间。

示例代码:

int normal_inference()
{
    try
    {
        DaoAI::DeepLearning::initialize();

        std::string root = "C:/Users/daoai/Downloads/";
        std::string model_zip_path = root + "dami_fast.dwm";
        std::string images_path = root + "dami_data/";
        std::string out_path = images_path + "out/";

        if (!std::filesystem::exists(out_path))
        {
            std::filesystem::create_directory(out_path);
        }

        std::vector<std::string> image_files;
        for (const auto& entry : std::filesystem::directory_iterator(images_path))
        {
            if (entry.is_regular_file())
            {
                auto ext = entry.path().extension().string();
                std::transform(ext.begin(), ext.end(), ext.begin(), ::tolower);
                if (ext == ".png" || ext == ".jpg" || ext == ".jpeg")
                {
                    image_files.push_back(entry.path().string());
                }
            }
        }

        DaoAI::DeepLearning::Vision::Classification model(model_zip_path, DaoAI::DeepLearning::DeviceType::GPU);
        const size_t batch_size = 16;
        size_t num_batches = (image_files.size() + batch_size - 1) / batch_size;

        double total_inference_time_excluding = 0.0;
        int total_images_excluding = 0;

        for (size_t batch_idx = 0; batch_idx < num_batches; ++batch_idx)
        {
            auto batch_start_time = std::chrono::high_resolution_clock::now();

            std::vector<std::thread> threads;
            size_t start_index = batch_idx * batch_size;
            size_t end_index = std::min(start_index + batch_size, image_files.size());

            for (size_t i = start_index; i < end_index; ++i)
            {
                threads.emplace_back([&, i]() {
                    try
                    {
                        DaoAI::DeepLearning::Image daoai_image(image_files[i]);
                        auto prediction = model.inference(daoai_image);
                    }
                    catch (const std::exception& e)
                    {
                        std::cerr << "Error processing image " << image_files[i] << ": " << e.what() << "\n";
                    }
                });
            }

            for (auto& t : threads)
            {
                if (t.joinable())
                    t.join();
            }

            auto batch_end_time = std::chrono::high_resolution_clock::now();
            std::chrono::duration<double, std::milli> batch_duration = batch_end_time - batch_start_time;
            std::cout << "Batch " << (batch_idx + 1) << " took " << batch_duration.count() << " ms" << std::endl;

            if (batch_idx != 0 && batch_idx != num_batches - 1)
            {
                total_inference_time_excluding += batch_duration.count();
                total_images_excluding += (end_index - start_index);
            }
        }

        if (total_images_excluding > 0)
        {
            double avg_time_per_image = total_inference_time_excluding / total_images_excluding;
            std::cout << "Average inference time per image (excluding first and last batch): "
                << avg_time_per_image << " ms" << std::endl;
        }
    }
    catch (const std::exception& e)
    {
        std::cerr << "Exception occurred: " << e.what() << "\n";
        return 1;
    }
    return 0;
}

2. batch_inference():

该函数演示了如何使用 SDK 内置的批量推理功能进行性能优化。 与手动创建线程不同,批量推理允许模型在一次推理调用中处理多个图片。

批次大小以16为例,函数计算整个图片集的总推理时间和平均推理时间。

示例代码:

int batch_inference()
{
    try
    {
        DaoAI::DeepLearning::initialize();

        std::string root = "C:/Users/daoai/Downloads/";
        std::string model_zip_path = root + "dami_fast.dwm";
        std::string images_path = root + "dami_data/";

        DaoAI::DeepLearning::Vision::Classification model(model_zip_path, DaoAI::DeepLearning::DeviceType::GPU);
        model.setBatchSize(16);

        std::vector<DaoAI::DeepLearning::Image> images;
        for (const auto& entry : std::filesystem::directory_iterator(images_path))
        {
            if (entry.is_regular_file())
            {
                auto ext = entry.path().extension().string();
                std::transform(ext.begin(), ext.end(), ext.begin(), ::tolower);
                if (ext == ".png" || ext == ".jpg" || ext == ".jpeg")
                {
                    images.emplace_back(entry.path().string());
                }
            }
        }

        auto start = std::chrono::high_resolution_clock::now();
        auto prediction = model.inference(images);
        auto end = std::chrono::high_resolution_clock::now();

        std::chrono::duration<double, std::milli> inference_time = end - start;
        std::cout << "Total inference time for " << images.size() << " images: "
            << inference_time.count() << " ms" << std::endl;

        double per_image_time = inference_time.count() / images.size();
        std::cout << "Average inference time per image: "
            << per_image_time << " ms" << std::endl;

    }
    catch (const std::exception& e)
    {
        std::cerr << "Exception occurred: " << e.what() << std::endl;
        return 1;
    }
    return 0;
}

性能对比

以分类模型为例

模式类型

硬件

图像尺寸大小

平均一张图的推理时间(普通单张推理)

平均一张图的推理时间(内置批次推理)

快速

2080 Super

96*96

19.17ms

2.9ms

均衡

2080 Super

96*96

11.45ms

3.05ms

准确

2080 Super

96*96

25.72ms

3.97ms

完整的测试表格请参考 各模型、各模型类型、不同尺寸模型推理的运行时间表