InferLLM
Project Links: https://github.com/MegEngine/InferLLM
InferLLM is a lightweight LLM model inference framework that mainly references and borrows from the llama.cpp project. llama.cpp puts almost all core code and kernels in a single file and use a large number of macros, making it difficult for developers to read and modify. InferLLM has the following features:
- Simple structure, easy to get started and learning, and decoupled the framework part from the kernel part.
- High efficiency, ported most of the kernels in llama.cpp.
- Defined a dedicated KVstorage type for easy caching and management.
- Compatible with multiple model formats (currently only supporting alpaca Chinese and English int4 models).
- Currently supports CPU and GPU, optimized for Arm, x86, CUDA and riscv-vector. And it can be deployed on mobile phones, with acceptable speed.
In short, InferLLM is a simple and efficient LLM CPU inference framework that can deploy quantized models in LLM locally and has good inference speed.
Latest News
- 2023.08.16: Add support for LLama-2-7B model.
- 2023.08.8: Optimized the performance on Arm, which optimized the int4 matmul kernel with arm asm and kernel packing.
- berfor: support chatglm/chatglm2, baichuan, alpaca, ggml-llama model.