diff --git a/.github/ISSUE_TEMPLATE/documentation_improvement.md b/.github/ISSUE_TEMPLATE/documentation_improvement.md new file mode 100644 index 00000000..e2df6546 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/documentation_improvement.md @@ -0,0 +1,15 @@ +--- +name: Documentation improvement +about: Create a report to help us improve +title: '[DOC] ' +labels: 'documentation' +assignees: '' + +--- + +## What part of the documentation do you want to update? + + +## What changes do you want to make? + + diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index b88884e5..329b69a1 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -11,13 +11,13 @@ repos: additional_dependencies: [types-pyyaml] exclude: tests/ - - repo: local - hooks: - - id: check-pruna-pro - name: Check for pruna_pro - entry: > - bash -c "git diff --cached --name-status | grep -v \"^D\" | grep -E \"^[A-Z]\\s+docs/\" | cut -f2- | while IFS= read -r file; do if [ -f \"$file\" ] && grep -q \"pruna_pro\" \"$file\"; then echo \"Error: pruna_pro found in staged file $file\"; exit 1; fi; done || [ $? -eq 1 ] && exit 1 || exit 0" - language: system - stages: [pre-commit] - types: [text] - files: '^docs/' \ No newline at end of file + # - repo: local + # hooks: + # - id: check-pruna-pro + # name: Check for pruna_pro + # entry: > + # bash -c "git diff --cached --name-status | grep -v \"^D\" | grep -E \"^[A-Z]\\s+docs/\" | cut -f2- | while IFS= read -r file; do if [ -f \"$file\" ] && ! grep -q \"pruna_pro\" \"$file\"; then echo \"Error: pruna_pro not found in staged file $file\"; exit 1; fi; done || [ $? -eq 1 ] && exit 1 || exit 0" + # language: system + # stages: [pre-commit] + # types: [text] + # files: '^docs/' \ No newline at end of file diff --git a/README.md b/README.md index 006898fe..fec61025 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ Element - **Simply make AI models faster, cheaper, smaller, greener!** + **Simply make AI models faster, cheaper, smaller, greener!** Element
@@ -86,17 +86,8 @@ pip install -e . ## Pruna Cool Quick Start -Before we start: Pruna allows to collect [a minimal set of aggregated, non-personal telemetry data](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/telemetry.html) to help us identify popular algorithms and improve the product. Telemetry is enabled by default because your participation helps us make Pruna better. However, if you'd prefer not to share this, you can always disable telemetry with: -```python -from pruna.telemetry import set_telemetry_metrics - -set_telemetry_metrics(False) # disable telemetry for current session -set_telemetry_metrics(False, set_as_default=True) # disable telemetry globally -``` - - -Getting started with Pruna is easy-peasy pruna-squeezy! +Getting started with Pruna is easy-peasy pruna-squeezy! First, load any pre-trained model. Here's an example using Stable Diffusion: @@ -105,7 +96,7 @@ from diffusers import StableDiffusionPipeline base_model = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4") ``` -Then, use Pruna's `smash` function to optimize your model. You can customize the optimization process using `SmashConfig`: +Then, use Pruna's `smash` function to optimize your model. Pruna provides a variety of different compression and optimization algorithms, allowing you to combine different algorithms to get the best possible results. You can customize the optimization process using `SmashConfig`: ```python from pruna import smash, SmashConfig @@ -113,6 +104,7 @@ from pruna import smash, SmashConfig # Create and smash your model smash_config = SmashConfig() smash_config["cacher"] = "deepcache" +smash_config["compiler"] = "stable_fast" smashed_model = smash(model=base_model, smash_config=smash_config) ``` @@ -124,18 +116,6 @@ smashed_model("An image of a cute prune.").images[0]
-Pruna provides a variety of different compression and optimization algorithms, allowing you to combine different algorithms to get the best possible results: - -```python -from pruna import smash, SmashConfig - -# Create and smash your model -smash_config = SmashConfig() -smash_config["cacher"] = "deepcache" -smash_config["compiler"] = "stable_fast" -smashed_model = smash(model=base_model, smash_config=smash_config) -``` - You can then use our evaluation interface to measure the performance of your model: ```python @@ -143,332 +123,66 @@ from pruna.evaluation.task import Task from pruna.evaluation.evaluation_agent import EvaluationAgent from pruna.data.pruna_datamodule import PrunaDataModule -task = Task("image_generation_quality", datamodule=PrunaDataModule.from_string("LAION256")) -eval_agent = EvaluationAgent(task) +task = Task("image_generation_quality", datamodule=PrunaDataModule.from_string("LAION256")) +eval_agent = EvaluationAgent(task) eval_agent.evaluate(smashed_model) ``` - This was the minimal example, but you are looking for the maximal example? You can check out our [documentation][documentation] for an overview of all supported [algorithms][docs-algorithms] as well as our tutorials for more use-cases and examples. - ## Pruna Heart Pruna Pro Pruna has everything you need to get started on optimizing your own models. To push the efficiency of your models even further, we offer Pruna Pro. To give you a glimpse of what is possible with Pruna Pro, let us consider three of the most widely used diffusers pipelines and see how much smaller and faster we can make them. In addition to popular open-source algorithms, we use our proprietary Auto Caching algorithm. We compare the fidelity of the compressed models. Fidelity measures the similarity between the images of the compressed models and the images of the original model. -### Stable Diffusion XL - -For [Stable Diffusion XL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), we compare Auto Caching with [DeepCache](https://github.com/horseee/DeepCache) (available with Pruna). We combine these caching algorithms with torch.compile to get an additional **9%** reduction in inference latency, and we use [HQQ](https://github.com/mobiusml/hqq) 8-bit quantization to reduce the size of the model from **8.8GB** to **6.7GB**. - -SDXL Benchmark - -### FLUX [dev] -For [FLUX [dev]](https://huggingface.co/black-forest-labs/FLUX.1-dev), we compare Auto Caching with the popular [TeaCache](https://github.com/ali-vilab/TeaCache) algorithm. In this case, we used [Stable Fast](https://github.com/chengzeyi/stable-fast) to reduce the latency of Auto Caching by additional **13%**, and [HQQ](https://github.com/mobiusml/hqq) with 8-bit reduced the size of FLUX from **33GB** to **23GB**. - -FLUX [dev] Benchmark - -### HunyuanVideo -For [HunyuanVideo](https://huggingface.co/tencent/HunyuanVideo), we compare Auto Caching with [TeaCache](https://github.com/ali-vilab/TeaCache). Applying [HQQ](https://github.com/mobiusml/hqq) 8-bit quantization to the model reduced the size from **41GB** to **29GB**. - -HunyuanVideo Benchmark - - - -## Pruna Cool Algorithm Overview - -Since Pruna offers a broad range of compression algorithms, the following table provides an overview of all methods available in Pruna and those exclusive to Pruna Pro. For a detailed description of each algorithm, have a look at our [documentation](https://docs.pruna.ai/en/stable/). - - +
- - - - - - - - - - - - - + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + +
Algorithm
Pruna Pro
Type
Hardware
Model Format
CPUGPU🤗 Transformers CausalLM🤗 Diffusers Pipeline🤗 Transformers Whispertorch ModuleStable Diffusion XLFLUX [dev]HunyuanVideo
AWQquantizerCheckCheck
GPTQquantizerCheckCheck
HQQquantizerCheckCheckCheck
Int8quantizerCheckCheckCheck
QUANTOquantizerCheckCheckCheckCheck
Torch DynamicquantizerCheckCheckCheckCheck
HIGGSCheckquantizerCheckCheck
torchaoCheckquantizerCheckCheckCheckCheckCheckCheck
PERPCheckrecovererCheckCheckCheckCheck
c_translatecompilerCheckCheck
IPEXCheckcompilerCheckCheck
Stable FastcompilerCheckCheck
torch.compilecompilerCheckCheckCheckCheckCheckCheck
x-fastCheckcompilerCheckCheckCheckCheckCheck
DeepCache1cacherCheckCheckCheck
Adaptive CachingCheckcacherCheckCheckCheck
Auto CachingCheckcacherCheckCheckCheck
FLUX Caching2CheckcacherCheckCheckCheck
Periodic CachingCheckcacherCheckCheckCheck
HYPER3CheckdistillerCheckCheckCheck
Structured PruningprunerCheckCheckCheck
Unstructured PruningprunerCheckCheckCheckCheck
ifwbatcherCheckCheck
ws2tbatcherCheckCheck +
+ For Stable Diffusion XL, we compare Auto Caching with DeepCache (available with Pruna). We combine these caching algorithms with torch.compile to get an additional 9% reduction in inference latency, and we use HQQ 8-bit quantization to reduce the size of the model from 8.8GB to 6.7GB. +
+ SDXL Benchmark +
+
+ For FLUX [dev], we compare Auto Caching with the popular TeaCache algorithm. In this case, we used Stable Fast to reduce the latency of Auto Caching by additional 13%, and HQQ with 8-bit reduced the size of FLUX from 33GB to 23GB. +
+ FLUX [dev] Benchmark +
+
+ For HunyuanVideo, we compare Auto Caching with TeaCache. Applying HQQ 8-bit quantization to the model reduced the size from 41GB to 29GB. +
+ HunyuanVideo Benchmark +
-1. Only available for unet-based diffusers pipelines.
-2. Only available for FLUX models.
-3. Only available for FLUX, SD-XL, SD-v1-4, SD-v1-5, SD-3.5. +## Pruna Cool Algorithm Overview + +Since Pruna offers a broad range of compression algorithms, the following table provides a high-level overview of all methods available in Pruna. For a detailed description of each algorithm, have a look at our [documentation](https://docs.pruna.ai/en/stable/). + + +| Technique | Description | Speed | Memory | Accuracy | +| --- | --- | --- | --- | --- | +| Batching | Groups multiple inputs together to be processed simultaneously, improving computational efficiency and reducing overall processing time. | ✅ | ❌ | 〰️ | +| Caching | Stores intermediate results of computations to speed up subsequent operations, reducing inference time by reusing previously computed results. | ✅ | 〰️ | 〰️ | +| Compilation | Compilation optimises the model with instructions for specific hardware. | ✅ | ➖ | 〰️ | +| Distillation | Trains a smaller, simpler model to mimic a larger, more complex model. | ✅ | ✅ | ❌ | +| Quantization | Reduces the precision of weights and activations, lowering memory requirements. | ✅ | ✅ | ❌ | +| Pruning | Removes less important or redundant connections and neurons, resulting in a sparser, more efficient network. | ✅ | ✅ | ❌ | +| Recovering | Restores the performance of a model after compression. | 〰️ | 〰️ | ✅ | + +✅(improves), ➖(stays the same), 〰️(could worsen), ❌(worsens)

@@ -486,7 +200,7 @@ If you can not find an answer to your question or problem in our [documentation] The Pruna package was made with 💜 by the Pruna AI team. [Contribute to the repository][docs-contributing] to become part of the Pruna family! - + Contributors are displayed in a random order to avoid any perceived ranking. @@ -495,13 +209,13 @@ Contributors are displayed in a random order to avoid any perceived ranking. If you use Pruna in your research, feel free to cite the project! 💜 -``` - @misc{pruna, +```bibtex +@misc{pruna, title = {Efficient Machine Learning with Pruna}, year = {2023}, note = {Software available from pruna.ai}, - url={https://www.pruna.ai/} - } + url = {https://www.pruna.ai/} +} ```
diff --git a/docs/contributions/contributions_toc.rst b/docs/contributions/contributions_toc.rst deleted file mode 100644 index 90bcb5f6..00000000 --- a/docs/contributions/contributions_toc.rst +++ /dev/null @@ -1,10 +0,0 @@ -.. toctree:: - :hidden: - :maxdepth: 1 - :caption: Contributing - - /docs_pruna/contributions/how_to_contribute - /docs_pruna/contributions/opening_an_issue - /docs_pruna/contributions/adding_algorithm - /docs_pruna/contributions/adding_metric - /docs_pruna/contributions/adding_dataset \ No newline at end of file diff --git a/docs/contributions/how_to_contribute.rst b/docs/contributions/how_to_contribute.rst index a18614c6..f24fc8c3 100644 --- a/docs/contributions/how_to_contribute.rst +++ b/docs/contributions/how_to_contribute.rst @@ -1,11 +1,11 @@ How to Contribute 💜 =============================== -Since you landed on this part of the documentation, we want to first of all say thank you! 💜 +Since you landed on this part of the documentation, we want to first of all say thank you! 💜 Contributions from the community are essential to improving |pruna|, we appreciate your effort in making the repository better for everyone! -Please make sure to review and adhere to the `Pruna Code of Conduct `_ before contributing to Pruna. -Any violations will be handled accordingly and result in a ban from the Pruna community and associated platforms. +Please make sure to review and adhere to the `Pruna Code of Conduct `_ before contributing to Pruna. +Any violations will be handled accordingly and result in a ban from the Pruna community and associated platforms. Contributions that do not adhere to the code of conduct will be ignored. There are various ways you can contribute: @@ -13,6 +13,7 @@ There are various ways you can contribute: - Have a question? Discuss with us on `Discord `_ or check out the :doc:`/resources/faq` - Have an idea for a new tutorial? Open an issue with a :ref:`feature-request` or chat with us on `Discord `_ - Found a bug? Open an issue with a :ref:`bug-report` +- Documentation improvements? Open an issue with a :ref:`documentation-improvement` - Want a new feature? Open an issue with a :ref:`feature-request` - Have a new algorithm to add? Check out: :doc:`adding_algorithm` - Have a new metric to add? Check out: :doc:`adding_metric` @@ -75,7 +76,7 @@ You can then also install the pre-commit hooks with ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You are now ready to work on your contribution. Check out a branch on your forked repository and start coding! -When committing your changes, we recommend to follow the `Conventional Commit Guidelines `_. +When committing your changes, we recommend to follow the `Conventional Commit Guidelines `_. .. code-block:: bash @@ -84,13 +85,13 @@ When committing your changes, we recommend to follow the `Conventional Commit Gu git commit -m "feat: new amazing feature setup" git push origin feat/new-feature -Make sure to develop your contribution in a way that is well documented, concise and easy to maintain. +Make sure to develop your contribution in a way that is well documented, concise and easy to maintain. We will do our best to have your contribution integrated and maintained into |pruna| but reserve the right to reject contributions that we do not feel are in the best interest of the project. 4. Run the tests ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -We have a comprehensive test suite that is designed to catch potential issues before they are merged into |pruna|. +We have a comprehensive test suite that is designed to catch potential issues before they are merged into |pruna|. When you make a contribution, it is highly recommended to not only run the existing tests but also to add new tests that cover your contribution. You can run the tests by running the following command: @@ -109,9 +110,9 @@ If you want to run only the tests with a specific marker, e.g. fast CPU tests, y 5. Create a Pull Request ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Once you have made your changes and tested them, you can create a Pull Request. +Once you have made your changes and tested them, you can create a Pull Request. We will then review your Pull Request and get back to you as soon as possible. -If there are any questions along the way, please do not hesitate to reach out on `Discord `_. +If there are any questions along the way, please do not hesitate to reach out on `Discord `_. diff --git a/docs/contributions/opening_an_issue.rst b/docs/contributions/opening_an_issue.rst index ae513e98..a958b237 100644 --- a/docs/contributions/opening_an_issue.rst +++ b/docs/contributions/opening_an_issue.rst @@ -1,6 +1,27 @@ Opening an Issue =============================== + +.. _documentation-improvement: + +Documentation Improvement +------------------------ + +All bits help! We appreciate your interest in improving |pruna|’s documentation. + +Our documentation is built with `Sphinx `_ and `Read the Docs `_. + +The current set-up relies on restructured text (rst) files for the documentation and forces us to evaluate and build the documentation on our side. +This means you cannot directly evaluate the documentation changes on your local machine, however, you can still make changes to the documentation and create a pull request based on the changes. + +When opening a pull request for a documentation improvement, you will encounter the following template to help you structure your suggestion. Make sure to fill out all sections applicable to your feature request so that we can integrate it in Pruna as fast as possible: + +.. literalinclude:: issue_templates/documentation_improvement.md + :language: markdown + :linenos: + :lines: 9- + + .. _bug-report: Bug Report @@ -33,7 +54,7 @@ When opening a bug report on GitHub, you will encounter the following template t Feature Request --------------- -We appreciate your interest in improving |pruna|! Feature requests help shape the project, and we welcome ideas that align with our mission. +We appreciate your interest in improving |pruna|! Feature requests help shape the project, and we welcome ideas that align with our mission. Before submitting your feature request, consider the following points to ensure your request is clear and actionable: @@ -48,4 +69,4 @@ When opening a feature request on GitHub, you will encounter the following templ .. literalinclude:: issue_templates/feature_request.md :language: markdown :linenos: - :lines: 9- \ No newline at end of file + :lines: 9- diff --git a/docs/user_manual/telemetry.rst b/docs/contributions/telemetry.rst similarity index 100% rename from docs/user_manual/telemetry.rst rename to docs/contributions/telemetry.rst diff --git a/docs/tutorials/asr_whisper.ipynb b/docs/tutorials/asr_whisper.ipynb deleted file mode 100644 index a8896a0d..00000000 --- a/docs/tutorials/asr_whisper.ipynb +++ /dev/null @@ -1,194 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 100% faster Whisper Transcription" - ] - }, - { - "cell_type": "raw", - "metadata": { - "vscode": { - "languageId": "raw" - } - }, - "source": [ - "\n", - " \"Open\n", - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This tutorial demonstrates how to use the `pruna` package to optimize any custom whisper model. We will use the `openai/whisper-large-v3` model as an example." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# if you are not running the latest version of this tutorial, make sure to install the matching version of pruna\n", - "# the following command will install the latest version of pruna\n", - "!pip install pruna" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1. Loading the ASR model\n", - "\n", - "First, load your ASR model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import torch\n", - "from transformers import AutoModelForSpeechSeq2Seq\n", - "\n", - "device = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n", - "torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32\n", - "\n", - "model_id = \"openai/whisper-large-v3\"\n", - "\n", - "model = AutoModelForSpeechSeq2Seq.from_pretrained(\n", - " model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True\n", - ")\n", - "model.to(device)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2. Initializing the Smash Config\n", - "\n", - "Next, initialize the smash_config. Since the compiler requires a processor, we add it to the smash_config." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "from pruna import SmashConfig\n", - "\n", - "# Initialize the SmashConfig\n", - "smash_config = SmashConfig()\n", - "smash_config.add_processor(model_id)\n", - "smash_config['compiler'] = 'c_whisper'\n", - "# uncomment the following line to quantize the model to 8 bits\n", - "# smash_config['c_whisper_weight_bits'] = 8" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 3. Smashing the Model\n", - "\n", - "Now, you can smash the model, which will take approximately 2 minutes on a T4 GPU." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from pruna import smash\n", - "\n", - "# Smash the model\n", - "smashed_model = smash(\n", - " model=model,\n", - " smash_config=smash_config,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 4. Preparing the Input" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from datasets import load_dataset\n", - "from transformers import AutoProcessor\n", - "\n", - "processor = AutoProcessor.from_pretrained(model_id)\n", - "\n", - "dataset = load_dataset(\"distil-whisper/librispeech_long\", \"clean\", split=\"validation\")\n", - "sample = dataset[0][\"audio\"]\n", - "input_features = processor(sample[\"array\"], sampling_rate=sample[\"sampling_rate\"], return_tensors=\"pt\").input_features\n", - "input_features = input_features.cuda().half()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 5. Running the Model\n", - "\n", - "Finally, run the model to transcribe the audio file." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ - "# Display the result\n", - "results = smashed_model(input_features)\n", - "processor.decode(results, skip_special_tokens=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Wrap Up\n", - "\n", - "Congratulations! You have successfully smashed an ASR model. You can now use the `pruna` package to optimize any custom ASR model. The only parts that you should modify are step 1, 4 and 5 to fit your use case." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "pruna", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.11" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/docs/tutorials/index.rst b/docs/tutorials/index.rst new file mode 100644 index 00000000..6fd219d0 --- /dev/null +++ b/docs/tutorials/index.rst @@ -0,0 +1,66 @@ +.. _pruna_tutorials: + +Tutorials Pruna +=============== + +This tutorial will guide you through the process of using |pruna| to optimize your model. Looking for |pruna_pro| tutorials? Check out the :ref:`pruna_pro_tutorials` page. + +.. grid:: 1 2 2 2 + + .. grid-item-card:: Transcribe 2 hour of audio in 2 minutes with Whisper + :text-align: center + :link: ./asr_tutorial.ipynb + + Speed up ASR using the ``c_whisper`` ``compilation`` and ``whisper_s2t`` ``batching``. + + .. grid-item-card:: Smash your Computer Vision model with a CPU only + :text-align: center + :link: ./cv_cpu.ipynb + + ``Compile`` your model with ``torch_compile`` and ``openvino`` for faster inference. + + .. grid-item-card:: Speedup and Quantize any Diffusion Model + :text-align: center + :link: ./diffusion_quantization_acceleration.ipynb + + Speed up ``diffusers`` with ``torch_compile`` ``compilation`` and ``hqq_diffusers`` ``quantization``. + + .. grid-item-card:: Evaluating with CMMD using EvaluationAgent + :text-align: center + :link: ./evaluation_agent_cmmd.ipynb + + ``Evaluate`` image generation quality with ``CMMD`` and ``EvaluationAgent``. + + .. grid-item-card:: Run your Flux model with half the memory + :text-align: center + :link: ./flux_small.ipynb + + Speed up your image generation model with ``torch_compile`` ``compilation`` and ``hqq_diffusers`` ``quantization``. + + .. grid-item-card:: Making your LLMs 4x smaller + :text-align: center + :link: ./llms.ipynb + + Speed up your LLM inference with ``gptq`` ``quantization``. + + .. grid-item-card:: x2 smaller Sana diffusers in action + :text-align: center + :link: ./sana_diffusers_int8.ipynb + + Optimize your ``diffusion`` model with ``hqq_diffusers`` ``quantization`` in 8 bits. + + .. grid-item-card:: Make Stable Diffusion 3x Faster with DeepCache + :text-align: center + :link: ./sd_deepcache.ipynb + + Optimize your ``diffusion`` model with ``deepcache`` ``caching``. + + +.. toctree:: + :hidden: + :maxdepth: 1 + :caption: Pruna + :glob: + + ./* + diff --git a/docs/contributions/adding_algorithm.rst b/docs/user_manual/adding_algorithm.rst similarity index 92% rename from docs/contributions/adding_algorithm.rst rename to docs/user_manual/adding_algorithm.rst index 9f9da18e..99838c3e 100644 --- a/docs/contributions/adding_algorithm.rst +++ b/docs/user_manual/adding_algorithm.rst @@ -1,29 +1,28 @@ -Adding an Algorithm +Customize Algorithms ==================== -Adding the Algorithm to ``pruna.algorithms`` --------------------------------------------- - -If you’ve developed a new method or want to integrate a missing algorithm into |pruna|, we welcome your contribution! This tutorial guides you through the steps to integrate a new compression algorithm, making it available for all users. +If you’ve developed a new method or want to integrate a missing algorithm into |pruna|, we welcome your contribution! This tutorial guides you through the steps to integrate a new compression algorithm, making it available for all users. If anything is unclear or you want to discuss your contribution before opening a PR, please reach out on `Discord `_ anytime! If this is your first time contributing to |pruna|, please refer to the :ref:`how-to-contribute` guide for more information. +Add a Custom Algorithm +---------------------- + We’ll use **Superfast**, an example compiler, to demonstrate the process. -0. Identifying the Algorithm Group -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Step 1. Identify the Algorithm Group +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The first step is to identify the algorithm group. This is important because it determines the folder in which the algorithm should be placed. You can find the list of all algorithm groups in the :doc:`Compression Algorithms <../../compression>` section and determine which group fits your algorithm best by reviewing the algorithm group descriptions. -1. Creating the Compiler Class -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Step 2. Create the Compiler Class +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ First, navigate to ``pruna/algorithms/compilation/`` and create ``superfast.py``. - -2. Defining Compiler Attributes -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Step 3. Define Compiler Attributes +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Define the new compiler by inheriting from ``PrunaCompiler`` and define key attributes for the compiler. These attributes are used to provide information about the algorithm to the user, other functions in the package and even the documentation. @@ -40,10 +39,10 @@ These attributes are used to provide information about the algorithm to the user class SuperfastCompiler(PrunaCompiler): """ Implement Superfast Compiler using the superfast package. - + This compiler compiles anything with zero compilation time and 100x speedup. """ - + algorithm_name = "superfast" references = {"GitHub": "/url/to/GitHub"} tokenizer_required = False @@ -54,8 +53,9 @@ These attributes are used to provide information about the algorithm to the user compatible_algorithms = dict(quantizer=["quanto"]) -Explanation -^^^^^^^^^^^^ +Step 4. Add Algorithm Attributes +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + - docstring: The docstring should be concise and describe the algorithm in a way that is easy to understand. The description paragraph of the algorithm will be used to automatically generate the algorithm's documentation. - ``algorithm_name``: Identifier used to activate the algorithm, name should be in snake case. - ``references``: A dictionary of any references that can be provided for the algorithm, typically a link to the GitHub repository or a paper. @@ -65,8 +65,8 @@ Explanation - Additionally, you might have to specify a saving function. We provide more details on this in the section below. -Defining Hyperparameters -^^^^^^^^^^^^^^^^^^^^^^^^ +Step 5. Define Hyperparameters +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Define hyperparameters using `ConfigSpace `_, allowing users to configure the backend and mode. Everything that configures the algorithm or specifies the algorithm's behavior should be a hyperparameter. @@ -81,13 +81,13 @@ Everything that configures the algorithm or specifies the algorithm's behavior s CategoricalHyperparameter("mode", choices=["mode1", "mode2"], default_value="mode1", meta=dict(desc="The mode to use for the Superfast compiler.")), ] -Users can now configure hyperparameters via ``smash_config["superfast_backend"] = "backend2"``. +Users can now configure hyperparameters via ``smash_config["superfast_backend"] = "backend2"``. Make sure to include descriptions of the hyperparameters with the ``desc`` key in the ``meta`` dictionary. This will be used later to document the hyperparameters in the algorithm's documentation. -Checking Model Compatibility -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Step 6. Check Model Compatibility +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Ensure the compiler only runs on supported models. In our example, the Superfast compiler is compatible with any model that is a subclass of ``torch.nn.Module``: @@ -101,8 +101,8 @@ Ensure the compiler only runs on supported models. In our example, the Superfast Users can bypass this check using ``experimental=True`` when calling ``smash``, but results may be unpredictable. -Handling External Dependencies -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Step 7. Handle External Dependencies +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If the compiler requires external packages, isolate their imports: @@ -116,8 +116,9 @@ If the compiler requires external packages, isolate their imports: Make sure that the dependencies are listed in ``pyproject.toml`` if they are not already included. -Implementing the Compilation Process -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Step 8. Implement the Compilation Process +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + The ``_apply()`` function integrates superfast with Pruna: @@ -132,63 +133,9 @@ The ``_apply()`` function integrates superfast with Pruna: Note that the ``smash_config`` prefix wrapper automatically prefixes hyperparameters with the algorithm name (``superfast_``). If a user sets ``smash_config["superfast_backend"]``, it will be mapped correctly to ``"backend"`` in ``get_hyperparameters()``. -Full Implementation -^^^^^^^^^^^^^^^^^^^^ - -Here’s the complete ``superfast.py`` implementation: - -.. code-block:: python - - from typing import Any, Dict - import torch - from ConfigSpace import CategoricalHyperparameter - from pruna.algorithms.compilation import PrunaCompiler - from pruna.config.smash_config import SmashConfigPrefixWrapper - - class SuperfastCompiler(PrunaCompiler): - """ - Implement Superfast Compiler using the superfast package. - - This compiler compiles anything with zero compilation time and 100x speedup. - """ - - algorithm_name = "superfast" - references = {"GitHub": "/url/to/GitHub"} - tokenizer_required = False - processor_required = False - dataset_required = False - run_on_cpu = True - run_on_cuda = True - compatible_algorithms = dict(quantizer=["quanto"]) - - def get_hyperparameters(self) -> list: - return [ - CategoricalHyperparameter("backend", choices=["backend1", "backend2"], default_value="backend1"), - CategoricalHyperparameter("mode", choices=["mode1", "mode2"], default_value="mode1"), - ] - - def model_check_fn(self, model: Any) -> bool: - return isinstance(model, torch.nn.Module) - - def import_algorithm_packages(self) -> Dict[str, Any]: - from superfast import compile_func - return dict(compile_func=compile_func) - - def _apply(self, model: Any, smash_config: SmashConfigPrefixWrapper) -> Any: - compile_func = self.import_algorithm_packages()["compile_func"] - return compile_func(model, smash_config["backend"], smash_config["mode"]) - -.. container:: hidden_code - - .. code-block:: python - - # test instantiation of compiler - SuperfastCompiler() - - +Step 9. Determine the Saving Function +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Determining a Suitable Saving Function ----------------------------------------- Saving e.g. a compiled or quantized model can be tricky and requires careful consideration. To determine the correct saving function for your algorithm, consider the decision tree below. .. mermaid:: @@ -206,21 +153,19 @@ Saving e.g. a compiled or quantized model can be tricky and requires careful con G -->|Yes| L["SAVE_FUNCTIONS.pickled"] G -->|No| M["Introduce new saving function."] -The first decision is whether the original saving function can be retained. +The first decision is whether the original saving function can be retained. For example, GPTQ-quantized transformers models still support ``.from_pretrained`` and ``.save_pretrained``, making retention possible. -If the original function cannot be retained, we consider how long the algorithm takes to apply. -If it is quick (e.g., a caching helper), we can reapply it after loading. -The key distinction is whether the modifications persist when saving. For instance, “step caching cacher” attaches a helper that is discarded by ``diffusers`` upon saving, so the model can be saved and reloaded normally before reapplying the function. +If the original function cannot be retained, we consider how long the algorithm takes to apply. +If it is quick (e.g., a caching helper), we can reapply it after loading. +The key distinction is whether the modifications persist when saving. For instance, “step caching cacher” attaches a helper that is discarded by ``diffusers`` upon saving, so the model can be saved and reloaded normally before reapplying the function. In contrast, compilation is irreversible—once compiled, a model cannot be saved in its compiled form, so we must save it beforehand and reapply compilation after loading. -If neither approach works, we must introduce a new saving function or use ``SAVE_FUNCTIONS.pickled``. We implement a new saving function following the existing saving-function pattern as well as introducing a matching loading function. +If neither approach works, we must introduce a new saving function or use ``SAVE_FUNCTIONS.pickled``. We implement a new saving function following the existing saving-function pattern as well as introducing a matching loading function. Otherwise, we can resort to saving the model in pickled format, but be aware that pickled models pose security risks and are generally not trusted by the community. - - -Testing the Algorithm ----------------------- +Step 10. Test the Algorithm +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To integrate the algorithm into the test suite, we navigate to ``tests/algorithms/testers/compilation.py`` and add the following Tester Class: @@ -239,14 +184,14 @@ To integrate the algorithm into the test suite, we navigate to ``tests/algorithm dummy_algorithm_tester = types.ModuleType("pruna.algorithms.testers.compilation") dummy_algorithm_tester.AlgorithmTesterBase = ABC sys.modules["base_tester"] = dummy_algorithm_tester - + .. code-block:: python from base_tester import AlgorithmTesterBase from pruna.algorithms.compilation.superfast import SuperfastCompiler from pruna import PrunaModel - + class TestSuperfast(AlgorithmTesterBase): """Tester class for the Superfast algorithm.""" @@ -271,6 +216,60 @@ This Tester class automatically parametrizes an integration test at ``tests/algo Additionally, a test is created to check that ``model_check_fn`` rejects a non-compatible model. Before opening a PR, make sure to run the test suite locally to ensure the algorithm is working as expected. + +Full Implementation +------------------- + +Here’s the complete ``superfast.py`` implementation: + +.. code-block:: python + + from typing import Any, Dict + import torch + from ConfigSpace import CategoricalHyperparameter + from pruna.algorithms.compilation import PrunaCompiler + from pruna.config.smash_config import SmashConfigPrefixWrapper + + class SuperfastCompiler(PrunaCompiler): + """ + Implement Superfast Compiler using the superfast package. + + This compiler compiles anything with zero compilation time and 100x speedup. + """ + + algorithm_name = "superfast" + references = {"GitHub": "/url/to/GitHub"} + tokenizer_required = False + processor_required = False + dataset_required = False + run_on_cpu = True + run_on_cuda = True + compatible_algorithms = dict(quantizer=["quanto"]) + + def get_hyperparameters(self) -> list: + return [ + CategoricalHyperparameter("backend", choices=["backend1", "backend2"], default_value="backend1"), + CategoricalHyperparameter("mode", choices=["mode1", "mode2"], default_value="mode1"), + ] + + def model_check_fn(self, model: Any) -> bool: + return isinstance(model, torch.nn.Module) + + def import_algorithm_packages(self) -> Dict[str, Any]: + from superfast import compile_func + return dict(compile_func=compile_func) + + def _apply(self, model: Any, smash_config: SmashConfigPrefixWrapper) -> Any: + compile_func = self.import_algorithm_packages()["compile_func"] + return compile_func(model, smash_config["backend"], smash_config["mode"]) + +.. container:: hidden_code + + .. code-block:: python + + # test instantiation of compiler + SuperfastCompiler() + Conclusion ---------- diff --git a/docs/contributions/adding_dataset.rst b/docs/user_manual/adding_dataset.rst similarity index 89% rename from docs/contributions/adding_dataset.rst rename to docs/user_manual/adding_dataset.rst index b3d82345..e0bc7b49 100644 --- a/docs/contributions/adding_dataset.rst +++ b/docs/user_manual/adding_dataset.rst @@ -1,18 +1,21 @@ -Adding a Dataset -=============================== +Customize Datasets +================== -Our interface makes it easy to add :doc:`your own dataset <../user_manual/dataset>`. +Our interface makes it easy to add :doc:`your own dataset <../user_manual/dataset>`. Additionally, we provide a variety of :doc:`preconfigured datasets <../user_manual/dataset>` that can be readily used in SmashConfig for calibration or evaluation. -If you’d like to contribute a new dataset to our supported list, follow these two quick steps. +If you’d like to contribute a new dataset to our supported list, follow these two quick steps. If anything is unclear or you want to discuss your contribution before opening a PR, please reach out on `Discord `_ anytime! If this is your first time contributing to |pruna|, please refer to the :ref:`how-to-contribute` guide for more information. -1. Define the Dataset Setup -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Add a Custom Dataset +-------------------- -First, create a setup method to prepare the training, validation, and test splits. -This usually involves downloading or generating the dataset. +Step 1. Define the Dataset Setup +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +First, create a setup method to prepare the training, validation, and test splits. +This usually involves downloading or generating the dataset. For a text generation dataset, add the setup method in ``pruna/data/datasets/text_generation.py``: .. code-block:: python @@ -59,7 +62,7 @@ with the matching collate function and any defaults (e.g. the default image size base_datasets["NewDataset"] = (setup_new_dataset, "text_generation_collate", {}) -Ensure the dataset follows the expected format specified in the :doc:`collate function <../user_manual/dataset>`. +Ensure the dataset follows the expected format specified in the :doc:`collate function <../user_manual/dataset>`. The collate function aggregates several samples into a batch and converts them to the expected format. Now, users can add the dataset like this: @@ -74,16 +77,16 @@ Now, users can add the dataset like this: .. container:: hidden_code - + .. code-block:: python - + # test if dataloader works as expected for batch in smash_config.test_dataloader(): break -2. Add a Test +Step 2. Add a Test ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To verify that the dataset loads correctly, add it to ``tests/data/test_datamodule.py`` by parameterizing ``test_dm_from_string`` @@ -94,11 +97,11 @@ To verify that the dataset loads correctly, add it to ``tests/data/test_datamodu pytest.param("NewDataset", dict(img_size=512), marks=pytest.mark.slow) -Include necessary arguments for the collate function and mark the test as slow if needed. +Include necessary arguments for the collate function and mark the test as slow if needed. We categorize a test as slow if it requires several minutes to download and prepare the dataset. This ensures it runs appropriately in CI, either on GitHub Actions or nightly tests. Conclusion -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +---------- -That’s it! Your dataset is now available for everyone to use in Pruna. 💜 +That’s it! Your dataset is now available for everyone to use in |pruna|. 💜 diff --git a/docs/contributions/adding_metric.rst b/docs/user_manual/adding_metric.rst similarity index 96% rename from docs/contributions/adding_metric.rst rename to docs/user_manual/adding_metric.rst index 34339c20..bb75c877 100644 --- a/docs/contributions/adding_metric.rst +++ b/docs/user_manual/adding_metric.rst @@ -1,5 +1,5 @@ -Adding a Metric -=============================== +Customize Metrics +================= This guide will walk you through the process of adding a new metric to Pruna's evaluation system. @@ -12,20 +12,20 @@ Understanding Pruna's Metric System |pruna| has two main types of metrics that live under ``pruna/evaluation/metrics``: -1. **Base Metrics** - Inherit from ``BaseMetric`` and compute values directly without maintaining state. These metrics usually require isolated inference computation. Examples: ``GPUMemoryMetric``, ``ElapsedTimeMetric``. -2. **Stateful Metrics** - Inherit from ``StatefulMetric`` and maintain internal state across multiple computations. State here refers to the information that is accumulated across multiple batches. Examples: all metrics under ``TorchMetricWrapper`` like ``Accuracy``, ``CLIPScore``. +1. **Base Metrics** - Inherit from ``BaseMetric`` and compute values directly without maintaining state. These metrics usually require isolated inference computation. Examples: ``GPUMemoryMetric``, ``ElapsedTimeMetric``. +2. **Stateful Metrics** - Inherit from ``StatefulMetric`` and maintain internal state across multiple computations. State here refers to the information that is accumulated across multiple batches. Examples: all metrics under ``TorchMetricWrapper`` like ``Accuracy``, ``CLIPScore``. When adding a new metric to |pruna|, you should place your implementation in ``pruna/evaluation/metrics`` directory to ensure it's properly integrated with the rest of the system. Use snake_case for the file name (e.g., ``your_new_metric.py``). In |pruna|, we evaluate metrics by sharing inference runs across multiple metrics whenever possible. This means that |pruna| runs inference once for all compatible metrics. - + - **Stateful metrics** are preferred for most use cases, especially quality metrics, as they can share inference results across multiple metrics - **Base metrics** are primarily used when isolated inference is required (e.g., for GPU memory metrics where sharing inference would distort results) .. note:: If you are confused about which type of metric to implement, you will likely need to implement stateful metrics. Base metrics are typically only used for specialized performance measurements that require isolated inference. -We use PascalCase for the class names (e.g, ``YourNewMetric``) and NumPy style docstrings for documentation. +We use PascalCase for the class names (e.g, ``YourNewMetric``) and NumPy style docstrings for documentation. Base Metrics ~~~~~~~~~~~~ @@ -44,10 +44,10 @@ Base metrics inherit from the ``BaseMetric`` class and implement the ``compute() def __init__(self): super().__init__() # Initialize any parameters your metric needs - + def compute(self, model, dataloader): '''Run inference on the model and compute the metric value.''' - + outputs = run_inference(model, dataloader) result = some_calculation(outputs) return result @@ -93,11 +93,11 @@ Here's a complete example showing all required methods: self.metric_name = "your_metric_name" self.default_call_type = "y_gt" self.call_type = call_type if call_type else self.default_call_type - + # Initialize state variables self.add_state("total", torch.zeros(1)) self.add_state("count", torch.zeros(1)) - + def update(self, inputs, ground_truths, predictions): # Update the state variables based on the current batch # Pass the inputs, ground_truths and predictions and the call_type to the metric_data_processor to get the data in the correct format @@ -105,13 +105,12 @@ Here's a complete example showing all required methods: batch_result = some_calculation(*metric_data) self.total += batch_result self.count += 1 - + def compute(self): # Compute the final metric value using the accumulated state if self.count == 0: return 0 return self.total / self.count - When to Use Each Type ~~~~~~~~~~~~~~~~~~~~~ @@ -119,12 +118,12 @@ When to Use Each Type - **Use Stateful Metrics when**: Your metric can share inference with other metrics without affecting results (most quality metrics fall into this category) - **Use Basic Metrics when**: Your metric requires isolated inference or would produce incorrect results if inference were shared (e.g., performance metrics like GPU memory usage) -By using stateful metrics whenever possible, |pruna| can efficiently evaluate multiple metrics with just a single inference pass. +By using stateful metrics whenever possible, |pruna| can efficiently evaluate multiple metrics with just a single inference pass. Registering Your Metric ----------------------- -After implementing your metric, you need to register it with Pruna's ``MetricRegistry`` system. +After implementing your metric, you need to register it with Pruna's ``MetricRegistry`` system. The simplest way to do this is with the ``@MetricRegistry.register`` decorator: @@ -140,7 +139,7 @@ The simplest way to do this is with the ``@MetricRegistry.register`` decorator: self.param1 = param1 self.param2 = param2 self.metric_name = "your_metric_name" - + Thanks to this registry system, everyone using |pruna| can now refer to your metric by name without having to create instances directly! One important thing: the registration happens when your module is imported. To ensure your metric is always available, we suggest importing it in ``pruna/evaluation/metrics/__init__.py`` file. @@ -194,7 +193,7 @@ Once you've implemented your metric, everyone can use it in Pruna's evaluation p metrics = [ 'clip_score', - 'your_new_metric_name' + 'your_new_metric_name' ] data_module = PrunaDataModule.from_string('LAION256') @@ -206,6 +205,6 @@ Once you've implemented your metric, everyone can use it in Pruna's evaluation p results = eval_agent.evaluate(model) - + diff --git a/docs/user_manual/configure.rst b/docs/user_manual/configure.rst new file mode 100644 index 00000000..8274154d --- /dev/null +++ b/docs/user_manual/configure.rst @@ -0,0 +1,320 @@ +Define a SmashConfig +==================== + +This guide provides an introduction to configuring model optimization strategies with |pruna|. + +Model optimization configuration relies on the ``SmashConfig`` class. +The ``SmashConfig`` class provides a flexible dictionary-like interface for configuring model optimization strategies. +It manages algorithms, hyperparameters, and additional components like tokenizers, processors and datasets. + +Basic Configuration Workflow +---------------------------- + +|pruna| follows a simple workflow for configuring model optimization strategies: + +.. mermaid:: + :align: center + + graph LR + User -->|creates| SmashConfig + User -->|loads| PreTrainedModel["Pre-trained Model"] + + subgraph "Configuration Components" + SmashConfig --- Algorithm["Algorithm Selection"] + SmashConfig --- Hyperparameters + SmashConfig --- Tokenizer["Tokenizer (optional)"] + SmashConfig --- Processor["Processor (optional)"] + SmashConfig --- Dataset["Dataset (optional)"] + end + + SmashConfig -->|configures| SmashFn["smash() function"] + PreTrainedModel -->|input to| SmashFn + SmashFn -->|returns| OptimizedModel["Optimized PrunaModel"] + + style User fill:#bbf,stroke:#333,stroke-width:2px + style PreTrainedModel fill:#bbf,stroke:#333,stroke-width:2px + style SmashConfig fill:#bbf,stroke:#333,stroke-width:2px + style SmashFn fill:#bbf,stroke:#333,stroke-width:2px + style OptimizedModel fill:#bbf,stroke:#333,stroke-width:2px + +Let's see what that looks like in code. + +.. code-block:: python + + from pruna import SmashConfig + + smash_config = SmashConfig() + + # Activate IFW batching + smash_config['batcher'] = 'ifw' + + # Set IFW batching parameters + smash_config['ifw_weight_bits'] = 16 + smash_config['ifw_group_size'] = 4 + + # Add a tokenizer + smash_config.add_tokenizer('bert-base-uncased') + +Configure Algorithms +-------------------- + +|pruna| implements a extensible architecture for optimization algorithms. +Each algorithm has its own impact on the model in terms of speed, memory and accuracy. +The table underneath provides a general overview of the impact of each algorithm group. + +.. list-table:: + :widths: 10 60 10 10 10 + :header-rows: 1 + + * - Technique + - Description + - Speed + - Memory + - Accuracy + * - ``batcher`` + - Groups multiple inputs together to be processed simultaneously, improving computational efficiency and reducing processing time. + - ✅ + - ❌ + - ~ + * - ``cacher`` + - Stores intermediate results of computations to speed up subsequent operations. + - ✅ + - ~ + - ~ + * - ``compiler`` + - Optimises the model with instructions for specific hardware. + - ✅ + - ➖ + - ~ + * - ``distiller`` + - Trains a smaller, simpler model to mimic a larger, more complex model. + - ✅ + - ✅ + - ❌ + * - ``quantizer`` + - Reduces the precision of weights and activations, lowering memory requirements. + - ✅ + - ✅ + - ❌ + * - ``pruner`` + - Removes less important or redundant connections and neurons, resulting in a sparser, more efficient network. + - ✅ + - ✅ + - ❌ + * - ``recoverer`` + - Restores the performance of a model after compression. + - ~ + - ~ + - ✅ + +✅(improves), ➖(stays the same), ~(could worsen), ❌(worsens) + +.. tip:: + + The :doc:`Algorithm Overview ` page provides a more detailed overview of each algorithm within the different groups. + As well as additional information on the hardware requirements, compatibility with other algorithms and required components for each algorithm. + +Configure Algorithm Groups +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To activate an algorithm, you assign its name to the corresponding algorithm group in the ``SmashConfig``. +The group names are outlined in the table above and the specific algorithms are shown in the :doc:`Algorithm Overview ` page. + +Let's activate the ``ifw`` algorithm as a ``batcher``: + +.. code-block:: python + + from pruna import SmashConfig + + smash_config = SmashConfig() + + # Activate IFW batching + smash_config['batcher'] = 'ifw' + +Configure Algorithm Hyperparameters +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Each algorithm has its own set of hyperparameters that control its behavior. +These are automatically prefixed with the algorithm name and can also be found underneath each algorithm in the :doc:`Algorithm Overview `. + +Let's add the ``ifw_weight_bits`` and ``ifw_group_size`` hyperparameters for the ``ifw`` we defined above: + +.. code-block:: python + + from pruna import SmashConfig + + smash_config = SmashConfig() + + # Activate IFW batching + smash_config['batcher'] = 'ifw' + + # Set IFW batching parameters + smash_config['ifw_weight_bits'] = 16 + smash_config['ifw_group_size'] = 4 + +Configure Components +-------------------- + +Some algorithms require a tokenizer, processor or dataset to be passed to the SmashConfig. +For example, looking at the :doc:`Algorithm Overview ` we see that the ``gptq`` quantizer requires a dataset and a tokenizer. + +.. list-table:: + :widths: 10 90 10 + :header-rows: 1 + + * - Component + - Description + - Function + * - ``tokenizer`` + - Tokenizes the input text. + - ``add_tokenizer()`` + * - ``processor`` + - Processes the input data. + - ``add_processor()`` + * - ``data`` + - Loads a dataset. + - ``add_dataset()`` + +.. note:: + + If you try to activate a algorithm that requires a dataset, tokenizer or processor and haven’t added them to the ``SmashConfig``, you will receive an error. + Make sure to add them before activating the algorithm! If you want to know which algorithms require a dataset, tokenizer or processor, you can look at the :doc:`Algorithm Overview `. + +Configure Tokenizers, Processors +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +|pruna| provides a directly inherits from the ``transformers`` library. +This means, we can use the same tokenizers and processors as the ones used in the ``transformers`` library. + +.. tabs:: + + .. tab:: String Identifier + + Use a string identifier to use a tokenizer or processor from the Hugging Face Hub. + + .. code-block:: python + + from pruna import SmashConfig + + smash_config = SmashConfig() + + # Add a built-in dataset using a string identifier + smash_config.add_tokenizer('facebook/opt-125m') + smash_config.add_processor('openai/whisper-large-v3') + + .. tab:: Loading Directly + + Load a tokenizer or processor directly from the Hugging Face Hub with your own configuration. + + .. code-block:: python + + from pruna import SmashConfig + from transformers import AutoTokenizer + + smash_config = SmashConfig() + + # Load a tokenizer directly from the Hugging Face Hub + tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m") + smash_config.add_tokenizer(tokenizer) + + # Load a processor directly from the Hugging Face Hub + processor = AutoProcessor.from_pretrained("openai/whisper-large-v3") + smash_config.add_processor(processor) + +Now we've set up the tokenizer and processor, we can use them to process our data. + +Configure Datasets +^^^^^^^^^^^^^^^^^^ + +|pruna| provides a variety of pre-configured datasets for different tasks. +We can use string identifiers to use a built-in dataset or use collate functions to use a custom dataset. +Underneath you can find the list of all the available datasets. + +.. list-table:: + :header-rows: 1 + + * - Task + - Built-in Dataset ID + - Custom Collate Function + - Collate Function Arguments + * - Text Generation + - `WikiText `_, `SmolTalk `_, `SmolSmolTalk `_, `PubChem `_, `OpenAssistant `_, `C4 `_ + - ``text_generation_collate`` + - ``text: str`` + * - Image Generation + - `LAION256 `_, `OpenImage `_, `COCO `_ + - ``image_generation_collate`` + - ``image: PIL.Image.Image``, ``text: str`` + * - Image Classification + - `ImageNet `_, `MNIST `_, `CIFAR10 `_ + - ``image_classification_collate`` + - ``image: PIL.Image.Image``, ``label: int`` + * - Audio Processing + - `CommonVoice `_, `AIPodcast `_ + - ``audio_processing_collate`` + - ``audio: Optional[torch.Tensor]``, ``path: Optional[str]``, ``sentence: str`` + * - Question Answering + - `Polyglot `_ + - ``question_answering_collate`` + - ``question: str``, ``answer: str`` + +Similar to the tokenizers and processors, we can use string identifiers to use a built-in dataset or use a more custom approach, i.e. using a collate function. +Additionallly, you can create a fully custom ``PrunaDataModule`` use it in your workflow. + +.. tabs:: + + .. tab:: String Identifier + + Use a string identifier to use a built-in dataset as defined in the table above. + + .. code-block:: python + + from pruna import SmashConfig + + smash_config = SmashConfig() + + # Add a built-in dataset using a string identifier + smash_config.add_dataset('WikiText') + + .. tab:: Custom Collate Function + + Use a custom collate function to use a custom dataset as ``(train, val, test)`` tuples. + + In this case, you need to specify the ``collate_fn`` to use for the dataset. + The ``collate_fn`` is a function that takes a list of individual data samples and returns a batch of data in a unified format. + Your dataset will have to adhere to the formats expected by the ``collate_fn`` as defined in the table above. + + .. code-block:: python + + from pruna import SmashConfig + from pruna.data.utils import split_train_into_train_val_test + from datasets import load_dataset + + # Load custom datasets + train_ds = load_dataset("SamuelYang/bookcorpus")["train"] + train_ds, val_ds, test_ds = split_train_into_train_val_test(train_ds, seed=42) + + # Add to SmashConfig + smash_config = SmashConfig() + smash_config.add_tokenizer("bert-base-uncased") + smash_config.add_data( + (train_ds, val_ds, test_ds), + collate_fn="text_generation_collate" + ) + + .. tab:: PrunaDataModule + + You can also create a fully custom ``PrunaDataModule`` use it in your workflow. + This process is more flexible but also more complex. It allows for more control over the dataset and the data loading process. + The process for defining a ``PrunaDataModule`` is highlighted in the :doc:`Evaluation ` page but a basic example of adding it to the ``SmashConfig`` is shown below. + + .. code-block:: python + + from pruna import SmashConfig, PrunaDataModule + + # Load PrunaDataModule + data = PrunaDataModule(...) + + # Add to SmashConfig + smash_config = SmashConfig() + smash_config.add_data(data) diff --git a/docs/user_manual/customize.rst b/docs/user_manual/customize.rst new file mode 100644 index 00000000..96f61e21 --- /dev/null +++ b/docs/user_manual/customize.rst @@ -0,0 +1,39 @@ +Customize components +==================== + +|pruna| is designed to be customizable. You can add your own algorithms, datasets, and metrics to the package. + +.. grid:: 1 3 3 3 + + .. grid-item-card:: Add an algorithm + :text-align: center + :link: ./adding_algorithm.rst + + Steps to integrate a new compression algorithm, making it available in the ``SmashConfig``. + + .. grid-item-card:: Add a dataset + :text-align: center + :link: ./adding_dataset.rst + + Steps to integrate a new dataset, making it available in the ``SmashConfig``. + + .. grid-item-card:: Add a metric + :text-align: center + :link: ./adding_metric.rst + + Steps to integrate a new metric, making it available in the ``SmashConfig``. + +.. tip:: + + You can also customize the package by adding your own algorithms, datasets, and metrics. + Take a look at the :doc:`contributing guide ` to learn more. + If anything is unclear or you want to discuss your contribution before opening a PR, please reach out on `Discord `_ anytime! + +.. toctree:: + :maxdepth: 1 + :caption: Customize components + :hidden: + + adding_algorithm + adding_dataset + adding_metric diff --git a/docs/user_manual/dataset.rst b/docs/user_manual/dataset.rst deleted file mode 100644 index 8094ad37..00000000 --- a/docs/user_manual/dataset.rst +++ /dev/null @@ -1,106 +0,0 @@ -Datasets -========================= - -|pruna| provides a variety of pre-configured datasets for different tasks. This guide will help you understand how to use datasets in your |pruna| workflow. - -Available Datasets -------------------- - -|pruna| currently supports the following datasets categorized by task: - -Text Generation -^^^^^^^^^^^^^^^ - -| ``WikiText``: Wikipedia text dataset for language modeling -| ``SmolTalk``: Everyday conversation dataset -| ``SmolSmolTalk``: Lightweight version of SmolTalk -| ``PubChem``: Chemical compound dataset in SELFIES format -| ``OpenAssistant``: Instruction-following dataset -| ``C4``: Large-scale web text dataset - -Image Classification -^^^^^^^^^^^^^^^^^^^^ - -| ``ImageNet``: Large-scale image classification dataset -| ``MNIST``: Handwritten digit classification dataset - -Text-to-Image -^^^^^^^^^^^^^^^^^^^^ - -| ``COCO``: Image captioning dataset -| ``LAION256``: Subset of LAION artwork dataset -| ``OpenImage``: Image quality preferences dataset - -Audio Processing -^^^^^^^^^^^^^^^^^^^^ - -| ``CommonVoice``: Multi-language speech dataset -| ``AIPodcast``: AI-focused podcast audio dataset - -Question Answering -^^^^^^^^^^^^^^^^^^^^ - -| ``Polyglot``: Fact completion dataset - -Using Datasets ---------------- - -There are two main ways to use datasets in |pruna|: - -1. Using String Identifier -^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -What makes using the already implemented datasets so easy is that you can simply use the dataset's string identifier to add it to your :doc:`SmashConfig `: - -.. code-block:: python - - from pruna import SmashConfig - - smash_config = SmashConfig() - smash_config.add_tokenizer("bert-base-uncased") - smash_config.add_data("WikiText") - -2. Using Custom Datasets -^^^^^^^^^^^^^^^^^^^^^^^^^ - -You can also pass your own datasets as a tuple of ``(train, validation, test)`` datasets: - -.. code-block:: python - - from pruna import SmashConfig - from pruna.data.utils import split_train_into_train_val_test - from datasets import load_dataset - - # Load custom datasets - train_ds = load_dataset("SamuelYang/bookcorpus")["train"] - train_ds, val_ds, test_ds = split_train_into_train_val_test(train_ds, seed=42) - - # Add to SmashConfig - smash_config = SmashConfig() - smash_config.add_tokenizer("bert-base-uncased") - smash_config.add_data( - (train_ds, val_ds, test_ds), - collate_fn="text_generation_collate" - ) - -In this case, you need to specify the ``collate_fn`` to use for the dataset. The ``collate_fn`` is a function that takes a list of individual data samples and returns a batch of data in a unified format. -Your dataset will have to adhere to the formats expected by the ``collate_fn`` and this will be checked during a quick compatibility check when adding the dataset to the ``smash_config``. - - -.. autofunction:: pruna.data.collate.text_generation_collate -.. autofunction:: pruna.data.collate.image_generation_collate -.. autofunction:: pruna.data.collate.image_classification_collate -.. autofunction:: pruna.data.collate.audio_collate -.. autofunction:: pruna.data.collate.question_answering_collate - - - -.. _prunadatamodule: - -Accessing the PrunaDataModule directly -------------------------------------- - -You can also create and access the PrunaDataModule directly and use it in your workflow, e.g., if you want to pass it to the :doc:`evaluation agent `. - -.. autoclass:: pruna.data.pruna_datamodule.PrunaDataModule - :members: from_string, from_datasets diff --git a/docs/user_manual/evaluate.rst b/docs/user_manual/evaluate.rst new file mode 100644 index 00000000..b7ab5530 --- /dev/null +++ b/docs/user_manual/evaluate.rst @@ -0,0 +1,230 @@ +Evaluate optimizations with the Evaluation Agent +================================================ + +This guide provides an introduction to evaluating model optimizations with |pruna|. + +Evaluation helps you understand how compression affects your models across different dimensions - from output quality to resource requirements. +This knowledge is essential for making informed decisions about which compression techniques work best for your specific needs. + +Basic Evaluation Workflow +------------------------- + +|pruna| follows a simple workflow for evaluating model optimizations: + +.. mermaid:: + :align: center + + graph LR + User -->|creates| Task + User -->|creates| EvaluationAgent + Task -->|uses| PrunaDataModule + Task -->|defines| Metrics + Metrics -->|includes| StatefulMetric + Metrics -->|includes| StatelessMetric + PrunaDataModule -->|provides data| EvaluationAgent + PrunaModel -->|provides predictions| EvaluationAgent + EvaluationAgent -->|evaluates| PrunaModel + EvaluationAgent -->|returns| Evaluation_Results + User -->|configures| EvaluationAgent + + subgraph Metric_Types + StatefulMetric + StatelessMetric + end + + style User fill:#bbf,stroke:#333,stroke-width:2px + style Task fill:#bbf,stroke:#333,stroke-width:2px + style EvaluationAgent fill:#bbf,stroke:#333,stroke-width:2px + style PrunaDataModule fill:#bbf,stroke:#333,stroke-width:2px + style PrunaModel fill:#bbf,stroke:#333,stroke-width:2px + style Evaluation_Results fill:#bbf,stroke:#333,stroke-width:2px + style Metrics fill:#bbf,stroke:#333,stroke-width:2px + +Let's see what that looks like in code. + +.. code-block:: python + + from pruna.evaluation.evaluation_agent import EvaluationAgent + from pruna.evaluation.task import Task + from pruna.data.pruna_datamodule import PrunaDataModule + + # Load the optimized model + optimized_model = PrunaModel.from_pretrained("CompVis/stable-diffusion-v1-4") + + # Create and configure Task + task = Task( + requests=["clip_score", "psnr"], + datamodule=PrunaDataModule.from_string('LAION256'), + device="cpu" + ) + + # Create and configure EvaluationAgent + eval_agent = EvaluationAgent(task) + + # Evaluate the model + eval_agent.evaluate(optimized_model) + +Evaluation Components +--------------------- + +The |pruna| package provides a variety of evaluation metrics to assess your models. +In this section, we’ll introduce the evaluation metrics you can use. + +Task +^^^^ + +The ``Task`` is a class that defines the task you want to evaluate your model on and it requires a set of metrics and a :ref:`PrunaDataModule ` to perform the evaluation. + +.. code-block:: python + + from pruna.evaluation.task import Task + from pruna.data.pruna_datamodule import PrunaDataModule + + task = Task( + requests=["image_generation_quality"], + datamodule=PrunaDataModule.from_string('LAION256'), + device="cpu" + ) + +Metrics +~~~~~~~ + +The ``Metrics`` is a class that defines the metrics you want to evaluate your model on. + +Metrics are the core components that calculate specific performance indicators. There are two main types of metrics: + +- **Stateful Metrics**: These metrics compute values directly from inputs without maintaining state across batches. +- **Stateless Metrics**: Metrics that maintain internal state and accumulate information across multiple batches. These are typically used for quality assessment. + +The ``Task`` accepts ``Metrics`` in three ways: + +.. tabs:: + + .. tab:: Predefined Options + + As a plain text request from predefined options (e.g., ``image_generation_quality``) + + .. code-block:: python + + from pruna.evaluation.task import Task + from pruna.data.pruna_datamodule import PrunaDataModule + + # Create the task + task = Task( + request="image_generation_quality", + datamodule=PrunaDataModule.from_string('LAION256'), + device="cpu" + ) + + .. tab:: List of Metric Names + + As a list of metric names (e.g., [``"clip_score"``, ``"psnr"``]) + + .. code-block:: python + + from pruna.evaluation.task import Task + from pruna.data.pruna_datamodule import PrunaDataModule + + # Create the task + task = Task( + metrics=["clip_score", "psnr"], + datamodule=PrunaDataModule.from_string('LAION256'), + device="cpu" + ) + + .. tab:: List of Metric Instances + + As a list of metric instances, which provides more flexibility in configuring the metrics. + + .. code-block:: python + + from pruna.evaluation.task import Task + from pruna.data.pruna_datamodule import PrunaDataModule + from pruna.evaluation.metrics.metric_psnr import PSNR + + # Initialize the metrics + metrics = [ + PSNR() + ] + + # Create the task + task = Task( + metrics=metrics, + datamodule=PrunaDataModule.from_string('LAION256'), + device="cpu" + ) + +.. note:: + + You can find the full list of available metrics in the :ref:`Metric Overview ` section. + +PrunaDataModule +~~~~~~~~~~~~~~~ + +The ``PrunaDataModule`` is a class that defines the data you want to evaluate your model on. +Data modules are a core component of the evaluation framework, providing standardized access to datasets for evaluating model performance before and after optimization. + +They offer the following functionality: + +- Standard dataloaders for training, validation, and testing +- Integration with appropriate collate functions for different data types +- Support for dataset size limitations for faster evaluation +- Compatibility with tokenizers for text-based tasks + +The ``Task`` accepts ``PrunaDataModule`` in three ways: + +.. tabs:: + + .. tab:: From String + + As a plain text request from predefined options (e.g., ``LAION256``) + + .. code-block:: python + + from pruna.data.pruna_datamodule import PrunaDataModule + + # Create the data Module + datamodule = PrunaDataModule.from_string('LAION256') + + .. tab:: From Datasets + + As a list of datasets, which provides more flexibility in configuring the data module. + + .. code-block:: python + + from pruna.data.pruna_datamodule import prunadatamodule + from transformers import AutoTokenizer + from datasets import load_dataset + + # Load a built-in dataset + tokenizer = AutoTokenizer.from_pretrained("gpt2") + + # Load custom datasets + train_ds = load_dataset("SamuelYang/bookcorpus")["train"] + train_ds, val_ds, test_ds = split_train_into_train_val_test(train_ds, seed=42) + + # Create the data module + datamodule = PrunaDataModule.from_datasets( + datasets=(train_ds, val_ds, test_ds), + collate_fn="text_generation_collate", + tokenizer=tokenizer, + collate_fn_args={"max_seq_len": 512}, + dataloader_args={"batch_size": 16, "num_workers": 4} + ) + + + + + + + +EvaluationAgent +^^^^^^^^^^^^^^^ + +The ``EvaluationAgent`` is a class that evaluates the performance of your model. +It is a subclass of ``pl.LightningModule`` and ``pruna.SmashConfig``. + + + + + diff --git a/docs/user_manual/evaluation.rst b/docs/user_manual/evaluation.rst deleted file mode 100644 index 8934704b..00000000 --- a/docs/user_manual/evaluation.rst +++ /dev/null @@ -1,330 +0,0 @@ -.. _evaluation: - -Evaluation Metrics -=================== - -The |pruna| package provides helpful evaluation tools to assess your models. In this section, we'll introduce the evaluation metrics you can use with the package. - -Evaluation helps you understand how compression affects your models across different dimensions - from output quality to resource requirements. This knowledge is essential for making informed decisions about which compression techniques work best for your specific needs. - -.. _quicktutorial: - -Quick Tutorial --------------- - -Before we start, here's a simple example showing how to evaluate your models using |pruna|. - -The rest of this guide provides more detailed explanations of each component and additional features available for model evaluation. - -.. code-block:: python - - import copy - - from diffusers import StableDiffusionPipeline - - from pruna import smash, SmashConfig - from pruna.data.pruna_datamodule import PrunaDataModule - from pruna.evaluation.evaluation_agent import EvaluationAgent - from pruna.evaluation.task import Task - - # Load data and set up smash config - smash_config = SmashConfig() - smash_config['cacher'] = 'deepcache' - - # Load the base model - model_path = "CompVis/stable-diffusion-v1-4" - pipe = StableDiffusionPipeline.from_pretrained(model_path) - - # Smash the model - copy_pipe = copy.deepcopy(pipe) - smashed_pipe = smash(copy_pipe, smash_config) - - # Define the task and the evaluation agent - metrics = ['clip_score', 'psnr'] - task = Task(metrics, datamodule=PrunaDataModule.from_string('LAION256')) - eval_agent = EvaluationAgent(task) - - # Evaluate base model, all models need to be wrapped in a PrunaModel before passing them to the EvaluationAgent - first_results = eval_agent.evaluate(pipe) - print(first_results) - - # Evaluate smashed model - smashed_results = eval_agent.evaluate(smashed_pipe) - print(smashed_results) - - -.. code-block:: python - - # Base model result output - {'clip_score_y_x': 28.0828} - - # Smashed model result output - {'clip_score_y_x': 28.4500, 'psnr_pairwise_y_gt': 18.7465} - -Evaluation Framework --------------------- - -The evaluation framework in |pruna| consists of several key components: - -Task -^^^^ -Processes user requests and converts them into a set of metrics. The ``Task`` accepts metrics in three ways: - -- As a plain text request from predefined options (e.g., ``image_generation_quality``) -- As a list of metric names (e.g., [``"clip_score"``, ``"psnr"``]) (see :ref:`Available Metrics ` below) -- As a list of metric instances - -In addition to metrics, ``Task`` requires a :ref:`PrunaDataModule ` to perform the evaluation. - -.. autoclass:: pruna.evaluation.task.Task - -Currently, ``Task`` supports the following plain textrequests: - -- ``image_generation_quality``: Creates metrics for evaluating image generation models (``clip_score``, ``pairwise_clip_score``, ``psnr``) - - -.. code-block:: python - - from pruna.evaluation.task import Task - from pruna.data.pruna_datamodule import PrunaDataModule - - task = Task("image_generation_quality", datamodule=PrunaDataModule.from_string('LAION256')) - -EvaluationAgent -^^^^^^^^^^^^^^^ -The main entry point for evaluating models. The ``EvaluationAgent``: - -- Takes a ``Task`` object that defines what metrics to use -- Provides methods to evaluate any model -- Handles the evaluation process, including separating metrics by execution strategy -- Runs inference on the model to generate predictions -- Caches predictions to avoid redundant computations -- Passes ground truth data and predictions to the appropriate metrics -- Collects and returns results from all metrics - -.. autoclass:: pruna.evaluation.evaluation_agent.EvaluationAgent - :members: evaluate - -.. container:: hidden_code - - .. code-block:: python - - from pruna.evaluation.task import Task - from pruna.data.pruna_datamodule import PrunaDataModule - - data_module = PrunaDataModule.from_string('LAION256') - data_module.limit_datasets(10) - - task = Task("image_generation_quality", datamodule=data_module) - -.. code-block:: python - - from pruna.evaluation.evaluation_agent import EvaluationAgent - - eval_agent = EvaluationAgent(task) - - -For the full example running evaluation please see :ref:`Quick Tutorial ` above. - -.. _metrics: - -Metrics -------- - -Metrics help quantify different aspects of model performance, from output quality to resource requirements. The |pruna| package includes metrics for both quality assessment and resource utilization. - -When using the ``EvaluationAgent``, all metrics are executed automatically as part of the evaluation pipeline. The agent handles model inference, data preparation, and passing the appropriate inputs to each metric, eliminating the need to run metrics individually. - -Metrics can operate in both single-model and pairwise modes: - -- In single-model mode, each evaluation produces independent scores for the model being evaluated. -- In pairwise mode, metrics compare a subsequent model against the first model evaluated by the agent. Usually, this is used to compare the base model (first model) with its smashed version (subsequent model). The first model's outputs are cached and used as a reference point for all following evaluations. The pairwise comparison produces a single score that quantifies the relationship (e.g., similarity or difference) between the two models. - -Our metrics fall into two implementation categories that work differently under the hood: - -Base Metrics -^^^^^^^^^^^^^ -Simple metrics that compute values directly from inputs without maintaining state across batches. Examples include: -- Model Architecture metrics -- Energy consumption metrics -- Memory usage metrics - -`elapsed_time `_ -"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" - -Measures inference time, latency, and throughput. - -:Evaluation on CPU: Yes. -:Required: - A PrunaModel object that defines the model to evaluate. - A DataLoader object that defines the dataloader to evaluate the model on. -:Parameters: - - | ``n_iterations``: Number of inference iterations to measure (default 100). - | ``n_warmup_iterations``: Number of warmup iterations before measurement (default 10). - | ``device``: Device to run inference on (default "cuda"). - | ``timing_type``: Type of timing to use ("sync" or "async", default "sync"). - -`gpu_memory `_ -"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" - -Measures peak GPU memory usage during model loading and execution. - -:Evaluation on CPU: No. -:Required: - Path to the PrunaModel to evaluate. - A DataLoader object that defines the dataloader to evaluate the model on. - The model class to load the model from the path. -:Parameters: - - | ``mode``: Memory measurement mode ("disk", "inference", or "training"). - | ``gpu_indices``: List of GPU indices to monitor (default all available GPUs). - -`energy `_ -"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" - -Measures energy consumption in kilowatt-hours (kWh) and CO2 emissions in kilograms (kg). - -:Evaluation on CPU: Yes. -:Description: Measures energy consumption in kilowatt-hours (kWh) and CO2 emissions in kilograms (kg). -:Required: - A PrunaModel object that defines the model to evaluate. - A DataLoader object that defines the dataloader to evaluate the model on. -:Parameters: - - | ``n_iterations``: Number of inference iterations to measure (default 100). - | ``n_warmup_iterations``: Number of warmup iterations before measurement (default 10). - | ``device``: Device to run inference on (default "cuda"). - -`model_architecture `_ -"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" - -Measures the number of parameters and MACs (multiply-accumulate operations) in the model. - -:Evaluation on CPU: Yes. -:Required: - A PrunaModel object that defines the model to evaluate. - A DataLoader object that defines the dataloader to evaluate the model on. -:Parameters: - - | ``device``: Device to evaluate the model on (default "cuda"). - -Stateful Metrics -^^^^^^^^^^^^^^^^^ -Metrics that maintain internal state and accumulate information across multiple batches. These are typically used for quality assessment. - -Most of our stateful metrics are implemented using the TorchMetricsWrapper, which adapts metrics from the `TorchMetrics `_ library to work within our evaluation framework. This allows us to leverage the robust implementations provided by TorchMetrics while maintaining a consistent API. - -`clip_score `_ -"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" - -Measures the similarity between images and text using the CLIP model. - - -:Evaluation on CPU: Yes. -:Required: Inputs, ground truth and predictions. -:Parameters: Accepts all parameters from the TorchMetrics CLIPScore implementation. - -`pairwise_clip_score `_ -"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" - -Measures the similarity between images of first and subsequent models using the CLIP model. - -:Evaluation on CPU: Yes. -:Required: Inputs, ground truth and predictions. -:Parameters: Accepts all parameters from the TorchMetrics CLIPScore implementation. - -`cmmd `_ -"""""""""""""""""""""""""""""""""""""""""""" - -CMMD measures the distributional discrepancy between two sets of images or text by computing Maximum Mean Discrepancy (MMD) in the CLIP embedding space. It captures both semantic and visual alignment. - -Key Benefits: - -- **Distribution-Free:** Does not rely on any assumptions about the underlying feature distribution. -- **Unbiased Estimation:** Provides a statistically unbiased measure of the discrepancy between two image sets. -- **Sample Efficiency:** Achieves reliable estimates even with smaller image samples, making it suitable for rapid evaluations. -- **Human-Aligned:** Demonstrates better agreement with human perceptual assessments of image quality compared to FID. - -:Evaluation on CPU: Yes. -:Required: Inputs, ground truth and predictions. -:Parameters: - - | ``clip_model_name``: Name of the CLIP model to use (default "openai/clip-vit-large-patch14-336"). - | ``call_type``: Call type to use for the metric (default "gt_y"). For pairwise evaluation pass "pairwise" or "pairwise_gt_y". - | ``device``: Device to run the metric on (default "cuda"). - - - -`accuracy `_ -"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" -Measures the proportion of correct predictions in classification tasks. - - -:Evaluation on CPU: Yes. -:Required: Inputs, ground truth and predictions. TorchMetrics requires a 'task' parameter to be set to 'binary', 'multiclass', or 'multilabel'. Each task type may have additional specific requirements - please refer to the TorchMetrics documentation for details. -:Parameters: Accepts all parameters from the TorchMetrics Accuracy implementation (task, num_classes, threshold, etc.). - -`precision `_ -"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" -Measures the proportion of positive identifications that were actually correct. - - -:Evaluation on CPU: Yes. -:Required: Inputs, ground truth and predictions. TorchMetrics requires a 'task' parameter to be set to 'binary', 'multiclass', or 'multilabel'. Each task type may have additional specific requirements - please refer to the TorchMetrics documentation for details. -:Parameters: Accepts all parameters from the TorchMetrics Precision implementation (task, num_classes, threshold, etc.). - -`recall `_ -"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" -Measures the proportion of actual positives that were identified correctly. - - -:Evaluation on CPU: Yes. -:Required: Inputs, ground truth and predictions. TorchMetrics requires a 'task' parameter to be set to 'binary', 'multiclass', or 'multilabel'. Each task type may have additional specific requirements - please refer to the TorchMetrics documentation for details. -:Parameters: Accepts all parameters from the TorchMetrics Recall implementation (task, num_classes, threshold, etc.). - -`perplexity `_ -"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" -Measures how well a probability model predicts a text sample. - -:Evaluation on CPU: Yes. -:Required: Inputs, ground truth and predictions. -:Parameters: Accepts all parameters from the TorchMetrics Perplexity implementation. - -`fid `_ -"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" -Measures the similarity between generated and real image distributions using the Frechet Distance between Gaussian distributions fitted to the Inception embeddings of the generated and real images. - -FID compares the **distribution** of real and generated images in a high-dimensional feature space. Since it estimates **mean and covariance statistics**, smaller sample sizes can introduce high variance, making the metric less stable. Large-scale evaluations often use **tens of thousands of images**, but for practical use, smaller sample sizes may still provide a reasonable approximation. - -**Computation Considerations** - -When generating images and computing FID on **thousands to tens of thousands of samples**, the process can take **multiple hours to several days**, even on a high-end GPU like an **A100 or RTX 4090**. On mid-range GPUs like a **3060 or 4060**, it can take **significantly longer**. A rough approximation using **a few thousand images** may still take **several hours**, even with strong hardware. - -:Evaluation on CPU: No (impractical due to the high computational cost) -:Required: Inputs, ground truth and predictions. -:Parameters: Accepts all parameters from the TorchMetrics FrechetInceptionDistance implementation (feature extraction parameters, etc.). - -`psnr `_ -""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" -Measures the peak signal-to-noise ratio (PSNR) between two images. - -:Evaluation on CPU: Yes. -:Required: Inputs, ground truth and predictions. -:Parameters: Accepts all parameters from the TorchMetrics PSNR implementation. - -`ssim `_ -""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" -Measures the structural similarity index (SSIM) between two images. - -:Evaluation on CPU: Yes. -:Required: Inputs, ground truth and predictions. -:Parameters: Accepts all parameters from the TorchMetrics SSIM implementation. - -`lpips `_ -"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" -Measures the Learned Perceptual Image Patch Similarity (LPIPS) between two images. - -:Evaluation on CPU: Yes. -:Required: Inputs, ground truth and predictions. -:Parameters: Accepts all parameters from the TorchMetrics LPIPS implementation. \ No newline at end of file diff --git a/docs/user_manual/optimize.rst b/docs/user_manual/optimize.rst new file mode 100644 index 00000000..408b05dd --- /dev/null +++ b/docs/user_manual/optimize.rst @@ -0,0 +1,208 @@ +Optimize your first model +========================= + +This guide provides a quick introduction to optimizing AI models with |pruna|. + +You'll learn how to use Pruna's core functionality to make your models faster, smaller, cheaper, and greener. +For installation instructions, see :doc:`Installation `. + +Basic Optimization Workflow +--------------------------- + +|pruna| follows a simple workflow for optimizing models: + +.. mermaid:: + :align: center + + graph LR + A[Load Model] --> B[Define SmashConfig] + B --> C[Optimize Model] + C --> D[Evaluate Model] + D --> E[Run Inference] + style A fill:#bbf,stroke:#333,stroke-width:2px + style B fill:#bbf,stroke:#333,stroke-width:2px + style C fill:#bbf,stroke:#333,stroke-width:2px + style D fill:#bbf,stroke:#333,stroke-width:2px + style E fill:#bbf,stroke:#333,stroke-width:2px + +Let's see what that looks like in code. + +.. code-block:: python + + from pruna import smash, SmashConfig + from diffusers import StableDiffusionPipeline + from pruna.data.pruna_datamodule import PrunaDataModule + from pruna.evaluation.evaluation_agent import EvaluationAgent + from pruna.evaluation.task import Task + + # Load the model + model = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4") + + # Create and configure SmashConfig + smash_config = SmashConfig() + smash_config["cacher"] = "deepcache" + + # Optimize the model + optimized_model = smash(model=model, smash_config=smash_config) + + # Evaluate the model + metrics = ['clip_score', 'psnr'] + task = Task(metrics, datamodule=PrunaDataModule.from_string('LAION256')) + eval_agent = EvaluationAgent(task) + eval_agent.evaluate(optimized_model) + + # Run inference + optimized_model("A serene landscape with mountains").images[0] + +Step-by-Step Optimization Workflow +---------------------------------- + +Step 1: Load a pretrained model +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +First, load any model using its original library, like ``transformers`` or ``diffusers``: + +.. code-block:: python + + from diffusers import StableDiffusionPipeline + + base_model = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4") + + +Step 2: Define optimizations with a ``SmashConfig`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +After loading the model, we can define a ``SmashConfig`` to customize the optimizations we want to apply. +This ``SmashConfig`` is a dictionary-like object that configures which optimizations to apply to your model. +You can specify multiple optimization algorithms from different categories like batching, caching and quantization. + +For now, let's just use a ``cacher`` to accelerate the model during inference. + +.. code-block:: python + + from pruna import SmashConfig + + smash_config = SmashConfig() + smash_config["cacher"] = "deepcache" # Accelerate the model with caching + +Pruna support a wide range of algorithms for specific optimizations, all with different trade-offs. +To understand how to configure the right one for your scenario, see :doc:`Define a SmashConfig `. + +Step 3: Apply optimizations with ``smash`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The ``smash()`` function is the core of Pruna. It takes your model and ``SmashConfig``, applies the specified optimizations. +Let's use the ``smash()`` function to apply the configured optimizations: + +.. code-block:: python + + from pruna import smash + + optimized_model = smash(model=base_model, smash_config=smash_config) + + +The ``smash()`` function returns a ``PrunaModel`` - a wrapper that provides a standardized interface for the optimized model. So, we can still use the model as we would use the original one. + +Step 4: Evaluate the optimized model with the ``EvaluationAgent`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To evaluate the optimized model, we can use the same interface as the original model. + +.. code-block:: python + + from pruna.data.pruna_datamodule import PrunaDataModule + from pruna.evaluation.evaluation_agent import EvaluationAgent + + metrics = ['clip_score', 'psnr'] + task = Task(metrics, datamodule=PrunaDataModule.from_string('LAION256')) + eval_agent = EvaluationAgent(task) + eval_agent.evaluate(optimized_model) + +To understand how to run more complex evaluation workflows, see :doc:`Evaluate a model `. + +Step 5: Run inference with the optimized model +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To run inference with the optimized model, we can use the same interface as the original model. + +.. code-block:: python + + optimized_model("A serene landscape with mountains").images[0] + +Example use cases +----------------- + +Let's look at some specific examples for different model types. + +Example 1: Diffusion Model Optimization +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: python + + from diffusers import StableDiffusionPipeline + from pruna import smash, SmashConfig + + # Load the model + model = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4") + + # Create and configure SmashConfig + smash_config = SmashConfig() + smash_config["cacher"] = "deepcache" + smash_config["compiler"] = "stable_fast" + + # Optimize the model + optimized_model = smash(model=model, smash_config=smash_config) + + # Generate an image + optimized_model("A serene landscape with mountains").images[0] + +Example 2: Large Language Model Optimization +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: python + + from transformers import AutoModelForCausalLM + from pruna import smash, SmashConfig + + # Load the model + model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m") + + # Create and configure SmashConfig + smash_config = SmashConfig() + smash_config["quantizer"] = "gptq" # Apply GPTQ quantization + + # Optimize the model + optimized_model = smash(model=model, smash_config=smash_config) + + # Use the model for generation + input_text = "The best way to learn programming is" + optimized_model(input_text) + + +Example 3: Speech Recognition Optimization +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: python + + from transformers import AutoModelForSpeechSeq2Seq + from pruna import smash, SmashConfig + import torch + + # Load the model + model_id = "openai/whisper-large-v3" + model = AutoModelForSpeechSeq2Seq.from_pretrained( + model_id, torch_dtype=torch.float16, low_cpu_mem_usage=True + ).to("cuda") + + # Create and configure SmashConfig + smash_config = SmashConfig() + smash_config.add_processor(model_id) # Required for Whisper + smash_config["compiler"] = "c_whisper" + smash_config["batcher"] = "whisper_s2t" + + # Optimize the model + optimized_model = smash(model=model, smash_config=smash_config) + + # Use the model for transcription + optimized_model("audio_file.wav") + diff --git a/docs/user_manual/save_load.rst b/docs/user_manual/save_load.rst index 4ab1b0cb..01377a60 100644 --- a/docs/user_manual/save_load.rst +++ b/docs/user_manual/save_load.rst @@ -1,10 +1,12 @@ -Saving and Loading Pruna Models -=============================== +Save and Load Models +===================== -After smashing a model using |pruna|, you can save it to disk and load it later using the built-in save and load functionality. +This guide provides a quick introduction to saving and loading optimized AI models with |pruna|. -Saving and Loading Models -------------------------- +You will learn how to save and load a ``PrunaModel`` after smashing a model using |pruna|. Haven't smashed a model yet? Check out the :doc:`optimize guide ` to learn how to do that. + +Saving a ``PrunaModel`` +----------------------- To save a smashed model, use the ``PrunaModel.save_pretrained()`` or ``PrunaModel.save_to_hub()`` method. This method saves all necessary model files and as well as the smash configuration to the specified directory: @@ -46,13 +48,20 @@ To save a smashed model, use the ``PrunaModel.save_pretrained()`` or ``PrunaMode smashed_model = smash(model=base_model, smash_config=smash_config) # Save the model - smashed_model.save_to_hub("PrunaAI/smashed-stable-diffusion-v1-4") + smashed_model.save_to_hub("PrunaAI/smashed-stable-diffusion-v1-4-smashed") + + .. tip:: + + When saving models to the hub, we recommend to use a suffix like ``-smashed`` to indicate that the model has been smashed with |pruna|. The save operation will: 1. Save the model weights and architecture, including information on how to load the model later on 2. Save the ``smash_config`` (including tokenizer and processor if present, data will be detached and not reloaded) +Loading a ``PrunaModel`` +------------------------ + To load a previously saved ``PrunaModel``, use the ``PrunaModel.from_pretrained()`` or ``PrunaModel.from_hub()`` class method: .. tabs:: @@ -74,6 +83,7 @@ To load a previously saved ``PrunaModel``, use the ``PrunaModel.from_pretrained( loaded_model = PrunaModel.from_hub("PrunaAI/smashed-stable-diffusion-v1-4") The load operation will: + 1. Load the model architecture and weights and cast them to the device specified in the SmashConfig 2. Restore the smash configuration @@ -103,8 +113,6 @@ you should also load the smashed model as follows: Depending on the saving function of the algorithm combination not all keyword arguments are required for loading (e.g. some are set by the algorithm combination itself). In that case, we discard and log a warning about unused keyword arguments. - - Algorithm Reapplication ~~~~~~~~~~~~~~~~~~~~~~~~ Some algorithms, particularly compilers and certain quantization methods, need to be reapplied after loading, as, for example, a compiled model can be rarely saved in its compiled state. @@ -118,10 +126,4 @@ Set ``verbose=True`` when loading if you want to see warning messages as well as from pruna import PrunaModel - loaded_model = PrunaModel.from_pretrained("saved_model/", verbose=True) - -``PrunaModel`` Function Documentation ---------------------------------------------- - -.. autoclass:: pruna.engine.pruna_model.PrunaModel - :members: from_pretrained, from_hub, save_to_hub, save_pretrained \ No newline at end of file + loaded_model = PrunaModel.from_pretrained("saved_model/", verbose=True) \ No newline at end of file diff --git a/docs/user_manual/smash.rst b/docs/user_manual/smash.rst deleted file mode 100644 index fd6aff7d..00000000 --- a/docs/user_manual/smash.rst +++ /dev/null @@ -1,57 +0,0 @@ -smash -========================= - -The ``smash`` function is the main function in |pruna| for optimizing models. In the following sections we will show you how to use it. - -Calling the ``smash`` Function ---------------------------------------------- - -In preparation to using ``smash``, we have to load our model and define a ``SmashConfig``. As an example, we will take a simple model by loading the ``ViT-B/16`` model from ``torchvision``. - -.. code-block:: python - - import torchvision - - base_model = torchvision.models.vit_b_16(weights="ViT_B_16_Weights.DEFAULT").cuda() - -Next, we will define a :doc:`SmashConfig ` and activate the ``torch_compile`` compiler. - -.. code-block:: python - - from pruna import SmashConfig - smash_config = SmashConfig() - smash_config['compiler'] = 'torch_compile' - -We are now ready to call the ``smash`` function! - -We can pass the model and the ``SmashConfig`` to the ``smash`` function as follows: - -.. code-block:: python - - from pruna import smash - - smashed_model = smash( - model=base_model, - smash_config=smash_config, - ) - -The resulting smashed model can be used in the same way as the original one. - -We perform compatibility checks to ensure that the model is compatible with the algorithms that you have selected at the beginning of the ``smash`` process. If you wish to skip these checks, you can set the ``experimental`` flag to ``True``: - -.. code-block:: python - - smashed_model = smash( - model=base_model, - smash_config=smash_config, - experimental=True, - ) - -Please note that this can lead to undefined behavior or difficult-to-debug errors. - -Importantly, the returned model offers save and load functionality that allows you to save the model and load it in its smashed state, see :doc:`save_load`. - -``smash`` Function Documentation ---------------------------------------------- - -.. autofunction:: pruna.smash.smash \ No newline at end of file diff --git a/docs/user_manual/smash_config.rst b/docs/user_manual/smash_config.rst deleted file mode 100644 index f6509d67..00000000 --- a/docs/user_manual/smash_config.rst +++ /dev/null @@ -1,67 +0,0 @@ -SmashConfig -========================= - -``SmashConfig`` is an essential tool in |pruna| for configuring parameters to optimize your models. This manual explains how to define and use a ``SmashConfig``. - -Defining a simple ``SmashConfig`` ---------------------------------- - -Define a ``SmashConfig`` using the following snippet: - -.. code-block:: python - - from pruna import SmashConfig - smash_config = SmashConfig() - -After creating an empty ``SmashConfig``, you can set activate a algorithm by adding it to the ``SmashConfig``: - -.. code-block:: python - - smash_config['quantizer'] = 'hqq' - -Additionally, you can overwrite :doc:`the defaults of the algorithm ` you have added by setting the hyperparameters in the ``SmashConfig``: - -.. code-block:: python - - smash_config['hqq_weight_bits'] = 4 - -You're done! You created your ``SmashConfig`` and can now :doc:`pass it to the smash function. ` - - -Adding a Dataset, Tokenizer or Processor ----------------------------------------- - -Some algorithms require a dataset, tokenizer or processor to be passed to the ``SmashConfig``. -For example, the ``gptq`` quantizer requires a dataset and a tokenizer. We can pass them to the ``SmashConfig`` e.g. as follows: - -.. code-block:: python - - from pruna import SmashConfig - smash_config = SmashConfig() - smash_config.add_tokenizer("facebook/opt-125m") - smash_config.add_data("WikiText") - -As you can see in this example, we can add a dataset simply by passing the name of the dataset. However, the ``add_data()`` function also supports other input formats. For more information, see the :doc:`dataset documentation `. - -We can now activate the ``gptq`` quantizer by adding it to the ``SmashConfig``: - -.. code-block:: python - - smash_config['quantizers'] = 'gptq' - -Similarly, we can add a processor to the ``SmashConfig`` if required, like for example by the ``c_whisper`` compiler: - -.. code-block:: python - - from pruna import SmashConfig - smash_config = SmashConfig() - smash_config.add_processor("openai/whisper-large-v3") - smash_config['compiler'] = 'c_whisper' - -If you try to activate a algorithm that requires a dataset, tokenizer or processor and haven't added them to the ``SmashConfig``, you will receive an error. Make sure to add them before activating the algorithm! If you want to know which algorithms require a dataset, tokenizer or processor, you can look at :doc:`the compression algorithm overview `. - -``SmashConfig`` Documentation ---------------------------------------------- - -.. autoclass:: pruna.config.smash_config.SmashConfig - :members: add_data, add_tokenizer, add_processor, save_to_json, load_from_json, flush_configuration, load_dict