{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Llama-2-7b Inference"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example we compile and deploy the Hugging Face [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model for tensor parallel inference on Neuron using the `Neuronx-Distributed` package.\n",
    "\n",
    "> Note: This model is not currently optimized for performance on neuronx-distributed. For optimized llama-2 inference use transformers-neuronx.\n",
    "\n",
    "The example has the following main sections:\n",
    "\n",
    "1. Set up the Jupyter Notebook\n",
    "1. Install dependencies\n",
    "1. Download the model\n",
    "1. Trace the model\n",
    "1. Perform greedy sampling\n",
    "1. Benchmark sampling\n",
    "\n",
    "This Jupyter Notebook can be run on a Trn1 instance (`trn1.32xlarge`). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up the Jupyter Notebook\n",
    "\n",
    "The following steps set up Jupyter Notebook and launch this tutorial:\n",
    "1. Clone the [Neuronx-Distributed](https://github.com/aws-neuron/neuronx-distributed.git) repo to your instance using\n",
    "```\n",
    "git clone https://github.com/aws-neuron/neuronx-distributed.git\n",
    "```\n",
    "\n",
    "2. Navigate to the `examples/inference` samples folder\n",
    "```\n",
    "cd neuronx-distributed/example/inference/\n",
    "```\n",
    "\n",
    "3. Copy the tutorial notebook `llama2_inference.ipynb` to the `example/inference/` directory. \n",
    "```\n",
    "wget https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/examples/pytorch/neuronx_distributed/llama/llama2_inference.ipynb\n",
    "```\n",
    "\n",
    "4. Follow the instructions in [Jupyter Notebook QuickStart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html) to run Jupyter Notebook on your instance.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install Dependencies\n",
    "This tutorial requires the following pip packages:\n",
    "\n",
    " - `torch-neuronx`\n",
    " - `neuronx-cc`\n",
    " - `sentencepiece`\n",
    " - `transformers`\n",
    " - `neuronx-distributed`\n",
    "\n",
    "You can install `neuronx-distributed` using the [setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/index.html). Most of other packages will be installed when configuring your environment using the [torch-neuronx inference setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx). The additional dependencies must be installed here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install transformers==4.35.2 sentencepiece "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download the model \n",
    "Use of this model is governed by the Meta license. In order to download the model weights and tokenizer follow the instructions in [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). \n",
    "\n",
    "Once granted access, you can download the model. For the purposes of this sample we assume you have saved the Llama-2-7b model in a directory called `models/Llama-2-7b-chat-hf` with the following format:\n",
    "\n",
    "```\n",
    "Llama-2-7b-chat-hf\n",
    " ├── LICENSE.txt\n",
    " ├── README.md\n",
    " ├── USE_POLICY.md\n",
    " ├── config.json\n",
    " ├── generation_config.json\n",
    " ├── model-00001-of-00002.safetensors\n",
    " ├── model-00002-of-00002.safetensors\n",
    " ├── model.safetensors.index.json\n",
    " ├── pytorch_model-00001-of-00002.bin\n",
    " ├── pytorch_model-00002-of-00002.bin\n",
    " ├── pytorch_model.bin.index.json\n",
    " ├── special_tokens_map.json\n",
    " ├── tokenizer.json\n",
    " ├── tokenizer.model\n",
    " └── tokenizer_config.json\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_path = \"/home/ubuntu/models/Llama-2-7b-chat-hf\"\n",
    "traced_model_path = \"/home/ubuntu/models/llama-2-7b-chat-hf-trace\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Trace the model\n",
    "\n",
    "Now we can trace the model using the LlamaRunner script. This saves the model to the `traced_model_path`. Tracing the 7b model can take up to 70 minutes. \n",
    "\n",
    "In this sample we use tensor parallelism degree 32 to optimize performance on trn1.32xlarge. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama2.llama2_runner import LlamaRunner\n",
    "\n",
    "max_context_length = 128\n",
    "max_new_tokens = 384\n",
    "batch_size = 2\n",
    "tp_degree = 32\n",
    "\n",
    "runner = LlamaRunner(model_path=model_path, \n",
    "                     tokenizer_path=model_path)\n",
    "\n",
    "runner.trace(traced_model_path=traced_model_path,\n",
    "             tp_degree=tp_degree,\n",
    "             batch_size=batch_size,\n",
    "             max_context_length=max_context_length,\n",
    "             max_new_tokens=max_new_tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inference \n",
    "\n",
    "Now lets load the traced model to perform autoregressive sampling. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "generate_ids, outputs = runner.generate_on_neuron(prompt=[\"I believe the meaning of life is\", \"The color of the sky is\"], \n",
    "                                                  traced_model_path=traced_model_path)\n",
    "\n",
    "for idx, output in enumerate(outputs):\n",
    "    print(f\"output {idx}: {output}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Benchmarking \n",
    "\n",
    "Here we benchmark the per token latency for greedy sampling. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = runner.benchmark_sampling(traced_model_path)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "aws_neuron_venv_pytorch",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}