Michael Stewart

Self-Hosting Large Language Models Using AWS SageMaker

This week, OpenAI released the latest iteration of their family of instruction-trained large language models, GPT-4. Accompanying the model, was a technical document, that left many once again concerned about how “open” the company really is.

Although there is a lack of transparency from OpenAI with data and weightings used in the training of the model, there exist open-source LLM solutions that can be deployed to your chosen cloud platform.

The good folk over at Hugging Face offer a large library of free to download and use natural language models, including some quite similar to the GPT family. One of these is Bloom, from BigScience, which they state is the “first multilingual LLM trained in complete transparency, to change this status quo — the result of the largest collaboration of AI researchers ever involved in a single research project.”. The bloom model can do all of your text generation, summarization, classification, translation, and more. Sounds great.

Cloud platforms like Microsoft Azure and Amazon AWS are making the deployment of Hugging Face language models increasingly more accessible with sets of new libraries and services. Let’s go through the process of deploying a Bloom text generation model to AWS.

Since I don’t have the resources/funds available to deploy the full 175B parameter Bloom model (although I’m happy to try if anyone wants to donate), I went with a smaller version, the bloom-3b, which as the name suggests, is a 3B parameter alternative.

Steps

The very first step is to download the model from the Hugging Face repository.

git clone https://huggingface.co/bigscience/bloom-3b

We need to archive these files into a tarball (.tar.gz file). That is native to Unix-based systems, while on Windows you will have to use a tool such as 7Zip.

Jump into your AWS account and drop the tarball into any S3 bucket.


Next is to create the model, endpoint configuration and endpoint in AWS SageMaker. There are 2 different options here, one is through code using the SageMaker SDK and the other is just doing it manually through the AWS Console. Although the code option isn’t that complicated, for simplicity’s sake, let’s go through it in AWS.


Navigate to SageMaker and go to Inference > Models.

Create a model and give it a name. If you don’t have an existing SageMaker IAM role, let AWS create you one. Check “Provide model artifacts and inference image location” and copy the S3 location where you dropped the tarball into the location for model artifacts. For the location of the inference code image, we can either build our own or reference an already existing image. There are some listed here, you just need to change the region to match your AWS region for the inference. We do need to assign 2 environment variables as well, these are:
HF_MODEL_ID=bigscience/bloom-3b
HF_TASK=text-generation
Go ahead and create the model.


Move over to endpoint configurations and create a new one. Give it a name and scroll down to “Variants” where we should select the model that we just created. Before moving on, select edit and update the instance type, since we will need some significant resources to run. I went with ml.g4dn.2xlarge, but I believe that you should be able to get away with ml.g4dn.xlarge. Create the endpoint configuration.


The last step is to create the endpoint itself. Give it a name, assign the endpoint configuration and hit create.

The creation process may take a while. It can vary between 5 minutes to up to 30 minutes from my experience. You can check on the progress in CloudWatch by filtering on “/aws/sagemaker” and finding the endpoint.


Once the endpoint is created, it will move to “InService” status and we can test it through an API client such as Postman. Click into the endpoint and copy the URL.

Open up your API client and paste the URL in as a POST request. You will need to add an AWS Signature as Authorization header, and include your access key, secret key, region and set the service name to “sagemaker”.

Update the body for your request and check the response.

Not too bad, honestly. From my testing, it does quite well on simple text generation tasks. You can also try out the 175B parameter model here.


If you are just experimenting, you’ll want to go ahead and delete everything you just created so as to not incur unwanted costs. If you want to keep going, you can integrate your SageMaker endpoint with API Gateway and use it in your application.

Worth It?

The question now becomes whether it is viable “rolling out your own” with a model like Bloom, or using OpenAI. There isn’t really a simple answer. If data sovereignty is of the upmost importance, then of course it would make sense to host your own model and API to keep control over the data instead of handing it over to our tech overlords in sunny California. On the other hand, using GPT-3/4 requires very little set up time and may yield better results. Other variables to consider are performance and operational cost/cost of production.

Unsurprisingly, in terms of pure compute costs running Bloom to churn out text content is going to be significantly cheaper than running the same task on OpenAI’s GPT-3. Moreover, depending on your use case, you could opt for a smaller, much cheaper model such as bloom-3b. OpenAI also offer different models that are cheaper and better suited to a variety of tasks. Based on this thread, the maximum cost is unclear, but probably in the order of a few hundred a month. If you are maxing your OpenAI bill way past that, or you are looking to massively scale your production using language models, then it very well may be the way to go.

Conclusion

Accessibility to language models has come a long way in a short period of time, due to OpenAI, Hugging Face and open-source models. While OpenAI helps reduce the barrier to entry for use of this technology, the trade-off is a lack of transparency. This issue is not insignificant, and appears to have been done purposefully, which raises reasonable concerns. Thankfully, open-source solutions help to solve this problem as well. OpenAI and Microsoft are in a franticly-paced race with Google, Facebook, other researchers and enthusiasts. One project to keep an eye on, Stanford Alpaca, is tuned at a very low cost using instruction data generated from OpenAI’s text-davinci-003.

I am not so concerned with one model or company becoming a “monopoly”. My personal opinion is that there should be competition between them all, and even in the context of an individual system we can utilize one or many models. There is a bigger picture to get a view of, philosophical and not technical, that includes ethics, “AGI” (yeah, I had to put it in there), a robust evaluation of output, regulation, data ownership and transparency.

The future always comes too fast and in the wrong order.” — Alvin Toffler


Leave a comment