Serve Hugging Face Models Locally over HTTP Using Transformers & Flask

Background

I am working on a internal RAG application, where I am currently in a intial data ingestion phase. Also testing out different models to see which one fits the best for the job at hand. This project will need multimodel architecture, so quick testing of different models is a must.

As of first week of Oct 2024, meta-llama/Llama-3.2 series of models seems like the "best" (testing in progress) for the job. For quick testing you go to the page e.g Meta-llama-3.2-1B and copy the default code and run it in your local machine. Local cos of the internal data and security reasons.

import torch
from transformers import pipeline

token = "your-hf-token"
model_id = "meta-llama/Llama-3.2-1B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    token=token,
)
messages = [
    {
        "role": "system",
        "content": "Your are a expert full stack developer.",
    },
    {"role": "user", "content": "Who are you?"},
]
outputs = pipe(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Now this is all well and good, but pairing this with a seperate document loading project to create simple summaries for embeddings is where I got stuck. Now before I go ahead, there are loads of ways to do this, llama.cpp, Litserve and many more, some very complex and some very simple.

But I wanted to keep few things or tasks very simple and easy to manage, as this is the initial phase of the project. Few google searches ("serve huggingface model locally") and I couldn't find any guide to do this, some were kinda close but not exactly what I was looking for. There may be a guide out there, but pretty sure my search terms were not good enough.

Time to write one myself. It really turned out to be a simple task, thanks to the Transformers library and Flask. Since model was already in cache, I didn't have to worry about downloading the model again.

from flask import Flask, request, jsonify
from transformers import pipeline
import torch

token = "your-hf-token"
model_id = "meta-llama/Llama-3.2-1B-Instruct"

app = Flask(__name__)

# Initialize the pipeline
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    token=token,
)



@app.route("/chat", methods=["POST"])
def chat():
    # Get input data from request
    data = request.json

    # Perform inference
    input_text = data.get("text", "")
    results = pipe(
        input_text,
        max_new_tokens=256,
    )

    # Return response
    return jsonify(results)


if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

Testing works as seen in the image below. Serve Hugging Face Models Locally over HTTP

And that's it, you can run multiple models simultaneously by running multiple instances of the script with different model IDs. You can also add more endpoints to the Flask app to serve different models or different tasks. May be compare different models or tasks side by side.

Update

I did some more digging and found out that there is a official guide to do this, here. Its more of a suggestion than a guide but following is the code snippet from the guide.

# server.py

from starlette.applications import Starlette
from starlette.responses import JSONResponse
from starlette.routing import Route
from transformers import pipeline
import asyncio


async def homepage(request):
    payload = await request.body()
    string = payload.decode("utf-8")
    response_q = asyncio.Queue()
    await request.app.model_queue.put((string, response_q))
    output = await response_q.get()
    return JSONResponse(output)


async def server_loop(q):
    pipe = pipeline(model="google-bert/bert-base-uncased")
    while True:
        (string, response_q) = await q.get()
        out = pipe(string)
        await response_q.put(out)


app = Starlette(
    routes=[
        Route("/", homepage, methods=["POST"]),
    ],
)


@app.on_event("startup")
async def startup_event():
    q = asyncio.Queue()
    app.model_queue = q
    asyncio.create_task(server_loop(q))

Start the server with the following command:

uvicorn server:app

And query the server with the following command:

curl -X POST -d "test [MASK]" http://localhost:8000/
#[{"score":0.7742936015129089,"token":1012,"token_str":".","sequence":"test."},...]

So there you have it, two ways to serve Hugging Face models locally over HTTP. One using Flask and other using Starlette. Both are simple and easy to setup. Please note in both cases the performance might depend on how good your machine is, in my testing even 1B model on 36GB Mac M3 Pro took almost 10-15 seconds to respond. I tried it with a different setup using llama.cpp inference server and performance was much much better. Post on that coming soon.