How to Create an MCP (Model Context Protocol) Server

A walk through for creating a simple MCP server.

Jul 02, 2025

The Model Context Protocol (MCP) is a hypothetical framework for managing and serving machine learning model contexts efficiently. Since MCP isn't a widely recognized standard, this guide assumes it's a custom protocol for model inference and context management. Below, I'll walk you through creating a simple MCP server using Python, FastAPI, and a basic model context management system. The server will handle model loading, context storage, and inference requests.

Prerequisites

Python 3.8+
Basic understanding of REST APIs
Familiarity with machine learning models (e.g., using Hugging Face Transformers)
Installed dependencies: fastapi, uvicorn, transformers, pydantic

Step 1: Define the MCP Specification

For this example, the MCP server will:

Store model contexts (e.g., loaded models and their configurations).
Accept inference requests with input data and context IDs.
Return predictions or context updates.
Use a simple JSON-based protocol for communication.

Step 2: Set Up the Project

Create a project directory and install the required packages:

mkdir mcp-server
cd mcp-server
pip install fastapi uvicorn transformers pydantic

Step 3: Implement the MCP Server

Below is a sample implementation of an MCP server using FastAPI. The server loads a pre-trained model (e.g., BERT for text classification) and manages contexts.

Server Code

Create a file named mcp_server.py:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import uuid
import logging

# Initialize FastAPI app
app = FastAPI(title="MCP Server")
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# In-memory storage for model contexts
contexts = {}

# Model and tokenizer (loaded at startup)
MODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)

# Pydantic models for request/response
class InferenceRequest(BaseModel):
    context_id: str | None = None
    text: str

class InferenceResponse(BaseModel):
    context_id: str
    prediction: str
    confidence: float

class ContextCreateResponse(BaseModel):
    context_id: str

# Create a new context
@app.post("/context", response_model=ContextCreateResponse)
async def create_context():
    context_id = str(uuid.uuid4())
    contexts[context_id] = {"model": MODEL_NAME, "state": "active"}
    logger.info(f"Created context: {context_id}")
    return {"context_id": context_id}

# Perform inference
@app.post("/infer", response_model=InferenceResponse)
async def infer(request: InferenceRequest):
    context_id = request.context_id or str(uuid.uuid4())
    
    # Validate context
    if context_id not in contexts and request.context_id:
        raise HTTPException(status_code=404, detail="Context not found")
    
    # Create new context if none provided
    if context_id not in contexts:
        contexts[context_id] = {"model": MODEL_NAME, "state": "active"}
        logger.info(f"Created temporary context: {context_id}")

    # Tokenize input
    inputs = tokenizer(request.text, return_tensors="pt", truncation=True, padding=True)
    
    # Perform inference
    outputs = model(**inputs)
    logits = outputs.logits
    prediction_id = logits.argmax().item()
    confidence = float(logits.softmax(dim=1)[0][prediction_id])
    prediction = "positive" if prediction_id == 1 else "negative"

    logger.info(f"Inference completed for context: {context_id}")
    return {
        "context_id": context_id,
        "prediction": prediction,
        "confidence": confidence
    }

# Delete a context
@app.delete("/context/{context_id}")
async def delete_context(context_id: str):
    if context_id not in contexts:
        raise HTTPException(status_code=404, detail="Context not found")
    del contexts[context_id]
    logger.info(f"Deleted context: {context_id}")
    return {"message": f"Context {context_id} deleted"}

# Health check
@app.get("/health")
async def health():
    return {"status": "healthy", "model": MODEL_NAME}

# Run the server
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Step 4: Explanation of the Code

FastAPI Setup: The server uses FastAPI for creating RESTful endpoints.
Model Loading: A pre-trained DistilBERT model is loaded for sentiment analysis.
Context Management: Contexts are stored in-memory with unique IDs, tracking model state.
Endpoints:
- POST /context: Creates a new context and returns a unique context_id.
- POST /infer: Performs inference on input text, using an existing or new context.
- DELETE /context/{context_id}: Deletes a context.
- GET /health: Checks server and model status.
Pydantic Models: Used for request/response validation.
Logging: Tracks context creation, inference, and deletion.

Step 5: Run the Server

Start the server by running:

python mcp_server.py

The server will be available at

http://localhost:8000

. You can access the interactive API documentation at http://localhost:8000/docs.

Step 6: Test the Server

Use curl or a tool like Postman to test the endpoints.

Create a Context:

curl -X POST http://localhost:8000/context

Response:

{"context_id": "550e8400-e29b-41d4-a716-446655440000"}

Perform Inference:

curl -X POST http://localhost:8000/infer \
-H "Content-Type: application/json" \
-d '{"context_id": "550e8400-e29b-41d4-a716-446655440000", "text": "I love this movie!"}'

Response:

{
    "context_id": "550e8400-e29b-41d4-a716-446655440000",
    "prediction": "positive",
    "confidence": 0.9991
}

Delete a Context:

curl -X DELETE http://localhost:8000/context/550e8400-e29b-41d4-a716-446655440000

Response:

{"message": "Context 550e8400-e29b-41d4-a716-446655440000 deleted"}

Step 7: Scaling and Improvements

For a production-ready MCP server, consider:

Persistent Storage: Use a database (e.g., Redis, PostgreSQL) for contexts.
Authentication: Add API key or OAuth2 for secure access.
Model Management: Support multiple models and dynamic loading.
Load Balancing: Deploy with a reverse proxy (e.g., Nginx) and scale with multiple workers.
Error Handling: Add retry mechanisms and detailed error responses.

TLDR

This guide demonstrated how to create a basic MCP server using Python and FastAPI. The server handles model contexts and inference requests, providing a foundation for more complex systems. You can extend this by adding features like model versioning, context persistence, or advanced request queuing.

For further details on FastAPI, visit fastapi.tiangolo.com. To explore Hugging Face models, check huggingface.co.