Running AI models requires substantial hardware resources, often beyond the capacity of standard servers or virtual machines. To address this, enterprise software can leverage AI models hosted in the cloud or on specialized on-premises machines.
Over the past 6-8 months, we’ve witnessed tremendous progress in the capabilities, pricing, and features for hosting AI models in the cloud and on-premises. These advancements have significantly enhanced AI hosting options.
Many of our clients seek to incorporate AI into their businesses, and we’ve been approached multiple times for guidance on selecting the best hosting model.
Drawing on our years of expert cloud experience, we are well-equipped to help clients choose the right AI hosting platform. Whether it’s scalability, cost-effectiveness, or specialized features, we’ve successfully guided clients in selecting the hosting platform that best aligns with their business goals.
To help address common questions, we created this primer on popular hosting platforms. Below is a brief overview of some of the more popular cloud and on-premises AI hosting platforms we work with:
Popular Cloud AI Hosting Platforms
Azure AI Studio
Azure AI Studio is a cloud-based offering from Microsoft, is designed to empower businesses with advanced AI capabilities. It provides tools and services that make building, deploying, and managing AI models at scale easier. With Azure AI Studio, users can leverage pre-built models or create custom solutions tailored to specific needs, all within a highly secure and scalable cloud environment.
- Overview: Azure AI Studio supports the deployment of a wide range of models from a model catalog. It offers a playground to test prompts, fine-tuning support, content filters (violence/hate/etc.), and Prompt Flow (a Logic app-like builder supporting chaining of prompts, logic, other tools, and execution tracing).
- Models: Almost 2,000 models commercial and open-source models are available.
- Deployment: Serverless pay-as-you-go (PAYG) and Managed Compute are offered, but the prices vary from model to model.
- Pricing: Token-based for serverless and per hour for Managed Compute.
- RAG: Supported through Azure AI Search.
- Data Privacy: Customer data is not available to other customers, OpenAI, and is not used to train/improve any MS or third-party products or services. OpenAI offers a BAA for HIPAA compliance.
Azure OpenAI Service
Azure OpenAI Service is a cloud-based platform that brings the power of OpenAI’s advanced language models to Microsoft Azure’s secure and scalable infrastructure. This service enables developers and businesses to integrate AI capabilities, such as natural language processing and conversational AI, into their applications.
- Overview: API service accessible in Azure. Offers a playground to test prompts and content filters (violence/hate/etc.).
- Models: Several GPT flavors with varying context sizes, model sizes, and prices.
- Deployment: Can run globally (requests routed to whatever region has capacity, higher throughput limits, latency may vary) or locked to a specific region.
- Pricing: Token-based pricing. pay-as-you-go (PAYG) and Provisioned Throughput Units (PTU) are offered (PTU only if you have an MS account team).
- RAG: Supported through Azure AI Search.
- Data Privacy: Customer data is not available to other customers, OpenAI, and not used to train/improve any MS or 3rdparty products or services. Offers a BAA for HIPAA compliance.
AWS Bedrock
AWS Bedrock is Amazon’s cloud service that simplifies the deployment of foundational AI models. It provides access to powerful pre-trained models, enabling businesses to build and scale AI applications quickly within the secure and flexible AWS environment.
- Overview: AWS Bedrock has some of the most popular models running and ready for access without any particular deployment needed. It offers a playground to test prompts, fine-tuning support, content filters (violence/hate/etc.), Prompt Flows (a Logic App-like builder supporting chaining of prompts, logic, other tools, and execution tracing), Agents, governance, and audibility.
- Models: 30-40 of the most popular models already running today. Custom models can be imported if based on Mistral, Flan, or LLaMA.
- Deployment: Models are already running.
- Pricing: Token-based pricing
- RAG: Supported through Knowledge Bases.
- Data Privacy: Customer data is not available to other customers and not used to train/improve any products or services. It also offers a BAA for HIPAA compliance.
- Other: Supports a feature called Prompt Flows, which allows users to visually design the chaining together of various prompts to models, conditionals, and other tools. It supports guardrails, which are content filters for violence, hate, explicit, etc. It supports a playground for testing prompts and offers a full auditing of calls.
GCP Vertex AI
GCP Vertex AI is Google Cloud’s platform for developing, deploying, and managing machine learning models. It provides a unified interface for building custom models, automating workflows, and leveraging pre-trained models. Vertex AI integrates seamlessly with other Google Cloud services, offering tools for scalable and efficient AI solutions.
- Overview: Vertex AI offers several popular models, some of which are already running and ready for access without any special deployment needed. It also offers a playground to test prompts and fine-tuning support. With their Gemini and PaLM models, this hosting platform provides content filters (violence/hate/etc.) and grounding (connecting model output to verifiable sources of information to prevent hallucinations).
- Models: 80-90 of the most popular models.
- Deployment: Some models are already running (managed APIs), and some must be deployed to a specific machine size.
- Pricing: Token-based pricing for managed API models, and per-hour pricing for models you deploy.
- RAG: They provide a reference architecture for you to build it using their document AI technology.
- Data Privacy: Customer data is not available to other customers and not used to train/improve any products or services. Offers a BAA for HIPAA compliance.
- Other: It supports direct Google Colab Enterprise integration, a playground for testing prompts, full call auditing, content filters for Gemini and PaLM models, and Apache Airflow via several operators.
Hugging Face Enterprise
Hugging Face Enterprise is a cloud-based platform offering advanced tools for deploying and managing state-of-the-art machine learning models. It provides access to a wide range of pre-trained models, including those for natural language processing and computer vision, with tons of support for customization and fine-tuning.
- Overview: Hugging Face Enterprise offers many open-source and commercial models and fine-tunings, some of which have a playground to test with.
- Models: It has over 800,000 base and fine-tuned models.
- Deployment: The models must be deployed, but you can choose Hugging Face’s AWS, Azure, GCP instance.
- Pricing: Pricing is per hour.
- RAG: Not built in. Instead, RAG models can be deployed, and Python code must be written/run on your environment to invoke them and wire them together with other models.
- Data Privacy: Customer data is not available to other customers and is not used to train/improve any products or services. It offers a BAA for HIPAA compliance, but it is very expensive.
- Other: Supports direct Google Colab Enterprise integration. It supports a playground for testing prompts, full auditing of calls and content filters for Gemini and PaLM models.
Our Favorite Alternatives to Cloud Hosting AI Platforms
BentoML
BentoML is an open-source platform designed to deploy, manage, and scale machine learning models across environments such as on-premises, Data Centers, Edge, and embedded offerings.
Overview: BentoML is a development library for building AI applications with Python. It contains everything you need to boot up an open-source model of your choice and make it accessible as an API endpoint for your application. Typically, you would download an open-source model from Hugging Face, package it with BentoML, export it as a Docker image, and run it anywhere you like.
Hugging Face TGI
Hugging Face TGI (Text Generation Inference) is designed to deploy and manage Hugging Face models across various environments such as on-premises, Data Centers, Edge, and embedded offerings.
Overview: Hugging Face TGI is a development toolkit for deploying and serving LLMs. It has built-in support for buffering multiple API requests and support quantization, token streaming, and telemetry (using Open Telemetry and Prometheus). Specialized versions are available for different GPU lines (Nvidia, AMD, AWS Inferentia). Delivered as a Docker image, you typically boot it up with parameters anywhere you like.
NVIDIA Triton Inference Server
NVIDIA Triton Inference Server is a powerful platform for deploying and managing large-scale AI models. It supports multiple frameworks and provides a unified interface for model serving.
Overview: Triton Inference Server is an open-source software toolkit developed by Nvidia to serve one or more models of many types concurrently on Nvidia GPUs. It supports model ensembles (allowing multiple models to be chained together), a C and Java API (to link directly with application code), metrics (via Prometheus), and both HTTP and gRPC APIs. It is available as a Docker image.
Which Hosting Solution Will Work for Your Business?
When selecting the best hosting solution for your AI model, several factors must be considered, such as scalability, cost, deployment options, and platform-specific features.
Azure AI Studio, Azure OpenAI Service, AWS Bedrock, and GCP Vertex AI are strong contenders for cloud-based flexibility and ease of use. If you don’t require a BAA, Hugging Face Enterprise is an incredibly cost-effective option. Hugging Face TGI, BentoML, and Nvidia Triton offer robust deployment options for those preferring on-premises solutions.
Regardless of the hosting platform you want to deploy your AI solution, our Rōnin Consulting team can help. Contact us today to learn how we can set up your team with the right on-premises or cloud platform.