Master building AI-powered applications with OpenAI API through practical examples, authentication guides, and production-ready best practices.
Large Language Models (LLMs) have revolutionized how we build intelligent applications, and the OpenAI API provides one of the most accessible entry points into this technology. Whether you're a seasoned developer or new to AI, this comprehensive guide will take you from basic concepts to building production-ready applications with LLMs.
Understanding Large Language Models
Before diving into implementation, it's crucial to understand what LLMs are and how they work. Large Language Models are neural networks trained on vast amounts of text data to understand and generate human-like text. They excel at tasks like text completion, translation, summarization, and even code generation.
The key insight behind LLMs is that language understanding emerges from predicting the next word in a sequence. By training on billions of words, these models develop sophisticated understanding of grammar, context, reasoning, and even world knowledge.
Setting Up Your Development Environment
Let's start by setting up a robust development environment for working with the OpenAI API:
# requirements.txt
openai==1.3.7
python-dotenv==1.0.0
pydantic==2.5.0
aiohttp==3.9.1
tenacity==8.2.3
Create a virtual environment and install dependencies:
python -m venv llm_env
source llm_env/bin/activate # On Windows: llm_env\Scripts\activate
pip install -r requirements.txt
Set up your environment variables in a .env
file:
OPENAI_API_KEY=your_api_key_here
OPENAI_ORG_ID=your_org_id_here # Optional
OpenAI API Fundamentals
Authentication and Basic Setup
# openai_client.py
import os
from openai import OpenAI
from dotenv import load_dotenv
from typing import Optional, Dict, Any, List
import logging
load_dotenv()
class OpenAIClient:
def __init__(self, api_key: Optional[str] = None, organization: Optional[str] = None):
self.client = OpenAI(
api_key=api_key or os.getenv("OPENAI_API_KEY"),
organization=organization or os.getenv("OPENAI_ORG_ID")
)
self.logger = self._setup_logging()
def _setup_logging(self):
logging.basicConfig(level=logging.INFO)
return logging.getLogger(__name__)
async def chat_completion(self,
messages: List[Dict[str, str]],
model: str = "gpt-3.5-turbo",
temperature: float = 0.7,
max_tokens: Optional[int] = None,
stream: bool = False) -> Dict[str, Any]:
"""
Create a chat completion with proper error handling
"""
try:
response = self.client.chat.completions.create(
model=model,
messages=messages,
temperature=temperature,
max_tokens=max_tokens,
stream=stream
)
if stream:
return self._handle_streaming_response(response)
else:
return self._handle_standard_response(response)
except Exception as e:
self.logger.error(f"OpenAI API error: {e}")
raise
def _handle_standard_response(self, response) -> Dict[str, Any]:
"""Handle standard API response"""
return {
"content": response.choices[0].message.content,
"model": response.model,
"usage": {
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
"total_tokens": response.usage.total_tokens
},
"finish_reason": response.choices[0].finish_reason
}
def _handle_streaming_response(self, response):
"""Handle streaming API response"""
for chunk in response:
if chunk.choices[0].delta.content is not None:
yield chunk.choices[0].delta.content
Understanding OpenAI Models
OpenAI offers several models with different capabilities and pricing:
# model_selector.py
from enum import Enum
from dataclasses import dataclass
from typing import Dict, Optional
class ModelCapability(Enum):
TEXT_GENERATION = "text_generation"
CODE_GENERATION = "code_generation"
ANALYSIS = "analysis"
CREATIVE_WRITING = "creative_writing"
REASONING = "reasoning"
@dataclass
class ModelInfo:
name: str
context_length: int
input_cost_per_1k: float # USD
output_cost_per_1k: float # USD
capabilities: list
description: str
class ModelSelector:
def __init__(self):
self.models = {
"gpt-4": ModelInfo(
name="gpt-4",
context_length=8192,
input_cost_per_1k=0.03,
output_cost_per_1k=0.06,
capabilities=[ModelCapability.TEXT_GENERATION, ModelCapability.REASONING,
ModelCapability.CODE_GENERATION, ModelCapability.ANALYSIS],
description="Most capable model, best for complex reasoning"
),
"gpt-4-turbo": ModelInfo(
name="gpt-4-turbo-preview",
context_length=128000,
input_cost_per_1k=0.01,
output_cost_per_1k=0.03,
capabilities=[ModelCapability.TEXT_GENERATION, ModelCapability.REASONING,
ModelCapability.CODE_GENERATION, ModelCapability.ANALYSIS],
description="Faster GPT-4 with longer context window"
),
"gpt-3.5-turbo": ModelInfo(
name="gpt-3.5-turbo",
context_length=4096,
input_cost_per_1k=0.001,
output_cost_per_1k=0.002,
capabilities=[ModelCapability.TEXT_GENERATION, ModelCapability.CODE_GENERATION],
description="Fast and cost-effective for most tasks"
)
}
def recommend_model(self,
task_type: ModelCapability,
complexity: str = "medium",
budget_conscious: bool = False) -> str:
"""Recommend the best model for a given task"""
if budget_conscious and task_type in [ModelCapability.TEXT_GENERATION,
ModelCapability.CODE_GENERATION]:
return "gpt-3.5-turbo"
if complexity == "high" or task_type == ModelCapability.REASONING:
return "gpt-4"
if task_type == ModelCapability.ANALYSIS:
return "gpt-4-turbo"
return "gpt-3.5-turbo"
def calculate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:
"""Calculate the cost for a given request"""
if model not in self.models:
raise ValueError(f"Unknown model: {model}")
model_info = self.models[model]
input_cost = (input_tokens / 1000) * model_info.input_cost_per_1k
output_cost = (output_tokens / 1000) * model_info.output_cost_per_1k
return input_cost + output_cost
Building Your First LLM Application
Let's create a practical application that demonstrates core concepts:
# llm_assistant.py
import asyncio
from typing import List, Dict, Any, Optional
from openai_client import OpenAIClient
from model_selector import ModelSelector, ModelCapability
import json
from datetime import datetime
class LLMAssistant:
def __init__(self):
self.client = OpenAIClient()
self.model_selector = ModelSelector()
self.conversation_history = []
self.system_prompt = """You are a helpful AI assistant. You provide accurate,
helpful, and concise responses. If you're unsure about something, you clearly
state your uncertainty."""
async def chat(self,
user_message: str,
context: Optional[Dict] = None,
model_override: Optional[str] = None) -> Dict[str, Any]:
"""
Chat with the assistant
"""
# Prepare messages
messages = [{"role": "system", "content": self.system_prompt}]
# Add conversation history (last 10 exchanges)
messages.extend(self.conversation_history[-20:])
# Add current message
messages.append({"role": "user", "content": user_message})
# Select appropriate model
model = model_override or self.model_selector.recommend_model(
ModelCapability.TEXT_GENERATION,
complexity="medium"
)
try:
# Get response
response = await self.client.chat_completion(
messages=messages,
model=model,
temperature=0.7
)
# Update conversation history
self.conversation_history.extend([
{"role": "user", "content": user_message},
{"role": "assistant", "content": response["content"]}
])
# Calculate cost
cost = self.model_selector.calculate_cost(
model,
response["usage"]["prompt_tokens"],
response["usage"]["completion_tokens"]
)
return {
"response": response["content"],
"model_used": model,
"tokens_used": response["usage"]["total_tokens"],
"estimated_cost": cost,
"timestamp": datetime.now().isoformat()
}
except Exception as e:
return {
"error": str(e),
"timestamp": datetime.now().isoformat()
}
async def summarize_text(self, text: str, max_length: int = 200) -> str:
"""
Summarize long text
"""
prompt = f"""Please summarize the following text in approximately {max_length} words:
{text}
Summary:"""
messages = [{"role": "user", "content": prompt}]
response = await self.client.chat_completion(
messages=messages,
model="gpt-3.5-turbo",
temperature=0.3,
max_tokens=int(max_length * 1.5) # Allow some buffer
)
return response["content"]
async def analyze_sentiment(self, text: str) -> Dict[str, Any]:
"""
Analyze sentiment of text
"""
prompt = f"""Analyze the sentiment of the following text and return a JSON response with:
- sentiment: positive/negative/neutral
- confidence: 0.0 to 1.0
- key_emotions: list of detected emotions
- explanation: brief explanation
Text: {text}
Return only valid JSON:"""
messages = [{"role": "user", "content": prompt}]
response = await self.client.chat_completion(
messages=messages,
model="gpt-3.5-turbo",
temperature=0.1
)
try:
return json.loads(response["content"])
except json.JSONDecodeError:
return {"error": "Failed to parse sentiment analysis"}
def clear_history(self):
"""Clear conversation history"""
self.conversation_history = []
def save_conversation(self, filename: str):
"""Save conversation to file"""
with open(filename, 'w') as f:
json.dump(self.conversation_history, f, indent=2)
def load_conversation(self, filename: str):
"""Load conversation from file"""
with open(filename, 'r') as f:
self.conversation_history = json.load(f)
# Example usage
async def main():
assistant = LLMAssistant()
# Simple chat
response = await assistant.chat("Explain quantum computing in simple terms")
print(f"Assistant: {response['response']}")
print(f"Cost: ${response['estimated_cost']:.4f}")
# Text summarization
long_text = """
Large Language Models (LLMs) are a type of artificial intelligence model designed to
understand and generate human-like text. These models are trained on vast amounts of text
data from the internet, books, articles, and other sources. The training process involves
predicting the next word in a sequence, which allows the model to learn patterns in language,
grammar, facts about the world, and even some reasoning abilities. LLMs have become incredibly
powerful and versatile, capable of tasks ranging from answering questions and writing code
to creative writing and language translation. Popular examples include GPT models from OpenAI,
Claude from Anthropic, and various models from Google, Meta, and other organizations.
"""
summary = await assistant.summarize_text(long_text, max_length=50)
print(f"Summary: {summary}")
# Sentiment analysis
sentiment = await assistant.analyze_sentiment("I love this new product! It's amazing and works perfectly.")
print(f"Sentiment: {sentiment}")
if __name__ == "__main__":
asyncio.run(main())
Advanced Features and Best Practices
Implementing Rate Limiting and Retry Logic
# rate_limiter.py
import asyncio
import time
from typing import Optional
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from openai import RateLimitError, APIError
class RateLimiter:
def __init__(self, max_requests_per_minute: int = 60):
self.max_requests = max_requests_per_minute
self.requests = []
self.lock = asyncio.Lock()
async def acquire(self):
"""Acquire permission to make a request"""
async with self.lock:
now = time.time()
# Remove requests older than 1 minute
self.requests = [req_time for req_time in self.requests if now - req_time < 60]
# Check if we can make another request
if len(self.requests) >= self.max_requests:
# Calculate wait time
oldest_request = min(self.requests)
wait_time = 60 - (now - oldest_request)
await asyncio.sleep(wait_time)
return await self.acquire()
# Record this request
self.requests.append(now)
class RobustOpenAIClient(OpenAIClient):
def __init__(self, rate_limit: int = 60):
super().__init__()
self.rate_limiter = RateLimiter(rate_limit)
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10),
retry=retry_if_exception_type((RateLimitError, APIError))
)
async def chat_completion_with_retry(self, *args, **kwargs):
"""Chat completion with rate limiting and retry logic"""
await self.rate_limiter.acquire()
return await self.chat_completion(*args, **kwargs)
Token Management and Context Optimization
# token_manager.py
import tiktoken
from typing import List, Dict, Tuple
class TokenManager:
def __init__(self, model: str = "gpt-3.5-turbo"):
self.model = model
self.encoding = tiktoken.encoding_for_model(model)
self.max_tokens = self._get_max_tokens(model)
def _get_max_tokens(self, model: str) -> int:
"""Get maximum tokens for model"""
model_limits = {
"gpt-3.5-turbo": 4096,
"gpt-4": 8192,
"gpt-4-turbo-preview": 128000,
"gpt-4-32k": 32768
}
return model_limits.get(model, 4096)
def count_tokens(self, text: str) -> int:
"""Count tokens in text"""
return len(self.encoding.encode(text))
def count_message_tokens(self, messages: List[Dict[str, str]]) -> int:
"""Count tokens in message list"""
total_tokens = 0
for message in messages:
# Account for message structure overhead
total_tokens += 4 # Every message has overhead
for key, value in message.items():
total_tokens += self.count_tokens(value)
if key == "name":
total_tokens += 1
total_tokens += 2 # Every conversation has overhead
return total_tokens
def optimize_context(self,
messages: List[Dict[str, str]],
max_response_tokens: int = 500) -> List[Dict[str, str]]:
"""
Optimize message context to fit within token limits
"""
# Reserve tokens for response
available_tokens = self.max_tokens - max_response_tokens
# Always keep system message if present
optimized_messages = []
if messages and messages[0].get("role") == "system":
optimized_messages = [messages[0]]
messages = messages[1:]
# Calculate system message tokens
current_tokens = self.count_message_tokens(optimized_messages)
# Add messages from most recent, working backwards
for message in reversed(messages):
message_tokens = self.count_message_tokens([message])
if current_tokens + message_tokens <= available_tokens:
optimized_messages.insert(-1 if optimized_messages else 0, message)
current_tokens += message_tokens
else:
break
return optimized_messages
def truncate_text(self, text: str, max_tokens: int,
preserve_end: bool = False) -> str:
"""
Truncate text to fit within token limit
"""
tokens = self.encoding.encode(text)
if len(tokens) <= max_tokens:
return text
if preserve_end:
truncated_tokens = tokens[-max_tokens:]
else:
truncated_tokens = tokens[:max_tokens]
return self.encoding.decode(truncated_tokens)
Production Deployment Considerations
Environment Configuration
# config.py
from pydantic import BaseSettings, validator
from typing import Optional, List
import os
class OpenAIConfig(BaseSettings):
api_key: str
organization_id: Optional[str] = None
max_requests_per_minute: int = 60
default_model: str = "gpt-3.5-turbo"
max_tokens_per_request: int = 2000
default_temperature: float = 0.7
class Config:
env_prefix = "OPENAI_"
env_file = ".env"
@validator('api_key')
def api_key_must_be_set(cls, v):
if not v:
raise ValueError('OpenAI API key must be set')
return v
class AppConfig(BaseSettings):
debug: bool = False
log_level: str = "INFO"
cache_enabled: bool = True
cache_ttl: int = 3600
max_conversation_length: int = 50
# Security
allowed_origins: List[str] = ["*"]
api_key_required: bool = True
class Config:
env_file = ".env"
# config_manager.py
import redis
import json
from typing import Any, Optional
class ConfigManager:
def __init__(self, redis_host: str = "localhost", redis_port: int = 6379):
self.redis_client = redis.Redis(
host=redis_host,
port=redis_port,
decode_responses=True
)
self.openai_config = OpenAIConfig()
self.app_config = AppConfig()
def get_config(self, key: str, default: Any = None) -> Any:
"""Get configuration value with Redis override capability"""
# Check Redis first for dynamic config
redis_value = self.redis_client.get(f"config:{key}")
if redis_value:
try:
return json.loads(redis_value)
except json.JSONDecodeError:
return redis_value
# Fall back to environment config
return getattr(self.app_config, key, default)
def set_config(self, key: str, value: Any, ttl: Optional[int] = None):
"""Set configuration value in Redis"""
if isinstance(value, (dict, list)):
value = json.dumps(value)
if ttl:
self.redis_client.setex(f"config:{key}", ttl, value)
else:
self.redis_client.set(f"config:{key}", value)
Error Handling and Monitoring
# monitoring.py
import logging
import time
from typing import Dict, Any, Optional
from functools import wraps
import asyncio
class LLMMetrics:
def __init__(self):
self.request_count = 0
self.error_count = 0
self.total_tokens = 0
self.total_cost = 0.0
self.response_times = []
def record_request(self,
tokens_used: int,
cost: float,
response_time: float,
success: bool = True):
"""Record metrics for a request"""
self.request_count += 1
if not success:
self.error_count += 1
self.total_tokens += tokens_used
self.total_cost += cost
self.response_times.append(response_time)
# Keep only last 1000 response times
if len(self.response_times) > 1000:
self.response_times = self.response_times[-1000:]
def get_stats(self) -> Dict[str, Any]:
"""Get current statistics"""
avg_response_time = (
sum(self.response_times) / len(self.response_times)
if self.response_times else 0
)
error_rate = (
self.error_count / self.request_count
if self.request_count > 0 else 0
)
return {
"total_requests": self.request_count,
"total_errors": self.error_count,
"error_rate": error_rate,
"total_tokens": self.total_tokens,
"total_cost": self.total_cost,
"avg_response_time": avg_response_time,
"avg_tokens_per_request": (
self.total_tokens / self.request_count
if self.request_count > 0 else 0
)
}
def monitor_llm_calls(metrics: LLMMetrics):
"""Decorator to monitor LLM API calls"""
def decorator(func):
@wraps(func)
async def wrapper(*args, **kwargs):
start_time = time.time()
try:
result = await func(*args, **kwargs)
# Extract metrics from result
tokens_used = result.get("tokens_used", 0)
cost = result.get("estimated_cost", 0)
response_time = time.time() - start_time
metrics.record_request(
tokens_used, cost, response_time, success=True
)
return result
except Exception as e:
response_time = time.time() - start_time
metrics.record_request(0, 0, response_time, success=False)
raise
return wrapper
return decorator
Testing Your LLM Applications
# test_llm.py
import pytest
import asyncio
from unittest.mock import Mock, patch
from llm_assistant import LLMAssistant
class TestLLMAssistant:
@pytest.fixture
def assistant(self):
return LLMAssistant()
@pytest.fixture
def mock_openai_response(self):
return {
"content": "This is a test response",
"model": "gpt-3.5-turbo",
"usage": {
"prompt_tokens": 10,
"completion_tokens": 20,
"total_tokens": 30
},
"finish_reason": "stop"
}
@pytest.mark.asyncio
async def test_chat_basic(self, assistant, mock_openai_response):
"""Test basic chat functionality"""
with patch.object(assistant.client, 'chat_completion',
return_value=mock_openai_response):
response = await assistant.chat("Hello, world!")
assert response["response"] == "This is a test response"
assert response["model_used"] == "gpt-3.5-turbo"
assert response["tokens_used"] == 30
@pytest.mark.asyncio
async def test_summarization(self, assistant):
"""Test text summarization"""
mock_response = {
"content": "Short summary of the text.",
"model": "gpt-3.5-turbo",
"usage": {"total_tokens": 25}
}
with patch.object(assistant.client, 'chat_completion',
return_value=mock_response):
summary = await assistant.summarize_text("Long text here...", max_length=10)
assert "summary" in summary.lower()
def test_conversation_history(self, assistant):
"""Test conversation history management"""
# Add some conversation history
assistant.conversation_history = [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi there!"}
]
assert len(assistant.conversation_history) == 2
assistant.clear_history()
assert len(assistant.conversation_history) == 0
# Performance testing
@pytest.mark.performance
class TestLLMPerformance:
@pytest.mark.asyncio
async def test_concurrent_requests(self):
"""Test handling multiple concurrent requests"""
assistant = LLMAssistant()
# Mock the API calls
with patch.object(assistant.client, 'chat_completion') as mock_chat:
mock_chat.return_value = {
"content": "Response",
"model": "gpt-3.5-turbo",
"usage": {"prompt_tokens": 10, "completion_tokens": 10, "total_tokens": 20}
}
# Create 10 concurrent requests
tasks = [
assistant.chat(f"Question {i}")
for i in range(10)
]
responses = await asyncio.gather(*tasks)
assert len(responses) == 10
assert all("response" in resp for resp in responses)
Best Practices and Production Tips
1. Cost Management
- Monitor token usage closely
- Use appropriate models for different task complexities
- Implement caching for repeated queries
- Set up budget alerts and limits
2. Security
- Never expose API keys in client-side code
- Implement proper authentication and authorization
- Validate and sanitize user inputs
- Log access patterns for security monitoring
3. Performance Optimization
- Use streaming for long responses
- Implement request batching where possible
- Cache frequent responses
- Optimize prompt length and structure
4. Error Handling
- Implement comprehensive retry logic
- Handle rate limits gracefully
- Provide fallback responses for failures
- Monitor error rates and patterns
5. User Experience
- Provide progress indicators for long requests
- Implement proper loading states
- Allow users to interrupt long-running operations
- Provide clear error messages
Conclusion
Building applications with LLMs and the OpenAI API opens up incredible possibilities for creating intelligent, responsive applications. By following the patterns and practices outlined in this guide, you'll be well-equipped to build robust, production-ready AI applications.
Remember to start simple, monitor everything, and iterate based on real user feedback. The AI landscape is rapidly evolving, so stay curious and keep experimenting with new capabilities as they become available.
The key to success with LLMs lies not just in understanding the technology, but in thoughtful application design, careful testing, and continuous optimization based on real-world usage patterns.
Share this article
David Childs
Consulting Systems Engineer with over 10 years of experience building scalable infrastructure and helping organizations optimize their technology stack.