Operations Runbook
Comprehensive operational guide for tenant onboarding, service configuration, monitoring, contextual messaging operations, troubleshooting, and Terraform workflows.
Operations and Configuration Runbook
This guide covers day-to-day operational procedures for PlatformXe, from initial tenant setup through service configuration, monitoring, and troubleshooting.
Tenant onboarding
API key creation and scope assignment
Every tenant receives one or more API keys with scoped access. Create keys through the portal or the admin dashboard.
// TypeScript SDK -- verify key works
const client = new PlatformXeClient({
apiKey: 'pxk_live_your_key_here',
});
const health = await client.healthCheck();
console.log(health.status); // "ok"
# Python SDK
from platformxe import PlatformXeClient
client = PlatformXeClient(api_key="pxk_live_your_key_here")
// Go SDK
client := platformxe.NewClient(platformxe.ClientConfig{
APIKey: "pxk_live_your_key_here",
})
Recommended scope sets by use case:
| Use case | Scopes |
|---|---|
| Full-stack app | messaging:send, storage:upload, storage:read, permissions:check, events:read |
| Authorization only | permissions:check, permissions:manage, permissions:audit |
| Background jobs | exports:create, events:manage, webhooks:manage |
| Read-only dashboard | storage:read, events:read, permissions:check |
Service processor initial configuration
After creating an API key, configure processors for each service your tenant will use.
// Enable and configure storage
await client.storage.updateProcessor({
enabled: true,
config: {
maxFileSizeMb: 25,
allowedMimeTypes: ['image/jpeg', 'image/png', 'application/pdf'],
moderationEnabled: true,
},
});
// Enable OCR
await client.ocr.updateProcessor({
enabled: true,
config: {
confidenceThreshold: 0.85,
supportedDocumentTypes: ['NIN_SLIP', 'DRIVERS_LICENSE'],
},
});
// Enable identity resolution
await client.identity.updateProcessor({
enabled: true,
config: { retryOnFailure: true, maxRetries: 3 },
});
# Python equivalents
client.storage.update_processor(enabled=True, config={"maxFileSizeMb": 25})
client.ocr.update_processor(enabled=True, config={"confidenceThreshold": 0.85})
client.identity.update_processor(enabled=True, config={"retryOnFailure": True})
// Go equivalents
client.Storage.UpdateProcessor(map[string]interface{}{"enabled": true, "config": map[string]interface{}{"maxFileSizeMb": 25}})
client.Ocr.UpdateProcessor(map[string]interface{}{"enabled": true, "config": map[string]interface{}{"confidenceThreshold": 0.85}})
Channel setup for contextual messaging
Create channels before threads can be opened. Each channel maps to a domain entity type.
await client.threads.createChannel({
slug: 'booking',
displayName: 'Booking Conversations',
entityType: 'BOOKING',
participantRoles: ['GUEST', 'HOST', 'PLATFORM'],
defaultVisibility: ['ALL'],
lifecycleRules: {
autoClose: { onEntityStatus: ['CHECKED_OUT', 'CANCELLED'] },
autoArchive: { afterClosedDays: 90 },
},
});
Webhook and event subscription setup
// Create a webhook endpoint
const webhook = await client.webhooks.create({
url: 'https://your-app.com/webhooks/platformxe',
events: ['email.message.*', 'permissions.role.*'],
secret: 'whsec_your_signing_secret',
});
// Create an event subscription
await client.events.createSubscription({
eventTypes: ['BOOKING_CONFIRMED', 'BOOKING_CANCELLED'],
webhookUrl: 'https://your-app.com/events',
isActive: true,
});
Service configuration
Processor types and defaults
PlatformXe has 7 configurable processor types. Each controls runtime behaviour for its service.
| Processor | Key settings | Default values |
|---|---|---|
| Messaging | retryMaxAttempts, retryDelayMs, deadLetterAfter | 3 attempts, 2000ms delay, dead-letter after 5 failures |
| Storage | maxFileSizeMb, allowedMimeTypes, moderationEnabled | 10MB, all types, moderation off |
| OCR | confidenceThreshold, supportedDocumentTypes | 0.80 threshold, all document types |
defaultPageSize, defaultMargins | A4, 20/15/20/15 margins | |
| QR | defaultSize, defaultFormat, brandColor | 256px, PNG, black |
| Exports | maxConcurrentJobs, retentionDays | 2 concurrent, 7 day retention |
| Identity | retryOnFailure, maxRetries, cacheTtlSeconds | Retry on, 2 retries, 3600s cache |
Recommended configurations by industry
Property management:
// Storage: large files, image moderation on
await client.storage.updateProcessor({
enabled: true,
config: { maxFileSizeMb: 50, moderationEnabled: true },
});
// PDF: A4, custom branding margins
await client.pdf.updateProcessor({
enabled: true,
config: { defaultPageSize: 'A4' },
});
Healthcare:
// Identity: aggressive caching, retry enabled
await client.identity.updateProcessor({
enabled: true,
config: { retryOnFailure: true, maxRetries: 5, cacheTtlSeconds: 0 },
});
// OCR: high confidence for medical documents
await client.ocr.updateProcessor({
enabled: true,
config: { confidenceThreshold: 0.95 },
});
Legal services:
// Exports: longer retention for compliance
await client.exports.updateProcessor({
enabled: true,
config: { retentionDays: 90, maxConcurrentJobs: 1 },
});
Monitoring and health
Health check endpoint
const health = await client.healthCheck();
// health.status: "ok" | "degraded" | "down"
// health.timestamp: ISO 8601
health = client.health_check()
health, err := client.HealthCheck()
Usage monitoring
Track consumption against plan limits on a monthly basis.
const usage = await client.usage.summary({ month: '2026-04' });
console.log(`Emails: ${usage.emailsSent}`);
console.log(`API calls: ${usage.apiCalls}`);
console.log(`Storage: ${usage.storageUsedMb}MB`);
console.log(`Permission checks: ${usage.permissionChecks}`);
Webhook delivery monitoring
Check webhook delivery health by inspecting the webhook resource and testing delivery.
const webhooks = await client.webhooks.list();
for (const wh of webhooks.data.webhooks) {
console.log(`${wh.name}: ${wh.isActive ? 'active' : 'disabled'}`);
}
Event log monitoring
Monitor event processing by querying the event log.
const log = await client.events.log({
from: new Date(Date.now() - 3600000).toISOString(), // last hour
limit: '100',
});
Contextual messaging operations
Channel lifecycle rules configuration
Lifecycle rules control automatic thread state transitions. Configure them on channel creation or update.
Auto-close rules close threads when the associated entity reaches a terminal status:
await client.threads.updateChannel('ch_abc', {
lifecycleRules: {
autoClose: {
onEntityStatus: ['CHECKED_OUT', 'CANCELLED', 'EXPIRED'],
},
autoArchive: {
afterClosedDays: 90,
},
inactivityClose: {
afterDays: 30,
warningBeforeDays: 3,
},
},
});
Your application forwards entity status changes to trigger lifecycle evaluation:
await client.threads.entityEvent({
channelSlug: 'booking',
entityId: 'BK-2026-00451',
event: 'STATUS_CHANGED',
newStatus: 'CHECKED_OUT',
});
// If 'CHECKED_OUT' is in the autoClose list, the thread closes automatically
Escalation rule authoring
Escalation rules use JSON Logic conditions to match against flag data and trigger actions automatically.
Condition format (JSON Logic):
{
"in": [{ "var": "flag.reason" }, ["SAFETY", "EMERGENCY"]]
}
{
"and": [
{ "in": [{ "var": "flag.reason" }, ["DISPUTE"]] },
{ "==": [{ "var": "flag.severity" }, "HIGH"] }
]
}
Action types:
| Action | Description | Config fields |
|---|---|---|
CREATE_ISSUE | Create an issue in the connected issue tracker | title, priority, assignee |
NOTIFY_WEBHOOK | Send a webhook notification | webhookUrl, includeThread |
SEND_EMAIL | Send an alert email | to, templateId |
ASSIGN_AGENT | Auto-assign a platform agent | agentPool, strategy |
CLOSE_THREAD | Force-close the thread | reason |
Autonomous escalation setup
Configure rules for safety, cleanliness, and refund scenarios that trigger without human intervention.
await client.threads.setEscalationConfig('ch_abc', {
flagReasons: [
{ code: 'SAFETY', label: 'Safety concern', severity: 'HIGH' },
{ code: 'CLEANLINESS', label: 'Cleanliness issue', severity: 'MEDIUM' },
{ code: 'REFUND', label: 'Refund request', severity: 'LOW' },
{ code: 'EMERGENCY', label: 'Emergency', severity: 'HIGH' },
],
rules: [
{
id: 'rule-safety',
name: 'Safety auto-escalation',
trigger: 'PARTICIPANT_FLAG',
conditions: { in: [{ var: 'flag.reason' }, ['SAFETY', 'EMERGENCY']] },
actions: [
{ type: 'CREATE_ISSUE', config: { title: 'SAFETY: {{thread.subject}}', priority: 'URGENT' } },
{ type: 'NOTIFY_WEBHOOK', config: { webhookUrl: 'https://ops.example.com/safety-alerts' } },
],
priority: 1,
isActive: true,
},
{
id: 'rule-cleanliness',
name: 'Cleanliness follow-up',
trigger: 'PARTICIPANT_FLAG',
conditions: { in: [{ var: 'flag.reason' }, ['CLEANLINESS']] },
actions: [
{ type: 'ASSIGN_AGENT', config: { agentPool: 'housekeeping', strategy: 'round-robin' } },
],
priority: 2,
isActive: true,
},
{
id: 'rule-refund',
name: 'Refund request routing',
trigger: 'PARTICIPANT_FLAG',
conditions: { in: [{ var: 'flag.reason' }, ['REFUND']] },
actions: [
{ type: 'SEND_EMAIL', config: { to: 'refunds@example.com', templateId: 'tmpl_refund_alert' } },
],
priority: 3,
isActive: true,
},
],
});
Thread lifecycle processing
Inactivity close: Threads with no messages for the configured afterDays period are automatically closed. A warning system message is sent warningBeforeDays before closure.
Auto-archive: Closed threads are archived after afterClosedDays. Archived threads remain queryable but are excluded from inbox views.
Retention: Message and thread data is retained per the organization's data retention policy. Archived thread content is immutable.
Audit trail verification
Every thread action produces an audit event. Query the event log to verify the trail.
const log = await client.events.log({
eventType: 'THREAD_',
entityId: 'th-001',
});
// Returns: THREAD_CREATED, THREAD_MESSAGE_SENT, THREAD_CLOSED, etc.
Troubleshooting
Common error codes
| Error code | HTTP | Cause | Resolution |
|---|---|---|---|
UNAUTHORIZED | 401 | Missing or invalid API key | Verify the x-api-key header value |
FORBIDDEN | 403 | API key lacks required scope | Add the missing scope to the API key |
PLAN_REQUIRED | 403 | Feature requires a higher plan | Upgrade tenant plan (e.g., Federation requires Enterprise) |
NOT_FOUND | 404 | Resource does not exist | Verify the resource ID |
RATE_LIMITED | 429 | Rate limit exceeded | Back off and retry. See rate limits below |
VALIDATION_ERROR | 400 | Invalid request body | Check the error message for field-level details |
CONFLICT | 409 | Duplicate or conflicting state | Check for existing resources with the same unique fields |
PROVIDER_ERROR | 502 | Upstream provider failure | Retry; the circuit breaker will failover automatically |
PROCESSOR_DISABLED | 400 | Service processor is disabled | Enable the processor via updateProcessor |
Rate limiting behaviour
| Route class | Limit | Scope |
|---|---|---|
| Permission checks (check, resolve, batch) | 5,000/hr | permissions:check |
| Permission mutations (CRUD) | 500/hr | permissions:manage |
| Permission audit (logs, export) | 100/hr | permissions:audit |
| All other routes | 1,000/hr | Per API key |
When rate limited, the API returns a 429 response with Retry-After header indicating seconds until the next allowed request. SDKs with retry enabled (default) handle this automatically with exponential backoff.
Circuit breaker states
PlatformXe uses circuit breakers for external provider calls (email, SMS, identity resolution). The three states are:
| State | Behaviour |
|---|---|
| CLOSED | Normal operation. Requests go to the primary provider |
| OPEN | Primary provider has failed repeatedly. Requests route to the next provider in the fallback chain |
| HALF_OPEN | Testing if the primary provider has recovered. A small percentage of requests probe the primary |
Circuit breakers reset automatically. No manual intervention is required. The health check endpoint reflects provider circuit breaker states.
Escalation action failures
If an escalation rule action fails (e.g., webhook timeout, email delivery failure):
- The action failure is logged in the event log
- The flag remains in
PENDINGstate - The action is retried up to 3 times with exponential backoff
- After all retries fail, the action is marked
FAILEDand aTHREAD_ESCALATION_FAILEDevent is emitted - Manual intervention: review the flag and re-trigger escalation or process manually
// Query for failed escalation actions
const log = await client.events.log({
eventType: 'THREAD_ESCALATION_FAILED',
});
Provider failover chain
For messaging services, PlatformXe uses a multi-provider fallback chain. If the primary provider fails, requests automatically route to the next available provider. The order is configured per-tenant and not exposed publicly.
Failed messages enter a persistent retry queue. Monitor queue health through the usage summary endpoint.
Terraform operations
Initial workflow
# 1. Initialize the provider
terraform init
# 2. Preview changes
terraform plan -var="platformxe_api_key=pxk_live_..."
# 3. Apply changes
terraform apply -var="platformxe_api_key=pxk_live_..."
Store your API key in a .tfvars file or environment variable rather than passing it on the command line.
# Using environment variable
export PLATFORMXE_API_KEY="pxk_live_your_key_here"
terraform plan
terraform apply
Resource import for existing infrastructure
If you have resources already created through the portal or SDK, import them into Terraform state before managing them as code.
# Import a role
terraform import platformxe_permissions_role.agent role_abc123
# Import a channel
terraform import platformxe_threads_channel.booking ch_abc123
# Import a processor
terraform import platformxe_storage_processor.config proc_abc123
After importing, run terraform plan to verify the imported state matches your configuration. Fix any drift before applying new changes.
State management best practices
- Remote state: Use a remote backend (S3, GCS, Terraform Cloud) for team environments.
- State locking: Enable state locking to prevent concurrent modifications.
- Workspaces: Use separate workspaces for staging and production tenants.
- Sensitive values: Mark API keys as
sensitive = truein variable definitions.
variable "platformxe_api_key" {
type = string
sensitive = true
}
Processor resource lifecycle
Processor resources are singletons per service per organization. They are created on first apply and updated in place on subsequent applies. Destroying a processor resource resets it to default values (it does not disable the service).
# View current processor state
terraform state show platformxe_storage_processor.config
# Refresh from remote
terraform refresh
Handling plan changes
When changing your PlatformXe plan (e.g., upgrading from Basic to Enterprise), some resources may become available or unavailable. Run terraform plan after plan changes to detect drift:
terraform plan
# If federation resources are now available, they will show as "to create"