Operations Runbook

Comprehensive operational guide for tenant onboarding, service configuration, monitoring, contextual messaging operations, troubleshooting, and Terraform workflows.

Operations and Configuration Runbook

This guide covers day-to-day operational procedures for PlatformXe, from initial tenant setup through service configuration, monitoring, and troubleshooting.

Tenant onboarding

API key creation and scope assignment

Every tenant receives one or more API keys with scoped access. Create keys through the portal or the admin dashboard.

// TypeScript SDK -- verify key works
const client = new PlatformXeClient({
  apiKey: 'pxk_live_your_key_here',
});

const health = await client.healthCheck();
console.log(health.status); // "ok"

# Python SDK
from platformxe import PlatformXeClient

client = PlatformXeClient(api_key="pxk_live_your_key_here")

// Go SDK
client := platformxe.NewClient(platformxe.ClientConfig{
    APIKey: "pxk_live_your_key_here",
})

Recommended scope sets by use case:

Use case	Scopes
Full-stack app	`messaging:send`, `storage:upload`, `storage:read`, `permissions:check`, `events:read`
Authorization only	`permissions:check`, `permissions:manage`, `permissions:audit`
Background jobs	`exports:create`, `events:manage`, `webhooks:manage`
Read-only dashboard	`storage:read`, `events:read`, `permissions:check`

Service processor initial configuration

After creating an API key, configure processors for each service your tenant will use.

// Enable and configure storage
await client.storage.updateProcessor({
  enabled: true,
  config: {
    maxFileSizeMb: 25,
    allowedMimeTypes: ['image/jpeg', 'image/png', 'application/pdf'],
    moderationEnabled: true,
  },
});

// Enable OCR
await client.ocr.updateProcessor({
  enabled: true,
  config: {
    confidenceThreshold: 0.85,
    supportedDocumentTypes: ['NIN_SLIP', 'DRIVERS_LICENSE'],
  },
});

// Enable identity resolution
await client.identity.updateProcessor({
  enabled: true,
  config: { retryOnFailure: true, maxRetries: 3 },
});

# Python equivalents
client.storage.update_processor(enabled=True, config={"maxFileSizeMb": 25})
client.ocr.update_processor(enabled=True, config={"confidenceThreshold": 0.85})
client.identity.update_processor(enabled=True, config={"retryOnFailure": True})

// Go equivalents
client.Storage.UpdateProcessor(map[string]interface{}{"enabled": true, "config": map[string]interface{}{"maxFileSizeMb": 25}})
client.Ocr.UpdateProcessor(map[string]interface{}{"enabled": true, "config": map[string]interface{}{"confidenceThreshold": 0.85}})

Channel setup for contextual messaging

Create channels before threads can be opened. Each channel maps to a domain entity type.

await client.threads.createChannel({
  slug: 'booking',
  displayName: 'Booking Conversations',
  entityType: 'BOOKING',
  participantRoles: ['GUEST', 'HOST', 'PLATFORM'],
  defaultVisibility: ['ALL'],
  lifecycleRules: {
    autoClose: { onEntityStatus: ['CHECKED_OUT', 'CANCELLED'] },
    autoArchive: { afterClosedDays: 90 },
  },
});

Webhook and event subscription setup

// Create a webhook endpoint
const webhook = await client.webhooks.create({
  url: 'https://your-app.com/webhooks/platformxe',
  events: ['email.message.*', 'permissions.role.*'],
  secret: 'whsec_your_signing_secret',
});

// Create an event subscription
await client.events.createSubscription({
  eventTypes: ['BOOKING_CONFIRMED', 'BOOKING_CANCELLED'],
  webhookUrl: 'https://your-app.com/events',
  isActive: true,
});

Service configuration

Processor types and defaults

PlatformXe has 7 configurable processor types. Each controls runtime behaviour for its service.

Processor	Key settings	Default values
Messaging	`retryMaxAttempts`, `retryDelayMs`, `deadLetterAfter`	3 attempts, 2000ms delay, dead-letter after 5 failures
Storage	`maxFileSizeMb`, `allowedMimeTypes`, `moderationEnabled`	10MB, all types, moderation off
OCR	`confidenceThreshold`, `supportedDocumentTypes`	0.80 threshold, all document types
PDF	`defaultPageSize`, `defaultMargins`	A4, 20/15/20/15 margins
QR	`defaultSize`, `defaultFormat`, `brandColor`	256px, PNG, black
Exports	`maxConcurrentJobs`, `retentionDays`	2 concurrent, 7 day retention
Identity	`retryOnFailure`, `maxRetries`, `cacheTtlSeconds`	Retry on, 2 retries, 3600s cache

Recommended configurations by industry

Property management:

// Storage: large files, image moderation on
await client.storage.updateProcessor({
  enabled: true,
  config: { maxFileSizeMb: 50, moderationEnabled: true },
});

// PDF: A4, custom branding margins
await client.pdf.updateProcessor({
  enabled: true,
  config: { defaultPageSize: 'A4' },
});

Healthcare:

// Identity: aggressive caching, retry enabled
await client.identity.updateProcessor({
  enabled: true,
  config: { retryOnFailure: true, maxRetries: 5, cacheTtlSeconds: 0 },
});

// OCR: high confidence for medical documents
await client.ocr.updateProcessor({
  enabled: true,
  config: { confidenceThreshold: 0.95 },
});

Legal services:

// Exports: longer retention for compliance
await client.exports.updateProcessor({
  enabled: true,
  config: { retentionDays: 90, maxConcurrentJobs: 1 },
});

Monitoring and health

Health check endpoint

const health = await client.healthCheck();
// health.status: "ok" | "degraded" | "down"
// health.timestamp: ISO 8601

health = client.health_check()

health, err := client.HealthCheck()

Usage monitoring

Track consumption against plan limits on a monthly basis.

const usage = await client.usage.summary({ month: '2026-04' });

console.log(`Emails: ${usage.emailsSent}`);
console.log(`API calls: ${usage.apiCalls}`);
console.log(`Storage: ${usage.storageUsedMb}MB`);
console.log(`Permission checks: ${usage.permissionChecks}`);

Webhook delivery monitoring

Check webhook delivery health by inspecting the webhook resource and testing delivery.

const webhooks = await client.webhooks.list();

for (const wh of webhooks.data.webhooks) {
  console.log(`${wh.name}: ${wh.isActive ? 'active' : 'disabled'}`);
}

Event log monitoring

Monitor event processing by querying the event log.

const log = await client.events.log({
  from: new Date(Date.now() - 3600000).toISOString(), // last hour
  limit: '100',
});

Contextual messaging operations

Channel lifecycle rules configuration

Lifecycle rules control automatic thread state transitions. Configure them on channel creation or update.

Auto-close rules close threads when the associated entity reaches a terminal status:

await client.threads.updateChannel('ch_abc', {
  lifecycleRules: {
    autoClose: {
      onEntityStatus: ['CHECKED_OUT', 'CANCELLED', 'EXPIRED'],
    },
    autoArchive: {
      afterClosedDays: 90,
    },
    inactivityClose: {
      afterDays: 30,
      warningBeforeDays: 3,
    },
  },
});

Your application forwards entity status changes to trigger lifecycle evaluation:

await client.threads.entityEvent({
  channelSlug: 'booking',
  entityId: 'BK-2026-00451',
  event: 'STATUS_CHANGED',
  newStatus: 'CHECKED_OUT',
});
// If 'CHECKED_OUT' is in the autoClose list, the thread closes automatically

Escalation rule authoring

Escalation rules use JSON Logic conditions to match against flag data and trigger actions automatically.

Condition format (JSON Logic):

{
  "in": [{ "var": "flag.reason" }, ["SAFETY", "EMERGENCY"]]
}

{
  "and": [
    { "in": [{ "var": "flag.reason" }, ["DISPUTE"]] },
    { "==": [{ "var": "flag.severity" }, "HIGH"] }
  ]
}

Action types:

Action	Description	Config fields
`CREATE_ISSUE`	Create an issue in the connected issue tracker	`title`, `priority`, `assignee`
`NOTIFY_WEBHOOK`	Send a webhook notification	`webhookUrl`, `includeThread`
`SEND_EMAIL`	Send an alert email	`to`, `templateId`
`ASSIGN_AGENT`	Auto-assign a platform agent	`agentPool`, `strategy`
`CLOSE_THREAD`	Force-close the thread	`reason`

Autonomous escalation setup

Configure rules for safety, cleanliness, and refund scenarios that trigger without human intervention.

await client.threads.setEscalationConfig('ch_abc', {
  flagReasons: [
    { code: 'SAFETY', label: 'Safety concern', severity: 'HIGH' },
    { code: 'CLEANLINESS', label: 'Cleanliness issue', severity: 'MEDIUM' },
    { code: 'REFUND', label: 'Refund request', severity: 'LOW' },
    { code: 'EMERGENCY', label: 'Emergency', severity: 'HIGH' },
  ],
  rules: [
    {
      id: 'rule-safety',
      name: 'Safety auto-escalation',
      trigger: 'PARTICIPANT_FLAG',
      conditions: { in: [{ var: 'flag.reason' }, ['SAFETY', 'EMERGENCY']] },
      actions: [
        { type: 'CREATE_ISSUE', config: { title: 'SAFETY: {{thread.subject}}', priority: 'URGENT' } },
        { type: 'NOTIFY_WEBHOOK', config: { webhookUrl: 'https://ops.example.com/safety-alerts' } },
      ],
      priority: 1,
      isActive: true,
    },
    {
      id: 'rule-cleanliness',
      name: 'Cleanliness follow-up',
      trigger: 'PARTICIPANT_FLAG',
      conditions: { in: [{ var: 'flag.reason' }, ['CLEANLINESS']] },
      actions: [
        { type: 'ASSIGN_AGENT', config: { agentPool: 'housekeeping', strategy: 'round-robin' } },
      ],
      priority: 2,
      isActive: true,
    },
    {
      id: 'rule-refund',
      name: 'Refund request routing',
      trigger: 'PARTICIPANT_FLAG',
      conditions: { in: [{ var: 'flag.reason' }, ['REFUND']] },
      actions: [
        { type: 'SEND_EMAIL', config: { to: 'refunds@example.com', templateId: 'tmpl_refund_alert' } },
      ],
      priority: 3,
      isActive: true,
    },
  ],
});

Thread lifecycle processing

Inactivity close: Threads with no messages for the configured afterDays period are automatically closed. A warning system message is sent warningBeforeDays before closure.

Auto-archive: Closed threads are archived after afterClosedDays. Archived threads remain queryable but are excluded from inbox views.

Retention: Message and thread data is retained per the organization's data retention policy. Archived thread content is immutable.

Audit trail verification

Every thread action produces an audit event. Query the event log to verify the trail.

const log = await client.events.log({
  eventType: 'THREAD_',
  entityId: 'th-001',
});
// Returns: THREAD_CREATED, THREAD_MESSAGE_SENT, THREAD_CLOSED, etc.

Troubleshooting

Common error codes

Error code	HTTP	Cause	Resolution
`UNAUTHORIZED`	401	Missing or invalid API key	Verify the `x-api-key` header value
`FORBIDDEN`	403	API key lacks required scope	Add the missing scope to the API key
`PLAN_REQUIRED`	403	Feature requires a higher plan	Upgrade tenant plan (e.g., Federation requires Enterprise)
`NOT_FOUND`	404	Resource does not exist	Verify the resource ID
`RATE_LIMITED`	429	Rate limit exceeded	Back off and retry. See rate limits below
`VALIDATION_ERROR`	400	Invalid request body	Check the error message for field-level details
`CONFLICT`	409	Duplicate or conflicting state	Check for existing resources with the same unique fields
`PROVIDER_ERROR`	502	Upstream provider failure	Retry; the circuit breaker will failover automatically
`PROCESSOR_DISABLED`	400	Service processor is disabled	Enable the processor via `updateProcessor`

Rate limiting behaviour

Route class	Limit	Scope
Permission checks (check, resolve, batch)	5,000/hr	`permissions:check`
Permission mutations (CRUD)	500/hr	`permissions:manage`
Permission audit (logs, export)	100/hr	`permissions:audit`
All other routes	1,000/hr	Per API key

When rate limited, the API returns a 429 response with Retry-After header indicating seconds until the next allowed request. SDKs with retry enabled (default) handle this automatically with exponential backoff.

Circuit breaker states

PlatformXe uses circuit breakers for external provider calls (email, SMS, identity resolution). The three states are:

State	Behaviour
CLOSED	Normal operation. Requests go to the primary provider
OPEN	Primary provider has failed repeatedly. Requests route to the next provider in the fallback chain
HALF_OPEN	Testing if the primary provider has recovered. A small percentage of requests probe the primary

Circuit breakers reset automatically. No manual intervention is required. The health check endpoint reflects provider circuit breaker states.

Escalation action failures

If an escalation rule action fails (e.g., webhook timeout, email delivery failure):

The action failure is logged in the event log
The flag remains in PENDING state
The action is retried up to 3 times with exponential backoff
After all retries fail, the action is marked FAILED and a THREAD_ESCALATION_FAILED event is emitted
Manual intervention: review the flag and re-trigger escalation or process manually

// Query for failed escalation actions
const log = await client.events.log({
  eventType: 'THREAD_ESCALATION_FAILED',
});

Provider failover chain

For messaging services, PlatformXe uses a multi-provider fallback chain. If the primary provider fails, requests automatically route to the next available provider. The order is configured per-tenant and not exposed publicly.

Failed messages enter a persistent retry queue. Monitor queue health through the usage summary endpoint.

Terraform operations

Initial workflow

# 1. Initialize the provider
terraform init

# 2. Preview changes
terraform plan -var="platformxe_api_key=pxk_live_..."

# 3. Apply changes
terraform apply -var="platformxe_api_key=pxk_live_..."

Store your API key in a .tfvars file or environment variable rather than passing it on the command line.

# Using environment variable
export PLATFORMXE_API_KEY="pxk_live_your_key_here"
terraform plan
terraform apply

Resource import for existing infrastructure

If you have resources already created through the portal or SDK, import them into Terraform state before managing them as code.

# Import a role
terraform import platformxe_permissions_role.agent role_abc123

# Import a channel
terraform import platformxe_threads_channel.booking ch_abc123

# Import a processor
terraform import platformxe_storage_processor.config proc_abc123

After importing, run terraform plan to verify the imported state matches your configuration. Fix any drift before applying new changes.

State management best practices

Remote state: Use a remote backend (S3, GCS, Terraform Cloud) for team environments.
State locking: Enable state locking to prevent concurrent modifications.
Workspaces: Use separate workspaces for staging and production tenants.
Sensitive values: Mark API keys as sensitive = true in variable definitions.

variable "platformxe_api_key" {
  type      = string
  sensitive = true
}

Processor resource lifecycle

Processor resources are singletons per service per organization. They are created on first apply and updated in place on subsequent applies. Destroying a processor resource resets it to default values (it does not disable the service).

# View current processor state
terraform state show platformxe_storage_processor.config

# Refresh from remote
terraform refresh

Handling plan changes

When changing your PlatformXe plan (e.g., upgrading from Basic to Enterprise), some resources may become available or unavailable. Run terraform plan after plan changes to detect drift:

terraform plan
# If federation resources are now available, they will show as "to create"