MySQL HeatWave: Identifying and Testing Failover in High Availability DB Systems

Question

MySQL HeatWave: Identifying and Testing Failover in High Availability DB Systems

Derrick Ryan posted Nov 10, 2025 Originally published at dev.to 10 min read

In a high availability MySQL HeatWave DB system, automatic failover is a critical mechanism that ensures business continuity when the primary instance becomes unavailable. Understanding how to identify when a failover has occurred and how to properly test failover scenarios is essential for database administrators managing mission-critical applications.

This guide walks you through recognizing failover events through error messages, understanding the failover process, and implementing best practices for testing failover scenarios without disrupting production systems.

Understanding Failover in High Availability

What is Failover?

Failover is the automatic process where one of the secondary MySQL instances is promoted to become the new primary instance when the current primary fails or becomes unavailable. The promotion happens automatically without manual intervention, minimizing downtime and ensuring continuous availability.

Common Causes of Failover

Failover can be triggered by several scenarios:

Availability Domain Failure: Complete outage of the data center hosting the primary instance
Primary Database Instance Failure: Hardware failure, software crash, or critical system error on the primary instance
Network Connectivity Issues: Loss of network connectivity between the primary instance and the MySQL HeatWave Service management layer
Block Storage Issues: Persistent storage failures affecting the primary instance's ability to read or write data
Maintenance Activities: Certain planned maintenance operations that affect the primary instance

The Failover Process

When failover occurs:

MySQL HeatWave Service detects the primary instance failure
One of the secondary instances is automatically selected for promotion
The selected secondary is promoted to primary status
The DB system endpoint is reassigned to the newly promoted primary instance
The IP address remains the same, but all existing connections are closed
Client applications must reconnect to resume operations
If a HeatWave cluster was attached, it's automatically recreated and data is reloaded

Identifying a Failover: Error Messages and Indicators

Primary Indicators of Failover

1. Console Notification

The most straightforward way to verify if a failover has occurred is through the OCI Console:

Steps to Check:

Open the navigation menu
Select Databases → HeatWave MySQL → DB Systems
Choose your compartment from the List Scope
Click the name of your DB system to open the DB System Details page

If a failover has occurred, a message is displayed stating: "Current placement () differs from preferred placement, due to failover or maintenance activity"

This message indicates that the current primary instance is not in your originally preferred location due to automatic failover.

2. MySQL Event Notifications

When a failover happens, a MySQL - Automatic Recovery event is emitted on the DB system with the additionalDetails.isFailover property set to true

Configure event rules to receive notifications when these events occur.

Error Messages During Failover

When a failover is in progress or has just occurred, applications may encounter various error messages. Understanding these errors helps you identify failover events and implement appropriate retry logic.

Error 1290 (HY000): Read-Only Mode Error

Error Message:

ERROR 1290 (HY000): The MySQL server is running with the --super-read-only option 
so it cannot execute this statement

What It Means:
This error occurs when attempting to execute write operations (INSERT, UPDATE, DELETE, DDL statements) on an instance that is in super-read-only mode

When You'll See It:

During the brief window when a primary is being demoted to secondary
If your application still has connections to the old primary after failover
When attempting writes to a secondary instance

What to Do:

Implement connection retry logic in your application
Close existing connections and reconnect to get routed to the new primary
Ensure your application uses the DB system endpoint, not instance-specific IPs

Error 3100 (HY000): Replication Hook Error

Error Message:

ERROR 3100 (HY000): Error on observer while running replication hook 'before_commit'

What It Means:
This error indicates the system is starting the failover process, and transaction commits are being blocked or rolled back.

Context:
This error can also indicate transaction size errors when the message includes "Transaction of size X exceeds specified limit"

When You'll See It:

At the beginning of a failover event
When large transactions exceed the group replication transaction size limit
During the brief period when the primary instance is being transitioned

What to Do:

Transactions in progress will be rolled back automatically
Implement exponential backoff retry logic
Reconnect and retry the transaction
For transaction size errors, break large transactions into smaller batches

Error 2013 (HY000): Connection Lost During Query

Error Message:

ERROR 2013 (HY000): Lost connection to MySQL server during query

What It Means:
The connection to the MySQL server was lost while executing a query, typically because the primary instance became unavailable during failover.

When You'll See It:

During active failover when the primary instance goes offline
When long-running queries are interrupted by failover
During network disruptions affecting the primary

What to Do:

Close the broken connection
Establish a new connection to the DB system endpoint
Retry the query on the new primary
Implement idempotent query patterns where possible

Error Log Indicators

MySQL error logs provide detailed information about failover events. You can query the Performance Schema to view these logs.

Viewing Error Logs

Connect to your DB system using MySQL Shell or MySQL Client and run:

SELECT * FROM performance_schema.error_log 
WHERE SUBSYSTEM = 'Repl' 
ORDER BY LOGGED DESC 
LIMIT 50;

Key Error Codes Indicating Failover

MY-013213 and MY-011507: Primary Election Change

These errors indicate that a failover has occurred and a secondary instance has been promoted to primary through a Group Replication primary election change

MY-011608: Transaction Size Error

This error indicates a transaction size error where the transaction exceeds the group_replication_transaction_size_limit

MY-011566: Super Read-Only Mode Change

This error indicates that super_read_only has been switched off when an instance is promoted from secondary to primary, or switched on when demoted from primary to secondary.

What It Indicates:

An instance role change has occurred
The instance is transitioning between primary and secondary status
Part of the normal failover process

Additional Failover Indicators

Binary Log Position Differences

After a failover, the current binary log file name and position of the new primary may be different from the old primary

This is because binary logs of each instance are managed independently. Each transaction may be written to different binary log files and positions across instances.

To Check Binary Log Position:

SHOW MASTER STATUS;

Compare before and after failover to identify changes.

Connection String Behavior

While the DB system endpoint IP address remains constant during failover, all existing connections are terminated and must be re-established. Monitor your application's connection pool for sudden spikes in connection closures and new connection attempts.

Testing Failover: Process and Best Practices

Why Test Failover?

Regular failover testing is essential for:

Validating your disaster recovery procedures
Ensuring applications handle failovers gracefully
Training operations teams on recovery processes
Identifying configuration issues before real failures occur
Meeting compliance and audit requirements
Building confidence in your high availability architecture

Switchover vs. Failover

Failover: Automatic, unplanned promotion of a secondary to primary due to primary instance failure.

Switchover: Manual, planned promotion of a secondary to primary for testing, maintenance, or optimization purposes.

When you perform a switchover, the preferred placement and current placement change to the newly selected placement of the primary instance, and the DB system endpoint IP address does not change

Use switchover to safely test failover scenarios without actually causing a failure.

Step-by-Step Failover Testing Process

Phase 1: Set Up a Test DB System

Never test failover on production systems during business hours.

Create a Non-Production HA DB System
- Use a development or staging environment
- Configure it identically to production (same shape, configuration, HA settings)
- Ensure the system has high availability enabled
Import Representative Data
- Load a production-like dataset
- Include tables with primary keys (HA requirement)
- Consider data sensitivity and masking requirements
Deploy a Sample Application
- Use a test application that mimics production workload
- Implement connection pooling and retry logic
- Add monitoring and logging capabilities
- Configure the application to use the DB system endpoint

Phase 2: Establish Baseline Monitoring

Before triggering switchover, establish baseline metrics:

Application Performance Metrics
- Query response times
- Transaction throughput
- Connection pool statistics
- Error rates
Database Metrics
- Active connections
- Current binary log position
- Replication lag (should be minimal in HA systems)
- Current primary instance location
System Health Indicators
- CPU and memory utilization
- Storage I/O metrics
- Network latency between availability domains

Phase 3: Execute Switchover

Switchover causes a short period of downtime while the primary instance is redirected to the newly promoted instance, requiring all database connections to be reopened

Using the Console:

Navigate to your DB System details page
Click the Actions menu
Select Switchover
Choose the target availability domain or fault domain (different from current)
Confirm the switchover operation

Using OCI CLI:

oci mysql db-system update \
  --db-system-id <YOUR_DB_SYSTEM_OCID> \
  --availability-domain <TARGET_AD> \
  --force

Modern Switchover Behavior (for systems created/upgraded after February 25, 2025):

New transactions are blocked when switchover is initiated, running transactions are allowed to complete, and uncommitted transactions are rolled back when the connection to the previous primary is broken

Phase 4: Monitor the Switchover

During Switchover (typically 30-60 seconds):

Watch Application Behavior
- Monitor for connection errors (Error 1290, 2013, 3100)
- Observe connection pool drain and refill
- Track any failed transactions
- Verify automatic retry mechanisms

Check Error Logs

SELECT * FROM performance_schema.error_log 
WHERE LOGGED > DATE_SUB(NOW(), INTERVAL 5 MINUTE)
ORDER BY LOGGED DESC;

Verify HeatWave Cluster Status (if applicable)
When current placement changes in a switchover, HeatWave cluster is detached from the previous primary instance and either reused (same AD) or deleted and recreated (different AD)
```
SHOW GLOBAL STATUS LIKE 'rapid_plugin_bootstrapped';
```

After Switchover:

Verify New Primary Location
- Check OCI Console for current placement
- Confirm switchover to target AD/FD succeeded
- Verify "Current placement" matches target
Validate Application Recovery
- All connections successfully re-established
- Transaction throughput returns to baseline
- Query performance is consistent
- No persistent error conditions
Check Data Consistency
- Verify recent transactions completed successfully
- Confirm no data loss occurred
- Test read and write operations
- Validate application state consistency

Phase 5: Document and Analyze

Key Metrics to Record:

Metric	Before	During	After
Total Downtime	N/A	___ sec	N/A
Connections Dropped	0	___	0
Failed Transactions	0	___	0
Recovery Time	N/A	___ sec	N/A

Analysis Questions:

Did the application handle failover gracefully?
Were retry mechanisms effective?
Did any transactions fail permanently?
How quickly did the system return to normal operation?
Were there any unexpected issues?

Switchover to Return to Preferred Placement

After testing, you can switch back to the original preferred placement:

Option 1: Switchover Back
Perform another switchover to return the primary to the original AD/FD.

Option 2: Update Preferred Placement
You can switchover to the existing primary instance to change the preferred placement to the current placement with no downtime

Best Practices for Handling Failover

1. Test on Non-Production Systems First

Critical Rule: Never test failover on production systems during business hours or peak traffic periods.

Why This Matters:

Switchover causes brief downtime (30-60 seconds)
Connections are forcibly closed
Uncommitted transactions are rolled back
Long-running queries may be aborted

Best Practice:

Maintain a dedicated test environment that mirrors production
Perform regular failover drills quarterly or bi-annually
Test during maintenance windows for production validation
Document all test results and improvements made

2. Use Switchover to Simulate Failover

Recommendation: Use manual switchover operations to test disaster recovery procedures rather than artificially creating failures.

Benefits:

Controlled, predictable process
Ability to schedule during low-traffic periods
Minimizes risk of actual data loss
Provides clean before/after states for analysis

Testing Scenarios:

Same AD, Different Fault Domain: Minimal impact, tests intra-datacenter failover
Different AD: Simulates datacenter failure, tests cross-AD latency and behavior
With Active HeatWave Cluster: Tests analytics workload recovery
Under Load: Tests failover during peak transaction volume

3. Implement Robust Application-Level Retry Logic

Connection Retry Pattern:

import time
import mysql.connector
from mysql.connector import Error

def connect_with_retry(max_retries=5, base_delay=1):
    for attempt in range(max_retries):
        try:
            connection = mysql.connector.connect(
                host='<DB_SYSTEM_ENDPOINT>',
                database='your_database',
                user='admin',
                password='your_password',
                connect_timeout=10
            )
            return connection
        except Error as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt)
            print(f"Connection failed: {e}. Retrying in {delay}s...")
            time.sleep(delay)

Key Principles:

Always use the DB system endpoint, never instance-specific IPs
Implement exponential backoff (1s, 2s, 4s, 8s, 16s)
Limit retry attempts to avoid infinite loops
Make transactions idempotent when possible
Log all retry attempts for monitoring

4. Move Primary Back After Unplanned Failover

After an automatic failover occurs in production:

The current placement differs from the preferred placement, and a message is displayed indicating "Current placement differs from preferred placement, due to failover or maintenance activity"

Decision Points:

Option A: Leave in New Location

System is stable and operational
Performance is acceptable
No immediate reason to change

Option B: Switchover Back to Preferred Location

Original location has specific advantages
Schedule switchover during next maintenance window
Perform during low-traffic period

Best Practice: Don't rush to move back immediately after failover. Validate the stability of the current state first, then plan a controlled switchover during a maintenance window.

5. Configure Connection Timeouts Appropriately

Recommended Settings:

SET GLOBAL connect_timeout = 10;
SET GLOBAL interactive_timeout = 28800;
SET GLOBAL wait_timeout = 28800;

For Applications Using PrivateLink:

Set shorter connect_timeout values (5 seconds)
Implement aggressive retry logic
Failovers and switchovers take longer when using Query PrivateLink endpoints

6. Test with HeatWave Cluster

If your DB system has a HeatWave cluster attached:

Additional Considerations:

Cluster Behavior During Switchover
- If new primary is in the same AD as previous primary, the existing HeatWave cluster is reused and reattached
- If new primary is in a different AD, the existing HeatWave cluster is deleted and a new one created
Data Reload
- Tables are automatically reloaded from HeatWave Storage Layer
- Monitor reload progress: SHOW GLOBAL STATUS LIKE 'rapid%';
Long-Running Queries
- When current placement changes in a switchover, long-running queries offloaded to HeatWave cluster can get aborted
- Reconnect and rerun queries after cluster is ready

Key Takeaways

Identifying and testing failover in MySQL HeatWave high availability DB systems is a critical skill for database administrators and DevOps teams. By understanding the error messages that indicate failover, implementing proper monitoring, and regularly testing switchover scenarios in non-production environments, you can ensure your applications handle failover gracefully with minimal disruption.

Key takeaways:

Recognize failover indicators: Watch for specific error codes (1290, 3100, 2013) and console messages
Use switchover for testing: Never test on production; use manual switchover to simulate failover safely
Monitor thoroughly: Track error logs, transaction status, and application behavior
Implement retry logic: Applications must handle connection failures and transaction rollbacks gracefully
Plan recovery: After automatic failover, assess whether to switch back during a maintenance window

1 Comment

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

Dilip Kumar · Answer 1 · 2025-11-10T12:29:21+0000

Dilip Kumar • Nov 10, 2025

useful detail, informative

	MySQL HeatWave: Creating High Availability DB Systems and Understanding Limitations Derrick Ryan - Nov 7, 2025
	MySQL HeatWave: Creating DB Systems and High Availability Setup Derrick Ryan - Nov 4, 2025
	MySQL HeatWave: Connecting and Managing DB Systems Derrick Ryan - Nov 12, 2025
	What Is an Availability Zone Explained Simply Ijay - Feb 12
	MySQL HeatWave: Read Replicas for Scaling Read-Heavy Workloads Derrick Ryan - Nov 28, 2025

Welcome to Coder Legion

Connect with 4,261 amazing developers

Don't have an account? Sign up

OR

MySQL HeatWave: Identifying and Testing Failover in High Availability DB Systems

Understanding Failover in High Availability

What is Failover?

Common Causes of Failover

The Failover Process

Identifying a Failover: Error Messages and Indicators

Primary Indicators of Failover

1. Console Notification

2. MySQL Event Notifications

Error Messages During Failover

Error 1290 (HY000): Read-Only Mode Error

Error 3100 (HY000): Replication Hook Error

Error 2013 (HY000): Connection Lost During Query

Error Log Indicators

Viewing Error Logs

Key Error Codes Indicating Failover

Additional Failover Indicators

Binary Log Position Differences

Connection String Behavior

Testing Failover: Process and Best Practices

Why Test Failover?

Switchover vs. Failover

Step-by-Step Failover Testing Process

Phase 1: Set Up a Test DB System

Phase 2: Establish Baseline Monitoring

Phase 3: Execute Switchover

Phase 4: Monitor the Switchover

Phase 5: Document and Analyze

Switchover to Return to Preferred Placement

Best Practices for Handling Failover

1. Test on Non-Production Systems First

2. Use Switchover to Simulate Failover

3. Implement Robust Application-Level Retry Logic

4. Move Primary Back After Unplanned Failover

5. Configure Connection Timeouts Appropriately

6. Test with HeatWave Cluster

Key Takeaways

1 Comment

Please log in to add a comment.

Please log in to comment on this post.

More Posts

More From Derrick Ryan

Related Jobs

Commenters (This Week)