Monitoring Data Replication - Overview

Table of Contents

Overview

Replication is a continuous activity and details of on-going replication activity is shown in the Data Replication Monitor in the CommCell Console. See View Data Replication Monitor for step-by-step instructions.

From the Replication Monitor you can:

All other job-based activity, such as Recovery Point creation, is reflected in the Job Controller. See Controlling Jobs in Job Management for comprehensive information.

Job Phases

CDR utilizes phases to perform three types of operations - initial data transfer or baselining, smart synchronization, and continuous data replication. The sequence of these phases is listed below along with details of CDR activities during each phase, and the consequence of an interruption, such as a temporary loss of connectivity:

Job Phase

Associated Activity

Comments

Baseline Scan For Windows only, start NTFS journaling on the source to track any file operations that occur during the entire Baseline phase.

Scan source path to obtain the number of files and bytes to transfer.

Generate Collect File.

The Replication Pair will show a Job State of Preparing for Replication in the Data Replication Monitor.

If this phase is interrupted:

  • For Windows, it can resume again at the same point.
  • For Unix, it will start over.

A Full Re-Sync will start at this phase.

Baseline

(For Windows)

Calculates checksum on the source and destination to identify files that will be sent to the destination.

Data is transferred from the Replication Pair source path to the destination path using the checksum.

If this phase is interrupted, it can resume again at the same point.
SmartSync Scan Create a non-persistent snapshot; for Windows, compare it to the change journal.

Scan snapshot and generate a new Collect File for any files or directories that were added or data that was modified since the beginning of the Baseline Scan phase.

For Windows:
  • During the SmartSync Scan phase, CDR requires a short period of no disk activity on the source to begin monitoring the source path, and if there is significant I/O on the source, this can fail, although CDR will continue making successive attempts to find a short period of inactivity. When you have multiple, active Replication Pairs on the same source computer, CDR requires this short period of no disk activity for all of them, even for Replication Pairs that use different drives as their source. For example, if you have a Replication Pair on the Client which is already replicating, using F:\ as its source, and you create a new Replication Pair on the same Client using G:\ as its source, too much I/O on F:\ can cause CDR to fail to begin monitoring G:\.

For Unix:

  • To create the non-persistent snapshot on the source, CDR requires a short period of time, during which no files are deleted, created, or renamed. (File writes will not affect the snapshot.) If one of these three operations occurs while the snapshot is being taken, the process of creating the snapshot will begin again.

If this phase is interrupted:

  • For Windows, it can resume again at the same point
  • For Unix, it will start over

A Smart Re-Sync will start at this phase.

Processing Orphan Files Compare the Collect File to the Destination to identify orphan files, and apply orphan file settings. Any data that was deleted on the replication source during the Baselining phases are treated according to your settings for Orphan Files.

If this phase is interrupted:

  • For Windows, it will resume again from the beginning of this phase; however, if the snapshot is no longer available, it will return to the SmartSync Scan phase.
  • For Unix, it will return to the SmartSync Scan phase.
Checksum Calculation

(On Windows only)

Calculate checksums on the source and destination to identify files that have changed since Baseline Scan. If this phase is interrupted, it will resume again from the beginning of this phase; however, if the snapshot is no longer available, it will return to the SmartSync Scan phase.
SmartSync Transfer all changed files to destination from the new Collect File. If this phase is interrupted:
  • For Windows, it will resume again from the beginning of this phase; however, if the snapshot is no longer available, it will return to the SmartSync Scan phase.
  • For Unix, it will return to the SmartSync Scan phase.
Updating Smart Sync

(On Windows only)

Compare time stamps on source and destination and update.

Temporary snapshot is deleted.

If this phase is interrupted, it will resume again from the beginning of this phase.
Replication Data is continuously replicated from the source to destination. Log Transfer & Log Replay activity is on-going. For more information, refer to Replication Logs.

The Replication Pair will show a Job State of Replicating in the Data Replication Monitor.

If the Replication phase is interrupted, when restarted, if it is possible, replication will begin again from the last log replayed on the destination; if this is not possible, the Replication Pair will return to the Baseline Scan phase (Full Re-Sync) or to the SmartSync Scan phase (a Smart Re-Sync) depending on the nature and duration of the interruption. Note that if a user manually restarts Replication by choosing Start Full Resync, the Replication Pair will return to the Baseline Scan phase.

For the SmartSync Scan, while new files and directories will be copied in their entirety, modified files do not need to be copied. Thus, for larger files, only the modified portion is re-copied, while smaller files with substantial changes may be copied in their entirety. Modified files below a certain size threshold are copied again as complete files, while files above that size are broken into blocks with just the changed blocks copied to the destination computer.

Files smaller than 256KB will be copied in their entirety whether they match the destination or not. For files above 256KB in size, only the changed blocks will be transferred; the default block size for hashing is 64KB. The default values of the minimum file size and the block size for hashing, can be configured in Replication Set Properties. See Create a Replication Set for step-by-step instruction.

Interruptions and Restarts

By default, CDR handles interruptions by seamlessly restarting replication, but if that is not possible, Smart Re-Sync will be started. However, some interruptions will require a Full Re-Sync. The following sections describes each phase and restart behavior when the phase is interrupted:

Smart Re-Sync

Smart Re-Sync is the default behavior of CDR when activities are interrupted and cannot be seamlessly restarted at the same point again. In general, CDR endeavors to do the following in such cases, wherever possible:

  • continue logging on the source
  • continue replaying logs on the destination which were received before the interruption
  • restart activities exactly where they were interrupted, or as close to that point as possible

For examples of commons types of interruptions, and how Smart Re-Sync handles the recovery, refer to System Behavior when Replication is Interrupted.

For a detailed listing of each phase, and the specifics of the exact point at which Smart Re-Sync restarts activities, refer to Job Phases.

Full Re-Sync

Full Re-Sync should be necessary only in cases such as the following:

  • the data on the destination is altered by means outside of the replication process, e.g., manually deleted or modified, etc.
  • an interruption is of long enough duration that the logs overflow on the source

In such a case, all existing content in the destination path is considered inconsistent and Full Re-Sync is recommended to rebuild it again based on the current data in the specified source path. When you start replication from the Replication Set or Replication Pair level, you can specify Full Re-Sync, causing the Replication Pair to begin at the Baseline Scan phase.

Data Replication will be interrupted if a hard disk used for either a source or destination is put into the 'standby' state through the power schema configuration. It will be necessary to abort activity for all affected Replication Sets and restart them again using Start Full Resync after such an event.

Changes That Interrupt Data Replication

Changes to the following configuration items will not be effective until data replication activity has been interrupted and restarted:

The following will require data replication to be interrupted and restarted:

  • On Windows, if chkdsk is run on a hard disk used for either a source or destination, the affected Replication Pairs in the Replicating state must be aborted and restarted using Smart Re-Sync.
  • By default, CDR will always replicate only the new or updated data in the source path. If data is deleted on the destination, since there has been no change on the source, that data will not be replicated again, unless you abort the Replication Pair and perform the following to recopy the data from the source to the destination again:
    • On Windows, perform a Full Re-Sync.
    • On UNIX, perform a Smart Re-Sync.

System Behavior When Replication Is Interrupted

There are several ways in which data replication activity can be interrupted, and CDR recovers from each of them in a similar manner. The table below provides a listing of common causes of interruption, and the effect of them on Baselining, SmartSync, and data replication, as well as how CDR recovers from them.

Interruption

Effect Of Interruption & Smart Re-Sync

Abort a Replication Pair during Baselining phases Baselining activities stop on the source.

When the Replication Pair is restarted, Baselining activities will resume, restarting at the beginning of the phase if necessary, then SmartSync and data replication activities will begin automatically.

Abort a Replication Pair during SmartSync phases Logging stops on the source.

When the Replication Pair is restarted, SmartSync activities will resume, restarting at the beginning of a phase if necessary, and data replication activities will begin automatically.

Abort a Replication Pair during Replication phase Logging stops on the source.

When the Replication Pair is restarted, for NTFS or UNIX, Smart Re-Sync will continue the data replication activities automatically; for FAT file systems, Full Re-Sync will be necessary.

Suspend a Replication Set Baselining, SmartSync, and data replication activities stop for all Replication Pairs, but any logging activities will continue on the source.

When the Replication Set is resumed:

  • For any Replication Pairs that were performing data replication, CDR will transfer the accumulated logs to the destination, and data replication will continue.
  • For Replication Pairs that were in the Baselining or SmartSync phases, how activities begin again will depend on the exact phase the Replication Pairs were in, as well as the operating system type.
Graceful or non-graceful shutdown of the source computer The destination computer continues to replay the logs it has received.

When the source computer and software are running again, Replication Pair(s) will be in the System Aborted state for some time, then Smart Re-Sync will be performed.

Graceful or non-graceful shutdown of the destination computer Logging continues on the source.

When the destination computer and software are running again:

  • For any Replication Pairs that were performing data replication, CDR will transfer the accumulated logs to the destination, and data replication will continue.
  • For Replication Pairs that were in the Baselining or SmartSync phases, how activities begin again will depend on the exact phase the Replication Pairs were in, as well as the operating system type.
CDR software shutdown on the source All CDR-related activities stop.

When the software is restarted, CDR will start Smart Re-Sync.

CDR software shutdown on the destination Logging continues on the source.
  • For any Replication Pairs that were performing data replication, CDR will transfer the accumulated logs to the destination, and data replication will continue.
  • For Replication Pairs that were in the Baselining or SmartSync phases, how activities begin again will depend on the exact phase the Replication Pairs were in, as well as the operating system type.
Replication Service is stopped on the source Baselining, SmartSync, and data replication activities stop for all Replication Pairs, but logging continues on the source, and the destination computer continues to replay the logs it had received before the service was stopped.

When the Replication Service is started again:

  • For any Replication Pairs that were performing data replication, CDR will transfer the accumulated logs to the destination, and data replication will continue.
  • For Replication Pairs that were in the Baselining or SmartSync phases, how activities begin again will depend on the exact phase the Replication Pairs were in, as well as the operating system type.
Replication Service is suspended on the destination Baselining, SmartSync, and data replication activities stop for all Replication Pairs, and log replay stops on the destination, but logging continues on the source.

When the Replication Service is started again:

  • For any Replication Pairs that were performing data replication, CDR will transfer the accumulated logs to the destination, and data replication will continue.
  • For Replication Pairs that were in the Baselining or SmartSync phases, how activities begin again will depend on the exact phase the Replication Pairs were in, as well as the operating system type.
Interruption of network connectivity (source and/or destination) Baselining, SmartSync, and data replication activities stop for all Replication Pairs, but logging continues on the source, and the destination computer continues to replay the logs it had received before the network connectivity was interrupted.

When network connectivity is restored:

  • For any Replication Pairs that were performing data replication, CDR will transfer the accumulated logs to the destination, and data replication will continue.
  • For Replication Pairs that were in the Baselining or SmartSync phases, how activities begin again will depend on the exact phase the Replication Pairs were in, as well as the operating system type.

If the network interruption is for a significant amount of time, the following will occur:

  • On Windows, the status of the Replication Pair will become Failed, and will need to be restarted manually with Smart Re-Sync when connectivity is restored.
  • On UNIX, CDR will continue to retry sending the logs to the destination computer until network connectivity is restored.
Source computer runs out of log space (Windows)

-- or --

Source computer tries to create new entries in a log before the old entries have been transferred to the destination (UNIX)

Logging will stop, all logs will be deleted, all Replication Pairs will be System Aborted.
  • On Windows, the system will wait 3 minutes, then check space on the log volume. If there is sufficient space, a Smart Re-Sync will occur; if not, the Replication Pair will be Aborted.
  • On UNIX, a Smart Re-Sync will occur.

For instructions on restarting replication after it has been interrupted, see Start/Suspend/Resume/Abort Data Replication Activity.

Job States

The Data Replication Monitor shows the state of each Replication Pair. These states are briefly described:

New Pair The Replication Pair has been created, but no activity has taken place yet.
Preparing for Replication CDR is scanning the source paths, preparing for initial transfer or Full Re-Sync.
Baseline For detailed information, see Baseline.
Initial Sync For detailed information, see Baseline Scan.
SmartSync Scan For detailed information, see SmartSync Scan.
SmartSync For detailed information, see SmartSync.
Processing For detailed information, see Processing Orphan Files.
Replicating Data is being continuously replicated.
Replicating (Not verifiable) The most recent communication between the CommServe and CDR Client indicated the job was in the Replicating state, but this cannot be verified because communication has been interrupted.
Suspended Replication activity has been temporarily halted, either by a user, or because communication between the source and destination has been interrupted. Logs continue to be written on the source.
Pending There has been a temporary interruption and CDR is attempting to reconnect and resume operations.
Failed Phase failed to complete, or log transfer has stopped, perhaps for connectivity issues; logs continue to be written on the source.
Paused CDR is trying to resume replication activity.
Stopped  Replication activity has been halted by one of the following:
  • the Replication Log is inaccessible to be read from on the destination computer
  • the system, because communication between the source and destination has been interrupted and cannot be successfully resumed by the system
  • the system, because the replication destination has run out of space
  • a source or destination throttling condition has not been cleared; if the condition is on a source computer, the job will be System Aborted at first, and if on a destination computer, the job will be Paused at first
System Aborted For CDR on Windows only, a Replication Pair will be in this state for 3 minutes if the source disk hosting replication logs runs out of space, after which the system will attempt to restart.

To see more information about a particular Replication Pair, see View details of data replication activities.

You can change the state of a Replication Pair, or several at the same time. See Change the State of Replication Pair.

Considerations

  • The status of all Replication Pairs is not immediately updated when one Replication Pair is resumed. For instance, when all Replication Pairs had been placed in the Paused state, and you Resume one of them, a prompt will ask if you want all Pairs to be resumed. If you choose to do so, all the Replication Pairs that were placed in the Paused state will Resume, and be placed back in the same state they were in previously. However, the CommCell Console will not immediately reflect the status of all the other Replication Pairs that were resumed, and they may still be shown in the Paused state. The CommCell Console will properly synchronize and display the correct state of the Replication Pairs within a few minutes.
  • During SmartSync of application data, Data Replication Monitor may display more than the actual number of files transferred.

Job Details

The following information is available in the Data Replication Monitor:

Active When the symbol is green, it indicates recent activity for the Replication Pair; an orange symbol indicates no recent activity. An exclamation point preceding the symbol indicates that some files are not copied successfully to the destination computer during replication. To see failed files for a replication pair, see View the failed files for a Replication Pair for step-by-step instructions.
Phase The current phase of the job; for more detailed information see Job Phases.

General

Job ID A unique number allocated by the Job Manager for the operation.
State The current state of the Replication Pair; for more detailed information see Job States.
Last Update Time The date and time of the CommServe when the Job Manager last updated the Data Replication Monitor.
Pair Abort Reason For a Replication Pair that was aborted, the reason is listed.
Last Error The most recent error message for this Replication Pair.

Initial Sync Information

Start Time The date and time of the CommServe when data replication activity began for the Replication Pair.
Number of Files To Be Transferred The files remaining to be transferred for the Replication Log file currently being replayed on the destination.
Number of Files Already Transferred The files transferred for the Replication Log file currently being replayed on the destination.
Data To Be Transferred during Initial Sync On Source The aggregate size of all files to be transferred between the source and destination for the Replication Pair. The actual data transferred may differ slightly from this number, based on whether a given file actually gets transferred in full or in part.
Data Transferred during Initial Sync On Destination The sum of all data already transferred between the source and destination for the Replication Pair.
Throughput Unit The rate of data transfer during Baseline phase, in GB/hour.
Progress The percentage of files transferred for the Replication Log file currently being replayed on the destination.

Replicating State Information

Last Log Played Time The date and time of the CommServe when the most recent Replication Log was played on the destination computer.
Replicated Data The sum of all data transferred between the source and destination machines since the Start Time.
Attempts The number of attempts at replication the system has made for the Replication Pair.
Latest Source Log The number of the most recent Replication Log that was created on the source computer.
Latest Destination Log The number of the most recent Replication Log that was replayed on the destination computer. If this number is lower than the Latest Source Log number, it indicates that the destination computer has not yet replayed all of the Replication Logs that have been created on the source computer.

Configuration

Pair ID A unique number allocated by the Job Manager that identifies the Replication Pair.
Source Path The path on the source computer for the Replication Pair.
Destination Path The path on the destination computer for the Replication Pair.
Replication Set The name of the Replication Set.
Replication Type The type of replication configured for the Replication Set. (See Data Replication Type.)
Client The CDR Client that is the source computer for the Replication Pair.
Destination Host The CDR Client that is the destination computer for the Replication Pair.

Attempts

The following information is available in the Attempts window:

Phase The phase that the Replication Pair was in at the time of the attempted activity.
State Current state of the Replication Pair.
Start Time The date and time of the CommServe when the attempted activity began for the Replication Pair.
End Time The date and time of the CommServe when the attempted activity ended for the Replication Pair.
Elapsed Time The amount of time that elapsed while the activity was being attempted for the Replication Pair.
Files to Transfer Files to be transferred to the destination computer for the Replication Pair, based on the initial scan.
Files Transferred Files already transferred to the destination computer for the Replication Pair.
Data Transferred The sum of all data already transferred between the source and destination during the attempted activity.
Data to Transfer The aggregate size of all files to be transferred between the source and destination for the Replication Pair. The actual data transferred may differ slightly from this number, based on whether a given file actually gets transferred in full or in part.