TROUBLESHOOTING PUBLIC FOLDER REPLICATION USING EVENT LOGS


The purpose of this Bulletin is to serve as a guide to troubleshooting public folder replication problems. It will not tell you exactly how to fix every possible replication problem. However, it will show you how to isolate every possible replication problem so that you focus your troubleshooting on the point of failure. This information was published for customers here, here and here.

 

TROUBLESHOOTING PUBLIC FOLDER REPLICATION USING EVENT LOGS

This content in this Bulletin was written by Bill Long

 

This is broken down as follows:

 

1. Introduction

2. Troubleshooting the Replication of New Changes

–       Hierarchy Replication

–       Content Replication

–       Troubleshooting Steps

3. Troubleshooting the Replication of Existing Data

–       Troubleshooting

4. Troubleshooting Replica Deletion

–       Troubleshooting

5. Common Problems

6. Conclusion

7. Additional Reading

8. Credits

9. Tech Bulletin archive and subscription info

 

Here it goes:

 

1. Introduction

 

This Bulletin is intended to take you from a problem description like “The content on my old server isn’t replicating to my new server” to a much narrower problem description like “My old server isn’t responding to the status requests from my new server, therefore the new server doesn’t know it’s missing data and isn’t trying to backfill. This means the problem is actually with the old server.” This Bulletin will also describe how to identify a few of the most common replication problems. Before I get into the details of troubleshooting, I want to give an overview of my general approach to these issues.

 

The best troubleshooting tool for public folder replication is the application log. In order to isolate a replication problem, you must be able to follow the replication events in the log to see exactly where the process is breaking down. Typically, you should turn up Diagnostics Logging on Replication Incoming and Replication Outgoing to maximum as a starting point for troubleshooting. Each time a store sends or receives a replication message, it will log an event to that effect (assuming logging is turned up). The various kinds of replication messages can be differentiated by the ‘Type’ shown in the description of the event. I prefer to focus on the Type rather than the event ID for several reasons:

 

– Event IDs change between versions of Exchange. For instance, in Exchange 5.5 an outbound backfill request was a 3014. In Exchange 2000 and 2003 it’s a 3016.

 

– Incoming and outgoing event IDs are different for each type. An outgoing hierarchy message is a 3018, while an incoming hierarchy message is 3028.

 

– Status requests and status messages use the same event IDs, even though they are two different types of messages. Thus, you can’t distinguish between them from event ID alone.

 

Focusing on the Type is a little easier. You can easily correlate these with Event IDs by examining your application log. There are only 7 types, and you can see if the message is outgoing or incoming by looking at the Category of the event. If you focus on types instead of event IDs, all you need to remember is:

 

Hierarchy – 0x2

Content – 0x4

Backfill Request – 0x8

Backfill Response – 0x80000002 (for hierarchy) or 0x80000004 (for content)

Status – 0x10

Status Request – 0x20

 

Also note that Replication Errors logging is rarely helpful. Even when replication is working normally, most servers will generate lots of replication error events, such as event ID 3093 indicating there was an error reading a property. In most cases the property has no effect on replication and the error can be ignored. I recommend leaving Replication Errors logging at None unless you’re looking for something specific, such as the problem described in this blog post.

 

2. Troubleshooting the Replication of New Changes

 

To troubleshoot public folder replication, you must first be familiar with the normal message flow that is expected when replication is working. Based on that knowledge, you can identify the point of failure when a problem arises. Before I discuss how to troubleshoot the replication of new changes, let’s describe what the expected behavior is.

 

–       Hierarchy Replication

 

The replication of new hierarchy changes takes place whenever a folder is created or deleted, or a change is made to the properties of a public folder, such as the replica list, client permissions, description, administrative note, or storage limits. Note that this does not include the email addresses on a mail-enabled folder. The email addresses are stored on the directory object in the Active Directory, so changing them does not result in hierarchy replication. Only changing the properties stored in the public store itself will cause hierarchy replication.

 

Every 15 minutes (by default), the store broadcasts any changes that were made to the folder properties in one or more hierarchy replication messages. With logging turned up to Maximum on Replication Outgoing, you’ll see an event like this on the server where the hierarchy was modified:

 

Event Type: Information

Event Source:     MSExchangeIS Public Store

Event Category:   Replication Outgoing Messages

Event ID:   3018

Description:

An outgoing replication message was issued.

 

Type: 0x2

 Message ID: <91ACCD0758385549A56A10971798985572D5@bilongexch1.bilong.test>

 Database “First Storage Group\Public Folder Store (BILONGEXCH1)”

 CN min: 1-72CF, CN max: 1-72D3

 RFIs: 1

 1) FID: 1-6994, PFID: 1-1, Offset: 28

      IPM_SUBTREE\NewFolder

 

Notice the “Type: 0x2” at the beginning of the description, identifying this as a hierarchy replication message.

 

A hierarchy replication message is sent from the originating server directly to all other public stores. There’s no concept of topology for the replication of new changes. All changes to the hierarchy are sent directly from the server on which the changes were made to all other servers that have a public store associated with the same hierarchy. Every other server should log an incoming event showing a type 0x2 when they successfully process the incoming replication message.

 

–       Content Replication

 

Content replication takes place whenever a message is created or deleted, or the properties of a message are changed. The times at which the store broadcasts content changes for a folder can be modified by changing the replication schedule on that folder, but by default the changes will be broadcast every 15 minutes just like the hierarchy. A content replication message is identified by the type 0x4 in the event description. Once again, there is no concept of topology for the broadcast of new changes. When the content of a folder is modified on any given replica, that replica will send a 0x4 message directly to all other replicas of the folder. And again, every receiving server should log an incoming 0x4 event when they successfully process the incoming message.

 

–       Troubleshooting Steps

 

Those are the two most basic scenarios for replication. When new hierarchy changes or new content changes are not making it from one server to another, troubleshooting is very straightforward.

 

1. Did the server generate an outbound replication message?

 

So you made a change to a folder or you added content to a folder on particular server, and that content didn’t make it to some other server. The first question to answer is did the target server broadcast the changes? When troubleshooting, it’s important to keep track of what server you’re working against when you make changes. There are several ways to accomplish this. In ESM, you can right-click on the Public Folders node and choose “Connect To” to point to a particular store. For the most part ESM will make any changes on the specified store, but be aware of one exception – Client Permissions. Prior to Exchange 2003 Sp2, when you changed Client Permissions through ESM, ESM would attempt to make the change against a store that houses a replica of the folder, even though this isn’t necessary since the permissions are stored as a property of the folder in the hierarchy. With the 2003 Sp2 version of ESM, this was changed so that it now does make the change on the hierarchy you’ve pointed it to. When you’re testing hierarchy replication by making changes through ESM, you should avoid using the permissions for testing since it may be hard to predict which store the change will be made against, unless you’re running ESM from 2003 Sp2. If you’re using Outlook and you want to verify which replica of a folder you are hitting, you can use MFCMAPI or a similar tool to view the PR_REPLICA_SERVER property of the folder. This will show you the name of the server that Outlook is using to access the content of that folder.

 

If the replication schedule is Always for the folder in question (which is always true for the hierarchy), and you don’t see an outbound 0x2 or 0x4 within 15 minutes, then you know something is wrong on the originating server. If the server is not generating any outbound hierarchy or content broadcasts, the replication agent may have failed to start. One of the common scenarios is described in KB272999. The important thing to note here is the 3079 event:

 

Event ID: 3079

Source: MSExchangeIS Public

Type: Error

Category: Replication Errors

Description:

Unexpected replication thread error 0x3f0.

 

EcGetReplMsg

EcReplStartup

FReplAgent

 

This will be logged even with no additional logging turned up when you mount the public store. If the 3079 includes the “EcReplStartup” function, this means the replication agent failed to start and no new changes will be broadcast until the problem is corrected and the store is mounted again.

 

Hierarchy replication is also vulnerable to certain permissions problems when Exchange 5.5 public stores are present in the organization. When an Exchange 2000 or 2003 server sends a hierarchy replication message to an Exchange 5.5 server, it must produce a ptagACLData property (the 5.5-style permissions based on legacyExchangeDN) from the ptagNTSD property (the 2000-style permissions based on SID). This means each SID must be converted to a legacyExchangeDN. This SID to legacyExchangeDN conversion can fail for several reasons. For instance, if a SID resolves to more than one user, an event like this may be generated:

 

Event ID: 9528

Category: General

Source: MSExchangeIS

Type: Error

Description:

The SID S-1-5-21-408248388-469072634-37170099-1391 was found on 2 users in the DS, so the store cannot map this SID to a unique user.

 

The users involved are:

/DC=com/DC=company/CN=Users/CN=User1

/DC=com/DC=company/CN=Users/CN=User2

 

Since the SID can’t be converted to a legacyExchangeDN, the store will fail to generate an outbound hierarchy broadcast message.

 

2. Was the message addressed to the server that didn’t receive the change?

 

If the originating server generated the outbound message, the next step is to be sure it was addressed to the server that didn’t receive the data. The easiest way to verify this is to track the message. In message tracking, you can just track the message ID that was reported in the outbound replication event. In the Message History window, the To: line will be visible. If the store that didn’t receive the changes is not listed here, then again the focus should be on the originating server. Does it see that server in the organization? Does that server have email addresses on its public store object? Does the originating server see that store in the replica list on the folder in question?

 

3. Did the message make it to the destination server?

 

Once you’ve verified that the message was addressed to the destination server, the next question to answer is – did the message make it that far? You can determine this through message tracking. If message tracking says the message was delivered to the store, but you don’t see an incoming replication event acknowledging the message, see the “Common Problems” section below.

 

3. Troubleshooting the Replication of Existing Data

 

When new changes replicate, but old unchanged stuff doesn’t, you have a backfill problem. The most typical situation where hierarchy backfill occurs is when a new public store is created. The most typical content backfill scenario is when a public store has been added to the replica list of a folder.

 

When you have a backfill problem like this, it may have already occurred to you that there’s an easy workaround – just make a change to all the items. By doing this you circumvent the broken backfill process and replicate everything as new changes. Despite the fact that I wrote both of the tools that are typically used for this (PFDAVAdmin and ModifyItems), it’s usually best to troubleshoot the backfill process and fix the root cause. If you just change everything to make it replicate, you may end up with the same backfill problem in the future when the replicas get out of sync again. That said, let’s move on to discussing backfill. To understand the backfill process, it’s first necessary to understand how changes are tracked.

 

Every folder and message in the store is assigned a Change Number (CN) when it is created and every time it is modified. When replication occurs, the CNs on each object are used to determine whether that object needs to be replicated. A group of CNs is called a CNSet. The CNSet for a particular folder on a particular server is called status information. This status information is included on every replication message. Every message type 0x2 contains the hierarchy status for the sending server. Likewise, every message type 0x4 contains the status information for that particular folder for the sending server. All other replication message types contain status information for their respective folders as well.

 

When a new public store mounts for the first time, it sends a status request (type 0x20) for the hierarchy to all the existing public stores. Similarly, when a new store is added to the replica list of a folder, that store will send a 0x20 to all other replicas of that folder. Like every replication message, a status request contains a CNSet of all CNs for the folder in question (or the hierarchy) that the originating store has, and asks the other stores to respond if they have CNs that the originator doesn’t. Note that prior to Exchange 2003 Sp2, every replica was not asked to respond to the status request, so some stores would ignore the status request even if they had changes that the originating store did not. A 2003 Sp2 server will ask for responses from all replicas, and will respond even when the originating server did not specifically ask it to, as long as it has changes that the originating server does not. This can greatly improve the decisions made during the backfill process. The unique thing about a status request is that it doesn’t contain any data to replicate – it just has a list of change numbers. The other stores respond with status messages (0x10), which list their own CNSets for that same folder (or the hierarchy). When the originating server receives the 0x10 messages, it compares the CNSet contained within to its own CNSet. If the 0x10 contains changes that the store doesn’t already have, the backfill process begins.

 

The first step in the backfill process is to add entries to the backfill array for the folder in question. These entries have a CNSet that describes the missing changes, and a timeout value describing when the store will request the missing data. The backfill timeout will vary depending on the situation. In the case of a new public store being brought online or a new replica of a folder being added, the initial timeout is 15 minutes.

 

Backfill entries may be added to the backfill array during the course of normal operation as well. Consider a situation where a particular public store has broadcast two changes in two separate 0x2 messages. Let’s say the administrator deletes the first 0x2 message out of the queue, but the second one makes it through. When the other servers receive this 0x2, they will find that the CNSet in the status information contains CNs that they never got. As a result, they will create backfill entries for that data. Backfill entries for missing data that was discovered during the normal course of replication will start with a timeout of 6 hours if the data is available in the same Routing Group (RG), or 12 hours if it is only available in a different RG. Each time a backfill request is issued, the next timeout will be 12 and then 24 hours for intra-RG requests, or 24 and 48 hours for inter-RG requests.

 

Every five minutes the store will check to see if any backfill entries have reached their timeout. If they have, a backfill request (type 0x8) is issued for the missing CNs, and the timeout is set to the next interval. A backfill request is not a broadcast; it is directed at a single server – one of the servers that previously indicated it had the missing CNs in the status information it sent to the requesting server. When that server receives the incoming 0x8, it immediately processes the request and responds with one or more backfill responses (0x80000002 for hierarchy or 0x80000004 for content), which contain the actual data for the requested change numbers. Like backfill requests, backfill responses are not broadcasts – they are sent only to the requesting server.

 

If the requesting server successfully processes the incoming backfill response, the CNs it contained are cleared from the backfill array on that store. Actually, any incoming message that contains CNs that are outstanding in the backfill array will cause those CNs to be cleared from the array.

 

–       Troubleshooting

 

As you can see, there are a lot more questions to answer when troubleshooting the backfill process.

 

1. Does the store know it’s missing data?

 

First you should determine if the server even realizes that other stores have changes that it needs to request. Unfortunately, there is no supported tool or utility that will let you view the backfill array directly to see if it has anything in it. However, there are other more indirect ways of figuring this out.

 

One way is to wait. If the server knows it’s missing data, it will be requesting it at least once every 24 or 48 hours. This means you can simply turn up logging and wait to see if a 0x8 message ever goes out. If you never see a 0x8 for the folder in question, but you are seeing 0x8’s for other folders, you may have hit the outstanding backfill limit, which we’ll discuss shortly.

 

Another option is to make sure the server receives the latest status information. Remember, the server only sends a status request that one time after you add the new replica. After that, the only status information it receives will be through the normal course of replication. So if its initial attempt to get status was lost because the 0x20 or the 0x10 in response was lost or deleted, it may sit there indefinitely and not even realize that it’s missing anything. There are several ways to make sure the server has received status information for a folder.

 

– Go to a server that has all the data and make a change to the folder by adding, deleting, or modifying a message. In the case of the hierarchy, create, delete, or change the properties of a folder. The resulting 0x4 or 0x2 will contain status information for that folder or the hierarchy, respectively. When the server that’s missing the data successfully processes the incoming replication message, you know that it has added any appropriate entries to the backfill array.

 

– Use the Synchronize Content option in Exchange 2003 ESM. This is a well-hidden but very useful option. To find it, go under the Public Folders tree and go to the folder in question. Highlight the folder in the left-hand pane. In the right-hand pane click the Status tab. Right-click on the server that has all the data and choose Synchronize Content. This does two things – it causes the server to issue a status request 0x20 for the folder, and it causes it to immediately timeout any backfill entries. Notice that I said you should Synchronize Content from the server that already has the data. You may wonder why you would do that, when it’s the other server that has the backfill entries that need to be timed out. Remember that at this point we’re just trying to ensure that the server missing the data KNOWS it has something to backfill. To that end, we can use Synchronize Content from the server that has the data to send a 0x20 to the server that doesn’t. In this case we’re not really interested in seeing a status 0x10 response to the 0x20. We just want the store missing content to receive a replication message for the folder from a store with content, so it can add the appropriate entries to the backfill array. The 0x20 from the server with the data serves this purpose. Note that in Exchange 2003 Sp2, Synchronize Content is now available for the hierarchy by right-clicking on the Public Folders node itself.

 

– Use the Replication Flags registry value (KB813629). If you put this value in place, along with the Enable Replication Messages At Startup value from Q321082, it causes the store to send a status request 0x20 for every folder on startup. Again, you would want to use this on the server that has the content – the point of this step is to get the server that has content to send its status information to the server that’s missing content.

 

– Use 2003 ESM to send a backfill response. In 2003 Sp1, you could use the Send Hierarchy option to send a hierarchy backfill response and the Send Contents option to send a folder content backfill response. In 2003 Sp2, both of these options became Resend Changes. This sends a backfill response for the range of data you specify, but you probably shouldn’t specify the whole range of data since that might satisfy all outstanding backfill entries and end up working around the original problem. Instead, specify a range of only a day or two. This causes a 0x80000002 or 0x80000004 to go to the target server, which again serves the purpose of giving it status information for the store that has the data.

 

Once you’ve used one of these options to force status information, and you’ve verified that the store missing the data received the incoming message by watching the application log, then you know it knows it’s missing the data.

 

2. Does the store request the missing data?

 

After you’ve made sure the store know it needs to backfill some data, does it ever issue a backfill request? Recall that after it has tried to backfill the data a couple of times, You may have to wait 24 or 48 hours for the next backfill request, since that will be the longest timeout interval for intrasite and intersite backfills, respectively. There is one way to speed this up, and that is to use Synchronize Content again, but this time from the server that’s missing the data. This will immediately timeout the backfill entries for that folder. However, you may still find that the store does not issue a backfill request for the folder you’re focusing on. If this is the case, watch the app log for the next 24-48 hours. If the store is sending backfill requests for other folders, but not for the folder you’re focusing on, it may have hit the outstanding backfill limit.

 

When you experience a situation where you’ve added replicas of a lot of folders to a new store, and replication seems fine at first but then grinds to a halt over the next day or two, you have probably hit the outstanding backfill limit. The outstanding backfill limit is a mechanism intended to throttle replication. By default, the store will only allow 50 outstanding backfill requests at a time. Once it has 50 outstanding, it will re-request those 50 over and over until they are satisfied. Once any one outstanding entry has been satisfied, that opens up a slot in the OBL for a new set of data to be requested. This means that if 50 requests are having problems being satisfied for whatever reason, replication can not proceed.

 

If you are seeing this behavior, you should watch the application log to see what the store is requesting. You’ll be seeing periodic 0x8 messages for the current 50 outstanding backfill requests, and you’ll find that no backfill response is received, which is why they’re still outstanding. At that point you should change your focus to troubleshooting one of the folders the store is currently trying to backfill, since resolving the problem will allow it to move on to other folders.

 

There is one other option, and that is to increase the Oustanding Backfill Limit (OBL). You can do this by creating a registry value called Replication Oustanding Backfill Limit under the registry key for that store. The maximum value is 5000 decimal. However, once you do this the replication floodgates will open and it will be hard to determine which 50 folders caused it to choke. You’ll need to postpone troubleshooting until things settle down again. Typically I recommend leaving the limit at 50 and fixing the problem, instead of working around it by increasing the limit.

 

If the OBL doesn’t appear to be a problem, and you still aren’t seeing outgoing 0x8 messages for the folder in question, see the Common Problems section below.

 

3. Does the other store respond to the request?

 

Once you have a backfill request to focus on, you need to determine if the backfill target ever got the request. Check the application log on that server for the incoming 0x8. You can also search the application log for the message ID mentioned in the outgoing event from the sending side. If you can find no sign of it in the application log, use message tracking to see how far it got. If it received the 0x8, it should respond almost immediately with one or more 0x80000002 or 0x80000004 messages (you will often see many backfill responses to a single backfill request, since the changes are not all sent in a single message). Of course, the time it takes to generate the backfill response messages will vary based on the data in the folder and the replication message size limit. For instance, if you set the maximum replication message size to 1 GB, the responding server could try to pack the entire hierarchy into a single backfill response, which might take an hour or more just to pack up!

 

4. Does the requesting server get the response?

 

Now it’s time to check the application log on the requesting server to see if it received the backfill response. If not, track the message and see how far it got. If it received the backfill response and logged it in the application log, then that backfill request should have been satisfied and it should be able to move on.

 

As mentioned earlier, if you find that message tracking shows that one of these messages was delivered to the store, yet the application log does not show the incoming replication message, have a look at “Common Problems” below.

 

4. Troubleshooting Replica Deletion

 

You removed an old server from the replica list on all your folders. However, when you go to Public Folder Instances for the old store in ESM, you still see a bunch of folders there. This is due to a problem with the replica deletion process. In the Exchange 2003 Sp2 version of ESM, if you try to delete a public store in this state, ESM presents a dialog stating:

 

“You cannot delete this public folder store because it contains folder replicas. To avoid data loss, right click the public folder store and use Move Replicas to move the replicas to a different server. It may take several hours until the content is replicated to the new server and the local replicas are removed.”

 

When you remove a store from the replica list on a folder, that store does not immediately delete the data. Instead, it sends out a special 0x20 status request to all the other replicas. This is called a Replica Delete Pending Status Request (RDPSR), and can not be distinguished from a normal status request in the application log. A RDPSR contains a flag that indicates the replica is pending deletion. When the other stores receive this 0x20, they respond with a special 0x10 called a Replica Delete Pending Ack (RDPA). The RDPA indicates that it’s ok to delete that data – but the other stores only send this 0x10 if they already have all the CNs that the pending deletion replica has. The replica will only be deleted once the store has received a 0x10 indicating that someone else has the data.

 

This means that if you delete the store before Public Folder Instances is empty, you are probably losing data. Only 2003 Sp2 ESM will stop you from doing this – in older versions you must manually check Public Folder Instances to see if it’s ok to delete the store. You should always check Public Folder Instances before deleting a public store, and when 2003 Sp2 ESM gives you this warning, you should not try to ignore it or work around it – instead, you should troubleshoot the replica deletion process.

 

Note that prior to Exchange 2003 Sp2, the server that was removed from the replica list only sends the RDPSR once. If no one responds, you’ll see that the folder just stays in Public Folder Instances indefinitely, unless you add the store back to the replica list and then remove it again, causing a new RDPSR to be sent. 2003 Sp2 changed this behavior so that the store will retry every hour until it gets a RDPA from someone.

 

–       Troubleshooting

 

This is almost the same as troubleshooting the backfill process.

 

1. Did the pending deletion replica send a 0x20?

 

Unless you already had logging turned up when you removed the replica, you won’t know. Fortunately, you can just add the replica back and then remove it again. Then watch the application log for the 0x20.

 

2. Did the 0x20 reach the other replicas?

 

You should know the drill by now. Check the application logs on the other replicas to see if they received the 0x20.

 

3. Did any other replica respond with a 0x10?

 

This is the part you’ll probably end up focusing on. If a replica received the 0x20 from the pending deletion replica, but did not respond with a 0x10, that means that the pending deletion replica has data that the other replica doesn’t. Since you know it just received a 0x20 from that replica, then you also know that it already knows what data it’s missing. Therefore, you’d expect to see a backfill request for that folder every 24-48 hours. Watch the application log, and troubleshoot it exactly like the normal backfill process described earlier.

 

4. Did the pending deletion replica receive the 0x10?

 

Once any other replica has all the data, that replica should respond with a 0x10. When the pending deletion replica receives that 0x10, it will finally be willing to delete that data. That doesn’t mean it will happen immediately, though. If there are clients using that replica, it won’t be deleted until later during online maintenance. If you want, you could speed this up by dismounting and mounting the store to disconnect the clients.

 

5. Common Problems

 

You may find that a server sent some type of replication message to another server, but the receiving server never logged the incoming message in the application log. However, message tracking says it was delivered locally to the store on that server. This behavior usually indicates either a problem with theReplicationStatetable or a permissions problem on the SMTP virtual server.

 

Let’s cover the easy one first. One problem that causes an incoming replication message to be ignored by the receiving server is a problem with the ReplicationState, or ReplState, table. Note that a problem with the ReplState table may also cause the server to fail to issue backfill requests (0x8) for some folders, so this information also applies to that situation. Each public store uses its ReplState table to track the state of replication for any replicated folders. The table contains multiple rows for each folder – one row per replica. It’s possible for the rows in the ReplState table to get out of sync with the replica list, such that it has extra or missing rows. Sometimes you can get it to sync up again just by making a change such as removing a server from the replica list, applying the change, and then immediately adding it back, but this doesn’t always work. Fortunately, a ReplState test was added to isinteg. See KB889331 for Exchange 2003, or KB892485 for Exchange 2000. As long as you have the updated isinteg.exe and store.exe, you can use isinteg to correct the problem with the ReplState table. If you run only the ReplState test, it is typically very fast – less than a minute even on a large public store. Once isinteg has been run, you may still need to go back and make a change to the folder to get the ReplState table to sync up with the replica list. After they’re in sync, the server should be able to process the incoming replication messages, or should begin issuing backfill requests normally.

 

The other common problem that causes an incoming replication message to be ignored is an issue specific to Exchange 2003. An Exchange 2003 server requires that the sending server has the Send As right on the receiving SMTP virtual server. That is, if ServerA is Exchange 2003, and ServerB is sending a PF replication message to ServerA, ServerB must have Send As on ServerA’s SMTP virtual server. Otherwise, ServerA does not process the incoming replication message. This permission is normally granted through the Exchange Domain Servers groups. If the Send AS right is the problem, all incoming replication messages from a particular server will fail. I find it easiest to identify this problem with a network trace taken while a replication message is being transferred from one server to another. The conversation should go like this:

 

ServerA: 220 ServerA.microsoft.com Microsoft ESMTP MAIL Service…

 

ServerB: EHLO ServerB.microsoft.com

 

ServerA: 250-ServerA.microsoft.com Hello

         250-TURN

         250-SIZE

         250-ETRN

         250-PIPELINE

         250-DSN

         250-ENHANCEDSTATUSCODES

         250-8bitmime

         250-BINARYMIME

         250-CHUNKING

         250-VRFY

         250-X-EXPS GSSAPI NTLM LOGIN

         250-X-EXPS=LOGIN

         250-AUTH GSSAPI NTLM LOGIN

         250-AUTH=LOGIN

         250-X-LINK2STATE

         250-X-EXCH50

         250 OK

 

The important part here is that ServerA must be advertising the GSSAPI NTLM LOGIN options. If you don’t see these in ServerA’s response to the EHLO, it’s usually because Integrated Windows Authentication has been unchecked on the SMTP virtual server. This is mentioned in step 1 of KB843106 and step 3 of KB842273. As long as these authentication verbs appear, you should see ServerB try to use them:

 

ServerB: X-EXPS GSSAPI

 

ServerA: 334 GSSAPI supported

 

ServerB: <a bunch of base64 encoded data>

 

ServerA: 334 <more base64 encoded stuff>

 

ServerB: CRLF

 

ServerA: 235 2.7.0 Authentication successful.

 

If authentication does not succeed, you may have a kerberos problem or an issue with the computer account for ServerB. Next the servers will transmit linkstate information. After that, they finally get around to the business of transferring email:

 

ServerB: MAIL FROM:<ServerB-IS@microsoft.com>

 

ServerA: 250 2.1.0 ServerB-IS@microsoft.com….Sender OK

 

ServerB: RCPT TO:<ServerA-IS@microsoft.com> NOTIFY=NEVER

 

ServerA: 250 2.1.5 ServerA-IS@microsoft.com

 

ServerB: XEXCH50 2404 2

 

ServerA: 354 Send binary data

 

It’s this last response to the XEXCH50 verb that’s important. If the response is “354 Send binary data”, then everything is fine, at least as far as permissions to the SMTP virtual server are concerned. If the GSSAPI NTLM login options were not advertised, or the authentication attempt failed, then it’s expected that ServerA will instead respond with “504 Need to authenticate”. If those steps succeeded, but ServerA still says “504 Need to authenticate” instead of “354 Send binary data”, then ServerB does not have the Send As right on ServerA’s SMTP virtual server. There are several ways this could happen. For one, when you delegate rights such as Exchange Full Administrator in ESM, that user or group inherits a deny on the Send As right. Therefore, using ESM to delegate admin rights to the computer account, the Exchange Domain Servers group, or some other group that contains the Exchange servers will break public folder replication. Another possibility is that the computer account is not in the Exchange Domain Servers group, which is how it normally has the Send As right. You’ll need to evaluate the permissions on the SMTP virtual server and determine why the computer account for the sending server does not have the proper rights. See KB843106 and KB842273 for more details about the “504 Need to authenticate” problem.

 

6. Conclusion

 

You may have noticed as you read through this document that Sp2 for Exchange 2003 contains several important enhancements under the hood to prevent replication issues and assist in troubleshooting them. Environments with multiple public stores can really see a huge benefit from Sp2, especially when it comes to moving replicas between servers, and adding and removing public stores.

 

7. Additional Reading

 

How to troubleshoot public folder replication problems in Exchange 2000 Server and in Exchange Server 2003

http://support.microsoft.com/default.aspx?scid=kb;EN-US;842273

 

How to troubleshoot the “504 need to authenticate first” SMTP protocol error

http://support.microsoft.com/default.aspx?scid=kb;EN-US;843106

 

Backfill requests for some public folders are never completed on an Exchange Server 2003 computer

http://support.microsoft.com/default.aspx?scid=kb;EN-US;889331

 

Backfill requests for some public folders are never completed on the Exchange 2000 Server or Exchange Server 2003 computer

http://support.microsoft.com/default.aspx?scid=kb;EN-US;892485

 

Update to send status request messages in Exchange 2000 Server

http://support.microsoft.com/default.aspx?scid=kb;EN-US;813629

 

XADM: 3044 and 3079 Events Occur Starting Information Store and Public Folder Replication Does Not Work

http://support.microsoft.com/default.aspx?scid=kb;EN-US;272999

 

8. Credits

 

Thanks to Bill Long (CPR) for writing of this material!

Thanks to Dave Whitney (EXCHANGE) for his review of this and his comments / suggestions.

 

9. Tech Bulletin archive and subscription info

 

To subscribe to weekly Exchange Technical Bulletins please follow this link:

 

http://AutoGroup/JoinGroup.asp?GroupAlias=exchtb

 

Find older Exchange Technical Bulletins at \\nbtools\Share\Exchange_Technical_Bulletins

Find Tech Bulletins also at:

 

http://kiosk/sites/ex55

http://kiosk/sites/ex2000

http://kiosk/sites/ex2003

Another archive: http://emsweb/sites/Exchange/Exchange%20Technical%20Bulletins/Forms/All.aspx

 

Searchable Help file with all Bulletins (created and maintained by Mike Lagase):

 

\\exhelp\public\Docs\Exchange Tech Bulletins (CHM Format with Searchable Index)

 

NEW! Publicly available Bulletins aka Exchange Insider articles:

http://www.microsoft.com/technet/prodtechnol/exchange/2003/insider/default.mspx

 

Any questions? Please e-mail ninob. 

 

Nino Bilic

 

Microsoft Exchange Server – Beta

+ ninob@microsoft.com, ( 469-775-7265, ¹ 11:00AM – 8:00PM CST

 

Manager:

Jason Stine, + jstine@microsoft.com, ( (425) 703 9360

Delighting our customers is our top priority. We welcome your comments and suggestions about how we can improve the support we provide to you. Please e-mail us at + managers@microsoft.com.

Advertisements
Explore posts in the same categories: Microsoft Exchange 2007

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: