2012年3月29日星期四

BCPBatchSize and Replication hanging

I have transactional replication setup between 2 sql 2000 sp3a servers
running on 2 4 way Dell servers with 16 gig of memory each. The servers
are running windows 2003. Everything is enterprise edition and the sql
servers are using about 14 gig memory.
We restore the database to the second server and then enable publication
to that server turning of the generation of SPs but enabling the use of
column name for the distribution.
The Distribution runs every 10 minutes and normally keeps up just fine.
(before we started using column names).
Up until this past weekend we did not need the column name option in the
replication distribution and everything ran fine. Some table structures
changed but not that big of changes but we do not have about 60 ident
columns that we are dealing with.
Starting this Monday we needed to use the column name option because we
are replicating to some large tables that have identity columns on them.
We have enabled the not for replication property on the columns.
What I am seeing is that replication now just hangs at times. The
server just goes idle and distribution job just keeps running and says
"it is waiting for response from the backend co". We have rebuilt it a
couple of times and each time the distribution job ran fine for a while
and then would just hang. Once it would hang there was no getting past
that point.
I have traced the replication process and it just stops sending data. I
have also enabled the output of the distribution job to a file and it
shows more data being sent than I see in the sql trace. The trace is
not missing data because it always stops at the same spot (I tried it
several times) and the output file always stops also but shows alot more
data being sent out.
I tried changing the bcpbatchsize for the distribution agent all the way
down to 1 or 2 but the trace still shows many more commands per batch
than 1 or 2. I have enabled the tsql batch start and end to trace the
batches allow with the statement start and end.
I am wondering what the bcpbatchsize really affects if it does not
change the number of commands per batch? Since we have some tables with
large rows I am thinking that the 10,000 commands per batch is too much
now that we are trying to include column names.
How do I really change the batch size?
Anybody else seen this problem when including column names?
Thanks,
Ron
fiddling with the bcpbatchsize is not always a good idea. What this does is
determines how many rows are processed before they are committed. Setting
this to a high value means your tlog on your subscriber will grow and the
bcp process (performed by odbcbcp) will chug along longer before sending
messages back to the SQL Server agent. This will cause the "waiting for
response from the backend".
using a small value for bcpbatchsize will take longer, but is more chatty
and you won't get this message.
Think of it like the way your girlfriend blows your check. Give her a large
amount of money and you won't hear from her for weeks. On the other hand
give her a small amount of money, and she will be back in a few hours. Give
her to little money and she will never come back. So you have to balance the
amount of money you give her with how frequently you want to see her. Same
with bcpbatchsize - but I digress.
If your subscriber is a multiproc machine think about using UseInprocLoader
switch.
But, I'm a little confused - are you regenerating your snapshot? Are the
tables whose column names you have changed large?
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
"Ron" <rgellenbecker@.wonderboxtech.com> wrote in message
news:Xns95E29DD9C85B5WBT39312565@.207.46.248.16...
> I have transactional replication setup between 2 sql 2000 sp3a servers
> running on 2 4 way Dell servers with 16 gig of memory each. The servers
> are running windows 2003. Everything is enterprise edition and the sql
> servers are using about 14 gig memory.
> We restore the database to the second server and then enable publication
> to that server turning of the generation of SPs but enabling the use of
> column name for the distribution.
> The Distribution runs every 10 minutes and normally keeps up just fine.
> (before we started using column names).
> Up until this past weekend we did not need the column name option in the
> replication distribution and everything ran fine. Some table structures
> changed but not that big of changes but we do not have about 60 ident
> columns that we are dealing with.
> Starting this Monday we needed to use the column name option because we
> are replicating to some large tables that have identity columns on them.
> We have enabled the not for replication property on the columns.
> What I am seeing is that replication now just hangs at times. The
> server just goes idle and distribution job just keeps running and says
> "it is waiting for response from the backend co". We have rebuilt it a
> couple of times and each time the distribution job ran fine for a while
> and then would just hang. Once it would hang there was no getting past
> that point.
> I have traced the replication process and it just stops sending data. I
> have also enabled the output of the distribution job to a file and it
> shows more data being sent than I see in the sql trace. The trace is
> not missing data because it always stops at the same spot (I tried it
> several times) and the output file always stops also but shows alot more
> data being sent out.
> I tried changing the bcpbatchsize for the distribution agent all the way
> down to 1 or 2 but the trace still shows many more commands per batch
> than 1 or 2. I have enabled the tsql batch start and end to trace the
> batches allow with the statement start and end.
> I am wondering what the bcpbatchsize really affects if it does not
> change the number of commands per batch? Since we have some tables with
> large rows I am thinking that the 10,000 commands per batch is too much
> now that we are trying to include column names.
> How do I really change the batch size?
> Anybody else seen this problem when including column names?
> Thanks,
> Ron
|||Thanks for the response.
I was changing the bcpbatchsize because of the problem I am having were
the replication just stopped. I let it sit for over 24 hrs and saw no
activity, no error message, it just said it "waiting for response from
the backend co". It never timed out. I was using the default profile
at that time. The destination server was not showing much activity and
the trace showed no command flowing between the servers.
The other problem I then saw using the trace was that even when I set
the bcpbatchsize very small and I was able to keep replication running
(after doing complete resetup) I would see more commands sent per batch
than I had specified. I set it to two and during the trace saw more
than two commands between eatch batch start and end in the trace. My
understanding of the bcpbatchsize is as you defined it. I just did not
see it actually doing that in the trace. That is why I am wondering
just what it actually does?
Even with the bcpbatchsize at 2 the replication eventually froze again
on the same message, "waiting for a response from the backend co". The
problem was that it was still sending the big batches. I wanted to make
them smaller as a test and was having no luck.
We don't use the snapshot because it takes way too long for our 70 gig
databases. We can backup and then restore the database in about an
hour.
Some of the tables are quite large (lots of columns and some the columns
get pretty big) and some of the column names get a little wordy.
We have gone back to removing the idents (which takes about an hour
right now it took about 1/2 hr 6 months ago on slower hardware) and are
no longer setting the "use column names" in the replication and it has
been running fine for over a week. I don't want to keep removing the
idents because the time it takes is getting longer quickly and the
window to setup replication is shrinking. I can get about 6 hours now
before users start to complain. Our next upgrade is coming up soon and
I will be redoing the replication setup again because several tables
will have structure changes.
"Hilary Cotter" <hilary.cotter@.gmail.com> wrote in
news:#1pAG5k$EHA.3424@.TK2MSFTNGP11.phx.gbl:

> fiddling with the bcpbatchsize is not always a good idea. What this
> does is determines how many rows are processed before they are
> committed. Setting this to a high value means your tlog on your
> subscriber will grow and the bcp process (performed by odbcbcp) will
> chug along longer before sending messages back to the SQL Server
> agent. This will cause the "waiting for response from the backend".
> using a small value for bcpbatchsize will take longer, but is more
> chatty and you won't get this message.
> Think of it like the way your girlfriend blows your check. Give her a
> large amount of money and you won't hear from her for weeks. On the
> other hand give her a small amount of money, and she will be back in a
> few hours. Give her to little money and she will never come back. So
> you have to balance the amount of money you give her with how
> frequently you want to see her. Same with bcpbatchsize - but I
> digress.
> If your subscriber is a multiproc machine think about using
> UseInprocLoader switch.
> But, I'm a little confused - are you regenerating your snapshot? Are
> the tables whose column names you have changed large?
>
|||There is a condition where the buffers are exhausted/depleted this might
account for what you are seeing. Normally you get this when the agent
starts.
There is another condition, which again you might be running into. When you
push a tonne of data over the wire it takes some time for the data to be
committed. This time is a function of how much data and bcpbatchsize and
your log size. I have on occasion waited overnight for a snapshot to be
applied.
I would advise you to look into another strategy for deploying these large
snapshots. For instance you might want to break your publication into
pieces related to DRI. Put the large tables into separate publications, i.e.
one publication for each large table and tables related to it by DRI.
Then do a nosync subscription with these tables and use another method for
deploying the snapshot. Like bcp, or copying the snapshot to a cd,
tapedrive, network share and then applying it to the subscriber from this
location.
I prefer to use bcp as I have a window into what is being applied and I can
control it in a highly granular fashion. It will require more work, but the
problem is with large snapshot you have to start off at square one with each
hiccup.
SQL 2005 will restart the snapshot application where it fails so it should
fix this.
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
"Ron" <rgellenbecker@.wonderboxtech.com> wrote in message
news:Xns95E37DDC6CCD5WBT39312565@.207.46.248.16...
> Thanks for the response.
> I was changing the bcpbatchsize because of the problem I am having were
> the replication just stopped. I let it sit for over 24 hrs and saw no
> activity, no error message, it just said it "waiting for response from
> the backend co". It never timed out. I was using the default profile
> at that time. The destination server was not showing much activity and
> the trace showed no command flowing between the servers.
> The other problem I then saw using the trace was that even when I set
> the bcpbatchsize very small and I was able to keep replication running
> (after doing complete resetup) I would see more commands sent per batch
> than I had specified. I set it to two and during the trace saw more
> than two commands between eatch batch start and end in the trace. My
> understanding of the bcpbatchsize is as you defined it. I just did not
> see it actually doing that in the trace. That is why I am wondering
> just what it actually does?
> Even with the bcpbatchsize at 2 the replication eventually froze again
> on the same message, "waiting for a response from the backend co". The
> problem was that it was still sending the big batches. I wanted to make
> them smaller as a test and was having no luck.
> We don't use the snapshot because it takes way too long for our 70 gig
> databases. We can backup and then restore the database in about an
> hour.
> Some of the tables are quite large (lots of columns and some the columns
> get pretty big) and some of the column names get a little wordy.
> We have gone back to removing the idents (which takes about an hour
> right now it took about 1/2 hr 6 months ago on slower hardware) and are
> no longer setting the "use column names" in the replication and it has
> been running fine for over a week. I don't want to keep removing the
> idents because the time it takes is getting longer quickly and the
> window to setup replication is shrinking. I can get about 6 hours now
> before users start to complain. Our next upgrade is coming up soon and
> I will be redoing the replication setup again because several tables
> will have structure changes.
>
>
> "Hilary Cotter" <hilary.cotter@.gmail.com> wrote in
> news:#1pAG5k$EHA.3424@.TK2MSFTNGP11.phx.gbl:
>
|||Thanks for the info.
My problem is not with deploying the snapshots, we have figured out how
to do the replication setup without using snapshots.
The problem is that we want to start using column names in the
replication.
I have done more testing and tracing and have narrowed down my problem.
We are using 1 way transactional replication. When the system updates
records, replication is issuing deletes and then inserts. The deletes
are the part that are taking a long time (over an hour). The inserts
takes seconds. It is the same number of rows deleted and inserted.
If I delete the rows outside of the update statement on the publisher
and then just insert the data on the publisher the process takes seconds
for the deletes and inserts to replicate. It is when I do update
statments to the publisher that the distribution takes a very long time.
The log reader gets the data in seconds in either case. The odd part is
the during the trace I can see that the distribution is just doing
deletes and then inserts. I have even taken and recreated the deletes
that replication appears to be using and they run very quickly against
the publisher and through the distribution process.
The table has the same structure on both the publisher and subscriber.
There are no idents in this table and all foreign keys have been removed
on the subscriber. The delete statement is using the clustered primary
to select the records and the deletes are being done one row at a time.
I have done the deletes one row at time outside of the update statement
and it took about 2 min to execute and about 2 min to replicate.
Why do the delete statments created by replication for the updates at
the publisher take so much longer to execute when using column names in
replication? Updating the same records without column names takes
seconds.
Thanks,
Ron
"Hilary Cotter" <hilary.cotter@.gmail.com> wrote in
news:usZrKu#$EHA.1564@.TK2MSFTNGP09.phx.gbl:

> There is a condition where the buffers are exhausted/depleted this
> might account for what you are seeing. Normally you get this when the
> agent starts.
> There is another condition, which again you might be running into.
> When you push a tonne of data over the wire it takes some time for the
> data to be committed. This time is a function of how much data and
> bcpbatchsize and your log size. I have on occasion waited overnight
> for a snapshot to be applied.
> I would advise you to look into another strategy for deploying these
> large snapshots. For instance you might want to break your
> publication into pieces related to DRI. Put the large tables into
> separate publications, i.e. one publication for each large table and
> tables related to it by DRI.
> Then do a nosync subscription with these tables and use another method
> for deploying the snapshot. Like bcp, or copying the snapshot to a cd,
> tapedrive, network share and then applying it to the subscriber from
> this location.
> I prefer to use bcp as I have a window into what is being applied and
> I can control it in a highly granular fashion. It will require more
> work, but the problem is with large snapshot you have to start off at
> square one with each hiccup.
> SQL 2005 will restart the snapshot application where it fails so it
> should fix this.
>
sql

没有评论:

发表评论