I have been continuing to attempt to spider my message board from outside the threads software. i have trouble with mass duplications, 25 of the same page in the results

the fellow who writes the ZOOM spider has been working with me on this here are his comments:


We made a few attempts and did some thorough looking at the problem. I think
we were able to track down the core of the problem, but there doesn't seem
to be any easy solution to this.

The problem is largely (if not, completely) caused by the new URLs used by
UBB and the way it is passing extra parameters in the URL to track how a
user got to a thread (ie: from which forum index etc.). There's also a lot
of inconsistent naming or varying parameters which mean similar things. I
can't see how this new version of UBB can be very friendly to search engines
- it just gives out too many different URLs to the exact same page.

I'll explain this in more detail later, but first, I'll describe my testing
setup. I used the following as my start URL:
http://ambergriscaye.com/forum/ubbthreads.php/ubb/

And this is my skip list:
---------------------start
?ubb=newpost
?ubb=markallread
?ubb=mycookies
/ubb/newuser
/ubb/cfrm
/ubb/calendar
/ubb/search
/ubb/faq
/fpart/all/
/ubb/showprofile
/ubb/dosearch
/showflat/sticky/
page/1/fpart/1
/ubb/printthread
----------------------end

This simple setup was able to actually index correctly on the most part.
Because of the start URL, many other links such as
"ubbthreads.php?ubb=showday" were automatically ignored because it was
considered external to the "ubbthreads.php/ubb/" folder.

Now indexing with this setup, gave me all the message posts, but with some
duplication.

Here's the crux of it:

The forum indexes are accessed as such:
http://ambergriscaye.com/forum/ubbthreads.php/ubb/postlist/Board/1/page/0
http://ambergriscaye.com/forum/ubbthreads.php/ubb/postlist/Board/1/page/1
http://ambergriscaye.com/forum/ubbthreads.php/ubb/postlist/Board/1/page/2

These URLs are important and we need to index them. They are the listing of
threads for one of the forums ("Board/1") and each of the pages contain
different threads. We need to crawl these indexes to find the threads, so we
can't simply skip "page/1" etc. Note in the above, "page/0" is the same as
"page/1". Yet if we skip "page/0", we might not find a "page/1" link given
by UBB, and miss a forum.

Now, when you click on a thread from, let's say page 1 of the above board,
what it actually does is, it carries across the "page/1" part of the URL, in
order to remember where it came from. So you get the following:
http://ambergriscaye.com/forum/ubbthreads.php/ubb/showflat/Number/349/page/1
http://ambergriscaye.com/forum/ubbthreads.php/ubb/showflat/Number/349/page/1
/fpart/1
http://ambergriscaye.com/forum/ubbthreads.php/ubb/showflat/Number/349/page/1
/fpart/2
http://ambergriscaye.com/forum/ubbthreads.php/ubb/showflat/Number/349/page/1
/fpart/3

All of these go to the same thread, with "fpart/2" and "fpart/3" pointing to
the 2nd and 3rd pages of that thread.

But if this thread was linked from the second page of the board index, it
would have URLs like:
http://ambergriscaye.com/forum/ubbthreads.php/ubb/showflat/Number/349/page/2
/fpart/1

And that's the problem. The page parameter is merely a tracking mechanism.
It doesn't actually change the page, and yet it can be anything. It makes it
impossible to determine if the pages are the same.

The idea of simply skipping "page/2" and "page/3" etc. won't work. This is
because you'd then be skipping all threads which were only linked from the
second and 3rd pages of the forum index.

To me, it would seem to be a flaw in the design of the URL naming method in
UBB. Google, Yahoo, etc. would all be looking at many many versions of the
same page with URLs like these. They might be filtering some out based on a
percentage of how similar they are, but it can't rate well in terms of
PageRank when this happens.

We provide a method of detecting duplicate pages but it is useless here
because the same page looks different on each load (due to the chatbox on
the side and also the "Generated in x seconds" message down the bottom).

So is there any solution? These are what come to mind:

- If there is an option within UBB to turn off the feature of remembering
which page of the forum index you came from (so that it would drop the
"page/x" parameter in all the "showflat" thread URLs), then this would cure
it.

- From what I can tell, this feature seems only evident in the "Previous
Topic" "Index" "Next Topic" links, down the bottom of a thread (before the
"Quick Reply" box). If you can edit your UBB template, and enclose these
links with <!--ZOOMSTOPFOLLOW--> ... <!--ZOOMRESTARTFOLLOW-->, then Zoom
will actually NOT crawl these links.

There might be other places within UBB's complex interface that contain
links like these though, and if you can find them and do the same, that
would help minimize the chances of Zoom finding different links to the same
thread.

I hope that helps somewhat!
=============================
=