|
|
Joined: Nov 2006
Posts: 173
member
|
member
Joined: Nov 2006
Posts: 173 |
I have been continuing to attempt to spider my message board from outside the threads software. i have trouble with mass duplications, 25 of the same page in the results the fellow who writes the ZOOM spider has been working with me on this here are his comments: We made a few attempts and did some thorough looking at the problem. I think we were able to track down the core of the problem, but there doesn't seem to be any easy solution to this.
The problem is largely (if not, completely) caused by the new URLs used by UBB and the way it is passing extra parameters in the URL to track how a user got to a thread (ie: from which forum index etc.). There's also a lot of inconsistent naming or varying parameters which mean similar things. I can't see how this new version of UBB can be very friendly to search engines - it just gives out too many different URLs to the exact same page.
I'll explain this in more detail later, but first, I'll describe my testing setup. I used the following as my start URL: http://ambergriscaye.com/forum/ubbthreads.php/ubb/
And this is my skip list: ---------------------start ?ubb=newpost ?ubb=markallread ?ubb=mycookies /ubb/newuser /ubb/cfrm /ubb/calendar /ubb/search /ubb/faq /fpart/all/ /ubb/showprofile /ubb/dosearch /showflat/sticky/ page/1/fpart/1 /ubb/printthread ----------------------end
This simple setup was able to actually index correctly on the most part. Because of the start URL, many other links such as "ubbthreads.php?ubb=showday" were automatically ignored because it was considered external to the "ubbthreads.php/ubb/" folder.
Now indexing with this setup, gave me all the message posts, but with some duplication.
Here's the crux of it:
The forum indexes are accessed as such: http://ambergriscaye.com/forum/ubbthreads.php/ubb/postlist/Board/1/page/0 http://ambergriscaye.com/forum/ubbthreads.php/ubb/postlist/Board/1/page/1 http://ambergriscaye.com/forum/ubbthreads.php/ubb/postlist/Board/1/page/2
These URLs are important and we need to index them. They are the listing of threads for one of the forums ("Board/1") and each of the pages contain different threads. We need to crawl these indexes to find the threads, so we can't simply skip "page/1" etc. Note in the above, "page/0" is the same as "page/1". Yet if we skip "page/0", we might not find a "page/1" link given by UBB, and miss a forum.
Now, when you click on a thread from, let's say page 1 of the above board, what it actually does is, it carries across the "page/1" part of the URL, in order to remember where it came from. So you get the following: http://ambergriscaye.com/forum/ubbthreads.php/ubb/showflat/Number/349/page/1 http://ambergriscaye.com/forum/ubbthreads.php/ubb/showflat/Number/349/page/1 /fpart/1 http://ambergriscaye.com/forum/ubbthreads.php/ubb/showflat/Number/349/page/1 /fpart/2 http://ambergriscaye.com/forum/ubbthreads.php/ubb/showflat/Number/349/page/1 /fpart/3
All of these go to the same thread, with "fpart/2" and "fpart/3" pointing to the 2nd and 3rd pages of that thread.
But if this thread was linked from the second page of the board index, it would have URLs like: http://ambergriscaye.com/forum/ubbthreads.php/ubb/showflat/Number/349/page/2 /fpart/1
And that's the problem. The page parameter is merely a tracking mechanism. It doesn't actually change the page, and yet it can be anything. It makes it impossible to determine if the pages are the same.
The idea of simply skipping "page/2" and "page/3" etc. won't work. This is because you'd then be skipping all threads which were only linked from the second and 3rd pages of the forum index.
To me, it would seem to be a flaw in the design of the URL naming method in UBB. Google, Yahoo, etc. would all be looking at many many versions of the same page with URLs like these. They might be filtering some out based on a percentage of how similar they are, but it can't rate well in terms of PageRank when this happens.
We provide a method of detecting duplicate pages but it is useless here because the same page looks different on each load (due to the chatbox on the side and also the "Generated in x seconds" message down the bottom).
So is there any solution? These are what come to mind:
- If there is an option within UBB to turn off the feature of remembering which page of the forum index you came from (so that it would drop the "page/x" parameter in all the "showflat" thread URLs), then this would cure it.
- From what I can tell, this feature seems only evident in the "Previous Topic" "Index" "Next Topic" links, down the bottom of a thread (before the "Quick Reply" box). If you can edit your UBB template, and enclose these links with <!--ZOOMSTOPFOLLOW--> ... <!--ZOOMRESTARTFOLLOW-->, then Zoom will actually NOT crawl these links.
There might be other places within UBB's complex interface that contain links like these though, and if you can find them and do the same, that would help minimize the chances of Zoom finding different links to the same thread.
I hope that helps somewhat! ==============================
|
|
|
|
Bots
by Outdoorking - 04/13/2024 5:08 PM
|
|
|
0 members (),
1,063
guests, and
171
robots. |
Key:
Admin,
Global Mod,
Mod
|
|
|
|
|