|
Joined: Nov 2006
Posts: 173
member
|
member
Joined: Nov 2006
Posts: 173 |
I have been continuing to attempt to spider my message board from outside the threads software. i have trouble with mass duplications, 25 of the same page in the results the fellow who writes the ZOOM spider has been working with me on this here are his comments: We made a few attempts and did some thorough looking at the problem. I think we were able to track down the core of the problem, but there doesn't seem to be any easy solution to this.
The problem is largely (if not, completely) caused by the new URLs used by UBB and the way it is passing extra parameters in the URL to track how a user got to a thread (ie: from which forum index etc.). There's also a lot of inconsistent naming or varying parameters which mean similar things. I can't see how this new version of UBB can be very friendly to search engines - it just gives out too many different URLs to the exact same page.
I'll explain this in more detail later, but first, I'll describe my testing setup. I used the following as my start URL: http://ambergriscaye.com/forum/ubbthreads.php/ubb/
And this is my skip list: ---------------------start ?ubb=newpost ?ubb=markallread ?ubb=mycookies /ubb/newuser /ubb/cfrm /ubb/calendar /ubb/search /ubb/faq /fpart/all/ /ubb/showprofile /ubb/dosearch /showflat/sticky/ page/1/fpart/1 /ubb/printthread ----------------------end
This simple setup was able to actually index correctly on the most part. Because of the start URL, many other links such as "ubbthreads.php?ubb=showday" were automatically ignored because it was considered external to the "ubbthreads.php/ubb/" folder.
Now indexing with this setup, gave me all the message posts, but with some duplication.
Here's the crux of it:
The forum indexes are accessed as such: http://ambergriscaye.com/forum/ubbthreads.php/ubb/postlist/Board/1/page/0 http://ambergriscaye.com/forum/ubbthreads.php/ubb/postlist/Board/1/page/1 http://ambergriscaye.com/forum/ubbthreads.php/ubb/postlist/Board/1/page/2
These URLs are important and we need to index them. They are the listing of threads for one of the forums ("Board/1") and each of the pages contain different threads. We need to crawl these indexes to find the threads, so we can't simply skip "page/1" etc. Note in the above, "page/0" is the same as "page/1". Yet if we skip "page/0", we might not find a "page/1" link given by UBB, and miss a forum.
Now, when you click on a thread from, let's say page 1 of the above board, what it actually does is, it carries across the "page/1" part of the URL, in order to remember where it came from. So you get the following: http://ambergriscaye.com/forum/ubbthreads.php/ubb/showflat/Number/349/page/1 http://ambergriscaye.com/forum/ubbthreads.php/ubb/showflat/Number/349/page/1 /fpart/1 http://ambergriscaye.com/forum/ubbthreads.php/ubb/showflat/Number/349/page/1 /fpart/2 http://ambergriscaye.com/forum/ubbthreads.php/ubb/showflat/Number/349/page/1 /fpart/3
All of these go to the same thread, with "fpart/2" and "fpart/3" pointing to the 2nd and 3rd pages of that thread.
But if this thread was linked from the second page of the board index, it would have URLs like: http://ambergriscaye.com/forum/ubbthreads.php/ubb/showflat/Number/349/page/2 /fpart/1
And that's the problem. The page parameter is merely a tracking mechanism. It doesn't actually change the page, and yet it can be anything. It makes it impossible to determine if the pages are the same.
The idea of simply skipping "page/2" and "page/3" etc. won't work. This is because you'd then be skipping all threads which were only linked from the second and 3rd pages of the forum index.
To me, it would seem to be a flaw in the design of the URL naming method in UBB. Google, Yahoo, etc. would all be looking at many many versions of the same page with URLs like these. They might be filtering some out based on a percentage of how similar they are, but it can't rate well in terms of PageRank when this happens.
We provide a method of detecting duplicate pages but it is useless here because the same page looks different on each load (due to the chatbox on the side and also the "Generated in x seconds" message down the bottom).
So is there any solution? These are what come to mind:
- If there is an option within UBB to turn off the feature of remembering which page of the forum index you came from (so that it would drop the "page/x" parameter in all the "showflat" thread URLs), then this would cure it.
- From what I can tell, this feature seems only evident in the "Previous Topic" "Index" "Next Topic" links, down the bottom of a thread (before the "Quick Reply" box). If you can edit your UBB template, and enclose these links with <!--ZOOMSTOPFOLLOW--> ... <!--ZOOMRESTARTFOLLOW-->, then Zoom will actually NOT crawl these links.
There might be other places within UBB's complex interface that contain links like these though, and if you can find them and do the same, that would help minimize the chances of Zoom finding different links to the same thread.
I hope that helps somewhat! ==============================
|
|
|
|
Joined: Nov 2006
Posts: 173
member
|
member
Joined: Nov 2006
Posts: 173 |
so i am wondering if the URL naming method could be changed in a future version to be more PageRank friendly and eliminate the spidering duplicates. and i am wondering how i can get done these three things
================================ - If there is an option within UBB to turn off the feature of remembering which page of the forum index you came from (so that it would drop the "page/x" parameter in all the "showflat" thread URLs), then this would cure it.
- From what I can tell, this feature seems only evident in the "Previous Topic" "Index" "Next Topic" links, down the bottom of a thread (before the "Quick Reply" box). If you can edit your UBB template, and enclose these links with <!--ZOOMSTOPFOLLOW--> ... <!--ZOOMRESTARTFOLLOW-->, then Zoom will actually NOT crawl these links.
- Find where there are other places within UBB's complex interface that contain links like these, and find them and do the same ++++++++++++++++++++++++++++++
|
|
|
|
Joined: Jun 2006
Posts: 9,242 Likes: 1
Former Developer
|
Former Developer
Joined: Jun 2006
Posts: 9,242 Likes: 1 |
Unfortunatley, if you remove the page portion from the URL it might make it better for non-duplicates in search engines, but it would make the forums very quirky for everyone trying to navigate them. When browsing, making a post, adding to favorites, etc, you'd always get sent back to the 1st page of the forum instead of the page you were on, and then you'd have to browse back to that page to resume reading.
The only option that might be possible is moving some of these things into the session to store your current page. So it's possible we can address this in a future version, but there's always going to be duplicate links at some point if just trying to judge by the URL. For example, the listing of topics. You might have a link to a topic that is on page 1 of the forum, but a few days later, the topic itself may not have changed but it's now on page 20. The page number is actually used here to determine which topics to display so it can't be tracked internally.
|
|
|
|
Joined: Nov 2006
Posts: 173
member
|
member
Joined: Nov 2006
Posts: 173 |
its proving practically impossible to spider. and if i have trouble google will too.
i get like 150,000+ pages indexed. thing runs forever. other large boards get 10,000 or so.
mass dupes. google will punt. old version was remarkably well indexed in google.
bummer bigtime for all of us
|
|
|
|
Joined: Nov 2006
Posts: 173
member
|
member
Joined: Nov 2006
Posts: 173 |
here's my exclude list that results in only the showflat lines being indexed:
page/1/fpart/1 ubb=newpost ubb=markallread ubb=mycookies /ubb/newuser /ubb/cfrm /ubb/calendar /ubb/search /ubb/faq /fpart/all/ /ubb/showprofile /ubb/dosearch /showflat/sticky/ /ubb/printthread ubb=addfavuser ubb=newreply ubb=showprofile ubb=sendprivate ubbthreads.php/ubb/showthreaded &Searchpage=
still a bazillion dupes.
why have a 'search engine friendly URLs" option when a spider can't even function accurately in there?
|
|
|
|
Joined: Jun 2006
Posts: 16,299 Likes: 116
|
Joined: Jun 2006
Posts: 16,299 Likes: 116 |
SE Friendly URLs simply allows bots to access sections they would normally not be able to due to special characters such as ? and = and &
|
|
|
|
Joined: Jun 2006
Posts: 9,242 Likes: 1
Former Developer
|
Former Developer
Joined: Jun 2006
Posts: 9,242 Likes: 1 |
We should be able to improve on this for a future version, since spiders aren't logged in and they don't need to really keep track of things the same way a logged in user does. It won't be in 7.1 because that's pretty much frozen, but something I can look into for 7.2.
|
|
|
|
Joined: Nov 2006
Posts: 173
member
|
member
Joined: Nov 2006
Posts: 173 |
THANKS! good idea.
for now, can you tell me if (#1) below is an option, and if not, how to do (#2) below?
#1 - If there is an option within UBB to turn off the feature of remembering which page of the forum index you came from (so that it would drop the "page/x" parameter in all the "showflat" thread URLs), then this would cure it.
#2 - From what I can tell, this feature seems only evident in the "Previous Topic" "Index" "Next Topic" links, down the bottom of a thread (before the "Quick Reply" box). If you can edit your UBB template, and enclose these links with <!--ZOOMSTOPFOLLOW--> ... <!--ZOOMRESTARTFOLLOW-->, then Zoom will actually NOT crawl these links.
|
|
|
|
Joined: Jun 2006
Posts: 9,242 Likes: 1
Former Developer
|
Former Developer
Joined: Jun 2006
Posts: 9,242 Likes: 1 |
#1 currently isn't an option.
#2 would just require an edit to the showflat.tpl file. You can see where these links are made (previouslink, currentlink, nextlink). You'd just need to surround them with the zoom comments that were given.
|
|
|
|
Joined: Nov 2006
Posts: 173
member
|
member
Joined: Nov 2006
Posts: 173 |
where is that "showflat.tpl" file located? spent the last ten minutes digging thru directories, searching drive, searching for .tpl files containing the word "previouslink"... no luck finding it
|
|
|
|
Joined: Jun 2006
Posts: 9,242 Likes: 1
Former Developer
|
Former Developer
Joined: Jun 2006
Posts: 9,242 Likes: 1 |
If you edit it directly on the server then it's in the templates/default directory.
|
|
|
|
Joined: Nov 2006
Posts: 173
member
|
member
Joined: Nov 2006
Posts: 173 |
ok found the file. its not on my hard drive after the install
i found this code
<div> {if $prevlinkstart} <img style="vertical-align: middle" src="{$config.BASE_URL}/images/{$style_array.general}/previous.gif" alt="{$lang.PREV_THREAD}" border="0" {$images.previous} /> {$prevlinkstart} {$lang.PREV_THREAD} {$prevlinkstop} {/if}
<img style="vertical-align: middle" src="{$config.BASE_URL}/images/{$style_array.general}/all.gif" alt="{$alttext}" border="0" {$images.all} /> {$currentlinkstart} {$linktext} {$currentlinkstop}
{if $nextlinkstart} {$nextlinkstart} {$lang.NEXT_THREAD} {$nextlinkstop} <img style="vertical-align: middle" src="{$config.BASE_URL}/images/{$style_array.general}/next.gif" alt="{$lang.NEXT_THREAD}" border="0" {$images.next} /> {/if}
</div>
should i make it like this?
<div><!--ZOOMSTOPFOLLOW--> {if $prevlinkstart} <img style="vertical-align: middle" src="{$config.BASE_URL}/images/{$style_array.general}/previous.gif" alt="{$lang.PREV_THREAD}" border="0" {$images.previous} /> {$prevlinkstart} {$lang.PREV_THREAD} {$prevlinkstop} {/if}
<img style="vertical-align: middle" src="{$config.BASE_URL}/images/{$style_array.general}/all.gif" alt="{$alttext}" border="0" {$images.all} /> {$currentlinkstart} {$linktext} {$currentlinkstop}
{if $nextlinkstart} {$nextlinkstart} {$lang.NEXT_THREAD} {$nextlinkstop} <img style="vertical-align: middle" src="{$config.BASE_URL}/images/{$style_array.general}/next.gif" alt="{$lang.NEXT_THREAD}" border="0" {$images.next} /> <!--ZOOMRESTARTFOLLOW--> {/if}
</div>
and do i upload showflat.tpl in ascii format?
|
|
|
|
Joined: Jun 2006
Posts: 16,299 Likes: 116
|
Joined: Jun 2006
Posts: 16,299 Likes: 116 |
templates/default/showflat.tpl
|
|
|
|
Joined: Nov 2006
Posts: 173
member
|
member
Joined: Nov 2006
Posts: 173 |
found it
does my language above look correct and do i upload the file in ascii format?
|
|
|
|
Joined: Nov 2006
Posts: 173
member
|
member
Joined: Nov 2006
Posts: 173 |
thru the control panel i got
Control panel message /web/disk1/virtual/ambergriscaye.com/htdocs/forum/templates/default/showflat.tpl is not writeable. Please fix the permissions on this file and try again.
|
|
|
|
Joined: Jun 2006
Posts: 16,299 Likes: 116
|
Joined: Jun 2006
Posts: 16,299 Likes: 116 |
the template files need to be writable (chmod 666) in order to use the editor in the cp.
|
|
|
|
Joined: Jun 2006
Posts: 9,242 Likes: 1
Former Developer
|
Former Developer
Joined: Jun 2006
Posts: 9,242 Likes: 1 |
Hmm, I had changed those permissions when doing the install. Will need to look at that again.
Changes look good. Go ahead and upload in ascii.
|
|
|
|
Joined: Nov 2006
Posts: 173
member
|
member
Joined: Nov 2006
Posts: 173 |
thanks. looks like the board is working ok, will start another index and see if that change contains the dupes a bit....
|
|
|
|
Joined: Nov 2006
Posts: 173
member
|
member
Joined: Nov 2006
Posts: 173 |
the guy who writes the spider script (a damn good one i might add) has helped me a LOT and i have worked for a week non stop indexing with different permutations and i can't get less than 125,000 files to index on the board with mass dupes. very disappointing. have pretty much given up.
worked great before with my 6.x board, and i am currently successfully indexing several 6.x ultimate boards.
threads 7 seems unindexable without mass dupes. bad for all of us who want out boards indexed by google.
|
|
|
Bots
by Outdoorking - 04/13/2024 5:08 PM
|
|
|
|
|
|
1 members (Napalm),
806
guests, and
165
robots. |
Key:
Admin,
Global Mod,
Mod
|
|
|
|