O'Reilly Web Spam

Another in the line of folks who should know better, O’Reilly has web spam on some of their sites. This doesn’t appear to be quite as bad as the WordPress web spam because they aren’t using CSS to make the content invisible to visitors. The placement of these “ads” are rather out of the way though, on the bottom left hand column.

With a little bit of looking around I was able to find these ads on oreillynet.com (on the article pages also), windowsdevcenter.com (on the article pages also), macdevcenter.com (on the article pages also), ondotnet.comt (on the article pages also), onjava.com (on the article pages also), onlamp.com (on the article pages also), perl.com (on the article pages also) and xml.com (on the article pages also).

The numbers involved don’t appear to be quite as bad as the WordPress incident either, with Google finding less than 600 pages with these ads for oreillynet.com. If the other sites have a similar number of pages with ads then all told it would less than 5000 pages. A lot of these ads point to freehotelsearch.com, which seems to offer a legitimate service (I only looked up reservations, I didn’t actually place one).

I think one could argue that these ads aren’t completely wrong. The argument would come down to intent, are these links there in hopes that people will actually click on them, or are they more of an effort to trick search engines to increase their importance? They are links, so it is possible that someone might click on them, but they aren’t nearly as prominent as the rest of their ads. I’m leaning more towards the idea that these ads are there more to boost their search engine ranking than as traditional ads. Tim is going to have a tough time making this look legit.

UPDATE 8:30am 24 Aug 2005: Tim O’Reilly has a posted an initial response to the complaints about the ads. The short version: while not completely wrong (and not nearly as bad as the WordPress spam) these types of ads aren’t good for the long term.

Ranked Search, Merging Google And Yahoo

Last week I discovered Twingine via Russell. Twingine puts the results of your query from Google and Yahoo! into frames so that you can compare the results side by side. This seemed like an interesting idea, but the interface isn’t particularly useful, it’s too much work to visually compare the two. Then I remembered Matt’s announcement of using Yahoo’s search APIs at WordPress.org, which started me thinking about the availability of APIs from Yahoo! and from Google.

It seemed like there should be some way of combining these two resources, taking the search results from both Google and Yahoo! and mix them in some semi-meaningful way. So last night I started putting together the Ranked Search website. You enter a query and the site requests the top 10 results from both Yahoo! and Google via their search APIs, giving each link rank. The first link gets rank of ten and so on through all ten links from each result set. The idea being the the links with the highest rank are more likely to be what you are looking for. Then I look for links that appear in both sets, merging them into one, with a new rank that is the sum of their original ranks. All of the unique links from each set are then merged in and the new set is sorted by rank. The highest potential score is 20 (where Google and Yahoo! both return the same link in the #1 position) and the lowest possible score is 1. It is really basic stuff.

Making requests out over the Internet to both Google and Yahoo! isn’t the fastest thing in the west. So I put in some basic caching for every query. The result sets from every query is cached in a PostgreSQL database and is used when a exact query match is found and the results are less than 12 hours old. If the results are more than 12 hours old the query is sent off to Google and Yahoo! and the new results are cached again.

Everything is very plain and basic right now, consider it an experiment. If you have any additional thoughts leave a comment or use my contact form to drop me a note.

Most of this information is also available on the about page for Ranked Search.

My Web 2.0, By Yahoo!

It really is amazing how quickly concepts can spread. Tagging data (URLs, images, etc) has impressed a lot of people as a “better way” to organize content. Normally when having these types of discussions you point to del.icio.us, Flickr and more recently Technorati. Today a new, much larger, player is added to that list, Yahoo!. Their announcement about My Web 2.0 emphasizes that the reasoning behind this is to capitalize on the community knowledge behind allowing virtually anyone to tag websites. JeremY! has some thoughts on why this is important.

My first impression is that this looks pretty darn cool! Go start at http://myweb2.search.yahoo.com/ (assuming your have a Yahoo! account already) and simply do some searches on the web and start tagging them. You do this via the “Save” link for the site in the search results page. When you save a link it asks your for a description and tags, with the tags field providing suggestions (presumably from tags that other people are using). Permissions can be set to restrict this to just yourself, your community or everyone. Another feature I don’t remember seeing before is the “View as XML” link. It turns out that Yahoo! is identifying RSS/ATOM feeds and then providing a link. So if a site as the “View as XML” link on it in the search results, you know it has a feed. Nice to see search engines trying to divine what features sites have a do something with that knowledge.

You can import an existing list of links from IE, Yahoo! Bookmarks or and RSS feed. Although I haven’t tried it yet, I suspect you’ll be able to import your del.icio.us bookmarks into My Web 2.0 via the RSS import feature. There are some other features that you’d expect from this type of service, like the top 100 most popular sites and browsing by tags. Of course you can also search on just your own set of links. One thing I couldn’t find was way to see how many other people had saved the same URL. This is something that you can do in del.icio.us and it’s rather disappointing to see that left out here. So far that is the most obvious feature that is missing. I should note here that so far the site seems to respond very quickly, which is something that del.icio.us has had problems with, either being down or just extremely slow.

As JeremY! noted, they’ve exposed My Web 2.0 via the Yahoo Search API, which was a very smart move. In the future I’d like to see more this approach, where virtually every feature is exposed (to some degree) via an API that we can get our hands the same day a new feature is released. For now this trend is still pretty close to the bleeding edge, but as time goes on and things mature a bit more, companies that don’t provide APIs will be missing the boat in a major, major way.

It is my sincere hope that Yahoo!’s My Web 2.0 doesn’t get completely overrun by those trying to game and spam search engines. Although Google is still the number target for this type of “attack”, Yahoo! is a big enough player to attract the attention of those who would do evil in this regard. Since you need a Yahoo! account in order to use this feature the obvious spot to defend yourself is at the account creation process. Shore up your defenses Yahoo!, I’m sure the bad guys will be coming with a renewed effort.

The new My Web 2.0 looks impressive.

Wait a minute. I can’t find any way to syndicate my bookmarks (saved sites) via an RSS/ATOM feed? What is up with that? I can’t find any mention of it in the FAQ. Common guys, you covered so many other features on launch, how could you possibly leave that one out? I was about to mention how this was going to be a del.icio.us killer, but I doubt anyone will give up on it until they can get a feed of their links. Fix that and I’ll likely give up on del.icio.us and move to My Web 2.0.

In the meantime I’ll play with this a bit more and look at the API features to see what is possible.

UPDATE 2:30pm 29 Jun 2005: As Toby pointed out in the comments below, you can get a feed of your links via the API. Links for this should be plastered all over the place in My Web 2.0, tagging and feeds go hand in hand in many ways. Interesting, you don’t need a valid application id (appid) in the URL for it to work. So a feed for My Web 2.0 account looks like http://api.search.yahoo.com/MyWebService/rss/urlSearch.xml?appid=somestrangeid&yahooid=somestrangeid. You can get a feed for anyone that you have a Yahoo username for. I’m going to guess that permissions tie into this somehow, a link that I mark as private shouldn’t show up in public feed. Thanks for the pointer Toby.

Google Sitemaps

Last summer I wrote some thoughts about something like a mod_ping for Apache so that search engines could be easily notified when pages on a site change. I was trying to abstract the idea of pings and trackbacks in use by blogs into a general feature that could be used for any site, even one made out of static files.

The announcement about Google Sitemaps reminded me very much about my mod_ping idea. It isn’t the same, but the goal seems to be the same, providing a way for search engines to discover URLs and when they change. SearchEngineWatch has an article about it which provides a brief overview of what Google is up to. More information can be found in the help for there sitemap generator tool (and the Source Forge site for the tool), the Google Sitemaps FAQ and the Sitemap protocol page.

They specifically mention a hope that servers (Apache & IIS) will support this in the future. In the mean time you can manually ping Google for sitemap updates using something like curl, wget or even your web browser I suppose. I’d expect this feature to be built into certain web tools, like blogs and content management systems. I’m sure someone will get around to writing a tool for WordPress to generate a sitemap file, adding to it each time an entry is published and then ping Google to let them know it has been updated.

Will other web search companies adopt this? Keep an eye on Yahoo!, MSN, AOL, A9 and IceRocket to see if this goes anywhere. I don’t think that this will be limited to the “traditional” search folks, I’d think that someone at Technorati, PubSub and maybe Bloglines might come up with some clever uses for this. If we are really lucky people will learn from history and come up with something like feedmesh for sitemap pings.

For now I’ve whipped up a very basic sitemap file at http://joseph.randomnetworks.com/sitemap.xml and pinged Google to let them know about it.