Thursday, April 19, 2007

Finding files over the net

Lately I've been really upset with the fact that finding files over the internet is much harder than it should be. There are billions of files out there just waiting to be downloaded, and no one actually knows where they are...

Now you're probably gonna say "That's bull, you can search for the file on Google and find it easily", but that's where you're wrong. You can only find files which are meant to be found. What does that mean? For example, MP3 files are files which often don't want to be found, meaning that certain search engines (Google, for example) can index them easily but won't do it. On the other hand, you may think that search engines find links to files within pages, which is also wrong because it only indexes the text in the page, and not the anchors themselves. That means that you can only find files over the net by finding pages that lead you to files.

But what if you are looking for a file using a file name instead of its description? What about all the files that are posted in forums all over the world? You'd have to look for the file by entering a description or a file name in a search engine, and then you'd have to look for a link in the page to the file you are looking for (oftenly described using a very big "DOWNLOAD" link). This process may sometimes be annoying, frustrating and time-consuming.

So I got tired of the idea of looking for files over the internet myself. For a start, I wrote a Python based web crawler that searches for a file using a search string, and from there it takes each webpage in the result of the search and read all the links from that page. That gave me a quick and much more comfortable way of looking for files over the internet, without all the fuss behind digging all over the net just to find a small file (which I usually know its exact name, which is just more frustrating).

The script was very useful for downloading specific files which aren't often downloaded. For example: Drivers, DLL's, MP3's, old games, firmware, etc. So I decided to make a webpage out of it: http://www.findthatfile.com. I'm caching my results for webpages that I crawl to speed the search up a bit, so certain download pages which generate download links on the fly (such as download.com) won't work. But I don't really care about these sites since my crawler helps people to find files which are buried in the depths of the internet, and not shareware applications which are already indexed on commercial download sites.

And of course, each file has a link to the page that I found the file in because of copyright issues.

No comments: