Inércia Sensorial

31 de January de 2009

Web crawling services

Filed under: Poetria — Tags: — inerte @ 03:44

The number of web apps that need to crawl the web in some form or another is so huge, and it’s becoming bigger everyday, that either I am the stupidest person on Earth and can’t Google properly or there’s none selling web crawling services.

Folks, someone needs to do this. A metered service (like S3) where costumers can query you app for crawling results.

I am going to give you two reasons why I should do this myself.

Reason number one:
It’s cheaper and not someone elses core competency. How does Friendfeed index all these webpages? Who cares? They shouldn’t be doing this. Writing a good web crawler is hard. They need the *data* when it’s is *new*.

Reason number two:
I have so many ideas, but I want to focus on prototyping them instead of writing the crawler. It would really help devs around the world if they could just use some API to crawl webpages.

Did I say API? Yes, that’s the point. Someone needs to write a crawler with an API:

POST /api/i=http://www.example.com/file.html
user=name
pass=word
when=00 00,12 * * 1-5
expires=2592000

Yeah, that’s the crontab syntax. “when” would also accept “once” and “onchange”.

“expires” is the number of seconds (since now) that this crawl won’t be needed anymore.

This request would return an “id”, to be used later, when the costumer is ready to download the webpage from us.

Of course there’s also:

POST /api/i=regex
format=rss
content_regex=some_string(.*)sucks?

So you know when someone says your product sucks. And:

POST /api/i=regex
name=(jpg,gif)
width=LT200
height=LT200
type=image

LT is Less Than, there would be also GT and EQ.

So, /api/i= is to insert a crawling request. You can request webpages by /api/g=

POST /api/g=http://www.example.com/file.html
only=#some_node_id .some_node_class

XPath on “only”.

Since the costumer would pay for data transfered, it would suck to have the costumer to query /api/g= everytime he needs something. And it’s not much different than writing your own crawler, isn’t? Actually it is because of robots.txt, html parsing, server load, and much more. But a lot of people think that writing crawlers is easy and scalable.

Anyway! The magic happens when you crawl a webpage and it matches some rule set by one of your costumers. Now you just need to tell them the list of ids previously sent by /api/i= that are ready. They connect to your server and download the files.

And if you have ids:

POST /api/g=111,112,113
compress=True

Which would return insert requests with ids 111 and 112 and 113 in a zip file.

One more good thing: Economy of scale. Everyone needs the newest RSS feeds. You can have dozens of costumers requesting the same feed, but you only will need to grab it once.

This service would have nothing to do with search, Google, deep web, semantic web, whatever. Just make sure people will know when a webpage is updated.

23 de January de 2009

Wikipedia with just links

Filed under: Perils of Software Development — Tags: — inerte @ 03:02

So I just read how Britannica is planning a new website version, mostly to “compete” with Wikipedia. Britannica’s CEO also talks about the relationship between Google’s results and Wikipedia’s pages.

Anyway, one criticism is valid, sometimes Wikipedia articles are not the best source of information about a topic. And this is being written by someone who searches daily on Google for terms followed by the word “wikipedia”, because Wikipedia usually have results that are good enough.

So here’s my idea, dump Wikipedia’s database of article titles, and let people submit links to these titles. And let users vote Reddit-Digg style. Find a way to deal with spammer’s bots and you’re ready to do.

For example, the Wikipedia article for “Design_pattern_(computer_science)” (which I have open in a tab right now) would be just a bunch of links for other sites that users can do whatever you want them to do so you can sort the links by relevancy.

As I’ve said on my previous post, I’ve been feeling lazy lately, otherwise I would go and implement this.

You might ask, why not just add the links that users would upvote to Wikipedia? Well, rare are the articles on Wikipedia that accept “full” links on its references section. Most links are put for citations, and these are just used to justify small text/phrases excerpts/facts, not whole webpages. And the “external links” section of Wikipedia articles aren’t sorted, and that’s the whole point of my idea :p

Now go and build it and remember to buy me a beer if you make a buck.

22 de January de 2009

How the fuck is trim in Python?

Filed under: Perils of Software Development — Tags: — inerte @ 20:54

HTFITIP is a website where you can see what a function from one language is called in another (or an implementation).

The idea came to me after the eleventh time I asked myself, how the fuck is trim in Python? It’s called strip by the way.

The website is very simple. On the front page, there’s this form:

How the fuck is _____’s _____ in _____?

The first space is the “source” programming language name, the second one we’ll put the function/method name, and the third one, the “target” programming language. The source language is optional.

A good source for the… source programming language is PHP’s (massive) functions list. First of all, PHP’s seems to have a function for everything, and there’s no namespace so it’s easier for us to build an index of it. Secondly, it’s a widely known language. Thirdly, it’s the first language of a lot of people, so it’s quite possible that a lot of people will want to know what’s the equivalent for PHP’s function X in a new language they’re learning, so if we have to start somewhere might as well be this way.

When the user selects the programming language from the source dropdown, an Ajax call automatically gets all the function names so when the user types a name on the second field, it auto-completes. Then, he selects the “target” language on the third field.

How the fuck is PHP’s trim in Python? (click submit)

If the relationship has already been stablished on the database, we present the answer to the user:

PHP’s trim in Python is called strip. (a link to Python’s manual page on strip would be nice)

If not, we can let the user make this relationship. If it’s not possible, say, Python doesn’t have an array_intersect_uassoc function on any module, we let the user type an implementation, wiki-style.

Web 2.0 baby, the users provide all the content, we make all the money.

And the website backend is simple. A list of programming language and its functions, and the relationship between functions/methods, and somekind of wiki stuff to let people submit implementations.

I’ve been feeling lazy lately so go ahead and implement this if you want. Should be an useful tool.

Powered by WordPress