Inércia Sensorial

2009-08-23

Apache + mod_wsgi + Django + lighttpd

Filed under: Perils of Software Development — Tags: , , , , — inerte @ 18:45

I’ve written how to configure Apache, mod_python and Django and how to put lighttpd behind Apache.

Recently I decided to host my most visited website on a different VPS provider[1], and started a quest to update my knowledge about Django deployment. I did things differently this time, using mod_wsgi (the recommended way of deploying Apache and Django), and configuring Apache behind lighttpd for dynamic content (in other words, lighttpd will serve static media).

I did everything below in the last couple days, and did not wrote things as I was doing it, because it involved a lot of experimentation (trial and error) for me. As such, I am writing this article based on memory and checking my config files. If you encounter any problems, please leave a comment and I will clarify any omissions.

Here’s how to do it:

Install the usual suspects

I choose Ubuntu as my Linux distro, and installation of anything is a breeze on it. sudo apt-get install package-name. This part is well covered around the web, so I won’t comment in details how it’s done. Sufficient to say, some of the packages I’ve installed were apache2, libapache2-mod-wsgi, and lighttpd.

Configure Apache and mod_wsgi to load your project

Since lighttpd will act as the primary server for my domain, I decided to move Apache to port 81:

sudo vi /etc/apache2/ports.conf

Overwrite existing ip:port lines with these:

NameVirtualHost 127.0.0.1:81
Listen 81

Where you put your own Python modules on newer Ubuntu installations has changed to /usr/local/lib/python2.6/dist-packages/. Therefore, I’ve uploaded Django, my project and other necessary modules (which weren’t installed by apt-get) to this directory, leaving me with the following structure:

/usr/local/lib/python2.6/dist-packages/django/
/usr/local/lib/python2.6/dist-packages/my_project_name/

The mod_wsgi documentation has an excellent article on Django integration, but it’s fairly length. You should read it anyway, since there are lots of options that you might want to use. Here’s a cheatsheet:

Create a document root for your domain name:

sudo mkdir /var/www/example.com

Create the file which will be loaded by mod_wsgi with your project configuration:

sudo mkdir /usr/local/lib/python2.6/dist-packages/project_name/apache/
sudo vi /usr/local/lib/python2.6/dist-packages/project_name/apache/django.wsgi

With these contents:

import sys
import os

sys.path.append(‘/usr/local/lib/python2.6/dist-packages’)
os.environ[‘DJANGO_SETTINGS_MODULE’] = ‘project_name.settings’

import django.core.handlers.wsgi

application = django.core.handlers.wsgi.WSGIHandler()

Create a domain configuration file for Apache:

sudo vi /etc/apache2/sites-available/example.com

With these contents:

<VirtualHost 127.0.0.1:81>
ServerName example.com
ServerAdmin [email protected]

DocumentRoot /var/www/example.com

Alias /media/ /usr/local/lib/python2.6/dist-packages/django/contrib/admin/media/
<Directory /usr/local/lib/python2.6/dist-packages/django/contrib/admin/media>
Options -Indexes
Order deny,allow
Allow from all
</Directory>

Alias /project_media_dir/ /usr/local/lib/python2.6/dist-packages/project_name/templates/project_media_dir/
<Directory /usr/local/lib/python2.6/dist-packages/project_name/templates/project_media_dir>
Options -Indexes
Order deny,allow
Allow from all
</Directory>

WSGIScriptAlias / /usr/local/lib/python2.6/dist-packages/project_name/apache/django.wsgi
WSGIDaemonProcess example.com
WSGIProcessGroup example.com

<Directory /usr/local/lib/python2.6/dist-packages/project_name/apache>
Order deny,allow
Allow from all
</Directory>
</VirtualHost>

Activate it:

cd /etc/apache2/sites-enabled/
sudo ln -s ../sites-available/example.com

Configure lighttpd to proxy non-static media requests to Apache

I used MySQL Performance Blog’s “Lighttpd as reverse proxy” article as a basis for my own configuration. Therefore, we’ll have an url http://example.com/server-status, which will require authentication, enabling us to see Apache’s server status.

Create a directory for error logs:

sudo mkdir /var/log/lighttpd/example.com

Create a domain file configuration for lighttpd:

sudo vi /etc/lighttpd/conf-available/20-example.com.conf

With these contents:

server.modules += ( “mod_auth”,
“mod_status”,
“mod_proxy”,
)

$HTTP[“host”] =~ “(^|\.)example\.com$” {
$HTTP[“url”] !~ “\.(js|css|gif|jpg|png|ico|txt|swf|html|htm)$” {
proxy.server = ( “” => (
( “host” => “127.0.0.1”, “port” => 81 )
))
}

server.document-root = “/var/www/example.com/”
server.errorlog = “/var/log/lighttpd/example.com/error.log”
dir-listing.activate = “disable”

auth.backend = “htpasswd”
auth.backend.htpasswd.userfile = “/var/www/.htpasswd”
auth.require = ( “/server-status” => (
“method” => “basic”,
“realm” => “status”,
“require” => “valid-user”
)
)
}

There are lines worthy of mention in the configuration above:

$HTTP[“host”] =~ “(^|\.)example\.com$” {

This will wrap the directives inside to only apply for example.com requests.

$HTTP[“url”] !~ “\.(js|css|gif|jpg|png|ico|txt|swf|html|htm)$” {
proxy.server = ( “” => (
( “host” => “127.0.0.1”, “port” => 81 )
))
}

These will send any requests for documents not ending in the specified extensions to ip 127.0.0.1, port 81, where Apache lives. Essentially, everything which is static content (or more accurately, specified by the | separated regular expression), will be served by lighttpd.

cd /etc/lighttpd/conf-enabled/
sudo ln -s ../conf-available/example.com.conf

Tell the filesystem where you project and Django’s admin static content are located:

sudo ln -s /usr/local/lib/python2.6/dist-packages/django/contrib/admin/media/ /var/www/example.com/media
sudo ln -s /usr/local/lib/python2.6/dist-packages/project_name/templates/project_media_dir/ /var/www/example.com/project_media_dir/

Finally, restart everything so the newest configuration can be applied

sudo /etc/init.d/apache2 restart
sudo /etc/init.d/lighttpd restart

So, what actually happens?

When a visitor goes to your website (example.com), the request will hit lighttpd first. If the document path does not end with a string in our list of static content extensions, the request will be proxied to Apache on port 81, otherwise lighttpd will serve itself.

And that’s it, if my memory is correct. Did I miss anything? Comment at will.

[1] Linode, if you’re curious. Mainly because bandwidth is cheaper. If you’re looking for a Linode referral, Linode discount code or Linode promotion code, sign up using this link to credit me as referral. Thanks 🙂

2009-08-06

I’m glad Sony back-peddled on this Wipeout ad incarnation

Filed under: Geral — Tags: — inerte @ 01:02

So…. insane

2009-06-14

Paraquedas Boituva 06/06/2009

Filed under: Imagem — Tags: — inerte @ 18:59

Eu, o Sérgio e a Mi pulamos 🙂

2009-05-27

Substituir múltiplas linhas por apenas uma, em PHP

Filed under: PHP — Tags: — inerte @ 21:41

Bati um pouco a cabeça para achar a expressão regular mas lá vai:

1
$string = preg_replace("/(\r\n|\n|\r)+/", "\n", $string);

2009-02-16

Como fazer scp em arquivos com espaços no nome

Filed under: Programação — Tags: , — inerte @ 18:59

Dica rápida de Linux na linha de comando. Só colocar entre aspas duplas e simples. Lá vai:

scp [email protected]:"'tem espacos aqui'"

2009-02-04

Referrer test

Filed under: Programação — inerte @ 23:01

I’ve made a test page to see how referrers (both server-side and Javascript) will work if Google changes its url query string on search results pages:

http://www.inerciasensorial.com.br/referrer-test/index.php

2009-01-31

Web crawling services

Filed under: Poetria — Tags: — inerte @ 03:44

The number of web apps that need to crawl the web in some form or another is so huge, and it’s becoming bigger everyday, that either I am the stupidest person on Earth and can’t Google properly or there’s none selling web crawling services.

Folks, someone needs to do this. A metered service (like S3) where costumers can query you app for crawling results.

I am going to give you two reasons why I should do this myself.

Reason number one:
It’s cheaper and not someone elses core competency. How does Friendfeed index all these webpages? Who cares? They shouldn’t be doing this. Writing a good web crawler is hard. They need the *data* when it’s is *new*.

Reason number two:
I have so many ideas, but I want to focus on prototyping them instead of writing the crawler. It would really help devs around the world if they could just use some API to crawl webpages.

Did I say API? Yes, that’s the point. Someone needs to write a crawler with an API:

POST /api/i=http://www.example.com/file.html
user=name
pass=word
when=00 00,12 * * 1-5
expires=2592000

Yeah, that’s the crontab syntax. “when” would also accept “once” and “onchange”.

“expires” is the number of seconds (since now) that this crawl won’t be needed anymore.

This request would return an “id”, to be used later, when the costumer is ready to download the webpage from us.

Of course there’s also:

POST /api/i=regex
format=rss
content_regex=some_string(.*)sucks?

So you know when someone says your product sucks. And:

POST /api/i=regex
name=(jpg,gif)
width=LT200
height=LT200
type=image

LT is Less Than, there would be also GT and EQ.

So, /api/i= is to insert a crawling request. You can request webpages by /api/g=

POST /api/g=http://www.example.com/file.html
only=#some_node_id .some_node_class

XPath on “only”.

Since the costumer would pay for data transfered, it would suck to have the costumer to query /api/g= everytime he needs something. And it’s not much different than writing your own crawler, isn’t? Actually it is because of robots.txt, html parsing, server load, and much more. But a lot of people think that writing crawlers is easy and scalable.

Anyway! The magic happens when you crawl a webpage and it matches some rule set by one of your costumers. Now you just need to tell them the list of ids previously sent by /api/i= that are ready. They connect to your server and download the files.

And if you have ids:

POST /api/g=111,112,113
compress=True

Which would return insert requests with ids 111 and 112 and 113 in a zip file.

One more good thing: Economy of scale. Everyone needs the newest RSS feeds. You can have dozens of costumers requesting the same feed, but you only will need to grab it once.

This service would have nothing to do with search, Google, deep web, semantic web, whatever. Just make sure people will know when a webpage is updated.

2009-01-23

Wikipedia with just links

Filed under: Perils of Software Development — Tags: — inerte @ 03:02

So I just read how Britannica is planning a new website version, mostly to “compete” with Wikipedia. Britannica’s CEO also talks about the relationship between Google’s results and Wikipedia’s pages.

Anyway, one criticism is valid, sometimes Wikipedia articles are not the best source of information about a topic. And this is being written by someone who searches daily on Google for terms followed by the word “wikipedia”, because Wikipedia usually have results that are good enough.

So here’s my idea, dump Wikipedia’s database of article titles, and let people submit links to these titles. And let users vote Reddit-Digg style. Find a way to deal with spammer’s bots and you’re ready to do.

For example, the Wikipedia article for “Design_pattern_(computer_science)” (which I have open in a tab right now) would be just a bunch of links for other sites that users can do whatever you want them to do so you can sort the links by relevancy.

As I’ve said on my previous post, I’ve been feeling lazy lately, otherwise I would go and implement this.

You might ask, why not just add the links that users would upvote to Wikipedia? Well, rare are the articles on Wikipedia that accept “full” links on its references section. Most links are put for citations, and these are just used to justify small text/phrases excerpts/facts, not whole webpages. And the “external links” section of Wikipedia articles aren’t sorted, and that’s the whole point of my idea :p

Now go and build it and remember to buy me a beer if you make a buck.

2009-01-22

How the fuck is trim in Python?

Filed under: Perils of Software Development — Tags: — inerte @ 20:54

HTFITIP is a website where you can see what a function from one language is called in another (or an implementation).

The idea came to me after the eleventh time I asked myself, how the fuck is trim in Python? It’s called strip by the way.

The website is very simple. On the front page, there’s this form:

How the fuck is _____’s _____ in _____?

The first space is the “source” programming language name, the second one we’ll put the function/method name, and the third one, the “target” programming language. The source language is optional.

A good source for the… source programming language is PHP’s (massive) functions list. First of all, PHP’s seems to have a function for everything, and there’s no namespace so it’s easier for us to build an index of it. Secondly, it’s a widely known language. Thirdly, it’s the first language of a lot of people, so it’s quite possible that a lot of people will want to know what’s the equivalent for PHP’s function X in a new language they’re learning, so if we have to start somewhere might as well be this way.

When the user selects the programming language from the source dropdown, an Ajax call automatically gets all the function names so when the user types a name on the second field, it auto-completes. Then, he selects the “target” language on the third field.

How the fuck is PHP’s trim in Python? (click submit)

If the relationship has already been stablished on the database, we present the answer to the user:

PHP’s trim in Python is called strip. (a link to Python’s manual page on strip would be nice)

If not, we can let the user make this relationship. If it’s not possible, say, Python doesn’t have an array_intersect_uassoc function on any module, we let the user type an implementation, wiki-style.

Web 2.0 baby, the users provide all the content, we make all the money.

And the website backend is simple. A list of programming language and its functions, and the relationship between functions/methods, and somekind of wiki stuff to let people submit implementations.

I’ve been feeling lazy lately so go ahead and implement this if you want. Should be an useful tool.

Nanomachine regulation

Filed under: Poetria — Tags: — inerte @ 16:11

Effective immediately

1) Nanomachine development requires a government license;

2) Nanomachines have a limited number of productions, which on expiration the nanomachine will self-destruct. Example: A tomato nanomachine will not be able to make more than XX tomatoes;

3) Hacking nanomachines is punishable by death. False acusations of nanomachine hacking equal 10 years in prison;

4) These policies are in effect worldwide;

Rationale:

1) Control;

2) There’s a risk that everything will turn into tomatoes if tomato-nanomachines go wild;

3) None shall be able to make a nuclear bomb-nanomachine, and risk is minimized by 1);

4) Sea micronations or evil governments can’t protect someone nanomachining nuclear bombs.

« Newer PostsOlder Posts »

Powered by WordPress