pronghorn http://blog.pronghorn.org webserver posterous.com Mon, 02 May 2011 13:09:00 -0700 Snippet: Double-Word Compare-And-Swap in the D language http://blog.pronghorn.org/snippet-double-word-compare-and-swap-in-the-d http://blog.pronghorn.org/snippet-double-word-compare-and-swap-in-the-d

Update: My changes to DMD have been pulled in. DMD's inline assembler now supports cmpxchg16b.

 

I'm currently working on a bunch of lock-free data structures including a lock-free FIFO queue which is primarily used for the event queue.

The implementation of such lock-free structures, however, is plagued by a major obstacle: The risk of ABA occurrence, see http://en.wikipedia.org/wiki/ABA_problem

One approach to overcome this problem is described in "An optimistic approach to lock-free FIFO queues" using tags being incremented together with the node pointer being exchanged (atomically!) - see http://people.csail.mit.edu/edya/publications/OptimisticFIFOQueue-journal.pdf

This approach however requires the cmpxchg16b CPU instruction on x86-64, respectively cmpxchg8b on x86. The availability of these instructions is indicated by the flag CX8/16 in /proc/cpuinfo.

Since neither GDC nor DMD implement the cmpxchg16b instruction in their inline assembler yet, I had to add it on my own. Just checkout my changes to GDC at https://bitbucket.org/nischu7/gdc/changeset/d8a2a73fb3d8.

As soon as you've rebuilt GDC you'll be able to compile this function: 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// by Niklas Schulze / nischu7 <n7@pronghorn.org>
//
// requires cmpxchg16b CPU instruction
// required GDC patch: http://bitbucket.org/nischu7/gdc/changeset/d8a2a73fb3d8
bool dwcas16(T, V1, V2)(T* here, const V1 ifThis, const V2 writeThis)
{
    asm
    {
        /*
mov RAX, ifThis; // copy ifThis[0] to RAX
mov RDX, 8[ifThis]; // copy ifThis[1] to RDX

mov RBX, writeThis; // copy writeThis[0] to RBX
mov RCX, 8[writeThis]; // copy writeThis[1] to RCX
*/

        // slightly faster?
        lea RSI, ifThis;
        mov RAX, [RSI];
        mov RDX, 8[RSI];

        lea RSI, writeThis;
        mov RBX, [RSI];
        mov RCX, 8[RSI];

        mov RSI, here;

        lock; // lock always needed to make this op atomic
        cmpxchg16b [RSI];

        setz AL;
    }
}

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1292591/IMGP8902_kontrast.jpg http://posterous.com/users/4wzCXMauzsUp j.n. schulze n7 j.n. schulze
Mon, 07 Mar 2011 14:18:00 -0800 Sp(ee)dy Protocol http://blog.pronghorn.org/speedy-protocol http://blog.pronghorn.org/speedy-protocol

Ever heard of SPDY before? It's a protocol draft based on HTTP introduced by the Google Chromium Project. In contrast to HTTP, SPDY supports request multiplexing over a single TCP connection, which leads to a much higher efficiency. SPDY also compresses the HTTP headers carried on top of it, to mention just two features.

Implementing SPDY in existing web servers is not a pony farm since it requires major changes to the core server logic. "mod_spdy" provides SPDY support for Apache but does not support request multiplexing, which would require hacking the core.

Apart from visiting the CeBIT last Saturday, I've spent my weekend implementing SPDY in pronghorn. In order to make multiplexing work, I've to put in some additional effort. But the results look really encouraging so far.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1292591/IMGP8902_kontrast.jpg http://posterous.com/users/4wzCXMauzsUp j.n. schulze n7 j.n. schulze
Sat, 26 Feb 2011 14:42:00 -0800 Let's go on! http://blog.pronghorn.org/lets-go-on http://blog.pronghorn.org/lets-go-on

After it has become quiet around the project, I'm glad to announce that everything has turned to good account:

All barriers are broken down - let's focus on developing again ;-)

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1292591/IMGP8902_kontrast.jpg http://posterous.com/users/4wzCXMauzsUp j.n. schulze n7 j.n. schulze
Sun, 28 Nov 2010 14:27:00 -0800 gdc w/ d2.0.50 - GC crash @amd64 http://blog.pronghorn.org/gdc-w-d2050-gc-crash-amd64 http://blog.pronghorn.org/gdc-w-d2050-gc-crash-amd64

To my great joy, GDC has been merged with the current D frontend (which is 2.0.50, to express it numerically).

But apparently, the garbage collector is a mere wreck when it comes to 64 bit builds. Pronghorn crashes after exactly 94 sequential requests (segfault) when compiling in 64 bit mode unless we disable the GC by calling "GC.disable()".

And as I don't want to abandon the 64 bit support for now, there's no way to bypass this bug except disabling the GC and doing explicit memory management as we did in good old times (which I still prefer over garbage collection). But as Walter Bright points out here, garbage-collected programs are (usually) faster. Moreover D is designed as a garbage collected language - some features, such as array concentration, rely on the GC. Garbage collection, however, isn't well-suited for realtime applications such as server daemons since the arbitrary collects performed by the GC will block all threads temporarily. Hence we need to tune the GC behaviour anyway and perform manual collects whenever the server is idle.

In a few days we'll see whether I was able to fix that GC bug on my own, whether "ibuclaw" will have fixed it or if the official 64 bit DMD will have been published.

32 bit builds, however, are working fine.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1292591/IMGP8902_kontrast.jpg http://posterous.com/users/4wzCXMauzsUp j.n. schulze n7 j.n. schulze
Sun, 07 Nov 2010 09:57:00 -0800 64bit DMD is in sight! http://blog.pronghorn.org/64bit-dmd-is-in-sight http://blog.pronghorn.org/64bit-dmd-is-in-sight

As recent tweets of @WalterBright show, he's working hard on 64bit DMD. I'm really looking forward to finally produce 64bit builds of pronghorn.

I'm well aware that goshawk's (who merges GDC to the recent D2 frontend from time to time) GDC branch produces working 64bit binaries, but unfortunately it's still D2.0.39.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1292591/IMGP8902_kontrast.jpg http://posterous.com/users/4wzCXMauzsUp j.n. schulze n7 j.n. schulze
Sun, 07 Nov 2010 05:02:00 -0800 Improving string comparison performance by about 1100% http://blog.pronghorn.org/improving-string-comparison-performance-by-ab http://blog.pronghorn.org/improving-string-comparison-performance-by-ab

Usually, strings in D are compared either this way…

Snippet 1 – The convenient way

if(stringA == stringB) // This will give you a good (if not even best) performance in most cases since arrays are handled by reference.
{
    // strings are equal
}

…or this way…

Snippet 2 – The OldSchool-C-way

if(strcmp(stringA, stringB) == 0)
{
    // strings are equal
}

Let’s do a little benchmark first ;–)

We’ll compare two strings 100 million times to get expressive numbers

The convenient way (complete program)

import std.stdio;
import std.perf;

void main()
{
    auto pc = new PerformanceCounter();

    string str = "POST";

    pc.start();

    for(uint i=0;i<100_000_000;++i)
    {
        if(str == "POST")
        {
            // strings are equal
        }
    }

    pc.stop();
    writefln("Execution took %d ms", pc.milliseconds);
}

Note that “str” isn’t constant. A constant string wasn’t representative.

result: 3984ms

The OldSchool way (complete program)

import std.stdio;
import std.perf;
import std.string; // for (str)cmp

void main()
{
    auto pc = new PerformanceCounter();

    string str = "POST";

    pc.start();

    for(uint i=0;i<100_000_000;++i)
    {
        if(cmp(str, "POST") == 0) // cmp() is similar to strcmp()
        {
            // strings are equal
        }
    }

    pc.stop();
    writefln("Execution took %d ms", pc.milliseconds);
}

result: 3388ms

3984ms vs 3388ms – A 17% increase that won’t knock your socks off

Surprisingly, I’ve an ace in the hole and you hopefully a Little Endian CPU.

Using the C(O)MP(ARE)-Instruction of your CPU, we are able to compare 4 characters at once very quickly without the need of an internal loop within the strcmp()-Function. Simply treat the 4 characters as an 32 bit integer and let the magic happen. Don’t forget to bit shift the characters since it’s Little Endian.

I’ve wrapped everything into this nifty template

template str4_cmp(const char[] m, const char c0, const char c1, const char c2, const char c3)
{
    const char[] str4_cmp = "*(cast(uint*)"~m~")==(('"~c3~"'<<24)|('"~c2~"'<<16)|('"~c1~"'<<8)|'"~c0~"')";
}

So let’s put everything together…

import std.stdio;
import std.perf;

template str4_cmp(const char[] m, const char c0, const char c1, const char c2, const char c3)
{
    const char[] str4_cmp = "*(cast(uint*)"~m~")==(('"~c3~"'<<24)|('"~c2~"'<<16)|('"~c1~"'<<8)|'"~c0~"')";
}

void main()
{
    auto pc = new PerformanceCounter();

    string str = "POST";

    pc.start();

    for(uint i=0;i<100_000_000;++i)
    {
        if(mixin(str4_cmp!("str", 'P', 'O', 'S', 'T')))
        {
            // strings are equal
        }
    }

    pc.stop();
    writefln("Execution took %d ms", pc.milliseconds);
}

…and finally, compile and execute it.

result: 276ms – That’s an increase about 1100%!

Pronghorn uses this trick to determine the desired HTTP Request method (“GET ” <– note the whitespace, “POST”, etc.). The only drawback is that the string length has to be a multiple of 2 (using an unsigned short) or 4 (using an unsigned integer, as shown here). Hence we’ve to came up with an hybrid aproach to transparently replace existing strcmp()/strncmp() functions.

I highly recommend reading http://www.codeproject.com/KB/string/optimize_strings.aspx?msg=1009488 for further details on this

Note that the code snippets have been compiled with DMD 2.0.49 without any optimizations (-O has been omitted).

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1292591/IMGP8902_kontrast.jpg http://posterous.com/users/4wzCXMauzsUp j.n. schulze n7 j.n. schulze
Wed, 06 Oct 2010 13:44:00 -0700 Untitled http://blog.pronghorn.org/29851800 http://blog.pronghorn.org/29851800

As you might have noticed, there hasn't been any development progress for weeks due to a couple of other projects I'm involved in - which require a lot of effort as well.

With that said I hope I'll get round to finish the current release within the next months - in a stable manner, naturally.

Ideally, the GDC compiler will have been merged with the current D2 Frontend, or even better, a native Linux x86_64 version of DMD2 will have been released then.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1292591/IMGP8902_kontrast.jpg http://posterous.com/users/4wzCXMauzsUp j.n. schulze n7 j.n. schulze
Thu, 22 Jul 2010 05:13:00 -0700 the new virtual host map - unfolded http://blog.pronghorn.org/the-new-virtual-host-map-unfolded http://blog.pronghorn.org/the-new-virtual-host-map-unfolded

Pronghorn's virtualHostMap class provides an internal API for creating, accessing and deleting virtual hosts objects.

But how's the data organized? The most common way is probably to simply store the hostnames in a hashmap. We did so prior to pronghorn release 0.8 since it's the simplest and probably most efficient way - unless it comes to wildcard subdomains.

What's so tricky about wildcard subdomains? Having a configured wildcard subdomain "*.example.com" and a request to "dharmainitiative.foo.bar.example.com", several lookups are required to find the appropriate virtual host which makes the lookup process quite expensive. The following steps are needed to find the associated virtual host:

VirtualHost: *.example.com, Request: dharmainitiative.foo.bar.example.com

  1. lookup dharmainitiative.foo.bar.example.com (doesn't exist)
  2. lookup *.foo.bar.example.com (doesn't exist)
  3. lookup *.bar.example.com (doesn't exist)
  4. lookup *.example.com (oh, we finally got it)

As you can see, four steps are required in this example.

 

As mentioned before, pronghorn 0.8 comes up with another approach, which will be explained in the following lines.

keyword: tree

All Hostnames are organized in a tree (as the domain name system, DNS, does). If a new virtual host is registered to the virtual host map, all its hostnames will be inserted into the tree by splitting the hostname into its labels ("example.com" becomes an array of "example" and "com") whereas the labels are placed into reverse order ("com", "example" - the TLD is represented by the first level of the tree). If the root node of the tree doesn't contain a child node with name "com", it will be created instead. The same happens with the "example" label. If the "com" node doesn't hold a child node with name "example", a new node will be created and linked to the "com" node.

Lookups work in a similar way. The root node is accessed, the existence of the "com" label is ensured, and if the child label "example" doesn't exist, the "com" label will be checked for a child label with name "*" - a wildcard one. Of course expensive string comparisons can't be abandonded, which are needed to find the desired label. But compared to flat hashmaps lookups are performed in an incremental way - step by step, hostname label by hostname label. In so doing there's no need to compare long strings at once which saves time and makes this approach more efficient. And last but not least memory is saved since each label is only stored once (at the respective position in the tree).

Deleting hostnames from the tree is the hardest part. Before deleting a label, it needs to be ensured that it isn't shared with other hostnames. But as of now, the new hostname implementation works like a charm.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1292591/IMGP8902_kontrast.jpg http://posterous.com/users/4wzCXMauzsUp j.n. schulze n7 j.n. schulze
Thu, 22 Jul 2010 03:53:00 -0700 pronghorn outperforms lighttpd and nginx http://blog.pronghorn.org/pronghorn-outperforms-lighttpd-and-nginx-see http://blog.pronghorn.org/pronghorn-outperforms-lighttpd-and-nginx-see

As this benchmark shows, pronghorn 0.8 outperforms lighttpd and nginx with ease. Apache is beaten as well, but it's like taking candy from a baby.

A 10 kilobyte sized file has been served 10,000 times at 6 concurrency levels (1, 10, 100, 200, 500 up to 1000 simultaneous requests). As this benchmark shows, pronghorn performs best at 100 concurrent requests (handling ~55,000 requests/s) in this scenario. The requests have been generated using ApacheBench (ab) on a Linux 2.6 system with the following specs:

  • Intel Xeon X3220
  • 8 GB DDR2-800 ECC RAM
  • 2x Samsung HD103HJ @ RAID-1
  • 3ware 9650SE-2LP RAID controller
  • chassis/board/power supply: Dell PowerEdge R200 (shouldn't matter)

The server side was powered by another Xen domU on the same physical machine (with just 1GB of RAM).

software versions used:

  • Linux 2.6.26
  • pronghorn 0.8 (single-threaded, epoll, sendfile)
  • lighttpd 1.4.19 (epoll, sendfile)
  • nginx 0.7.62
  • Apache 2.2.12 (worker mpm)

 

Bench1

 

But, as nice as this might look like, be aware that pronghorn 0.8 isn't stable yet and the final performance might slightly differ from these results.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1292591/IMGP8902_kontrast.jpg http://posterous.com/users/4wzCXMauzsUp j.n. schulze n7 j.n. schulze
Thu, 22 Jul 2010 03:04:00 -0700 lua api: new features implemented http://blog.pronghorn.org/lua-api-new-features-implemented http://blog.pronghorn.org/lua-api-new-features-implemented

see the lua api wiki page for details

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1292591/IMGP8902_kontrast.jpg http://posterous.com/users/4wzCXMauzsUp j.n. schulze n7 j.n. schulze
Thu, 22 Jul 2010 02:57:00 -0700 first post http://blog.pronghorn.org/first-post-3 http://blog.pronghorn.org/first-post-3

just switched from self-hosted wordpress to posterous

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1292591/IMGP8902_kontrast.jpg http://posterous.com/users/4wzCXMauzsUp j.n. schulze n7 j.n. schulze