BlogKontaktTagcloud

Memcache with quicklz

Memcache is a fast, distributed key-value-storage. Yes, this day you would say a non-permanent NoSQL storage, but hey memcache is around since 2003 (initially written by Brad Fitzpatrick) long before the NoSQL-Hype. Memcache has a very broad user base which includes quite a few top 100 sites. You can connect to the Memcache Server (called memcached) via an ASCII or a Binary protocol, but for the most programing language there are already client API’s that makes it easy to use memcache. For PHP there are even two libraries PECL memcached, which use libmemcached, and PECL memcache. I will use memcache for this article. But hey, that’s the easy stuff now it's time to go on.


Compressing the cache

Ok, so it would be great to compress this cache to have even more cache available. It does also make sense to compress the data in the client library; therefore you not only save time in the cache but you also get a speedup on the transfer to the cache.

And guess what, PECL memcache has already build in this feature. If you add the MEMCACHE_COMPRESSED flag to your store operation the library will compress your date by using zlib. You can even define a threshold with which every entry above a certain size will get compressed automatically.  PHP does even set this threshold to a certain value initially, but in my point of view this is a bug.

Compression is always an architectural tradeoff. By compression you gain speed on the transmission and space in the storage, but you lose time on compressing and decompressing it on the client. In nearly all cases, or in least the cases your caching is designed well, you gain much more by compression.


Compressing the cache even better

Ever since Urban’s talk about quicklz at the webtuesday I wanted to implement the memcache compression with this algorithm. Due to the high workload at eth, the project did get delayed over and over again, and I also got stuck at benchmarking memcache correctly (also because of the above mentioned bug).


So I did add quicklz to PECL memcache 3.0.5 which was not too hard. The compression happen in the file memcache_pool.c in the functions mmc_compress and the decompression in mmc_uncompress  accordingly. I added the files quicklz.c and quicklz.h to the folder and changed the functions accordingly (see the patch here). As I’m a bit lazy I ignored most of the error handling. (Compression shouldn’t fail, right!?) Don’t forget to add quicklz.c to the config9.m4 that it will be compiled as well. As PECL build & install did not work proper for me I did the installation “manual”. “phpize; configure; make; make install”, in the main source folder does the job. Thanks a lot to Alvaro Videla for the help, seems like this twitter-thingy is useful for something after all.


Benchmarking it

It was a bit hard to get a good benchmark. But I used the abstract dump from Wikipedia which is a big xml document. I splited it into chunks of 512k (“split -b 512k -a 3 enwiki-latest-abstract1.xml wiki_”, unix is awesome). I uploaded then all this files into memcached with a small PHP-script and downloaded all of them in a random order 3 times in a row. I repeated these steps over 10 iteration. I started memcached on the local machine with 10 gigs off storage and the debugging option “-vv” on an extra-large amazon ec2 instance.


Uncompressed they did need 642 MB of cache storage; the average write time is 3.89 seconds whereas the read time 5.26 seconds (keep in mind that the read time is always over three iteration).  Compressed with zlib the storage space sink to 87 MB but at the same time the read and write time increases significant. The average write time is now 27.66 seconds and the read time goes up to 19.55 seconds. But now the result we are really interested in. I used quicklz level one and did not get a big difference for level two, for level three I did get crashes so I could not do any measurements. The result for quicklz: The read values are with 6.68 seconds pretty good as this comes nearly to the uncompressed, but what is really impressive is the write per write performance which is with 3.54 seconds even better then uncompressed. Unfortunately quicklz uses with 118 MB way more storage then I did expect.


Keep in mind that this measurement are done on a unusual setting, the machine was way faster than a usual web box, so the compression did of course benefit from this. On the other hand usually the memcached (or at least not all of them) are not on the same box and the network speed matters. So if you have a slow network the size of the cache content matters a lot. And at the end you will not compress a so much data like I did, so compression speed should not make a huge difference in your webapp. Also if you have a smart caching strategy, caching will be always faster than calculating the data, no matter how fast compression is.


Conclusion

Well at the end I’m not really happy with the size of the data quicklz produces even if the speed of the algorithm is impressive. But at the end of the day for most web app the cache size matters much more than the speed, because you get a lot of speed improvement out of the cache. That’s the reason why I did not improve the patch the way one could use it in production.

Ähnliche Beiträge:
Mobile App Hackathon
Types inference FMFP
Third Week FMFP
Solution Second Week FMFP
Third Week FMFP
Comments (0)  Permalink

Mobile App Hackathon

My friend Jonas is now working for the early stage startup betacular. They organize a hackathon in the night from the 30. April to the 1. of May 2011 in Zurich where you have the possibility to use their brand new API. You not only get free food and beverages, but you might also win a 1000 CHF cash prize. Register now at http://socialapphackathon.eventbrite.com/ and let the world see that you participate at techup.ch.

Ähnliche Beiträge:
Make it human (or how to crack a CAPTCHA)
Named parameters in Java (bgl-style)
Jira status
Mailstatus in Skype
PHP Quine
Comments (0)  Permalink

Make it human (or how to crack a CAPTCHA)

A CAPTCHA is a picture that should be able to separate a computer from a human. So a human should be able to read the content, but a computer should not. But sometimes it's pretty easier to make your computer look human by solving CAPTCHA's. There are some quite good CAPTCHA's out there, e.g. reCAPTCHA from google, but some people still prefer to write their own. During a boring weekend I tried to show how easy such a self written CAPTCHA is to crack.

As I don't want to offend the creator of this CAPTCHA and reveal his identity, I did create a emulator for this CAPTCHA on my own server which does just randomly deliver one of 100 downloaded CAPTCHA's. The address of the emulator is http://leo.buettiker.org/captcha/emulator.php. Such a CAPTCHA is delivered as animated gif and looks like the following picture:

So for cracking it i did use PHP as a scripting language with the WideImage for the image handling, Tesseract-OCR for orc and ImageMagick for handling with the gif. At the start let us define some variable and import the WideImage script.

<?php
include ".\image\WideImage.php";

$path = './captcha/';
$tmpCaptchaName = 'captcha.gif';
$resultFile = 'result.jpg';
$imagemagickconvert ='C:\Users\Leo\Desktop\PHP\ImageMagick-6.6.6-4\convert.exe';
$tesseract = '"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe"';
$img = null;
$picOffStep = 4;
$initialOff = 51;
$picOff = $initialOff;

Then we need to download the image with the script. We use curl for this. We set CURLOPT_COOKIESESSION to 1 to get each time a new image and not stick to one session. We save the image to the disk for later use.

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://leo.buettiker.org/captcha/emulator.php");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($ch, CURLOPT_COOKIESESSION, 1); 
$content=curl_exec($ch);
file_put_contents($path.$tmpCaptchaName, $content);
curl_close($ch);

Then we create an image in which we save the CAPTCHA-word. It's saved in the $img variable. With an image tool we see that the animated image consist out of 38 single pictures. As we only need the first part of the animation (spot goes from left to right) we iterate over the first 19 steps of the animation. With the command

convert captcha.gif[0] captcha0.jpg

We are able to extract to first picture of the animation. Just right afterwards we load the image again and crop a rectangular part of the spot out and insert it into the image. The position of the spot is found manually with an image program and is hardcoded. This is all done in the following snippet.

$img = WideImage::createPaletteImage(91,27);
for($i = 0; $i < 19; ++$i) {
	echo ".";
	$file = "image$i.jpg";
	exec ($imagemagickconvert.' '.$path.$tmpCaptchaName.'['.$i.'] '.$path.$file);
	$img = $img->merge(WideImage::load($path.$file)->crop($picOff,9,23,27),$picOff-$initialOff);
	$picOff += $picOffStep;
}
echo "\n";
$img->saveToFile($path.$resultFile);

Now we have a picture like the following saved on the disk:

We now use Tesseract-OCR to do ocr on the image. We do this with the command:

$tesseract.exe result.jpg toutput -l eng letters

as we now that the words in the CAPTCHA consists only out of digits and lowercase letter we save this character in the letters config file. This looks like:

tessedit_char_whitelist 0123456789abcdefghijklmnopqrstuvwxyz

To do this automatically, we add the following lines to our script:

exec($tesseract.' '.$path.$resultFile.' '.$path."toutput -l eng letters");
echo preg_replace('/\s/', '', file_get_contents($path."toutput.txt"))."\n";

Now we get an output that looks like:

...................
Tesseract Open Source OCR Engine with Leptonica
fpbmd

It would be now easy to use this extracted information to submit a form. But on the other hand, how wants do something like this :-) The script works not 100% reliable. Especially the difference between z's and 2's is sometimes hard for tesseract. But probably this could optimize with trainings or you just ignore all CAPTCHA with one of this letters and try it with a other one.

The learning of this is surely that it is not too hard to crack homebrewed CAPTCHA's. So I would recommend that you stay whenever you can with a well know CAPTCHA like reCAPTCHA. On the other hand a CAPTCHA is never realy safe, according to wikipedia letting humans solve CAPTCHA's can be quite cheap.

Ähnliche Beiträge:
Jira status
Mailstatus in Skype
PHP Quine
What's php like?
Zend Framwork 1.5 is out
Comments (0)  Permalink

Named parameters in Java (bgl-style)

Sometimes one would like to have the ability to have named parameters in Java. During a lecture in how to use the boost graph library I came across how named parameters are implemented in the BGL. How to use named parameters in bgl is explained here. Let me introduce a example why one might need, named parameters.

public class Example {

  public static void outputName(String title, String firstName, String middleName, String lastName) {
    System.out.print((title != "" ? title+". " : "Mr./Mrs. "));
    System.out.print((firstName != "" ? firstName+" " : ""));
    System.out.print((middleName != "" ? middleName.charAt(0)+". " : ""));
    System.out.print(lastName);
    System.out.println();
  }
  
  public static void main(String[] args) {
    outputName("Dr", "Franz", "Peter", "Frosch");
    outputName("", "Heidi", "", "Peter");
    outputName("Dr", "", "", "Müller");
    outputName("", "Beat", "Sepp", "Wolf");
  }
}

This example shows how a call to a method with a lot of parameters can be hard to read and easy to misunderstand and can be a source of errors. So it would be nice if one could name the parameters as bellow.

/* NOT WORKING CODE*/
public class ExampleNotWorkingNamed {
  public static void outputName(String title = "", String firstName = "", String middleName = "", String lastName="") {
    System.out.print((title != "" ? title+". " : "Mr./Mrs. "));
    System.out.print((firstName != "" ? firstName+" " : ""));
    System.out.print((middleName != "" ? middleName.charAt(0)+". " : ""));
    System.out.print(lastName);
    System.out.println();
  }

  public static void main(String[] args) {
    outputName(title="Dr", first="Franz", middle="Peter", last="Frosch");
    outputName(first="Heidi", last="Peter");
    outputName(title="Dr", last="Müller");
    outputName(first="Beat", middle="Sepp", last="Wolf");
  }
}

In some programing language something like this does work, unfortunately in Java it doesn't. At this moment we come to the point where we implement the named parameters like in bgl. A call does look then like bellow:

outputName(first("Beat").middle("Sepp").last("Wolf"));

The full code for the same example with named parameters does look like this:

import api.NamedParameterNames;
import static api.NamedParameterNamesFactory.*;

public class NamedExample {
 
  public static void outputName(NamedParameterNames names) {
    System.out.print((names.getTitle() != null ? names.getTitle()+". " : "Mr./Mrs. "));
    System.out.print((names.getFirst() != null ? names.getFirst()+" " : ""));
    System.out.print((names.getMiddle() != null ? names.getMiddle().charAt(0)+". " : ""));
    System.out.print((names.getLast() != null ? names.getLast() : ""));
    System.out.println();
  }

  public static void main(String[] args) {
    outputName(title("Dr").first("Franz").middle("Peter").last("Frosch"));
    outputName(first("Heidi").last("Peter"));
    outputName(title("Dr").last("Müller"));
    outputName(first("Beat").middle("Sepp").last("Wolf"));
  }
}

Afterwards we show that we need to implement a Factory which static methods are imported over the static import in the mail file. They all return an object of the type NamedParameterNames which does contain exactly the same method as the factory and does return an object as well so that we can use method chaining. The code does look like this:

package api;

public class NamedParameterNamesFactory {
  public static NamedParameterNames title(String title) {
    NamedParameterNames p = new NamedParameterNames();
    p.title(title);
    return p;
  }

  public static NamedParameterNames first(String firstname) {
    NamedParameterNames p = new NamedParameterNames();
    p.first(firstname);
    return p;
  }
  
  public static NamedParameterNames middle(String middle) {
    NamedParameterNames p = new NamedParameterNames();
    p.middle(middle);
    return p;
  }

  public static NamedParameterNames last(String last) {
    NamedParameterNames p = new NamedParameterNames();
    p.last(last);
    return p;
  }
}

package api;

public class NamedParameterNames {
  private String title;
  private String first;
  private String middle;
  private String last;
  
  protected NamedParameterNames(){
    super();
  }
  
  public String getTitle() {
    return title;
  }

  public NamedParameterNames title(String title) {
    this.title = title;
    return this;
  }

  public String getFirst() {
    return first;
  }

  public NamedParameterNames first(String firstname) {
    this.first = firstname;
    return this;
  }

  public String getMiddle() {
    return middle;
  }

  public NamedParameterNames middle(String middle) {
    this.middle = middle;
    return this;
  }

  public String getLast() {
    return last;
  }

  public NamedParameterNames last(String last) {
    this.last = last;
    return this;
  }
}

There are some advantages and some disadvantages of doing this. Start with the advantages:

  • Calls to the method are easy to read
  • Might reduce error by wrong calls

But there are also quite some disadvantages:

  • A lot of code to write for the factory and the parameter object (so this might only pay off if the API get used by a lot of people)
  • Not all java-programmer might understand this and the might misuse the API code be writing ugly code. A more "java-ish" way of doing this is the builder pattern out of Josh Bloch's "Effective Java 2nd Edition".
  • If you need a lot of parameter in your method might be a indicator that your OO design is bad. Think about it again!

More about the named parameter idiom on Stackoverflow

Ähnliche Beiträge:
SCJP, now!
Java Bug: Process.waitFor() hangs
Mobile App Hackathon
Make it human (or how to crack a CAPTCHA)
Jira status
Comments (0)  Permalink

Jira status

After my last post I thought it might be also helpful to publish how many open jira tickets I have to my skype status.

You can get each jira search result also as a rss feed. Your browser does indicate the link to the result as rss. This url might look something like:

http://jira.example.com/sr/jira.issueviews:searchrequest-xml/temp/SearchRequest.xml?&&resolution=-1&assigneeSelect=specificuser&assignee=leo.buettiker&sorter/field=priority&sorter/order=DESC&tempMax=100&reset=true&decorator=none

To call this url, even if you have no valid session, you might add your user credentials at the end. This looks like:

&os_username=$username&os_password=$password

You could now use a xml or rss parser to interprete the returned feed. But for my result even that is too much, I only will count how many items in the feed are. The php snippet to do this will look like:

$jiraRss = file_get_contents($url);
$jiraCount = substr_count($jiraRss,'<item>');	
$jiraMessage = $jiraCount?" and $jiraCount open Jira Issues":"";

There might be a lot of other cool usecases you can simply implement (ticket you currently work on, Tickets closed int the last week, etc.). It's just a little bit sad that there is no REST API for Skype which would be make it easier to change the status between platforms.

Ähnliche Beiträge:
Mailstatus in Skype
Make it human (or how to crack a CAPTCHA)
PHP Quine
What's php like?
Zend Framwork 1.5 is out
Comments (0)  Permalink

Mailstatus in Skype

You all know the troubles with overflowing inboxes. I'm a bit fan of Inbox Zero and I found a lot of ways to work fast with my mails. I did switch off signaling ingoing mails, I use a lot of filtering and a good folder structure.

But sometimes my own lazyiness get into my way. So I started to put "Inbox Zero" into my skype status if I get my box empty. But after some times I decided to automatised this message.

I do know that Patrice does automated Skype updates with his Mac. After a quick search I found out that on Windows Skype has a COM-Api and they even provide a little PHP Example. With PHP it is also pretty easy so to acess an IMAP inbox (MS Exchange also provide a IMAP access). So I wrote a quick script that updates my Skype-Message:

$mail = imap_open('{mail.example.com}INBOX','leo.buettiker', 'password');

// Create a Skype4COM object:
$skype = new COM("Skype4COM.Skype");

// Create a conversion object:
$convert = $skype->convert;
$convert->language = "en";

// Start the Skype client:
if (!$skype->client()->isRunning()) {
  $skype->client()->start(true, true);
}


while(true) {
	imap_check($mail);
	$number = imap_num_msg($mail);
	$skype->CurrentUserProfile()->MoodText= 
		"Leo has currently $number mails in his inbox";
	sleep(5);
}

This does not only demonstrate how you can overcom your own lazyiness with open comunication and automated tools. It's in my point of view also a nice example what it's possible with PHP outside of the classical website rendering.

Ähnliche Beiträge:
Jira status
Make it human (or how to crack a CAPTCHA)
PHP Quine
What's php like?
Zend Framwork 1.5 is out
Comments (0)  Permalink

PHP Quine

Ok, Mirko made me again losing a hell lot of time. He wrote a about his implementation of a quine in Ruby. Quines are just programmes that can replicated themselves without opening a file (also not itself, 'cause that would be too easy in PHP). As usual I had to try this in PHP myself. I found the article from Patrick Schneider very helpfully. He explains a quite cool approach with a base64-encoded-dna pretty clear. I just wrote the solution a bit shorter which brought it down to 159 chars (you have to have it all on one line):


<?=($dna='PD89KCRkbmE9JyonKT9zdHJfcmVwbGFjZShjaHIoNDIpLCAkZG5hLCBiYXNlNjRfZGVjb2RlKCRkbmEpKTonJz8+Cg==n')?
str_replace(chr(42), $dna, base64_decode($dna)):''?>

Unfortunately Mirko did not allow my copy-past solution (damn academics!). And for myself the solution with a generator is not too natural, as using another program to generate a quine is probably not like it was supposed to be. So with help of diff I tried to find my own solution:

php quine | diff -u quine -

I still nearly got a knot in the brain (much nicer in swiss german: "chnopf im chopf"). But after some trying I did had a solution which is with 113 characters even shorter:


<?=($a=array (
  0 => '<?=($a=',
  1 => ')?$a[0].var_export($a,1).$a[1]:"";',
))?$a[0].var_export($a,1).$a[1]:"";

By the way, as a nice start for the language of your choice you should look in the messy c2-wiki (although not all solution there might be work).

Ähnliche Beiträge:
Make it human (or how to crack a CAPTCHA)
Jira status
Mailstatus in Skype
What's php like?
Zend Framwork 1.5 is out
Comments (1)  Permalink

delicious friends

I had the idea to write a script that find out which people you might add to your delicious network, based on the links you have in common. Probably this idea was unconsciously influenced by my co-worker Stefan how had a similar idea for tilllate-users. Unfortunately you don't get the information you need for this out of the del.icio.us api. So I wrote a littel screen-scraper that sucked the information directly from the del.icio.us frontend. Unfortunately I run the first few time into the Yahoo-"you shall not steal"-guard. (Even if I did wait one second between request, seems that for the frontend you have to wait even longer between requests.) This weekend I increased the timeout between requests massively and it did work (but very slow).

The idea is pretty simple:
  • The script get's your last 100 links
  • It does search then all the links that are saved by other person as well
  • It does save all person that have saved this link by username
  • Afterwards it does callculate how often a username is saved
  • It does order the usernames by occurrence
  • It does print out the first 100 usernames with information if they are fan of you or in your network
As the script is, because of the trottling, very slow I can only give you a small sample of output here. If you're near to my "network" likelihood might be big to find you on the list. As you see I did not spend a lot of time in html formating ;-) I did use PHP for scripting, XML_HTML_Sax3 for parsing and Cache_Lite for, well, caching.
Ähnliche Beiträge:
SuperHappyDevFlat >01<
Memcache with quicklz
Mobile App Hackathon
Types inference FMFP
Third Week FMFP
Comments (2)  Permalink

Scaling is not about...

Amdal's Law, do you remember from school?! But not important for this article.I hear and read the world scaling so often lately that I earnestly think about giving it a fixed field on my bullshit bingo card. As a lot of words on bullshit bingo, scaling is often misused.

After talking with Mirko and reading a lot of blogs I really think the world needs yet another one. So this article tries to try to kill 4 common misunderstandings of scaling, because scaling is not about…

[more after the jump]

Ähnliche Beiträge:
Mobile App Hackathon
Make it human (or how to crack a CAPTCHA)
Named parameters in Java (bgl-style)
Jira status
Mailstatus in Skype
Lese ganzen Beitrag Comments (4)  Permalink

What's php like?

Lots of functions listed on PHP.net- thank god!

[stumbled over this on twitter]
Ähnliche Beiträge:
Make it human (or how to crack a CAPTCHA)
Jira status
Mailstatus in Skype
PHP Quine
Zend Framwork 1.5 is out
Comments (0)  Permalink
Next1-10/77