July 24, 2007

Failure Story on ICFP07…

Posted in Programming, Scheme at 9:25 pm by pmatos

My last weekend was spent on working out a solution in PLT-Scheme for the ICFP07 task. Amazingly fun, well-written, well-imagined task and poor code development on my part briefly resumes it. It was nonetheless, a lot of fun!!!

If you’re not confortable with the task description, you should look at it to understand what’s coming next. So I had to read a huge string from a file, Endo DNA, formed by 4 characters I, C, F and P. My approach was “Let’s get something done and then I’ll optimize if I need to.” And so it was…

Reading a file is pretty easy:

(define (get-line-from-file str)
    (let ([filefp (open-input-file str)])
        (read-line filefp)
        (close-input-port filefp))))

I now, started to think that it would be nice not only to have it in the string, but also to optimize prepends to this string and character skips and since th DNA string was global, I decided to make it global in my module also.

  (define *dna* "")
  (define *dna-index* 0)

  (define (set-dna! str)
    (set! *dna* str)
    (set! *dna-index* 0))

  (define (dna-prefix? str)
    (string-prefix? str *dna* 0 (string-length str) *dna-index* (string-length *dna*)))
  (define (dna-drop! n)
    (set! *dna-index* (+ *dna-index* n)))
  (define (dna-substring init end)
    (cond [(or (>= init end) (>= (+ *dna-index* init) (string-length *dna*))) ""]
           (substring *dna* (+ *dna-index* init) (if (> (+ *dna-index* end) (string-length *dna*))
                                                     (string-length *dna*)
                                                     (+ *dna-index* end)))]))
  (define (dna-preppend! str)
    (if (> (string-length str) *dna-index*)
        (set-dna! (string-append str (string-drop *dna* *dna-index*)))
          (set! *dna-index* (- *dna-index* (string-length str)))
          (string-copy! *dna* *dna-index* str))))                                

  (define (dna-ref n)
    (if (>= n (dna-length))
        (string-ref *dna* (+ *dna-index* n))))
  (define (dna-length)
    (- (string-length *dna*) *dna-index*))
  (define (find-smallest-dna-suffix s i)
    (let loop ([k 0])
      (let ([sstr (dna-substring (+ k i) (+ k i (string-length s)))])
        (cond [(string=? sstr "") #f]
              [(string=? s sstr) (+ i k (string-length s))]
              [else (loop (+ k 1))]))))
  (define (dna)
    (dna-substring 0 (dna-length)))

This code basically defines a dna string globally and functions to handle this dna string. Also an index to the start of the real DNA string is maintained, because if I skip n chars I don’t want to be dropping chars and then when prepending ending up putting them again on it. So I used this little trick just to have one constant sized string capable of growing if needed. At this moment I considered using evectors but then again… I thought, later I optimize… Now, the code as you may see in the task manually is pretty written almost in a schemish style (if you look to the lets) and with Cish assignments and loops, so I decided to go ahead and do it in the most straighforward way possible. One problem was the finish function which could be called from any other function and it would finish the processing. After some messing around with Scheme and thinking about it I thought it would be a good time to put call/cc to good use.

  (define exit-now (void))

  (define (execute)
    (call/cc (λ (exit)
               (set! exit-now exit)
               (let loop ([iteration 0] [time (current-inexact-milliseconds)])
                 (printf "Execution Iteration : ~a~n"  iteration)
                 (let ([p (begin0 (pattern))]
                       [t (begin0 (template))])
                   (printf "pattern ~a~n" p)
                   (printf "template ~a~n" t)
                   (matchreplace p t)
                   (printf "len(rna) = ~a~n" (/ (rna-length) 7))
                   (printf "Time taken: ~a~n" (- (current-inexact-milliseconds) time))
                   (loop (+ iteration 1) (current-inexact-milliseconds)))))))

  (define (finish)
    (exit-now *rna*))

Once the function execute is called, it sets the exit-now variable to the continuation of execute and the function finish calls it with the rna global variable that we will look into later.

Defining pattern was just looking to the task manual and almost blindly copy it:

    (define (pattern)
    (let repeat ([p ()] [lvl 0])
      (cond [(dna-prefix? "C")
             (dna-drop! 1)
             (repeat (cons 'I p) lvl)]
            [(dna-prefix? "F")
             (dna-drop! 1)
             (repeat (cons 'C p) lvl)]
            [(dna-prefix? "P")
             (dna-drop! 1)
             (repeat (cons 'F p) lvl)]
            [(dna-prefix? "IC")
             (dna-drop! 2)
             (repeat (cons 'P p) lvl)]
            [(dna-prefix? "IP")
             (dna-drop! 2)
             (repeat (cons `(! ,(nat)) p) lvl)]
            [(dna-prefix? "IF")
             (dna-drop! 3)
             (repeat (cons `(? ,(consts)) p) lvl)]
            [(dna-prefix? "IIP")
             (dna-drop! 3)
             (repeat (cons '( p) (+ lvl 1))]
            [(or (dna-prefix? "IIC") (dna-prefix? "IIF"))
             (dna-drop! 3)
             (if (zero? lvl)
                 (reverse! p)
                 (repeat (cons ') p) (- lvl 1)))]
            [(dna-prefix? "III")
             (rna-append! (dna-substring 3 10))
             (dna-drop! 10)
             (repeat p lvl)]
             (printf "finish: called by pattern can't match beginning of DNA: ~a...~n" (dna-substring 0 10))

And the same was try with the template function:

  (define (template)
    (let repeat ([t ()])
      (cond [(dna-prefix? "C")
             (dna-drop! 1)
             (repeat (cons 'I t))]
            [(dna-prefix? "F")
             (dna-drop! 1)
             (repeat (cons 'C t))]
            [(dna-prefix? "P")
             (dna-drop! 1)
             (repeat (cons 'F t))]
            [(dna-prefix? "IC")
             (dna-drop! 2)
             (repeat (cons 'P t))]
            [(or (dna-prefix? "IF") (dna-prefix? "IP"))
             (dna-drop! 2)
             (repeat (cons `(_ ,(nat) ,(nat)) t))]
            [(or (dna-prefix? "IIC") (dna-prefix? "IIF"))
             (dna-drop! 3)
             (reverse! t)]
            [(dna-prefix? "IIP")
             (dna-drop! 3)
             (repeat (cons `(|| ,(nat)) t))]
            [(dna-prefix? "III")
             (rna-append! (dna-substring 3 10))
             (dna-drop! 10)
             (repeat t)]
             (printf "finish: called by template can't match beginning of DNA: ~a...~n" (dna-substring 0 10))

The nat and consts functions seems that could do a little tweaking but again I postponed. Anyway it was Saturday, first competition day for me and I still had the whole day and still sunday so I just wrote:

  (define (nat)
    (cond [(dna-prefix? "P")
           (dna-drop! 1)
          [(or (dna-prefix? "I") (dna-prefix? "F"))
           (dna-drop! 1)
           (* 2 (nat))]
          [(dna-prefix? "C")
           (dna-drop! 1)
           (+ (* 2 (nat)) 1)]
           (printf "finish: called by nat. dna is empty.~n")

  (define (consts)
    (cond [(dna-prefix? "C")
           (dna-drop! 1)
           (string-append "I" (consts))]
          [(dna-prefix? "F")
           (dna-drop! 1)
           (string-append "C" (consts))]
          [(dna-prefix? "P")
           (dna-drop! 1)
           (string-append "F" (consts))]
          [(dna-prefix? "IC")
           (dna-drop! 2)
           (string-append "P" (consts))]
          [else ""]))

The next one were the straightforward matchreplace and replace itself which were the most bug-ridden functions I had in the competition but not amazingly were quite simple to implement:

  (define (matchreplace p t)
    (call/cc (λ (return)
               (let loop ([i 0] [e ()] [c ()] [cp p])
                 (for-each (λ (envel) (print-seq-info "e" envel)) e)
                 (if (null? cp)
                       (printf "successful match of length: ~a~n" i)
                       (for-each (λ (envel) (print-seq-info "e" envel)) e)
                       (dna-drop! i)
                       (replace t (reverse! e)))
                     (match (car cp)
                       ((? symbol? (or 'I 'C 'F 'P))
                        (let ([dna-symb (string->symbol (string (dna-ref i)))])
                          (if (eqv? dna-symb (car cp))
                              (loop (+ i 1) e c (cdr cp)))))
                       (`(! ,n)
                        (if (not (> (+ i n) (dna-length)))
                            (loop (+ i n) e c (cdr cp))))
                       (`(? ,s)
                        (let ([smallest-suffix (find-smallest-dna-suffix s i)])
                          (if smallest-suffix (loop smallest-suffix e c (cdr cp)))))
                       ((? symbol? '()
                        (loop i e (cons i c) (cdr cp)))
                       ((? symbol? '))
                        (loop i (cons (dna-substring (car c) i) e) (cdr c) (cdr cp)))))))

  (define (replace tpl e)
    (let ([env (list->vector e)])
      (letrec ([env-ref (λ (n) (if (>= n (vector-length env)) "" (vector-ref env n)))])
        (let loop ([r ""] [t tpl])
          (if (null? t)
              (dna-preppend! r)
              (cond [(symbol? (car t)) (loop (string-append r (symbol->string (car t))) (cdr t))]
                    [(list? (car t))
                     (if (eqv? (first (car t)) '_)
                         (loop (string-append r (protect (second (car t)) (env-ref (third (car t))))) (cdr t))
                         (loop (string-append r (asnat (string-length (env-ref (second (car t)))))) (cdr t)))]))))))
              (match (car t)
                ((? symbol? (or 'I 'C 'F 'P))
                 (loop (string-append r (symbol->string (car t))) (cdr t)))
                (`(_ ,l ,n)
                 (loop (string-append r (protect l (env-ref n))) (cdr t)))
                (`(|| ,n)
                 (loop (string-append r (asnat (string-length (env-ref n)))) (cdr t)))))))))

Where the print functions are just debugging aids:

  (define (print-seq-info name seq)
    (printf "~a = ~a...(~a bases)~n" name (if (< (string-length seq) 10) (string-length seq) (substring seq 0 10)) (string-length seq)))
  (define (print-dna-info)
    (printf "dna = ~a... (~a bases)~n" (dna-substring 0 10) (dna-length)))

And follows the utility functions for matchreplace:

  (define (protect l d)
    (if (zero? l) d (protect (- l 1) (dna-quote d))))

  (define (dna-quote d)
    (printf "quote: d=NOTSHOWN~n")
    (cond [(string-prefix? "I" d) (string-append "C" (dna-quote (string-drop d 1)))]
          [(string-prefix? "C" d) (string-append "F" (dna-quote (string-drop d 1)))]
          [(string-prefix? "F" d) (string-append "P" (dna-quote (string-drop d 1)))]
          [(string-prefix? "P" d) (string-append "IC" (dna-quote (string-drop d 1)))]
          [else ""]))

  (define (asnat n)
    (cond [(zero? n) "P"]
          [(even? n) (string-append "I" (asnat (quotient n 2)))]
          [(odd? n) (string-append "C" (asnat (quotient n 2)))]))

This finishes up the hard part DNA->RNA, which I had it finished by saturday afternoon… Now, I needed the bitmap part working. Incredibly, using PLT-Scheme functions to handle bitmaps was easy and produced great results, however, didn’t allow me to run this on my server, without X. This is because I needed mred.ss library which require X to be running, when in fact, no window is being displayed and apparently, no X is in fact being using directly by me. The RNA variable which I didn’t show was defined similarly to the DNA variable and is of no use to show it here. The RNA to image part is of not much interest (because I never got to really optimize it) and you can find the code online. However, with this much code and a little bit more for bitmap generation was enough to generate this from the prefix found in the last task page after about 140 iterations:

First the part where optimization worked beautifully… the replace function:

  (define (replace tpl e)
    (let ([env (list->vector e)])
      (letrec ([env-ref (λ (n) (if (>= n (vector-length env)) "" (vector-ref env n)))])
        (let loop ([r '()] [t tpl])
          (if (null? t)
              (dna-preppend! (apply string-append (reverse! r)))
              (cond [(symbol? (car t)) (loop (cons (symbol->string (car t)) r) (cdr t))]
                    [(list? (car t))
                     (if (eqv? (first (car t)) '_)
                         (loop (cons (protect (second (car t)) (env-ref (third (car t)))) r) (cdr t))
                         (loop (cons (asnat (string-length (env-ref (second (car t))))) r) (cdr t)))]))))))

This version removed the string-appends from all over the body to apply a string-append to a list of strings which is built in reverse order during the loop. The difference of these two versions is quite big.

Now, using PLT-Scheme profiler I started to optimize everything that should up worse than light green (in DrScheme) and decided to remove the match at matchreplace by something similar which in fact didn’t improve much, if at all, this function. This is what is not in place of the match call:

           (let ([obj (car cp)])
             (if (symbol? obj)
                 (cond [(eqv? obj '()
                        (loop i e (cons i c) (cdr cp))]
                       [(eqv? obj '))
                        (let ([substr (dna-substring (car c) i)])
                          (loop i (cons substr e) (cdr c) (cdr cp)))]
                        (let ([dna-symb (string->symbol (string (dna-ref i)))])
                          (if (eqv? dna-symb (car cp))
                              (loop (+ i 1) e c (cdr cp))
                              (return (printf "failed match~n"))))])
                 (if (eqv? (first obj) '!)
                     (if (not (> (+ i (second obj)) (dna-length))) (loop (+ i (second obj)) e c (cdr cp)) (return (printf "failed match~n")))
                     (let ([smallest-suffix (find-smallest-dna-suffix (second obj) i)]) (if smallest-suffix
                                                                                            (loop smallest-suffix e c (cdr cp))
                                                                                            (return (printf "failed match~n"))))))))))))

I also tried to optimize consts and nat by removing all those function calls without much success:

  (define (nat)
    ;; Search for the first P
    (let ([p-pos (let loop ([i 0])
                   (let ([cc (dna-ref i)])
                     (cond [(string? cc) (printf "finish: called by nat. After eating ~a bases can't find a P starting with ~a..." i (dna-substring 0 10))
                           [(char=? cc #P) i]
                           [else (loop (+ i 1))])))])
        (let loop ([pos (- p-pos 1)] [val 0])
          (cond [(= pos -1) val]
                [(char=? (dna-ref pos) #C) (loop (- pos 1) (+ (* 2 val) 1))]
                [else (loop (- pos 1) (* 2 val))]))
        (dna-drop! (+ p-pos 1))))) 

  (define (consts)
    (let const-loop ([i 0] [transf ()])
      (let ([cc (dna-ref i)])
        (if (and (char=? cc #I)
                 (not (char=? (dna-ref (+ i 1)) #C)))
              (dna-drop! i)
              (apply string (reverse! transf)))
            (cond [(char=? cc #C) (const-loop (+ i 1) (cons #I transf))]
                  [(char=? cc #F) (const-loop (+ i 1) (cons #C transf))]
                  [(char=? cc #P) (const-loop (+ i 1) (cons #F transf))]
                  [else (const-loop (+ i 2) (cons #P transf))])))))

And I even optimized template and pattern by replacing calls from

(dna-prefix? "C")


(char=? (dna-ref 0) #C)

without much, or any, success. These benchmark explain why I tried:

> (time (begin (let loop ([i 0])
                 (unless (>= i 3000000)
                   (cond [(dna-prefix? "IP") "C"]
                         [(dna-prefix? "IF") "P"]
                         [else "K"])
                   (loop (+ i 1))))))
cpu time: 32678 real time: 32821 gc time: 3660
> (time (begin (let loop ([i 0])
                 (unless (>= i 3000000)
                   (cond [(dna-prefix? "I") (if (char=? (dna-ref 1) #\P) "C" "P")]
                         [else "K"])
                   (loop (+ i 1))))))
cpu time: 34970 real time: 36296 gc time: 1664
> (time (begin (let loop ([i 0])
                 (unless (>= i 3000000)
                   (cond [(and (char=? (dna-ref 0) #\I) (char=? (dna-ref 1) #\P) "C" "P")]
                         [else "K"])
                   (loop (+ i 1))))))
cpu time: 38086 real time: 38345 gc time: 0

The optimizations took me almost all saturday night and sunday. Sunday evening I was getting desperate. Then I though that with these datastructures there wasn’t probably much more I could do but I was tired and had to work on the next day so I sent and email to the list to let my frustration go when I was advised to go by #oasis @ freenode.org where people where talking about datastructures and algorithms. I went by to the channel full of people and went to sleep to do nothing else. I’m doing basically ~1 iterations / sec when I should probably be doing 20000 iterations / sec.

Yesterday, I went to look into the PLT-Scheme manual and remembered myself that PLT Scheme string are unicode and probably perform even worse than normal C-strings (in PLT-Scheme, byte-strings) due to that. I shall try change to byte-string and see the results! :-) The more or less cleanified code of what I had is here. Overall, I already saw many possible optimizations, other that are possibly I certainly have not yet come to think about them but it was a whole lot of fun! :-)


  1. JJ said,

    Thanks for sharing your story with us. The more problem we see thrown at Scheme, and their attempted solutions, the better.

    It be interesting to see you investigate this under a different, faster Scheme? Like Bigloo or Gambit-C.

  2. Alexander Kiel said,

    I participate on ICFP too. I’m sorry that you don’t get your DNA to RNA converter reasonably fast. Our first implementation was estimated to run about 72 hours. We took Java but this doesn’t matter.

    The key problem was the DNA StringBuffer. The matchreplace function skip almost every time nearly the whole DNA (later bettern known as the green zone) and the replace function append this part again to the rest of the DNA. In Java this means that at every iteration the whole 7 MB DNA String would be copied to a new buffer.

    On Saturday I wrote a StringBuffer which use a linked list of strings opposite to a big array of chars. So I rediced the big copy actions to pointer handling. But this was not so easy. The string list gets fast fragmented in many small parts. So I wrote a automatic intern defragmenter for it :-)

    At Saturday afternoon we had a working DNA RNA converter with runs only about one minute. So the real work could begin. This year it was like the last where implementing the converter or the UM (last year) was not the point. It was considered that one could implement such thing at instant and concentrate oneself on the real task :-)

    best wishes

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: