Clojure script to parse IIS logs from a directory and show slowest pages

Just having some fun today. The code runs actually pretty fast: It reads over 1.5 million lines from 1.1GB of logs and sorts them by time-taken in 5 seconds. I bet there’s room for improvement but it was enough for my needs.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
(ns iis_log
  (:require clojure.contrib.duck-streams clojure.contrib.io clojure.string clojure.test))
 
(def #^{:doc "Where the log files reside."} logs-directory "D:/userdata/jnetto/Desktop/zina_inside/W3SVC93479707")
 
(def #^{:doc "How many lines to show from logs."} top-to-show 25)
 
;; IIS log lines starting with # are comments
(defn is-not-comment?
  [^String s]
  (not= (.substring s 0 1) "#"))
 
(defn remove-comments
  [file-contents]
  (filter is-not-comment? file-contents))
 
;; Values on logs are separated by a space. Therefore, "-" is shown as a value when there is nothing there, instead of a space,
;; which would mess up the separator.
(defn nil-if-hyphen
  [s]
  (if (not= s "-") s))
 
(defstruct token :uri :time-taken :username)
 
(defn relevant-line-info
  "Returns a map with the path + querystring, time-taken and username from a log line."
  [line]
  (let [line-tokens (clojure.string/split line #" ")
        cs-uri-stem (nth line-tokens 6)
        cs-uri-query (nth line-tokens 7)
        cs-username (nth line-tokens 9)]
    (struct-map token :uri (str cs-uri-stem (nil-if-hyphen cs-uri-query))
                      :time-taken (read-string (nth line-tokens 20))
                      :username (nil-if-hyphen cs-username))))
 
(def accesses (ref []))
 
(doseq [child (.listFiles (clojure.contrib.io/file logs-directory))]
  (println "Processing" child) ;; Just for debugging purposes, to show something is happening
  (let [file-contents (clojure.contrib.duck-streams/read-lines child)
        file-without-comments (remove-comments file-contents)
        tokens (map relevant-line-info file-without-comments)]
    (dosync (alter accesses concat (take top-to-show tokens))))) ;; We 'take 'top-to-show here as an optimization. If we will print
                                                                 ;; only top-to-show, why take more from each file?
 
;; Until here it's actually pretty fast. 'sort-by and 'reverse are actually the slowest parts of this script.
(doseq [line (take top-to-show (reverse (sort-by :time-taken @accesses)))] 
  (println line))
  • Matt

    You should be able to write
    tokens (map #(relevant-line-info %) file-without-comments)astokens (map relevant-line-info file-without-comments)since it’s a one argument function.

  • Well observed Matt. The code grew organically and I left the redundant #() there. Thanks for noticing, it is updated now.