Clojure script to parse IIS logs from a directory and show slowest pages

Just having some fun today. The code runs actually pretty fast: It reads over 1.5 million lines from 1.1GB of logs and sorts them by time-taken in 5 seconds. I bet there’s room for improvement but it was enough for my needs.

(ns iis_log
  (:require clojure.contrib.duck-streams clojure.contrib.io clojure.string clojure.test))

(def #^{:doc "Where the log files reside."} logs-directory "D:/userdata/jnetto/Desktop/zina_inside/W3SVC93479707")

(def #^{:doc "How many lines to show from logs."} top-to-show 25)

;; IIS log lines starting with # are comments
(defn is-not-comment?
  [^String s]
  (not= (.substring s 0 1) "#"))

(defn remove-comments
  [file-contents]
  (filter is-not-comment? file-contents))

;; Values on logs are separated by a space. Therefore, "-" is shown as a value when there is nothing there, instead of a space,
;; which would mess up the separator.
(defn nil-if-hyphen
  [s]
  (if (not= s "-") s))

(defstruct token :uri :time-taken :username)

(defn relevant-line-info
  "Returns a map with the path + querystring, time-taken and username from a log line."
  [line]
  (let [line-tokens (clojure.string/split line #" ")
        cs-uri-stem (nth line-tokens 6)
        cs-uri-query (nth line-tokens 7)
        cs-username (nth line-tokens 9)]
    (struct-map token :uri (str cs-uri-stem (nil-if-hyphen cs-uri-query))
                      :time-taken (read-string (nth line-tokens 20))
                      :username (nil-if-hyphen cs-username))))

(def accesses (ref []))

(doseq [child (.listFiles (clojure.contrib.io/file logs-directory))]
  (println "Processing" child) ;; Just for debugging purposes, to show something is happening
  (let [file-contents (clojure.contrib.duck-streams/read-lines child)
        file-without-comments (remove-comments file-contents)
        tokens (map relevant-line-info file-without-comments)]
    (dosync (alter accesses concat (take top-to-show tokens))))) ;; We 'take 'top-to-show here as an optimization. If we will print
                                                                 ;; only top-to-show, why take more from each file?

;; Until here it's actually pretty fast. 'sort-by and 'reverse are actually the slowest parts of this script.
(doseq [line (take top-to-show (reverse (sort-by :time-taken @accesses)))] 
  (println line))

Posted

in

by

Tags:

Comments

2 responses to “Clojure script to parse IIS logs from a directory and show slowest pages”

  1. Matt Avatar
    Matt

    You should be able to write
    tokens (map #(relevant-line-info %) file-without-comments)astokens (map relevant-line-info file-without-comments)since it’s a one argument function.

  2. Julio Nobrega Netto Avatar

    Well observed Matt. The code grew organically and I left the redundant #() there. Thanks for noticing, it is updated now.