Just having some fun today. The code runs actually pretty fast: It reads over 1.5 million lines from 1.1GB of logs and sorts them by time-taken in 5 seconds. I bet there’s room for improvement but it was enough for my needs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | (ns iis_log (:require clojure.contrib.duck-streams clojure.contrib.io clojure.string clojure.test)) (def #^{:doc "Where the log files reside."} logs-directory "D:/userdata/jnetto/Desktop/zina_inside/W3SVC93479707") (def #^{:doc "How many lines to show from logs."} top-to-show 25) ;; IIS log lines starting with # are comments (defn is-not-comment? [^String s] (not= (.substring s 0 1) "#")) (defn remove-comments [file-contents] (filter is-not-comment? file-contents)) ;; Values on logs are separated by a space. Therefore, "-" is shown as a value when there is nothing there, instead of a space, ;; which would mess up the separator. (defn nil-if-hyphen [s] (if (not= s "-") s)) (defstruct token :uri :time-taken :username) (defn relevant-line-info "Returns a map with the path + querystring, time-taken and username from a log line." [line] (let [line-tokens (clojure.string/split line #" ") cs-uri-stem (nth line-tokens 6) cs-uri-query (nth line-tokens 7) cs-username (nth line-tokens 9)] (struct-map token :uri (str cs-uri-stem (nil-if-hyphen cs-uri-query)) :time-taken (read-string (nth line-tokens 20)) :username (nil-if-hyphen cs-username)))) (def accesses (ref [])) (doseq [child (.listFiles (clojure.contrib.io/file logs-directory))] (println "Processing" child) ;; Just for debugging purposes, to show something is happening (let [file-contents (clojure.contrib.duck-streams/read-lines child) file-without-comments (remove-comments file-contents) tokens (map relevant-line-info file-without-comments)] (dosync (alter accesses concat (take top-to-show tokens))))) ;; We 'take 'top-to-show here as an optimization. If we will print ;; only top-to-show, why take more from each file? ;; Until here it's actually pretty fast. 'sort-by and 'reverse are actually the slowest parts of this script. (doseq [line (take top-to-show (reverse (sort-by :time-taken @accesses)))] (println line)) |
You should be able to write
tokens (map #(relevant-line-info %) file-without-comments)astokens (map relevant-line-info file-without-comments)since it’s a one argument function.
Comment by Matt — 2011-09-24 @ 00:47
Well observed Matt. The code grew organically and I left the redundant #() there. Thanks for noticing, it is updated now.
Comment by Julio Nobrega Netto — 2011-09-24 @ 15:46