[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

elmo-msgdb-get-message-id-from-buffer's performance issue



I tried testing elmo-msgdb-get-message-id-from-buffer's performance by
re-syncing large localdir folder (which contains about 31000 messages)
with threading.

I redefined elmo-msgdb-get-message-id-from-buffer as below and
measured time to re-sync.  Please note measurement is very rough and
function is not byte-compiled.

Environment: ThinkPad X201s (Core i7-640LM, 2.13GHz), Windows7 (x64)
Emacs 24.1.50 (locally built)


A. as loose as possible
(defun elmo-msgdb-get-message-id-from-buffer ()
  (let ((msgid (elmo-field-body "message-id")))
    (if msgid
	(if (string-match "<.+>" msgid)
	    (match-string 0 msgid)
	  (concat "<" msgid ">"))	; Invaild message-id.
      ;; no message-id, so put dummy msgid.
      (concat "<"
	      (if (elmo-unfold-field-body "date")
		  (timezone-make-date-sortable (elmo-unfold-field-body "date"))
		(md5 (string-as-unibyte (buffer-string))))
	      (nth 1 (eword-extract-address-components
		      (or (elmo-field-body "from") "nobody"))) ">"))))

B. A without assuming narrowed
(defun elmo-msgdb-get-message-id-from-buffer ()
  (let ((msgid (std11-field-body "message-id")))
    (if msgid
	(if (string-match "<.+>" msgid)
	    (match-string 0 msgid)
	  (concat "<" msgid ">"))	; Invaild message-id.
...

C. A with more strict regexp
(defun elmo-msgdb-get-message-id-from-buffer ()
  (let ((msgid (elmo-field-body "message-id")))
    (if msgid
	(if (string-match "\\`[ \n\t]*\\(<.+>\\)[ \n\t]*\\'" msgid)
	    (match-string 1 msgid)
	  (concat "<" msgid ">"))	; Invaild message-id.
...

D. C with elmo-unfold-field-body
(defun elmo-msgdb-get-message-id-from-buffer ()
  (let ((msgid (elmo-unfold-field-body "message-id")))
    (if msgid
	(if (string-match "\\`[ \t]*\\(<.+>\\)[ \t]*\\'" msgid)
	    (match-string 1 msgid)
	  (concat "<" msgid ">"))	; Invaild message-id.
...

E. Using lexical analyzer
(defun elmo-msgdb-get-message-id-from-buffer ()
  (let ((msgid (elmo-unfold-field-body "message-id")))
    (if msgid
	(or (let* ((tokens (std11-parse-msg-ids-string msgid))
		   (id (assq 'msg-id tokens)))
	      (setq id
		    (unless (assq 'msg-id (delq id tokens))
		      (std11-addr-to-string (cdr id))))
	      ;; Return nil when result is "".
	      (when (> (length id) 0) id))
	    (concat "<" msgid ">"))	; Invaild message-id.
...

F. combination of E and C
(defun elmo-msgdb-get-message-id-from-buffer ()
  (let ((msgid (elmo-unfold-field-body "message-id")))
    (if msgid
	(or (let* ((tokens (std11-parse-msg-ids-string msgid))
		   (id (assq 'msg-id tokens)))
	      (setq id
		    (unless (assq 'msg-id (delq id tokens))
		      (std11-addr-to-string (cdr id))))
	      ;; Return nil when result is "".
	      (when (> (length id) 0) id))
	    (if (string-match "\\`[ \n\t]*\\(<.+>\\)[ \n\t]*\\'" msgid)
		(match-string 1 msgid)
	      (concat "<" msgid ">")))	; Invaild message-id.
...


Result:
A 190sec (as loose as possible)
B 189sec (A without assuming narrowed)
C 188sec (A with more strict regexp)
D 189sec (C with elmo-unfold-field-body)
E 256sec (Using lexical analyzer)
F 254sec (combination of E and C)

I think differences of A, B, C and D are within error limit.  At least
in my environment (result may change in old Emacsen or systems),

1. It would be better making elmo-msgdb-get-message-id-from-buffer not
to assume buffer is narrowed to header for robustness and
maintainability.

2. In elmo-msgdb-get-message-id-from-buffer, kind of extracting header
function and matching regexp little affect the performance.

3. Using lexical analyzer affects the performance.  If we introduce
lexical analyzer to extract Message-ID, I want a customizable option
to disable it.  

BTW, in my localdir folders, I found only one Message-ID: header with
comment.  But that message was spam.

-- 
Kazuhiro Ito