User tools

Website tools


duplicate detection

This is an old version of the document!


If emails are delivered to Benno MailArchiv multiple times for archiving, Benno MailArchiv's duplicate detection feature will be activated.

How it works

During archiving, Benno MailArchiv generates a SHA256 checksum for each archived email. This is logged internally in the Benno MailArchiv journal and used for consistency checks (compliance).

Duplicate detection

Die Checksumme wird jeweils über die gesamte E-Mail erzeugt. So kann direkt beim Archivieren einer E-Mail geprüft werden, ob eine Mail mit gleicher Prüfsumme evtl. bereits im Archiv vorhanden ist. Eine etwaige gleiche Checksumme würde bedeuten, dass die zu archivierende Mail bereits im Archiv abgelegt wurde, also eine Doublette wäre. Die Archivierung der betreffenden Mail würde in diesem Fall abgebrochen und das Erkennen des Duplikats entsprechend im Archivierungs-Logfile /var/log/benno/archive.log protokolliert. Zusätzlich wird der nicht erfolgte Archivierungsvorgang des Duplikats im Journal mit Prüfsumme und dem Hinweis „DUPLICATE“ protokolliert.

Dank dieser wirksamen Doubletten-Erkennung können E-Mails beliebig oft zur Archivierung an Benno MailArchiv übergeben werden. Sie werden zuverlässig als Doubletten erkannt und ihre Archivierung als Duplikat abgebrochen, so dass jede Mail tatsächlich nur einmal in das Mailarchiv gelangt. Durch die SHA265-Prüfsumme ist mit an Sicherheit grenzender Wahrscheinlichkeit ausgeschlossen, dass E-Mails jemals versehentlich als Doublette erkannt und damit irrtümlich nicht archiviert werden. Da die gesamte E-Mail für die Erzeugung der Checksumme herangezogen wird, bedeutet dies bzgl. der Konsistenzprüfung von archivierten E-Mails bzw. der Konsistenzprüfung des gesamten Archivs, dass bereits das „Kippen“ eines einziges Bits einer archivierten E-Mail ausreicht, um die Checksumme der Mail zu verändern, und somit eine Korruption der Mail/des Archivs festzustellen ist.

Multiple mail delivery in complex environments

While emails typically (and especially in on-premises installations) reach the mail archive via a single, uniform path, in complex environments (e.g., in larger hosting infrastructures) emails may be transported to the archive multiple times and simultaneously via different paths. For example, different MTAs or transport methods and types (SMTP, IMAP, etc.) could be responsible for this.

An example

In a highly complex infrastructure, Benno MailArchiv receives a specific email "M" for archiving via three different routes. Because each copy of the email is transported via these different routes, the email itself (from the user's perspective in the mail client – ​​i.e., in terms of text and content) is the same. However, in the email copies (in the part typically not visible to the user), individual and different email headers have been inserted for each copy due to the different transport routes.

The three emails (which appear identical from the user's perspective) are, according to Benno MailArchiv's duplicate detection, three different emails: Generating the SHA256 checksum for each of the three emails yields the checksums "C1", "C2", and "C3". In terms of content (textually and from the user's perspective), the emails appear identical. However, due to the different headers of each email, Benno MailArchiv logically and correctly identifies them as three different emails (because of the different checksums). Benno MailArchiv would therefore archive the three emails as three separate emails. From the user's perspective, they would thus be visible three times in the archive, as they would be found and displayed using the same search criteria based on the email body.

To achieve a suitable implementation in this situation (archiving each email only once), the following scenario is recommended:

An email is uniquely identified by the headers listed below and additionally by the body (message text) of the email:

Envelope-From - X-REAL-MAILFROM Envelope-To - X-REAL-RCPTTO Return-Path Subject Message-Id Date From To Cc Body

Two emails, M1 and M2, which do not differ with respect to the aforementioned characteristics, are identical in terms of content and sender/recipient assignment. If any of these fields differ, the two emails, M1 and M2, are not identical.

Other email headers, such as Received, DKIM signatures, etc., are not directly related to the email's content. These headers are more a part of an email's envelope (similar to stamps and sticky notes on a contract being processed by a company).

Based on this situation, the checksum calculation would be performed in two ways. First, the standard checksum required for compliance policies would be generated for the entire email (as before). Simultaneously, a second checksum would be generated exclusively for the portion of the email specified above. This would easily enable duplicate detection for emails that appear identical from the user's perspective.

Compliance requirements according to the GoBD

According to the German principles of proper accounting (GoBD), every email must be able to be restored from the archive in its original state (i.e., the email including all headers, attachments, etc.). Furthermore, every email must be verifiable for any manipulation, which is achieved using the standard checksum.

However, if several emails (identical in content and text) M1, M2 and M3 (different copies of the same email in the above sense) arrive for archiving, how should these be handled with regard to their different headers?

Eine rechtliche Sicht auf diese Situation

According to the information available to us, there is no legal obligation to archive multiple versions of an email that differ only in their headers. Nevertheless, from a pragmatic point of view, all copies of the email in question (M1, M2, M3, etc.) should be archived. From a purely formal standpoint (and immediately verifiable technically based on the different checksums), these are de facto different emails. Therefore, for legal certainty, all versions of the email should be archived. Technically, using two checksums—that is, the simplified duplicate detection described above—it would be easy to archive only the first of the email copies with identical content.

In order to implement a legally compliant solution for the operator, we advise discussing the matter with a legal advisor of your choice before implementation and only then deciding on and implementing the specific form of duplicate detection.

Wir gehen bis auf weiteres davon aus, dass es rechtlich ausreichend sein könnte, die vereinfachte Doubletten- bzw. Dupplikatss-Erkennung anzuwenden und damit nur eines von mehrfach anlandenden E-Mail-Exemplaren zu archivieren. Wir gehen außerdem und bis auf weiteres davon aus, dass eine geeignete Erklärung bzw. Niederschrift des Sachverhalts in der (dank GoBD obligatorischen) Verfahrensdokumentation ausreichend sein dürfte, um zu einer rechtssicheren Archivierung zu gelangen.

Die Entscheidung über die Art der angewendeten Doubletten-Erkennung und damit verbunden die Verantwortung gegenüber der Finanzverwaltung obliegt einzig und allein dem Betreiber.

Legal Notice / Disclaimer

This document does not constitute legal advice. It serves only for general information purposes. We assume no liability for the accuracy or completeness of the information provided. All liability is excluded.

duplikatserkennung.1511443299.txt.gz · Zuletzt geändert: 2017/11/23 13:21 von lwsystems