Maillog Analysis for Postfix and IMSVA (Part I)

[Leer este artículo en español]

Part I. Postfix Maillog Performance Indicators

The following procedure should work for any mail server or Anti-Spam appliance. This particular analysis was made with the Postfix bundled in IMSVA 8.2 but it could be easily translated to any other platform.

 

The first point to identify what is the actual performance of Postfix is to define certain indicators that can help us to measure behavior patterns. The following list shows some of the most important for the analysis.

  • Number of Inbound connections
  • Number of Outbound connections
  • Number of connections to the Scanner (only for IMSVA)
  • Mail delivered to Internet
  • Mail permanently rejected
    • Rejected mail because of nonexistent domain
    • Rejected mail because of nonexistent mailbox
    • Mean lifetime before rejection
    • Maximum lifetime before rejection
    • Minimum lifetime before rejection
    • Mail temporarily rejected
      • Mail rejected because of timeout or connection rejection
        • Top ten of these domains
  • Mail rejected for any other reason
    • Top ten of these reasons
    • Average queuing time
    • Maximum queuing time
    • Minimum queuing time
    • Number of mailboxes
Next, you have to obtain the actual data for each of this indicators. We’ll show you here how to do this manually which may help you in writing scripts to automate the process, our objective here is to show how to filter the maillog file to understand such information.

 

DETERMINING THE NUMBER OF INBOUND CONNECTIONS IN MAILLOG

 

First, you have to know that Postfix doesn’t really distinguishes between connections from our LAN and the ones from the Internet, so if you search for the total number of connections you’ll end up with a value that includes both. In this analysis we’ll focus on connections received only from the internal network, this is, mails that leave our Organization. To obtain this value there are some cases you have to consider according to your Postfix configuration.
  • The first option is when you have a Postfix that only sends outgoing mail, so you can assume that the total number of connections is the same as the number of connection from the internal network (LAN). In such case, you have to filter maillog, looking for all lines that contain the following text: “ connect from “. For IMSVA, this number will also contain connections to the Scanner on IP 127.0.0.1, so you have to exclude these lines in the filter.
  • The second option is when you have a Postfix that sends and receives mail from both your LAN and Internet and also the relay configuration allows connections from certain IP addresses only. In this case you have to filter maillog looking for strings that contain “ connect from unknown[Internal_Mail_server_IP]”. If Postfix receives mail from more than one server on your LAN (you have two or more HUB servers for example) but all of them belong to the same subnet, then you can make a filter like this “ connect from unknown[192.168.75.”. If IP address are from different subnets then you can make different filters, one for each IP address. In this last scenario you can merge all the lines at the end or you can analyze each group of lines individually  to detect differences in the behavior of each internal mail server.
To execute the filters you can do it directly in the local Postfix server using the “cat” command or you can extract the log and process it with applications like “Notepad++” which can help you in these kind of operations.
If you will filter maillog directly in the local Postfix server, these are the commands to execute:

 

For case 1, Postfix (IMSVA) for outbound mail only
# cat /var/log/maillog | grep ” connect from unknown” | grep –v 127.0.0.1 |wc –l

 

For case 2, Postfix (IMSVA) for outbound and inbound mail for only one internal server
# cat /var/log/maillog | grep ” connect from unknown[Internal_Mail_IP]” |wc –l

 

For case2, Postfix (IMSVA) for outbound and inbound mail for several internal mail servers in the same subnet
# cat /var/log/maillog | grep ” connect from unknown[192.168.75.” |wc –l

 

For case 2, Postfix (IMSVA) for outbound and inbound mail for several internal mail servers in different subnets
# cat /var/log/maillog | grep ” connect from unknown[Internal_Mail1_IP]” |wc –l
[…]
# cat /var/log/maillog | grep ” connect from unknown[Internal_MailN_IP]” |wc –l
At the end, just add all the results to obtain the total number of connections from  the internal network.

 

The number obtained from these filters will give you the total number of inbound connections from the internal network.

 

 

DETERMINING THE NUMBER OF OUTBOUND CONNECTIONS IN MAILLOG

 

This number can be obtained with the following formula:

 

OUTBOUND CONNECTIONS = MAIL DELIVERED TO INTERNET + MAIL PERMANENTLY REJECTED + MAIL TEMPORARILY REJECTED

 

Notice for IMSVA, that this number doesn’t include the number of mails / connections to the Scanner (IP 127.0.0.1) because we want to distinguish between already processed mail and mail that is waiting to be processed.

 

 

DETERMINING THE NUMBER OF CONNECTIONS TO IMSVA SCANNER IN MAILLOG

 

For IMSVA, Scanner refers to the module that is responsible of processing mail before it is delivered. There are two Postfix instances: the first listens on port 25 or 2500 (whether you have IP Profiler or not) and is responsible for sending mail to the Scanner, and the second one listens on port 10026 and is responsible for receiving mail from the Scanner and perform the final delivery. Because of this configuration, the Scanner produces its own lines in maillog that reflect the passing of the mail between both Postfix instances. Knowing how many connections are received from the Scanner will help you in identifying what is the actual mail processing capability of IMSVA.

 

The command to obtain this number is similar to the ones already used:
# cat /var/log/maillog | grep ” connect from unknown[127.0.0.1]” |wc –l

 

In this scenario is not possible, at least in an easy way, to distinguish between connections to the Scanner for incoming or outgoing mail when IMSVA processes mail for both directions. Despite of this, the final number will help you know if the Scanner has enough available connections in order to increase its processing capacity and speed.

 

 

DETERMINING THE AMOUNT OF MAIL DELIVERED TO INTERNET

 

To obtain this number you can use this filter:
# cat /var/log/maillog | grep “status=sent”|grep –v 127.0.0.1 |wc -l

 

Notice that this filter excludes all mails successfully delivered to the Scanner by excluding IP 127.0.0.1. If Postfix (IMSVA) is only for outgoing mail then this filter should be enough to obtain the correct number.

 

In those cases where Postfix (IMSVA) processes mail for both incoming and outgoing mail, this number indicates the total number of connections for both the LAN and Internet. In such cases you’ll also have to exclude the IP addresses of all the internal mail servers as shown in the following filter:
# cat /var/log/maillog | grep “status=sent”|grep –v 127.0.0.1|grep –v 192.168.75.5| wc –l

 

If you have several mail servers but they all belong to the same subnet just omit the last octet, for example “192.168.75.”. If this is not the case you’ll have to use “grep –v [IP]” for each IP address to exclude.

 

 

DETERMINING THE AMOUNT OF MAIL WITH PERMANENT REJECTION

 

To obtain this number you can use the following filer:
# cat /var/log/maillog | grep “status=bounced”| wc –l

 

In this case, we didn’t exclude the localhost IP (127.0.0.1) and in those cases where Postfix accepts both incoming and outgoing connections, this filter doesn’t exclude either the IP addresses of the internal mail servers. This is because it’s convenient to have the total number of rejected mail and then obtain the number for both the Scanner and the internal mail servers. To obtain these last numbers you just have to add a “grep” with the value of the IP you want to filter as shown in the following example, where we obtain the number of permanent rejected mail for IMSVA Scanner on localhost:
# cat /var/log/maillog | grep “status=bounced”|grep 127.0.0.1| wc –l

 

Ideally, the value for these sub-filters should be “0”, otherwise this could indicate a problem in the performance of IMSVA scan engines.

 

 

DETERMINING THE AMOUNT OF PERMANENT REJECTED MAIL BECAUSE OF NON-EXISTENT DOMAINS

 

To obtain this value you can use the following filter:
# cat /var/log/maillog | grep “status=bounced”| grep “dsn=5.4.4” |wc –l

 

We identify this condition based on the status response code instead of using the “human readable description”, because the numerical code indicates the same condition for all possible descriptions. This way you can avoid filtering for every single and unique description, considering that each mail server is able to establish its own description and also that each mail administrator is also able to change the text of the description and language to enforce Organization security policies.

 

 

DETERMINING THE AMOUNT OF PERMANENT REJECTED MAIL BECAUSE OF NON-EXISTENT MAILBOXES

 

There are several smtp response codes that fall under this category but they all point to the same kind of problem: there is a problem with the recipient’s mailbox that is responsible for an unsuccessfully delivery. This number, according to the Organization specific mail flow and from what you can see on the field, may be assumed as the difference between the total permanent rejected mail and the mail rejected because of non-existent domains, this is:

 

NON-EXISTENT MAILBOXES = TOTAL PERMANENT REJECTED – MAIL REJECTED BECAUSE OF NON-EXISTENT DOMAINS

 

If you want to make sure that these two categories are included in at least 90% to 95% of the total permanent rejected mail, you can filter each of the response codes included in the “bounced” state. As a result of these sub-filters you’ll be sure enough if you have to consider an additional category or if these two should be enough for your analysis.

 

 

DETERMINING THE MEAN, MINIMUM AND MAXIMUM LIFETIME BEFORE A TEMPORARILY REJECTION

 

These values represent the time a mail spends since it is first received by Postfix up to its final delivery to the recipient’s domain. At this point, Postfix receives an 5XX response code indicating the mail could not be delivered. The mean lifetime tells you how much time is spent in mails that will never be delivered to its destination because of a permanent failure or a policy violation. The importance of this number comes when you end up spending too much resources in mails that should never had been received in the first place.

 

For this scenario, the total number of connections belong to a single and unique mail, this is, if you see 100 permanently rejected connections, you can assume those connections were generated by exactly 100 mails. This is true, because this kind of mails have only one shot to be delivered, if they were queued because of a temporarily rejection before this one shot, then the first queue action counts as a temporary rejection. This is important because, as you’ll see later in this post, when we talk about temporarily rejections, one single mail may end up generating at least 2 connections.

 

You can use the following filter to obtain the three values (mean, minimum and maximum lifetime in Postfix):
# cat maillog|grep status=bounced|awk ‘{print $9}’|grep delay=|sed ‘s/delay=//g’|sed ‘s/,//g’ > delayBounced.txt; cat maillog|grep status=bounced|awk ‘{print $10}’|grep delay=|sed ‘s/delay=//g’|sed ‘s/,//g’ >> delayBounced.txt; cat delayBounced.txt |awk ‘NR == 1 { max=$1; min=$1; sum=0 }{ if ($1>max) max=$1; if ($1<min) min=$1; sum+=$1;} END {printf “Min: %d\tMax: %d\tAverage: %f\n”, min, max, sum/NR}’

 

This command does the following:
  • cat maillog|grep status=bounced|. This section filters all lines that contain “status=bounced” which include all mails with a permanent rejection.
  • awk ‘{print $9}’. This section presents only the ninth column from the set previously filtered. This column contains the parameter “delay=” which in maillog represents the time any mail spends from its arrival up to the time it was delivered or in this case, rejected.
  • grep delay=. In some cases, the ninth column may include a parameter called “conn_use” instead of “delay”, this parameter indicates the number of times a connection has been reused because of lack of available connections. Because of this, we “grep” only those lines including “delay”.
  • sed ‘s/delay=//g’. So far, we ended up with a set of lines that look like this “delay=342,”. This section deletes the word “delay=” so we can now have a column of numerical values like “342,”.
  • sed ‘s/,//g’. This section deletes the final comma “,” at the end of the numerical value.
  • > delayBounced.txt. This section saves the final set with numerical values in a file called delayBounced.txt.
  • You’ll see now that following sections are repeated to include the output of the tenth line instead of the ninth. This is because in those lines where the ninth column says “conn_use”, the “delay” parameter is moved to the tenth. At the end of the command you can see that instead of overwriting the delayBounced.txt file, we just append the resulting set to the one we already had with the mechanism “>>” which adds the resulting lines at the end of the file.
  • The last section of the command is the one that calculates the mean, minimum and maximum values. The minimum and maximum values are important add a context to the mean value. Let’s suppose that the final mean is 30 minutes and values for min and max are 2 seconds and 8 days. In this case, for the mean to have a value of 30 minutes, most of the values must be closer to a range between 2 seconds and 1 hour and therefore, the number of mails that spent more than 4 hours to be delivered must be small. On the other hand, if we have the same min and max values but a mean of 7 days, then this means that most of the mails spent more than 5 days to be rejected while just a few ones were rejected in seconds or in a few hours.

 

These three values together will give you an idea about how much time are these kind of mails been delayed inside Postfix and the determine the root cause and course of action.

 

 

DETERMINING THE AMOUNT OF MAIL WITH TEMPORARY REJECTION

 

A temporary rejection is identified by a response code 4XX and it means that at the time of delivery, the SMTP Server was not able to receive that mail. This implies, by SMTP rules, that this mail should be queued to retry a later delivery up to the final lifetime of the mail; in such a case, the mail should be deleted and an NDR should be sent to the sender informing of the failure.

 

To determine this value you can use the following filter to look for all the lines that include the sentence “status=deferred”.
# cat maillog|grep status=deferred|wc -l

 

The result will give you the total number of connections used in temporarily delivery failures. Because this value is the number of connections, another indicator is the number of unique mails that produced this number. Both numbers are different because once a mail is queued, its delivery should be tried as many times until it is finally accepted or it comes to its lifetime limit. Based on this, one single unique mail may produce at least two connections (the first try when the mail got queued for the first time and at least a second try when it was successfully delivered).

 

To determine the amount of mail you can use the following command:
# cat maillog|grep status=deferred|awk ‘{print $6}’|sort -u|wc –l

 

This filter takes all the lines from maillog that contain the phrase “status=deferred” that includes all temporarily rejected mails. Then it prints the sixth column that contains the Postfix Internal ID, which is unique to every mail received by Postfix. If a mail consumes 10 connections, then the internal ID will show up 10 times indicating this particular mail has used 10 connections and all of them got a temporal rejection. Finally the filter orders all lines and deletes al repeated ID’s, which will deliver the total amount of mail generating these failed connections.

 

If you also want to know how many connections have been generated by each mail, you can use the following filter:
# cat maillog|grep status=deferred|awk ‘{print $6}’|sort|uniq -c|sort –nr > countDeferredMail.txt

 

This command behaves exactly the same but it uses the “uniq” command to count all the times an Internal ID has been repeated. At the end, it orders the final set from higher to lower with “sort –nr”. The output can be sent to a file “countDeferredMail.txt” or you can just substitute this parameter with a “|more” to analyze the data directly without using a local file. This last case is shown below:
# cat maillog|grep status=deferred|awk ‘{print $6}’|sort|uniq -c|sort –nr |more

 

 

DETERMINING THE AMOUNT OF TEMPORARILY REJECTED MAIL BECAUSE OF TIMEOUT OR CONNECTION REFUSED

 

Just as we explained on mails rejected because of non-existent domains, it is advisable to determine this value by using the numerical value of the response code and not the human readable description, as this last one may change from server to server. This situation is represented by the numerical value of DSN 4.4.1 (Delivery Status Notification).
You can use the following command to obtain this value:
# cat maillog|grep status=deferred|grep dsn=4.4.1|wc –l

 

The result will give you the total of connections used to deliver mail to domains that have one of the following problems:
  • The domain exists but it is not associated with a mail server. It may have one of the following public DNS records: MX, CNAME or A. As an SMTP rule, as long as a domain has at least an A record, even when it doesn’t have an MX record, the SMTP Client must retry the delivery until it is successful or the mail lifetime expires. This is an SMTP rule, so all SMTP servers should behave exactly the same way.
  • The destination domain has the IP of your Postfix / IMSVA in a black list with a temporarily rejection response code.
  • The domain exists and the Mail server exists but it is not responding with the numerical response code 220 to initiate the transaction.
  • There is a device in the middle: load balancer, firewall, switch, etc. That is not allowing the mail transaction.

 

To determine the total number of mails generating these connections you can use the following command:
# cat maillog|grep status=deferred|grep dsn=4.4.1|awk ‘{print $6}’|sort|uniq -c|sort -nr|wc –l

 

If you want to obtain the Top Ten of the domains generating this failure, you can use the following filter:
# cat maillog|grep status=deferred|grep dsn=4.4.1|awk ‘{print $15}’|grep “\[” > dsn441.txt;cat maillog|grep status=deferred|grep dsn=4.4.1|awk ‘{print $18}’|grep “\[” >> dsn441.txt; cat dsn441.txt | sort|uniq -c|sort -nr|more

 

As a result, you’ll get an ordered list of the domains that generate the most connections because of this reason along with its name and IP address. If you want to save the list for a later analysis, just substitute the last “more” with “ > listaDNS441.txt”.

 

 

DETERMINING THE AMOUNT OF MAIL WITH TEMPORARILY REJECTION BECAUSE ANY OTHER REASONS

 

You can easily determine this number by subtracting the number of connections with connection problems obtained before from the total of temporarily rejected mail. This is:

 

MAIL REJECTED BY OTHER REASONS = TOTAL OF TEMPORARILY REJECTED MAIL – TOTAL OF TEMPORARILY REJECTED MAIL WITH CONNECTION PROBLEMS

 

However, this number by itself doesn’t allow you to identify any other reasons that may exist at the time of the rejection. To obtain a more exact value, you can obtain a Top Ten of these Other Reasons. You can use the following filter to obtain this value:
# cat maillog|grep status=deferred|grep -v dsn=4.4.1|awk ‘{print $11″ “$16” “$17” “$18” “$19” “$20” “$21” “$22” “$23” “$24” “$25” “$26” “$27” “$28” “$29” “$30” “$31” “$32}’|grep dsn= > dsn4XX.txt;cat maillog|grep status=deferred|grep -v dsn=4.4.1|awk ‘{print $12” “$17” “$18” “$19” “$20” “$21” “$22” “$23” “$24” “$25” “$26” “$27” “$28” “$29” “$30” “$31” “$32” “$33}’|grep dsn= >> dsn4XX.txt;cat dsn4XX.txt | sort|uniq -c|sort -nr|more

 

Just as with the other examples already shown, you can save the list by replacing the las “|more” with “> listaDSN4xx.txt”.

 

You should notice that the output may show lines like the following:
  25 dsn=4.2.2, 451 4.2.2 user over quota; cannot receive new mail: user1@domain.com (in reply to RCPT TO command))
     24 dsn=4.2.2, 451 4.2.2 user over quota; cannot receive new mail: user2@domain.com (in reply to RCPT TO command))
     23 dsn=4.2.2, 451 4.2.2 user over quota; cannot receive new mail: user3@domain.com (in reply to RCPT TO command))

 

Here you can see that all lines show the same error but for a different user. If you want to eliminate this error to see any other errors, you may exclude the DSN status code. For example, at the end of the filter you can add “|grep –v dsn=4.2.2|more” to exclude all lines that show DSN 4.2.2 for “user over quota” errors. With this modification you will see all the other reasons that doesn’t have to do with user over quota conditions.

 

Another way to see this responses is to filter exclusively for the DSN, this is because the numerical code always refers to the same reason even when they may be differences in the descriptions. To obtain an ordered DSN list without the descriptions you can use the following command:
# cat maillog|grep status=deferred|awk ‘{print $11}’|grep dsn= > dsn4XX.txt;cat maillog|grep status=deferred|awk ‘{print $12}’|grep dsn= >> dsn4XX.txt;cat dsn4XX.txt | sort|uniq -c|sort -nr|more

 

The output of this command will give you a list of all temporarily rejection codes along with its counter as shown below:
  76661 dsn=4.4.1,
   4590 dsn=4.4.3,
   3299 dsn=4.7.7,
   2989 dsn=4.7.1,
   1066 dsn=4.3.2,
    572 dsn=4.0.0,

 

You may use any of these commands depending on the information you need to obtain.

 

 

DETERMINING THE MEAN, MINIMUM AND MAXIMAL LIFETIME OF THE MAIL BEFORE THE TEMPORAL REJECTION

 

This value indicates how much time a mail is being stored inside Postfix without being able to be delivered. It also indicates how bad is the performance degradation because of this queuing, for example, it’s not the same to have a mean value of 30 minutes than 12 hours for this indicator. Very high values may indicate a possible situation where our IP is being black listed by these domains, or the domains may exist but there is no mail server anymore listening on that IP. These and many other reasons may produce a spent of resources on mails that no matter what you do, they’ll never be able to be delivered successfully so your Postfix is just wasting network connections, storage, cpu, and memory processing them.

 

The command to obtain this value is the same we used to determine the same value for mails with a permanent rejection except for the reason as shown below:
# cat maillog|grep status=bounced|awk ‘{print $9}’|grep delay=|sed ‘s/delay=//g’|sed ‘s/,//g’ > delayBounced.txt; cat maillog|grep status=bounced|awk ‘{print $10}’|grep delay=|sed ‘s/delay=//g’|sed ‘s/,//g’ >> delayBounced.txt; cat delayBounced.txt |awk ‘NR == 1 { max=$1; min=$1; sum=0 }{ if ($1>max) max=$1; if ($1<min) min=$1; sum+=$1;} END {printf “Min: %d\tMax: %d\tAverage: %f\n”, min, max, sum/NR}’

 

 

DETERMINING THE AMOUNT OF ACTIVE MAILBOXES

 

If you’re using IMSVA, you should know by now that it uses Postfix as its MTA server, acting like an SMTP Proxy, so no local mailboxes live inside the box, they are at the protected internal mail server. The number of protected mailboxes is important because it gives you a context for all the indicators we’ve seen. When you put the data together you’ll be able to link the numerical values of the indicators to the number of mailboxes that were active during the analysis.

 

To obtain this value you can use the following filter:
# cat maillog |grep from=|awk ‘{print $7}’| sort|uniq -c|sort -nr|wc -l

 

If you have more than one maillog files available for your analysis, then you can execute the same command for all files as shown in the filter below:
# cat maillog maillog.1 maillog.2 maillog.3 maillog.4 .. maillog.N|grep from=|awk ‘{print $7}’| sort|uniq -c|sort -nr|wc –l

 

The output of those commands will give the number of the mailboxes that were active during the maillog capture. If you want to know exactly which mailbox are they, you can replace the las “|wc –l” with “> listMailbox.txt” to generate a file with the complete list.

 

You should notice that this number represents only the amount of mailboxes Postfix has seen using the service during the period of time the maillog has captured. If you want to know a more exact number you should execute this command on the most amount of maillog files to evaluate different times and days.

 

 

Conclusion

The commands presented here along with the explanation of what each of the indicators represent will give you enough information to create a table that gathers all the necessary information you need to know what the actual performance of your Postfix or IMSVA device is. On the next post (Part 2. Postfix Maillog indicators interpretation), we’ll show you some examples with real data that may help you in giving a more precise meaning to these numbers.

PART I. Postfix Maillog Performance Indicators

PART II. Maillog Indicators Interpretation

For more information we invite you to visit our Postfix section. If you want to learn more about the rules to create, transmit and process and email you can check out our publication on The SMTP Protocol Fundamentals.

Remember that you can send us your comments and questions to our Twitter account: @redinskala where you’ll also find more information and security tips.

Thanks for your visit!

 

2 thoughts on “Maillog Analysis for Postfix and IMSVA (Part I)

  1. Pingback: Maillog Analysis for Postfix and IMSVA (Part II) | RedinSkala

  2. Pingback: Análisis del archivo Maillog para Postfix e IMSVA (Parte I) | RedinSkala

Comments are closed.