Module 2
Website Logging for Security
Last updated on: 1 October 2024
Edit this page on GitHubModule 2
Last updated on: 1 October 2024
Edit this page on GitHubAny website that’s exposed to the internet is under constant attack. At the very least, it’s being deluged by undirected attacks by hordes of robots operated by criminal actors. More concerning are targeted attacks; even unskilled attackers, given perseverance and luck, can find vulnerabilities in a website.
Ideally, the website owner should be able to be aware of the threats they are facing. Site owners especially will want to know if an attacker is close to finding, or has recently found, a vulnerability in their site. Finally, if a vulnerability is exploited, site owners will want to know where the vulnerability is, and for how long it’s been exploited. Website logs can support all of these desires.
On the other hand, excessive logging can present a risk to the users of web sites. If a site logs sensitive information, and those logs are acquired by an adversary (e.g., seizure by law enforcement or compromise by hackers), then sensitive information could easily end up in the wrong hands.
This subtopic will cover approaches to website logging to maximize utility to website owners and minimize risk to site users.
After completing this subtopic, the practitioner should be able to do the following:
Even with the best possible skills, dedication, processes, and intentions, it’s nearly impossible to develop a website that’s completely resistant to any kind of attack. Given enough time and bad luck, every site will have a security incident. When that happens, it’s important to have logging in place that supports the detection and investigation of security events. At the same time, it’s important that a website’s logs don’t pose additional risks themselves. This subtopic will teach you how to approach logging to maximize a site’s security. It will discuss:
Various web platforms have their own logging systems. They can be relied upon to record data on every request and response, but are generally not sufficient for all incident response needs. Let’s go over what’s available in some common frameworks’ logs.
Apache is the most popular full-featured web server on the internet, serving more active sites than any other. By default, it logs events to file on the web server’s filesystem. There are two files: access_log
and error_log
. The access log contains structured information about each request, while the error log contains more semi-structured data about things that have gone wrong.
The access log has one line per entry, with a configurable format. The default format is the following fields, each separated by a space:
"GET / HTTP/1.1"
). These fields will always have spaces.Here’s an example:
127.0.0.1 - - [13/Dec/2023:13:55:36 -0700] "GET / HTTP/1.1" 200 2326
Note that each Apache server can be configured to log more of less data. For more information, see the Apache documentation. For more about the apache access log and how to use it, see this article.
The error log consists of a mix of messages from Apache that are in a semi-structured format and error messages from websites running on the server, without any enforced structure. The default structure of error log entries from Apache itself is one line per entry, with the following fields, again separated by spaces:
This article provides more information about using the Apache error log.
IIS is the default Windows web server, and is also a very popular web server. Like Apache, IIS also, by default, logs requests to the web server’s filesystem. There are several log formats available, but the default is the W3C format, which logs the following, separated by spaces:
GET
, POST
, HEAD
, etc.)/
, /index.htm
, /posts/34/reply
, etc.)200
, 404
, 500
, etc.)HTTP/1.1
)Note that the default logs do not log the query string, so for example a request to http://example.com/profile?profileID=34 will only log /profile
. For more information on the IIS access logs, see the Microsoft documentation.
Error logs under IIS are slightly more complicated. Depending on the error, they may go to the HTTP.SYS HTTPERR log file, or in the Windows event log.
The HTTPERR file contains protocol-level errors and is in a structured format, with the following fields separated by spaces:
HTTP/1.1
)GET
, POST
, HEAD
, etc.)200
, 404
, 500
, etc.)For more information on the error log, see the Microsoft documentation.
The Windows event log contains errors generated from the application server (e.g. ASP.NET) or application. These are available in the Windows Event Viewer and are semi-structured:
Information
, Warning
, Error
)For more information on finding error logs on Windows, see this article.
Depending on how you count, nginx may be the most popular web server on the internet, however it is fairly limited, usually acting as a reverse proxy to a back-end web server or serving static files.
The default access logs are similar to the default Apache logs, but with the following fields at the end of each line:
For more information about nginx logs, see the official documentation.
nginx error logs are semi-structured, with the following fields, separated by spaces:
For more information, see this article.
If a site is behind a CDN, it’s often useful to see the logs of the requests to the CDN, as opposed to the requests from the CDN to the origin site. Each CDN provider provides logs differently and has different pricing structures for logging.
When setting up server logging, there are a few steps that should be taken to maximize the security value of the logs.
Make sure the logs contain at least the IP address of the requestor, full URI requested (including the query string), time taken to serve the request, response size, referer, and user-agent. This information can be extremely helpful when investigating an incident.
Try to get the logs off of the web server as quickly as possible. If the server itself is compromised, attackers will likely try to hide their tracks by deleting or modifying the server logs. Some ways of accomplishing this include:
Even when fully configured, built-in server logs miss a lot of important information. Some examples:
Much of this information isn’t included for good reason. Much of it can have bad implications for user privacy. Others (like useful error logging) require insight into the application itself, so can’t be done by the web server.
The main purpose of application-level logging in a web application is to overcome the limitations of server logging. There are numerous articles describing best practices for logging, here are a few:
These resources should get you set up with the knowledge you need to integrate security logging into an existing (or new) web application.
When overcoming the limitations of built-in server logging, we want to make sure that we don’t put site users at risk. It is frequently the case that logs are less well protected than production databases. First off, logs are not as obvious a target as a production database, so people tend to not focus on them as much when putting security measures in place. Secondly, it’s often the case that more users at an organization are given access to logs than are granted to access to a production database. Third, logs tend to be sent to many different systems, whereas production databases tend to stay centralized. Because of this, it’s worth considering redacting sensitive information in logs.
This article prevents some general best practices for handling sensitive data during logging. Here are some approaches to consider for specific sorts of data:
It is a recommended practice to not include sensitive information in GET parameters, hence GET parameters being logged, but not POST parameters. However, it can be extremely useful to have access to information about POST parameters when responding to an attack. A few things to put in place:
import re
keep = ['select', 'where', 'from', 'and', 'script', 'on', 'src', '../', '<', '>']
output = ''
i = 0
while i < len(target):
matched = False
for j in keep:
if target\[i:i+len(j)].lower() == j.lower():
output = output + target\[i:i+len(j)]
i=i+len(j)
matched = True
if not matched:
if ' ' == target\[i:i+1]:
output = output + ' '
elif re.search('\w', target\[i:i+1]):
output = output + "A"
elif re.search('\d', target\[i:i+1]):
output = output + "1"
else:
output = output + "*"
i = i+1
If a request causes an error that looks like an attempt to hack or bypass controls, the website should aggressively log the request information. Examples include:
If any of these happen, it’s a good idea to log the request, as well as internal information (e.g., database query, filename, etc). In the good case, there’s a simple bug in the site. In that case, there’s plenty of debugging information. In the bad case, the site is being compromised. In that case, it’s easier to find where the compromise occurred, so that forensics is more effective.
Logging the identity of a logged-in user can be dangerous, but there are steps that can be taken to mitigate the danger. It’s questionable to log session cookies, but a hash of a session ID can be used to track a user’s activity across the site. Also, if the web server has a queryable directory of active user sessions, then either an internal ID can be used in logs, or the existing session IDs can be hashed to identify the log entries of a logged-in user. This will allow site owners to identify an active attacker, while making the identities in the logs useless to a threat actor on their own.
Read through the following example commands which use common Unix tools like awk
, sort
, uniq
, and grep
to perform the analysis on Apache and Nginx logs.
Below are example commands using common Unix tools like awk
, sort
, uniq
, and grep
to perform the analysis on Apache and Nginx logs.
awk
is a powerful command-line tool for manipulating text files in Unix-like operating systems. It has a simple syntax. The basic structure of an awk
command is as follows:
awk 'pattern { action }' file
For example let’s consider the following text file (we will call it example.txt):
apple red 5
banana yellow 10
pear green 15
Orange orange 20
awk
scans the input file line by line, and performs a specified action for each line if the pattern matches. awk
automatically splits each line of input into fields based on whitespace (by default). Fields can be referenced using $1, $2, etc., where $1 refers to the first field, $2 to the second, and so on.
For example to print first column with awk
command we need to use
awk '{ print $1 }' example.txt
We can use Conditional Filtering. For example we want to print lines where third column is greater than 10
awk '$3 > 10 {print $1, $3}' example.txt
To use a custom delimiter with awk
, use the -F option followed by the delimiter character. For example if we have a comma delimited file we can use -F’,’ (enclose the delimiter character in single quotes ) to specify a comma (,) as the delimiter.
awk -F',' '{print $1, $3}' comma-delimited.txt
We can do calculations using awk
. This command calculates the sum of values in the third field across all lines and prints the total at the end. “END” is a special pattern used to execute statements after the last record is processed
awk '{total += $3} END {print "Total:", total}' example.txt
There are some built in variables in awk
. For example NR is a built-in variable in awk that represents the current record number. NR increments by one for each line read from the input file(s).
If you want to print line numbers in addition to line content, you could use the following:
awk '{print NR, $0}' example.txt
Spend some time playing around with the following awk commands. You can use a log from your own web server or use practice ones, such as this collection.
Identify the total number of requests recorded in the access log.
cat apache_access.log | wc -l
Determine the most frequently requested URLs.
awk '{print $7}' apache_access.log | sort | uniq -c | sort -nr | head -5
This awk
command will print the seventh column from each line of the log then pipe the output of the previous awk command into the sort command. sort is used to sort the lines of text alphabetically or numerically. By default, it sorts in ascending order. After sorting the output with sort, the uniq -c
command is used to count the occurrences of each unique line in the sorted output. The sort -nr
command is used to sort the output numerically (-n) in reverse order (-r). This means that the lines are sorted based on their numerical values, with the highest values appearing first. The head -5
command is used to display the first 5 lines of the input.
Find out the top 5 IP addresses making requests to the server.
awk '{print $1}' apache_access.log | sort | uniq -c | sort -nr | head -5
Analyze the distribution of request methods.
awk '{print $6}' apache_access.log | sort | uniq -c
Count the total number of requests in an Nginx access log.
cat nginx_access.log | wc -l
Identify the most requested URLs and their corresponding status codes.
awk '{print $7, $9}' nginx_access.log | sort | uniq -c | sort -nr | head -5
Calculate the average size of requests (in bytes).
awk '{sum+=$10} END {print "Average request size:", sum/NR, "bytes"}' nginx_access.log
This awk
command calculates the average request size by summing up the values in the 10th column (presumably representing request sizes) for all lines in the nginx_access.log
file. Then, it divides the total sum by the number of lines (NR), representing the average request size in bytes. Finally, it prints out the result along with a descriptive message.
Make sure that the 10th column actually represents the request size in bytes in your nginx_access.log
file, as the accuracy of the calculation depends on the correctness of the column indexing. \
Determine the top 5 user agents accessing the server.
awk -F'"' '{print $6}' nginx_access.log | sort | uniq -c | sort -nr | head -5
This command uses awk
to set the field separator (-F) to double quotes ("), then extracts the 6th field from each line of thenginx_access.log
file. This assumes that the log entries are formatted in such a way that the URL or request path is enclosed within double quotes. The extracted URLs or request paths are then piped to sort them alphabetically. uniq -c
is used to count the occurrences of each unique URL or request path. The output is piped again to sort -nr
to sort the results numerically in descending order based on the count.
Finally, head -5
is used to display the top 5 URLs or request paths with the highest occurrence counts.
Analyze the distribution of requests by hour of the day.
awk '{print $4}' nginx_access.log | cut -c 14-15 | sort | uniq -c
awk
is used to extract the 4th field ($4) from each line of the access.log
file, which typically contains the timestamp.
The cut
command is then applied to extract characters 14 to 15 from each timestamp, which correspond to the hour portion.
The extracted hour values are piped to sort to arrange them in ascending order. uniq -c
is used to count the occurrences of each unique hour value.
The output will display the count of log entries for each hour in the log file.
cat apache_error.log | grep 'error' | wc -l
cat nginx_error.log | grep 'error' | wc -l
awk '{print $NF}'
reads each line of input data, splits it into fields (separated by whitespace by default), and then prints the value of the last field from each line.cat apache_error.log | grep 'error' | awk '{print $NF}' | sort | uniq -c | sort -nr | head -5
The number at the beginning of each line shows how many times a particular error occurred in the log. In this case, “2047
” means that the error with the last field “757
” occurred 2047 times.
The last field represents different things in each line. It could be a file path, a specific action, or some other identifier related to the error. For instance, “757
” or “154
” could be error codes or unique identifiers, while “/home/mysite/public_html/new/wp-content/plugins/woocommerce/includes/data-stores/abstract-wc-order-data-store-cpt.php:100
” could be a file path and line number where the error occurred.
cat nginx_error.log | grep 'error' | awk '{print $NF}' | sort | uniq -c | sort -nr | head -5
cat apache_error.log | grep 'error' | awk '{print $1}' | sort | uniq -c
cat apache_error.log | grep 'error' | awk '{$1=""; $2=""; $3=""; print}' | sort | uniq -c | sort -nr | head -10
For this exercise, we use we use log files from this collection (same collection as the other files in this practice section)
In this task we are going to use regular expressions. Regular expressions (regex) are like powerful search tools that help you find specific patterns in data. For example, if you’re investigating suspicious network traffic and you know that malicious requests often contain certain patterns of characters, you can use regex to search through logs or traffic captures to find those requests. Regex allows you to define flexible search patterns. For example:
[a-z] range - Matches a character in the range “a” to “z”. Case sensitive.
I.e. [g-s] matches a character between g and s inclusive
abcdefghijklmnopqrstuvwxyz
[A-Z] range - Matches a character in the range “A” to “Z” . Case sensitive.
[0-9] range - Matches a character in the range “0” to “9”. Case sensitive.
We can also use quantifiers to match the specified quantity of the previous token. {1,3} will match 1 to 3. {3} will match exactly 3. {3,} will match 3 or more.
[a-d]{3} matches any sequence of exactly three characters within the given range, each of which can be any lowercase letter from ‘a’ to ’d’. So, it would match strings like ‘abc’, ‘bda’, ‘cad’, etc.
Some characters have special meanings within regexes these characters are:
Symbol | Name | Description |
---|---|---|
\ | Backslash | Used to escape a special character |
^ | Caret | Beginning of a string |
$ | Dollar sign | End of a string |
. | Period or dot | Matches any single character |
| | Vertical bar or pipe symbol | Matches previous OR next character/group |
? | Question mark | Match zero or one of the previous |
* | Asterisk or star | Match zero, one or more of the previous |
+ | Plus sign | Match one or more of the previous |
( ) | Opening and closing parenthesis | Group characters |
[ ] | Opening and closing square bracket | Matches a range of characters |
{ } | Opening and closing curly brace | Matches a specified number of occurrences of the previous |
In our task we will use backslash to escape “\” special character.
You can read more about regex here: https://en.wikipedia.org/wiki/Regular_expression
If you check the provided nginx access log you can see these kind of lines:
181.214.166.113 - - [15/Feb/2024:15:05:19 -0500] "[L\x9E\x804\xD9-\xFB&\xA3\x1F\x9C\x19\x95\x12\x8F\x0C\x89\x05\x81" 400 181 "-" "-"
45.95.169.184 - - [15/Feb/2024:15:49:27 -0500] "\x10 \x00\x00BBBB\xBA\x8C\xC1\xABDAAA" 400 181 "-" "-"
As you can see, both lines contain \x followed by exactly two characters which map to hexadecimal notation (so they use the numbers 0-9 and the letters A to F), such as \x9C, \x10, \xBA, etc. To filter all lines we need to use the ‘\\x[a-fA-F0-9]{3}
’ pattern where\\x[a-fA-F0-9]
is our token, {3}
is a quantifier.
We will use the grep
command to search for the specified pattern in text. For example:
grep 'abcd'
will filter all lines containing the string “abcd”.
The “-E
” option in the grep
command enables the use of extended regular expressions for pattern matching grep -E 'abcd\\[0-9]{2}'
for filtering text like abcd\34, abcd\47
etc.
For those exercises, we use nginx log files from this collection (same collection as the other files in this practice section)
\\x[a-fA-F0-9]{2}
regex to filter requests from nginx access.log containing a suspicious payload. The regex'\x[a-fA-F0-9]{3}'
matches a sequence starting with ‘\x
’ followed by exactly three hexadecimal characters (0-9, a-f, or A-F). How many lines are there?Correct answer: 131 lines
Command(s) to execute: grep -E '\\x[a-fA-F0-9]{3}' nginx_access.log|wc|awk '{print $1}'
Correct answer: 222.186.13.131 19 lines
Command(s) to execute: grep -E '\\x[a-fA-F0-9]{2}' nginx_access.log|sort|awk '{print $1}'| sort | uniq -c | sort -nr
error.log
by running more error.log
. You can quit this command with ctrl+c or press the “q” key to return command prompt. Excluding “PHP Notice” errors. What kind of critical errors can you find in the log?Correct answer: SSL handshaking errors
Command(s) to execute:
more nginx_error.log
cat nginx_error.log|grep -v "PHP"|grep crit
Correct answer: .git/config
Command(s) to execute: cat nginx_error.log|grep -v "PHP"|grep forbidden
This skill check will be much easier if you’ve first completed the practice exercise above.
You are given an nginx access log from a website under attack to investigate, which you can download here.
Locate a suspicious path that is being targeted, extract IP addresses that are sending suspicious requests and find out which countries those IPs are in (you can use geoIP databases, described in more detail in the malicious infrastructure learning path, for this). You can use standard CLI tools like awk
, grep
, sort
, uniq
. To find out AS numbers and countries, we recommend using relevant online lookup services.
Hint: ipinfo.io provides a convenient way of looking up IP details, you can use curl to fetch those.
Log Files - Apache
FreeAn overview of how to read log files in the Apache web server.
Understanding the Apache Access and Error Log, Resource 1
FreeTwo pieces on how to read the Apache web server’s logs.
Understanding the Apache Access and Error Log, Resource 2
FreeTwo pieces on how to read the Apache web server’s logs.
Server-side logging
FreeAn analysis of logs within the Microsoft IIS server.
IIS Error Logs and Other Ways to Find ASP.Net Failed Requests
FreeAnother look at IIS logs and how we can search for application errors therein.
Configuring logging on nginx
FreeDocumentation by the NGINX web server on how to configure and work with logs.
A guide to NGINX logs
FreeAn overview of different NGINX logs and their formats.
Security Log: Best Practices for Logging and Management
FreeAn analysis of when logs are useful, how we can analyze them, and what policies we can create around them.
OWASP logging cheat sheet and vocabulary, Resource 1
FreeA guide from OWASP on what purpose logs should serve, how we should analyze them, and a standard vocabulary for them.
OWASP logging cheat sheet and vocabulary, Resource 2
FreeA guide from OWASP on what purpose logs should serve, how we should analyze them, and a standard vocabulary for them.
Keep Sensitive Data Out of Your Logs: 9 Best Practices
FreeThorough logging can also end up including sensitive data, which could put users at risk. This guide looks at how we can adapt our logging practices to exclude sensitive data from logs.
Congratulations on finishing Module 2!
Mark the checkbox to confirm your completion and continue to the next module.
Marks the current module as completed and saves the progress for the user.
You've completed all modules in this learning path.