Home » Parse log files with PowerShell

Parse log files with PowerShell

by Vlad Drumea
0 comments

In this post I’ll go over some examples of how to parse log files with PowerShell, specifically my blog’s raw access log from cPanel.

Access log file structure

Each line in the access log represents a resource being accessed.

1xx.xxx.xxx.xxx – – [08/Nov/2024:20:38:03 +0200] “GET /2024/04/15/fix-ssl-certificate-error-1416f086-in-sqlcmd-on-linux/ HTTP/1.1” 200 33216 “https://www.google.com/” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 OPR/114.0.0.0”

And the format is as follows:

Log row portionExplanation
1xx.xxx.xxx.xxxThe IP address of the visitor. Note that I’ve redacted it here because I don’t want to disclose anyone’s IP address.
[08/Nov/2024:20:38:03 +0200]Timestamp containing the date and time when the page/resource was accessed.
“GET /2024/04/15/fix-ssl-certificate-error-1416f086-in-sqlcmd-on-linux/ HTTP/1.1”Access method, in this case GET (content download).
Page or resource, in this case it’s the page for this blog post
The communication protocol used.
200The HTTP response that the visitor got when accessing this page.
33216Content size of the accessed page/resource.
“https://www.google.com/”Referrer from where the visitor got to this page. In this case the visitor got here from Google.
“Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 OPR/114.0.0.0”Details about the visitor’s user agent.

Parse access log files with PowerShell to get useful info

First, load the log file name in a variable to make things easier.

Get referrer URLs


And the output looks like this.


Breaking down the PowerShell command

CommandExplanation
Select-String $LogFile -Pattern '"GET ." 200 [0-9]+ "http.'Searches through the log file and outputs lines that match a specific pattern.
“GET .*” – Matches lines with “GET” followed by any characters (like a page or resource path).
200 – Ensures the HTTP response code is 200 (found/success).
[0-9]+ – Matches content size that’s one or more digits
“http.* – Ensures the line contains a referrer starting with “http”.
Select-String -notmatch 'vladdba.com'Filters out any results that have my own blog as the referrer.
ForEach-Object { $_.Line.Split(' ')[10] }Splits each matching line into an array of strings, using spaces as the delimiter, and outputs the 11th element from the split array (array indexing starts at 0).
Group-ObjectGroups the extracted items (from the previous step) by their value and outputs the values and the count of their occurrences.
Where-Object name -ne '"-"'I might still get some stragglers where the value (referrer) is “-” which doesn’t help much, so I’m filtering it out here.
Sort-Object Count -descendingSorts the grouped items by the Count property in descending order (the referrer that accounts for the most visits will be first).
Select-Object count, nameSelects and outputs only the Count and Name properties for the sorted items.

Get referrer URLs while excluding major search engines

This is just a minor variation on the above code, and it just excludes major search engines (google, yandex, bing, baidu, etc.)


There’s not much to explain here, the records matching any major search engine are excluded at the same step where visits referred from my own blog are excluded.

Get pages by hits


Breaking down the PowerShell command

To avoid pointless repetition, I’ll only focus on the differences.

CommandExplanation
Select-String $LogFile -Pattern '"GET \/2[0-9][0-9][0-9]." 200 [0-9]+|"GET \/ ." 200 [0-9]+'Searches the log file for any lines matching one of two patterns (the | between the patters represents an OR) and outputs the result.
First pattern:
“GET \/2[0-9][0-9][0-9].*” 200 [0-9]+
Lines that contain GET followed by a space, a /2 and three digits (e.g.: /2024), that have a 200 status code and any valid content size. This would match any valid blog post link.
Second pattern:
“GET \/ .*” 200 [0-9]+
Lines that contain GET followed by / and space (the root of the site) that have a 200 status code and a valid content size.
This would only match https://vladdba.com/
ForEach-Object { "https://vladdba.com"+$_.Line.Split(' ')[6] }Similar to the previous examples, this extracts the relevant portion from each line, while also prepending https://vladdba.cm to the extracted value.

Get media accessed from other sites

This one’s actually what got me into using parsing access log files with PowerShell in the first place.

Specifically, after the incident where a college decided to not only plagiarize my content, but also just offload storage and bandwidth costs for the images by just hot-linking directly from me. You can read the blog post detailing everything here.


Note that this command also outputs the referrer URL, not just the hit count and resource URL.


In most cases, images and other media hosted on a site are accessed by the site itself when rendering, so the referrer is a page on the site.
In situations such as hot-linking, the referrer is a site that you don’t own but has your media embedded in its pages.

For cases where your images are shown in image search results, scraped by search engines, or shown in news/RSS reader, the referrer is a search engine or RSS reader.
But here I’ve already filtered out referrers such as search engines and RSS readers.

If you’re wondering how the output looked when Borders College was mooching off of me, then here’s the output from 2 months worth of activity.


Breaking down the PowerShell command

Again, I’ll only focus on the relevant differences.

CommandExplanation
Select-String $LogFile -Pattern '"GET \/wp-content\/upload.*" 200 [0-9]+ "http.*'“GET \/wp-content\/upload.*” – Matches GET requests for paths starting with /wp-content/upload (standard WordPress media upload location) followed by any number of characters.
Select-String -notmatch 'vladdba.com|google|yandex|bing|duckduckgo|yahoo|baidu|perplexity|feedly|newsblur' Excluding any results where the referrer is my own blog, a search engine or an RSS reader.
ForEach-Object { "https://vladdba.com"+$_.Line.Split(' ')[6,10]}Extracting the resource as well as the referrer and prepending https://vladdba.com to the resource.

Conclusion

PowerShell is not only great at automating stuff and writing scripts, it can also come in handy when trying to make sense of large raw access log files such as the ones from your website.

You may also like

Leave a Comment

* By using this form you agree with the storage and handling of your data by this website.

This site uses Akismet to reduce spam. Learn how your comment data is processed.