Module 2
Basic file analysis
Last updated on: 26 July 2024
Edit this page on GitHubModule 2
Last updated on: 26 July 2024
Edit this page on GitHubOnce you have a piece of malware on your analysis VM, the next step is to figure out what’s in it. A piece of malware may use multiple files; in this case you would use the techniques in this section for each file. There are a few different ways to get an idea of what kind of file you’re dealing with. Note that some malware is tricky about this, hiding malicious content in innocuous files or making files that are several valid types at once (a classic example being the GIFAR, which is a file that is both a valid image and also a valid Java applet). Because of this, when evaluating malware files, we need to perform a deeper analysis of file types and contents. Beyond basic file extensions, we’ll examine file headers and signatures, as well as string contents.
After completing this subtopic, the practitioner should be able to do the following:
For many operating systems, file extensions are very important to how the system treats the file. File names (and thus extensions) are not actually part of the file, but part of the file metadata in the filesystem. As such, they are easily changed, and don’t actually reveal anything critical about the content of the file. That said, they’re a good first step in analysis. There is a practically unlimited set of file extensions (they’re just letters at the end of a filename), and there is no enforced registry. There can be no exhaustive list of extensions, and many extensions have multiple possible meanings. That said, here are some lists:
Many file formats have distinctive data structures that are unique to their file format. Usually, this is at the start of the file, but sometimes it appears in other places. For example, GIF files start with the string “GIF89a” (or, less commonly, “GIF87a”), while Windows executables (PE format) start with “MZ”. These headers are critical, as most (if not all) software that uses a file will not process the file without the correct signatures. For example, if you try to run a file that ends in “.exe” in Windows, but the file doesn’t contain a proper PE file header, Windows will not execute the file.
In many cases, it is possible to determine more about a file format by looking at additional file content. For example, both regular ZIP archives and Java archive (JAR) files are in ZIP format. If you rename a .jar file to .zip, standard ZIP tools will extract it just fine. However, all JAR files will have strings in them (such as “MANIFEST.MF”) that not all ZIP files will.
In some cases, files won’t even have the same basic format, but it will be difficult to distinguish between them. For example, both Java bytecode and Mach-O binaries start with the byte sequence 0xCAFEBABE. Here is the code which the file command uses in order to tell the two of them apart: as you can see, it requires many heuristics.
Since the number of file types is immense, it makes sense to use a tool with a database of file types. The most common of these is the “file” command in linux. Since it’s open-source, you can see how it came to a particular decision about a particular file. A similar tool is TrID. While it’s not open-source, you may be able to get better results on certain files,
Another useful tool for file analysis is the “strings” command. This unix utility will print out all the ASCII strings in a file, which can be incredibly useful for spotting patterns such as URLs. While this won’t work well on encrypted, compressed, or encoded data, it can be useful.
Lastly, a hex editor will display binary files in a human-readable format. Typically they will display both a hexadecimal and ascii representation of the file data, which can be helpful in detecting patterns. There are many hex editors, Wikipedia has a comparison of some, and REMnux comes with a hex editor called wxHexEditor.
For a more advanced guide on how to capture and do preliminary analysis on an Android app, we recommend checking out this excellent guide from PiRogue tool suite.
Here’s a quick article on static reverse engineering of file formats. Read through it and make sure that you have understood it. If possible, discuss this article with a mentor or someone else with deep knowledge of file format reverse engineering.
Complete the Malware Introductory (free) exercises on TryHackMe
Open up the REMNux VM you set up in the previous subtopic’s practice exercises.
Conduct the following tasks:
file
command.file
command on in a hex editor. Do you see any major differences between them, especially in how the files start?Show the work above to a mentor or peer who will confirm that you have correctly carried out the exercises.
Common file name extensions in Windows
FreeGuide by Microsoft outlining commonly encountered file extensions in Windows.
List of filename extensions | Wikipedia
FreeComprehensive list of file extensions used by various software.
TrID
FreeProgram for Windows and Linux to identify file types based on binary signatures.
File extensions and file type definitions
FreeTrID’s list of over 16,000 known file extensions.
File
FreeCommand line program for Unix-like systems to identify files by type.
Comparison of hex editors
FreeList and comparison of hex editors for directly editing binary files.
Wikibooks/ Reverse Engineering File Formats
FreeComprehensive guide to reverse engineering file formats.
Beginner guide: how to handle a potentially malicious mobile app
FreeIntroduction to handling suspicious Android apps with initial data collection and analysis steps.
Congratulations on finishing Module 2!
Mark the checkbox to confirm your completion and continue to the next module.
Marks the current module as completed and saves the progress for the user.
You've completed all modules in this learning path.