Analyzing Malicious PDF Files - Part 12
Analyzing Malicious PDF Files - Part 12
1
00:00:11,160 --> 00:00:11,910
Hello everyone.
1
2
00:00:11,940 --> 00:00:16,230
Let us start analyzing a bunch of malicious PDF files.
2
3
00:00:16,310 --> 00:00:23,270
So again we'll be coming back to our FLARE suite of tools collection.
3
4
00:00:23,310 --> 00:00:29,280
You can see there is a folder for PDF. When you go inside you will find that there
is a tool called PDF
4
5
00:00:29,280 --> 00:00:38,070
parser which is critical in parsing the complete PDF file and extracting malicious
artifacts artifact.
5
6
00:00:38,140 --> 00:00:43,460
Then there is a shortcut for PDF-parser and there is a shortcut for PDfid.
6
7
00:00:43,480 --> 00:00:47,450
So these are basically the compiled executables of the actual program.
7
8
00:00:47,490 --> 00:00:51,670
If you go inside PDf-parser, you'll see that it's simply a python file.
8
9
00:00:51,700 --> 00:00:57,810
You we can run it just like we ran the other OLE file analysis tools in the
previous videos.
9
10
00:00:58,240 --> 00:01:05,020
So Pdfid's original tool is not here. So we can just click on the properties of the
shortcut and we
10
11
00:01:05,020 --> 00:01:08,240
can see that where exactly this shortcut is pointing to.
11
12
00:01:08,450 --> 00:01:11,880
They can see that it's in Program Files/pdfid
12
13
00:01:12,010 --> 00:01:18,550
So I don't really enjoy running the shortcuts. It's better to always run the python
program directly
13
14
00:01:18,640 --> 00:01:23,110
so that in case there is any error you can look at it and try and resolve.
14
15
00:01:23,110 --> 00:01:25,480
So let's quickly go to program files.
15
16
00:01:27,000 --> 00:01:36,010
Pdfid and just copy it and move it to the FLARE folder.
16
17
00:01:36,080 --> 00:01:41,510
So you now have both pdf-parser and pdfid in the same FLARE directory.
17
18
00:01:41,510 --> 00:01:43,640
So pdfid is more of a
18
19
00:01:45,100 --> 00:01:51,650
meta information tool which gives you a bunch of information about the PDF file.
19
20
00:01:51,770 --> 00:01:57,260
For example how many page numbers are there, are there any javascripts inside it
and things like that.
20
21
00:01:57,260 --> 00:02:02,870
Whereas PDF-parser is more of a dynamic parsing of the PDF file.
21
22
00:02:02,900 --> 00:02:06,410
So let's begin with using pdfid.
22
23
00:02:06,410 --> 00:02:07,350
For the first
23
24
00:02:12,680 --> 00:02:22,600
So I will come to my pdfid directory and my files are stored in course files/PDF
files/PDF examples
24
25
00:02:22,610 --> 00:02:24,350
I have three examples here.
25
26
00:02:24,350 --> 00:02:28,700
So we'll be using them one on one
26
27
00:02:28,700 --> 00:02:37,420
We Will pass
>python pdfid.py
followed by the location of the file.
27
28
00:02:39,500 --> 00:02:43,360
So once you press enter it will give us a bunch of information.
28
29
00:02:43,360 --> 00:02:47,660
For example this PDF file has 26 objects inside it.
29
30
00:02:47,660 --> 00:02:53,240
If you remember from our previous discussion we talked about how PDF file is
basically.....the body of PDF file
30
31
00:02:53,240 --> 00:02:57,920
consists of different objects and all those objects will begin with.
31
32
00:02:58,010 --> 00:03:01,650
'obj' and end with 'endobj'.
32
33
00:03:01,880 --> 00:03:08,780
So there are 26 objects and 26 end-objects so it is ending all the objects properly
33
34
00:03:08,990 --> 00:03:15,500
There are nine streams. Again the body of for the PDF files contain streams and
these teams have the
34
35
00:03:15,500 --> 00:03:17,060
data.
35
36
00:03:17,240 --> 00:03:18,990
Then there is one cross-reference.
36
37
00:03:19,010 --> 00:03:21,690
There is one trailer one start xref
37
38
00:03:21,770 --> 00:03:23,790
There are three page numbers.
38
39
00:03:23,960 --> 00:03:27,050
There is one javascript as well.
39
40
00:03:27,050 --> 00:03:33,800
/JS tag and it has been picked up by PDfid
40
41
00:03:33,800 --> 00:03:35,100
well.
41
42
00:03:35,200 --> 00:03:37,190
There is an open action as well.
42
43
00:03:37,190 --> 00:03:42,640
So what I mean by open action here is that once you launch the PDf file, whatever
is
43
44
00:03:42,650 --> 00:03:46,370
marked as open action will be immediately executed.
44
45
00:03:46,670 --> 00:03:55,060
So it's very important to understand all these meta properties that we have got
from PDfid
45
46
00:03:55,160 --> 00:04:00,710
We already know a bunch of them but there are some of them which are new and the
important ones are things
46
47
00:04:00,710 --> 00:04:04,860
like JS, Javascript, AA, openaction
47
48
00:04:04,920 --> 00:04:05,720
XFA, URI
48
49
00:04:05,720 --> 00:04:11,930
So URI again tells us is there is any URI that is present inside the PDF. The
embedded file
49
50
00:04:11,930 --> 00:04:12,470
tells us.
50
51
00:04:12,470 --> 00:04:20,540
Is there any embedded file for example an executable or a Flash file that is inside
the PDF. So the interesting
51
52
00:04:20,540 --> 00:04:22,340
parts here are javascript's.
52
53
00:04:22,370 --> 00:04:28,550
We know that this file contains javascript and there is an open action that is
performed as well which
53
54
00:04:28,550 --> 00:04:33,980
means that as soon as we are launching the PDF ,the PDF is trying to do something
without you know giving
54
55
00:04:33,980 --> 00:04:37,410
you any kind of permission or something.
55
56
00:04:37,410 --> 00:04:47,260
All you have to do is just from that PDF itself. let us run for our second file as
well
56
57
00:04:47,260 --> 00:04:49,410
file we get something similar.
57
58
00:04:49,510 --> 00:04:54,770
There are 12 objects two streams it has two pages.
58
59
00:04:54,910 --> 00:05:03,580
And again it has javascript inside it and it performs open action as well and there
is no embedded file
59
60
00:05:03,820 --> 00:05:06,950
and there is no URI inside that PDF file
60
61
00:05:08,580 --> 00:05:13,110
Let us try with our third example
61
62
00:05:13,230 --> 00:05:18,060
We have eight objects one stream one page.
62
63
00:05:18,060 --> 00:05:19,470
There is no javascript.
63
64
00:05:19,470 --> 00:05:25,900
In this case and that is one xfa, no URI. That's it.
64
65
00:05:25,920 --> 00:05:31,950
So this is how we first collect some kind of static information of the PDF file
using pdfid and
65
66
00:05:31,950 --> 00:05:37,230
this can help us in making again some heuristic analysis of the PDF file by looking
at the number
66
67
00:05:37,230 --> 00:05:43,020
of pages, whether it has some javascript's or not with it or its performing some
open action or not and
67
68
00:05:43,020 --> 00:05:44,350
things like that.
68
69
00:05:44,370 --> 00:05:50,940
So once we have some kind of static heuristics about the PDF file, the next thing
that we can do
69
70
00:05:50,940 --> 00:05:56,710
is we can start using PDF parser to actually look into these elements.