(Do not be afraid of)
PHP Compiler Internals
Sebastian Bergmann
June 13th 2009
Who I Am
Sebastian Bergmann
Involved in the PHP
project since 2000
Creator of PHPUnit
Co-Founder and
Principal Consultant
with thePHP.cc
Under PHP's Hood
Extensions
(date, dom, gd, json, mysql, pcre, pdo, reflection, session, standard, …)
PHP Core Zend Engine
Request Management Compilation and Execution
File and Network Operations Memory and Resource Allocation
Server API (SAPI)
(mod_php, FastCGI, CLI, ...)
This slide contains material by Sara Golemon
How PHP executes code
Lexical Analysis
Converts the source from a sequence of characters into a
sequence of tokens
How PHP executes code
Lexical Analysis
Syntax Analysis
Analyzes a sequence of tokens to determine their grammatical
structure
How PHP executes code
Lexical Analysis
Syntax Analysis
Bytecode Generation
Generate bytecode based on the information gathered by
analyzing the sourcecode
How PHP executes code
Lexical Analysis
Syntax Analysis
Bytecode Generation
Bytecode Execution
Lexical Analysis
Scan a sequence of characters
1 <?php
2 if (TRUE) {
3 print '*';
4 }
5 ?>
Lexical Analysis
Scan a sequence of characters
1 <?php T_OPEN_TAG
2 if (TRUE) {
3 print '*';
4 }
5 ?>
Lexical Analysis
Scan a sequence of characters
1 <?php T_OPEN_TAG
2 if (TRUE) { T_IF
T_WHITESPACE
(
T_STRING
)
T_WHITESPACE
{
T_WHITESPACE
3 print '*';
4 }
5 ?>
Lexical Analysis
Scan a sequence of characters
1 <?php T_OPEN_TAG
2 if (TRUE) { T_IF
T_WHITESPACE
(
T_STRING
)
T_WHITESPACE
{
T_WHITESPACE
3 print '*'; T_PRINT
T_WHITESPACE
T_CONSTANT_ENCAPSED_STRING
;
4 }
5 ?>
Lexical Analysis
Scan a sequence of characters
1 <?php T_OPEN_TAG
2 if (TRUE) { T_IF
T_WHITESPACE
(
T_STRING
)
T_WHITESPACE
{
T_WHITESPACE
3 print '*'; T_PRINT
T_WHITESPACE
T_CONSTANT_ENCAPSED_STRING
;
T_WHITESPACE
4 } }
5 ?>
Lexical Analysis
Scan a sequence of characters
1 <?php T_OPEN_TAG
2 if (TRUE) { T_IF
T_WHITESPACE
(
T_STRING
)
T_WHITESPACE
{
T_WHITESPACE
3 print '*'; T_PRINT
T_WHITESPACE
T_CONSTANT_ENCAPSED_STRING
;
T_WHITESPACE
4 } }
T_WHITESPACE
5 ?> T_CLOSE_TAG
Lexical Analysis
Scan a sequence of characters
T_OPEN_TAG <?php
T_IF if
T_WHITESPACE
(
T_STRING TRUE
)
T_WHITESPACE
{
T_WHITESPACE
T_PRINT print
T_WHITESPACE
T_CONSTANT_ENCAPSED_STRING '*'
;
T_WHITESPACE
}
T_WHITESPACE
T_CLOSE_TAG ?>
Lexical Analysis
Scan a sequence of characters
Lexical Analysis
Scanner Generators
You do not want to write a scanner by
hand
At least when the code for the scanner should
be efficient and maintainable
Tools such as flex or re2c generate the
code for a scanner from a set of rules
<ST_IN_SCRIPTING>"if"
"if" { {
return T_IF;
}
Lexical Analysis
PHP Tokens
T_ABSTRACT T_CONCAT_EQUAL T_ELSE T_FUNCTION
T_AND_EQUAL T_CONST T_ELSEIF T_FUNC_C
T_ARRAY T_CONSTANT_ENCAPSED_STRING T_EMPTY T_GLOBAL
T_ARRAY_CAST T_CONTINUE T_ENCAPSED_AND_WHITESPACE T_GOTO
T_AS T_CURLY_OPEN T_ENDDECLARE T_HALT_COMPILER
T_BAD_CHARACTER T_DEC T_ENDFOR T_IF
T_BOOLEAN_AND T_DECLARE T_ENDFOREACH T_IMPLEMENTS
T_BOOLEAN_OR T_DEFAULT T_ENDIF T_INC
T_BOOL_CAST T_DIR T_ENDSWITCH T_INCLUDE
T_BREAK T_DIV_EQUAL T_ENDWHILE T_INCLUDE_ONCE
T_CASE T_DNUMBER T_END_HEREDOC T_INLINE_HTML
T_CATCH T_DOC_COMMENT T_EVAL T_INSTANCEOF
T_CHARACTER T_DO T_EXIT T_INT_CAST
T_CLASS T_DOLLAR_OPEN_CURLY_BRACES T_EXTENDS T_INTERFACE
T_CLASS_C T_DOUBLE_ARROW T_FILE T_ISSET
T_CLONE T_DOUBLE_CAST T_FINAL T_IS_EQUAL
T_CLOSE_TAG T_DOUBLE_COLON T_FOR T_IS_GREATER_OR_EQUAL
T_COMMENT T_ECHO T_FOREACH T_IS_IDENTICAL
Lexical Analysis
PHP Tokens
T_IS_NOT_EQUAL T_OBJECT_CAST T_SR_EQUAL
T_IS_NOT_IDENTICAL T_OBJECT_OPERATOR T_START_HEREDOC
T_IS_SMALLER_OR_EQUAL T_OLD_FUNCTION T_STATIC
T_LINE T_OPEN_TAG T_STRING
T_LIST T_OPEN_TAG_WITH_ECHO T_STRING_CAST
T_LNUMBER T_OR_EQUAL T_STRING_VARNAME
T_LOGICAL_AND T_PAAMAYIM_NEKUDOTAYIM T_SWITCH
T_LOGICAL_OR T_PLUS_EQUAL T_THROW
T_LOGICAL_XOR T_PRINT T_TRY
T_METHOD_C T_PRIVATE T_UNSET
T_MINUS_EQUAL T_PUBLIC T_UNSET_CAST
T_ML_COMMENT T_PROTECTED T_USE
T_MOD_EQUAL T_REQUIRE T_VAR
T_MUL_EQUAL T_REQUIRE_ONCE T_VARIABLE
T_NAMESPACE T_RETURN T_WHILE
T_NS_C T_SL T_WHITESPACE
T_NEW T_SL_EQUAL T_XOR_EQUAL
T_NUM_STRING T_SR
Syntax Analysis
Analyze a sequence of tokens
Syntax Analysis
Parser Generators
You do not want to write a parser by hand
At least when the code for the scanner should
be efficient and maintainable
Tools such as bison or lemon generate
the code for a parser from a set of rules
T_IF '(' expr ')' { ... }
statement { ... }
elseif_list else_single { ... }
PHP Bytecode
Disassembling with vld
1 <?php
2 if (TRUE) {
3 print '*';
4 }
5 ?>
sb@thinkpad ~ % php -dextension=vld.so -dvld.active=1 -dvld.execute=0 if.php
filename: /home/sb/if.php
function name: (null)
number of ops: 8
compiled vars: none
line # op fetch ext return operands
-------------------------------------------------------------------------------
2 0 EXT_STMT
1 JMPZ true, ->6
3 2 EXT_STMT
3 PRINT ~0 '%2A'
4 FREE ~0
4 5 JMP ->6
6 6 EXT_STMT
7 RETURN 1
PHP Bytecode
Disassembling with bytekit-cli
1 <?php
2 if (TRUE) {
3 print '*';
4 }
5 ?>
sb@thinkpad ~ % bytekit if.php
bytekit-cli 1.0.0 by Sebastian Bergmann.
Filename: /home/sb/if.php
Function: main
Number of oplines: 8
line # opcode result operands
-----------------------------------------------------------------------------
2 0 EXT_STMT
1 JMPZ true, ->6
3 2 EXT_STMT
3 PRINT ~0 '*'
4 FREE ~0
4 5 JMP ->6
6 6 EXT_STMT
7 RETURN 1
PHP Bytecode
Bytecode visualization with bytekit-cli
1 <?php
2 if (TRUE) {
3 print '*';
4 }
5 ?>
sb@thinkpad ~ % bytekit --graph /tmp --format svg if.php
PHP Bytecode
Disassembling with bytekit-cli
1 <?php
2 $a = 1;
3 $b = 2;
4 print $a + $b;
5 ?>
sb@thinkpad ~ % bytekit add.php
bytekit-cli 1.0.0 by Sebastian Bergmann.
Filename: /home/sb/add.php
Function: main
Number of oplines: 10
Compiled variables: !0 = $a, !1 = $b
line # opcode result operands
-----------------------------------------------------------------------------
2 0 EXT_STMT
1 ASSIGN !0, 1
3 2 EXT_STMT
3 ASSIGN !1, 2
4 4 EXT_STMT
5 ADD ~2 !0, !1
6 PRINT ~3 ~2
7 FREE ~3
6 8 EXT_STMT
9 RETURN 1
PHP Bytecode
List of Opcodes
NOP IS_NOT_EQUAL POST_INC ADD_VAR UNSET_DIM
ADD IS_SMALLER POST_DEC BEGIN_SILENCE UNSET_OBJ
SUB IS_SMALLER_OR_EQUAL ASSIGN END_SILENCE FE_RESET
MUL CAST ASSIGN_REF INIT_FCALL_BY_NAME FE_FETCH
DIV QM_ASSIGN ECHO DO_FCALL EXIT
MOD ASSIGN_ADD PRINT DO_FCALL_BY_NAME FETCH_R
SL ASSIGN_SUB JMPZ RETURN FETCH_DIM_R
SR ASSIGN_MUL JMPNZ RECV FETCH_OBJ_R
CONCAT ASSIGN_DIV JMPZNZ RECV_INIT FETCH_W
BW_OR ASSIGN_MOD JMPZ_EX SEND_VAL FETCH_DIM_W
BW_AND ASSIGN_SL JMPNZ_EX SEND_VAR FETCH_OBJ_W
BW_XOR ASSIGN_SR CASE SEND_REF FETCH_RW
BW_NOT ASSIGN_CONCAT SWITCH_FREE NEW FETCH_DIM_RW
BOOL_NOT ASSIGN_BW_OR BRK FREE FETCH_OBJ_RW
BOOL_XOR ASSIGN_BW_AND BOOL INIT_ARRAY FETCH_IS
IS_IDENTICAL ASSIGN_BW_XOR INIT_STRING ADD_ARRAY_ELEMENT FETCH_DIM_IS
IS_NOT_IDENTICAL PRE_INC ADD_CHAR INCLUDE_OR_EVAL FETCH_OBJ_IS
IS_EQUAL PRE_DEC ADD_STRING UNSET_VAR FETCH_FUNC_ARG
PHP Bytecode
List of Opcodes
FETCH_DIM_FUNC_ARG INIT_STATIC_METHOD_CALL
FETCH_OBJ_FUNC_ARG ISSET_ISEMPTY_VAR
FETCH_UNSET ISSET_ISEMPTY_DIM_OBJ
FETCH_DIM_UNSET PRE_INC_OBJ
FETCH_OBJ_UNSET PRE_DEC_OBJ
FETCH_DIM_TMP_VAR POST_INC_OBJ
FETCH_CONSTANT POST_DEC_OBJ
EXT_STMT ASSIGN_OBJ
EXT_FCALL_BEGIN INSTANCEOF
EXT_FCALL_END DECLARE_CLASS
EXT_NOP DECLARE_INHERITED_CLASS
TICKS DECLARE_FUNCTION
SEND_VAR_NO_REF RAISE_ABSTRACT_ERROR
CATCH ADD_INTERFACE
THROW VERIFY_ABSTRACT_CLASS
FETCH_CLASS ASSIGN_DIM
CLONE ISSET_ISEMPTY_PROP_OBJ
INIT_METHOD_CALL HANDLE_EXCEPTION
Extending the Compiler
Test First!
Zend/tests/unless.phpt
--TEST--
unless statement
--FILE--
<?php
unless (FALSE) {
print 'unless FALSE is TRUE, this is printed';
}
unless (TRUE) {
print 'unless TRUE is TRUE, this is printed';
}
?>
--EXPECT--
unless FALSE is TRUE, this is printed
Extending the Compiler
Add token for unless to the scanner
Add rule for unless to the parser
Generate bytecode for unless in the compiler
Add token for unless to ext/tokenizer
Add unless scanner token
Zend/zend_language_scanner.l
<ST_IN_SCRIPTING>"if" {
return T_IF;
}
<ST_IN_SCRIPTING>"unless" {
return T_UNLESS;
}
<ST_IN_SCRIPTING>"elseif" {
return T_ELSEIF;
}
<ST_IN_SCRIPTING>"endif" {
return T_ENDIF;
}
<ST_IN_SCRIPTING>"else" {
return T_ELSE;
}
Add unless parser rule
Zend/zend_language_parser.y
%token T_NAMESPACE
%token T_NS_C
%token T_DIR
%token T_NS_SEPARATOR
%token T_UNLESS
.
.
unticked_statement:
'{' inner_statement_list '}'
| T_IF '(' expr ')' {
.
.
| T_UNLESS '(' expr ')' {
zend_do_unless_cond(&$3, &$4 TSRMLS_CC);
} statement {
zend_do_if_after_statement(&$4, 1 TSRMLS_CC);
} {
zend_do_if_end(TSRMLS_C);
}
.
.
How if is compiled
Zend/zend_compile.c
void zend_do_if_cond
(const znode *cond, znode *closing_bracket_token TSRMLS_DC)
{
typedef struct _znode {
int op_type;
union {
zval constant;
zend_uint var;
zend_uint opline_num;
zend_op_array *op_array;
zend_op *jmp_addr;
struct {
zend_uint var;
zend_uint type;
} EA;
} u;
} } znode;
zend_do_if_cond() is called when an if statement is compiled
How if is compiled
Zend/zend_compile.c
void zend_do_if_cond
(const znode *cond, znode *closing_bracket_token TSRMLS_DC)
{
int if_cond_op_number =
get_next_op_number(CG(active_op_array));
zend_op *opline =
get_next_op(CG(active_op_array) TSRMLS_CC);
struct _zend_op {
opcode_handler_t handler;
znode result;
znode op1;
znode op2;
ulong extended_value;
uint lineno;
zend_uchar opcode;
} };
Allocate a new opline in the current oparray
How if is compiled
Zend/zend_compile.c
void zend_do_if_cond
(const znode *cond, znode *closing_bracket_token TSRMLS_DC)
{
int if_cond_op_number =
get_next_op_number(CG(active_op_array));
zend_op *opline =
get_next_op(CG(active_op_array) TSRMLS_CC);
opline->opcode = ZEND_JMPZ;
Set the opcode of the new opline to JMPZ (jump if zero)
How if is compiled
Zend/zend_compile.c
void zend_do_if_cond
(const znode *cond, znode *closing_bracket_token TSRMLS_DC)
{
int if_cond_op_number =
get_next_op_number(CG(active_op_array));
zend_op *opline =
get_next_op(CG(active_op_array) TSRMLS_CC);
opline->opcode = ZEND_JMPZ;
opline->op1 = *cond;
Set the first operand of the new opline to the if condition
How if is compiled
Zend/zend_compile.c
void zend_do_if_cond
(const znode *cond, znode *closing_bracket_token TSRMLS_DC)
{
int if_cond_op_number =
get_next_op_number(CG(active_op_array));
zend_op *opline =
get_next_op(CG(active_op_array) TSRMLS_CC);
opline->opcode = ZEND_JMPZ;
opline->op1 = *cond;
closing_bracket_token->u.opline_num =
if_cond_op_number;
SET_UNUSED(opline->op2);
INC_BPC(CG(active_op_array));
}
Perform book keeping tasks such as marking the second operand of the
new opline as unused or incrementing the backpatching counter for the
current oparray
Add unless to compiler
Zend/zend_compile.c
void zend_do_unless_cond
(const znode *cond, znode *closing_bracket_token TSRMLS_DC)
{
int unless_cond_op_number =
get_next_op_number(CG(active_op_array));
zend_op *opline =
get_next_op(CG(active_op_array) TSRMLS_CC);
opline->opcode = ZEND_JMPNZ;
opline->op1 = *cond;
closing_bracket_token->u.opline_num =
unless_cond_op_number;
SET_UNUSED(opline->op2);
INC_BPC(CG(active_op_array));
}
All we have to do to generate code for the unless statement, as
compared to generate code for the if statement, is to use the JMPNZ
(jump if not zero) opcode instead of the JMPZ (jump if zero) opcode
Add unless to compiler
The generated bytecode
1 <?php
2 unless (FALSE) {
3 print '*';
4 }
5 ?>
sb@thinkpad ~ % bytekit unless.php
bytekit-cli 1.0.0 by Sebastian Bergmann.
Filename: /home/sb/unless.php
Function: main
Number of oplines: 8
line # opcode result operands
-----------------------------------------------------------------------------
2 0 EXT_STMT
1 JMPNZ true, ->6
3 2 EXT_STMT
3 PRINT ~0 '*'
4 FREE ~0
4 5 JMP ->6
6 6 EXT_STMT
7 RETURN 1
Run the test
sb@thinkpad php-5.3-unless % make test TESTS=Zend/tests/unless.phpt
Build complete.
Don't forget to run 'make test'.
=====================================================================
PHP : /usr/local/src/php/php-5.3-unless/sapi/cli/php
PHP_SAPI : cli
PHP_VERSION : 5.3.0RC3-dev
ZEND_VERSION: 2.3.0
PHP_OS : Linux 2.6.28-11-generic #42-Ubuntu SMP Fri Apr 17 01:57:59 UTC 2009 i686 GNU/Linux
INI actual : /usr/local/src/php/php-5.3-unless/tmp-php.ini
More .INIs :
CWD : /usr/local/src/php/php-5.3-unless
Extra dirs :
VALGRIND : Not used
=====================================================================
Running selected tests.
PASS unless statement [Zend/tests/unless.phpt]
=====================================================================
Number of tests : 1 1
Tests skipped : 0 ( 0.0%) --------
Tests warned : 0 ( 0.0%) ( 0.0%)
Tests failed : 0 ( 0.0%) ( 0.0%)
Expected fail : 0 ( 0.0%) ( 0.0%)
Tests passed : 1 (100.0%) (100.0%)
---------------------------------------------------------------------
Time taken : 0 seconds
=====================================================================
Add unless to ext/tokenizer
ext/tokenizer/tokenizer_data.c
sb@thinkpad tokenizer % ./tokenizer_data_gen.sh
Wrote tokenizer_data.c
The End
Thank you for your interest!
These slides will be linked soon from
http://sebastian-bergmann.de/
You can vote for this talk on
http://joind.in/582
Acknowledgements
Thomas Lee, whose Python Language Internals presentation at
OSDC 2008 inspired this presentation
Stefan Esser for creating the Bytekit extension that provides
PHP bytecode access and analysis features
Derick Rethans, David Soria Parra, and Scott MacVicar for reviewing
these slides
References
http://www.php.net/manual/en/tokens.php
http://www.zapt.info/opcodes.html
Sara Golemon: ”Extending and Embedding PHP”
http://derickrethans.nl/vld.php
http://bytekit.org/
http://github.com/sebastianbergmann/bytekit-cli/
License
This presentation material is published under the Attribution-Share Alike 3.0 Unported
license.
You are free:
✔ to Share – to copy, distribute and transmit the work.
✔ to Remix – to adapt the work.
Under the following conditions:
● Attribution. You must attribute the work in the manner specified by the author or
licensor (but not in any way that suggests that they endorse you or your use of the
work).
● Share Alike. If you alter, transform, or build upon this work, you may distribute the
resulting work only under the same, similar or a compatible license.
For any reuse or distribution, you must make clear to others the license terms of this
work.
Any of the above conditions can be waived if you get permission from the copyright
holder.
Nothing in this license impairs or restricts the author's moral rights.