0% found this document useful (0 votes)
47 views18 pages

Optimization Tips - Andrei Alexandrescu - CppCon 2014

Uploaded by

alan88w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views18 pages

Optimization Tips - Andrei Alexandrescu - CppCon 2014

Uploaded by

alan88w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Optimization Tips

Prepared for CppCon 2014

Andrei Alexandrescu, Ph.D.


Research Scientist, Facebook
[email protected]

© 2014- Andrei Alexandrescu. Do not redistribute. 1 / 36

Beware Compiler’s Most


Vexing Inlining

© 2014- Andrei Alexandrescu. Do not redistribute. 2 / 36


Inlining

• Interacts with all other optimizations


• Final code shape/size hard to estimate
• Cost function intractable
• App costs != benchmark estimates

© 2014- Andrei Alexandrescu. Do not redistribute. 3 / 36

From Regression to Win In 2 Flags

--max-inline-instns-auto=100
--early-inlining-instns=200

© 2014- Andrei Alexandrescu. Do not redistribute. 4 / 36


“Dark Matter” Code: cdtors

• Most affected by inlining


• “Motherhood and Apple Pie”
• Implicitly called
• Often implicitly generated
• Often trivial
◦ What’s a few stores between friends?
• Deadly effects at scale
◦ Beyond traditional advice!

© 2014- Andrei Alexandrescu. Do not redistribute. 5 / 36

I-Cache

• Spills seldom occur in microbenchmarks


• Issue in large applications
◦ Exactly where it hurts most
◦ . . . and harder to trace to causes
• Hard for compiler to assess impact
• (Don’t want to lose in microbenchmarks
either!)

© 2014- Andrei Alexandrescu. Do not redistribute. 6 / 36


Tip #1: Beware Inline Destructors

• Called everywhere, implicitly


• Not reflected in source code size
◦ . . . transitively
• Often generated automatically

• Watch destructor size carefully

© 2014- Andrei Alexandrescu. Do not redistribute. 7 / 36

One Destructor Inlined


Controling inlining

// GCC
#define ALWAYS_INLINE inline __attribute__((__always_inline__))
#define NEVER_INLINE __attribute__((__noinline__))

// VC++:
#define ALWAYS_INLINE __forceinline
#define NEVER_INLINE __declspec(noinline)

© 2014- Andrei Alexandrescu. Do not redistribute. 9 / 36

Defang NEVER_INLINE
Defang ALWAYS_INLINE

Case Study: Custom shared_ptr

• The go-to solution for reference counting


• Optimized for a blend of needs, each with a
cost:
◦ Compulsive atomic refcounting
◦ Custom deleters
◦ Weak Pointer Support
• No support for intrusive reference counting
◦ Remember: first cache line is where it’s at

© 2014- Andrei Alexandrescu. Do not redistribute. 12 / 36


Atomics Matter

• Atomic inc/dec: 2.5x–5x slower


• 40 years of optimizing ++/-- ripples
• 4 years of optimizing atomic inc/dec ripples
• Post inlining of course

© 2014- Andrei Alexandrescu. Do not redistribute. 13 / 36

But. . . But. . . Unwitting Sharing?

• Store thread id at first access with smart ptr


◦ Debug mode only
• Compare it with new access
• assert on mismatch

© 2014- Andrei Alexandrescu. Do not redistribute. 14 / 36


Classic Implementation
• Let’s assume non-intrusive for now

template <class T>


class SingleThreadPtr {
T* p_;
unsigned* c_;
public:
SingleThreadPtr() : p_(nullptr), c_(nullptr) {
}
SingleThreadPtr(T* p)
: p_(p)
, c_(p ? new unsigned(1) : nullptr) {
}
SingleThreadPtr(const SingleThreadPtr& rhs)
: p_(rhs.p_)
, c_(rhs.c_) {
if (c_) ++*c_;
}
...

© 2014- Andrei Alexandrescu. Do not redistribute. 15 / 36

Classic Implementation (cont’d)

SingleThreadPtr(SingleThreadPtr&& rhs)
: p_(rhs.p_)
, c_(rhs.c_) {
rhs.p_ = nullptr;
rhs.c_ = nullptr;
}
~SingleThreadPtr() {
if (c_ && --*c_ == 0) {
delete p_;
delete c_;
}
}

© 2014- Andrei Alexandrescu. Do not redistribute. 16 / 36


Herb’s Talk “Atomic Weapons”

• Focus on MT
• Use atomic<unsigned>* for c_
• Use fetch_add(1,memory_order_relaxed)
for ++
• fetch_sub(1,memory_order_acq_rel) for --

© 2014- Andrei Alexandrescu. Do not redistribute. 17 / 36

Task

Make this faster

© 2014- Andrei Alexandrescu. Do not redistribute. 18 / 36


(Source: “Down for the Count?”, Shahriyar, R et al.)

Observation

• Many refcounts are 0 or 1


• C++ legacy code in particular!
◦ People avoided auto_ptr
◦ tr1::shared_ptr closest portable
alternative
• Some designs use shared_ptr instead of
unique_ptr as future flexibility (rightly or
wrongly)

© 2014- Andrei Alexandrescu. Do not redistribute. 20 / 36


Tip #2: Lazy Refcount Allocation

template <class T>


class SingleThreadPtr {
T* p_;
mutable unsigned* c_;
public:
SingleThreadPtr() : p_(nullptr), c_(nullptr) {}
SingleThreadPtr(T* p) : p_(p), c_(nullptr) {}
SingleThreadPtr(const SingleThreadPtr& rhs)
: p_(rhs.p_)
, c_(rhs.c_) {
if (!p_) return;
if (!c_) {
c_ = rhs.c_ = new unsigned(2);
} else {
++*c_;
}
}
...
© 2014- Andrei Alexandrescu. Do not redistribute. 21 / 36

Tip #2 (cont’d)

...
SingleThreadPtr(SingleThreadPtr&& rhs)
: p_(rhs.p_)
, c_(rhs.c_) {
rhs.p_ = nullptr;
//rhs.c_ = nullptr; // UNNEEDED
}
~SingleThreadPtr() {
if (!p_) return;
if (!c_) {
soSueMe: delete p_;
} else if (--*c_ == 0) {
delete c_;
goto soSueMe;
}
}

© 2014- Andrei Alexandrescu. Do not redistribute. 22 / 36


Tip #2 (alternative)

...
SingleThreadPtr(SingleThreadPtr&& rhs)
: p_(rhs.p_)
, c_(rhs.c_) {
rhs.p_ = nullptr;
rhs.c_ = nullptr; // NEEDED
}
~SingleThreadPtr() {
if (!c_) {
soSueMe: delete p_;
} else if (--*c_ == 0) {
delete c_;
goto soSueMe;
}
}

• Fold test into delete call

© 2014- Andrei Alexandrescu. Do not redistribute. 23 / 36

Performance Dynamics

• One ref: p_ && (!c_ || *c == 1)


• Many refs: p_ && c_ && *c_ > 1
• No deallocation of c_ going down
◦ Avoid thrashing on transitions 1 ↔ 2
• We’re not above goto
◦ Dtor still a tad larger
• Ctors smaller, use zero-init
• Can control #copies better than #creations

© 2014- Andrei Alexandrescu. Do not redistribute. 24 / 36


Tip #3: Skip Last Decrement

template <class T>


class SingleThreadPtr {
...
~SingleThreadPtr() {
if (!p_) return;
if (!c_) {
soSueMe: delete p_;
} else if (*c_ == 1) {
delete c_;
goto soSueMe;
} else {
--*c_;
}
}
};

© 2014- Andrei Alexandrescu. Do not redistribute. 25 / 36

Motivation

• Most object have low refcounts


• Last refcount decrement is high
percentage-wise
• Avoid dirtying memory on moribund objects
• Replace interlocked decrement with atomic
read
◦ On x86, all reads are atomic!

© 2014- Andrei Alexandrescu. Do not redistribute. 26 / 36


Performance Dynamics

• Dtor got a tad larger


• Competition with delete
◦ If expensive, one decref won’t matter
◦ See coming Tip
• May help deleting old unused objects
◦ One less dirty page
• Generally worth the extra test
• YMMV

© 2014- Andrei Alexandrescu. Do not redistribute. 27 / 36

Tip #4: Prefer Zero of All

• Zero is “special” to the CPU


• Special assignment
• Special comparisons
• E.g. in an enum, make 0 the most frequent
value

© 2014- Andrei Alexandrescu. Do not redistribute. 28 / 36


Tip #4: Prefer Zero of All

...
SingleThreadPtr(const SingleThreadPtr& rhs)
: p_(rhs.p_)
, c_(rhs.c_) {
if (!p_) return;
if (!c_) {
c_ = rhs.c_ = new unsigned(1);
} else {
++*c_;
}
}
...

© 2014- Andrei Alexandrescu. Do not redistribute. 29 / 36

Tip #4: Prefer Zero of All

...
~SingleThreadPtr() {
if (!p_) return;
if (!c_) {
soSueMe: delete p_;
} else if (*c_ == 0) {
delete c_;
goto soSueMe;
} else {
--*c_;
}
}
...

© 2014- Andrei Alexandrescu. Do not redistribute. 30 / 36


Performance Dynamics

• Code is not faster!


◦ Test is 1 cycle or less either way
• Code is smaller
• Most often inc/decref inlined
• Effect on I-Cache may become noticeable

• Weird sub-tip: make default state all zeros


◦ http://goo.gl/WZH0BS

© 2014- Andrei Alexandrescu. Do not redistribute. 31 / 36

True story: > 0.5%

enum class A { foo, bar };

© 2014- Andrei Alexandrescu. Do not redistribute. 32 / 36


Tip #5: Use Dedicated Allocators

• No generic allocator handles small allocs well


• Keep all refcounts together
• Heap with 1 control bit per counter
◦ Only 3.125% size overhead for 32-bit
◦ Cache-friendly control bit
• Alternative: freelists
◦ No per-allocation overhead
◦ Odd cache friendliness patterns
◦ Require pointer-sized count
representation
• Best: intrusive

© 2014- Andrei Alexandrescu. Do not redistribute. 33 / 36

Tip #6: Use Smaller Counters

• Vast majority of objects: < 16 refs


• Prefer 16- or 8-bit counters
• Saturate them (with hysteresis)
• On saturation: leak!
◦ Such objects are long-lived anyway
◦ You may have cycles anyway
◦ Log a leakage report on exit

• Intrusive: just use whatever bits available

© 2014- Andrei Alexandrescu. Do not redistribute. 34 / 36


Summary

© 2014- Andrei Alexandrescu. Do not redistribute. 35 / 36

To Paraphrase John Lennon

 You may say I am special


But I’m not the only one. . . 

© 2014- Andrei Alexandrescu. Do not redistribute. 36 / 36

You might also like