0% found this document useful (0 votes)
27 views117 pages

Konrad Kokosa - Make Your Custom .NET Whys and Hows

Uploaded by

techfelows
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views117 pages

Konrad Kokosa - Make Your Custom .NET Whys and Hows

Uploaded by

techfelows
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 117

Make your custom .

NET GC
"whys" and "hows"
Konrad Kokosa

1 / 117
Welcome To The World Of Custom GCs!

2 / 117
Welcome To The World Of Custom GCs!
which in .NET does not exist so much...

3 / 117
Java

4 / 117
Java
-server -Xms24G -Xmx24G -XX:PermSize=512m -XX:+UseG1GC
-XX:MaxGCPauseMillis=200 -XX:ParallelGCThreads=20
-XX:ConcGCThreads=5
-XX:InitiatingHeapOccupancyPercent=70

5 / 117
Java
-server -Xms24G -Xmx24G -XX:PermSize=512m -XX:+UseG1GC
-XX:MaxGCPauseMillis=200 -XX:ParallelGCThreads=20
-XX:ConcGCThreads=5
-XX:InitiatingHeapOccupancyPercent=70

or...

-server -Xss4096k -Xms12G -Xmx12G -XX:MaxPermSize=512m


-XX:+HeapDumpOnOutOfMemoryError -verbose:gc -Xmaxf1
-XX:+UseCompressedOops -XX:+DisableExplicitGC -XX:+AggressiveOpts
-XX:+ScavengeBeforeFullGC -XX:CMSFullGCsBeforeCompaction=10
-XX:CMSInitiatingOccupancyFraction=80 -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode
-XX:+CMSIncrementalPacing -XX:+CMSParallelRemarkEnabled
-XX:GCTimeRatio=19 -XX:+UseAdaptiveSizePolicy
-XX:MaxGCPauseMillis=500 -XX:+PrintGCTaskTimeStamps
-XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+PrintGCApplicationConcurrentTime
-XX:+PrintTenuringDistribution -Xloggc:gc.log

6 / 117
Cargo cult programming con guring.

7 / 117
But why di erent/custom GCs at all?!

8 / 117
Jack of all trades is master of none.
9 / 117
Di erent workloads, di erent applications, di erent expectations...

10 / 117
11 / 117
"Simple" knobs

12 / 117
"Simple" knobs

GC modes

13 / 117
Workstation vs. Server Mode

14 / 117
Workstation Server
Designed mostly for responsiveness Designed for simultaneous, request-
needed in interactive, UI-based based processing applications
applications
big throughput (pauses may be
pauses as short as possible unpredictable, final throughput
good citizen in the whole is what matters)
interactive environment "give me all" citizen in the
system

15 / 117
gc.cpp has <40 kLOC of C++

.\src\gc\gcsvr.cpp defines SERVER_GC constant and SVR namespace:

#define SERVER_GC 1
namespace SVR {
#include "gcimpl.h" // <-- defines MULTIPLE_HEAPS
#include "gc.cpp"
}

.\src\gc\gcwks.cpp defines WKS namespace:

namespace WKS {
#include "gcimpl.h"
#include "gc.cpp"
}

16 / 117
gc.cpp has <40 kLOC of C++

.\src\gc\gcsvr.cpp defines SERVER_GC constant and SVR namespace:

#define SERVER_GC 1
namespace SVR {
#include "gcimpl.h" // <-- defines MULTIPLE_HEAPS
#include "gc.cpp"
}

.\src\gc\gcwks.cpp defines WKS namespace:

namespace WKS {
#include "gcimpl.h"
#include "gc.cpp"
}

and then the whole gc.cpp begins...

heap_segment* gc_heap::get_segment_for_loh (size_t size


#ifdef MULTIPLE_HEAPS
, gc_heap* hp
#endif //MULTIPLE_HEAPS
)
{
#ifndef MULTIPLE_HEAPS
gc_heap* hp = 0;
#endif //MULTIPLE_HEAPS
heap_segment* res = hp->get_segment (size, TRUE); 17 / 117
Non-Concurrent vs. Concurrent Mode

18 / 117
Non-Concurrent Concurrent
"stop the world" - all managed some parts of GC runs
threads are suspended concurrently with managed
no work, no allocations, no threads
nothing... normal work possible (mostly)
optimal as no floating garbage, produces some floating garbage
everything collected no concurrent compacting

19 / 117
.\src\gc\gc.cpp consumes BACKGROUND_GC constant
always defined in both SVR and WKS versions
dynamic flag checked

void GCStatistics::AddGCStats(const gc_mechanisms& settings, size_t timeInMSec)


{
#ifdef BACKGROUND_GC
if (settings.concurrent)
{
bgc.Accumulate((uint32_t)timeInMSec*1000);
cntBGC++;
}
else if (settings.background_p)
{
// ...

20 / 117
Concurrent (false) Concurrent (true)
Workstation Non-Concurrent Workstation Background Workstation
Server Non-Concurrent Server Background Server

21 / 117
Concurrent (false) Concurrent (true)
Workstation Non-Concurrent Workstation Background Workstation
Server Non-Concurrent Server Background Server

22 / 117
Additional GC knobs:
GCNoAffinitize and GCHeapAffinitizeMask:

<configuration>
<runtime>
<gcServer enabled="true"/>
<GCHeapCount enabled="6"/>
<GCNoAffinitize enabled="true"/>
<GCHeapAffinitizeMask enabled="144"/>
</runtime>
</configuration>

Latency Modes

Latency Optimization Goals

CoreCLR comment: "Latency modes required user to have specific GC


knowledge (e.g., budget, full-blocking GC). We are trying to move away from
them as it makes a lot more sense for users to tell us what’s the most
important out of the performance aspects that make sense to them"

VM Hoarding

GCSettings.LargeObjectHeapCompactionMode

23 / 117
CLR Hosting

24 / 117
CLR Hosting
host your own .NET inside a process:
to be able to call managed code inside - i.e. SQL Server
to customize CLR runtime (including memory management)

25 / 117
26 / 117
Most interesting for us:

ICLRGCManager2:
SetGCStartupLimitsEx - sets the size of GC segments and the maximum
size of the gen0
IHostMemoryManager:
VirtualAlloc, VirtualFree, VirtualProtect, VirtualQuery - how CLR
operates on virtual memory
IHostMalloc:
Alloc/DebugAlloc, Free - native heap allocations

27 / 117
CLR Hosting 101
ICLRRuntimeHost* runtimeHost;
ICLRMetaHost *pMetaHost = nullptr;
ICLRRuntimeInfo *pRuntimeInfo = nullptr;
hr = CLRCreateInstance(CLSID_CLRMetaHost, IID_ICLRMetaHost,
(LPVOID*)&pMetaHost);
hr = pMetaHost->GetRuntime(L"v4.0.30319", IID_PPV_ARGS(&pRuntimeInfo));
hr = pRuntimeInfo->GetInterface(CLSID_CLRRuntimeHost, IID_ICLRRuntimeHost,
(LPVOID*)&runtimeHost);
ICLRControl* clrControl;
hr = runtimeHost->GetCLRControl(&clrControl);
DWORD dwReturn;
hr = runtimeHost->Start();
hr = runtimeHost->ExecuteInDefaultAppDomain(targetApp,
L"HelloWorld.Program",
L"Test", L"", &dwReturn);

28 / 117
CLR Hosting 101
ICLRGCManager2* clrGCManager;
hr = clrControl->GetCLRManager(IID_ICLRGCManager2, (void**)&clrGCManager);
SIZE_T segmentSize = 4 * 1024 * 1024 * 1024;
SIZE_T maxGen0Size = 4 * 1024 * 1024 * 1024;
hr = clrGCManager->SetGCStartupLimitsEx(segmentSize, maxGen0Size);

29 / 117
CLR Hosting 101
CustomHostControl customHostControl;
hr = runtimeHost->SetHostControl(&customHostControl);

...

class CustomHostControl : public IHostControl


{
virtual HRESULT GetHostManager(REFIID riid, void ** ppObject) override
{
if (riid == IID_IHostMemoryManager)
{
IHostMemoryManager *pMemoryManager = new CustomHostMemoryManager();
*ppObject = pMemoryManager;
return S_OK;
}
*ppObject = NULL;
return E_NOINTERFACE;
}
...

30 / 117
I.e. page locking manager:

class CustomHostMemoryManager : public IHostMemoryManager


{
virtual HRESULT VirtualAlloc(void * pAddress, SIZE_T dwSize, DWORD
flAllocationType, DWORD flProtect, EMemoryCriticalLevel eCriticalLevel,
void ** ppMem) override
{
void* result = ::VirtualAlloc(pAddress,
dwSize,
flAllocationType,
flProtect);
*ppMem = result;
BOOL locked = false;
if (flAllocationType & MEM_COMMIT)
{
locked = ::VirtualLock(*ppMem, dwSize);
}
return S_OK;
}
...
}

31 / 117
I.e. page locking manager:

class CustomHostMemoryManager : public IHostMemoryManager


{
virtual HRESULT VirtualAlloc(void * pAddress, SIZE_T dwSize, DWORD
flAllocationType, DWORD flProtect, EMemoryCriticalLevel eCriticalLevel,
void ** ppMem) override
{
void* result = ::VirtualAlloc(pAddress,
dwSize,
flAllocationType,
flProtect);
*ppMem = result;
BOOL locked = false;
if (flAllocationType & MEM_COMMIT)
{
locked = ::VirtualLock(*ppMem, dwSize);
}
return S_OK;
}
...
}

See: Non-paged CLR host project by Sasha Goldshtein and Alon Fliess at
https://archive.codeplex.com/?p=nonpagedclrhost

32 / 117
Custom GC
(aka Local GC)

33 / 117
34 / 117
35 / 117
What can be done with it?

36 / 117
What can be done with it?
Everything!

37 / 117
What can be done with it?
Everything!
Well... almost

38 / 117
Usage
Since .NET Core 2.1:

set COMPlus_GCName=f:\CoreCLR.ZeroGC\x64\Release\ZeroGC.dll

In .NET Core 2.0 (preview):

additionally required recompiling runtime with FEATURE_STANDALONE_GC


feature enabled:
> build.cmd -buildstandalonegc

39 / 117
Implementing
regular C++ library (i.e. created in Visual Studio)
include only three files from CoreCLR:

#include "debugmacros.h"
#include "gcenv.base.h"
#include "gcinterface.h"

implement two exported simple methods


GC_Initialize
GC_VersionInfo
implement the rest of the GC:
IGCHeap - responsible for... everything
IGCHandleManager and IGCHandleStore - responsible for handling...
handles

40 / 117
41 / 117
Is it di cult?

42 / 117
Is it di cult?
No but it requires very deep knowledge about the runtime and... the
GC

43 / 117
Implementing - cont.
extern "C" DLLEXPORT void
GC_VersionInfo(
/* Out */ VersionInfo* result
)
{
result->MajorVersion = GC_INTERFACE_MAJOR_VERSION;
result->MinorVersion = GC_INTERFACE_MINOR_VERSION;
result->BuildVersion = 0;
}

Specifying which GC API version our custom GC supports.

44 / 117
Implementing - cont.
extern "C" DLLEXPORT HRESULT
GC_Initialize(
/* In */ IGCToCLR* clrToGC,
/* Out */ IGCHeap** gcHeap,
/* Out */ IGCHandleManager** gcHandleManager,
/* Out */ GcDacVars* gcDacVars
)
{
IGCHeap* heap = new ZeroGCHeap(clrToGC);
IGCHandleManager* handleManager = new ZeroGCHandleManager();
*gcHeap = heap;
*gcHandleManager = handleManager;
return S_OK;
}

45 / 117
Implementing - cont.
extern "C" DLLEXPORT HRESULT
GC_Initialize(
/* In */ IGCToCLR* clrToGC,
/* Out */ IGCHeap** gcHeap,
/* Out */ IGCHandleManager** gcHandleManager,
/* Out */ GcDacVars* gcDacVars
)
{
IGCHeap* heap = new ZeroGCHeap(clrToGC);
IGCHandleManager* handleManager = new ZeroGCHandleManager();
*gcHeap = heap;
*gcHandleManager = handleManager;
return S_OK;
}

Specifying pointers to our custom IGCHeap and IGCHandleManager


implementations.

46 / 117
Implementing - cont.
extern "C" DLLEXPORT HRESULT
GC_Initialize(
/* In */ IGCToCLR* clrToGC,
/* Out */ IGCHeap** gcHeap,
/* Out */ IGCHandleManager** gcHandleManager,
/* Out */ GcDacVars* gcDacVars
)
{
IGCHeap* heap = new ZeroGCHeap(clrToGC);
IGCHandleManager* handleManager = new ZeroGCHandleManager();
*gcHeap = heap;
*gcHandleManager = handleManager;
return S_OK;
}

Remembering IGCToCLR as it provides so convenient API as:

SuspendEE and RestartEE methods for thread suspensions


GcScanRoots for methods root scanning
GcStartWork and GcDone to inform the runtime

47 / 117
IGCHeap
class ZeroGCHeap : public IGCHeap
{
private:
IGCToCLR* gcToCLR;
public:
ZeroGCHeap(IGCToCLR* gcToCLR)
{
this->gcToCLR = gcToCLR;
}
// Inherited via IGCHeap
...
75 methods!
}

48 / 117
// Inherited via IGCHeap
virtual bool IsValidSegmentSize(size_t size) override;
virtual bool IsValidGen0MaxSize(size_t size) override;
virtual size_t GetValidSegmentSize(bool large_seg = false) override;
virtual void SetReservedVMLimit(size_t vmlimit) override;
virtual void WaitUntilConcurrentGCComplete() override;
virtual bool IsConcurrentGCInProgress() override;
virtual void TemporaryEnableConcurrentGC() override;
virtual void TemporaryDisableConcurrentGC() override;
virtual bool IsConcurrentGCEnabled() override;
virtual HRESULT WaitUntilConcurrentGCCompleteAsync(int millisecondsTimeout) ove
virtual bool FinalizeAppDomain(void* pDomain, bool fRunFinalizers) override;
virtual void SetFinalizeQueueForShutdown(bool fHasLock) override;
virtual size_t GetNumberOfFinalizable() override;
virtual bool ShouldRestartFinalizerWatchDog() override;
virtual Object* GetNextFinalizable() override;
virtual void SetFinalizeRunOnShutdown(bool value) override;
virtual int GetGcLatencyMode() override;
virtual int SetGcLatencyMode(int newLatencyMode) override;
virtual int GetLOHCompactionMode() override;
virtual void SetLOHCompactionMode(int newLOHCompactionMode) override;
virtual bool RegisterForFullGCNotification(uint32_t gen2Percentage, uint32_t lo
virtual bool CancelFullGCNotification() override;
virtual int WaitForFullGCApproach(int millisecondsTimeout) override;
virtual int WaitForFullGCComplete(int millisecondsTimeout) override;
virtual unsigned WhichGeneration(Object* obj) override;
virtual int CollectionCount(int generation, int get_bgc_fgc_coutn = 0) override
virtual int StartNoGCRegion(uint64_t totalSize, bool lohSizeKnown, uint64_t lo
virtual int EndNoGCRegion() override;
virtual size_t GetTotalBytesInUse() override;
virtual HRESULT GarbageCollect(int generation = -1, bool low_memory_p = false,
virtual unsigned GetMaxGeneration() override;
virtual void SetFinalizationRun(Object* obj) override;
49 / 117
// Inherited via IGCHeap
virtual bool IsValidSegmentSize(size_t size) override;
virtual bool IsValidGen0MaxSize(size_t size) override;
virtual size_t GetValidSegmentSize(bool large_seg = false) override;
virtual void SetReservedVMLimit(size_t vmlimit) override;
virtual void WaitUntilConcurrentGCComplete() override;
virtual bool IsConcurrentGCInProgress() override;
virtual void TemporaryEnableConcurrentGC() override;
virtual void TemporaryDisableConcurrentGC() override;
virtual bool IsConcurrentGCEnabled() override;
virtual HRESULT WaitUntilConcurrentGCCompleteAsync(int millisecondsTimeout) ove
virtual bool FinalizeAppDomain(void* pDomain, bool fRunFinalizers) override;
virtual void SetFinalizeQueueForShutdown(bool fHasLock) override;
virtual size_t GetNumberOfFinalizable() override;
virtual bool ShouldRestartFinalizerWatchDog() override;
virtual Object* GetNextFinalizable() override;
virtual void SetFinalizeRunOnShutdown(bool value) override;
virtual int GetGcLatencyMode() override;
virtual int SetGcLatencyMode(int newLatencyMode) override;
virtual int GetLOHCompactionMode() override;
virtual void SetLOHCompactionMode(int newLOHCompactionMode) override;
virtual bool RegisterForFullGCNotification(uint32_t gen2Percentage, uint32_t lo
virtual bool CancelFullGCNotification() override;
virtual int WaitForFullGCApproach(int millisecondsTimeout) override;
virtual int WaitForFullGCComplete(int millisecondsTimeout) override;
virtual unsigned WhichGeneration(Object* obj) override;
virtual int CollectionCount(int generation, int get_bgc_fgc_coutn = 0) override
virtual int StartNoGCRegion(uint64_t totalSize, bool lohSizeKnown, uint64_t lo
virtual int EndNoGCRegion() override;
virtual size_t GetTotalBytesInUse() override;
virtual HRESULT GarbageCollect(int generation = -1, bool low_memory_p = false,
virtual unsigned GetMaxGeneration() override;
virtual void SetFinalizationRun(Object* obj) override;
50 / 117
virtual bool RegisterForFinalization(int gen, Object* obj) override;
virtual HRESULT Initialize() override;
virtual bool IsPromoted(Object* object) override;
virtual bool IsHeapPointer(void* object, bool small_heap_only = false) override
virtual unsigned GetCondemnedGeneration() override;
virtual bool IsGCInProgressHelper(bool bConsiderGCStart = false) override;
virtual unsigned GetGcCount() override;
virtual bool IsThreadUsingAllocationContextHeap(gc_alloc_context* acontext, int
virtual bool IsEphemeral(Object* object) override;
virtual uint32_t WaitUntilGCComplete(bool bConsiderGCStart = false) override;
virtual void FixAllocContext(gc_alloc_context* acontext, bool lockp, void* arg
virtual size_t GetCurrentObjSize() override;
virtual void SetGCInProgress(bool fInProgress) override;
virtual bool RuntimeStructuresValid() override;
virtual size_t GetLastGCStartTime(int generation) override;
virtual size_t GetLastGCDuration(int generation) override;
virtual size_t GetNow() override;
virtual Object* Alloc(gc_alloc_context* acontext, size_t size, uint32_t flags)
virtual Object* AllocLHeap(size_t size, uint32_t flags) override;
virtual Object* AllocAlign8(gc_alloc_context* acontext, size_t size, uint32_t
virtual void PublishObject(uint8_t* obj) override;
virtual void SetWaitForGCEvent() override;
virtual void ResetWaitForGCEvent() override;
virtual bool IsObjectInFixedHeap(Object* pObj) override;
virtual void ValidateObjectMember(Object* obj) override;
virtual Object* NextObj(Object* object) override;
virtual Object* GetContainingObject(void* pInteriorPtr, bool fCollectedGenOnly
virtual void DiagWalkObject(Object* obj, walk_fn fn, void* context) override;
virtual void DiagWalkHeap(walk_fn fn, void* context, int gen_number, bool walk_
virtual void DiagWalkSurvivorsWithType(void* gc_context, record_surv_fn fn, voi
virtual void DiagWalkFinalizeQueue(void* gc_context, fq_walk_fn fn) override;
virtual void DiagScanFinalizeQueue(fq_scan_fn fn, ScanContext* context) overrid

51 / 117
virtual bool RegisterForFinalization(int gen, Object* obj) override;
virtual HRESULT Initialize() override;
virtual bool IsPromoted(Object* object) override;
virtual bool IsHeapPointer(void* object, bool small_heap_only = false) override
virtual unsigned GetCondemnedGeneration() override;
virtual bool IsGCInProgressHelper(bool bConsiderGCStart = false) override;
virtual unsigned GetGcCount() override;
virtual bool IsThreadUsingAllocationContextHeap(gc_alloc_context* acontext, int
virtual bool IsEphemeral(Object* object) override;
virtual uint32_t WaitUntilGCComplete(bool bConsiderGCStart = false) override;
virtual void FixAllocContext(gc_alloc_context* acontext, bool lockp, void* arg
virtual size_t GetCurrentObjSize() override;
virtual void SetGCInProgress(bool fInProgress) override;
virtual bool RuntimeStructuresValid() override;
virtual size_t GetLastGCStartTime(int generation) override;
virtual size_t GetLastGCDuration(int generation) override;
virtual size_t GetNow() override;
virtual Object* Alloc(gc_alloc_context* acontext, size_t size, uint32_t flags)
virtual Object* AllocLHeap(size_t size, uint32_t flags) override;
virtual Object* AllocAlign8(gc_alloc_context* acontext, size_t size, uint32_t
virtual void PublishObject(uint8_t* obj) override;
virtual void SetWaitForGCEvent() override;
virtual void ResetWaitForGCEvent() override;
virtual bool IsObjectInFixedHeap(Object* pObj) override;
virtual void ValidateObjectMember(Object* obj) override;
virtual Object* NextObj(Object* object) override;
virtual Object* GetContainingObject(void* pInteriorPtr, bool fCollectedGenOnly
virtual void DiagWalkObject(Object* obj, walk_fn fn, void* context) override;
virtual void DiagWalkHeap(walk_fn fn, void* context, int gen_number, bool walk_
virtual void DiagWalkSurvivorsWithType(void* gc_context, record_surv_fn fn, voi
virtual void DiagWalkFinalizeQueue(void* gc_context, fq_walk_fn fn) override;
virtual void DiagScanFinalizeQueue(fq_scan_fn fn, ScanContext* context) overrid

52 / 117
virtual void DiagScanHandles(handle_scan_fn fn, int gen_number, ScanContext* co
virtual void DiagScanDependentHandles(handle_scan_fn fn, int gen_number, ScanCo
virtual void DiagDescrGenerations(gen_walk_fn fn, void* context) override;
virtual void DiagTraceGCSegments() override;
virtual bool StressHeap(gc_alloc_context* acontext) override;
virtual segment_handle RegisterFrozenSegment(segment_info *pseginfo) override;
virtual void UnregisterFrozenSegment(segment_handle seg) override;
virtual void ControlEvents(GCEventKeyword keyword, GCEventLevel level) override
virtual void ControlPrivateEvents(GCEventKeyword keyword, GCEventLevel level) o
virtual void GetMemoryInfo(uint32_t * highMemLoadThreshold, uint64_t * totalPhy
virtual void SetSuspensionPending(bool fSuspensionPending) override;
virtual void SetYieldProcessorScalingFactor(uint32_t yieldProcessorScalingFacto

53 / 117
So, what we MUST implement?

54 / 117
55 / 117
56 / 117
57 / 117
Let's write Minimum Valuable Product - Zero GC
only allocating
no Garbage Collection at all

58 / 117
Zero GC
Most IGCHeap methods may be dummy:

bool CustomGCHeap::RuntimeStructuresValid()
{
return true;
}

bool ZeroGCHeap::IsPromoted(Object * object)


{
return false;
}

unsigned ZeroGCHeap::GetCondemnedGeneration()
{
return 0;
}

59 / 117
Zero GC
IGCHeap::GarbageCollect

called by the runtime in rare cases:


GC.Collect
low-memory notification
not called by the GC itself

60 / 117
Zero GC
IGCHeap::GarbageCollect

called by the runtime in rare cases:


GC.Collect
low-memory notification
not called by the GC itself

Trivial implementation:

HRESULT ZeroGCHeap::GarbageCollect(int generation, bool low_memory_p, int mode)


{
return NOERROR;
}

61 / 117
Zero GC
IGCHeap - allocations:

Object* ZeroGCHeap::Alloc(gc_alloc_context * acontext, size_t size, uint32_t flags


{
// return address of a new object
// trigger GC if necessary
}

Object* ZeroGCHeap::AllocLHeap(size_t size, uint32_t flags)


{
// return address of a new object
// trigger GC if necessary
}

62 / 117
Zero GC
IGCHeap - allocations:

Object* ZeroGCHeap::Alloc(gc_alloc_context * acontext, size_t size, uint32_t flags


{
// return address of a new object
// trigger GC if necessary
}

Object* ZeroGCHeap::AllocLHeap(size_t size, uint32_t flags)


{
// return address of a new object
// trigger GC if necessary
}

63 / 117
Zero GC
IGCHeap - allocations:

Object* ZeroGCHeap::Alloc(gc_alloc_context * acontext, size_t size, uint32_t flags


{
// return address of a new object
}
Object* ZeroGCHeap::AllocLHeap(size_t size, uint32_t flags)
{
// return address of a new object
}

64 / 117
Zero GC
IGCHeap - allocations:

Object* ZeroGCHeap::Alloc(gc_alloc_context * acontext, size_t size, uint32_t flags


{
int sizeWithHeader = size + sizeof(ObjHeader);
ObjHeader* address = (ObjHeader*)calloc(sizeWithHeader, sizeof(char));
return (Object*)(address + 1);
}

Object* ZeroGCHeap::AllocLHeap(size_t size, uint32_t flags)


{
int sizeWithHeader = size + sizeof(ObjHeader);
ObjHeader* address = (ObjHeader*)calloc(sizeWithHeader, sizeof(char));
return (Object*)(address + 1);
}

65 / 117
Zero GC
IGCHeap - allocations:

Object* ZeroGCHeap::Alloc(gc_alloc_context * acontext, size_t size, uint32_t flags


{
int sizeWithHeader = size + sizeof(ObjHeader);
ObjHeader* address = (ObjHeader*)calloc(sizeWithHeader, sizeof(char));
return (Object*)(address + 1);
}

66 / 117
Zero GC
IGCHeap - creating handles (pinning, strong, ...):

bool ZeroGCHandleManager::Initialize()
{
g_gcGlobalHandleStore = new ZeroGCHandleStore();
return true;
}

OBJECTHANDLE
ZeroGCHandleManager::CreateGlobalHandleOfType(Object * object, HandleType type)
{
return g_gcGlobalHandleStore->CreateHandleOfType(object, type);
}

int handlesCount = 0;
OBJECTHANDLE handles[65535];

OBJECTHANDLE
ZeroGCHandleStore::CreateHandleOfType(Object * object, HandleType type)
{
handles[handlesCount] = (OBJECTHANDLE__*)object;
return (OBJECTHANDLE)&handles[handlesCount++];
}

67 / 117
Zero GC
IGCHandleManager - storing handles:

void
ZeroGCHandleManager::StoreObjectInHandle(OBJECTHANDLE handle, Object * object)
{
Object** handleObj = (Object**)handle;
*handleObj = object;
}

bool
ZeroGCHandleManager::StoreObjectInHandleIfNull(OBJECTHANDLE handle, Object* object
{
Object** handleObj = (Object**)handle;
if (*handleObj == NULL)
{
*handleObj = object;
return true;
}
return false;
}

68 / 117
And that's mostly all!
Complete Calloc-based implementation:

https://github.com/kkokosa/CoreCLR.ZeroGC

69 / 117
"Mostly"

70 / 117
Caveat #1 - write barriers

71 / 117
Remembered sets (card tables)

72 / 117
LEAF_ENTRY JIT_WriteBarrier_PostGrow64, _TEXT
align 8
mov [rcx], rdx
NOP_3_BYTE ; padding for alignment of constant
PATCH_LABEL JIT_WriteBarrier_PostGrow64_Patch_Label_Lower
mov rax, 0F0F0F0F0F0F0F0F0h
; Check the lower and upper ephemeral region bounds
cmp rdx, rax
jb Exit
nop ; padding for alignment of constant

PATCH_LABEL JIT_WriteBarrier_PostGrow64_Patch_Label_Upper
mov r8, 0F0F0F0F0F0F0F0F0h
cmp rdx, r8
jae Exit
nop ; padding for alignment of constant

PATCH_LABEL JIT_WriteBarrier_PostGrow64_Patch_Label_CardTable
mov rax, 0F0F0F0F0F0F0F0F0h
; Touch the card table entry, if not already dirty.
shr rcx, 0Bh
cmp byte ptr [rcx + rax], 0FFh
jne UpdateCardTable
REPRET
UpdateCardTable:
mov byte ptr [rcx + rax], 0FFh
ret
align 16
Exit:
REPRET
LEAF_END_MARKED JIT_WriteBarrier_PostGrow64, _TEXT

73 / 117
LEAF_ENTRY JIT_WriteBarrier_PostGrow64, _TEXT
align 8
mov [rcx], rdx
NOP_3_BYTE ; padding for alignment of constant
PATCH_LABEL JIT_WriteBarrier_PostGrow64_Patch_Label_Lower
mov rax, 0F0F0F0F0F0F0F0F0h
; Check the lower and upper ephemeral region bounds
cmp rdx, rax
jb Exit
nop ; padding for alignment of constant

PATCH_LABEL JIT_WriteBarrier_PostGrow64_Patch_Label_Upper
mov r8, 0F0F0F0F0F0F0F0F0h
cmp rdx, r8
jae Exit
nop ; padding for alignment of constant

PATCH_LABEL JIT_WriteBarrier_PostGrow64_Patch_Label_CardTable
mov rax, 0F0F0F0F0F0F0F0F0h
; Touch the card table entry, if not already dirty.
shr rcx, 0Bh
cmp byte ptr [rcx + rax], 0FFh
jne UpdateCardTable
REPRET
UpdateCardTable:
mov byte ptr [rcx + rax], 0FFh
ret
align 16
Exit:
REPRET
LEAF_END_MARKED JIT_WriteBarrier_PostGrow64, _TEXT

74 / 117
LEAF_ENTRY JIT_WriteBarrier_PostGrow64, _TEXT
align 8
mov [rcx], rdx
NOP_3_BYTE ; padding for alignment of constant
PATCH_LABEL JIT_WriteBarrier_PostGrow64_Patch_Label_Lower
mov rax, 0F0F0F0F0F0F0F0F0h
; Check the lower and upper ephemeral region bounds
cmp rdx, rax
jb Exit
nop ; padding for alignment of constant

PATCH_LABEL JIT_WriteBarrier_PostGrow64_Patch_Label_Upper
mov r8, 0F0F0F0F0F0F0F0F0h
cmp rdx, r8
jae Exit
nop ; padding for alignment of constant

PATCH_LABEL JIT_WriteBarrier_PostGrow64_Patch_Label_CardTable
mov rax, 0F0F0F0F0F0F0F0F0h
; Touch the card table entry, if not already dirty.
shr rcx, 0Bh
cmp byte ptr [rcx + rax], 0FFh
jne UpdateCardTable
REPRET
UpdateCardTable:
mov byte ptr [rcx + rax], 0FFh
ret
align 16
Exit:
REPRET
LEAF_END_MARKED JIT_WriteBarrier_PostGrow64, _TEXT

75 / 117
LEAF_ENTRY JIT_WriteBarrier_PostGrow64, _TEXT
align 8
mov [rcx], rdx
NOP_3_BYTE ; padding for alignment of constant
PATCH_LABEL JIT_WriteBarrier_PostGrow64_Patch_Label_Lower
mov rax, 0F0F0F0F0F0F0F0F0h
; Check the lower and upper ephemeral region bounds
cmp rdx, rax
jb Exit
nop ; padding for alignment of constant

PATCH_LABEL JIT_WriteBarrier_PostGrow64_Patch_Label_Upper
mov r8, 0F0F0F0F0F0F0F0F0h
cmp rdx, r8
jae Exit
nop ; padding for alignment of constant

PATCH_LABEL JIT_WriteBarrier_PostGrow64_Patch_Label_CardTable
mov rax, 0F0F0F0F0F0F0F0F0h
; Touch the card table entry, if not already dirty.
shr rcx, 0Bh
cmp byte ptr [rcx + rax], 0FFh
jne UpdateCardTable
REPRET
UpdateCardTable:
mov byte ptr [rcx + rax], 0FFh
ret
align 16
Exit:
REPRET
LEAF_END_MARKED JIT_WriteBarrier_PostGrow64, _TEXT

76 / 117
Zero GC
IGCHeap - fooling write barriers:

HRESULT ZeroGCHeap::Initialize()
{
// Not used currently
MethodTable* freeObjectMethodTable = gcToCLR->GetFreeObjectMethodTable();

WriteBarrierParameters args = {};


args.operation = WriteBarrierOp::Initialize;
args.is_runtime_suspended = true;
args.requires_upper_bounds_check = false;
args.card_table = new uint32_t[1];
args.lowest_address = reinterpret_cast<uint8_t*>(~0);;
args.highest_address = reinterpret_cast<uint8_t*>(1);
args.ephemeral_low = reinterpret_cast<uint8_t*>(~0);
args.ephemeral_high = reinterpret_cast<uint8_t*>(1);
gcToCLR->StompWriteBarrier(&args);
return NOERROR;
}

77 / 117
Zero GC
IGCHeap - fooling write barriers:

HRESULT ZeroGCHeap::Initialize()
{
// Not used currently
MethodTable* freeObjectMethodTable = gcToCLR->GetFreeObjectMethodTable();

WriteBarrierParameters args = {};


args.operation = WriteBarrierOp::Initialize;
args.is_runtime_suspended = true;
args.requires_upper_bounds_check = false;
args.card_table = new uint32_t[1];
args.lowest_address = reinterpret_cast<uint8_t*>(~0);;
args.highest_address = reinterpret_cast<uint8_t*>(1);
args.ephemeral_low = reinterpret_cast<uint8_t*>(~0);
args.ephemeral_high = reinterpret_cast<uint8_t*>(1);
gcToCLR->StompWriteBarrier(&args);
return NOERROR;
}

78 / 117
Zero GC
IGCHeap - fooling write barriers:

HRESULT ZeroGCHeap::Initialize()
{
// Not used currently
MethodTable* freeObjectMethodTable = gcToCLR->GetFreeObjectMethodTable();

WriteBarrierParameters args = {};


args.operation = WriteBarrierOp::Initialize;
args.is_runtime_suspended = true;
args.requires_upper_bounds_check = false;
args.card_table = new uint32_t[1];
args.lowest_address = reinterpret_cast<uint8_t*>(~0);;
args.highest_address = reinterpret_cast<uint8_t*>(1);
args.ephemeral_low = reinterpret_cast<uint8_t*>(~0);
args.ephemeral_high = reinterpret_cast<uint8_t*>(1);
gcToCLR->StompWriteBarrier(&args);
return NOERROR;
}

Still:

requires Workstation GC mode - Server GC injects JIT_WriteBarrier_SVR64


that omits ephemeral checks and crashes the runtime :(

79 / 117
Zero GC - Calloc-based - applied
> dotnet new webapi -o CoreCLR.WebApi

[HttpGet]
public IEnumerable<string> Get()
{
return new string[] { DateTime.Now.ToLongTimeString(), "value2" };
}

> dotnet build -c Release


> set COMPlus_GCName=f:\CoreCLR.ZeroGC\x64\Release\ZeroGC.dll
> dotnet run -c Release

80 / 117
Zero GC applied - results
.NET Core 2.1 with Zero GC:

.NET Core 2.1:

81 / 117
Zero GC applied - results
.NET Core 2.1:

82 / 117
Zero GC applied - results
.NET Core 2.1 with Zero GC:

~314 MB after 24k requests (~11kB/request)

83 / 117
What's next?

84 / 117
What's next?
Calloc-based allocator is slow (each object triggers OS call and memory
zeroying)

85 / 117
What's next?
Calloc-based allocator is slow (each object triggers OS call and memory
zeroying)

Bump-pointer allocator instead of slooow calloc

86 / 117
87 / 117
Allocator::Allocate(amount)
{
if (alloc_ptr + amount <= alloc_limit)
{
// This is the fast path - we have enough memory to bump the pointer
PTR result = alloc_ptr;
alloc_ptr += amount;
return result;
}
else
{
// This is the slow path - new allocation context will be created
...
}
}

88 / 117
Thread-affinity of the allocation context structure - ensured by the runtime

89 / 117
Bump-pointer GC allocator - step #1:
// Normally both SOH and LOH allocations go through there
Object * ZeroGCHeap::Alloc(
gc_alloc_context * acontext,
size_t size,
uint32_t flags)
{
// Per thread acontext...
// acontext->alloc_ptr
// acontext->alloc_limit
}

90 / 117
Bump-pointer GC allocator - step #2:
// Normally both SOH and LOH allocations go through there
Object * ZeroGCHeap::Alloc(
gc_alloc_context * acontext,
size_t size,
uint32_t flags)
{
uint8_t* result = acontext->alloc_ptr;
uint8_t* advance = result + size;
if (advance <= acontext->alloc_limit)
{
acontext->alloc_ptr = advance;
return (Object* )result;
}
...
}

91 / 117
Bump-pointer GC allocator - step #3:
// Normally both SOH and LOH allocations go through there
Object * ZeroGCHeap::Alloc(
gc_alloc_context * acontext,
size_t size,
uint32_t flags)
{
uint8_t* result = acontext->alloc_ptr;
uint8_t* advance = result + size;
if (advance <= acontext->alloc_limit)
{
acontext->alloc_ptr = advance;
return (Object* )result;
}
int growthSize = 16 * 1024 * 1024;
uint8_t* newPages = (uint8_t*)VirtualAlloc(NULL, growthSize,
MEM_RESERVE | MEM_COMMIT,
PAGE_READWRITE);
uint8_t* allocationStart = newPages;
acontext->alloc_ptr = allocationStart + size;
acontext->alloc_limit = newPages + growthSize;
return (Object*)(allocationStart);
}

92 / 117
Bump-pointer GC allocator - step #4:
// Normally both SOH and LOH allocations go through there
Object * ZeroGCHeap::Alloc(
gc_alloc_context * acontext,
size_t size,
uint32_t flags)
{
uint8_t* result = acontext->alloc_ptr;
uint8_t* advance = result + size;
if (advance <= acontext->alloc_limit)
{
acontext->alloc_ptr = advance;
return (Object* )result;
}
int beginGap = 24;
int growthSize = 16 * 1024 * 1024;
uint8_t* newPages = (uint8_t*)VirtualAlloc(NULL, growthSize,
MEM_RESERVE | MEM_COMMIT,
PAGE_READWRITE);
uint8_t* allocationStart = newPages + beginGap;
acontext->alloc_ptr = allocationStart + size;
acontext->alloc_limit = newPages + growthSize;
return (Object*)(allocationStart);
}

93 / 117
Bump-pointer GC allocator - let's ignore those LOHs (thread-safety!):
// This variation is used in the rare circumstance when you want to allocate
// an object on the large object heap but the object is not big enough to
// naturally go there.
Object * ZeroGCHeap::AllocLHeap(size_t size, uint32_t flags)
{
int sizeWithHeader = size + sizeof(ObjHeader);
ObjHeader* address = (ObjHeader*)calloc(sizeWithHeader, sizeof(char*));
return (Object*)(address + 1);
}

94 / 117
Caveat #2 - allocation context is reused by the runtime (JIT!)

95 / 117
96 / 117
Fast path in EE (not changeable)

; IN: rcx: MethodTable*


; OUT: rax: new object
LEAF_ENTRY JIT_TrialAllocSFastMP_InlineGetThread, _TEXT
mov edx, [rcx + OFFSET__MethodTable__m_BaseSize]
; m_BaseSize is guaranteed to be a multiple of 8.

INLINE_GETTHREAD r11
mov r10, [r11 + OFFSET__Thread__m_alloc_context__alloc_limit]
mov rax, [r11 + OFFSET__Thread__m_alloc_context__alloc_ptr]

add rdx, rax

cmp rdx, r10


ja AllocFailed
mov [r11 + OFFSET__Thread__m_alloc_context__alloc_ptr], rdx
mov [rax], rcx

ret
AllocFailed:
jmp JIT_NEW
LEAF_END JIT_TrialAllocSFastMP_InlineGetThread, _TEXT

97 / 117
Fast path in EE (not changeable)

; IN: rcx: MethodTable*


; OUT: rax: new object
LEAF_ENTRY JIT_TrialAllocSFastMP_InlineGetThread, _TEXT
mov edx, [rcx + OFFSET__MethodTable__m_BaseSize]
; m_BaseSize is guaranteed to be a multiple of 8.

INLINE_GETTHREAD r11
mov r10, [r11 + OFFSET__Thread__m_alloc_context__alloc_limit]
mov rax, [r11 + OFFSET__Thread__m_alloc_context__alloc_ptr]

add rdx, rax

cmp rdx, r10


ja AllocFailed
mov [r11 + OFFSET__Thread__m_alloc_context__alloc_ptr], rdx
mov [rax], rcx

ret
AllocFailed:
jmp JIT_NEW
LEAF_END JIT_TrialAllocSFastMP_InlineGetThread, _TEXT

98 / 117
So why Calloc-based approach works?

99 / 117
Fast path in EE (not changeable)

; IN: rcx: MethodTable*


; OUT: rax: new object
LEAF_ENTRY JIT_TrialAllocSFastMP_InlineGetThread, _TEXT
mov edx, [rcx + OFFSET__MethodTable__m_BaseSize]
; m_BaseSize is guaranteed to be a multiple of 8.

INLINE_GETTHREAD r11
mov r10, [r11 + OFFSET__Thread__m_alloc_context__alloc_limit]
mov rax, [r11 + OFFSET__Thread__m_alloc_context__alloc_ptr]

add rdx, rax

cmp rdx, r10


ja AllocFailed
mov [r11 + OFFSET__Thread__m_alloc_context__alloc_ptr], rdx
mov [rax], rcx

ret
AllocFailed:
jmp JIT_NEW
LEAF_END JIT_TrialAllocSFastMP_InlineGetThread, _TEXT

100 / 117
JIT_NEW fall-back:

HCIMPL1(Object*, JIT_New, CORINFO_CLASS_HANDLE typeHnd_)


{
...
TypeHandle typeHnd(typeHnd_);
MethodTable *pMT = typeHnd.AsMethodTable();
newobj = AllocateObject(pMT);
return(OBJECTREFToObject(newobj));
}
HCIMPLEND

Object * AllocateObject(MethodTable * pMT)


{
alloc_context * acontext = GetThread()->GetAllocContext();
Object * pObject;
size_t size = pMT->GetBaseSize();
uint8_t* result = acontext->alloc_ptr;
uint8_t* advance = result + size;
if (advance <= acontext->alloc_limit)
{
acontext->alloc_ptr = advance;
pObject = (Object *)result;
}
else
{
pObject = g_theGCHeap->Alloc(acontext, size, 0);
if (pObject == NULL) return NULL;
}
pObject->RawSetMethodTable(pMT);
return pObject;
}
101 / 117
Bump-pointer GC allocator - step #4 repeated:
// Normally both SOH and LOH allocations go through there
Object * ZeroGCHeap::Alloc(
gc_alloc_context * acontext,
size_t size,
uint32_t flags)
{
uint8_t* result = acontext->alloc_ptr;
uint8_t* advance = result + size;
if (advance <= acontext->alloc_limit)
{
acontext->alloc_ptr = advance;
return (Object* )result;
}
int beginGap = 24;
int growthSize = 16 * 1024 * 1024;
uint8_t* newPages = (uint8_t*)VirtualAlloc(NULL, growthSize,
MEM_RESERVE | MEM_COMMIT,
PAGE_READWRITE);
uint8_t* allocationStart = newPages + beginGap;
acontext->alloc_ptr = allocationStart + size;
acontext->alloc_limit = newPages + growthSize;
return (Object*)(allocationStart);
}

102 / 117
Or ignore alloc_ptr and alloc_limit by using custom
elds

103 / 117
Zero GC bump pointer applied - results
.NET Core 2.1:

104 / 117
Zero GC applied - results

105 / 117
Important runtime support
Events

gcToCLR->EventSink()->FireGCCreateSegment_V1(newPages, growthSize, 0);

Threading:

CreateThread(void (*threadStart)(void*), void* arg, bool is_suspendable,


const char* name)
SuspendEE(SUSPEND_REASON reason)
RestartEE(bool bFinishedGC)
GcScanRoots(promote_func* fn, int condemned, int max_gen, ScanContext* sc)

Configuration:

GetBooleanConfigValue(const char* key, bool* value)


GetIntConfigValue(const char* key, int64_t* value)
GetStringConfigValue(const char* key, const char** value)

106 / 117
What's next?

107 / 117
What's next?
... just draw f** owl!

108 / 117
Question: What one should even care?

109 / 117
Question: What one should even care?
learning A LOT
having a GREAT FUN

creating customized, specialized GC


or awaited concurrent compacting GC (yeah, simple...)

110 / 117
Question: What about nalizers?

111 / 117
Question: What about nalizers?
Currently ignored!
Runtime still creates and maintains nalization thread

Hmm ... AFAIK currently no API to communicate with it...

112 / 117
Question: What about multiple GC heaps (like in Server
GC)?

113 / 117
Question: What about multiple GC heaps (like in Server
GC)?
Currently ignored!

One would need to implement it - core/heap a nity, heap


balancing, ...

114 / 117
Literature:
The Garbage Collection Handbook (http://gchandbook.org) - Richard Jones,
Antony Hosking, Eliot Moss

Pro .NET Memory Management Management


(https://prodotnetmemory.com) - Konrad Kokosa

http://tooslowexception.com/zero-garbage-collector-for-net-core/
http://tooslowexception.com/zero-garbage-collector-for-net-core-2-1-and-
asp-net-core-2-1/

115 / 117
That's all! Thank you! Any questions?!

116 / 117
117 / 117

You might also like