Skip to content

partial parallelisation of genbcode, and code that it touches #5815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

mkeskells
Copy link
Contributor

@mkeskells mkeskells commented Mar 30, 2017

Genbcode has internal phases
worker1
optimisation
Worker2
Worker3

This parallelises optimisation ( under some circumstances) with worker1
when worker1 has finished multiple worker2 and 3 can commence

the unit of work for the paralelisation is changes to be a source file ( was a class)

There is some modification of the IO patterns

I/O is very expensive in windows, so reduceing the file operation reduces the stat calls

minor inlining changes to reduce memory usage

Partial move to nio for performance

canched to data structures for thread safety

small changes to IO library for type refinement

added a -Y option to enable/disable parallel running

running benchmarks on a warmed up VM using sbt to compile akka-actor I get the following times based on an quad core I7 laptop running windows 10 with SSD
For windows - In summary this change slightly reduces total compile time by 20%, and the CPU usage by 10%, and allocation by about 2%

The changes in Unix will be posted shortly, but will not expected to be as dramatic

variance is based on 60 compile cycles, removing the first 10 as warmup. Compile target is akka-actor

the tool used to measure these results will be contributed in #5760 and updated in https://github.com/rorygraves/scalac_perf/tree/2.12.x_profile2

post processing the results is via https://github.com/rorygraves/perf_tester

results key
baseline - 2.12.x branch snapped end of March

genBcodeBase[Enabled/Disabled] - the parallelization changes ( with parallelization enabled/disabled(

genBcodeBase_BT[Enabled/Disabled] - the parallelization changes ( with parallelization enabled/disabled, with a optimization for BTypes descriptor generation - which is a separate commit

ALL - summary of 60 cycles of compile

after 10 90% - ignore the first 10 cycles, and the worst 10% of the remains

after 10 90% JVM, no GC - additionally ignore data outside the jvm /GenBcode phase, and ignore the results when a GC occurred during the jvm phase

notes

this PR builds on the work in #5800 which is withdrawn
This is a squashed, and tidied up version of that PR

Results using a I7 windows 10 SSD quad core with Norton AV (with exclusions around the dev area)
Unix results are not quite at dramatic, and will be added shortly

Windows results

ALL

                  RunName	                AllWallMS	                   CPU_MS	                Allocated
              00_baseline	 10725.42 [+2.87% -0.87%]	 10502.60 [+2.71% -0.88%]	  2799.19 [+1.10% -0.99%]
  01_genBcodeBaseDisabled	  9621.73 [+2.76% -0.88%]	  9470.83 [+2.72% -0.88%]	  2781.74 [+1.11% -1.00%]
       02_genBCodeEnabled	  8979.15 [+3.68% -0.83%]	  9666.15 [+3.15% -0.86%]	  2752.37 [+1.15% -0.99%]
03_genBcodeBaseDisabled_BT	  9549.15 [+2.71% -0.87%]	  9379.17 [+2.65% -0.87%]	  2761.93 [+1.11% -1.00%]
    04_genBCodeEnabled_BT	  8836.47 [+3.48% -0.84%]	  9590.63 [+3.08% -0.87%]	  2748.42 [+1.14% -0.99%]
after 10 90%

                  RunName	                AllWallMS	                   CPU_MS	                Allocated
              00_baseline	  9741.62 [+1.04% -0.96%]	  9611.46 [+1.04% -0.96%]	  2789.80 [+1.00% -1.00%]
  01_genBcodeBaseDisabled	  8815.88 [+1.04% -0.96%]	  8686.11 [+1.03% -0.96%]	  2773.25 [+1.00% -1.00%]
       02_genBCodeEnabled	  7906.04 [+1.05% -0.95%]	  8735.07 [+1.04% -0.95%]	  2740.46 [+1.00% -1.00%]
03_genBcodeBaseDisabled_BT	  8738.31 [+1.05% -0.95%]	  8606.60 [+1.04% -0.95%]	  2751.74 [+1.00% -1.00%]
    04_genBCodeEnabled_BT	  7875.92 [+1.04% -0.95%]	  8707.64 [+1.04% -0.96%]	  2736.24 [+1.00% -1.00%]
after 10 90% JVM, no GC

                  RunName	                AllWallMS	                   CPU_MS	                Allocated
              00_baseline	  2776.93 [+1.09% -0.93%]	  2753.47 [+1.08% -0.93%]	   604.19 [+1.00% -1.00%]
  01_genBcodeBaseDisabled	  1758.43 [+1.05% -0.96%]	  1736.98 [+1.03% -0.94%]	   589.91 [+1.00% -1.00%]
       02_genBCodeEnabled	  1141.89 [+1.08% -0.94%]	  2057.29 [+1.03% -0.94%]	   583.88 [+1.00% -1.00%]
03_genBcodeBaseDisabled_BT	  1685.44 [+1.10% -0.93%]	  1660.39 [+1.11% -0.93%]	   581.33 [+1.00% -1.00%]
    04_genBCodeEnabled_BT	  1108.42 [+1.09% -0.92%]	  2043.40 [+1.04% -0.95%]	   577.50 [+1.00% -1.00%]

*/
def packageInternalName: String = {
lazy val (packageInternalName:String, simpleName: String) = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this pattern introduces a third field, scala/scala-dev#308

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now there is a feature that I was not aware of. reasonable easy to work around it though

@@ -1077,6 +1052,13 @@ abstract class BTypes {
"scala/Null",
"scala/Nothing"
)

def apply(internalName: InternalName) : ClassBType = {
classBTypeFromInternalName.getOrElseUpdate(internalName, new ClassBType(internalName))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably make sure there's a single ClassBType per InternalName, also under concurrent access. I haven't checked in detail if something depends on this assumption.

@lrytz
Copy link
Member

lrytz commented Mar 31, 2017

The above are just two random comments, but before going into more details, let's make a plan how to get this in.

First, thanks @mkeskells for the PR, this is going to improve compiler performance a lot!

I'm a little worried about the change in its current form because of the existing code structure in GenBCode. I would very much like to clean this up before doing such a substantial change and making the code even harder to follow.

The current pattern of splitting up the backend into "components" is a bit of a red herring, because it basically puts everything in a hierarchy of traits, but ultimately everything is bunched into a single class GenBCode.

image

I'd like to use more composition instead of inheritance, like we already do for BTypes and the optimizer.

Second, I'd like to separate the parts of the packend that can access global (therefore Symbol, Type), and the rest which can be parallelized. Again, we already have this for BTypes and the optimizer.

Since you have some deep experience with the backend now, maybe you have other suggestions?

I can start working on this refactoring next week.

@lrytz lrytz mentioned this pull request Mar 31, 2017
@mkeskells
Copy link
Contributor Author

@lrytz the real expert of this is @retronym, and I would defer to him on the changes and structure of the files
I did find it hard to navigate when I was doing this work though

I did have an earlier version where I extracted the components into separate files to work on and then attempted to isolate global access, but ran into a few issues

  1. it would make it hard to review the changes
  2. there are lots of things that have access patterns to global, and my drive was for performance, rather than structure

I do think that there are some bits that we can easily lift ( maybe in a separate PR)
e.g. changes to the IO lib, descriptor generation in BTypes

Happy to discuss this on a call, or via email, but I think that we need to talk to @retronym
I know that he has other changes in this area

I also have another change that affects this area. based on @retronym use of per run settings. I think that could be also done before considering the restructure, as is is simple point fixes and would be easier to consider now then to track after rework. It is more CPU and memory reductions

I will submit this per-run as a PR on Sunday/Monday if I get the time

I also note that this PR is showing errors. I hope to look at this in the same timeframe

@mkeskells
Copy link
Contributor Author

/rebuild

@mkeskells
Copy link
Contributor Author

some parts of this have been done in other PRs. The parallelism will be addressed after the refactor that @lrytz is looking to do. @Lyrtz what is the timeframe for this to complete?

@mkeskells mkeskells closed this May 24, 2017
@lrytz
Copy link
Member

lrytz commented May 26, 2017

@mkeskells I plan to work on this after Copenhagen, I hope to have it done in 3 weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants