twillouer/ToArrayBench.java

jerrinot · 2014-11-18T17:17:12Z

that's interesting one. I've simplified your setup to include just the two cases I most interested in:

@State(Scope.Thread)
public class ToArrayBench {

    @Param("10")
    private int size;
    private ArrayList<Byte> list;

    @Setup
    public void setup() throws Throwable
    {
        list = new ArrayList<>();
        for (int i = 0; i < size; i++) {
            list.add((byte) i);
        }
    }

    @Benchmark
    public void zero_sized_array(Blackhole bh)
    {
        Byte[] bytes = list.toArray(new Byte[0]);
        bh.consume(bytes);
    }

    @Benchmark
    public void sized_array_fixed_size(Blackhole bh)
    {
        Byte[] bytes = list.toArray(new Byte[size]);
        bh.consume(bytes);
    }

    public static void main(String[] args) throws Throwable {
        Options opt = new OptionsBuilder().include(".*" + ToArrayBench.class.getSimpleName() + ".*")
                .warmupIterations(10)
                .warmupTime(TimeValue.seconds(1))
                .measurementIterations(20)
                .timeUnit(TimeUnit.MILLISECONDS)
                .threads(1)
                .forks(1)
//                        .addProfiler(LinuxPerfProfiler.class)
                .build();

        new Runner(opt).run();
    }
}

and indeed the results are counter-intuitive:

Benchmark                                      (size)   Mode  Samples      Score  Score error   Units
o.o.j.s.ToArrayBench.sized_array_fixed_size        10  thrpt       20  29170.782     2361.987  ops/ms
o.o.j.s.ToArrayBench.zero_sized_array              10  thrpt       20  41963.943     2474.507  ops/ms

however when I change the sized_array_fixed_size() method to use constant there results are more aligned with expectation:

    @Benchmark
    public void sized_array_fixed_size(Blackhole bh)
    {
        Byte[] bytes = list.toArray(new Byte[10]);
        bh.consume(bytes);
    }

Benchmark                                      (size)   Mode  Samples      Score  Score error   Units
o.o.j.s.ToArrayBench.sized_array_fixed_size        10  thrpt       20  47657.534      607.251  ops/ms
o.o.j.s.ToArrayBench.zero_sized_array              10  thrpt       20  43036.401      716.582  ops/ms

I can see the version the original version has lower instructions per cycle count:
This is the original sized_array_fixed_size():

   23289.588754 task-clock (msec)         #    0.639 CPUs utilized          
            14,374 context-switches          #    0.617 K/sec                  
             3,054 cpu-migrations            #    0.131 K/sec                  
               453 page-faults               #    0.019 K/sec                  
    79,254,722,599 cycles                    #    3.403 GHz                     [31.19%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   187,817,995,702 instructions              #    2.37  insns per cycle         [38.93%]
    36,922,468,827 branches                  # 1585.364 M/sec                   [38.93%]
        51,694,474 branch-misses             #    0.14% of all branches         [38.71%]
    46,459,233,680 L1-dcache-loads           # 1994.850 M/sec                   [38.77%]
       654,546,759 L1-dcache-load-misses     #    1.41% of all L1-dcache hits   [38.58%]
        57,970,689 LLC-loads                 #    2.489 M/sec                   [30.90%]
   <not supported> LLC-load-misses:HG      
   <not supported> L1-icache-loads:HG      
         8,075,209 L1-icache-load-misses:HG  #    0.00% of all L1-icache hits   [31.19%]
    46,181,846,166 dTLB-loads:HG             # 1982.940 M/sec                   [31.09%]
           720,667 dTLB-load-misses:HG       #    0.00% of all dTLB cache hits  [31.05%]
        32,330,750 iTLB-loads:HG             #    1.388 M/sec                   [31.14%]
           637,481 iTLB-load-misses:HG       #    1.97% of all iTLB cache hits  [31.39%]
   <not supported> L1-dcache-prefetches:HG 
                 0 L1-dcache-prefetch-misses:HG #    0.000 K/sec                   [31.40%]

      36.442513922 seconds time elapsed

vs. the zero_sized_array():

  23279.164686 task-clock (msec)         #    0.639 CPUs utilized          
            15,372 context-switches          #    0.660 K/sec                  
             3,401 cpu-migrations            #    0.146 K/sec                  
               532 page-faults               #    0.023 K/sec                  
    78,414,546,516 cycles                    #    3.368 GHz                     [31.04%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   260,885,594,417 instructions              #    3.33  insns per cycle         [39.01%]
    49,221,494,283 branches                  # 2114.401 M/sec                   [39.16%]
        69,222,874 branch-misses             #    0.14% of all branches         [39.19%]
    62,745,594,182 L1-dcache-loads           # 2695.354 M/sec                   [39.11%]
     1,121,079,378 L1-dcache-load-misses     #    1.79% of all L1-dcache hits   [38.64%]
        99,591,088 LLC-loads                 #    4.278 M/sec                   [30.57%]
   <not supported> LLC-load-misses:HG      
   <not supported> L1-icache-loads:HG      
        10,673,042 L1-icache-load-misses:HG  #    0.00% of all L1-icache hits   [30.94%]
    63,298,980,413 dTLB-loads:HG             # 2719.126 M/sec                   [31.00%]
         1,173,107 dTLB-load-misses:HG       #    0.00% of all dTLB cache hits  [32.64%]
        32,181,074 iTLB-loads:HG             #    1.382 M/sec                   [32.58%]
           463,867 iTLB-load-misses:HG       #    1.44% of all iTLB cache hits  [32.50%]
   <not supported> L1-dcache-prefetches:HG 
                 0 L1-dcache-prefetch-misses:HG #    0.000 K/sec                   [32.42%]

      36.426116098 seconds time elapsed

The new version of sized_array_fixed_size() with constant has IPC similar to the the zero_sized_array():

      23346.537627 task-clock (msec)         #    0.639 CPUs utilized          
            14,690 context-switches          #    0.629 K/sec                  
             3,524 cpu-migrations            #    0.151 K/sec                  
               502 page-faults               #    0.022 K/sec                  
    79,594,753,008 cycles                    #    3.409 GHz                     [30.88%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   254,160,699,602 instructions              #    3.19  insns per cycle         [38.88%]
    50,744,484,027 branches                  # 2173.534 M/sec                   [38.94%]
        70,854,819 branch-misses             #    0.14% of all branches         [39.08%]
    62,767,371,751 L1-dcache-loads           # 2688.509 M/sec                   [39.04%]
       908,061,562 L1-dcache-load-misses     #    1.45% of all L1-dcache hits   [38.96%]
        83,239,667 LLC-loads                 #    3.565 M/sec                   [30.64%]
   <not supported> LLC-load-misses:HG      
   <not supported> L1-icache-loads:HG      
        10,329,198 L1-icache-load-misses:HG  #    0.00% of all L1-icache hits   [30.71%]
    63,590,675,367 dTLB-loads:HG             # 2723.773 M/sec                   [30.74%]
           845,789 dTLB-load-misses:HG       #    0.00% of all dTLB cache hits  [30.69%]
        29,044,917 iTLB-loads:HG             #    1.244 M/sec                   [31.01%]
           489,904 iTLB-load-misses:HG       #    1.69% of all iTLB cache hits  [30.74%]
   <not supported> L1-dcache-prefetches:HG 
                 0 L1-dcache-prefetch-misses:HG #    0.000 K/sec                   [30.72%]

      36.531843880 seconds time elapsed

vs. the zero_sized_array():

      23167.723860 task-clock (msec)         #    0.638 CPUs utilized          
            14,513 context-switches          #    0.626 K/sec                  
             3,424 cpu-migrations            #    0.148 K/sec                  
               494 page-faults               #    0.021 K/sec                  
    78,630,215,762 cycles                    #    3.394 GHz                     [30.94%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   265,990,291,006 instructions              #    3.38  insns per cycle         [38.66%]
    50,353,702,113 branches                  # 2173.442 M/sec                   [38.65%]
        67,371,027 branch-misses             #    0.13% of all branches         [38.84%]
    63,736,003,459 L1-dcache-loads           # 2751.069 M/sec                   [38.83%]
     1,126,015,959 L1-dcache-load-misses     #    1.77% of all L1-dcache hits   [38.63%]
       181,894,495 LLC-loads                 #    7.851 M/sec                   [30.82%]
   <not supported> LLC-load-misses:HG      
   <not supported> L1-icache-loads:HG      
        10,028,501 L1-icache-load-misses:HG  #    0.00% of all L1-icache hits   [32.29%]
    59,875,036,318 dTLB-loads:HG             # 2584.416 M/sec                   [32.92%]
         1,291,552 dTLB-load-misses:HG       #    0.00% of all dTLB cache hits  [32.84%]
        36,123,503 iTLB-loads:HG             #    1.559 M/sec                   [32.87%]
           716,049 iTLB-load-misses:HG       #    1.98% of all iTLB cache hits  [32.79%]
   <not supported> L1-dcache-prefetches:HG 
                 0 L1-dcache-prefetch-misses:HG #    0.000 K/sec                   [32.54%]

      36.333910926 seconds time elapsed

jerrinot · 2014-11-18T19:28:56Z

this is giving me similarly confusing results:

@State(Scope.Thread)
public class ToArrayBench {

    @Param("10")
    private int size;
    private Byte[] buffer;

    @Setup
    public void setup() throws Throwable {
        buffer = new Byte[size];
        for (byte i = 0; i < size; i++) {
            buffer[i] = i;
        }
    }

    @Benchmark
    public void fast(Blackhole bh) {
        int s = buffer.length;
        Byte[] copy = Arrays.copyOf(buffer, s, Byte[].class);
        bh.consume(copy);
    }

    @Benchmark
    public void slow(Blackhole bh) {
        int s = buffer.length;
        Byte[] copy = (Byte[]) Array.newInstance(Byte[].class.getComponentType(), s);
        System.arraycopy(buffer, 0, copy, 0, s);
        bh.consume(copy);
    }

    public static void main(String[] args) throws Throwable {
        Options opt = new OptionsBuilder().include(".*" + ToArrayBench.class.getSimpleName() + ".*")
                .warmupIterations(10)
                .warmupTime(TimeValue.seconds(1))
                .measurementIterations(20)
                .timeUnit(TimeUnit.MILLISECONDS)
                .threads(1)
                .forks(1)
                .addProfiler(LinuxPerfProfiler.class)
                .build();

        new Runner(opt).run();
    }
}

Benchmark                         (size)   Mode  Samples       Score  Score error   Units
o.o.j.s.ToArrayBench.fast             10  thrpt       20  134199.249    15573.195  ops/ms
o.o.j.s.ToArrayBench.fast:@cpi        10  thrpt        1       0.367          NaN     CPI
o.o.j.s.ToArrayBench.slow             10  thrpt       20   49499.140     2661.012  ops/ms
o.o.j.s.ToArrayBench.slow:@cpi        10  thrpt        1       0.764          NaN     CPI

perf stats for fast():

      23044.197935 task-clock (msec)         #    0.632 CPUs utilized          
            14,410 context-switches          #    0.625 K/sec                  
             3,227 cpu-migrations            #    0.140 K/sec                  
               423 page-faults               #    0.018 K/sec                  
    78,964,852,601 cycles                    #    3.427 GHz                     [30.93%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   214,945,315,883 instructions              #    2.72  insns per cycle         [38.77%]
    40,570,072,182 branches                  # 1760.533 M/sec                   [39.00%]
         8,633,173 branch-misses             #    0.02% of all branches         [38.99%]
    55,630,805,896 L1-dcache-loads           # 2414.092 M/sec                   [39.18%]
     2,727,361,923 L1-dcache-load-misses     #    4.90% of all L1-dcache hits   [38.89%]
     1,403,751,595 LLC-loads                 #   60.916 M/sec                   [30.76%]
   <not supported> LLC-load-misses:HG      
   <not supported> L1-icache-loads:HG      
        10,486,830 L1-icache-load-misses:HG  #    0.00% of all L1-icache hits   [31.88%]
    54,778,270,818 dTLB-loads:HG             # 2377.096 M/sec                   [31.78%]
           584,070 dTLB-load-misses:HG       #    0.00% of all dTLB cache hits  [31.70%]
        23,538,830 iTLB-loads:HG             #    1.021 M/sec                   [31.62%]
           312,318 iTLB-load-misses:HG       #    1.33% of all iTLB cache hits  [31.64%]
   <not supported> L1-dcache-prefetches:HG 
                 0 L1-dcache-prefetch-misses:HG #    0.000 K/sec                   [31.57%]

      36.459311589 seconds time elapsed

perf stats for slow():

      23121.720993 task-clock (msec)         #    0.635 CPUs utilized          
            14,615 context-switches          #    0.632 K/sec                  
             3,168 cpu-migrations            #    0.137 K/sec                  
               463 page-faults               #    0.020 K/sec                  
    79,644,122,769 cycles                    #    3.445 GHz                     [30.98%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   104,254,073,200 instructions              #    1.31  insns per cycle         [38.76%]
    17,489,978,097 branches                  #  756.431 M/sec                   [38.87%]
        42,822,998 branch-misses             #    0.24% of all branches         [38.74%]
    21,942,957,608 L1-dcache-loads           #  949.019 M/sec                   [38.73%]
     1,022,663,441 L1-dcache-load-misses     #    4.66% of all L1-dcache hits   [38.71%]
        88,585,236 LLC-loads                 #    3.831 M/sec                   [30.86%]
   <not supported> LLC-load-misses:HG      
   <not supported> L1-icache-loads:HG      
        11,208,770 L1-icache-load-misses:HG  #    0.00% of all L1-icache hits   [31.33%]
    21,891,428,217 dTLB-loads:HG             #  946.791 M/sec                   [31.21%]
           807,245 dTLB-load-misses:HG       #    0.00% of all dTLB cache hits  [31.04%]
        24,236,835 iTLB-loads:HG             #    1.048 M/sec                   [30.99%]
           387,664 iTLB-load-misses:HG       #    1.60% of all iTLB cache hits  [31.06%]
   <not supported> L1-dcache-prefetches:HG 
                 0 L1-dcache-prefetch-misses:HG #    0.000 K/sec                   [31.13%]

      36.428755980 seconds time elapsed

twillouer · 2014-12-09T13:15:27Z

@jerrinot thanks for your time.
Still in trouble to understand the problem :)

klinham · 2015-01-10T20:58:43Z

No idea either, can someone please elaborate on that ?

twillouer · 2015-01-23T19:51:28Z

Updated:

@State(Scope.Benchmark)
public class ToArrayBench {

    private static final int SIZE = 100;

    ArrayList<Byte> list;

    @Setup
    public void setup() throws Throwable
    {
        list = new ArrayList<>();
        for (int i = 0; i < SIZE; i++) {
            list.add((byte) i);
        }
    }

    @Benchmark
    public void zero_sized_array(Blackhole bh)
    {
        bh.consume(list.toArray(new Byte[0]));
    }

    @Benchmark
    public void simple_toArray(Blackhole bh)
    {
        bh.consume(list.toArray());
    }

    @Benchmark
    public void sized_array_from_list(Blackhole bh)
    {
        bh.consume(list.toArray(new Byte[list.size()]));
    }

    @Benchmark
    public void sized_array_fixed_size(Blackhole bh)
    {
        bh.consume(list.toArray(new Byte[SIZE]));
    }

    @Benchmark
    public void defensive_copy(Blackhole bh)
    {
        bh.consume(new ArrayList<>(list));
    }

    public static void main(String[] args) throws RunnerException, IOException
    {
        Options opt = new OptionsBuilder().include(".*" + ToArrayBench.class.getSimpleName() + ".*")
                .warmupIterations(20)
                .warmupTime(TimeValue.seconds(1))
                .measurementIterations(20)
                .timeUnit(TimeUnit.MILLISECONDS)
                .forks(1)
//                .addProfiler(LinuxPerfAsmProfiler.class)
                .build();

        new Runner(opt).run();
    }
}

twillouer · 2015-01-23T21:52:52Z

Benchmark	Mode	Cnt	Score	Error	Units
ToArrayBench.defensive_copy	thrpt	200	16 714 192	± 129515,217	ops/s
ToArrayBench.simple_toArray	thrpt	200	17 918 950	± 102801,298	ops/s
ToArrayBench.sized_array_fixed_size	thrpt	200	5 799 136	± 65921,564	ops/s
ToArrayBench.sized_array_from_list	thrpt	200	5 643 162	± 85215,009	ops/s
ToArrayBench.zero_sized_array	thrpt	200	6 529 068	± 78960,062	ops/s

shipilev · 2015-01-24T00:22:08Z

Routinely, I will chew on people who can't use perfasm profiler, but this is not your fault it wasn't helping here. ;) Only in JMH 1.5+ (released yesterday) perfasm can decode the VM stubs, and VM stubs are the crucial piece of info to untangle this. See: http://cr.openjdk.java.net/~shade/scratch/ToArrayBench.java

twillouer/ToArrayBench.java

jerrinot commented Nov 18, 2014

Uh oh!

jerrinot commented Nov 18, 2014

Uh oh!

twillouer commented Dec 9, 2014

Uh oh!

klinham commented Jan 10, 2015

Uh oh!

twillouer commented Jan 23, 2015

Uh oh!

twillouer commented Jan 23, 2015

Uh oh!

shipilev commented Jan 24, 2015

Uh oh!

	@State(Scope.Benchmark)
	public class ToArrayBench {

	ArrayList<Byte> list;

	@Setup
	public void setup() throws Throwable
	{
	list = new ArrayList<>();
	for (int i = 0; i < 10; i++) {
	list.add((byte) i);
	}
	}

	@Benchmark
	public void zero_sized_array()
	{
	list.toArray(new Byte[0]);
	}

	@Benchmark
	public void simple_toArray()
	{
	list.toArray();
	}

	@Benchmark
	public void sized_array_from_list()
	{
	list.toArray(new Byte[list.size()]);
	}

	@Benchmark
	public void sized_array_fixed_size()
	{
	list.toArray(new Byte[100]);
	}

	@Benchmark
	public void defensive_copy()
	{
	new ArrayList<>(list);
	}

	public static void main(String[] args) throws RunnerException, IOException
	{
	Options opt = new OptionsBuilder().include("." + ToArrayBench.class.getSimpleName() + ".")
	.warmupIterations(20)
	.warmupTime(TimeValue.seconds(1))
	.measurementIterations(20)
	.timeUnit(TimeUnit.MILLISECONDS)
	.forks(1)
	// .addProfiler(LinuxPerfProfiler.class)
	.build();

	new Runner(opt).run();
	}
	}