Skip to content

Instantly share code, notes, and snippets.

@iakashpaul
Last active February 22, 2024 08:37
Show Gist options
  • Save iakashpaul/3dfe37f10a1d3eb38cb610b85a9e5ccf to your computer and use it in GitHub Desktop.
Save iakashpaul/3dfe37f10a1d3eb38cb610b85a9e5ccf to your computer and use it in GitHub Desktop.
chsasank-benchmarking

Runs for dtypes

DEVICE=cuda && DTYPE=float32 && python benchmark.py --device ${DEVICE}  --dtype ${DTYPE} > H100_${DEVICE}_${DTYPE}.log
DEVICE=cuda && DTYPE=float16 && python benchmark.py --device ${DEVICE}  --dtype ${DTYPE} > H100_${DEVICE}_${DTYPE}.log
DEVICE=cuda && DTYPE=bfloat16 && python benchmark.py --device ${DEVICE}  --dtype ${DTYPE} > H100_${DEVICE}_${DTYPE}.log
DEVICE=cuda && DTYPE=int8 && python benchmark.py --device ${DEVICE}  --dtype ${DTYPE} > H100_${DEVICE}_${DTYPE}.log

Device Type FP32 (TFLOPS) BW F16 BF16 INT8
Apple M1 Pro CPU 10-core CPU 0.33 96 0.008
Apple M1 Pro GPU 16-core GPU 3.74 176 4.3
Intel Xeon 8358 60-core CPU 3.5 96
Intel Xeon 6330 56-core CPU 5.7 81 NA 0.75 0.02
Intel Xeon 6230 40-core CPU 1.9 17.5 NA 0.61 0.014
AMD Ryzen 5 3600 6-core CPU 0.36 14
Nvidia A100 80GB GPU 19 1490 32 33 NA
Nvidia A10 24GB GPU 14.48 469
Nvidia V100 32GB GPU 13 766 84 9.4 NA
Nvidia RTX 2070S 8GB GPU 8 376 37 5 NA

Ryzen 5 3600

benchmarking cpu using torch.float32
size, elapsed_time, tops
256, 0.0007100820541381836, 0.04725430223796394
304, 0.00020635128021240234, 0.27229745287823454
362, 0.0003533363342285156, 0.268514293066278
430, 0.0005476951599121093, 0.2903330385930698
512, 0.0006838560104370118, 0.3925321294294962
608, 0.001193690299987793, 0.37657290505300817
724, 0.0021503925323486327, 0.3529619995336488
861, 0.003527235984802246, 0.36191362514452513
1024, 0.00509192943572998, 0.421742617431252
1217, 0.008618521690368652, 0.418281783757488
1448, 0.013982748985290528, 0.43425329242394584
1722, 0.02307753562927246, 0.4425272377457036
2048, 0.0389744758605957, 0.44079795313859077
2435, 0.0737607479095459, 0.3914727896388784
2896, 0.14375429153442382, 0.3379129607436293
3444, 0.24540278911590577, 0.3329200334777477
4096, 0.37633934020996096, 0.3651995387867831
4870, 0.6344788789749145, 0.36408242047901673
5792, 0.9949984312057495, 0.3905649436100884
6888, 1.7714987516403198, 0.3689508883586864
size (GB), elapsed_time, bandwidth (GB/s)
0.004194304, 5.555152893066406e-05, 151.00588879327037
0.00593164, 8.046627044677734e-05, 147.43171187294814
0.008388608, 0.00013594627380371095, 123.41063517654155
0.01186328, 0.0003290891647338867, 72.09766392395856
0.016777216, 0.0010283470153808593, 32.629483528546785
0.023726564, 0.0030118942260742186, 15.755243855907796
0.033554432, 0.004802894592285156, 13.97258730345578
0.047453132, 0.006703615188598633, 14.157474934034783
0.067108864, 0.009397172927856445, 14.282777281040833
0.094906264, 0.013365435600280761, 14.201746480751643
0.134217728, 0.018894267082214356, 14.207243648666582
0.189812528, 0.02665371894836426, 14.242855067821507
0.268435456, 0.03752543926239014, 14.306852166233767
0.37962506, 0.053798246383666995, 14.1129157739704

A100 float16

benchmarking cuda using torch.float16
size, elapsed_time, tops
256, 0.01777644157409668, 0.0018875786731633935
304, 0.008939647674560547, 0.006285362694985866
362, 0.009391403198242188, 0.010102415368318777
430, 0.009010767936706543, 0.01764710856132868
512, 0.009000349044799804, 0.029825005081896894
608, 0.008980417251586914, 0.050054625682405526
724, 0.009735321998596192, 0.07796422636143384
861, 0.009606742858886718, 0.1328811211824123
1024, 0.009710216522216797, 0.22115713311712432
1217, 0.00892808437347412, 0.4037787363110709
1448, 0.007884597778320313, 0.7701159849518099
1722, 0.007819414138793945, 1.3060362214777321
2048, 0.008268022537231445, 2.0778691768966437
2435, 0.009387540817260741, 3.075920127762037
2896, 0.009256076812744141, 5.24805911345917
3444, 0.009479331970214843, 8.618698556470994
4096, 0.009059309959411621, 15.1710178907408
4870, 0.010232973098754882, 22.57433922386718
5792, 0.009942054748535156, 39.08764495923313
6888, 0.019882059097290038, 32.87365936021618
size (GB), elapsed_time, bandwidth (GB/s)
0.004194304, 0.008298230171203614, 1.0108912173959712
0.00593164, 0.008292603492736816, 1.4305857033186993
0.008388608, 0.00836634635925293, 2.0053217114834005
0.01186328, 0.008056378364562989, 2.9450652546762592
0.016777216, 0.007922005653381348, 4.235598088178334
0.023726564, 0.00619211196899414, 7.663480285500778
0.033554432, 0.00462348461151123, 14.5147804391772
0.047453132, 0.0036125898361206053, 26.2709768629354
0.067108864, 0.0042188167572021484, 31.814069139379033
0.094906264, 0.006849765777587891, 27.710805619231184
0.134217728, 0.007091808319091797, 37.85148214981321
0.189812528, 0.0068720817565917965, 55.241638479614764
0.268435456, 0.008846306800842285, 60.68870592967483
0.37962506, 0.009415888786315918, 80.63499232312684

A100 bfloat16

benchmarking cuda using torch.bfloat16
size, elapsed_time, tops
256, 0.015097665786743163, 0.0022224913754193185
304, 0.005758213996887207, 0.009758047899986834
362, 0.006917119026184082, 0.013716094177482947
430, 0.00832064151763916, 0.019110786068946943
512, 0.006926321983337402, 0.03875584424832877
608, 0.006831693649291992, 0.06579794807493826
724, 0.006945896148681641, 0.10927414285398761
861, 0.008063292503356934, 0.15831681183195834
1024, 0.008281826972961426, 0.25930071408290967
1217, 0.009384751319885254, 0.3841306501496171
1448, 0.008578181266784668, 0.7078487379966464
1722, 0.008532881736755371, 1.1968334275640953
2048, 0.0074500560760498045, 2.306005351990474
2435, 0.008992719650268554, 3.2109669680559523
2896, 0.008328795433044434, 5.832348586600332
3444, 0.007643985748291016, 10.688076542563643
4096, 0.008878946304321289, 15.47919637774022
4870, 0.010046720504760742, 22.992836905389876
5792, 0.00858457088470459, 45.26860007276564
6888, 0.01943228244781494, 33.63454807222059
size (GB), elapsed_time, bandwidth (GB/s)
0.004194304, 0.007399177551269532, 1.1337216794535097
0.00593164, 0.008727836608886718, 1.3592463438099611
0.008388608, 0.008515739440917968, 1.9701420077962688
0.01186328, 0.008480381965637208, 2.797817373809437
0.016777216, 0.008199071884155274, 4.09246710775204
0.023726564, 0.006315088272094727, 7.514246191884141
0.033554432, 0.00802006721496582, 8.367618649725495
0.047453132, 0.008075571060180664, 11.752266594243402
0.067108864, 0.0070595979690551754, 19.012092273288914
0.094906264, 0.007315444946289063, 25.946819283533397
0.134217728, 0.008006620407104491, 33.52668696043214
0.189812528, 0.00883655548095703, 42.96075057957824
0.268435456, 0.0083709716796875, 64.1348379307911
0.37962506, 0.006459522247314453, 117.53967103614487

A100 INT8

Need to revise torch & cuda versions

V100 float16

benchmarking cuda using torch.float16
size, elapsed_time, tops
256, 0.005288243293762207, 0.006345099901053988
304, 4.7779083251953126e-05, 1.1760151969366865
362, 5.8317184448242186e-05, 1.6268936317425347
430, 5.1641464233398436e-05, 3.079192318818098
512, 6.201267242431641e-05, 4.328719365023544
608, 4.842281341552734e-05, 9.283050535346607
724, 0.0006131649017333985, 1.2378510998498296
861, 0.00011780261993408204, 10.836386853826447
1024, 7.870197296142579e-05, 27.286274628115695
1217, 0.00018236637115478515, 19.767737895822073
1448, 0.0001292705535888672, 46.97167773653695
1722, 0.0002488374710083008, 41.04059591434817
2048, 0.0002832174301147461, 60.65964646681365
2435, 0.0006921768188476562, 41.71668996091485
2896, 0.0007654905319213867, 63.457921746037535
3444, 0.0011915206909179688, 68.5674242929489
4096, 0.0020316600799560546, 67.64859674506812
4870, 0.0030206918716430666, 76.4734093432539
5792, 0.004576373100280762, 84.91691950382248
6888, 0.00769500732421875, 84.93767589888
size (GB), elapsed_time, bandwidth (GB/s)
0.004194304, 2.5534629821777342e-05, 328.51888038125117
0.00593164, 3.478527069091797e-05, 341.04319915777927
0.008388608, 3.7288665771484376e-05, 449.9280318264962
0.01186328, 4.84466552734375e-05, 489.7460901291339
0.016777216, 5.822181701660156e-05, 576.3205911356594
0.023726564, 7.894039154052735e-05, 601.1260784745152
0.033554432, 0.00010094642639160156, 664.7968273751912
0.047453132, 0.00013959407806396484, 679.8731387194808
0.067108864, 0.0001867055892944336, 718.8736475818057
0.094906264, 0.00026137828826904296, 726.1985272649019
0.134217728, 0.00035915374755859377, 747.4109843618055
0.189812528, 0.0005042552947998047, 752.8429744118316
0.268435456, 0.0007027387619018555, 763.9694024377432
0.37962506, 0.000991511344909668, 765.75031026919

V100 bfloat16

benchmarking cuda using torch.bfloat16
size, elapsed_time, tops
256, 0.02667853832244873, 0.0012577312742717067
304, 6.992816925048828e-05, 0.8035235099424206
362, 8.306503295898437e-05, 1.1421876645356601
430, 8.804798126220703e-05, 1.8059925704197128
512, 0.00011758804321289062, 2.2828465264448985
608, 0.00013427734375, 3.347634168552727
724, 0.00016186237335205078, 4.6892111630487445
861, 0.00022852420806884766, 5.586081110564057
1024, 0.000320124626159668, 6.708273817487892
1217, 0.0006142377853393555, 5.86901475624512
1448, 0.000863027572631836, 7.03575989522911
1722, 0.0012133121490478516, 8.416991541718449
2048, 0.002041149139404297, 8.416763308639903
2435, 0.0033989667892456053, 8.495324473708326
2896, 0.005703592300415039, 8.516814616722376
3444, 0.009201717376708985, 8.878723549453332
4096, 0.014371323585510253, 9.563416525571206
4870, 0.024599909782409668, 9.390384275522017
5792, 0.04042065143585205, 9.614182166082358
6888, 0.06916227340698242, 9.45018152162151
size (GB), elapsed_time, bandwidth (GB/s)
0.004194304, 2.446174621582031e-05, 342.9276032049903
0.00593164, 3.101825714111328e-05, 382.46120489715605
0.008388608, 3.483295440673828e-05, 481.6478040907872
0.01186328, 4.6205520629882815e-05, 513.5005444491228
0.016777216, 5.626678466796875e-05, 596.3452896412203
0.023726564, 7.715225219726563e-05, 615.0582341869962
0.033554432, 9.870529174804688e-05, 679.8912480933719
0.047453132, 0.00013720989227294922, 691.6867466902797
0.067108864, 0.00018470287322998048, 726.668327638198
0.094906264, 0.00025992393493652345, 730.2618285090001
0.134217728, 0.0003578662872314453, 750.0998713142066
0.189812528, 0.0005018472671508789, 756.4553617183828
0.268435456, 0.0007009267807006836, 765.9443565037069
0.37962506, 0.0009888172149658202, 767.8366724493611

V100 INT8

Need to revise torch & cuda versions

RTX 2070S F32

benchmarking cuda using torch.float32
size, elapsed_time, tops
256, 0.014125776290893555, 0.00237540445983358
304, 5.047321319580078e-05, 1.1132425388101652
362, 5.1856040954589844e-05, 1.8296008382722941
430, 6.949901580810547e-05, 2.2880036235197254
512, 7.755756378173828e-05, 3.4611125325626313
608, 0.00010981559753417969, 4.093329491378411
724, 0.00015544891357421875, 4.882677083732809
861, 0.00023860931396484374, 5.349978761466475
1024, 0.00034000873565673826, 6.315966099671126
1217, 0.0005458593368530273, 6.604211712825641
1448, 0.0008722305297851563, 6.961525166397971
1722, 0.0014461994171142579, 7.061569777408616
2048, 0.0022562503814697265, 7.614345165366353
2435, 0.004026150703430176, 7.171943595007254
2896, 0.006442856788635254, 7.539580634119545
3444, 0.009186863899230957, 8.893078820383868
4096, 0.015436434745788574, 8.903542543040686
4870, 0.02744767665863037, 8.416107813896367
5792, 0.04533388614654541, 8.572208103222879
6888, 0.08088059425354004, 8.080999455755025
size (GB), elapsed_time, bandwidth (GB/s)
0.004194304, 3.304481506347656e-05, 253.85549847642136
0.00593164, 4.2486190795898435e-05, 279.2267270320988
0.008388608, 5.2881240844726565e-05, 317.26214687855725
0.01186328, 7.462501525878906e-05, 317.9437875854313
0.016777216, 9.911060333251953e-05, 338.55542062864566
0.023726564, 0.00013742446899414062, 345.30333897104794
0.033554432, 0.000186920166015625, 359.0242049880816
0.047453132, 0.0002597332000732422, 365.39904784308425
0.067108864, 0.00036420822143554685, 368.51921538446715
0.094906264, 0.0005475997924804688, 346.6263694882062
0.134217728, 0.0007352352142333985, 365.10146794300016
0.189812528, 0.0009937524795532227, 382.0116817929089
0.268435456, 0.0014146089553833008, 379.51895466018027
0.37962506, 0.002018284797668457, 376.1858192050465

RTX 2070S float16

benchmarking cuda using torch.float16
size, elapsed_time, tops
256, 0.005084848403930664, 0.006598905087133359
304, 2.4533271789550783e-05, 2.290315310652206
362, 0.0006063461303710937, 0.15647144633698648
430, 0.00015423297882080078, 1.0309986957118566
512, 3.454685211181641e-05, 7.77018569249568
608, 5.1641464233398436e-05, 8.704467053226667
724, 6.458759307861328e-05, 11.751588994439986
861, 0.000567626953125, 2.2489326043664515
1024, 9.047985076904297e-05, 23.734385387986805
1217, 0.0002730607986450195, 13.202080430030826
1448, 0.00027284622192382815, 22.254494642389318
1722, 0.0004123687744140625, 24.765304091006698
2048, 0.00050201416015625, 34.22188166694906
2435, 0.0011893272399902343, 24.27870545556251
2896, 0.0013867616653442383, 35.02868552394098
3444, 0.002403569221496582, 33.990909867422005
4096, 0.0035764694213867186, 38.428667291306034
4870, 0.006640505790710449, 34.78689926348
5792, 0.010618138313293456, 36.59883632232168
6888, 0.017475819587707518, 37.40002206269852
size (GB), elapsed_time, bandwidth (GB/s)
0.004194304, 3.1566619873046876e-05, 265.74299160749246
0.00593164, 4.220008850097656e-05, 281.1197895882486
0.008388608, 5.3262710571289064e-05, 314.98990231720677
0.01186328, 7.414817810058594e-05, 319.9884421679743
0.016777216, 9.582042694091796e-05, 350.1803641585668
0.023726564, 0.00013625621795654297, 348.26394502696763
0.033554432, 0.0001840829849243164, 364.55766961618446
0.047453132, 0.00025680065155029295, 369.5717414541417
0.067108864, 0.00035834312438964844, 374.5508672131151
0.094906264, 0.0005070447921752929, 374.3506114828194
0.134217728, 0.000709366798400879, 378.4155906438423
0.189812528, 0.001004338264465332, 377.9852559954953
0.268435456, 0.0014100074768066406, 380.75749301407643
0.37962506, 0.0019683837890625, 385.7226035993798

RTX 2070S bfloat16

benchmarking cuda using torch.bfloat16
size, elapsed_time, tops
256, 0.0027062654495239257, 0.012398795545316055
304, 4.4178962707519534e-05, 1.2718480597199782
362, 5.137920379638672e-05, 1.8465808924557958
430, 6.778240203857422e-05, 2.34594814018994
512, 8.475780487060547e-05, 3.167088345548872
608, 0.00012900829315185547, 3.484360679595077
724, 0.00019524097442626953, 3.8875387209595704
861, 0.0002892017364501953, 4.4140632683228755
1024, 0.00046432018280029297, 4.6250060358105225
1217, 0.0008016824722290039, 4.496756198219868
1448, 0.0012295007705688476, 4.93863438669556
1722, 0.002044367790222168, 4.995401583239668
2048, 0.0033872127532958984, 5.071978182440201
2435, 0.005963873863220215, 4.841706315768501
2896, 0.009510970115661621, 5.107411513364935
3444, 0.01592259407043457, 5.1310423670035945
4096, 0.027903199195861816, 4.9255625674773125
4870, 0.04716935157775879, 4.8973029790157625
5792, 0.08068933486938476, 4.816144621901542
6888, 0.13136739730834962, 4.975329126829385
size (GB), elapsed_time, bandwidth (GB/s)
0.004194304, 3.173351287841797e-05, 264.3453951076784
0.00593164, 4.298686981201172e-05, 275.9745022580144
0.008388608, 5.3429603576660155e-05, 314.0059981154128
0.01186328, 7.274150848388672e-05, 326.1763537012127
0.016777216, 9.78231430053711e-05, 343.0111829279259
0.023726564, 0.00013377666473388672, 354.7190243858706
0.033554432, 0.0001855134963989258, 361.7465322075003
0.047453132, 0.0002609729766845703, 363.66318538302215
0.067108864, 0.00036041736602783204, 372.3952857189337
0.094906264, 0.000502777099609375, 377.5281892263429
0.134217728, 0.0007108211517333985, 377.641345288329
0.189812528, 0.001009511947631836, 376.0481061076529
0.268435456, 0.001414942741394043, 379.4294258657132
0.37962506, 0.0020305871963500976, 373.9066814588031

XEON 6330 bfloat16

benchmarking cpu using torch.bfloat16
size, elapsed_time, tops
256, 0.0021901369094848634, 0.01532070066244957
304, 0.0007275581359863281, 0.07722946830060035
362, 0.0006016969680786132, 0.15768046214852163
430, 0.0004467487335205078, 0.3559360957711602
512, 0.000615072250366211, 0.4364291444463229
608, 0.0010200977325439454, 0.4406552525893741
724, 0.0014643907546997071, 0.5183089592474548
861, 0.0031775951385498045, 0.4017361263280998
1024, 0.003713393211364746, 0.5783076355683746
1217, 0.005452871322631836, 0.6611141933677716
1448, 0.008942294120788574, 0.6790265117632406
1722, 0.015332889556884766, 0.666047848196651
2048, 0.02421088218688965, 0.7095928620603098
2435, 0.04082856178283691, 0.7072334779653765
2896, 0.06611626148223877, 0.7347124169301285
3444, 0.11195228099822999, 0.7297707919795908
4096, 0.1859917163848877, 0.7389520143337268
4870, 0.3130293369293213, 0.7379583276955218
5792, 0.5199612140655517, 0.7473855658145446
6888, 0.8682275295257569, 0.7527934970007311
size (GB), elapsed_time, bandwidth (GB/s)
0.004194304, 1.6951560974121092e-05, 494.8575539920113
0.00593164, 2.9134750366210938e-05, 407.18660194042553
0.008388608, 3.204345703125e-05, 523.5769656076191
0.01186328, 4.3773651123046874e-05, 542.028352474074
0.016777216, 0.00019590854644775392, 171.2759989720433
0.023726564, 0.00010991096496582031, 431.74152837941864
0.033554432, 0.00019106864929199218, 351.22907001579233
0.047453132, 0.000468754768371582, 202.46463695654137
0.067108864, 0.0011372804641723634, 118.01638402157438
0.094906264, 0.0020810604095458985, 91.20952334171712
0.134217728, 0.0030328989028930663, 88.5078812696133
0.189812528, 0.00473179817199707, 80.22849711693813
0.268435456, 0.00640721321105957, 83.79164143832462
0.37962506, 0.009680533409118652, 78.43060789241413

XEON 6330 int8

benchmarking cpu using torch.int8
size, elapsed_time, tops
256, 0.0016846656799316406, 0.019917561329652986
304, 0.004039764404296875, 0.013908961606829084
362, 0.005875968933105468, 0.016146418927687863
430, 0.00986475944519043, 0.01611939965525742
512, 0.011141633987426758, 0.024093006133833438
608, 0.018270087242126466, 0.02460368240407379
724, 0.0413280725479126, 0.01836540639827966
861, 0.07472686767578125, 0.01708294220946889
1024, 0.08651659488677979, 0.024821638563217972
1217, 0.15329647064208984, 0.023516331530011113
1448, 0.2395930290222168, 0.025343203050523455
1722, 0.3982081890106201, 0.025645977098998424
2048, 0.4549932241439819, 0.03775851655004747
2435, 0.727786374092102, 0.0396755514776178
2896, 1.2666751861572265, 0.03834956175258211
3444, 2.090726399421692, 0.039077090522508635
4096, 3.1241865873336794, 0.043991915857143654
4870, 5.981458854675293, 0.038619776815724684
5792, 11.528981041908263, 0.033707359285558985
6888, 24.265810799598693, 0.026934852642748253
size (GB), elapsed_time, bandwidth (GB/s)
0.004194304, 5.822181701660156e-05, 144.08014778391484
0.00593164, 9.953975677490234e-05, 119.1813239691497
0.008388608, 8.499622344970703e-05, 197.38778170452736
0.01186328, 4.9614906311035155e-05, 478.2143465364728
0.016777216, 6.074905395507813e-05, 552.3449307508947
0.023726564, 0.00011255741119384766, 421.59043546475743
0.033554432, 0.00021660327911376953, 309.8238598906505
0.047453132, 0.00045404434204101565, 209.02421902975004
0.067108864, 0.0012057304382324218, 111.31652958580084
0.094906264, 0.001912236213684082, 99.26207162153382
0.134217728, 0.003130984306335449, 85.73516496292531
0.189812528, 0.004472994804382324, 84.8704442106819
0.268435456, 0.006230497360229492, 86.16822718310648
0.37962506, 0.008964014053344727, 84.6997913525919

XEON 6230 bfloat16

benchmarking cpu using torch.bfloat16
size, elapsed_time, tops
256, 0.001166057586669922, 0.028775964741009238
304, 0.0003167867660522461, 0.17737144988794462
362, 0.0004210948944091797, 0.22530754293071226
430, 0.0005740880966186524, 0.2769853632858507
512, 0.000850057601928711, 0.31578501902805406
608, 0.0014193534851074218, 0.31670153257557215
724, 0.002286386489868164, 0.33196786779638704
861, 0.003532099723815918, 0.3614152662204194
1024, 0.005914664268493653, 0.3630778604694872
1217, 0.008795619010925293, 0.40985979741984746
1448, 0.014617276191711426, 0.41540261703771425
1722, 0.024225759506225585, 0.42155285547912696
2048, 0.038063979148864745, 0.45134191348758107
2435, 0.06337082386016846, 0.4556564676153674
2896, 0.08836662769317627, 0.5497147457144741
3444, 0.1559471845626831, 0.5238921433375465
4096, 0.2487639904022217, 0.552487332470337
4870, 0.41010868549346924, 0.5632716744880513
5792, 0.636151385307312, 0.610878974960134
6888, 1.0622331142425536, 0.6153037684294559
size (GB), elapsed_time, bandwidth (GB/s)
0.004194304, 5.710124969482422e-05, 146.90760788656365
0.00593164, 7.467269897460937e-05, 158.8703791734355
0.008388608, 0.00011763572692871093, 142.6200733231942
0.01186328, 0.00030057430267333985, 78.93742009537559
0.016777216, 0.0006120920181274414, 54.81926084031005
0.023726564, 0.0010343313217163086, 45.87807311225872
0.033554432, 0.002415776252746582, 27.779420351409424
0.047453132, 0.005098819732666016, 18.613378973171983
0.067108864, 0.006124520301818847, 21.91481477498577
0.094906264, 0.00918436050415039, 20.666929168799957
0.134217728, 0.013114047050476075, 20.469307069495002
0.189812528, 0.025612187385559083, 14.822047421613195
0.268435456, 0.027147817611694335, 19.775840536394895
0.37962506, 0.0396291971206665, 19.15885698335416

XEON 6230 INT8

benchmarking cpu using torch.int8
size, elapsed_time, tops
256, 0.0011035919189453125, 0.030404746015236777
304, 0.0028478622436523436, 0.019730212767573505
362, 0.003856062889099121, 0.024604333157586422
430, 0.006611490249633789, 0.0240511585128342
512, 0.007130289077758789, 0.03764720519359018
608, 0.0132371187210083, 0.03395840389998101
724, 0.029151320457458496, 0.02603679133875408
861, 0.052414536476135254, 0.024354975696126307
1024, 0.06348373889923095, 0.033827302632706384
1217, 0.11844491958618164, 0.030435840039360992
1448, 0.21307857036590577, 0.028496787704051427
1722, 0.3561347484588623, 0.028675769888204704
2048, 0.5653518438339233, 0.030387924566576144
2435, 0.9055092811584473, 0.03188849231126473
2896, 1.6462963581085206, 0.029506496830139946
3444, 3.509342336654663, 0.02328057423029335
4096, 8.10109314918518, 0.01696548242823548
4870, 14.660134220123291, 0.01575719584360376
5792, 25.486044383049013, 0.015248011826993006
6888, 44.70684518814087, 0.014619596515778656
size (GB), elapsed_time, bandwidth (GB/s)
0.004194304, 6.196498870849609e-05, 135.37657594779532
0.00593164, 9.491443634033204e-05, 124.98920561949258
0.008388608, 0.00016396045684814454, 102.32476977993892
0.01186328, 0.000305485725402832, 77.66830993072661
0.016777216, 0.0006456851959228515, 51.96716946877188
0.023726564, 0.0012836456298828125, 36.967467418817236
0.033554432, 0.0023816823959350586, 28.177083608854897
0.047453132, 0.00407719612121582, 23.277335987384127
0.067108864, 0.006319093704223633, 21.24002812464862
0.094906264, 0.009533262252807618, 19.91055348803593
0.134217728, 0.014014458656311036, 19.154179450172144
0.189812528, 0.020318937301635743, 18.683312535712137
0.268435456, 0.02902853488922119, 18.494592098733502
0.37962506, 0.04167752265930176, 18.217256486346063

M1 Pro CPU - INT8

benchmarking cpu using torch.int8
size, elapsed_time, tops
256, 0.004259657859802246, 0.007877259889027275
304, 0.007328653335571289, 0.007667019495556467
362, 0.01215658187866211, 0.007804484595010316
430, 0.020352959632873535, 0.007812819504794035
512, 0.03251914978027344, 0.008254688631583984
608, 0.05446903705596924, 0.008252604567584112
724, 0.09682230949401856, 0.007839173140637484
861, 0.1643320083618164, 0.007768144348296151
1024, 0.2544929265975952, 0.008438284225461425
1217, 0.4326848745346069, 0.008331630796841428
1448, 0.7112574338912964, 0.008537070397675465
1722, 1.2066871166229247, 0.008463203058453855
2048, 1.9730838775634765, 0.008707115485234763
2435, 3.3893067121505736, 0.008519537534470614
2896, 5.669154453277588, 0.008568550861039925
3444, 9.958435702323914, 0.008204050034578674
4096, 16.59977195262909, 0.008279568771439191
4870, 28.29214940071106, 0.008164901249750726
5792, 47.3701210975647, 0.00820372625553576
6888, 80.74701671600342, 0.008094367627757355
size (GB), elapsed_time, bandwidth (GB/s)
0.004194304, 3.3855438232421875e-05, 247.77726823121128
0.00593164, 7.402896881103516e-05, 160.2518607314654
0.008388608, 9.772777557373046e-05, 171.67295481254942
0.01186328, 0.0001516103744506836, 156.49694215165906
0.016777216, 0.0002319812774658203, 144.64284517505448
0.023726564, 0.0004257917404174805, 111.44680249897083
0.033554432, 0.0006222724914550781, 107.8448186630866
0.047453132, 0.0009853601455688476, 96.3163209175775
0.067108864, 0.0013759851455688477, 97.54300650136226
0.094906264, 0.0025406122207641602, 74.71133392521767
0.134217728, 0.003232312202453613, 83.04750258846703
0.189812528, 0.004046845436096192, 93.8076489440148
0.268435456, 0.005482769012451172, 97.91966628190707
0.37962506, 0.008221673965454101, 92.34738852333764

M1 Pro GPU - FP16

benchmarking mps using torch.float16
size, elapsed_time, tops
256, 0.009627270698547363, 0.003485352500274346
304, 0.006162810325622559, 0.009117419656157253
362, 0.011301898956298828, 0.008394682731358462
430, 0.007365679740905762, 0.021588503110840648
512, 0.0011082172393798828, 0.24222277587939936
608, 0.0034063577651977537, 0.1319624816255623
724, 0.008305120468139648, 0.09139022737981042
861, 0.0017158985137939453, 0.7439570299396482
1024, 0.0019629955291748046, 1.0939829541551476
1217, 0.0027472972869873047, 1.3121880340635514
1448, 0.014558553695678711, 0.4170781597489533
1722, 0.0037976980209350588, 2.6891127308446503
2048, 0.004535555839538574, 3.787820014084051
2435, 0.0075850248336791996, 3.8068861187885794
2896, 0.022348809242248534, 2.1735582305732133
3444, 0.01922299861907959, 4.250091590128399
4096, 0.02993953227996826, 4.590551121066
4870, 0.052914762496948244, 4.365560669639642
5792, 0.0893251657485962, 4.35052656123515
6888, 0.15182690620422362, 4.304876220456225
size (GB), elapsed_time, bandwidth (GB/s)
0.004194304, 0.00025799274444580076, 32.51489888996581
0.00593164, 0.000435328483581543, 27.251329622169887
0.008388608, 0.0002537250518798828, 66.12360851124225
0.01186328, 0.0002813577651977539, 84.32879036881621
0.016777216, 0.00034019947052001955, 98.63164086854579
0.023726564, 0.0004561424255371094, 104.03138437325528
0.033554432, 0.0005932807922363281, 113.11484355837325
0.047453132, 0.000733184814453125, 129.44384843920918
0.067108864, 0.0009582281112670898, 140.06866050143364
0.094906264, 0.0012568950653076172, 151.01700471196023
0.134217728, 0.0016906261444091797, 158.7787204685692
0.189812528, 0.0022737979888916016, 166.95636897148202
0.268435456, 0.0034802913665771484, 154.26033496960062
0.37962506, 0.004375720024108886, 173.51432811440466
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment