[AUTOTVM] TOPI integration for ARM CPU#1487
Conversation
| @@ -0,0 +1,123 @@ | |||
| # Performance Benchmark | |||
|
|
|||
| ## ARM CPU | |||
There was a problem hiding this comment.
consider put performance benchmark results in wiki for now, later we can have hosted website for the result, because they can change over time
There was a problem hiding this comment.
There was a problem hiding this comment.
How can I edit the wiki?
| Note: If a board has big.LITTLE archtiecture, we will use all big cores. | ||
| Otherwise, we will use all cores. | ||
|
|
||
| - **Firefly-RK3399 : 2 x Cortex A73 1.8Ghz+ 4 x Cortex A53 1.5Ghz** |
There was a problem hiding this comment.
only mark the cores being used(in this case big)
| parameters in [this repo](https://github.com/uwsaml/tvm-distro). | ||
| During compilation, TVM will download these operator parameters automatically. | ||
|
|
||
| But we don't tune for other devices, so you can only run benchmark for these devices. |
There was a problem hiding this comment.
Remove this line, and after this, have a quick section on how to do tuning for a new device
|
cc who may be interested in PR |
|
|
||
| logging.info("Finish loading %d records", counter) | ||
| if verbose: | ||
| logging.info("Finish loading %d records", counter) |
There was a problem hiding this comment.
consider just use logging.debug?
|
|
||
| def tune_tasks(tasks, | ||
| rpc_device_key, | ||
|
|
| early_stopping=200, | ||
| log_filename='tuning.log', | ||
|
|
||
| mea_number=5, |
| mea_number=5, | ||
| mea_parallel_num=1, | ||
| mea_timeout=20, | ||
| mea_use_ndk=False, |
There was a problem hiding this comment.
is it possible to pass in MeasureOption here? The options seem to be a bit duplicating with MeasureOption
| len(xs) - np.sum(valid_index), | ||
| self.feature_cache.size(self.fea_type)) | ||
| if self.verbose: | ||
| logging.info("train: %.2f\tobs: %d\terror: %d\tn_cache: %d", |
There was a problem hiding this comment.
consider use logging.debug and allow user to set the level
| If is not None, the tuner will first select | ||
| top-(plan_size * diversity_filter_ratio) candidates according to the cost model | ||
| and then pick batch_size of them according to the diversity metric. | ||
| verbose: int |
There was a problem hiding this comment.
consider directly rely on logging level for verbosity
There was a problem hiding this comment.
This is int not bool, so we leave this.
There was a problem hiding this comment.
verbose usually do not have the same meaning, so this argument is confusing. A better name is log_interval.
| } | ||
| pre_defined_opt = opt_table.get(model, []) | ||
|
|
||
| if not os.path.isfile(os.path.join(AUTOTVM_PRETUNED_PARAM_ROOT_PATH, "arm_cpu.log")): |
There was a problem hiding this comment.
consolidate all the logics of file system manipulation and autotvm cache into one file, say autotvm.tophub
| The raw kernel tensor | ||
| tile_size: int | ||
| Tile size of winograd transform. e.g. 2 for F(2x2, 3x3) and 4 for F(4x4, 3x3) | ||
| """ |
| these operators, it will query this log file to get the best knob values. | ||
|
|
||
| We also released pre-tuned parameters for some arm devices. You can go to | ||
| `ARM CPU Benchmark <https://github.com/merrymercy/tvm/blob/arm_cpu/apps/benchmark/README.md#arm-cpu>`_ |
|
Please also confirm the VTA CPU test cases, since these depend on availability of old rasp schedule which get removed here |
|
|
||
| E.g. For my RK3399, I use `python3 -m tvm.exec.rpc_sever --tracker=10.77.1.123:9190 --key=rk3399` | ||
|
|
||
| * For Andoird device |
| ``` | ||
|
|
||
| If you do not do tuning and run the benchmark for other devices directly, | ||
| the performance is not gauranteed (This is still doable, you can pick a most |
| CHECK(!op->condition.type().is_vector()); | ||
| Expr condition = this->Mutate(op->condition); | ||
| if (condition.type().is_vector()) { | ||
| LOG(WARNING) << "Detect vector condition in Vectorized Loop, scalarizing..."; |
There was a problem hiding this comment.
Currently it seems to pollute the logged data; ideally we would just print this once?
There was a problem hiding this comment.
This message is important, because it tells us when vectorization isn't working because of vectorize axis length vs input shape mismatch.
I'd imagine this message will mess up log during auto tuning, though.
There was a problem hiding this comment.
OK, I reverted this change.
There was a problem hiding this comment.
It does not pollute logging data. It occurs when I use "llvm" as target to build resnet-18.
https://github.com/dmlc/tvm/blob/f33fd5c03d8a2b3972e3b69a79a89d0c9754cd9e/topi/python/topi/x86/conv2d.py#L214-L218
@masahi Can I fix this by checking the length of w and only vectorize it when the length of w is a multiple of 16?
There was a problem hiding this comment.
@merrymercy Sure. I am aware of this issue. Probably 8 is a better default split factor than 16 for imagenet model.
I am planning to remove this old schedule completely and adapt AVX schedules for SSE target.
| // param.kernel_size[1]}); | ||
| // wshape = ConvertLayout(wshape, kOIHW, kernel_layout); | ||
| // wshape[kernel_layout.indexof('O')] *= param.groups; | ||
| // NNVM_ASSIGN_INPUT_SHAPE(attrs, *in_shape, Conv2DParam::kWeight, wshape); |
There was a problem hiding this comment.
Instead of commenting out, I'd sugguest remove them and leave more informative comment on why you don't do weight shape inference here.
| pre_packed = False | ||
| CO, _, KH, KW = get_const_tuple(kernel.shape) | ||
| else: | ||
| pre_packed = True |
There was a problem hiding this comment.
I'd sugguest pre_packed -> pre_computed, as this is not simply pre packing.
| copy_inputs[1] = weight | ||
| new_attrs['tile_size'] = tile_size | ||
| return sym.contrib.conv2d_winograd_without_weight_transform(*copy_inputs, **new_attrs) | ||
| else: |
There was a problem hiding this comment.
No need for else: block here. I think lint should catch this.
| return sym.contrib.conv2d_winograd_without_weight_transform(*copy_inputs, **new_attrs) | ||
| else: | ||
| # do nothing for depthwise convolution | ||
| return sym.conv2d(*copy_inputs, **new_attrs) |
There was a problem hiding this comment.
Better to return None here. When I was doing cuda winograd, returning a new conv2d symbol here caused a strange issue during InferShape. Returning None here solved the issue for me.
|
@merrymercy For winograd input/output transform, I was able to achieve minimal amount of math, like this for F(2x2, 3x3), for example. For F(2x2, 3x3), this reduces the number of add/sub for each 4x4 input tile from 64 to 32. Similar reduction exists for F(4x4, 3x3) and it is even more effective. It also allows completely removing matmul from compute definition of input/output transform. Check out here for a simple test case for this and here for how I integrated this reduction to my implementation of x86 F(4x4, 3x3). |
| s[V].unroll(r_nu) | ||
| s[V].parallel(b) | ||
| s[DD].compute_at(s[V], bb) | ||
|
|
There was a problem hiding this comment.
Can you add vectorize here somehow? I'm using different layout from yours, but I can do vectroized input/output transform. My implementation is here
There was a problem hiding this comment.
Functionally, would we expect vectorization coverage from this template already? e.g., if a configuration produces an easy-to-vectorize pattern here, would we expect llvm to vectorize already?
There was a problem hiding this comment.
I don't think llvm can auto-vectorize this.
| co, vc = cfg.define_split('tile_co', co, num_outputs=2) | ||
| oh, vh = cfg.define_split('tile_oh', oh, num_outputs=2) | ||
| ow, vw = cfg.define_split('tile_ow', ow, num_outputs=2) | ||
| elif num_tile == 3: # for gpu |
There was a problem hiding this comment.
Yes, it is for ARM Mali GPU.. They can share this function. But I didn't send code of mali in this PR
| Parameters | ||
| ---------- | ||
| outs: Array of Tensor | ||
| The computation graph description of conv2d_nchw |
There was a problem hiding this comment.
conv2d_nchw -> conv2d_winograd_weight_transform
| for network in networks: | ||
| net, params, shape, out_shape = get_network(network, batch_size=1) | ||
|
|
||
| with nnvm.compiler.build_config(opt_level=2, add_pass=['AlterOpLayout']): |
There was a problem hiding this comment.
Don't your case (AlterOpLayout optimization) not enter into conv2d_NCHWc and x86 schedule? which is only suitable for x86 now. At least for me, it will report error.
| return s | ||
|
|
||
|
|
||
| @conv2d_alter_layout.register(["arm_cpu", "mali"]) |
There was a problem hiding this comment.
@FrozenGene I registered alter_layout for arm_cpu here. I didn't get any error.
There was a problem hiding this comment.
@FrozenGene I added pre-tuned parameters for pynq board, which has a Cortex-A9 cpu.
|
@masahi Did you test the performance difference between non-minimal math version and minimal math version? I tried your compute deceleration but cannot get speedup. Have llvm handled this case already? |
|
@merrymercy yes, I have scripts to compare minimal version vs non minimal version. You can run them yourself to see the difference. The scripts will dump total execution time as well as time taken for input transform, batched gemm, and output transform separately. Obviously, if your winograd kernel is completely bottlenecked by gemm, there should be no performance difference. I observed this with my GPU version and x86 avx2 version. For x86 sse target, my minimal version is consistently faster than non minimal one. The above two scripts will benchmark with sse target. I have tested on recent CPUs (Coffee lake) and old high core count Xeon (12-16 core, Sandybridge and Nehalem). On recent CPUs difference is small. On old Xeon, I don't think LLVM can do this non trivial common subexpression elimination. Even if LLVM can detect common subexpressions, it is not supposed to eliminate them I believe, because this is float ops. |
|
Thanks for the explanation! We can keep the non minimal version for ARM CPU in this PR, since it is more readable. |
|
yes, you can follow up with another PR if you find a way to improve performance later. Let's merge this first. |
|
|
||
| 'ndk': use Android NDK to create shared library. Use this for android target. | ||
|
|
||
| callable; customized build function for other backends (e.g. VTA) |
There was a problem hiding this comment.
see measure/measure_methods.py default_build_func for example
|
|
||
| 'local-nofork': use local device for measure but does not use multiprocessing. | ||
| This mode is suitable for debug, but does not support timeout and parallel. | ||
| callable: It is a customized function for measurement. |
There was a problem hiding this comment.
see measure/measure_methods.py measure_rpc for example
|
|
||
| If your device has a same SoC of the above device, you can reuse these parameters | ||
| (e.g. use `llvm -device=arm_cpu -mode=rk3399 -target=aarch64-linux-gnu` as target). | ||
| Otherwise, you need to tune for your own device, please follow this [tutorial](please_fix_this_later.html). |
|
|
||
| AUTOTVM_TOPHUB_ROOT_PATH = os.path.join(os.path.expanduser('~'), ".tvm", "tophub") | ||
|
|
||
| def load_context(target, rootpath=AUTOTVM_TOPHUB_ROOT_PATH): |
There was a problem hiding this comment.
it is still better to allow the with block
with tophub.context(target):
my codeThere was a problem hiding this comment.
Is it possible to allow user also specify their customized location of tunning logs?
| """ | ||
| TopHub: Tensor Operator Hub | ||
| To get the best performance, we typically need auto-tuning for the specific devices. | ||
| TVM releases pre-tuned parameters in TopHub (https://github.com/uwsaml/tvm-distro) |
There was a problem hiding this comment.
since tvm-distro location can change, do not use url for now
| """ | ||
| path = tempdir() | ||
| filename = path.relpath("info.json") | ||
| print("Download meta info for pre-tuned parameters") |
| } | ||
|
|
||
| inline bool WinogradConv2DInferShape(const nnvm::NodeAttrs& attrs, | ||
| std::vector<TShape>* in_shape, |
| The number of measurement task that can run in parallel. | ||
| Set this according to the number of cpu cores (for compilation) and | ||
| the number of devices you have (for measuring generate code). | ||
| do_fork: bool, optional |
There was a problem hiding this comment.
Principle one of interface design is to simplify and hide options user do not use, in this case, do_fork is only used in local. I think we should remove this, and allow user to pass in
measure_func = autotvm.measure.local_nofork(measure args)
There was a problem hiding this comment.
similarly, if pack_size, rpc_device_key etc are only arguments to rpc. I think we should have good default, and allow user to do
measure_func = autotvm.measure.rpc_(rpc_key=xxxx)There was a problem hiding this comment.
do_fork is used in local_executor not measure_func. It can be used in any mode.
| build_func='default', | ||
|
|
||
| replay_db=None, | ||
| save_to_replay_db=True): |
There was a problem hiding this comment.
can save_to_display_db become an optional callback function?
|
need to rebase against the master |
|
I am doing some refactor. Do not merge. |
|
Is it worth preserving the function |
|
@merrymercy can you add tvm.target.rasp() as per comment by @ajtulloch ? |
|
I noticed that there are some things deleted; are we removing |
|
@ajtulloch |
|
@merrymercy Have we updated the related docs? I get your PR code into tvm/master and follow the tutorial https://docs.tvm.ai/tutorials/autotvm/tune_nnvm_arm.html, but I find that I can not train and get this information: BTW, I register one remote device named |
|
Thanks @merrymercy @masahi @ajtulloch @eqy @FrozenGene This is merged |
|
@FrozenGene can you open an discuss thread in the https://discuss.tvm.ai/ so we can followup the discussion? |
|
|
||
| new_attrs = {k: attrs[k] for k in attrs.keys()} | ||
|
|
||
| assert attrs.get_int_tuple("dilation") == (1, 1), "Does not support dilation " \ |
There was a problem hiding this comment.
I know we have merged it. But when I ran one model today, I find we can have better mechanism @merrymercy . Move this line after line 491.
assert attrs.get_int_tuple("dilation") == (1, 1), "Does not support dilation " \
"when alter_op_layout is enabled"
(if groups == 1)Because we will not change the kernel layout for depthwise conv2d.
Or we can support it in compute_conv2d function use topi.nn.dialate(inputs[1], (1, 1, dialate_h, dialate_w, 1).
This PR includes:
benchmark results