Skip to content

[AUTOTVM] TOPI integration for ARM CPU#1487

Merged
tqchen merged 79 commits into
apache:masterfrom
merrymercy:arm_cpu
Aug 2, 2018
Merged

[AUTOTVM] TOPI integration for ARM CPU#1487
tqchen merged 79 commits into
apache:masterfrom
merrymercy:arm_cpu

Conversation

@merrymercy

@merrymercy merrymercy commented Jul 25, 2018

Copy link
Copy Markdown
Member

This PR includes:

benchmark results

  • Firefly-RK3399 : 2 x Cortex A73 1.8Ghz
--------------------------------------------------
Network Name         Mean Inference Time (std dev)
--------------------------------------------------
squeezenet v1.1      48.87 ms            (1.07 ms)
mobilenet            82.16 ms            (0.09 ms)
resnet-18            162.55 ms           (0.14 ms)
vgg-16               912.44 ms           (0.32 ms)
  • Raspberry Pi 3B : 4 x Cortex A53 1.2Ghz
--------------------------------------------------
Network Name         Mean Inference Time (std dev)
--------------------------------------------------
squeezenet v1.1      92.34 ms            (0.07 ms)
mobilenet            145.22 ms           (0.11 ms)
resnet-18            325.06 ms           (0.23 ms)
vgg-16               crashed due to out of memeory
  • Huawei P20 Pro / Mate10 Pro (Soc: HiSilicon Kirin 970) : (4 x Cortex A73 2.36GHz)
--------------------------------------------------
Network Name         Mean Inference Time (std dev)
-------------------------------------------------
squeezenet v1.1      27.53 ms            (1.14 ms)
mobilenet            46.53 ms            (0.31 ms)
resnet-18            76.74 ms            (0.18 ms)
vgg-16               479.84 ms           (0.92 ms)
  • Google Pixel 2 (Soc: Qualcomm Snapdragon 835) : (4 × Kyro 2.35 GHz)
--------------------------------------------------
Network Name         Mean Inference Time (std dev)
--------------------------------------------------
squeezenet v1.1      23.57 ms            (0.42 ms)
mobilenet            40.73 ms            (0.11 ms)
resnet-18            63.95 ms            (0.03 ms)
vgg-16               407.75 ms           (9.57 ms)
  • PYNQ (2 x Cortex-A9 650MHz)
--------------------------------------------------
Network Name         Mean Inference Time (std dev)
--------------------------------------------------
squeezenet v1.1      452.40 ms           (0.09 ms)
mobilenet            772.16 ms           (0.25 ms)
resnet-18            1243.49 ms          (0.67 ms)
vgg-16               crashed due to out of memeory

Comment thread apps/benchmark/README.md Outdated
@@ -0,0 +1,123 @@
# Performance Benchmark

## ARM CPU

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider put performance benchmark results in wiki for now, later we can have hosted website for the result, because they can change over time

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can I edit the wiki?

Comment thread apps/benchmark/README.md Outdated
Note: If a board has big.LITTLE archtiecture, we will use all big cores.
Otherwise, we will use all cores.

- **Firefly-RK3399 : 2 x Cortex A73 1.8Ghz+ 4 x Cortex A53 1.5Ghz**

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only mark the cores being used(in this case big)

Comment thread apps/benchmark/README.md Outdated
parameters in [this repo](https://github.com/uwsaml/tvm-distro).
During compilation, TVM will download these operator parameters automatically.

But we don't tune for other devices, so you can only run benchmark for these devices.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line, and after this, have a quick section on how to do tuning for a new device

@merrymercy

merrymercy commented Jul 25, 2018

Copy link
Copy Markdown
Member Author

cc who may be interested in PR
@kevinthesun @yzhliu @Laurawly (autotuning, topi)
@masahi (cpu winograd)
@ajtulloch (mobile cpu)

Comment thread python/tvm/autotvm/record.py Outdated

logging.info("Finish loading %d records", counter)
if verbose:
logging.info("Finish loading %d records", counter)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider just use logging.debug?


def tune_tasks(tasks,
rpc_device_key,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no empty line between arguments

early_stopping=200,
log_filename='tuning.log',

mea_number=5,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mea-> measure

mea_number=5,
mea_parallel_num=1,
mea_timeout=20,
mea_use_ndk=False,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to pass in MeasureOption here? The options seem to be a bit duplicating with MeasureOption

len(xs) - np.sum(valid_index),
self.feature_cache.size(self.fea_type))
if self.verbose:
logging.info("train: %.2f\tobs: %d\terror: %d\tn_cache: %d",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider use logging.debug and allow user to set the level

If is not None, the tuner will first select
top-(plan_size * diversity_filter_ratio) candidates according to the cost model
and then pick batch_size of them according to the diversity metric.
verbose: int

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider directly rely on logging level for verbosity

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is int not bool, so we leave this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verbose usually do not have the same meaning, so this argument is confusing. A better name is log_interval.

Comment thread python/tvm/target.py Outdated
}
pre_defined_opt = opt_table.get(model, [])

if not os.path.isfile(os.path.join(AUTOTVM_PRETUNED_PARAM_ROOT_PATH, "arm_cpu.log")):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consolidate all the logics of file system manipulation and autotvm cache into one file, say autotvm.tophub

The raw kernel tensor
tile_size: int
Tile size of winograd transform. e.g. 2 for F(2x2, 3x3) and 4 for F(4x4, 3x3)
"""

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add return arguments

Comment thread tutorials/autotvm/tune_nnvm_arm.py Outdated
these operators, it will query this log file to get the best knob values.

We also released pre-tuned parameters for some arm devices. You can go to
`ARM CPU Benchmark <https://github.com/merrymercy/tvm/blob/arm_cpu/apps/benchmark/README.md#arm-cpu>`_

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to the master version

@tqchen

tqchen commented Jul 25, 2018

Copy link
Copy Markdown
Member

Please also confirm the VTA CPU test cases, since these depend on availability of old rasp schedule which get removed here

Comment thread apps/benchmark/README.md Outdated

E.g. For my RK3399, I use `python3 -m tvm.exec.rpc_sever --tracker=10.77.1.123:9190 --key=rk3399`

* For Andoird device

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Android

Comment thread apps/benchmark/README.md Outdated
```

If you do not do tuning and run the benchmark for other devices directly,
the performance is not gauranteed (This is still doable, you can pick a most

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: guaranteed

CHECK(!op->condition.type().is_vector());
Expr condition = this->Mutate(op->condition);
if (condition.type().is_vector()) {
LOG(WARNING) << "Detect vector condition in Vectorized Loop, scalarizing...";

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove this ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it seems to pollute the logged data; ideally we would just print this once?

@masahi masahi Jul 25, 2018

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message is important, because it tells us when vectorization isn't working because of vectorize axis length vs input shape mismatch.

I'd imagine this message will mess up log during auto tuning, though.

@merrymercy merrymercy Jul 29, 2018

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I reverted this change.

@merrymercy merrymercy Jul 29, 2018

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not pollute logging data. It occurs when I use "llvm" as target to build resnet-18.

https://github.com/dmlc/tvm/blob/f33fd5c03d8a2b3972e3b69a79a89d0c9754cd9e/topi/python/topi/x86/conv2d.py#L214-L218
@masahi Can I fix this by checking the length of w and only vectorize it when the length of w is a multiple of 16?

@masahi masahi Jul 29, 2018

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merrymercy Sure. I am aware of this issue. Probably 8 is a better default split factor than 16 for imagenet model.

I am planning to remove this old schedule completely and adapt AVX schedules for SSE target.

Comment thread nnvm/src/top/nn/convolution.cc Outdated
// param.kernel_size[1]});
// wshape = ConvertLayout(wshape, kOIHW, kernel_layout);
// wshape[kernel_layout.indexof('O')] *= param.groups;
// NNVM_ASSIGN_INPUT_SHAPE(attrs, *in_shape, Conv2DParam::kWeight, wshape);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of commenting out, I'd sugguest remove them and leave more informative comment on why you don't do weight shape inference here.

Comment thread topi/python/topi/arm_cpu/conv2d.py Outdated
pre_packed = False
CO, _, KH, KW = get_const_tuple(kernel.shape)
else:
pre_packed = True

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd sugguest pre_packed -> pre_computed, as this is not simply pre packing.

Comment thread topi/python/topi/arm_cpu/conv2d.py Outdated
copy_inputs[1] = weight
new_attrs['tile_size'] = tile_size
return sym.contrib.conv2d_winograd_without_weight_transform(*copy_inputs, **new_attrs)
else:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for else: block here. I think lint should catch this.

Comment thread topi/python/topi/arm_cpu/conv2d.py Outdated
return sym.contrib.conv2d_winograd_without_weight_transform(*copy_inputs, **new_attrs)
else:
# do nothing for depthwise convolution
return sym.conv2d(*copy_inputs, **new_attrs)

@masahi masahi Jul 25, 2018

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to return None here. When I was doing cuda winograd, returning a new conv2d symbol here caused a strange issue during InferShape. Returning None here solved the issue for me.

@masahi

masahi commented Jul 25, 2018

Copy link
Copy Markdown
Member

@merrymercy For winograd input/output transform, I was able to achieve minimal amount of math, like this for F(2x2, 3x3), for example.

produce temp {
  temp[0] = (d[0] - d[8])
  temp[1] = (d[1] - d[9])
  temp[2] = (d[2] - d[10])
  temp[3] = (d[3] - d[11])
  temp[4] = (d[4] + d[8])
  temp[5] = (d[5] + d[9])
  temp[6] = (d[6] + d[10])
  temp[7] = (d[7] + d[11])
  temp[8] = (d[8] - d[4])
  temp[9] = (d[9] - d[5])
  temp[10] = (d[10] - d[6])
  temp[11] = (d[11] - d[7])
  temp[12] = (d[4] - d[12])
  temp[13] = (d[5] - d[13])
  temp[14] = (d[6] - d[14])
  temp[15] = (d[7] - d[15])
}
produce V {
  V[0] = (temp[0] - temp[2])
  V[1] = (temp[1] + temp[2])
  V[2] = (temp[2] - temp[1])
  V[3] = (temp[1] - temp[3])
  V[4] = (temp[4] - temp[6])
  V[5] = (temp[5] + temp[6])
  V[6] = (temp[6] - temp[5])
  V[7] = (temp[5] - temp[7])
  V[8] = (temp[8] - temp[10])
  V[9] = (temp[9] + temp[10])
  V[10] = (temp[10] - temp[9])
  V[11] = (temp[9] - temp[11])
  V[12] = (temp[12] - temp[14])
  V[13] = (temp[13] + temp[14])
  V[14] = (temp[14] - temp[13])
  V[15] = (temp[13] - temp[15])
}

For F(2x2, 3x3), this reduces the number of add/sub for each 4x4 input tile from 64 to 32. Similar reduction exists for F(4x4, 3x3) and it is even more effective. It also allows completely removing matmul from compute definition of input/output transform.

Check out here for a simple test case for this and here for how I integrated this reduction to my implementation of x86 F(4x4, 3x3).

s[V].unroll(r_nu)
s[V].parallel(b)
s[DD].compute_at(s[V], bb)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add vectorize here somehow? I'm using different layout from yours, but I can do vectroized input/output transform. My implementation is here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally, would we expect vectorization coverage from this template already? e.g., if a configuration produces an easy-to-vectorize pattern here, would we expect llvm to vectorize already?

@masahi masahi Jul 25, 2018

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think llvm can auto-vectorize this.

Comment thread topi/python/topi/arm_cpu/conv2d.py Outdated
co, vc = cfg.define_split('tile_co', co, num_outputs=2)
oh, vh = cfg.define_split('tile_oh', oh, num_outputs=2)
ow, vw = cfg.define_split('tile_ow', ow, num_outputs=2)
elif num_tile == 3: # for gpu

@masahi masahi Jul 26, 2018

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems irrelevant for arm cpu.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is for ARM Mali GPU.. They can share this function. But I didn't send code of mali in this PR

Comment thread topi/python/topi/generic/nn.py Outdated
Parameters
----------
outs: Array of Tensor
The computation graph description of conv2d_nchw

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conv2d_nchw -> conv2d_winograd_weight_transform

@tqchen tqchen mentioned this pull request Jul 26, 2018
4 tasks
for network in networks:
net, params, shape, out_shape = get_network(network, batch_size=1)

with nnvm.compiler.build_config(opt_level=2, add_pass=['AlterOpLayout']):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't your case (AlterOpLayout optimization) not enter into conv2d_NCHWc and x86 schedule? which is only suitable for x86 now. At least for me, it will report error.

return s


@conv2d_alter_layout.register(["arm_cpu", "mali"])

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FrozenGene I registered alter_layout for arm_cpu here. I didn't get any error.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

@merrymercy merrymercy Aug 1, 2018

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FrozenGene I added pre-tuned parameters for pynq board, which has a Cortex-A9 cpu.

@merrymercy

merrymercy commented Jul 27, 2018

Copy link
Copy Markdown
Member Author

@masahi Did you test the performance difference between non-minimal math version and minimal math version? I tried your compute deceleration but cannot get speedup. Have llvm handled this case already?

@masahi

masahi commented Jul 27, 2018

Copy link
Copy Markdown
Member

@merrymercy yes, I have scripts to compare minimal version vs non minimal version. You can run them yourself to see the difference. The scripts will dump total execution time as well as time taken for input transform, batched gemm, and output transform separately.

Obviously, if your winograd kernel is completely bottlenecked by gemm, there should be no performance difference. I observed this with my GPU version and x86 avx2 version.

For x86 sse target, my minimal version is consistently faster than non minimal one. The above two scripts will benchmark with sse target. I have tested on recent CPUs (Coffee lake) and old high core count Xeon (12-16 core, Sandybridge and Nehalem). On recent CPUs difference is small. On old Xeon,
where my non minimal version was surprisingly slow, I've seen big difference.

I don't think LLVM can do this non trivial common subexpression elimination. Even if LLVM can detect common subexpressions, it is not supposed to eliminate them I believe, because this is float ops.

@merrymercy

Copy link
Copy Markdown
Member Author

Thanks for the explanation! We can keep the non minimal version for ARM CPU in this PR, since it is more readable.

@masahi

masahi commented Jul 27, 2018

Copy link
Copy Markdown
Member

yes, you can follow up with another PR if you find a way to improve performance later. Let's merge this first.

Comment thread python/tvm/autotvm/measure/measure.py Outdated

'ndk': use Android NDK to create shared library. Use this for android target.

callable; customized build function for other backends (e.g. VTA)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see measure/measure_methods.py default_build_func for example

Comment thread python/tvm/autotvm/measure/measure.py Outdated

'local-nofork': use local device for measure but does not use multiprocessing.
This mode is suitable for debug, but does not support timeout and parallel.
callable: It is a customized function for measurement.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see measure/measure_methods.py measure_rpc for example

Comment thread apps/benchmark/README.md Outdated

If your device has a same SoC of the above device, you can reuse these parameters
(e.g. use `llvm -device=arm_cpu -mode=rk3399 -target=aarch64-linux-gnu` as target).
Otherwise, you need to tune for your own device, please follow this [tutorial](please_fix_this_later.html).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix this or remove for now?

Comment thread python/tvm/autotvm/tophub.py Outdated

AUTOTVM_TOPHUB_ROOT_PATH = os.path.join(os.path.expanduser('~'), ".tvm", "tophub")

def load_context(target, rootpath=AUTOTVM_TOPHUB_ROOT_PATH):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is still better to allow the with block

with tophub.context(target):
   my code

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to allow user also specify their customized location of tunning logs?

Comment thread python/tvm/autotvm/tophub.py Outdated
"""
TopHub: Tensor Operator Hub
To get the best performance, we typically need auto-tuning for the specific devices.
TVM releases pre-tuned parameters in TopHub (https://github.com/uwsaml/tvm-distro)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since tvm-distro location can change, do not use url for now

Comment thread python/tvm/autotvm/tophub.py Outdated
"""
path = tempdir()
filename = path.relpath("info.json")
print("Download meta info for pre-tuned parameters")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use logging instead of print

Comment thread nnvm/src/top/nn/convolution.cc Outdated
}

inline bool WinogradConv2DInferShape(const nnvm::NodeAttrs& attrs,
std::vector<TShape>* in_shape,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argument alignment

The number of measurement task that can run in parallel.
Set this according to the number of cpu cores (for compilation) and
the number of devices you have (for measuring generate code).
do_fork: bool, optional

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Principle one of interface design is to simplify and hide options user do not use, in this case, do_fork is only used in local. I think we should remove this, and allow user to pass in

measure_func = autotvm.measure.local_nofork(measure args)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, if pack_size, rpc_device_key etc are only arguments to rpc. I think we should have good default, and allow user to do

measure_func = autotvm.measure.rpc_(rpc_key=xxxx)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do_fork is used in local_executor not measure_func. It can be used in any mode.

Comment thread python/tvm/autotvm/measure/measure.py Outdated
build_func='default',

replay_db=None,
save_to_replay_db=True):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can save_to_display_db become an optional callback function?

@tqchen

tqchen commented Jul 28, 2018

Copy link
Copy Markdown
Member

need to rebase against the master

@merrymercy

merrymercy commented Aug 1, 2018

Copy link
Copy Markdown
Member Author

I am doing some refactor. Do not merge.

@tqchen tqchen self-assigned this Aug 1, 2018
@ajtulloch

Copy link
Copy Markdown
Contributor

Is it worth preserving the function tvm.target.rasp() which just redirects to tvm.target.arm_cpu('rasp3b')? There's a bunch of tutorial/discuss/stackoverflow code that mentions it, and it seems like an easy way to not break existing out-of-tree code?

@tqchen

tqchen commented Aug 1, 2018

Copy link
Copy Markdown
Member

@merrymercy can you add tvm.target.rasp() as per comment by @ajtulloch ?

@eqy

eqy commented Aug 1, 2018

Copy link
Copy Markdown
Contributor

I noticed that there are some things deleted; are we removing check_correctness and automatic sanity checking for CUDA/OpenCL GPU targets, or is that currently being refactored?

@merrymercy

Copy link
Copy Markdown
Member Author

@ajtulloch tvm.target.rasp added.
@eqy They are moved to another file (measure_methods.py)

@FrozenGene

FrozenGene commented Aug 2, 2018

Copy link
Copy Markdown
Member

@merrymercy Have we updated the related docs? I get your PR code into tvm/master and follow the tutorial https://docs.tvm.ai/tutorials/autotvm/tune_nnvm_arm.html, but I find that I can not train and get this information:
[Task 1/19] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (201/1000) | 499.30 s Done
The result is 0.00 / 0.00. Didn't your doc sync with your code? Or some place I omit?

BTW, I register one remote device named custome_device non contained in your predefined device table.

@tqchen

tqchen commented Aug 2, 2018

Copy link
Copy Markdown
Member

Thanks @merrymercy @masahi @ajtulloch @eqy @FrozenGene This is merged

@tqchen tqchen merged commit d3ca9c2 into apache:master Aug 2, 2018
@tqchen

tqchen commented Aug 2, 2018

Copy link
Copy Markdown
Member

@FrozenGene can you open an discuss thread in the https://discuss.tvm.ai/ so we can followup the discussion?

@FrozenGene

Copy link
Copy Markdown
Member

@merrymercy merrymercy deleted the arm_cpu branch August 3, 2018 23:55
tqchen pushed a commit to tqchen/tvm that referenced this pull request Aug 4, 2018
sergei-mironov pushed a commit to sergei-mironov/tvm that referenced this pull request Aug 8, 2018

new_attrs = {k: attrs[k] for k in attrs.keys()}

assert attrs.get_int_tuple("dilation") == (1, 1), "Does not support dilation " \

@FrozenGene FrozenGene Aug 23, 2018

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we have merged it. But when I ran one model today, I find we can have better mechanism @merrymercy . Move this line after line 491.

assert attrs.get_int_tuple("dilation") == (1, 1), "Does not support dilation " \
                                                                               "when alter_op_layout is enabled"
(if groups == 1)

Because we will not change the kernel layout for depthwise conv2d.

Or we can support it in compute_conv2d function use topi.nn.dialate(inputs[1], (1, 1, dialate_h, dialate_w, 1).

tqchen pushed a commit to tqchen/tvm that referenced this pull request Mar 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants