Text parser optimization (~4.5x perf)#282
Conversation
Those aspects of the text format are not optional, they must be implemented to have a correct parser. |
brian-brazil
left a comment
There was a problem hiding this comment.
Performance improvements would be great, however the format has many ways it can be represented and this code only parses a subset of the potential valid input. If you can manage to make it work with that, that'd be great.
prometheus_client/core.py
Outdated
There was a problem hiding this comment.
A __str__ would make more sense I think
There was a problem hiding this comment.
renamed it to __str__ and having repr call str because it's useful for comparing unit test output:
E First differing element 0:
E <Metric name: a, documentation: help, type: counter, samples: [(u'a', {u'foo': u'bar'}, 1), (u'a', {u'foo': u'baez'}, 2), (u'a', {u'foo': u'buz'}, 3)]>
E <Metric name: a, documentation: help, type: counter, samples: [(u'a', {u'foo': u'bar'}, 1.0), (u'a', {u'foo': u'baz'}, 2.0), (u'a', {u'foo': u'buz'}, 3.0)]>
E
E - [<Metric name: a, documentation: help, type: counter, samples: [(u'a', {u'foo': u'bar'}, 1), (u'a', {u'foo': u'baez'}, 2), (u'a', {u'foo': u'buz'}, 3)]>]
E ? -
E
E + [<Metric name: a, documentation: help, type: counter, samples: [(u'a', {u'foo': u'bar'}, 1.0), (u'a', {u'foo': u'baz'}, 2.0), (u'a', {u'foo': u'buz'}, 3.0)]>]
E ?
instead of
E First differing element 0:
E <prometheus_client.core.CounterMetricFamily object at 0x108018c50>
E <prometheus_client.core.Metric object at 0x108018cd0>
E
E - [<prometheus_client.core.CounterMetricFamily object at 0x108018c50>]
E ? ------- ------ ^
E
E + [<prometheus_client.core.Metric object at 0x108018cd0>]
There was a problem hiding this comment.
Usually str calls repr rather than the other way around. repr should usually be an instantiatable version of object, while str is more human readable.
There was a problem hiding this comment.
Modified repr to fit what it's supposed to be, still way more readable for tests outputs 👍
prometheus_client/parser.py
Outdated
There was a problem hiding this comment.
Help and label values have different escaping rules (double quote is the difference), you need two functions for this.
There was a problem hiding this comment.
added another function specific for Help
prometheus_client/parser.py
Outdated
There was a problem hiding this comment.
Please keep comments as full sentances
prometheus_client/parser.py
Outdated
There was a problem hiding this comment.
Any mix of any number of spaces and tabs is permitted
There was a problem hiding this comment.
Added some additional unit tests around this 👍
prometheus_client/parser.py
Outdated
There was a problem hiding this comment.
there can be whitespace after the brace, and basically everywhere else between things
There was a problem hiding this comment.
this should be covered, I added one additional test case in test_spaces
prometheus_client/parser.py
Outdated
There was a problem hiding this comment.
A label value could contain a }
There was a problem hiding this comment.
This already taken into account with the use of rindex, added a test case to validate this point.
prometheus_client/parser.py
Outdated
There was a problem hiding this comment.
There can be a trailing comma after the last "
There was a problem hiding this comment.
should be already covered, test_commas is validating this
41edcb8 to
a25d444
Compare
Signed-off-by: Pierre Margueritte <mfpierre@gmail.com>
a25d444 to
1d7190c
Compare
| i = 0 | ||
| while i < len(value_substr): | ||
| i = value_substr.index('"', i) | ||
| if value_substr[i - 1] != "\\": |
There was a problem hiding this comment.
What if if you have x="" as the label? i - 1 will be -1, which might have unexpected results
There was a problem hiding this comment.
added a unit-test around empty labels, but this works fine 👍
Signed-off-by: Pierre <mfpierre@gmail.com>
|
Thanks! |
|
Hey @brian-brazil, any plans to do a release soon? 🙇 |
|
I've added it to my todo list |
Hi,
As we extensively use this library for parsing prometheus text format, we have observed that for big payloads (like https://github.com/kubernetes/kube-state-metrics for a big cluster) that can contains 400k+ lines the text parsing can be quite slow. (up to 27secs)
We tried to optimize the parser by dropping the state machine and leveraging native python functions such as
indexorfindand it gives us an average of x5 performances.Here are some benchmark using
timeit:For the KSM payload (400k lines) the parsing goes from ~27sec to ~4.7sec
Note: We could go up to almost 10x performance if we dropped some edge-cases treatment (like escaping, tab/space, etc...) could we consider a "strict" parsing mode that we could optionally use for "good citizens"?