Skip to content

PubSubPublisher : High CPU usage #3194

@hmigneron

Description

@hmigneron

We use PubSub in a fairly conventional way :

                                                Database X
                                              /   
                                             / 
Pubsub (Topic A) --> Worker (Subscription A) 
                                             \
                                              \
                                                Database Y

Our infrastructure is on google cloud (and we use k8s). For a steady ~3000 messages / second, 6 pods with the following config are enough and work at around 80% CPU

  "resources": {
    "requests": {
      "cpu": "500m",
      "memory": "2Gi"
    },
    "limits": {
      "cpu": "1500m",
      "memory": "3Gi"
    }
  }

We would like to use the message in different ways and move towards something like :

                                                 Database X
                                              /   
                                             / 
Pubsub (Topic A) --> Worker (Subscription A) -- Database Y
                                             \
                                              \
                                                Pubsub (Topic B)

The only modification we made to the code is creating a new PubSubPublisher and publishing messages to the topic. Every message from Topic A / Subscription A ends up in Topic B so we are also effectively publishing ~3000 messages / second.

After adding that operation, CPU usage for the pods went through the roof. What 6 pods were able to handle @ 80% CPU could not be handled by 20 pods @ 100% CPU with the config noted above (we rolled the code back as the workers weren't able to keep up).

The PubSubPublisher is created with very standard settings (I believe the retry settings are the default ones) :

private void startPublisher() {
RetrySettings retrySettings = RetrySettings.newBuilder()
    .setTotalTimeout(Duration.ofSeconds(10))
    .setInitialRetryDelay(Duration.ofMillis(5))
    .setRetryDelayMultiplier(2.0)
    .setMaxRetryDelay(Duration.ofMillis(Long.MAX_VALUE))
    .setInitialRpcTimeout(Duration.ofSeconds(10))
    .setRpcTimeoutMultiplier(2)
    .setMaxRpcTimeout(Duration.ofSeconds(10))
    .build();

publisher = Publisher.newBuilder(topic)
    .setRetrySettings(retrySettings)
    .setCredentialsProvider(FixedCredentialsProvider.create(
            ServiceAccountCredentials.fromStream(new FileInputStream(Configurations.getGoogleCloudCredentials())))
    )
    .build();
}

The actual publish method is :

private void publish(byte[] message) {
    try {
        PubsubMessage pubsubMessage = PubsubMessage.newBuilder().setData(ByteString.copyFrom(message)).build();
        ApiFuture<String> future = publisher.publish(pubsubMessage);

        ApiFutures.addCallback(future, new ApiFutureCallback<String>() {
            @Override
            public void onFailure(Throwable throwable) {
                if (throwable instanceof ApiException) {
                    ApiException apiException = ((ApiException) throwable);
                    logger.debug(new LogEntry().setMessage(String.format("PubSubException : ApiException. Code %s. IsRetryable ? %s",
                            apiException.getStatusCode().getCode(),
                            apiException.isRetryable())));
                }
                logger.warn(new LogEntry().setMessage("Error publishing message : " + Helper.getStackTrace(throwable)));
            }

            @Override
            public void onSuccess(String messageId) {

            }
        });
    }
    catch(Exception ex) {
        logger.error(new LogEntry().setMessage("An error occured publishing a message to pubsub : " + Helper.getStackTrace(ex)));
    }
}

Converting our objects to byte arrays is simple serialization :

org.apache.commons.lang3.SerializationUtils.serialize(message);

Is such a high CPU usage to be expected ? Should queueing into Pubsub need that much processing power ? What are we doing wrong for our publishers to consume so much ?

We've seen this with different versions. We got these results with 0.39.0-beta on a box running centos 7. We use gRPC 1.10.0 in the project.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions