Skip to content

mvp vm attestation#1091

Open
jordanhendricks wants to merge 29 commits intomasterfrom
jhendricks/rfd-605
Open

mvp vm attestation#1091
jordanhendricks wants to merge 29 commits intomasterfrom
jhendricks/rfd-605

Conversation

@jordanhendricks
Copy link
Copy Markdown
Contributor

@jordanhendricks jordanhendricks commented Mar 27, 2026

closes #1067

TODO:

Cargo.toml Outdated
# Attestation
#dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", branch = "jhendricks/update-sled-agent-types-versions", features = ["sled-agent"] }
dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", features = ["sled-agent"] }
vm-attest = { git = "https://github.com/oxidecomputer/vm-attest", rev = "a7c2a341866e359a3126aaaa67823ec5097000cd", default-features = false }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the Cargo.lock weirdness from dice-verifier -> sled-agent-client -> omciron-common (some previous rev) and that's where the later API dependency stuff we saw in Omicron comes up when building the tuf. sled-agent-client re-exports items out of propolis-client which means we end up in a situation where propolis-server depends on a different rev of propolis-client and everything's Weird.

i'm not totally sure what we want or need to do about this, particularly because we're definitely not using the propolis-client-related parts of sled-agent! we're just using one small part of the API for the RoT calls. but sled-agent and propolis are (i think?) updated in the same deployment unit so the cyclic dependency is fine.

@jordanhendricks jordanhendricks marked this pull request as ready for review April 2, 2026 00:08
@jordanhendricks
Copy link
Copy Markdown
Contributor Author

I want to add some comments in the attestation module but from a code-structure perspective @iximeow and I are happy with this. Ready for review!

@jordanhendricks jordanhendricks requested a review from hawkw April 2, 2026 00:41
@jordanhendricks jordanhendricks self-assigned this Apr 2, 2026
api_runtime.block_on(async { vnc.halt().await });
}

// TODO: clean up attestation server.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 014950e

Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the Tokio stuff felt a bit awkward here --- I'd be happy to open a PR against this branch changing some of the things I mentioned, if that's easier for you?

Comment on lines 499 to 500
// TODO: early return if none?
if let Some(vsock) = &self.spec.vsock {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, i think the TODO is as easy as changing this to

Suggested change
// TODO: early return if none?
if let Some(vsock) = &self.spec.vsock {
// TODO: early return if none?
let Some(vsock) = &self.spec.vsock else { return; };

and then un-indenting everything else in the function basically.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 014950e

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not super important but this string could be better probably

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 014950e

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turbo nit:

Suggested change
// table should be we sized this appropriately in testing, so

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a couple of commas here in 014950e

Comment on lines +707 to +728
// In a rack we only configure propolis-server with zero or
// one boot disks. It's possible to provide a fuller list,
// and in the future the product may actually expose such a
// capability. At that time, we'll need to have a reckoning
// for what "boot disk measurement" from the RoT actually
// means; it probably "should" be "the measurement of the
// disk that EDK2 decided to boot into", but that
// communication to and from the guest is a little more
// complicated than we want or need to build out today.
//
// Since as the system exists we either have no specific
// boot disk (and don't know where the guest is expected to
// end up), or one boot disk (and can determine which disk
// to collect a measurement of before even running guest
// firmware), we encode this expectation up front. If the
// product has changed such that this assert is reached,
// "that's exciting!" and "sorry for crashing your
// Propolis".
panic!(
"Unsupported VM RoT configuration: \
more than one boot disk"
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the rationale for making this a panic rather than a MachineInitError? would that be easier to debug if this was hit someday later?

Copy link
Copy Markdown
Contributor Author

@jordanhendricks jordanhendricks Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC from our discussions this was such a "this should never happen" scenario that panicking seemed more appropriate. AIUI if we have more than one boot disk specified this isn't a user error that should get propogated back to them. I don't remember off the top of my head what happens with a MachineInitError, but I imagine that gets translated to an internal propolis error? Whereas in this circumstance I think it's a bug outside of just propolis.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's fair --- I was just kind of vaguely wondering what would surface the error more obviously if someone was changing things around here.

Some(backend.clone_volume())
} else {
// Disk must be read-only to be used for attestation.
slog::info!(self.log, "boot disk is not read-only");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should explicitly state that this means it will not be attested?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took a crack at this in 014950e

Comment on lines +42 to +118
#[derive(Debug)]
enum AttestationInitState {
Preparing {
vm_conf_send: oneshot::Sender<VmInstanceConf>,
},
/// A transient state while we're getting the initializer ready, having
/// taken `Preparing` and its `vm_conf_send`, but before we've got a
/// `JoinHandle` to track as running.
Initializing,
Running {
init_task: JoinHandle<()>,
},
}

/// This struct manages providing the requisite data for a corresponding
/// `AttestationSock` to become fully functional.
pub struct AttestationSockInit {
log: slog::Logger,
vm_conf_send: oneshot::Sender<VmInstanceConf>,
uuid: uuid::Uuid,
volume_ref: Option<crucible::Volume>,
}

impl AttestationSockInit {
/// Do any any remaining work of collecting VM RoT measurements in support
/// of this VM's attestation server.
pub async fn run(self) {
let AttestationSockInit { log, vm_conf_send, uuid, volume_ref } = self;

let mut vm_conf = vm_attest::VmInstanceConf { uuid, boot_digest: None };

if let Some(volume) = volume_ref {
// TODO(jph): make propolis issue, link to #1078 and add a log line
// TODO: load-bearing sleep: we have a Crucible volume, but we can
// be here and chomping at the bit to get a digest calculation
// started well before the volume has been activated; in
// `propolis-server` we need to wait for at least a subsequent
// instance start. Similar to the scrub task for Crucible disks,
// delay some number of seconds in the hopes that activation is done
// promptly.
//
// This should be replaced by awaiting for some kind of actual
// "activated" signal.
tokio::time::sleep(std::time::Duration::from_secs(10)).await;

let boot_digest =
match crate::attestation::boot_digest::boot_disk_digest(
volume, &log,
)
.await
{
Ok(digest) => digest,
Err(e) => {
// a panic here is unfortunate, but helps us debug for
// now; if the digest calculation fails it may be some
// retryable issue that a guest OS would survive. but
// panicking here means we've stopped Propolis at the
// actual error, rather than noticing the
// `vm_conf_sender` having dropped elsewhere.
panic!("failed to compute boot disk digest: {e:?}");
}
};

vm_conf.boot_digest = Some(boot_digest);
} else {
slog::warn!(log, "not computing boot disk digest");
}

let send_res = vm_conf_send.send(vm_conf);
if let Err(_) = send_res {
slog::error!(
log,
"attestation server is not listening for its config?"
);
}
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Soo, it feels a bit funny to me that this thing is a task we spawn that, when it completes, sends a message over a oneshot channel and then exits, and then we have a JoinHandle<()> for that task. It kinda feels like this could just be a JoinHandle<VmInstanceConf> and make a bunch of this at least a bit simpler?

I'd be happy to throw together a patch that does that refactoring if it's too annoying.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. The JoinHandle was from a previous iteration of how we would structure things that looked more like the way we presently handle the VNC server. I'll take a look at how hard this is to remove.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this and also the change in this module that I suggested in #1091 (comment) are kinda just refactoring/tidying things up, I would be fine with leaving a lot of this as-is and then merge some refactoring later --- I'd be happy to open a follow-up PR after this has merged, if that makes life easier for you?

Comment on lines +322 to +339
let vm_conf = Arc::new(Mutex::new(None));

let log_ref = log.clone();
let vm_conf_cloned = vm_conf.clone();
tokio::spawn(async move {
match vm_conf_recv.await {
Ok(conf) => {
*vm_conf_cloned.lock().unwrap() = Some(conf);
}
Err(_e) => {
slog::warn!(
log_ref,
"lost boot digest sender, \
hopefully Propolis is stopping"
);
}
}
});
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, upon looking slightly closer at things, it kinda feels like all of this stuff with the vm_conf_recv oneshot and the task that waits for that and then stuffs it into a mutex could all be avoided if vm_conf was a tokio::sync::watch channel...AFAICT, once it gets stuck in there once, it's never mutated again, so does it really need to be a Mutex for all time?

let mut buffer =
Buffer::new(this_block_count as usize, block_size as usize);

// TODO(jph): We don't want to panic in the case of a failed read. How
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to do this and test on dublin.

// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! TODO: block comment
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in progress

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mvp vm attestation support in propolis-server (rfd 605)

4 participants