Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
EC2 Bare Metal Instances with Direct Access to Hardware (amazon.com)
385 points by jeffbarr on Nov 29, 2017 | hide | past | favorite | 127 comments


I'm really, really, happy about this. I've been complaining about the lack of cloud servers with exposed performance counters to any cloud vendor that'll listen (though of course nothing ever came of that). Kudos AWS, this is really cool.


Thanks! Would love to hear more about the counters that your interested in. We've exposed more in C5 than in previous instance types and we are trying to make more available over time in a safe way.


I have two use cases:

- General performance analysis. For this more counters is generally incrementally better.

- Running https://github.com/mozilla/rr. This requires the retired-branch-counter to be available (and accurate - sometimes virtualization messes that up)

The second one I actually care more about, because I've pretty much stopped trying to debug software when rr is not available, too painful ;). Feel free to email me (email is in my profile) for gory details.


For the benefit of anyone reading this, KVM and VMWare virtualization generally work. Xen has problems because of a stupid Xen workaround for a stupid Intel hardware bug from a decade ago. I can provide more details about that via email (in my profile) if desired.


Seconding paulie_a, We're running a Xen stack right now and I haven't heard of this. We've worked around a few nasty bugs with Xen and linux doms already, but I'm wondering if we have this problem you're referring to and don't even know it.


Can you please just post the info. Intel deserves to be shamed


One of the things the performance monitoring unit (PMU) is capable of doing is triggering an interrupt (the PMI) when a counter overflows. When combined with the ability to write to the counters, this lets you program the PMU to interrupt after a certain number of counted events. Nehalem supposedly had a bug where the PMI fires not on overflow but instead whenever the counter is zero. Xen added a workaround to set the value to 1 whenever it would instead be 0. Later this was observed on microarchitectures other than Nehalem and Xen broadened the workaround to run on every x86 CPU. Intel never provided any help in narrowing it down and there don't seem to be official errata for this behavior too.

This behavior is ok for statistically profiling frequent events but if you depend on exact counts (as rr does) or are profiling infrequent events it can mess up your day.

https://lists.xen.org/archives/html/xen-devel/2017-07/msg022... goes a little deeper and has citations.


Is this what khuey is referring to?: https://support.citrix.com/article/CTX136003


Im assuming rr is only unavailable for multithreaded apps? How frequently is rr available for your use?


rr works fine on multithreaded (and multiprocess) applications. It does emulate a single core machine though, so depending on your workload and how much parallelism your application actually has it might be painful.


scaleway and packet.net both offer bare metal

the AWS machines looks to be huge.. hence, high cost.


BlueMix/SoftLayer does as well, costs are somewhat competitive with packet.net. No idea how scaleway is making money.


Even though they are billed hourly, the deployment times (hours, last time I checked) make it not a real replacement as cloud servers. Scaleway servers deploy in seconds and packet.net in minutes.


Hrm, I tinkered a bit about a year ago and it was ~10 mins.


EC2 has actually exposed a subset even on the Xen instances for some of the more recent instance types.

Brendan Gregg wrote about them at http://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2....


Packet.net, arguably the leader in api driven, on demand bare metal instances, recently blogged about this: https://www.packet.net/blog/why-we-cant-wait-for-aws-to-anno...

I am a customer of packet's, along with other virtual and dedicated hosting providers. I don't use aws ec2. I've been pleased with Packet, and their offerings are much more diverse than this initial offering from aws.


I just now took a look at Packet's web site and their data center locations. They categorize each location as either "core" or "edge", but I couldn't find anything to indicate what those terms mean in this context. Are you familiar with that distinction?

The location nearest me is an "edge", not a "core". I wonder what I would be missing out on, if it's not "core".


Edge DCs only have one type of machine (1E) and no block storage, but I think are otherwise the same.

Even in core DCs though the availability of different types of machines varies.

Love packet.net btw -- the bgp stuff is really game changing.


mcrae covered this well, but to add to that, the Edge servers are targeted for low latency services close to users, think self driving cars, IOT, adtech, etc. You can see more details at https://www.packet.net/edge/ and https://www.packet.net/blog/looking-over-the-edge/.


Also a happy Packet customer. We use their small instances for things like service monitoring (where VM pauses cause false positives) and for routing infrastructure where bare metal is required to achieve VoIP-acceptable jitter. They’re also one of the few cloud hosting providers to support BGP.


Scaleway.com also offers baremetal servers at a really attractive price. The CLI is just awesome, it's great to see other cloud providers joining the game.


it's just a lot harder to automate reliably without any form of userdata.

packet.net I do coreos and provision with userdata specifying what docker image to run... that's all it takes to have immutable deployment :)


... not really, it just requires different infrastructure/systems.


You end up having to build your own moving parts...

Building a high-availability metadata store is not easy. And ensuring that incoming request IPs aren't spoof is a little non-trivial to reason about.

UserData is a good way to provide a one-time token that can be used to fetch data...

Using SSH for provisioning is just plain dirty... and almost impossible to do reliably... You'll need global locks and timeouts to recover in case one of your master crashes... Plus some garbage collection to cleanup things that where not fully provisioned.

This is a LOT of unreliable state to manage. And ton of corner cases. Having the right architecture matters for reliable automation.


Impressive hardware but I wonder what will be the cost considering even the regular VMs of EC2 are generally more expensive than dedicated offerings of other providers.


Head-to-head cost comparison of Amazon's AWS, IBM's Softlayer, Hetzner and Google's ComputeEngine on a machine learning benchmark:

https://rare-technologies.com/machine-learning-hardware-benc...


Interesting, but the really big difference comes when you need to push data out to end users...network egress charges.


True! Not with machine learning though. The target of this benchmark is hardcore number-crunching.


But... that's not using GPUs? What use is a ML benchmark that is only measuring CPU usage?


It is (using GPUs).


Then it needs to be rewritten, because it is impossible to tell what the machine specs are (none of them have GPU specs listed) and there is no documentation on which tests are run under the GPU and which are not. The repos contain vague information on GPU options, but there is no information on what was used in the tests.

There is nothing in this article that has any information on GPUs. It doesn't even list the actual machine instances used (would not the AWS tier name be useful here, for example?).


I guess you'll have to wait for the next part.


It has similar specs as the i3.16xlarge, so will probably be priced similarly ($5/hour).

i3.16xlarge: 64 vCPU 488GB 8 x 1900 NVMe SSD

i3.metal: 72 hyperthreads 512GB 15 TB disk

I wonder if this is the hardware for the host of the i3 series.


It's still the last generation of Xeons, but it's a beast.


What differentiates these from dedicated boxes in server rack? Is their dedicated "cloud" hardware somehow managing access to RAM/storage/etc?

On another tangent - how do Google Cloud and EC2 attach GPUs to instances - given that you can choose CPU and RAM the GPUs must somehow be modularized away from a dedicated server?


You can provision these servers just like any other instance. They work just like any other Amazon EC2 instance (same Nitro System platform as C5).

Disclaimer: I work at AWS on the team responsible for the Nitro System including EC2 Bare Metal Instances.


is there any information about nitro or ENA (assuming this is the "hardware accelerators" that are mentioned in tfa) that is publicly available? it seems like the most nifty little thing


Here are videos from break out sessions yesterday at re:Invent https://www.youtube.com/watch?v=LabltEXk0VQ https://www.youtube.com/watch?v=o9_4uGvbvnk


There is more coming at re:Invent. We have more talks queued up tomorrow on this too.


I'll keep an eye out, thanks


how many NICs can you attach to these?


15 Elastic Network Interfaces (ENIs) can be attached, just like i3.16xlarge.


Any idea if TPM is available on the bare metal instances?


> how do Google Cloud and EC2 attach GPUs to instances - given that you can choose CPU and RAM the GPUs must somehow be modularized away from a dedicated server?

Rack A of servers has a base_server_x. Rack B of servers is base_server_x + GPU_Y.

You ask for no GPU, you get a server from rack A. You ask for a GPU, you get a server from rack B.

No magic monkey adding GPUs to instances ;)


Oh there's magic monkeys too!: https://aws.amazon.com/ec2/elastic-gpus/


They can exist in a VPC alongside virtualized resources, map EBS volumes (due to having the ENA onboard), are integrated with cloudwatch, ebs, etc


I assume you can spin up new instances easily, and they’ll have per-hour billing. You don’t usually get that with dedicated servers.

It sounds a bit like MAAS [1], which allows you to throw images onto, and manage real servers easily, very much like you might spin up VMs on AWS.

[1] https://maas.io


These are billed the same as any other EC2 instance: per second.


Just to point out, per second billing is incredibly recent for EC2:

https://aws.amazon.com/about-aws/whats-new/2017/10/announcin...


Network and other managed services those are integrated into AWS.


So you could magic up a virtual machine that spans more hardware than would typically be in a single, enclosed chassis?


No, that isn't possible for VMs or bare metal.


Actually, it is possible, but not on Amazon's cloud. Check out www.scalemp.com - aggregating a collection of nodes into a large, single VM.


I'm looking forward to testing out FreeBSD on these... and also bhyve, for a fully BSD virtualization stack.


With them leaning bare-metal and low cost, I wonder if services like these could be used to bootstrap clouds in VAR form for niche OS's. Might be useful at the least for getting bugs out of the virtualization software using diverse workloads. If costs kept minimal, might even be profitable if the niche OS has enough users.


> Storage – 15.2 terabytes of local, SSD-based NVMe storage.

That's probably the most interesting aspect for me.

Does anyone know how that's provisioned? i.e 8x just under 2TB volumes, or something else?


It's exactly the same as with the i3.16xlarge instance type. There are eight 1900 GB drives. In an i3.16xlarge, those eight drives are passed through to the instance with PCIe passthrough but for the i3.metal instance, you avoid going through a hypervisor and IOMMU and have direct access.


Thanks.

I guess some other open questions:

- If one of those drives fails, will Amazon hotswap them out, or do you need to migrate to a new instance (moving TBs of data to a new box without causing outages can be painful.)

- Is there a hardware RAID controller for those drives, or is it software only?

- Can anyone with access to one of these boxes produce some IO performance stats on them? Bonus points for stats on single drive vs concurrent across all drives (i.e is there any throttling). More points for RAID10 performance across the whole 8.


The local NVMe storage for i3.metal is the same as i3.16xlarge. There are 8 NVMe PCI devices. For i3.16xlarge those PCI devices are assigned to the instance running under the Xen hypervisor. When running i3.metal, there simply isn't a hypervisor and the PCI devices are accessed directly.

- There is no hot swap for the NVMe storage.

- The 8 NVMe devices are discrete, there is no hardware RAID controller

- Anyone can get I/O performance stats on i3.16xlarge as a baseline. Intel VT-d can introduce some overhead from the handling (and caching) of DMA remapping requests in the IOMMU and interrupt delivery so I/O performance may be a bit higher on i3.metal, with a few microseconds lower latency.


For all this progress the billing on AWS is so damn confusing to figure out if some machine is left on unused that I won’t use AWS again. GCE and Azure miles ahead here.


It takes all of 3 clicks to figure this out using cost explorer


Why did AWS support send me a 5329 word message on how to check every region and service then?


Cost Explorer takes [up to] 24 hours to set up, so it's not a good answer to support questions about billing.


Maybe spend a few more minutes clicking through the apps billing section before firing off a support request next time :)


AWS is incredibly complex. Are you complaining that their billing can get complex?


So if it’s truly bare, how does Amazon give and take control of the machine for provisioning? Don’t they still need done kind of hypervisor?


Most servers have some sort of "lights out" management, which gives KVM + remote imaging and bios control.

With amazon, they have complete control over the network in and out, so cutting you off and re-imaging a server is pretty trivial.

To be fair, its not that hard to do even if you're not amazon.

Most of the big server vendor's out of band interfaces have an API, so telling a server to reboot from a network image is pretty trivial. Providing a netboot infrastructure to install images with a 'userdata' script is also not that difficult.

you'll need a DHCP server, tftp to serve the boot image, and usuaally an NFS server to pull the rest of the image over. With some engineering work that could be made to use HTTP.

https://wiki.centos.org/HowTos/NetworkInstallServer


It's a bit harder if you host something like this for the general public to use (vs administrating machines in your private DC). Normal setups aren't really hardened against someone flashing firmware, messing with UEFI, ..., all of which mean you can't entirely trust a machine coming back from customer control. I wouldn't be surprised if Amazon took this seriously and invested effort in stopping such things. At their scale, they probably can customize the hardware enough.


Everyone who sells bare metal as a service takes this seriously. As AWS build their own hardware, especially in these newer machines, I would guess that its not possible to flash firmware from the user machine, only from the control node.


EC2 Bare Metal instances boot from an EBS volume that is accessed via a NVMe PCI device (implemented in ASICs built by Annapurna Labs), just like virtualized C5 instances.


Why would you boot what is essentially an iSCSI target via NVMe?


NVMe is just how the storage is surfaced -- the hardware programming interface for the block device. Hardware iSCSI initiators (HBAs) also have a hardware programming interface, but at the end of the day you talk SCSI over that interface.

NVMe is a better match for the the storage operations supported by EBS. A bonus is that by surfacing EBS over NVMe there is a common storage interface for both managed storage volumes and local NVMe storage.


so is that a hardware cache, software cache or a storage tier?


The NVMe interface to EBS volumes that is implemented in our Nitro system today is closest to a HBA with no additional caching. So, a storage tier.


Maybe they use Intel's Management Engine!


The E5-2686 v4 doesn’t have IME.


The CPU itself doesn't, but the chipset does.

Amazon is a big enough customer that it wouldn't surprise me if they could get Intel to make special ME firmware with the features they want in it.

...which also means the ME exploits discussed recently here could lead to a whole lot more fun... ;-)



I wonder what it will cost - as of now I don’t see it in the price list


Pricing will come with general availability. I suspect (and hope) everyone will be surprised and happy with the price.


This is really good news, happy to see this is an option now.

And thanks for posting this here personally @jeffbarr.


Any time!


Why fancy words when they are just offering regular dedicated servers?


These were my exact same thoughts. I suppose its almost like a step back from the framework of "virtualize everything"... what's old is new..

addon thoughts: nonetheless, the specs on the bare metal box are ridiculous. buying something like that will cost you $50k (someone correct me?) - then you need to find a place to host it... thats not easy to do.


Because they're still virtualizing literally everything but the actual computer. You can attach NVMe backed EBS volumes, snapshot them as normal, etc. You can have this thing exist in a vpc next to your virtualized components, with 25gbps dedicated link. They're virtualizing the things you shouldn't need to care about, leaving you with a free Cpu and access to all the things that make aws aws


Regular dedicated servers don't have VPC, EBS, etc.


Accounting question: Might this qualify as a capital expense? If you squint hard enough?

For context, AWS is coy (at least publicly) about the existing dedicated instances and CapEx vs. OpEx.


The information found here should help your finance team or accountants determine how best to classify your expenses: https://aws.amazon.com/ec2/dedicated-hosts/faqs/#Should_I_Co...

Since EC2 Bare Metal instances will use the same pricing models as all other EC2 instances (on demand, reserved instances, dedicated host, spot), the same information is relevant.


For the UK its always opex. As you never own the instance, you are hiring it as a service.


Will there be smaller instances available eventually? I'm interested in bare metal performance but I don't need an instance that huge for my current workload.


Our goal is to for the majority of virtualized EC2 instances to be indistinguishable from bare metal (if not better). In most CPU and memory intensive benchmarks there is very little difference between an virtualized EC2 instance and bare metal, especially for smaller numbers of cores and memory sizes.


So now you can rent a dedicated server on AWS, what is nice.


AWS already had dedicated instances, but they still had a VM running on top. These are bare metal, which means you run directly on the hardware.


Interesting. I expect we'll be seeing a lot more VPS providers running on AWS with these instance types.


This will expose virtualization? As in I can run my own virtualization stack on these instances (KVM, etc)?


EC2 Bare Metal instances provide all the typical Intel processor features, including VT-x and VT-d. Yes you can use KVM.


Seems like this would be better for container farms, depending on cost.


Seems like a good way to build your own version of Joyent Triton, running containers without VMs.


including hypercontainer which uses KVM for hardware isolation with low overhead...


What is the price? I can't find it.


From _msw_ in this thread: "Pricing will come with general availability. I suspect (and hope) everyone will be surprised and happy with the price."


awesome stuff! Great to see AWS pushing things around the baremetal problem set.


Isnt this just regular web hosting?


Not quite: this is cloud-provisioned so you can do things like supply your own image and it integrates with all the other AWS services like virtual machines do. Provisioning is automated and self-serve. Also per-second billing which you couldn't get in the olden days with hosting.


Thanks Jeff!


I think Amazon is exposing themselves to far greater security risks than they realize.


Like what?


Blackhats, state actors, etc all trying to attack Amazon or colocated services. As an example (I don't know the extent of "bare metal" access, so I couldn't be sure) with the ability to run their own operating system, a client could potentially get all the way down to the NIC to form arbitrary network packets. With this they could potentially map and attack Amazon's internal network protocols (routers, etc). Any kind of vulnerability within Amazon's software stack on other servers now gets a whole lot worse. If the client did this at a very low rate, it would be difficult to detect. Firewalling off these servers only helps so much, since they could still attack colocated servers of other clients, or could potentially spoof the protocol of Amazon's own server management.

I hope they have thought this through carefully, because it potentially exposes everyone on EC2 to more, potentially worse, attacks.


The NIC that is used by EC2 Bare Metal instances is an Elastic Network Adapter (ENA) PCI device that surfaces a logical VPC Elastic Network Interface. ENA is implemented in an ASIC that we design and build.

When ENA is used in virtualized instances, Intel VT-d and SR-IOV are used to bypass the hypervisor. When ENA is used in a bare metal instance, the OS simply has direct access to the PCI device. In either case the device is a controlled surface, and VPC software defined networking deals with verifying and encapsulating network traffic.


It's all really cool that you design and build your own NICs. They are probably awesome tech designed by really smart people.

But how many hundreds of millions of lines of code are on these systems, roughly? Ballpark estimate.


You have the whole machine. There are no other colocated VMs.


that is the issue.

You have full access to the machine, so you can update firmware / tinker with BIOS etc.

Then let the machine go back into the pool, and wait for it dial home.

There is some mitigation, but this is a major reason a lot of vendors do not do per second / minute bare metal.


You cannot update firmware / tinker with BIOS etc. I will cover this in a breakout session at re:Invent today: https://www.portal.reinvent.awsevents.com/connect/search.ww#...


Cool - I would be interested in seeing how that mitigation was done.

Is this going to be available online afterwards, or is it just an in person breakout?


OT: Job advice needed (because I think many back-end devs will be here :D)

I'm thinking about going full-stack next year. I have a bit of experience building APIs besides being mainly a front-end developer.

Is going "cloud only" a good idea? I thought about starting with AWS Lambda, S3, DynamoDB and the Serverless framework.

Are the providers hugely different or is it a good idea to spread out and do some Azure and GCP too?


That's completely off topic. In fact, the question is so broad that I cannot think of anyplace other than the water cooler or Quora to ask it.

Career advice: Never go "foobar-only". Make an effort to learn "foobar" but understand whatever is one layer below it in the stack. Want to go "cloud-only"? Learn OpenCloud, not AWS.


lol, while complaining you still gave me a decent answer, thanks :)


Of course, we're all here to help each other! Good luck!


It's definitely worthwhile to learn Lambda, S3 and serverless apps but all that stuff can be learned on the job. S3 is especially easy to use for most use-cases and any decent programmer can learn to use it in an hour or two.

However, I would definitely learn a SQL dialect and learn how RDBMSes such as Postgres work (especially what is meant by ACID) as most companies are based around a database. Don't believe the hype - SQL is not dead. Dynamo is a great technology but there are many problems it can't solve for you.

Finally, I personally don't know Azure or GCP so well. Only knowing AWS in-depth hasn't held me back so far. I've used a few of Azure's services but I've never built a serious app on it.

My recommendation is to not really worry about individual technologies and to focus on safely handling and working with data.


Yes.

I learned React and later React-Native. Selling myself as a "mobile consultant" then worked fine, nobody cared "how" I made these mobile apps.

My idea was the same with back-end, learning some framework and start selling myself as "mobile cloud consultant" or something, with the hopes that clients also don't care "how" I create these cloud back-ends.

I know SQL, worked most of my time with RDBMSs, so this wouldn't be big of an issue. As I said I already did a few back-ends, but my focus was on front-end, usability and such.

I just mentioned DynamoDB because I had the impression that it was "the AWS DB", do they offer an SQL service besides Redshift?


Yes, they have RDS: https://aws.amazon.com/rds/

It allows you to launch many common database engines, which are managed and backed up by AWS. I've been using it for a few years and for my use-case it's great.


lol, I always had the impression RDS was the AWS Redis :D


Elasticache is the AWS Redis.


Yes, AWS has product called RDS , where you can choose RDBMS such as MySQL, PostgreSQL , Aurora (MySQL compatible database) and so on.

DynamoDB is noSQL database.


I knew that DynamoDB is a noSQL DB, I thought with the noSQL hype and everyone doing MongoDB/RethinkDB back-ends now, they would simply say "In the cloud you have to use this and that's it"

RDS somehow sounded like the Redis service of AWS, hehe.


AWS offers hosted Redis/Memcache via ElastiCache


Learning your way around cloud services is a great idea, but I would be hesitant about starting with Lambda and Serverless, or doing only that. It's somewhat of a different paradigm, kind of back-end for front-end developers, or at least people who don't want to deal with infrastructure. While that is a great thing, I think there is value in understanding what a more tradition webserver on AWS looks like with an EC2 instance, EBS volumes, AMIs, security groups, load balancer, SSH access, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: