PXE is a great example of a topic that turns up a ton of search results but very little helpful content. Search for “PXE configuration” or “PXE troubleshooting” and you’ll find the majority of posts focus on the same thing, specifically a few DHCP options that “must” be set in order for PXE to work. Admittedly that’s how we had PXE setup until recently, but an upgrade to our imaging software forced us to revisit this configuration, make changes, and learn quite a bit along the way.
We use Specops Deploy to image PCs and recently upgraded to their latest version to support the Windows 8.1 tablets we’re beginning to test. Deploy builds on standard Microsoft deployment tools, so pretty much everything in this post should apply regardless of the imaging solution you’re using. Besides, PXE is primarily BIOS-dependent as we’ll see later.
First a little background…
If enabled in the BIOS, PXE is part of the PC’s startup sequence. It uses the DHCP process by which devices are dynamically assigned IP addresses to obtain information about booting from a network location.
The basic DHCP process takes four steps:
- The client sends out a DHCP Discover broadcast.
- The DHCP server (or servers) respond with a DHCP Offer, which includes the IP address the DHCP server will provide the client.
- The client sends a DHCP Request broadcast, indicating the IP address it has selected.
- The DHCP server selected sends a DHCP ACK to the client, acknowledging the client has accepted the IP address.
Important Note: As mentioned, DHCP relies on broadcasts, which by definition do not traverse VLANs or subnets. If your DHCP server(s) is on a different VLAN from your clients, your router will need to be configured as a DHCP relay; on Cisco equipment this is done through the ip helper-address command. This will come into play for PXE as well.
Anyway, that’s the basic DHCP traffic flow, but it’s important to recognize that the initial DHCP Discover includes requests for quite a few parameters, as shown below via a Wireshark capture. Note options 60, 66, and 67, Vendor Class Identifier, TFTP server name and Bootfile Name.
The typical “recommended” configuration for PXE requires explicitly setting these three options (again: 60, 66, and 67) so that your clients receive this information directly from your DHCP server. And sure enough, if they are set a certain way, PXE will work in some (perhaps most) cases. But this is not the proper way to setup PXE, and doing so is not supported by Microsoft as per KB #259670. There are a few reasons this is not the right way to do things; first of all, by explicitly setting the PXE server in this fashion, you eliminate the option of having redundant PXE servers. Additionally, in some cases if your PXE server is down, your client PCs may hang up – either briefly or indefinitely – while looking for it. The biggest problem though – and the one that for us started this whole process – is that by specifying a boot file name in option 67, you eliminate your PXE server’s ability to dynamically determine which boot file it serves to a client.
The first indication we had that anything was wrong with our PXE setup was that the new tablets we were testing (Dell Venue 11 5130s, which are Atom-based devices) would not PXE boot. Turns out they need a UEFI boot file, which is different than everything else on our network. Explicitly setting option 67 caused the tablets to look for the standard boot file we’d been using, and since that was not a valid boot file for the tablets, they could not find a boot device through PXE. My first attempted fix was just to remove option 67, but evidently 60, 66, and 67 all need to be set if you’re going to use the explicit method.
The good news is that fixing all this is simple – sorta. You effectively just need to do two things:
- Remove options 60, 66, and 67 from your DHCP server(s). You may find reference elsewhere to removing option 43 (Vendor-Specific Information) as well but we have this option set for other purposes and this has not caused any issues.
- If your PXE server is on a different subnet/VLAN from your clients, configure your router to forward broadcasts to it, exactly as you do for your DHCP servers.
Two things will then happen when a client sends out a DHCP Discover broadcast:
- Your DHCP server(s) will respond with IP address(es) and related info.
- Your PXE server will respond with option 60, identifying itself as a boot server.
Note that second bullet-point…the PXE server should be replying with option 60, but the DHCP servers should not.
Once we realized we shouldn’t be setting these options explicitly and configured the ip helper-addresses, the Venue tablets sure enough would successfully PXE boot, getting the proper UEFI boot file in the process. But it’s never that simple, right?
Making these changes broke PXE for all (or as it turns out, almost all) of our desktops and laptops, which had been working fine for years. We have five different models of Dell OptiPlex desktops, plus a few models of Latitude laptops….a pretty good mix of their corporate product line for the last six years.
Most of the desktops would error out during the PXE part of the boot process with a “PXE-E55 ProxyDHCP service did not respond on port 4011″ error. Some of the laptops were worse, completely hanging during PXE without getting an IP address. Just to be clear, these PCs were not actually trying to boot from the network; the errors were happening during the part of the system’s start-up where it first tries to connect to the boot server and determine if it should be booting from the network.
Research into the PXE-E55 error always came back to the same supposed cause: having option 60 set on your DHCP server (without options 66 and 67) when your PXE server was on a separate server. Essentially in this configuration, your client would see the DHCP server respond with option 60 and because of that try to connect to it on port 4011 to network boot; since your DHCP server was not configured for PXE, port 4011 would obviously not respond.
I double and triple checked our DHCP servers and confirmed they should not be sending out option 60, so I decided to do some packet captures during the boot process to make sure they weren’t. The results were interesting; on a machine that generated the PXE-E55 error, I could see the DHCP Discover request go out and likewise see the DHCP servers and PXE server all send their DHCP Offer in response. The client would display an IP address (and respond to pings on this IP) but no further DHCP packets were sent out. The client never sent a DHCP Request packet and the PXE-E55 error occurred immediately after the client acquired an IP.
My instinct at this point was that something was wrong on the client side; after all the next packet that should have been sent was from the client. But with PXE having worked for years and all of our recent changes having been made on the server and network side, logic seemingly dictated the client couldn’t be the problem.
The breakthrough was one of my techs identifying a machine that was properly getting through the PXE process without an error. I packet captured that system and saw the four-step DHCP exchange as expected.
So now I had two test PCs on the same VLAN, connected to the same switch, and connecting to the same set of servers; one worked, one didn’t.
While they were different model PCs, they both used the same Intel NIC. However the PXE firmware on the NIC was slightly newer on the working system. I downloaded the latest Dell BIOS for the non-working machine, installed it, and low and behold the system started PXE booting as expected.
There’s clearly a bug in the earlier revisions of the Intel PXE firmware. Keep in mind that no traffic at all was being sent from the NIC of the non-working system after the DHCP Offers were received. It seems the NIC saw option 60 set without 66 and 67, and immediately generated the PXE-E55 error without even trying to connect to a PXE server. Interestingly I looked through all the BIOS revision notes I could find for a Dell OptiPlex GX745 – which was one of our affected systems – and cannot find any mention of PXE improvements or fixes. But the version number of the PXE firmware did increment and the behavior changed, so it would seem the release notes are incomplete.
Thankfully Dell provides a way to silently push a BIOS install, so we’re now updating the BIOS on all of our systems, and PXE is back to working as expected.
Your comments or questions are appreciated. Thanks for reading!