Netgear GS516TP problem 802.3ad LAG/LACP – only one direction full bandwidth

A little story

Starring: This cutty device

I’ve ordered a new GS516T 16 port L3 switch from Netgear, because i had to increase the performance of an NAS device attached to three vmWare ESXi hosts in a dedicated storage LAN. I crosschecked if the switch in question does provide support for LACP regarding IEEE 802.3ad dynamicially and i read the related datasheet of the product. Supported by the switch, so i placed the final order.

The switch arrived and i’ve created the inital configuration, like configuring the time and IP-address. As the second step i checked the website of Netgear to verify that i have the latest firmware installed. Out-of-the-box the GS516TP Firmware Version 6.0.1.16 was installed and it’s the latest one as of time of writing this article.

This is how the final configuration looks like.

Left: Three ESXi hosts with local datastores. Middle: Netgear GS516TP. Right: NAS device.
All network interfaces support 1Gbit/s fullduplex.

I proceeded to configure the LAG/LACP settings for the NAS device, so i launched the webgui of the switch and browsed to “Switching” -> “LAG” -> “LAG Configuration”. I created the setting like it’s visible on the screenshot; but i was not able to set the “STP Mode” (Spanning Tree Protocol) to “Disabled”. Everytime i changed the setting, the value remained “Enable” in the table. I was confused. I played around with this setting and selected enable and disable again and again; nothing seems to happen. The setting in the table remained at “Enable”.

However, i proceeded to make some performance testings from the ESXi farm to the NAS. I saw, that regardless of the direction of the dataflow, only one link of the aggregated port was used; so the overall bandwith was limited to 1GBit/s or 120MB/s only, for writing and reading to the NAS. I saw that i have 90 days free support after i’ve buyed a Netgear product: Great, because it was late in the evening, i did search the US phone number of the Netgear support; i was not able to find it – so i just called the number for premium support line and explained my problem, i was redirected to the right place – very nice. An engineer has connected to my Desktop and i demonstrated the problem with the limited bandwith over only one port of the aggregated link. I told him, that i have to disable STP mode, because i do not interconnect multiple switches but the setting remains “Enabled” – i told him, that i suspect this will prevent the aggregated link from working properly.

The engineer played around a little bit and saw the same problem with the STP Mode setting. Afterwards he browsed to “Switching” -> “STP” -> “STP Configuration” and disabled “Spanning Tree State” globally.

I did another performance test and saw, that i still have the same limitation/behaviour. So therefor the engineer browsed to “Switching” -> “STP” -> “CST Port Configuration” and here he disabled “STP Status” on Port 1-4 on the port-level. Additionaly he created a new LAG group and removed the interfaces from the previous LAG group and added those interface to the freshly created LAG group. He did also play around with the ‘STP Mode’. He tried to set it to disabled as the last action; but the STP Mode still remained “Enabled”.

I did another (quick) performance test and now i saw that the traffic is distributed over three ports of the aggregated link while i was transfering three virtual servers from the NAS to the local datastores of three different ESXi hosts at the same time (i read from the NAS). The engineer disconnected and we believed the problem was solved by disabling “STP Status” on the interfaces itself – as follow-up he promised to reproduce the problem in the lab and then he will notify the engineering team about this.

After a little while and deeper investigation, i found out, that i can revert back both settings (Global STP setting and port-based STP Mode) to the default value in case i disable the STP Mode on the LAG configuration page; i learned that i can switch the setting and the setting is applied in the background – just the value in the table remained on status “Enabled”. After applying “Enabled” the traffic was not distributed over the aggregated link, after applying “Disabled” the traffic was distributed over the aggregated links. Ok – so far, as long as i know about this behaviour, i can handle it. It’s just a minor bug in the webgui where you have to be aware of.

A few hours later, i received this message:

Regardless of this unhelpful message, i was happy that i found the root cause of my issue and proceeded to perform additional bandwith tests again, i found out that i can read data from the NAS from three ESXi hosts at the same time and i can reach a bandwith of 350 MB/s, nicely distributed over three ports – wow, as it should be. At the other hand, i experienced that writing to the NAS from three ESXi hosts is still not load balanced over the aggregated link; only one link at 1Gbit/s is used.

The first question i was asking myself was: “What is the hashing algorithm of the aggregated link?” Unfortanetly, there is no such setting to specify a hashing algorithm on the webgui of the switch. “Ok then, the only explanation is that the switch is doing L2 destination MAC based hashing”.

As long the switch only supports dst-mac for the hashing, then we always have this situation:

  • One to many hosts: Able to use more than one link; because the destination MAC in the tcp package is different.
  • Many to one host: Unable to use more than one link; because the destination MAC in the tcp package is always the same.

I tried to contact Netgear with this information. Because it was local office time i was searching for the phone number and followed the “My Support” process on the MyNetgear account. Following number was displayed. In case you find out how to dial this phone number from Switzerland, then you would be my hero. I’ve spent a lot of time trying to find out how i can establish a connection to this number, i had to give up.

At the end, i took again the extra round over the US premium support number. The L1 technician of Netgear did route me to the L2 technican, but Netgear did not confirm or decline that this is an expected behaviour and that the switch is using dst-mac to calculate the hashes; the only statement was, quote: “In case you need to change the hashing algorithm of LACP aggregated links, then you have to go by the M-Series of Netgear products”. This is the default tactic in case you do not want to confirm that you have a problem in one of your products – instead you try use the situation to make additional business; you try to sell another device and you hope the customer is badly informed and will follow your suggestion.

What i’ve learned, the L2 technician was not really able to reflect and understand my problem. The L2 technican had to interrupt the phonecall after every question and had to ask his supervisor in the background about the answer he should provide. After several questions i asked the technician, if i can speak with the supervisor directly. The answer was, quote: “Uhm, the supervisor is not available”. As a last question, i asked if i can give back the product under this circumstances, because it does not support what i need and i was not aware that there is such a limitation – this limitation is also not highlighted in the technical datasheet, there is no information that the switch only supports dst-mac (what is unexpectable by the way). Netgear refused. Later, from a talk in the Netgear Community forum, i’ve learned that the L2 technician must be hold back from his supervisor to further escalate my questions to L3.

While i was looking at the M-Series…. again: A negative example is Netgear. Have a look at the datasheet of the M4100 Series. In case you have burned your fingers with the GS switches and you want to be sure that you are buying a switch that can handle all those hashing algorithms, you will find this information in the datasheet: “Including static (selectable hashing algorithms) OR dynamic LAGs (LACP)”.

I my reading this means: Selectable hashing algorithmis IS ONLY POSSIBLE ON STATIC LAGs; but not dynamic LAGs based on LACP. Only, if you jump deeply into all those manuals and a lot of time later, then you will find in the User Manual on page 200 a screenshot of the webgui and the possibilities to configure your LAGS. Okay, it seems that this information in the datasheet is missleading/unclear.

I descided to get help and i wanted to generally share my experience in the Netgear Community forum and explained my issue there. A person with the nickname “Schumaku” replyed a few times to my statements and i felt that he must be an employee of Netgear itself. His motivation was to question my facts and tried to blame me as an angry person who does not know what i’m talking about. As a last post, i described my experience with the customer support and that i do not have the feeling that this issue(s) was taken seriously; or in case this is a wanted behaviour, that it does not make sense to use dst-mac hashing at all. This post was deleted/censored from the Netgear Communty forum.

The end of the story: I have a huge Netgear GS516TP brick on my table. I have to find another solution with another product; i’m looking out for a switch that can handle those different types of hash algorithms. Watch out carefully, those informations are normally not available in the datasheets, are hidden, incomplete or even missleading, whatever.

Btw: I could add another tech-experience from the smaller GS108Tv2 and how this switch is handling LACPs – the issues of the GS108Tv2 brought me to the GS516T; but that is another crazy story.

Just to make it complete: With all Netgear models i had my hands on in the past, it was possible to enter a email address at the “System Contact” field. This information will be read out in case you query your switch by using SNMP. With this model this is not possible anymore. In case you add an email address, you’ll get the following error message. This is just another bug, because the system contact is mostly an email address in terms of SNMP. No big deal, just anoying.

What is the outlook?

I’ve analysed the technical specifications of the TP Link T2600G-18TS V2.

The switch supports LACP 802.3ad dynamically and important, we can configure the hashing behaviour on our own preferences. Additionally the switch is the half of the price compared to the GS516TP and has a fast and modern webgui; he also comes with a dedicated console port including cables, for all those people who feel themself at home at the CLI – of course, you can also use SSH.

Final result?

  • With the TPLink switch and src-mac & dst-mac setting was able to write with ~240MB/s by using three different hosts at the same time (this doubled the performance compared to the Netgear).
  • With the src-ip & dst-ip setting on the switch i was able to write ~350MB/s by using three different hosts at the same time; and this is the performance i expected to reach!

Below is a screenshot of the NAS device and the four interfaces while i was writing to it from three different host at the same time.

Below is a screenshot of the NAS device and the four interfaces while i was reading from it from three different host at the same time.

Conclusion?

I’ve always prefered Netgear products over TP Link products in the past; my mindset has changed – from now on, i will use for me and my customers the TP Link products in the future. I currently feel more confident with the TP Link Jet Stream product-line.

Netgear GS516TP TPLink T2600G-18TS-V2
Price ~320.00$ ~165.00$
SFP (Uplink) Slots 2
10/100/1000Mbit/s-RJ45-Ports 16 16
LAG/LACP 802.3ad Limited and buggy L2/L3 – works as expected
PoE 8 Ports; 802.3af (15,4W)
PoE Budget 76W
PoE PD 2
Internal Bandwith 32 GBit/s 32 GBit/s
19″ Rack mountable Yes Yes
Noise 24.6 dBA No fan
WebGUI Slow, ugly and a few minor bugs Fast and modern
Console / SSH No Yes

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.