bb1

A Specific Broadband Fault

Internet connection appears to drop out completely.

Initially reported to TalkTalk on 4 October 2014.

The fault is intermittent, i.e. it comes and goes. When present the problems it causes are as such as the following.

Symptoms

  1. Web pages completely fail to load
  2. Web pages load partially
  3. DNS lookups fail
  4. Internet videos play for a few seconds then stop or do not play at all
  5. Downloads become impossible, usually hanging part-way through or not even starting
  6. Throughput can be extremely low - e.g. a measured 20kbit/s on a 3.2Mbit/s link - if it works at all

Conditions

  1. The fault happens on both wired and wireless connections
  2. The fault happens regardless of which PC is used
  3. The fault is seem by both Windows and Linux computers
  4. The Internet-connection dropouts are seen by all computers on the LAN at the same time

Investigation

It was seen that even simple pings to hosts on the Internet dropped out thereby indicating loss of basic IP connectivity. This was not sporadic packet losses but was seen to happen in an approximately periodic fashion: a few seconds of a good connection followed by a broadly similar period with no connection at all. Overall packet losses were high. For example,

Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

c:\>ping -t 8.8.8.8

Pinging 8.8.8.8 with 32 bytes of data:

 ... (output omitted)

Ping statistics for 8.8.8.8:
    Packets: Sent = 82, Received = 51, Lost = 31 (37% loss),
Approximate round trip times in milli-seconds:
    Minimum = 60ms, Maximum = 81ms, Average = 63ms

This shows heavy packet loss (37%). To see the specific dropout pattern, which should be helpful in identifying down the cause, see bb1-s01.

In order to narrow down where the losses were occurring there was a need to try the same test to the next hop router. That was found with the following.

c:\>tracert 8.8.8.8

Tracing route to google-public-dns-a.google.com [8.8.8.8]
over a maximum of 30 hops:

  1    <1 ms    <1 ms    <1 ms  dg1 [192.168.1.1]
  2    26 ms    27 ms    26 ms  host-78-146-176-1.as13285.net [78.146.176.1]
  3    51 ms    51 ms    51 ms  xe-11-2-0-bragg001.bre.as13285.net [78.151.225.39]
  4    50 ms    52 ms    51 ms  host-78-151-225-14.static.as13285.net [78.151.225.14]
^C

This shows the next hop as 78.146.176.1.

Then pings to the next hop were run and showed the same level of packet loss.

c:\>ping -t 78.146.176.1

Pinging 78.146.176.1 with 32 bytes of data:

 ... (output omitted)

Ping statistics for 78.146.176.1:
    Packets: Sent = 492, Received = 294, Lost = 198 (40% loss),
Approximate round trip times in milli-seconds:
    Minimum = 24ms, Maximum = 156ms, Average = 30ms

This shows heavy packet loss (40%). To see the dropout pattern see the detail in bb1-s02.

The next-hop IP address is not always 78.146.176.1 so a note has been kept of next-hop IP addresses for which the symptoms are the same. All of the following TalktTalk router interface addresses have been seen to show the same problem. A TalkTalk network engineer would be able to tell if there is anything common between the addresses (hardware, configuration, resources, night-time traffic load, etc).

  • 2.96.16.1
  • 78.146.176.1
  • 89.240.240.1
  • 89.241.64.1
  • 89.242.64.1

Symptom summary

The above responses and especially the patterns linked thereto show loss of IP connectivity and narrow it down somewhat to between the home broadband router and the first router on the TalkTalk network. IP connectivity drops out for significant periods of between 20 and 40 seconds. These periods are long enough to cause most internet-accessing applications to fail and to report faults such as those enumerated above. (The connectivity dropout fault is not on the LAN or the home router because pings to both the LAN and WAN interfaces of the home router showed 100% reliable at the time when pings to the first TalkTalk router showed the periodic dropout pattern. Also, the fault appears not to be with the layer-2 ADSL line because the Huawei HG523a home router reports that the ADSL line remains up and stable.)

Points to note

  1. There seems to be no fault on the ADSL line when the symptoms occur
  2. No fault was seen between the home router and the DSLAM at the exchange
  3. TalkTalk have tested the line repeatedly and found it to be clear. However, the Huawei router does show changes to the ADSL line's SNR figures

So where could the fault be?

There seem to be only these possiblities

  1. The home broadband router
  2. The layer-2 network between the home router in Aberdeen and the TalkTalk infrastructure router (possibly located in London or another big city). This layer-2 network can be subdivided into a number of parts such as
    1. The local loop between the home router and the DSLAM in the Balgownie exchange
    2. The Balgownie DSLAM, especially its response to load. If there is a problem with the configuration or operation of the Balgownie DSLAM then TalkTalk's standard testing does not reveal it and BT apparently report the Balgownie exchange as OK. However, it is possible that standard tests would fail to detect this pattern of failure
    3. The ATM network between the DSLAM and TalkTalk
  3. The TalkTalk infrastructure router or routers to which the DSL line connects

It would be unlikely that the fault would be any further into the TalkTalk network because the dropouts are seen to the first IP interface on the TalkTalk layer-3 infrastructure. However, it should be noted that, according to the IP addresses assigned for the home broadband router and its next hop, the DSL system seems to connect to different infrastructure routers or, at least, to different router interfaces. The fault seems to happen whichever infrastructure router interface is used.

When does the fault occur?

  1. It can start in the evening any time between, say, 5pm and midnight
  2. It can last until as late as 7am
  3. The problem sometimes appears to remain all night. At other times the fault clears and comes back. No stable pattern has been seen for when the problem will occur or when it will clear.

Questions

  1. Is some traffic flow causing the issue? (If this is due to traffic the ADSL connection should apparently slow connections down, not cause them to drop out completely as is happening here. So if congestion is leading to this problem are the places where oversubscription can occur properly configured?)
  2. Is there a place where interference is affecting part of the circuit? The fault has been observed when running on none of the customer's wiring using the test socket with nothing else connected. If there is interference it could be between the test socket and the exchange.
  3. Is the TalkTalk infrastructure router configured to handle congestion well on the outgoing ATM interface or subinterface? One might expect WRED or similar to drop individual packets from streams as usage approached 100%, not to suddenly drop all packets for about 30 seconds at a time.
  4. Is some load (whether on the same ATM interface or somewhere else) causing the TalkTalk infrastructure router to run out of resources, leading it to have to rebuild internal tables such as a routing table or an arp table or a record of connections?
  5. What else could cause the layer-3 dropouts seen?
  6. Are any changes possible to the Huawei home broadband router (such as allowing incoming pings from TalkTalk) so that TalkTalk can see the symptoms directly without having to get the information via the customer?
  7. Would the SNR (Signal to Noise Ratio) or other figures shown on the Huawei home router be of use in determining whether there is interference on the local loop? The SNR figures do vary.

Update 16 October 2014

  • A TalkTalk engineer visited the customer premises and apparently changed the customer's "profile" and made no other changes. After that the connection was much more reliable. No further fault was noticed until 28 October 2014 but then the original symptoms were found to have returned exactly as before.

Update 29 October 2014

The customer was asked by TalkTalk to run on the test socket with the phone disconnected. This made no difference. The same problem was seen. Detail of that follows.

  • The same symptoms including the cyclical losses of connection happened as before.
  • An opportunity was taken to run a different test, i.e. directly between the home router and the TalkTalk router interface which at the time was the one with IP address 78.146.176.1. These tests used the Huawei router's menus. The results reported by the Huawei router were as follows. Note that the failure counts are high and consistent with the problems seen.
Date and Time: 2014-10-29  05:56:11
Host: 78.146.176.1 
 Number of Repetitions: 4
Success Count: 1
Failure Count: 3
Minimum Response Time: 20
Maximum Response Time: 20
Average Response Time:20
and
Date and Time: 2014-10-29  05:56:57
Host: 78.146.176.1 
 Number of Repetitions: 4
Success Count: 0
Failure Count: 4
Minimum Response Time: 0
Maximum Response Time: 0
Average Response Time:0

For comparison, a PC-based ping test showed the same as before.
c:\>ping -n 40 78.146.176.1

Pinging 78.146.176.1 with 32 bytes of data:

 ... (output omitted)

Ping statistics for 78.146.176.1:
    Packets: Sent = 40, Received = 30, Lost = 10 (25% loss),
Approximate round trip times in milli-seconds:
    Minimum = 25ms, Maximum = 27ms, Average = 26ms

This shows heavy packet loss (25%). To see the specific dropout pattern see bb1-s03.

At times when the above packet losses are happening TalkTalk have repeatedly taken the line down to run tests but seen no fault. Whatever their standard tests check for apparently do not include whatever part of the circuit is causing the dropouts. Further, the automated tests accessible on the TalkTalk web site do not work well for this.

  • The web-page tests may fail completely: connection to the web page gets lost as with any other web page
  • The web-page tests (including the speed test) pause while the connection drops out and continue when the connection comes back. Even though they have lost connection and paused during part of the test when the finish they still report high bandwidth…. Perhaps they report peak transfer rates. Other bandwidth checkers have been more revealing, showing an average over the period of the test. For example, where TalkTalk's speed test has reported 3.2Mbit/s another test showed 0.02Mbit/s.

Update 30 October 2014

Tests overnight should help narrow down where the problem occurs. The home router's Downstream SNR (signal to noise ratio) was found to correlate closely with the reported fault.

  • Whenever the link was good for 30 seconds or so the Downstream SNR read just over 18dB.
  • Whenever the connection failed for 30 seconds or so the Downstream SNR read about 7.9dB.
  • Other figures that the router reported about the ADSL circuit (line rates, attenuation, power and the Upstream SNR) were stable. The Downstream SNR was the only one which changed.

The Downstream SNR can therefore be regarded as a probable cause, i.e. the decreases in SNR are likely causing the customer's connection losses.

If noise is affecting the local loop causing the received SNR to drop from 18dB to 8dB most possible locations have already been ruled out:

  • Home router. (Possible but unlikely because the fault does not happen during the day)
  • Customer's premises and cabling. (Can be ruled out for two reasons: already seen by a TalkTalk engineer and because the fault shows up on the dedicated test socket which omits the customer's wiring)
  • Customer's ADSL splitter. (Unlikely because two different splitters have been tried and because the fault does not occur during the day)

This leaves two obvious places where the SNR could be affected:

  • The local loop cabling.
  • The DSLAM or the customer's port thereon. (This is possible. A simple change may be to reterminate the local loop on another DSLAM or MSAN and see if the problem goes away)

Unless the fault is with the existing DSLAM the above findings suggest that a castellated pattern of noise (one with an on-off duty cycle) is caused on the local loop. This could be either by equipment or by induction from adjacent cabling and would have to be checked by the last-mile provider (likely BT Wholesale and/or Openreach). They may be able to locate the cause of the problem simply by inspection.

If an inspection does not reveal the cause and monitoring is required it would have to be carried out at night because the problem has not been seen to occur during daylight hours. The customer will be able to advise whether the problem has occurred (or, is occurring) on a specific night or not. Aside from the period following the TalkTalk engineer's visit on 15 October 2014 the problem has been seen nearly every night so if things remain as they are now the fault should show itself very quickly.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License