Below is a of a master doc of all my BGP notes, previously I was keeping this hidden away in a google doc but I thought why not share it with the world. If you see any issues let me know! As I always say, I’m no expert but I try my best!
These are the notes I’ve kept with me through University, certification prep and job interviews and they haven’t let me down yet, so hopefully, they can bring you up!
I did a very similar thing with my OSPF notes over here:
https://jonathansteward.co.uk/index.php/2020/11/06/ospf-master-notes/
Most importantly, if you have any questions or don’t understand something, post a comment/get in touch and I might be able to help! I always say that the best way to learn something is by teaching it!
Table of contents
1 – Summary
BGP was born out of the need for a more scalable routing protocol for the wider internet. In modern days its actually in use in loads of different places and not just on the “internet”
While BGP is the defacto standard for internet routing, it is also used in Data centers to help with EVPN and VXLan, its used in WAN’s to provide a range of different VPN options with the use of Multi protocol BGP
But it also works well just by itself in the wan to share, from site to site, blocks of subnets, while allowing for very granular administrative control.
One of the key building blocks to making BGP what it is today is the path attributes as they enable the flexibility, scalability and granularity that we’ve all come to know and love in BGP!
2 – BGP vs IGP
Ospf/eigrp (IGP) | Bgp |
Forms Neighborship before sending updates | Same |
Neighbours are typically discovered using multicast groups | Neighbors are statically configured. |
Does not use TCP, uses its own Protocol, 89/88 respectively | Uses Tcp with port 179 |
Advertises prefixes with lengths* | Advertises prefix/length called NLRI |
Advertises metric information or link-state database details | Advertises a number of path attributes with a given prefix to figure out the best path. |
Fast and more efficient routing | Focus on scalability, stability and administration granularity |
Linkstate/distance vector logic | Path vector logic (doesn’t have link details) |
BGP is commonly called a Path vector protocol but it is more advanced than just that.
A path vector suggests you know the path and the hops from point A to point B but no details or costs of the path (like in eigrp)
It certainly isn’t Link state as Link state means you know about all the links in the AS
3 – Path Attributes
Origin
Where the prefix was sourced from. Either Network command within BGP (i) This will match routes already in the local routing table or direct redistribution (?)
AS Path
This is just a list of the AS’s traffic would take to get to the destination AS
Fewer is better
This can be altered with AS path prepending. If the path is longer it might not be as preferred, this depends on the order of path attribute selection
For example See the image below, assuming you are at AS1, you have a route to AS3. Ignoring as6 you would have two paths
as1>as2>as3 or as1>as5>as4>as3
The second path is one hop longer but with pre-pend you could increase the first hop which could result in load balancing so the paths would be both 3 hops
as1>as1>as2>as3 or as1>as5>as4>as3
The AS Path attribute is also used to prevent loops.
If a router receives a bgp prefix with an AS path that includes the router’s local AS previously listed in the AS path it will drop the route.
See image below, If a route was advertised from AS1 > AS2 > AS3> AS6
If AS6 advertised the route to as2 it would be ignored as AS2 is already in the path.
Next hop
The next hop is used to define the IP address of the next-hop towards the prefix. This will usually be based on the routing table lookup.
The next-hop IP doesn’t have to be a directly connected peer, this could use IGP to recursively look up the route to the BGP peer
Alternatively, when peering within IBGP, when we advertise prefixes to neighbours we can set “next-hop self” to update this attribute to point towards a devices local interface IP.
Note: with EBGP you have to be directly connected or use the BGP multihop options.
MED (Multi exit discriminator)
Intended to influence how a remote AS views/prefers your prefixes/path. It results in the exit point with the lowest MED being prefered. This should only be passed from one AS to another, no Further than one AS hop.
Don’t forget this is used after local pref so if the remote side is already using un-even LP’s then this might not work.
By default MED is only used when comparing between routes from the same AS, it will not be compared when reviewing routes advertised from two differing neighbour ASN’s
Local Pref (LP)
Local pref is used to determine how traffic will leave the current ASN and move into the next ASN in the path. Higher is better.
Is only kept local to an AS.
With some transit providers, you can use communities to influence what they set the local pref of your route/s to.
Atomic Aggregate
This is used when some AS numbers are lost in the AS path. Used to identify that the path might not be 100% accurate and the true best path.
This should not be removed when re advertising the aggregate route, no NLRI of the route should be more more specific than the referenced route.
Order of path attributes
Cisco
Next hop reachable | Need to ensure the next hop is reachable and a valid route is within the routing table |
Weight (optional non transitive) | Cisco proprietary value, local only to the device, higher is better |
Local Pref (Well known discretionary | How to prefer traffic outbound of the local ASN, Higher is better |
Self originated | Prefers something directly connected or personally injected into bgp rather than a route learned from a remote device |
AS path (well known mandatory) | ASN’s the route passes through, shorter is better |
Origin (well known mandatory) | Prefers something that was injected by the network command than redistributed. |
MED (Optional non transitive) | Lower is better determines how the remote AS should direct traffic in |
Neighbor type | Prefers EBGP over IBGP |
IGP metric | lower IGP cost to the next hop is best |
LOAD BALANCE | At this point, if you have multiple paths with even attributes above then you have equal-cost paths, if you have enabled BGP multipath we load balance at this point. |
Oldest known route | The Route that has been in the BGP route table the longest |
Lowest neighbour RID | Lower router ID always wins |
Lowest neighbour ip | Lowest ip of the possible paths. |
Other Path Attributes
Non transitive attributes can’t be sent to other ASN’s, Transitive are allowed
Atomic Aggregate – Well known Discretionary
Aggregator – Optional Transitive
Community – Optional Transitive
Originator ID – Optional Non Transitive
Cluster list – Optional non-transitive
4 – Communities
A community is used to identify a group of routes or prefixes that share a common property. This allows you to identify those routes easily and then act on that. For example you might have Transit routes and External peer routes. You can apply a community for all routes coming from those two types of peering and then do some alteration based on that community tag. As a result you could Scope External peer routes further into your network and into different regions, but then keep Transit local to that region because of pricing per bit across transit sessions.
Communities are also Path attributes.
A route can have more than one community
Normal Communities are 32bits long. The first quarter and the last quarter are reserved for well known.
The recommended approach is to use the first 2 octets to represent your ASN then the last 2 for the specific community. This ensures that regardless of what you receive from external ASN’s your communities should be unique.
Extended communities
Extended communities allow for a 64 bit length.
The Extended community has a number of fields
i bit – How the community was assigned either via first come first served with IANA or with IETF/experimental
T bit – If set this community can be sent over different AS’s
Type – Defines the type of community and the way the data is made up, AS specific with 2 octets for AS, AS specific with 4 octets for AS and IPV4 specific.
Subtype – Used to provide extra types that can be used.
Well known Communities
No Export – Means that these routes should not be advertised outside of this local AS. For example, if you Receive routes from two ISP’s you would mark inbound routes as No Export otherwise you could turn into a Transit ASN
No Advertise – Do not advertise the NLRI/prefix to any peers.
No Export SubConfed – Should not be advertised out of the local ASN even if its part of a confederation.
None – Used to clear all communities
Blackhole – Used in DDOS scenarios to block traffic to a victim IP address. When received and accepted the router should black hole the traffic https://tools.ietf.org/html/rfc7999
5 – IBGP/EBGP
(defaults)
IBGP | EBGP |
Peer address can be multiple L3 hops away | Peer address must be the IP of the peers directly connected interface, can configure the command “multi-hop” and then update source address to reference loopback |
Can’t advertise IBGP learnt routes to another IBGP peer (Full mesh or RRR would be the only way) | Can advertise routes to anywhere as long as an AS number doesn’t re-occur, in this case a route is advertised but dropped and discarded by the peer. |
Peering Via Loop back ip Address
Peering via a loopback will Provide redundancy if you have more than one physical path. If one path goes down the IP for that remote physical interface will go down.
However a loopback will stay up as long as the device is operational, hence you will still be able to reach it via an alternative path. This is obviously based on underlying IGP details.
IBGP
This causes no issues and is allowed out of the box
EBGP
Requires the multi hop command and also the update source to be configured.
Allows for peering over a number of L3 hops to a remote device. This can be used in such cases where an ISP provides more than one link to your local device to the same remote device, hence making it worthwhile to peer against the loopback so if one interface goes down the peering doesn’t.
This is one way to provide redundancy, alternatively and more reccomended, you can setup peering sessions across each individual port.
If you update the source that will update the source IP address on the messages outbound from that device. So it won’t use the outbound interface IP it will use the Loopback IP
6 – Route reflectors (RR)
Peers of a RR are classed as either non-client or client, this is set at the RR and a remote peer of a RR would not be aware of this decision.
An RR and its set of peers is called a cluster.
An RR can be both in the forwarding path or it can be separate to the forwarding path.
If the RR is in the forwarding path it should set the “next-hop-self” on the advertised routes. If it is outside of the forwarding path this bit of config can be excluded
For example in the following diagram R3 doesn’t have the “next-hop-self” set so it retains the existing next-hop IP attribute and advertises it out to its neighbours.
R4 has then learnt 1.1.1.0/24 from R3 but the next hop is in fact pointing to R2 and hence will forward traffic in that direction!
RR’s will only advertise the best path. Like other BGP speakers, it will accept/learn the prefixes and then run the BGP best-path selection algorithm and then only advertise the best.
BGP add path changes this but is out of scope for this doc
If a route is received from a non client, an RR will advertise to all clients.
If a route is received from a client, an RR will advertise to all clients and non-clients.
Cluster list – List of cluster ID’s and is used to track the Route reflection path. When an RR receives a route it will add its cluster ID to the list.
Cluster ID – RID of the RR
Originator ID – RID of the client received the route from.
Routing loops are prevented by the RR checking the Cluster list for its own Cluster ID, if its seen it will not accept the route.
This can be seen in the diagram below. RR 30.30.30.30 tried to advertise back to 10.10.10.10 something it had already advertised. At this point RR 10.10.10.10 would deny the prefix and not learn it.
Additionally, the originator of the route will ignore the route when/if it sees its own RID in the originator ID
7 – Injecting routes
To advertise new routes you have two options, Redistribute or inject using the network command.
Using the network command you specify a prefix and subnet and if a match is found within the routing table then a BGP route is created for this.
This action is with no auto summary defined.
When we use the auto-summary command, if a network command uses a classful address then no subnet is needed and this will be injected only if you have a sub route that falls in this classful range or an exact match is seen.
e.g if you have 10.1.1.1/32 in your routing table but have 10.0.0.0 specified (Class A subnet) the 10.0.0.0/8 will be advertised!
8 – Filtering
You can filter certain prefix’s based on different things like AS path and prefix lists
You can define a prefix-list that will match certain prefixes/lengths and from that, you can permit/deny/prefer/de-prefer and many other things
You can define a specific AS path to match against. This would allow you to block all routes via a specific AS. If this AS is only a transit AS then this could mean you could use that to avoid their paths while using alternative providers.
This can also be done as a regex match (at least in Juniper 😉 )
You could also use a route map or policy to match on a community and carry out certain actions as described above
Real life example: By default, unless you are an ISP you should not be advertising out any routes that you are not originating internally within your AS, otherwise you could accept prefixes from ASN 1 pass through you to another session with ASN 3 and become the shortest path for all of ASN1’s traffic towards ASN3 orignated routes.
To implement the rules to block this re-advertisement we can use filtering on communities or AS-path regex to only permit certain locally originated routes from being advertised.
Alternatively you could tag all learnt routes with a “external” community and explicitly filter this to a deny criteria on your export policies.
An easier implementation would be to use the “no export” community which uses the well known community to block advertisement out of the local ASN as described above.
9 – Peer groups
You can use peer groups to define a default set of actions/configuration.
This can help set default policies or filtering for all your peering sessions.
The most optimal way of doing this could be by setting up a group per ASN you peer with so your defaults for that ASN are consistent.
Anything configured per peer overrides the peer group.
You could specify a specific import policy for the group setting LP 200 and on a specific session within the group that is currently seeing congestion, you could set a different import policy to tweak the LP up for select prefixes to 210 for example hence making this a less preferred path!
10 – Routing to the internet
You have 3 general options for routing to the internet, Use default routes to your ISP or BGP or a mix of the two
Using a default route is excellent if you have just one exit point and only one ISP as all traffic will have to flow that way anyway
Single homed
In this case the best way is a static default routing.
You can set up BGP with the ISP and advertise the Default south into your network while also advertising your local prefix north to the ISP, but this would gain you no benefit over just having a single default north pointing to the ISP
You would, however, want a discard route for the private subnet ranges to ensure any traffic that won’t be routed over the internet is discarded otherwise traffic that can’t find a route internally will take the path via the link to the internet using up capacity on a possibly expensive link.
Dual-homed (Two links one ISP)
Your actions here depend on your need to alter outbound/inbound traffic levels. If not then static routes work well to load balance or failover depending on the use case.
Two default routes facing the two links work well as you can either prefer one, or load balance over the two. This can be done by the redistribution metrics without BGP in normal IGP (OSPF, EIGRP) or with local pref metrics with BGP
However if you are to start working with a large BGP topology and two exit points, you might need to think about IBGP
ISP ROUTES
You have 3 ways of receiving routes from the ISP.
Default route – low memory requirements but can’t do any traffic engineering
Partial routes for ISP routes/customers – A good compromise between the full updates and default. Allows you to prefer a route via ISP 1 rather than ISP 2 as ISP 1 might have a direct connection to the destination prefix
Full BGP table – This will give you the best context past the local customers, however this will cause issues with memory as you will now have a full internet table for each peering connection to your ISP, this can quickly use up memory on low performance routers
11 – BGP messages
Header
Marker – 16 octet field, If no authentication or if the message is an open message then this will be all Zeros, Otherwise the values are used for authentication and can detect loss of synchronization.
Length – Length of the Message including header in octects
Type – Type of the message, 1- open, 2- update, 3- Notification, 4- keep alive
Open
Used to set up the initial session once TCP has formed.
Version – Defines the version of BGP used.
Local AS – Defines the local AS of the sender.
Hold time – Number of seconds the between Keepalives before taring down the session, Must be either 0 or larger than 3
BGP Identifier – RID for bgp
Optional Fields length – Used to identify the presence of an optional field/setting
Optional parameters
Update Message
Used to transfer details of the routing information between peers.
Can advertise new routes or withdraw routes
Advertises a number of routes/prefixes/NLRI that all share identical path attributes.
Withdrawn routes length – Value of 0 means no routes were withdrawn, otherwise identifies the number of routes in the field “withdrawn routes”
Withdrawn routes – For each prefix you have the following
Length – Used to identify the number of significant bits in the prefix field
Prefix – Ip address of the subnet, only includes the significant bits. 20.0.0.0/8 would only include 20.
Total PA length – Length of the PA field in bytes, if 0 indicates nothing is being advertised.
Path Attributes – Describes the path attributes.
Attribute type – Type of attribute like AS path or LP
Attribute length – Length of the value in bytes
Attribute – Actual attribute.
Network Layer reachability information – Lists the routes/prefixes being advertised.
Length – Length of the prefix field that is significant
Prefix – Specific prefix being advertised that matches the path attributes
Keep alive
Only Header, Used for ensuring the session stays active. Hold time of Zero suggests no keepalive should be sent.
Notification message
Used when an error is detected, BGP session is closed straight after transmission. Acts like a remote log implementation.
Error code – Used to identify what type of error has occurred
1 – message header, 2 – Open message error, 3 – update message error, 4 – hold timer expired, 5 – State machine error, 6 – Cease.
Error subcode – Used to identify further where the issue lies. Check https://tools.ietf.org/html/rfc4271#section-4.5 for detail
Data – Provides data on the error
12 – BGP adjacency states
Idle – Nothing has been set up and all connections are being refused, needs a start event before doing anything
Connect – Starting the TCP session. If it starts the bgp session sends an open message and moves to open sent if it fails, it moves to active and starts a re-try timer and in the meantime waits for the remote side to initiate a connection
Active – Trying to set up an adjacency and is listing for a TCP connection to be set up or for the remote side to re-try. If it stays within active state there is a connectivity issue!
Open sent – open message has been sent but no returning open message has been received. The local device will be waiting for this to be sent back. Once received the local peer will check the message for any errors. If everything is good it will move to open confirm and send a keepalive
Open confirm – Open message has been received from the remote peer. At this point, the session is pending on a keep-alive to ensure the peering parameters are correct. If the local peer doesn’t see a keep alive within the negotiating timer a Notification message will be sent and will move the session to idle.
Established – The session is set up. At this point, you will need to send the update messages.
AD/Preference note
If you have an EBGP route but for some reason also have the same prefix and length within your IGP your local device will usually follow the route from the IGP rather than EBGP due to the Administrative distance/Preference list.
Any BGP routes that do fail this are listed as RIB failures.
13 – Clashing ASN’s
Clashing ASN’s result in issues where prefixes advertised out by AS 12 would not be accepted at AS 12, if AS12 was a large internet edge ISP this could cause huge reachability issues. This is obviously why configuration usually requires explicit configuration of the remote side ASN
Private ASN’s can be used from 64.5k- 65.5k with no issues but theses can only be used at local scope within a private network.
14 – TCP Establishment
Client | Server | ||||
Start State | Action | Move to State | Start State | Action | Move to state |
Closed | Nothing can happen as the remote side will reject it. | Closed | Performs a passive open and readies its self for the Syn | Listen | |
Closed | Performs an active open and sends a SYN message | Syn Sent | Listen | Waits for Client | |
Syn Sent | Waits for an An ack for its SYN and a SYN from the server | Listen | Receives the syn and if it can it accepts and replies with a SYN + ACK | Syn Received | |
Syn Sent | The connection to the Server is final with the Ack received. Client will Send an ack for servers connection | Established | Syn Received | Waits for the ack to its syn it previous set | |
Established | Waiting for the server to complete connection | Syn Received | Waits for the server receives Ack To the syn and this completes the connection | Established |
Seq and Ack Numbers
Seq/ack numbers are used to identify how many bytes have been sent.
This number is randomly picked when the host initiates the connection from 0 to around 4.3 billion.
At initialization the Ack will be 0.
A TCP SYN or FIN flag will trigger an increase in the SEQ number along with the ACK once the data has been received correctly.
Once 700 bytes of traffic has been sent the returning packet after that will Ack the 700 Bytes by increasing the Ack by 700
Once data is sent and ack’d the sending side will increase the SEQ
Locally on one side the SEQ and ACK will allow you to identify with SEQ how much you have sent and has been acknowledged.
With Ack you identify how much you have received.
If re-transmission is needed when these values are sent to the remote side they will identify if something is missing.