Shallowdive into WebRTC

Ever wondered how video streaming technology has evolved over the past few years? Almost all of the major online media applications like Zoom, Discord, Gmeet use WebRTC protocol as the backbone that has enabled users over the web to interact smoothly via multimedia data transmission. In this blog we shall learn about the structure of WebRTC and how exactly does it function in semi-technical language.

WebRTC is a protocol over the internet that simplifies communication and facilitates real-time data sharing. It supports multimedia data like video, audio and other types of generic data allowing developers to build strong and reliable communication channels for users. It is a peer-to-peer protocol meaning it builds a single, isolated two-way communication channel between peers.

Before starting WebRTC, let us first revise the basic protocols out there. Currently there are two famous transmission protocols that are widely used

TCP : The Transmission Control Protocol , is a connection oriented ,server based data transfer protocol. It ensures reliability while transmitting the packets over the network by re-transmitting the packets over the secured route in case they're lost.
UDP : The user datagram protocol is a connectionless , serverless data transmission protocol. Here the packets don't have a fixed route and hence in case the packets are lost , they aren't re-transmitted.

Not deviating further from the topic , you can read more about these protocols over here. But over here the main advantage of UDP over TCP is that the speed of transmission is faster and there is no involvement of a middleware server. This protocol can essentially be used in video call services because we don't need the "reliability" part so badly. For example, you might've noticed sometimes that the call lags while sometimes sentences get skipped . In such case instead of compromising over speed, one can just ask to repeat what was said.

Components

TURN /ICE Servers

All the devices have two types of Ip address , one that is publically exposed and one which is strictly private to the device. To establish a connection between a pair of systems , their public Ip address are required. One issue here is that devices only have information about their private Ip and are anonymous to their public Ip(the Ip of their router). To solve this , there are special servers called TURN servers.

The device request their particular router to send a request to the TURN server , and in return these provide the details to the router which in turn is sent to the device that "what is the Ip address of the entity that had sent the request". This way both the peers now know their public Ip addresses respectively. The process explained above is a generalized sequence of action (Internal working is more complex and is not in the scope of this blog).To learn more about TURN servers, you can check this out.

Signaling Servers

Now the last thing to do would be to establish a connection by exchanging Ip addresses.Here comes our second component of the system called as Signaling Server. This entity is a centralized server that is responsible to transmit the session description of both parties to their respective peers. The session description packet contains the information that is necessary to establish a secure connection over UDP transmission.

It should be noted here that this server simply facilitates in connecting the peers. Once the session descriptions have been exchanged , this Node(server) has no further use and a direct two-way communication link between the peers gets established. From here the secured UDP transmission takes place and it is serverless. This was an overview of how signaling servers work . To learn in more detail you can visit here .

Communication Network Topography

MESH Topology

After reading till here it must've become apparent that this architecture has one major drawback. The connection is strictly in peer-to-peer (P2P) mode, this means that a single channel cant host multiple nodes(users). One solution to host multiple users in a single network is to use Mesh Topology.

Mesh networks are complete graphs(Each node is connected to every other node). This means that if there are N users then there are total N\(N-1)/2* unique channels, but as a single channel is doubly connected, there are total N\(N-1)* connections. So imagine that there are 10 people in a meeting. This would mean 90 connections. Now just to host the incoming 11th person, 20 new connections have to be established . This is not at all resource-efficient , and the complexity increases quadratically with each user. Mesh networks are secured but are only viable for a small number of users, therefore they aren't scalable , which is a prime requirement for real-world applications.

SFU Topology

The topology that is used widely (in almost all WebRTC applications) is called SFU topology, where SFU stands for Selective Forwarding Unit.This architecture makes use of a server as a central entity to carry forward communication , but mind that server as a client , which makes this topology a pseudo-serverless network.

This centralized server contains a virtual machine running on it, which has the capability to use WebRTC communication. So initially all the users establish a peer-to-peer connection with the machine. Now at every instance ( in milliseconds generally, but depends on the server), all the user send their video frame to this server. This entity then consolidates all the frames into a single deliverable image frame, which is then sent back to all the users. So by running a sequence of delivered frames, one sees a real-time video feed of other users. Note here that the number of connections is much less compared to Mesh networks (linear complexity). To understand more about SFU network , proceed here.

Conclusion

After making it here, you should be thorough about the basics of WebRTC and its architecture. I hope this blog was insightful for all the users that have read it with dedication and efforts, and have learnt new concepts. Feel free to comment regarding any errors or places of improvements. Finally, I would like to give a huge shout-out to all the software developers and students out there who are making such viable solutions for all and making them open-source to all.