Analyse your conversation history on Telegram programatically
Published in · 12 min read · Oct 24, 2017
--
Telegram is an instant messaging service just like WhatsApp, Facebook Messenger and WeChat. It has gained popularity in recent years for various reasons: its non-profit nature, cross-platform support, promises of security¹, and its open APIs.
In this post, we’ll use Telethon, a Python client library for the Telegram API to count the number of messages in each of our Telegram chats.
The more well-known of Telegram’s APIs is its Bot API, a HTTP-based API for developers to interact with the bot platform. The Bot API allows developers to control Telegram bots, for example receiving messages and replying to other users.
Besides the Bot API, there’s also the Telegram API itself. This is the API used by Telegram apps for all your actions on Telegram. To name a few: viewing your chats, sending and receiving messages, changing your display picture or creating new groups. Through the Telegram API you can do anything you can do in a Telegram app programatically.
The Telegram API is a lot more complicated than the Bot API. You can access the Bot API through HTTP requests with standard JSON, form or query string payloads, while the Telegram API uses its own custom payload format and encryption protocol.
MTProto is the custom encryption scheme which backs Telegram’s promises of security. It is an application layer protocol which writes directly to an underlying transport stream such as TCP or UDP, and also HTTP. Fortunately, we don’t need to concern ourselves with it directly when using a client library. On the other hand, we do need to understand the payload format in order to make API calls.
Type Language
The Telegram API is RPC-based, so interacting with the API involves sending a payload representing a function invocation and receiving a result. For example, reading the contents of a conversation involves calling the messages.getMessage
function with the necessary parameters and receiving a messages.Messages
in return.
Type Language, or TL, is used to represent types and functions used by the API. A TL-Schema is a collection of available types and functions. In MTProto, TL constructs will be serialised into binary form before being embedded as the payload of MTProto messages, however we can leave this to the client library which we will be using.
An example of a TL-Schema (types are declared first, followed by functions with a separator in between):
auth.sentCode#efed51d9 phone_registered:Bool phone_code_hash:string send_call_timeout:int is_password:Bool = auth.SentCode;auth.sentAppCode#e325edcf phone_registered:Bool phone_code_hash:string send_call_timeout:int is_password:Bool = auth.SentCode;---functions---auth.sendCode#768d5f4d phone_number:string sms_type:int api_id:int api_hash:string lang_code:string = auth.SentCode;
A TL function invocation and result using functions and types from the above TL-Schema, and equivalent binary representation (from the official documentation):
(auth.sendCode "79991234567" 1 32 "test-hash" "en")
=
(auth.sentCode
phone_registered:(boolFalse)
phone_code_hash:"2dc02d2cda9e615c84"
)d16ff372 3939370b 33323139 37363534 00000001 00000020 73657409 61682d74 00006873 e77e812d
=
2215bcbd bc799737 63643212 32643230 39616463 35313665 00343863 e12b7901
TL-Schema layers
The Telegram API is versioned using TL-Schema layers; each layer has a unique TL-Schema. The Telegram website contains the current TL-Schema and previous layers at https://core.telegram.org/schema.
Or so it seems, it turns out that although the latest TL-Schema layer on the Telegram website is Layer 23, at time of writing the latest layer is actually already Layer 71. You can find the latest TL-Schema here instead.
Creating a Telegram application
You will need to obtain an api_id
and api_hash
to interact with the Telegram API. Follow the directions from the official documentation here: https://core.telegram.org/api/obtaining_api_id.
You will have to visit https://my.telegram.org/ and login with your phone number and confirmation code which will be sent on Telegram, and fill in the form under “API Development Tools” with an app title and short name. Afterwards, you can find your api_id
and api_hash
at the same place.
Alternatively, the same instructions mention that you can use the sample credentials which can be found in Telegram source codes for testing. For convenience, I’ll be using the credentials I found in the Telegram Desktop source code on GitHub in the sample code here.
Installing Telethon
We’ll be using Telethon to communicate with the Telegram API. Telethon is a Python 3 client library (which means you will have to use Python 3) for the Telegram API which will handle all the protocol-specific tasks for us, so we’ll only need to know what types to use and what functions to call.
You can install Telethon with pip
:
pip install telethon
Use the pip
corresponding to your Python 3 interpreter; this may be pip3
instead. (Random: Recently Ubuntu 17.10 was released, and it uses Python 3 as its default Python installation.)
Creating a client
Before you can start interacting with the Telegram API, you need to create a client object with your api_id
and api_hash
and authenticate it with your phone number. This is similar to logging in to Telegram on a new device; you can imagine this client as just another Telegram app.
Below is some code to create and authenticate a client object, modified from the Telethon documentation:
from telethon import TelegramClient
from telethon.errors.rpc_errors_401 import SessionPasswordNeededError# (1) Use your own values here
api_id = 17349
api_hash = '344583e45741c457fe1862106095a5eb'
phone = 'YOUR_NUMBER_HERE'
username = 'username'
# (2) Create the client and connect
client = TelegramClient(username, api_id, api_hash)
client.connect()
# Ensure you're authorized
if not client.is_user_authorized():
client.send_code_request(phone)
try:
client.sign_in(phone, input('Enter the code: '))
except SessionPasswordNeededError:
client.sign_in(password=input('Password: '))
me = client.get_me()
print(me)
As mentioned earlier, the api_id
and api_hash
above are from the Telegram Desktop source code. Put your own phone number into the phone
variable.
Telethon will create a .session
file in its working directory to persist the session details, just like how you don’t have to re-authenticate to your Telegram apps every time you close and reopen them. The file name will start with the username
variable. It is up to you if you want to change it, in case you want to work with multiple sessions.
If there was no previous session, running this code will cause an authorisation code to be sent to you via Telegram. If you have enabled Two-Step Verification on your Telegram account, you will also need to enter your Telegram password. After you have authenticated once and the .session
file is saved, you won’t have to re-authenticate again until your session expires, even if you run the script again.
If the client was created and authenticated successfully, an object representing yourself should be printed to the console. It will look similar to (the ellipses …
mean that some content was skipped):
User(is_self=True … first_name='Jiayu', last_name=None, username='USERNAME', phone='PHONE_NUMBER' …
Now you can use this client object to start making requests to the Telegram API.
Inspecting the TL-Schema
As mentioned earlier, using the Telegram API involves calling the available functions in the TL-Schema. In this case, we’re interested in the messages.GetDialogs
function. We’ll also need to take note of the relevant types in the function arguments. Here is a subset of the TL-Schema we’ll be using to make this request:
messages.dialogs#15ba6c40 dialogs:Vector<Dialog> messages:Vector<Message> chats:Vector<Chat> users:Vector<User> = messages.Dialogs;messages.dialogsSlice#71e094f3 count:int dialogs:Vector<Dialog> messages:Vector<Message> chats:Vector<Chat> users:Vector<User> = messages.Dialogs;---functions---messages.getDialogs#191ba9c5 flags:# exclude_pinned:flags.0?true offset_date:int offset_id:int offset_peer:InputPeer limit:int = messages.Dialogs;
It’s not easy to read, but note that the messages.getDialogs
function will return a messages.Dialogs
, which is an abstract type for either a messages.dialogs
or a messages.dialogsSlice
object which both contain vectors of Dialog
, Message
, Chat
and User
.
Using the Telethon documentation
Fortunately, the Telethon documentation gives more details on how we can invoke this function. From https://lonamiwebs.github.io/Telethon/index.html, if you type getdialogs
into the search box, you will see a result for a method called GetDialogsRequest
(TL-Schema functions are represented by *Request
objects in Telethon).
The documentation for GetDialogsRequest
states the return type for the method as well as slightly more details about the parameters. The “Copy import to the clipboard” button is particularly useful for when we want to use this object, like right now.
The messages.getDialogs
function as well as the constructor for GetDialogsRequest
takes an offset_peer
argument of type InputPeer
. From the documentation for GetDialogsRequest, click through the InputPeer
link to see a page describing the constructors for and methods taking and returning this type.
Since we want to create an InputPeer
object to use as an argument for our GetDialogsRequest
, we’re interested in the constructors for InputPeer
. In this case, we’ll use the InputPeerEmpty
constructor. Click through once again to the page for InputPeerEmpty
and copy its import path to use it. The InputPeerEmpty
constructor takes no arguments.
Making a request
Here is our finished GetDialogsRequest
and how to get its result by passing it to our authorised client object:
from telethon.tl.functions.messages import GetDialogsRequest
from telethon.tl.types import InputPeerEmptyget_dialogs = GetDialogsRequest(
dialogs = client(get_dialogs)
offset_date=None,
offset_id=0,
offset_peer=InputPeerEmpty(),
limit=30,
)
print(dialogs)
In my case, I got back a DialogsSlice
object containing a list of dialogs, messages, chats and users, as we expected based on the TL-Schema:
DialogsSlice(count=204, dialogs=[…], messages=[…], chats=[…], users=[…])
Receiving a DialogsSlice
instead of Dialogs
means that not all my dialogs were returned, but the count
attribute tells me how many dialogs I have in total. If you have less than a certain amount of conversations, you may receive a Dialogs
object instead, in which case all your dialogs were returned and the number of dialogs you have is just the length of the vectors.
Terminology
The terminology used by the Telegram API may be a little confusing sometimes, especially with the lack of information other than the type definitions. What are “dialogs”, “messages”, “chats” and “users”?
dialogs
represents the conversations from your conversation historychats
represents the groups and channels corresponding to the conversations in your conversation historymessages
contains the last message sent to each conversation like you see in your list of conversations in your Telegram appusers
contains the individual users with whom you have one-on-one chats with or who was the sender of the last message to one of your groups
For example, if my chat history was this screenshot I took from the Telegram app in the Play Store:
dialogs
would contain the conversations in the screenshot: Old Pirates, Press Room, Monika, Jaina…
chats
would contain entries for Old Pirates, Press Room and Meme Factory.
messages
will contain the messages “All aboard!” from Old Pirates, “Wow, nice mention!” from Press Room, a message representing a sent photo to Monika, a message representing Jaina’s reply and so on.
users
will contain an entry for Ashley since she sent the last message to Press Room, Monika, Jaina, Kate and Winston since he sent the last message to Meme Factory.
(I haven’t worked with secret chats through the Telegram API yet so I’m not sure how they are handled.)
Counting messages
Our objective is to count the number of messages in each conversation. To get the number of messages a conversation, we can use the messages.getHistory
function from the TL-Schema:
messages.getHistory#afa92846 peer:InputPeer offset_id:int offset_date:date add_offset:int limit:int max_id:int min_id:int = messages.Messages
Following a similar process as previously with messages.getDialogs
, we can work out how to call this with Telethon using a GetHistoryRequest
. This will return either a Messages
or MessagesSlice
object which either contains a count
attribute telling us how many messages there are in a conversation, or all the messages in a conversation so we can just count the messages it contains.
However, we will first have to construct the right InputPeer
for our GetHistoryRequest
. This time, we use InputPeerEmpty
since we want to retrieve the message history for a specific conversation. Instead, we have to use either the InputPeerUser
, InputPeerChat
or InputPeerChannel
constructor depending on the nature of the conversation.
Manipulating the response data
In order to count the number of messages in each of our conversations, we will have to make a GetHistoryRequest
for that conversation with the appropriate InputPeer
for that conversation.
All of the relevant InputPeer
constructors take the same id
and access_hash
parameters, but depending on whether the conversation is a one-on-one chat, group or channel, these values are found in different places in the GetDialogsRequest
response:
dialogs
: a list of the conversations we want to count the messages in and contains apeer
value with the type andid
of the peer corresponding to that conversation, but not theaccess_hash
.chats
: contains theid
,access_hash
and titles for our groups and channels.users
: contains theid
,access_hash
and first name for our individual chats.
In pseudocode, we have:
let counts be a mapping from conversations to message countsfor each dialog in dialogs:
if dialog.peer is a channel:
channel = corresponding object in chats
name = channel.title
id = channel.id
access_hash = channel.access_hash
peer = InputPeerChannel(id, access_hash)
else if dialog.peer is a group:
group = corresponding object in chats
name = group.title
id = group.id
peer = InputPeerChat(id)
else if dialog.peer is a user:
user = corresponding object in users
name = user.first_name
id = user.id
access_hash = user.access_hash
peer = InputPeerUser(id, access_hash) history = message history for peer
count = number of messages in history counts[name] = count
Converting to Python code (note that dialogs
, chats
and users
above are members of the result of our GetDialogsRequest
which is also called dialogs
):
counts = {}# create dictionary of ids to users and chats
users = {}
chats = {}
for u in dialogs.users:
users[u.id] = u
for c in dialogs.chats:
chats[c.id] = c
for d in dialogs.dialogs:
peer = d.peer
if isinstance(peer, PeerChannel):
id = peer.channel_id
channel = chats[id]
access_hash = channel.access_hash
name = channel.title
input_peer = InputPeerChannel(id, access_hash)
elif isinstance(peer, PeerChat):
id = peer.chat_id
group = chats[id]
name = group.title
input_peer = InputPeerChat(id)
elif isinstance(peer, PeerUser):
id = peer.user_id
user = users[id]
access_hash = user.access_hash
name = user.first_name
input_peer = InputPeerUser(id, access_hash)
else:
continue
get_history = GetHistoryRequest(
peer=input_peer,
offset_id=0,
offset_date=None,
add_offset=0,
limit=1,
max_id=0,
min_id=0,
)
history = client(get_history)
if isinstance(history, Messages):
count = len(history.messages)
else:
count = history.count
counts[name] = count
print(counts)
Our counts
object is a dictionary of chat names to message counts. We can sort and pretty print it to see our top conversations:
sorted_counts = sorted(counts.items(), key=lambda x: x[1], reverse=True)
for name, count in sorted_counts:
print('{}: {}'.format(name, count))
Example output:
Group chat 1: 10000
Group chat 2: 3003
Channel 1: 2000
Chat 1: 1500
Chat 2: 300
Library magic
Telethon has some helper functions to simplify common operations. We could actually have done the above with two of these helper methods, client.get_dialogs()
and client.get_message_history()
, instead:
from telethon.tl.types import User_, entities = client.get_dialogs(limit=30)
counts = []
for e in entities:
if isinstance(e, User):
name = e.first_name
else:
name = e.title
count, _, _ = client.get_message_history(e, limit=1)
counts.append((name, count))
message_counts.sort(key=lambda x: x[1], reverse=True)
for name, count in counts:
print('{}: {}'.format(name, count))
However, I felt that it a better learning experience to call the Telegram API methods directly first, especially since there isn’t a helper method for everything. Nevertheless, there are some things which are much simpler with the helper methods, such as how we authenticated our client in the beginning, or actions such as uploading files which would be otherwise tedious.
The full code for this example can be found as a Gist here: https://gist.github.com/yi-jiayu/7b34260cfbfa6cbc2b4464edd41def42
There’s a lot more you can do with the Telegram API, especially from an analytics standpoint. I started looking into it after thinking about one of my older projects to try to create data visualisations out of exported WhatsApp chat histories: https://github.com/yi-jiayu/chat-analytics.
Using regex to parse the plain text emailed chat history, I could generate a chart similar to the GitHub punch card repository graph showing at what times of the week a chat was most active:
However, using the “Email chat” function to export was quite hackish, and you needed to manually export the conversation history for each chat, and it would be out of date once you received a new message. I didn’t pursue the project much further, but I always thought about other insights could be pulled from chat histories.
With programmatic access to chat histories, there’s lots more that can be done with Telegram chats. Methods such as messages.search
could me exceptionally useful. Perhaps dynamically generating statistics on conversations which peak and die down, or which are consistently active, or finding your favourite emojis or most common n-grams? The sky’s the limit (or the API rate limit, whichever is lower).
Updates
(2017–10–25 09:45 SGT) Modified message counting to skip unexpected dialogs
- ^ Personally, I can’t comment about Telegram’s security other than point out that Telegram conversations are not end-to-end encrypted by default, as well as bring up the common refrain about Telegram’s encryption protocol being self-developed and less-scrutinised compared to more-established protocols such as the Signal Protocol.