Reversing Google Play and Micro-Protobuf applications
19 septembre 2012 – 22:11I recently released a Google Play Unofficial Python API, which aims at providing a way for developers to query Google’s official Android application store. Such projects already exist, but they are all based on the previous version (« Android Market »), and are therefore limited. My goal was to adapt those projects and port them to the last version of Google Play.
This article first highights the limitations of existing projects. Then it focuses on the official Android client for Google Play and its internals, based on a Protobuf variant. Thanks to Androguard and its awesome static analysis features, I show how to automatically recover the .proto
file of Google Play, enabling us to generate stubs for querying Google’s servers. Finally, I quickly introduce the unofficial API.
Existing projects
Google Play can be queried in two ways: using the official website or the Android client. The website contains pretty much all the useful information, such as app name and developer name, comments, last version number and release date, permissions required by the app, statistics, etc. I guess one could build a simple program that queries this website and parses the pages, but it would still have one limitation: you simply cannot download apps. Well, you can, but for this you will need an actual compatible phone, and as soon as you perform the install request, the application will get downloaded and installed on your phone. Then if you want to retrieve it in order to analyse it, you must plug in your phone and use adb pull
. Some managed to get Google Play run within the emulator, but this is still a bit complicated and not straightforward: you need Java, Android SDK, customize your emulator ROM to embed Google Play, and script everyting yourself.
The main project I have been looking at is android-market-api, written in Java. Actually, I am a Python fan, and played much more with its Python equivalent. The goal of those projects is to simulate the network activity generated by the Android client, query Google Play servers, and parse the result. The underlying protocol used by Google Play is based on Google’s Protocol Buffers, aka Protobuf. For those who do not know, this library provides a way to encode messages in binary, compact blobs before sending them on the network, and decode them on the other side. The documentation contains plainty of details on the actual encoding format, so I won’t cover it. The only important thing to know about Protobuf is that it is much easier to decode messages if you know the structure of exchanged messages. Messages are composed of fields, each one having a tag, a name and a type. When encoded, a message embeds the tag, value and type (only basic types, or a generic « message » type) of each field, but not their names. Therefore, the semantics of each field must be guessed, and that is not always easy.
When Google Play Android client is able to query Google’s servers and download APKs, all network communications are done with Protobuf and HTTP(S). The underlying Protobuf file used by the unofficial API projects (and based on Android Market) has been published as a .proto
file. The unofficial API can forge some of those requests and interpret results. While playing with them, I have managed to search Android apps, but I could not always download them. Indeed, this version of the API requires a numeric « assetId » corresponding to the app you want to download. When trying to get appropriate assetIds using other API methods such as search()
, I got non-numeric values, such as: v2:com.fankewong.angrybirdsbackup2sd:1:4
. This type of value is rejected by Google Server when trying to download the app. Too bad…
A first look at Google Play Android client
The weird thing is that the non-numeric assetId problem occurs quite often, but not on all apps. I guess this is because Google updated their API when they switched to Google Play; those projects are using the old version of the API. The only way to have up-to-date information and be able to download any app would then be to analyse the updated Android client, and adapt existing projects.
Here we go! We retrieve com.android.vending-1.apk
from an up-to-date Android phone using adb
, and we use our favorite Android RE tools. A first look at class names highlights a pretty explicit VendingProtos
class, under the com.google.android.vending.remoting.protos
package. It contains references to a package named com.google.protobuf.micro
, embedded within the app. This package contains classes used to encode and decode messages. It is actually part of a public project, named micro-protobuf, which is a lightweight version of Protobuf. However, the underlying protocol remains the same.
Most of network traffic is sent using HTTPS. After installing our own on CA onto the phone and setting up an interception proxy like Burp, we can sniff traffic. From a black-box approach, the exchanged data looks like a binary stream:
All we need now is the .proto
file of Google Play to be able to decode it. But how can we get this file? It is unfortunately not embedded within the app, so we have to find another way. A paper and a tool have been published on the subject, but work only when the studied app or program embeds some kind of metadata, used by reflection features of Protobuf. This metadata is generally embedded in regular stubs generated with Google’s standard protobuf compiler called protoc
. However, this is not the case here since the Protobuf stubs embedded within Google Play Android client were not compiled with standard protoc
. Micro-protobuf seems to remove this metadata, probably to make protocol reversing harder.
Anyway, is there a way to guess the structure of exchanged messages, just by having a look at the decompiled Java code of the app? Let’s go back to the VendingProtos
class. It is contains many subclasses, among which one named AppDataProto
:
public static final class AppDataProto extends MessageMicro { private int cachedSize = -1; private boolean hasKey; private boolean hasValue; private String key_ = ""; private String value_ = ""; [...] public AppDataProto mergeFrom(CodedInputStreamMicro paramCodedInputStreamMicro) throws IOException { while (true) { int i = paramCodedInputStreamMicro.readTag(); switch (i) { default: if (parseUnknownField(paramCodedInputStreamMicro, i)) continue; case 0: return this; case 10: String str1 = paramCodedInputStreamMicro.readString(); AppDataProto localAppDataProto1 = setKey(str1); break; case 18: } String str2 = paramCodedInputStreamMicro.readString(); AppDataProto localAppDataProto2 = setValue(str2); } } public AppDataProto setKey(String paramString) { this.hasKey = 1; this.key_ = paramString; return this; } public AppDataProto setValue(String paramString) { this.hasValue = 1; this.value_ = paramString; return this; } [...] }
We can guess that this class represents a Micro-Protobuf message (the extends MessageMicro
part) and that it has two string fields: key
and value
. Their tag can be extracted from the mergeFrom()
method, which aims at decode incoming binary messages. It is composed of a main loop (while(true)
) and a switch
statement. Each case – except the first and second ones – corresponds to a field. The value of each case is actually the binary representation of the tag and type of the field. Everything is in the documentation; to skip the details, the actual value of each case is equal to (tag << 3) | type
. For instance, 10 stands for tag 1, type 2 (string). 18 means tag 2, string. Thus, the actual .proto
file looks as follows:
message AppDataProto { optional string key = 1; optional string value = 2; }
Actually type 2 is not exactly « string », but any length-delimited field. It could be a string, a series of bytes, or an embedded message itself. In that case, the code looks like this:
case 26: VendingProtos.AppDataProto localAppDataProto = new VendingProtos.AppDataProto(); paramCodedInputStreamMicro.readMessage(localAppDataProto); DataMessageProto localDataMessageProto2 = addAppData(localAppDataProto); break;
This field has a tag equal to 3 (26 >> 3) and is a message which name is AppDataProto
. In order to get this sub-message structure, we would have to repeat the analysis process to the corresponding class, and so on.
Automatic analysis
We now have a way of recovering a message structure by analyzing the generated code. All we need now is automating the process. For this, we can use Androguard, a multi-purpose framework intended to make Android reversing easier. With Androguard, we can simply open an APK, decompile it, parse its Dalvik code, and do all sorts of things. Once installed, one can use the provided androlyze
tool to dynamically interact with the framework, and then write a script to automate everything.
Androguard lets us easily browse the available classes and find those that extends MessageMicro
.
In [1]: apk = APK('com.android.vending-1.apk') In [2]: dvm = DalvikVMFormat(apk.get_dex()) In [3]: vma = uVMAnalysis(dvm) In [4]: proto_classes = filter(lambda c: "MessageMicro;" in c.get_superclassname(), dvm.get_classes()) In [5]: proto_class_names = map(lambda c: c.get_name(), proto_classes)
Then we extract the mergeFrom()
method of each class by filtering the method list generated by dvm.get_methods_class(class_name)
. The basic block list of each method can be obtained with vma.get_method(m).basic_blocks.gets()
.
The first is usually the one that implements the switch instruction. In Dalvik, a switch is often represented as a sparse-switch
instruction, which operand is a table composed of a list of values and offsets, called sparse-switch-payload
. Here is an example:
invoke-virtual v3, Lcom/google/protobuf/micro/CodedInputStreamMicro;->readTag()I move-result v0 sparse-switch v0, +52 (0xa4) [...] sparse-switch-payload sparse-switch-payload 0:9 a:a 12:12 1a:1a 22:22 2a:2a 32:32 3a:3a 42:42 4a:4a
Each (value, offset) tuple correspond to a case of the switch; if the value matches the compared register, then the execution continues to the corresponding offset. Once we are able to browse each case of the switch (and its target basic block), we can determine the name of each field and its type by examining the name of the corresponding accessors. For instance, here is a typical basic block:
invoke-virtual v3, Lcom/google/protobuf/micro/CodedInputStreamMicro;->readString()Ljava/lang/String; move-result-object v1 invoke-virtual v2, v1, L[...]AddressProto;->setCity(Ljava/lang/String;)L[...]AddressProto; goto -25
Each basic block contains two accessor calls: readXXX()
and setYYY()
. Their goal is to read an incoming series of bytes and initialize one field of the message. XXX
corresponds to the type of the field (here, string), and YYY
to its name (city).
The simplified analysis algorithm looks like:
for each class that extends MessageMicro: get its mergeFrom() method find the sparse-switch instruction get the corresponding sparse-switch-payload index all values and offsets in a dict for each value, offset: tag = value >> 3 get the target basic block using the offset find readXXX() and setYYY() calls type = XXX name = YYY index the tuple (tag, type, name)
Then we only need to format the output in order to generate a parsable .proto
file, dealing with nested messages and groups among other things.
I called the resulting script androproto.py
. It is released with the API code; feel free to play with it. It is able to analyze the target app and print the recovered Profotuf file. I apologize for the dirty code; since Google Play is the only app using Micro-Protobuf that I’ve analyzed, this script is pretty specific. But it should work with any app using this library, with a few changes. Its output on Google Play app looks like this:
message AckNotificationResponse { } message AndroidAppDeliveryData { optional int64 downloadSize = 1; optional string signature = 2; optional string downloadUrl = 3; repeated AppFileMetadata additionalFile = 4; repeated HttpCookie downloadAuthCookie = 5; optional bool forwardLocked = 6; optional int64 refundTimeout = 7; optional bool serverInitiated = 8; optional int64 postInstallRefundWindowMillis = 9; optional bool immediateStartNeeded = 10; optional AndroidAppPatchData patchData = 11; optional EncryptionParams encryptionParams = 12; } message AndroidAppPatchData { optional int32 baseVersionCode = 1; optional string baseSignature = 2; optional string downloadUrl = 3; optional int32 patchFormat = 4; optional int64 maxPatchSize = 5; } [...]
The resulting output is almost usable with protoc
. Almost, because there is a duplicate message that you need to manually remove in order to make protoc
happy. But after taking care of that detail, you have a working googleplay.proto
that you can use to generate C++, Java and Python stubs for querying Google Play API!
Building Google Play Unofficial Python API
In order to parse Google Play protobuf messages, we dump each server response intercepted with Burp into a file, an use:
protoc --decode=ResponseWrapper googleplay.proto < dump.bin
ResponseWrapper
is the root message type; it can be easily guessed by looking at the message names. Once we have a clue of what’s received by the application, we can start building our own API. Since we need a valid auth token from Google server, we need first to authenticate. I simply reused the code from android-market-api-py. Once logged in, we need to deal with protobuf traffic. For most of API requests, the Android client does not send protobuf messages, but only simple GET or POST requests, such as search?c=3&q=%s
. In order to parse Protobuf responses, we use the generated Python module (googleplay_pb2
):
message = googleplay_pb2.ResponseWrapper.FromString(data)
The resulting message can be browsed like a regular Python object. For some API methods, Google servers also return some prefetch data. A prefetch element contains a URL and raw data. It acts like a cache and can be dealt with pretty easily with a few lines of code.
The final API is pretty straightforward to use. Just follow the README. First make sure to edit googleplay.py
and insert your phone’s androidID
, then supply your Google credentials in config.py
. You can use the provided scripts, producing CSV output, and prettify them with pp
. Sorry for the following truncated output due to this blog…
$ alias pp="column -s ';' -t" # pretty-print CSV $ python search.py earth | pp Title Package name Creator Super Dev Price Offer Type Version Code Size Rating Num Downloads Google Earth com.google.earth Google Inc. 1 Gratuit 1 53 8.6MB 4.46 10 000 000+ Terre HD Free Edition ru.gonorovsky.kv.livewall.earthhd Stanislav Gonorovsky 0 Gratuit 1 33 4.7MB 4.47 1 000 000+ Earth Live Wallpaper com.seb.SLWP unixseb 0 Gratuit 1 60 687.4KB 4.06 5 000 000+ Super Earth Wallpaper Free com.mx.spacelwpfree Mariux 0 Gratuit 1 2 1.8MB 4.41 100 000+ Earth And Legend com.dvidearts.earthandlegend DVide Arts Incorporated 0 5,99 € 1 6 6.8MB 4.82 50 000+ Earth 3D com.jmsys.earth3d Dokon Jang 0 Gratuit 1 12 3.4MB 4.05 500 000+ [...] $ python categories.py | pp ID Name GAME Jeux NEWS_AND_MAGAZINES Actualités et magazines COMICS BD LIBRARIES_AND_DEMO Bibliothèques et démos COMMUNICATION Communication ENTERTAINMENT Divertissement EDUCATION Enseignement FINANCE Finance $ python list.py Usage: list.py category [subcategory] [nb_results] [offset] List subcategories and apps within them. category: To obtain a list of supported catagories, use categories.py subcategory: You can get a list of all subcategories available, by supplying a valid category $ python list.py WEATHER | pp Subcategory ID Name apps_topselling_paid Top payant apps_topselling_free Top gratuit apps_topgrossing Les plus rentables apps_topselling_new_paid Top des nouveautés payantes apps_topselling_new_free Top des nouveautés gratuites $ python list.py WEATHER apps_topselling_free | pp Title Package name Creator Super Dev Price Offer Type Version Code Size Rating Num Downloads La chaine météo com.lachainemeteo.androidapp METEO CONSULT 0 Gratuit 1 8 4.6MB 4.38 1 000 000+ Météo-France fr.meteo Météo-France 0 Gratuit 1 11 2.4MB 3.63 1 000 000+ GO Weather EX com.gau.go.launcherex.gowidget.weatherwidget GO Launcher EX 0 Gratuit 1 25 6.5MB 4.40 10 000 000+ Thermomètre (Gratuit) com.xiaad.android.thermometertrial Mobiquité 0 Gratuit 1 60 3.6MB 3.78 1 000 000+ $ python permissions.py com.google.android.gm android.permission.ACCESS_NETWORK_STATE android.permission.GET_ACCOUNTS android.permission.MANAGE_ACCOUNTS android.permission.INTERNET android.permission.READ_CONTACTS android.permission.WRITE_CONTACTS android.permission.READ_SYNC_SETTINGS android.permission.READ_SYNC_STATS android.permission.RECEIVE_BOOT_COMPLETED [...] $ python download.py com.google.android.gm Downloading 2.7MB... Done $ file com.google.android.gm.apk com.google.android.gm.apk: Zip archive data, at least v2.0 to extract
Conclusion
Although there is no metadata within Micro-Protobuf applications, recovering .proto
files is still doable and it can still be done automatically. The lack of obfuscation is clearly an advantage for an attacker, since all class and method names are easy to understand. Having a non-official Google Play API is handy for many reasons: performing statistics that aren’t available on the official front-end, looking for plagiarism, automatic malware search / downloading / analysis (Androguard to the rescue)… Feel free to browse the source, fork the project, and improve it!
10 réponses à “Reversing Google Play and Micro-Protobuf applications”
Good job! I was hoping someone would provide us with a more recent .proto for Google Play. I’m trying to port this to PHP now – my monologue is available at https://github.com/splitfeed/android-market-api-php/issues/12
Par Marko le 5 décembre 2012
Excellent post…I tried scraping via Gplay web based, but recently, when I scraps Gplay, after 1 or 2 minutes, the http response message get an error (code != 200), redirect to the captcha, also if I use some proxy and a process that changes random these proxies. Do u know some way to by-pass this restriction? Perhaps to create a Google account inside the app process..??? I have already used a random select of user-agents…, (my code has an high level of parallel process)
thks in advance, Paolo
Par paolo le 19 avril 2013
Hi, thanks for the feedback. I guess Google implements some kind of throttling in order to prevent (or slow down) the crawling process. I didn’t try, so I don’t know if there is a way to bypass it. But i would be careful; if you’re always using the same account, Google could track all your requests and decide to block it. You should maybe throttle your requests by sleeping between each method call.
Par Emilien Girault le 22 avril 2013
Just thought I’d stop by here and mention I’ve written a protobuf decoder for Burp. My extension supports loading a .proto or compiled python proto module (_pb2.py) for automatic deserialization and ability to tamper messages right from Burp.
- http://www.tssci-security.com/archives/2013/05/30/decoding-and-tampering-protobuf-serialized-messages-in-burp/
- https://github.com/mwielgoszewski/burp-protobuf-decoder
Par Marcin le 3 juin 2013
Salut, je viens de découvrir ton code et c’est un vrai bonheur à l’utilisation. Grand merci pour ce travail !
J’utilisais RealApkLauncher avant.
Une fonctionnalité manquante est la gestion automatique des mises à jours des APKs téléchargées.
Pour cela, soit il faudrait stocker dans une liste les APKs que l’on veut suivre entre 2 utilisations du logiciel (ca je peux faire), soit il faudrait récupérer les infos (nom et version) dans les APKs déja téléchargées sur le disque (ca je ne sais pas faire). Qu’en penses tu?
Je pense développer une interface en wxpython pour ton API pour télécharger les APKs à la manière de RealApkLauncher. Je vais faire ça demain si j’ai le temps.
Par Tuxicoman le 11 août 2013
Salut,
Je ne connaissais pas RealAPKLeecher. Effectivement la feature de suivi des versions est tout à fait possible, la version disponible étant retournée par l’API (appel à
details()
puis récupération du champdoc.details.appDetails.versionCode
dans la réponse).Si tu choisis l’autre option, c’est également faisable en inspectant le manifest de l’application. Tu peux le faire à la main, mais il faut extraire l’APK, convertir le manifest (XML binaire -> XML simple) puis le parser. Ou bien utiliser Androguard, qui le fait très bien. Il y a même une méthode pour récupérer la version. L’inconvénient est que l’outil repose sur pas mal de dépendances.
Je ne me suis pas replongé dans ce projet depuis des mois, il faudra que je le mette à jour un de ces quatre pour supporter les derniers formats de message Protobuf inclus dans les dernières versions de Google Play…
Par Emilien Girault le 11 août 2013
J’ai avancé. Voici l’interface en cours : http://jesuislibre.net/download/wip.png
Pour l’update, j’utilise Androguard pour récupérer le numéro de version, il n’y a pas d’autre dépendances que Python apparemment pour cette fonction.
J’aimerai pouvoir afficher le versionString à la place du versionCode car ca me semble plus parlant pour les utilisateurs mais il ne semble pas présent dans les résultats renvoyés par ton API. C’est normal ou j’ai raté un truc?
Par Tuxicoman le 12 août 2013
Sympa la GUI . Concernant l’API, en fait je ne fais que récupérer ce que les serveurs de Google renvoient, et effectivement le versionString ne semble pas toujours présent. Je n’ai pas compris pourquoi.
N’hésite pas à forker le projet sur Github. Plusieurs personnes m’ont déjà fait des pull requests, mais j’avoue ne pas avoir beaucoup de temps pour les intégrer. Je préfère laisser le choix aux intéressés d’intégrer ces modifs selon leurs besoins.
Par Emilien Girault le 12 août 2013
J’ai publié le logiciel si tu veux voir à quoi ça ressemble : http://tuxicoman.jesuislibre.net/2013/08/googleplaydownloader-telecharger-les-apk-sans-rien-demander-a-google.html
Par Tuxicoman le 19 août 2013
Salut !
Moi je cherche à ajouter une fonctionnalité à mon script : quand on veut récupérer un jeu, parfois celui-ci contient un fichier .obb qui intègre toutes les données du jeu (souvent près d’1Go). Or je n’arrive pas à trouver un moyen de récupérer ce fichier (https://github.com/matlink/gplaycli/issues/6). As-tu quelques infos à ce propos ?
Merci !
Par Matlink le 22 août 2015