This looks cool. I ran this on some web crawl data I have locally, so: all files you'd find on regular websites; HTML, CSS, JavaScript, fonts etc.<p>It identified some simple HTML files (html, head, title, body, p tags and not much else) as "MS Visual Basic source (VBA)", "ASP source (code)", and "Generic text document" where the `file` utility correctly identified all such examples as "HTML document text".<p>Some woff and woff2 files it identified as "TrueType Font Data", others are "Unknown binary data (unknown)" with low confidence guesses ranging from FLAC audio to ISO 9660. Again, the `file` utility correctly identifies these files as "Web Open Font Format".<p>I like the idea, but the current implementation can't be relied on IMO; especially not for automation.<p>A minor pet peeve also: it doesn't seem to detect when its output is a pipe and strip the shell colour escapes resulting in `^[[1;37` and `^[[0;39m` wrapping every line if you pipe the output into a vim buffer or similar.
Oh man, this brings me back! Almost 10 years ago I was working on a rails app trying to detect the file type of uploaded spreadsheets (xlsx files were being detected as application/zip, which is technically true but useless).<p>I found "magic" that could detect these and submitted a patch at <a href="https://bugs.freedesktop.org/show_bug.cgi?id=78797" rel="nofollow">https://bugs.freedesktop.org/show_bug.cgi?id=78797</a>. My patch got rejected for needing to look at the first 3KB bytes of the file to figure out the type. They had a hard limit that they wouldn't see past the first 256 bytes. Now in 2024 we're doing this with deep learning! It'd be cool if google released some speed performance benchmarks here against the old-fashioned implementations. Obviously it'd be slower, but is it 1000x or 10^6x?
As someone that has worked in a space that has to deal with uploaded files for the last few years, and someone who maintains a WASM libmagic Node package ( <a href="https://github.com/moshen/wasmagic">https://github.com/moshen/wasmagic</a> ) , I have to say I really love seeing new entries into the file type detection space.<p>Though I have to say when looking at the Node module, I don't understand why they released it.<p>Their docs say it's slow:<p><a href="https://github.com/google/magika/blob/120205323e260dad4e58778093fb220aa1991c2b/js/README.md#L43-L44">https://github.com/google/magika/blob/120205323e260dad4e5877...</a><p>It loads the model an runtime:<p><a href="https://github.com/google/magika/blob/120205323e260dad4e58778093fb220aa1991c2b/js/magika.js#L74-L75">https://github.com/google/magika/blob/120205323e260dad4e5877...</a><p>They mark it as Experimental in the documentation, but it seems like it was just made for the web demo.<p>Also as others have mentioned. The model appears to only detect 116 file types:<p><a href="https://github.com/google/magika/blob/120205323e260dad4e58778093fb220aa1991c2b/docs/supported-content-types-list.md">https://github.com/google/magika/blob/120205323e260dad4e5877...</a><p>Where libmagic detects... a lot. Over 1600 last time I checked:<p><a href="https://github.com/file/file/tree/4cbd5c8f0851201d203755b76cb66ba991ffd8be/magic/Magdir">https://github.com/file/file/tree/4cbd5c8f0851201d203755b76c...</a><p>I guess I'm confused by this release. Sure it detected most of my list of sample files, but in a sample set of 4 zip files, it misidentified one.
I ran a quick test on 100 semi-random files I had laying around. Of those, 81 were detected correctly, 6 were detected as the wrong file type, and 12 were detected with an unspecific file type (unknown binary/generic text) when a more specific type existed. In 4 of the unspecific cases, a low-confidence guess was provided, which was wrong in each case. However, almost all of the files which were detected wrong/unspecific are of types not supported by Magika, with one exception of a JSON file containing a lot of JS code as text, which was detected as JS code. For comparison, file 5.45 (the version I happened to have installed) got 83 correct, 6 wrong, and 10 not specific. It detected the weird JSON correctly, but also had its own strange issues, such as detecting a CSV as just "data". The "wrong" here was somewhat skewed by the 4 GLSL shader code files that were in the dataset for some reason, all of which it detected as C code (Magika called them unknown). The other two "wrong" detections were also code formats that it seems it doesn't support. It was also able to output a lot more information about the media files. Not sure what to make of these tests but perhaps they're useful to somebody.
I'm extremely confused about the claim that other tools have a worse precision or recall for APK or JAR files which are very much regular. Like, they should be a valid ZIP file with `META-INF/MANIFEST.MF` present (at least), and APK would need `classes.dex` as well, but at this point there is no other format that can be confused with APK or JAR I believe. I'd like to see which file was causing unexpected drop on precision or recall.
Wonder how this would handle a polyglot[0][1], that is valid as a PDF document, a ZIP archive, and a Bash script that runs a Python webserver, which hosts
Kaitai Struct’s WebIDE which, allowing you to view the file’s own annotated bytes.<p>[0]: <a href="https://www.alchemistowl.org/pocorgtfo/" rel="nofollow">https://www.alchemistowl.org/pocorgtfo/</a><p>[1]: <a href="https://www.alchemistowl.org/pocorgtfo/pocorgtfo16.pdf" rel="nofollow">https://www.alchemistowl.org/pocorgtfo/pocorgtfo16.pdf</a><p>Edit: just tested, and it does only identify the zip layer
I don't understand why this needs to exist. Isn't file type detection inherently deterministic by nature? A valid tar archive will always have the same first few magic bytes. An ELF binary has a universal ELF magic and header. If the magic is bad, then the file is corrupted and not a valid XYZ file. What's the value in throwing in "heuristics" and probabilistic inference into a process that is black and white by design.
So instead of spending some of their human resources to improve libmagic, they used some of their computing power to create an "open source" neural net, which is technically more accurate than the "error-prone" hand-written rules (ignoring that it supports far fewer filetypes), and which is much less effective in an adversarial context, and they want it to "help other software improve their file identification accuracy," which of course it can't since neural nets aren't introspectable. Thanks guys.
As somebody who's dealt with the ambiguity of attempting to use file signatures in order to identify file type, this seems like a pretty useful library. Especially since it seems to be able to distinguish between different types of text files based on their format/content e.g. CSV, markdown, etc.
A somewhat surprising and genuinely useful application of the family of techniques.<p>I wonder how susceptible it is to adversarial binaries or, hah, prompt-injected binaries.
This feels like old school google. I like that it's just a static webpage that basically can't be shut down or sunsetted. It reminds of when Google just made useful stuff and gave them away for free on a webpage like translate and google books. Obviously less life changing than the above but still a great option to have when I need this.
<i>Today web browsers, code editors, and countless other software rely on file-type detection to decide how to properly render a file.</i><p>"web browsers"? Odd to see this coming from Google itself. <a href="https://en.wikipedia.org/wiki/Content_sniffing" rel="nofollow">https://en.wikipedia.org/wiki/Content_sniffing</a> was widely criticised for being problematic for security.
To me the obvious use case is to first use the <i>file</i> command but then, when <i>file</i> returns "DATA" (meaning it couldn't guess the file type), call <i>magika</i>.<p>I guess I'll be writing a wrapper (only for when using my shell in interactive mode) around <i>file</i> doing just that when I come back from vacation. I hate it when <i>file</i> cannot do its thing.<p>Put it this way: I use <i>file</i> a lot and I know at times it cannot detect a filetype. But is <i>file</i> often wrong when it does have a match? I don't think so...<p>So in most of the cases I'd have <i>file</i> correctly give me the filetype, very quickly but then in those rare cases where <i>file</i> cannot find anything, I'd then use the slower but apparently more capable <i>magika</i>.
What are use-cases for this? I mean, obviously detecting the filetype is useful, but we kinda already have plenty of tools to do that, and I cannot imagine, why we need some "smart" way of doing this. If you are not a human, and you are not sure what is it (like, an unknown file being uploaded to a server) you would be better off just rejecting it completely, right? After all, there's absolutely no way an "AI powered" tool can be more reliable than some dumb, err-on-safer-side heuristic, and you wouldn't want to trust <i>that</i> thing to protect you from malicious payloads.
Reminds me when someone asked (at StackOverflow) on how to recognize binaries for different architetures, like x86 or ARM-something or Apple M1 and so on.<p>I gave the idea to use the technique of NCD (Normalized compression distance), based on Kolmogorov complexity. Celibrasi, R. was one great researcher in this area, and I think he worked at Google at some point.<p>Using AI seems to follow the same path: "learn" what represents some specific file and then compare the unknown file to those references (AI:all the parameters, NCD:compression against a known type).
I wrote an implementation of libmagic in Racket a few years ago(<a href="https://github.com/jjsimpso/magic">https://github.com/jjsimpso/magic</a>). File type identification is a pretty interesting topic.<p>As others have noted, libmagic detects many more file types than Magika, but I can see Magika being useful for text files in particular, because anything written by humans doesn't have a rigid format.
I just want to say thank you for the release. There are quite a lot of complaints in the comments but I think this is a useful and worthwhile contribution and I appreciate the authors for going through the effort to get it approved for open source release. It would be great if the model training data was included (or at lease documentation about how to reproduce it.) but that doesn’t preclude this being useful. Thanks!
I created a demo site for Magika.
<a href="https://9revolution9.com/tools/security/file_scanner/" rel="nofollow">https://9revolution9.com/tools/security/file_scanner/</a>
Mime type detection is very interesting thing. I wrote media type detection for McAfee Web Gateway 7.x and because it was a high performance proxy, the detection speed was a major focus, but also the precision, especially for "container types, like, MS Office, OLE-based files, etc. The base of it was a simple Lisp-like language that allowed to write signatures very fast, and everything was combined with very aggressive caching of the data, so we avoided to read data again and again, and used internal caches a lot. In tests, the detection was ~10x faster than file, and with more flexible language we got more file types recognized precisely. Although there were challenges with some formats, like, OLE-based files had FAT directory structure at the end of the file, and you were need to walk the tree to find the top-level structure to distinguish Excel file from Excel file embedded into Word.<p>Streams detection was also quite funny task...
At $job we have been using Apache Tika for years.<p>Works but occasionally having bugs and weird collisions when working with billions of files.<p>Happy to see new contributions in the space.
I wonder how it performs with detecting C vs C++ vs ObjC vs ObjC++ and for bonus points: the common C/C++ subset (which is an incompatible C fork), also extra bonus points for detecting language version compatibility (e.g. C89 vs C99 vs C11...).<p>Separating C from C++ and ObjC is where the file type detection on Github traditionally had problems with (but has been getting dramatically better over time), from an "AI-powered" solution which has been trained on the entire internet I would expect to do better right from the start.<p>The list here doesn't even mention any of those languages except C though:<p><a href="https://github.com/google/magika/blob/main/docs/supported-content-types-list.md">https://github.com/google/magika/blob/main/docs/supported-co...</a>
But will it let you print on Tuesday[1]?<p>1: <a href="https://bugs.launchpad.net/ubuntu/+source/cupsys/+bug/255161/comments/28/+index" rel="nofollow">https://bugs.launchpad.net/ubuntu/+source/cupsys/+bug/255161...</a>
My FOSS desktop text editor performs a subset of file type identification using the first 12 bytes, detecting the type quite quickly:<p>* <a href="https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/src/main/java/com/keenwrite/io/MediaTypeSniffer.java" rel="nofollow">https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/src/main...</a><p>There's a much larger list of file signatures at:<p>* <a href="https://github.com/veniware/Space-Maker/blob/master/FileSignatures.cs">https://github.com/veniware/Space-Maker/blob/master/FileSign...</a>
Nice aka perfect timing.I just restored "some" files (40GB) with [1] But the filetype detection of photorec set some wrong file types.<p>Edit:
It would be super helpful if the "suffix" could be added as output so i can move the files to the right directory [2] ;)<p>[1] <a href="https://www.cgsecurity.org/wiki/PhotoRec" rel="nofollow">https://www.cgsecurity.org/wiki/PhotoRec</a>
[2] <a href="https://github.com/google/magika/issues/63">https://github.com/google/magika/issues/63</a>
Assuming that I've not misunderstood, how does this compare to things like: TrID [0]?? Apart from being open source.<p>[0] <a href="https://mark0.net/soft-trid-e.html" rel="nofollow">https://mark0.net/soft-trid-e.html</a>
I have a question: Is something like Magika enough to check if a file is malicious or not?<p>Example: users can upload PNG file (and only PNG is accepted).
If Malika detects that the file is a PNG, does this mean the file is clean?
> Magika: AI powered fast and efficient file type identification<p>of 116 file types with proprietary puny model with no training code and no dataset.<p>> We are releasing a paper later this year detailing how the Magika model was trained and its performance on large datasets.<p>And ? How do you advance industry by this googleblog post and source code that is useless without closed source model ? All I see here is loud marketing name, loud promises, but actually barely anything useful. Hooly rooftop characters sideproject?
Took a .dxf file and fed it to Magika.
It says with confidence of 97% that that must be a PowerShell file. A classic .dwg could be "mscompress" (whatever that is), 81%, or a GIF. Both couldn't be further from the truth.<p>Common files are categorized successfully – but well, yeah that's not really an achievement.
Pretty much nothing more than a toy right now.
This is useful for detecting file types of unknown blobs with custom file extension, when the file command just returns data. Though it doesn't correctly identify lua code for some reason, it guesses with low probability that it's either ruby or javascript, or anything but lua.
If their “Exif Tool” is <a href="https://exiftool.org/" rel="nofollow">https://exiftool.org/</a> (what else could it be?), I don’t understand why they included it in their tests. Also, how does ExifTool recognize Python and html files?
I wonder what the output will be on polyglot files like run-anywhere binaries produced by cosmopolitan [1]<p>[1]: <a href="https://justine.lol/cosmopolitan/" rel="nofollow">https://justine.lol/cosmopolitan/</a>
Why is this piece of code being sold as open source, when in reality it just calls into proprietary ML blob that is tiny and useless, and actual source code of model is closed while properly useful large model is non existing ?
I wonder how big of a deal it is that you'd have to retrain the model to support a new or changed file type? It doesn't seem like the repo contains training code, but I could be missing it...
After reading thru all the comments, honestly I still don't get the point of this system. What is potential practical value or applications of this model?
Is it really common enough for files not to be annotated with a useful/correct file type extension (e.g. .mp3, .txt) that a library like this is needed?
Can we please god stop using AI like it's a meaningful word? This is really interesting technology; it's hamstrung by association with a predatory marketing term.
> <i>So far, libmagic and most other file-type-identification software have been relying on a handcrafted collection of heuristics and custom rules to detect each file format.<p>This manual approach is both time consuming and error prone as it is hard for humans to create generalized rules by hand.</i><p>Pure nonsense. The rules are accurate, based on the actual formats, and not "heuristics".
Can someone please help me understand why this is useful? The article mentions malware scanning applications, but if I'm sending you a malicious PDF, won't I want to clearly mark it with a .pdf extension so that you open it in your PDF app? Their examples are all very obvious based on file extensions.