Magika: AI powered fast and efficient file type identification

695 pointsby alphabettingover 1 year ago

56 comments

TomNomNomover 1 year ago

This looks cool. I ran this on some web crawl data I have locally, so: all files you'd find on regular websites; HTML, CSS, JavaScript, fonts etc.It identified some simple HTML files (html, head, title, body, p tags and not much else) as "MS Visual Basic source (VBA)", "ASP source (code)", and "Generic text document" where the `file` utility correctly identified all such examples as "HTML document text".Some woff and woff2 files it identified as "TrueType Font Data", others are "Unknown binary data (unknown)" with low confidence guesses ranging from FLAC audio to ISO 9660. Again, the `file` utility correctly identifies these files as "Web Open Font Format".I like the idea, but the current implementation can't be relied on IMO; especially not for automation.A minor pet peeve also: it doesn't seem to detect when its output is a pipe and strip the shell colour escapes resulting in `^[[1;37` and `^[[0;39m` wrapping every line if you pipe the output into a vim buffer or similar.

评论 #39397163 未加载

评论 #39396943 未加载

stevepikeover 1 year ago

Oh man, this brings me back! Almost 10 years ago I was working on a rails app trying to detect the file type of uploaded spreadsheets (xlsx files were being detected as application/zip, which is technically true but useless).I found "magic" that could detect these and submitted a patch at <a href="https://bugs.freedesktop.org/show_bug.cgi?id=78797" rel="nofollow">https://bugs.freedesktop.org/show_bug.cgi?id=78797</a>. My patch got rejected for needing to look at the first 3KB bytes of the file to figure out the type. They had a hard limit that they wouldn't see past the first 256 bytes. Now in 2024 we're doing this with deep learning! It'd be cool if google released some speed performance benchmarks here against the old-fashioned implementations. Obviously it'd be slower, but is it 1000x or 10^6x?

评论 #39393682 未加载

评论 #39393821 未加载

评论 #39392340 未加载

评论 #39395797 未加载

评论 #39402674 未加载

m0shenover 1 year ago

As someone that has worked in a space that has to deal with uploaded files for the last few years, and someone who maintains a WASM libmagic Node package ( <a href="https://github.com/moshen/wasmagic">https://github.com/moshen/wasmagic</a> ) , I have to say I really love seeing new entries into the file type detection space.Though I have to say when looking at the Node module, I don't understand why they released it.Their docs say it's slow:<a href="https://github.com/google/magika/blob/120205323e260dad4e58778093fb220aa1991c2b/js/README.md#L43-L44">https://github.com/google/magika/blob/120205323e260dad4e5877...</a>It loads the model an runtime:<a href="https://github.com/google/magika/blob/120205323e260dad4e58778093fb220aa1991c2b/js/magika.js#L74-L75">https://github.com/google/magika/blob/120205323e260dad4e5877...</a>They mark it as Experimental in the documentation, but it seems like it was just made for the web demo.Also as others have mentioned. The model appears to only detect 116 file types:<a href="https://github.com/google/magika/blob/120205323e260dad4e58778093fb220aa1991c2b/docs/supported-content-types-list.md">https://github.com/google/magika/blob/120205323e260dad4e5877...</a>Where libmagic detects... a lot. Over 1600 last time I checked:<a href="https://github.com/file/file/tree/4cbd5c8f0851201d203755b76cb66ba991ffd8be/magic/Magdir">https://github.com/file/file/tree/4cbd5c8f0851201d203755b76c...</a>I guess I'm confused by this release. Sure it detected most of my list of sample files, but in a sample set of 4 zip files, it misidentified one.

评论 #39392940 未加载

评论 #39393637 未加载

评论 #39394400 未加载

评论 #39397391 未加载

评论 #39395054 未加载

评论 #39392900 未加载

Eiimover 1 year ago

I ran a quick test on 100 semi-random files I had laying around. Of those, 81 were detected correctly, 6 were detected as the wrong file type, and 12 were detected with an unspecific file type (unknown binary/generic text) when a more specific type existed. In 4 of the unspecific cases, a low-confidence guess was provided, which was wrong in each case. However, almost all of the files which were detected wrong/unspecific are of types not supported by Magika, with one exception of a JSON file containing a lot of JS code as text, which was detected as JS code. For comparison, file 5.45 (the version I happened to have installed) got 83 correct, 6 wrong, and 10 not specific. It detected the weird JSON correctly, but also had its own strange issues, such as detecting a CSV as just "data". The "wrong" here was somewhat skewed by the 4 GLSL shader code files that were in the dataset for some reason, all of which it detected as C code (Magika called them unknown). The other two "wrong" detections were also code formats that it seems it doesn't support. It was also able to output a lot more information about the media files. Not sure what to make of these tests but perhaps they're useful to somebody.

评论 #39401118 未加载

lifthrasiirover 1 year ago

I'm extremely confused about the claim that other tools have a worse precision or recall for APK or JAR files which are very much regular. Like, they should be a valid ZIP file with `META-INF/MANIFEST.MF` present (at least), and APK would need `classes.dex` as well, but at this point there is no other format that can be confused with APK or JAR I believe. I'd like to see which file was causing unexpected drop on precision or recall.

评论 #39395555 未加载

评论 #39392701 未加载

评论 #39393149 未加载

评论 #39392667 未加载

awaythrow999over 1 year ago

Wonder how this would handle a polyglot[0][1], that is valid as a PDF document, a ZIP archive, and a Bash script that runs a Python webserver, which hosts Kaitai Struct’s WebIDE which, allowing you to view the file’s own annotated bytes.[0]: <a href="https://www.alchemistowl.org/pocorgtfo/" rel="nofollow">https://www.alchemistowl.org/pocorgtfo/</a>[1]: <a href="https://www.alchemistowl.org/pocorgtfo/pocorgtfo16.pdf" rel="nofollow">https://www.alchemistowl.org/pocorgtfo/pocorgtfo16.pdf</a>Edit: just tested, and it does only identify the zip layer

评论 #39394787 未加载

lopkeny12koover 1 year ago

I don't understand why this needs to exist. Isn't file type detection inherently deterministic by nature? A valid tar archive will always have the same first few magic bytes. An ELF binary has a universal ELF magic and header. If the magic is bad, then the file is corrupted and not a valid XYZ file. What's the value in throwing in "heuristics" and probabilistic inference into a process that is black and white by design.

评论 #39396025 未加载

评论 #39396367 未加载

评论 #39396865 未加载

评论 #39394448 未加载

评论 #39393493 未加载

YoshiRulzover 1 year ago

So instead of spending some of their human resources to improve libmagic, they used some of their computing power to create an "open source" neural net, which is technically more accurate than the "error-prone" hand-written rules (ignoring that it supports far fewer filetypes), and which is much less effective in an adversarial context, and they want it to "help other software improve their file identification accuracy," which of course it can't since neural nets aren't introspectable. Thanks guys.

评论 #39400015 未加载

评论 #39400868 未加载

thorumover 1 year ago

Supported file types: <a href="https://github.com/google/magika/blob/main/docs/supported-content-types-list.md">https://github.com/google/magika/blob/main/docs/supported-co...</a>

评论 #39392361 未加载

vunderbaover 1 year ago

As somebody who's dealt with the ambiguity of attempting to use file signatures in order to identify file type, this seems like a pretty useful library. Especially since it seems to be able to distinguish between different types of text files based on their format/content e.g. CSV, markdown, etc.

NiloCKover 1 year ago

A somewhat surprising and genuinely useful application of the family of techniques.I wonder how susceptible it is to adversarial binaries or, hah, prompt-injected binaries.

评论 #39398167 未加载

评论 #39392632 未加载

评论 #39393317 未加载

评论 #39391864 未加载

Vt71fcAqt7over 1 year ago

This feels like old school google. I like that it's just a static webpage that basically can't be shut down or sunsetted. It reminds of when Google just made useful stuff and gave them away for free on a webpage like translate and google books. Obviously less life changing than the above but still a great option to have when I need this.

userbinatorover 1 year ago

Today web browsers, code editors, and countless other software rely on file-type detection to decide how to properly render a file."web browsers"? Odd to see this coming from Google itself. <a href="https://en.wikipedia.org/wiki/Content_sniffing" rel="nofollow">https://en.wikipedia.org/wiki/Content_sniffing</a> was widely criticised for being problematic for security.

评论 #39392559 未加载

TacticalCoderover 1 year ago

To me the obvious use case is to first use the file command but then, when file returns "DATA" (meaning it couldn't guess the file type), call magika.I guess I'll be writing a wrapper (only for when using my shell in interactive mode) around file doing just that when I come back from vacation. I hate it when file cannot do its thing.Put it this way: I use file a lot and I know at times it cannot detect a filetype. But is file often wrong when it does have a match? I don't think so...So in most of the cases I'd have file correctly give me the filetype, very quickly but then in those rare cases where file cannot find anything, I'd then use the slower but apparently more capable magika.

评论 #39401066 未加载

krickover 1 year ago

What are use-cases for this? I mean, obviously detecting the filetype is useful, but we kinda already have plenty of tools to do that, and I cannot imagine, why we need some "smart" way of doing this. If you are not a human, and you are not sure what is it (like, an unknown file being uploaded to a server) you would be better off just rejecting it completely, right? After all, there's absolutely no way an "AI powered" tool can be more reliable than some dumb, err-on-safer-side heuristic, and you wouldn't want to trust that thing to protect you from malicious payloads.

评论 #39393310 未加载

评论 #39393374 未加载

woliveirajrover 1 year ago

Reminds me when someone asked (at StackOverflow) on how to recognize binaries for different architetures, like x86 or ARM-something or Apple M1 and so on.I gave the idea to use the technique of NCD (Normalized compression distance), based on Kolmogorov complexity. Celibrasi, R. was one great researcher in this area, and I think he worked at Google at some point.Using AI seems to follow the same path: "learn" what represents some specific file and then compare the unknown file to those references (AI:all the parameters, NCD:compression against a known type).

jjsimpsoover 1 year ago

I wrote an implementation of libmagic in Racket a few years ago(<a href="https://github.com/jjsimpso/magic">https://github.com/jjsimpso/magic</a>). File type identification is a pretty interesting topic.As others have noted, libmagic detects many more file types than Magika, but I can see Magika being useful for text files in particular, because anything written by humans doesn't have a rigid format.

VikingCoderover 1 year ago

What does it do with an Actually Portable Executable compiled by Cosmopolitan libc compiler?

评论 #39393700 未加载

20after4over 1 year ago

I just want to say thank you for the release. There are quite a lot of complaints in the comments but I think this is a useful and worthwhile contribution and I appreciate the authors for going through the effort to get it approved for open source release. It would be great if the model training data was included (or at lease documentation about how to reproduce it.) but that doesn’t preclude this being useful. Thanks!

glassonion999over 1 year ago

I created a demo site for Magika. <a href="https://9revolution9.com/tools/security/file_scanner/" rel="nofollow">https://9revolution9.com/tools/security/file_scanner/</a>

alexottover 1 year ago

Mime type detection is very interesting thing. I wrote media type detection for McAfee Web Gateway 7.x and because it was a high performance proxy, the detection speed was a major focus, but also the precision, especially for "container types, like, MS Office, OLE-based files, etc. The base of it was a simple Lisp-like language that allowed to write signatures very fast, and everything was combined with very aggressive caching of the data, so we avoided to read data again and again, and used internal caches a lot. In tests, the detection was ~10x faster than file, and with more flexible language we got more file types recognized precisely. Although there were challenges with some formats, like, OLE-based files had FAT directory structure at the end of the file, and you were need to walk the tree to find the top-level structure to distinguish Excel file from Excel file embedded into Word.Streams detection was also quite funny task...

评论 #39412448 未加载

vrnvuover 1 year ago

At $job we have been using Apache Tika for years.Works but occasionally having bugs and weird collisions when working with billions of files.Happy to see new contributions in the space.

johneaover 1 year ago

The results of which you'll never be 100% sure are correct...

评论 #39392588 未加载

评论 #39393110 未加载

评论 #39394240 未加载

flohofwoeover 1 year ago

I wonder how it performs with detecting C vs C++ vs ObjC vs ObjC++ and for bonus points: the common C/C++ subset (which is an incompatible C fork), also extra bonus points for detecting language version compatibility (e.g. C89 vs C99 vs C11...).Separating C from C++ and ObjC is where the file type detection on Github traditionally had problems with (but has been getting dramatically better over time), from an "AI-powered" solution which has been trained on the entire internet I would expect to do better right from the start.The list here doesn't even mention any of those languages except C though:<a href="https://github.com/google/magika/blob/main/docs/supported-content-types-list.md">https://github.com/google/magika/blob/main/docs/supported-co...</a>

aidenn0over 1 year ago

But will it let you print on Tuesday[1]?1: <a href="https://bugs.launchpad.net/ubuntu/+source/cupsys/+bug/255161/comments/28/+index" rel="nofollow">https://bugs.launchpad.net/ubuntu/+source/cupsys/+bug/255161...</a>

评论 #39404036 未加载

thangalinover 1 year ago

My FOSS desktop text editor performs a subset of file type identification using the first 12 bytes, detecting the type quite quickly:* <a href="https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/src/main/java/com/keenwrite/io/MediaTypeSniffer.java" rel="nofollow">https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/src/main...</a>There's a much larger list of file signatures at:* <a href="https://github.com/veniware/Space-Maker/blob/master/FileSignatures.cs">https://github.com/veniware/Space-Maker/blob/master/FileSign...</a>

delijatiover 1 year ago

Nice aka perfect timing.I just restored "some" files (40GB) with [1] But the filetype detection of photorec set some wrong file types.Edit: It would be super helpful if the "suffix" could be added as output so i can move the files to the right directory [2] ;)[1] <a href="https://www.cgsecurity.org/wiki/PhotoRec" rel="nofollow">https://www.cgsecurity.org/wiki/PhotoRec</a> [2] <a href="https://github.com/google/magika/issues/63">https://github.com/google/magika/issues/63</a>

account-5over 1 year ago

Assuming that I've not misunderstood, how does this compare to things like: TrID [0]?? Apart from being open source.[0] <a href="https://mark0.net/soft-trid-e.html" rel="nofollow">https://mark0.net/soft-trid-e.html</a>

评论 #39393992 未加载

Andugalover 1 year ago

I have a question: Is something like Magika enough to check if a file is malicious or not?Example: users can upload PNG file (and only PNG is accepted). If Malika detects that the file is a PNG, does this mean the file is clean?

评论 #39398298 未加载

评论 #39396064 未加载

评论 #39400137 未加载

评论 #39398441 未加载

评论 #39395946 未加载

diimdeepover 1 year ago

> Magika: AI powered fast and efficient file type identificationof 116 file types with proprietary puny model with no training code and no dataset.> We are releasing a paper later this year detailing how the Magika model was trained and its performance on large datasets.And ? How do you advance industry by this googleblog post and source code that is useless without closed source model ? All I see here is loud marketing name, loud promises, but actually barely anything useful. Hooly rooftop characters sideproject?

star4040over 1 year ago

It seems like it defeats the purpose of such a tool that this initial version doesn’t have polyglot files. I hope they’re quick to work on that.

runxelover 1 year ago

Took a .dxf file and fed it to Magika. It says with confidence of 97% that that must be a PowerShell file. A classic .dwg could be "mscompress" (whatever that is), 81%, or a GIF. Both couldn't be further from the truth.Common files are categorized successfully – but well, yeah that's not really an achievement. Pretty much nothing more than a toy right now.

评论 #39404076 未加载

Nullabillityover 1 year ago

It seems to detect my Android build.gradle.kts as Scala, which I suppose is a kind of hilarious confusion but not exactly useful.

runaldover 1 year ago

This is useful for detecting file types of unknown blobs with custom file extension, when the file command just returns data. Though it doesn't correctly identify lua code for some reason, it guesses with low probability that it's either ruby or javascript, or anything but lua.

Someoneover 1 year ago

If their “Exif Tool” is <a href="https://exiftool.org/" rel="nofollow">https://exiftool.org/</a> (what else could it be?), I don’t understand why they included it in their tests. Also, how does ExifTool recognize Python and html files?

Labo333over 1 year ago

I wonder what the output will be on polyglot files like run-anywhere binaries produced by cosmopolitan [1][1]: <a href="https://justine.lol/cosmopolitan/" rel="nofollow">https://justine.lol/cosmopolitan/</a>

secondary_opover 1 year ago

Why is this piece of code being sold as open source, when in reality it just calls into proprietary ML blob that is tiny and useless, and actual source code of model is closed while properly useful large model is non existing ?

评论 #39395239 未加载

Imnimoover 1 year ago

I wonder how big of a deal it is that you'd have to retrain the model to support a new or changed file type? It doesn't seem like the repo contains training code, but I could be missing it...

lqcfcjxover 1 year ago

After reading thru all the comments, honestly I still don't get the point of this system. What is potential practical value or applications of this model?

semitonesover 1 year ago

Is it really common enough for files not to be annotated with a useful/correct file type extension (e.g. .mp3, .txt) that a library like this is needed?

评论 #39392547 未加载

评论 #39392237 未加载

评论 #39393538 未加载

评论 #39392028 未加载

jwithingtonover 1 year ago

I guess I'm kind of a dummy on this, but why is it impressive to identify that a .js file is Javascript, a .md file is Markdown, etc?

评论 #39402623 未加载

andrewstuartover 1 year ago

Very useful.I wrote an editor that needed file type detection but the results of traditional approaches were flaky.

chromatonover 1 year ago

It can't correctly identify a DXF file in my testing. It categorizes it as plain text.

pier25over 1 year ago

I use FFMPEG to detect if uploaded files are valid audio files. Would this be much faster?

breatherover 1 year ago

Can we please god stop using AI like it's a meaningful word? This is really interesting technology; it's hamstrung by association with a predatory marketing term.

nayukiover 1 year ago

The name sounds like the Pokémon Magikarp or the anime series Madoka Magica.

goshxover 1 year ago

I used an HTML file and added JPEG magic bytes to its header:magika file.jpgfile.jpg: JPEG image data (image)

init0over 1 year ago

Why not detect it by checking the magic number of the buffer?

评论 #39398909 未加载

kazinatorover 1 year ago

> So far, libmagic and most other file-type-identification software have been relying on a handcrafted collection of heuristics and custom rules to detect each file format.This manual approach is both time consuming and error prone as it is hard for humans to create generalized rules by hand.Pure nonsense. The rules are accurate, based on the actual formats, and not "heuristics".

评论 #39393420 未加载

评论 #39393039 未加载

评论 #39402506 未加载

评论 #39396432 未加载

Delumineover 1 year ago

Voidtools - Everything.. looking at you to implement this

a-dubover 1 year ago

probably a lot of interesting work going on that looks like this for the virustotal db itself.

kushieover 1 year ago

this couldnt have been released at a better time for me! really needed a library like this.

评论 #39392801 未加载

评论 #39393648 未加载

lakomenover 1 year ago

Why? Just check the damn headers. Why do you need a power hungry and complicated AI model to do it? Why?

rfl890over 1 year ago

We have had file(1) for years

评论 #39392190 未加载

评论 #39393516 未加载

评论 #39392199 未加载

earth2marsover 1 year ago

how do i pronounce this? Myajika or MaGika? anyhow, its super cool.

omniover 1 year ago

Can someone please help me understand why this is useful? The article mentions malware scanning applications, but if I'm sending you a malicious PDF, won't I want to clearly mark it with a .pdf extension so that you open it in your PDF app? Their examples are all very obvious based on file extensions.