Some more great probabilistic python libraries:<p><a href="https://github.com/datamade/usaddress" rel="nofollow">https://github.com/datamade/usaddress</a> - "usaddress is a Python library for parsing unstructured address strings into address components, using advanced NLP methods."<p><a href="https://github.com/datamade/probablepeople" rel="nofollow">https://github.com/datamade/probablepeople</a> - "probablepeople is a python library for parsing unstructured romanized name or company strings into components, using advanced NLP methods."
In the same vague theme of "I don't know what I'm dealing with" : <a href="https://github.com/ajalt/fuckitpy" rel="nofollow">https://github.com/ajalt/fuckitpy</a>
We built a similar tool, utilizing a CNN. It works on structured (and unstructured) data and provides additional info.<p><a href="https://github.com/capitalone/DataProfiler" rel="nofollow">https://github.com/capitalone/DataProfiler</a><p>Cool part, is you can “extend” the intern name-entity recognition model by refitting with the new data.<p>Out if the box, the DataProfiler does something like 18 entities including most of the PII dada.
Why are these screenshots animated? The command is still visible in the final frame, and the final frame shows the output we're interested in, but not long enough to read and understand it.
I'm admittedly not impressed by the pcap processing.<p>It identifies a bunch of fragments of HTTP headers as "YouTube Video ID".<p>Meanwhile, I can get the same info and more by running<p><pre><code> $ strings FollowTheLeader.pcap
*]?>
GET / HTTP/1.1
Host: 10.0.2.5
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Pragma: no-cache
Cache-Control: no-cache
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/3.7.3rc1
Date: Sun, 14 Jul 2019 02:42:13 GMT
Content-type: text/html
Content-Length: 105
Last-Modified: Sun, 14 Jul 2019 02:41:10 GMT
<h1>My Flag Web Page</h1>
<p>Hi there! Have a flag!</p>
<p>Here is your flag: ctfa{terrific_traffic}</p></code></pre>
At first I thought this was going to be like google lens. It's instead a way to probabilistically Identify things in strings. I have wished for this to exist, and made my own dumbed down version of it before. This could be very useful for less fragile screen scraping.
Bee is a really tremendous and generous developer. I use a few of their other projects near-daily (Rustscan especially has changed my life.) Definitely one of those open source devs you follow just to see whatever they come up with next.