This question recognizes that we can use deep learning techniques to learn high level semantic vector representations of raw data structures like natural language, images, and graphs. It seems that web pages are some combination of this, and more, and there should in principle be some way to develop a joint model that faithfully represents joint distributions across these types of structures found in raw HTML. This is an open ended question where any feedback and thoughts are helpful. The plan is to research and develop an open source model for representing the features of any web page (like high level activations of a CNN) based on unsupervised training with Common Crawl data.