Web Science/Part1: Foundations of the web/Web content/Problem setting for web content formats/script

Course elements

PART1: Week1: Ethernet · Internet Protocol · Week2: Transmission Control Protocol · Domain Name System · Week3: Internet vs World Wide Web · HTTP · Week4: Web Content · Dynamic Web Content
PART2: Week5: How big is the Web? · Descriptive Web Models · Week6: Advanced Statistic Models · Modelling Similarity · Week7: Generative Modelling of the Web · Graph theoretic Web Modelling
PART3: Week8 : Investigating Meme Spreading · Herding Behaviour · Week9: Online Advertising · User Modelling
PART4: Week10 : Copyright · Net neutrality · Week11: Internet governance · Privacy

We have talked about sending messages between computers in the internet, but what actually do we want to do in the Web? We want to share content between as many people as possible! What is content? (display the following examples in different browswers) Content may be

an image (depict the image of a record cover of BandX opened in a Web browser from a file)
an audio file (run an audio for a few seconds without showing anything else in the Browser)
a piece of text (depict some description of BandX opened in a Web browser from a file)
contents from a database (depict some listing of 3 records including one record of BandX, something like the following. Leave in the example with Bobby McFerrin, because it shows something that is an artist and an album title at the same time; also from a file)

Rene's record collection

Tom Jobim: Rio revisited agua de beber 3:49 aguas de marco 4:07 chega de saudade 3:48 corcovado 3:05

Bobby McFerrin: Bobby McFerrin Dance with me 4:09 Feline 5:08

and in fact we may even want to share some executable code that let's us type some search, drag-and-drop some record into our local collection or something like this.

AND we do not want to have such content be separate from each other, but it should be nicely arranged and layouted.

But what happens if we just provide the content of a file as it is. Then an image looks like this (depict the content of image as a Hex stream) and an audio file looks about the same (show another Hex stream). This means that it is not sufficient to have the content, but the content must be marked up and structure to clarify what kind of content it is and how it is structured and how it should be managed and layouted.

Clearly we could build a useful piece of software that would use the internet and would be able to share and exchange content. However, an important lesson to learn from what we know about the internet so far is that it is not a good idea to prescribe a particular piece of software that maybe only runs on one operating system. But, to exploit the creativity of people, the scalability of decentralization and the positive competition between organizations and individuals, one must rather prescribe some simple standards for describing content that can be understood by many different pieces of software on many computers, handhelds, smartphones, TV sets or tablets with different screen sizes, resolutions, and modes of interaction (PROBLEM 1).

Considering the example content of our record listing, we observe the second challenge (PROBLEM 2): we must be able to structure content of one type, such as text, into its different parts, e.g. headings, paragraph and lists and not just display the content on the screen like one stream of characters.

Then the third problem (PROBLEM 3) that we run into when we want display content and interacting with content is that we must come up with a content format that lets us arrange different types of content in a readable manner, e.g. we must be able to arrange the music cover of BandX next to the text paragraph describing the band.

Finally, problem 4 (PROBLEM 4), the reverse side of digesting Web content is producing Web content. Producing content must be simple enough to understand for people and for the developers of different applications for different media formats

So, how can we solve these issues? In the past you will have encountered different languages that allow you to distinguish content proper from how the content is structured and arranged. You could choose

a programming language like Java (a programming language like Java); This would be highly flexible, because you can program everything. But this is still a bad idea, because the overhead for writing down content proper is very large
a markup language like Wiki (a markup language like Wiki syntax) syntax that you are using for editing course comments and discussions in this course. Wiki syntax is very easy. But this is still a bad idea, because the Wiki syntax idea cannot easily be generalized to new challenges such as graphics and multimedia and complex layouts.

Then, what is a sweet spot between the generality of a programming language and the easiness of Wiki. In fact, the publishing industry has produced a multitude of standards to delineate between content proper (images, text, paragraph, footers, headers, etc.) and *markup*. Examples include

* PDF
* Postscript
* epub
* Latex
* SGML (structured generalized markup language)

The inventor of the Web, Tim Berners-Lee, and the Web community at large found that SGML is a wonderful basis for describing content and structure, but a bit too complex for engineering and, hence, produced a simpler version of it, i.e. XML, the eXtensible Markup Language.

In fact, the core idea of XML is extremely simple. Even if the result is sometimes not simple and cannot be simple when we consider the intrinsic complexity of the task of structuring Web content. But, let's revisit Rene's record collection. Then beyond the sheer content, we simply put structure information in angle brackets. For example one specifies that "Rio revisited" is the title of a record by putting the structure information <recordtitle></recordtitle> around it, just like this:

 <recordtitle>Rio revisited</recordtitle>

In order to make clear to which record this recordtitle belong and which different tracks belong to this record we simply nest the recordtitle and the different tracks within - guess - the tags <record></record>. Thereby, somehing like a track is quite complex by itself and may carry with it a lot of further information. This is what may come out /Explain here what people see on the screen/

 <collection>
  <collectiontitle>This is <owner>Rene Pickhardt</owner>'s most favored record collection</collectiontitle>
  <record>
    <artist>Tom Jobim</artist> 
    <recordtitle>Rio revisited</recordtitle>
    <track> 
      <tracktitle>agua de beber</tracktitle> 
       <duration>3:49</duration>
    </track>
    <track>
      <tracktitle>aguas de marco</tracktitle>
      <duration>4:07</duration> 
    </track>
  </record>
  <record>
    <artist>Bobby McFerrin</artist> 
    <recordtitle>Bobby McFerrin</recordtitle>
    <track> 
      <tracktitle>Dance with me</tracktitle> 
      <duration>4:09</duration>
    </track>
    <track>
      <tracktitle>Feline</tracktitle> 
      <duration>5:08</duration> 
    </track>
  </record>
 </collection>

You see with this example that providing markup is easy for contents of a database and for marking up pieces of running text. Next we will show you that it is very easy to produce views on the given content as well as layout instructions from such a file. Some lessons from now, we will also show how multimedia will come into play.