F1423 Digital Texts and Multicultural Studies (Cultural Heritage Sp. Track)

Boschetti Federico

The course aims at illustrating the complete work-flow from bilingual printed editions out of copyright to digital editions linguistically analyzed and annotated by the students.

The first part of the course is devoted to the techniques to acquire digital texts from printed editions by Optical Character Recognition (OCR). Page images are scanned by the teacher or downloaded from repertories available online, such as archive.org. Open source OCR engines and tools developed at the ILC-CNR of Pisa in partnership with the Perseus Project of Boston are described and used by the students during the labs. Examples suggested by the teacher are mainly based on Greek texts with English translation, due to the interesting challenges offered by polytonic Greek, but no preliminary knowledge of the language is required. Students can base their mid term projects on short texts written in different languages, if either the original or the translation is in Greek, Latin, English, German, French, Spanish, Italian or Venetian and the other language is well mastered by them. Typical examples are the Venetian and the Italian translations of the first book of the Iliad by Casanova or some epigrams of the Anthologia Palatina in Greek and the related translation in a modern language, but texts and translations can be both in modern languages on topics selected by the students and discussed with the teacher. The principles of the Text Encoding Initiative (TEI) guidelines are illustrated and students are requested to provide their texts with minimal metainformation (author, title, edition, etc.), layout annotation (division in paragraphs, separation between text and critical apparatus, etc.) and anchors between the source and the related translation.

The second part of the course is devoted to the illustration of tools for linguistic analyses, such as the lemmatization, the morphological analysis and the syntactic parsing. During the lab, automated linguistic analyses are performed by the students on their texts. The suite of tools for editing and annotating texts developed at the Perseus Project (Perseids) is illustrated and tested with the students. Finally, the features of Aporia, the system for text retrieval developed at the ILC-CNR, are described and texts selected by the students, provided with linguistic analyses, are uploaded on the platform. Texts visualized in parallel (at the level of granularity established by the distance of the anchors between source and translation) are annotated by the students with historical, stylistic and linguistic annotations, focusing in particular the main differences between the original text and its translation, due to cultural reasons.

Learning outcomes of the course
Students will learn to manage the complete digitization work-flow of multilingual texts and will
annotate texts focusing their attention on cultural differences between the original work and its
translation.

Syllabus
Evaluation
Readings

Materials that must be studied by the students will be discussed and motivated in class.
Readings
- A. Babeu, Rome Wasn’t Digitized in a Day: Building a Cyberinfrastructure for Digital Classicists, Washington (DC) 2011, 12-31, http://www.clir.org/pubs/reports/pub150
- D. Bamman, G. Crane. 2008. Building a Dynamic Lexicon from a Digital Library. In Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries (JCDL 2008), Pittsburgh (PA).
- D. Bamman, F. Mambrini, G. Crane. 2009. An Ownership Model of Annotation: The Ancient
Greek Dependency Treebank. In TLT 2009: Proceedings of the Eighth International Workshop on Treebanks and Linguistic Theories Conference, Milan Italy : Northern European Association for Language Technology (NEALT), http://dl.tufts.edu/catalog/tufts:PB.001.002.00008
- F. Boschetti, M. Romanello, A. Babeu, D. Bamman, G. Crane. 2009. Improving OCR Accuracy
for Classical Critical Editions. In M. Agosti, J. Borbinha, S. Kapidakis, C. Papatheodorou, G.
Tsakonas. Research and Advanced Technology for Digital Libraries, Proceedings, Berlin, 156-167, http://www.perseus.tufts.edu/~ababeu/ecdl2009-preprint.pdf
- L. Burnard, S. Bauman. 2013. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Charlottesville (VA), 385-404, http://www.tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf
- G. Crane. 1991. Generating and Parsing Classical Greek. In Literary and Linguistic Computing, 6 , 4.
- G. Crane, A. Jones, D. Bamman, L. Cerrato, D. Mimno, D. Packel, D. Sculley, G. Weaver. 2006. Beyond Digital Incunabula: Modeling the Next Generation of Digital Libraries. In Proceedings of Research and Advanced Technology for Digital Libraries: 10th European conference, ECDL 2006, Alicante, Spain, September 17-22, 353-366, http://www.eecs.tufts.edu/~dsculley/papers/incunabula.pdf
- G. Crane, B. Almas, A. Babeu, L. Cerrato, M. Harrington, D. Bamman, H. Diakoff. 2012. Student Researchers, Citizen Scholars and the Trillion Word Library. In Proceedings of the 12th
ACM/IEEE-CS Joint Conference on Digital libraries (JCDL 2012), Washington (DC), 213-222,
http://dl.tufts.edu/catalog/tufts:PB.001.001.00023
- F. Dell'Orletta, M. Federico, S. Montemagni, V. Pirrelli. 2007. Maximum Entropy for Italian POS Tagging. In Proceedings of Workshop Evalita 2007. Intelligenza Artificiale 4, 2.
- F. Dell'Orletta. 2009. Ensemble system for Part-of-Speech tagging. In Proceedings of Evalita’09, Evaluation of NLP and Speech Tools for Italian, Reggio Emilia, December.
- C. Fellbaum. 1998. WordNet: An Electronical Lexical Database. Cambridge (MA), Introduction.
- E. Pianta, L. Bentivogli, C. Girardi. 2002. MultiWordNet: developing an aligned multilingual
database. In Proceedings of the First International Conference on Global WordNet, Mysore, India, January 21-25, http://multiwordnet.fbk.eu/paper/MWN-India-published.pdf
- A. Roventini, A. Alonge, N. Calzolari, B. Magnini, F. Bertagna. 2000. ItalWordNet: a Large
Semantic Database for Italian. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece, 31 May – 2 June 2000, Volume II, Paris, The European Language Resources Association (ELRA), 783-790,
http://www.lrec-conf.org/proceedings/lrec2000/pdf/129.pdf

EU e-Privacy Directive

This is an archived site of Venice International University.

F1423 Digital Texts and Multicultural Studies (Cultural Heritage Sp. Track)

Boschetti Federico

Quick Links

From VIUblogs

VIU

About VIU

Courses and programs

Research and training

Art @ VIU

VIU Publications