Transcriber Guide

Overview

Why do a manual transcription? Manual transciptions create a library of documents that will be used as a baseline for the artifical intelligence engine. Videlicet uses AI to generate evaluations for the larger corpus of documents. Transcription engineers will work with prompt engineers to train the AI engine to replicate the content in a manual transcription. This could entail back and forth iterations for the prompt engineers to create a prompt that will produce near 100% accuracy when compared to the manual transcription. Manual transcriptions are an important part to training computer transcription and building an accurate corpus of data. In this guide you will find the steps needed to create a manual transcription and tips related to the creation.

Intro to Transcribing

How to Transcribe a Page On the main toolbar, click on the “Browse” to take you to the transcription page.
Click anywhere on the page to manually transcribe.

Transcription Page

Click on the “Transcribe this page” button and a new page will load where you can input your manual transcription. The zoom tool will allow you to zoom into the original text to better view the document.

Transcription Page

Enter in your transcription in the text entry form to the right of the original document. Use the zoom tool to better see the text. Once complete, click the “Save manual transcription”.

Transcription Page

To copy the transcription text to your clipboard, press the “Copy transcription” button. You can then reuse this clipboard without the need to retype it.

Transcription Page

Transcription best practices

  • If a word or series of words is difficult to read, but one can make an educated guess, use square brackets with a question mark around the word. Multiple guesses can be included within the bracket.

    Example: “Mr. Sheredine delivers a report on the petition of [sundry?] inhabitants of Cecil county,”

  • If a word or series of words is illegible, use square brackets with a question mark in the middle to indicate their presence. It is important to notate that a word or series of words is present, even if one cannot make it out.

    Example: “The Horses, &c. to be Enter’d with [?] [?] and Benjamin [Brooks?]”

  • Transcriptions should notate all capitalizations, all-caps words, and irregular spellings as they occur in the text. This includes mistakes/typos and inconsistent grammar.

    Example sentence that includes all-caps, capitalizations, and irregular spellings as they occur in the text: “BY the Subscriber in Annapolis, for Corn, Wheat, or Pork, good West-Indian Rum, Melassies, Loaf Sugar, Chocolate, and several Sorts of European and India Goods.”

  • All headers, including the mastheads and any headers within the text, should be transcribed.

  • Order of transcription:

    i. Masthead or header (if included)

    ii Leftmost column, from top to bottom.

    iii. When bottom of column is reached, move to the top of the immediate column to the right.

    iv. End at bottom of rightmost column.

  • Advertisement pages are sub-organized into “text boxes” within the standard column organization. Each text box is separated by a long black line. Text boxes start below a long black line, and end above a long black line. The text is in between the long black lines. Column breaks do not separate or define the parameters of text boxes — an advertisement can be across multiple columns. Long, black lines denote the start and end of an advertisement. Advertisement text boxes should be transcribed in a similar order to “news” pages, which are just organized into columns.

  • Visual images are not transcribed, and their existence does not have to be textually noted.

  • The long medial s (ſ) should be textually transcribed with the “s” symbol. For example, the word “manifeſt” should be transcribed as “manifest.”

Transcription of tables

The AI engine will produce a transcription that has tables annotated in the Markdown language. Transcribers will also need to annotate tables in Markdown. A full guide to Markdown can be found here: https://www.markdownguide.org

Below is an example manual transcription of the first 2 rows of a table found in Maryland Court Records document ce457-000001-page0002.pdf

Partial table from PDF:

Transcription partial table

Code:

| 1 | The State of Maryland vs. George Hughes | for feloniously stealing three and a half Dolls. the property of Richard Burrell - on Information of Peter Yontz. Verdict Guilty. | |---|---|---|

| 2 | The State of Maryland vs. Samuel Shad | for beating & much abusing John Globes &c on Inf. of J. Globes. |

Rendered:

1 The State of Maryland vs. George Hughes for feloniously stealing three and a half Dolls. the property of Richard Burrell - on Information of Peter Yontz. Verdict Guilty.
2 The State of Maryland vs. Samuel Shad for beating & much abusing John Globes &c on Inf. of J. Globes.

The Transcriber’s Process

Transcribers will be assigned a Google Sheets that will act as a landing page for their transcribing work. This spreadsheet will list the dataset, filename, transcription of the page, completion status, and URL.

Transcription Process

Dataset The dataset is largest unit of data organization. Transcribers should list the collection that their data is being derived from.

Filename The file name is the name of the specific file that the transcriber works with. It is the text that accompanies the document, found in Browse and on the file’s page.

Transcription Work done by the transcriber, to be saved directly onto Videlicet’s website. If transcribing directly in Google Sheets, use command-enter x2 (Mac) or ctrl-enter x2 (PC) to seperate paragraphs. Refer to “Videlicet Transcription Guide” for more detailed instructions.

Status Indicates the file’s completion status. There are three statuses: stuck, in review, and complete. A stuck status, labeled as “stuck,” means that the transcriber has a question or an issue with the page. This issue can be resolved at a later point by the transcriber, or brought into lab and answered by Libby or Professor Dressler. An “in review” status, marked as such, means that the transcriber has uploaded and saved the transcription into the Videlicet website, but a reviewer has not checked the transcription for outlying issues (typos or other accidents). Do not indicate an “in review” status if the transcription is not inserted and saved on the Videlicet website. A “complete” status, marked as such, means that the transcription has been uploaded to the Videlicet website and checked by a reviewer. No status given means that the transcription is still in progress without outstanding issues or questions.

The Process

Pages will be randomly selected from the Browse page by the user and populated first into the spreadsheet. Transcribers must select pages from multiple years in multiple conditions within a given dataset - diverse transcribing conditions is key to generating accurate and useful computer evaluations.

Transcription Process

Transcribers can access the transcription box on the Videlicet website by clicking on their selected page, scrolling to the bottom, and clicking the “Transcribe this page” button. Transcription Process

Transcription work should follow the transcriber best practices. Work can occur either directly on the Videlicet page or in the Google Sheets. The final transcription should be populated in both locations. Transcription Process

When done with the transcription, press “save manual transcription” on the Videlicet page. Transcription Process

Change the status in the corresponding box in the Google Sheets to “in review.” A reviewer will change the status to “complete.” If the transcriber has an issue transcribing the page, do not save the transcription to the Videlicet website. Rather, save it to the Google Sheets, and type “stuck” in the status column. Questions will be addressed during lab time. If the transcriber is unable to complete a transcription in one sitting, do not save the transcription to the Videlicet website. Store it in the Google Sheets, and transfer it to the Videlicet website when it is complete. No incomplete transcriptions will be saved to the Videlicet website.

Using the Search Bar in Browse

The search bar in “browse” is used to search for a file name. The file name is cumulative and, through multiple prefixes, builds from general to specific to create a label for an individual page. Every dataset has a different structure for file names.

Maryland Gazette

mg - 1278 - 0012
dataset microfilm pg # (in total microfilm)

Virginia Gazette

vg - 1736 - vg0002
dataset year pg # (in total microfilm)

Maryland State Court Records

ce457 - 000001 - page0001
name of file in MD State Archives page #

Searching in “Browse” depends on the amount of specificity. For example, to find all documents from the Maryland Gazette, search “mg.” Transcription Process

To search all the documents from microfilm 1278 of the Maryland Gazette, search “mg-m1278.” Transcription Process

To search specifically for page 0012 of the Maryland Gazette, type of the full file name “mg-m1278-0012.” Transcription Process

This process limits what subset of the data is viewable.

Codes for Different Datasets

Maryland Gazette: mg

The Maryland Gazette files are organized by microflim and placement within the microfilm. Microfilms incorporate multiple years of data. Reference the table for the specific dates each microfilm contains.

Microfilm Dates
1007 December 3, 1738 - July 22, 1729
m1278 January 17, 1745 - December 25, 1751
m1279 January 2, 1752 - October 19, 1758
1280 October 26, 1758 - October 31, 1765 and December 10, 1765
1281 January 30, 1766 - December 26, 1771
1282 January 9, 1772 - September 10, 1779
1283 September 17, 1779 - June 28, 1787
m1284 July 5, 1787 - December 25, 1794
m1285 January 1, 1795 - December 29, 1803
m1286 January 1, 1804 - December 27, 1810
1287 January 2, 1811 - December 19, 1816
1288 January 2, 1817 - December 26, 1822
1289 January 2, 1823 - December 25, 1828
1290 January 1, 1829 - December 31, 1835
1291 January 7, 1836 - December 18, 1839

Virginia Gazette: vg

Pages organized by year given in the file name, not microfilm number.

Maryland State Court Records: ce457

This is the name of the dataset given by the Maryland State Archives. It refers to the organization of its scanned court documents.