Transcriber Guide
Overview
Why do a manual transcription? Manual transciptions create a library of documents that will be used as a baseline for the artifical intelligence engine. Videlicet uses AI to generate evaluations for the larger corpus of documents. Transcription engineers will work with prompt engineers to train the AI engine to replicate the content in a manual transcription. This could entail back and forth iterations for the prompt engineers to create a prompt that will produce near 100% accuracy when compared to the manual transcription. Manual transcriptions are an important part to training computer transcription and building an accurate corpus of data. In this guide you will find the steps needed to create a manual transcription and tips related to the creation.
Intro to Transcribing
How to Transcribe a Page
On the main toolbar, click on the “Browse” to take you to the transcription page.
Click anywhere on the page to manually transcribe.
Click on the “Transcribe this page” button and a new page will load where you can input your manual transcription. The zoom tool will allow you to zoom into the original text to better view the document.
Enter in your transcription in the text entry form to the right of the original document. Use the zoom tool to better see the text. Once complete, click the “Save manual transcription”.
To copy the transcription text to your clipboard, press the “Copy transcription” button. You can then reuse this clipboard without the need to retype it.
Transcription best practices
-
If a word or series of words is difficult to read, but one can make an educated guess, use square brackets with a question mark around the word. Multiple guesses can be included within the bracket.
Example: “Mr. Sheredine delivers a report on the petition of [sundry?] inhabitants of Cecil county,”
-
If a word or series of words is illegible, use square brackets with a question mark in the middle to indicate their presence. It is important to notate that a word or series of words is present, even if one cannot make it out.
Example: “The Horses, &c. to be Enter’d with [?] [?] and Benjamin [Brooks?]”
-
Transcriptions should notate all capitalizations, all-caps words, and irregular spellings as they occur in the text. This includes mistakes/typos and inconsistent grammar.
Example sentence that includes all-caps, capitalizations, and irregular spellings as they occur in the text: “BY the Subscriber in Annapolis, for Corn, Wheat, or Pork, good West-Indian Rum, Melassies, Loaf Sugar, Chocolate, and several Sorts of European and India Goods.”
-
All headers, including the mastheads and any headers within the text, should be transcribed.
-
Order of transcription:
i. Masthead or header (if included)
ii Leftmost column, from top to bottom.
iii. When bottom of column is reached, move to the top of the immediate column to the right.
iv. End at bottom of rightmost column.
-
Advertisement pages are sub-organized into “text boxes” within the standard column organization. Each text box is separated by a long black line. Text boxes start below a long black line, and end above a long black line. The text is in between the long black lines. Column breaks do not separate or define the parameters of text boxes — an advertisement can be across multiple columns. Long, black lines denote the start and end of an advertisement. Advertisement text boxes should be transcribed in a similar order to “news” pages, which are just organized into columns.
-
Visual images are not transcribed, and their existence does not have to be textually noted.
-
The long medial s (ſ) should be textually transcribed with the “s” symbol. For example, the word “manifeſt” should be transcribed as “manifest.”
Transcription of tables
The AI engine will produce a transcription that has tables annotated in the Markdown language. Transcribers will also need to annotate tables in Markdown. A full guide to Markdown can be found here: https://www.markdownguide.org
Below is an example manual transcription of the first 2 rows of a table found in Maryland Court Records document ce457-000001-page0002.pdf
Partial table from PDF:
Code:
| 1 | The State of Maryland vs. George Hughes | for feloniously stealing three and a half Dolls. the property of Richard Burrell - on Information of Peter Yontz. Verdict Guilty. | |---|---|---|
| 2 | The State of Maryland vs. Samuel Shad | for beating & much abusing John Globes &c on Inf. of J. Globes. |
Rendered:
1 | The State of Maryland vs. George Hughes | for feloniously stealing three and a half Dolls. the property of Richard Burrell - on Information of Peter Yontz. Verdict Guilty. |
---|---|---|
2 | The State of Maryland vs. Samuel Shad | for beating & much abusing John Globes &c on Inf. of J. Globes. |
The Transcriber’s Process
Transcribers will be assigned a Google Sheets that will act as a landing page for their transcribing work. This spreadsheet will list the dataset, filename, transcription of the page, completion status, and URL.
Dataset The dataset is largest unit of data organization. Transcribers should list the collection that their data is being derived from.
Filename The file name is the name of the specific file that the transcriber works with. It is the text that accompanies the document, found in Browse and on the file’s page.
Transcription Work done by the transcriber, to be saved directly onto Videlicet’s website. If transcribing directly in Google Sheets, use command-enter x2 (Mac) or ctrl-enter x2 (PC) to seperate paragraphs. Refer to “Videlicet Transcription Guide” for more detailed instructions.
Status Indicates the file’s completion status. There are three statuses: stuck, in review, and complete. A stuck status, labeled as “stuck,” means that the transcriber has a question or an issue with the page. This issue can be resolved at a later point by the transcriber, or brought into lab and answered by Libby or Professor Dressler. An “in review” status, marked as such, means that the transcriber has uploaded and saved the transcription into the Videlicet website, but a reviewer has not checked the transcription for outlying issues (typos or other accidents). Do not indicate an “in review” status if the transcription is not inserted and saved on the Videlicet website. A “complete” status, marked as such, means that the transcription has been uploaded to the Videlicet website and checked by a reviewer. No status given means that the transcription is still in progress without outstanding issues or questions.
The Process
Pages will be randomly selected from the Browse page by the user and populated first into the spreadsheet. Transcribers must select pages from multiple years in multiple conditions within a given dataset - diverse transcribing conditions is key to generating accurate and useful computer evaluations.
Transcribers can access the transcription box on the Videlicet website by clicking on their selected page, scrolling to the bottom, and clicking the “Transcribe this page” button.
Transcription work should follow the transcriber best practices. Work can occur either directly on the Videlicet page or in the Google Sheets. The final transcription should be populated in both locations.
When done with the transcription, press “save manual transcription” on the Videlicet page.
Change the status in the corresponding box in the Google Sheets to “in review.” A reviewer will change the status to “complete.” If the transcriber has an issue transcribing the page, do not save the transcription to the Videlicet website. Rather, save it to the Google Sheets, and type “stuck” in the status column. Questions will be addressed during lab time. If the transcriber is unable to complete a transcription in one sitting, do not save the transcription to the Videlicet website. Store it in the Google Sheets, and transfer it to the Videlicet website when it is complete. No incomplete transcriptions will be saved to the Videlicet website.
Using the Search Bar in Browse
The search bar in “browse” is used to search for a file name. The file name is cumulative and, through multiple prefixes, builds from general to specific to create a label for an individual page. Every dataset has a different structure for file names.
Maryland Gazette
mg | - | 1278 | - | 0012 |
---|---|---|---|---|
dataset | microfilm | pg # (in total microfilm) |
Virginia Gazette
vg | - | 1736 | - | vg0002 |
---|---|---|---|---|
dataset | year | pg # (in total microfilm) |
Maryland State Court Records
ce457 | - | 000001 | - | page0001 |
---|---|---|---|---|
name of file in MD State Archives | page # |
Searching in “Browse” depends on the amount of specificity. For example, to find all documents from the Maryland Gazette, search “mg.”
To search all the documents from microfilm 1278 of the Maryland Gazette, search “mg-m1278.”
To search specifically for page 0012 of the Maryland Gazette, type of the full file name “mg-m1278-0012.”
This process limits what subset of the data is viewable.
Codes for Different Datasets
Maryland Gazette: mg
The Maryland Gazette files are organized by microflim and placement within the microfilm. Microfilms incorporate multiple years of data. Reference the table for the specific dates each microfilm contains.
Microfilm | Dates |
---|---|
1007 | December 3, 1738 - July 22, 1729 |
m1278 | January 17, 1745 - December 25, 1751 |
m1279 | January 2, 1752 - October 19, 1758 |
1280 | October 26, 1758 - October 31, 1765 and December 10, 1765 |
1281 | January 30, 1766 - December 26, 1771 |
1282 | January 9, 1772 - September 10, 1779 |
1283 | September 17, 1779 - June 28, 1787 |
m1284 | July 5, 1787 - December 25, 1794 |
m1285 | January 1, 1795 - December 29, 1803 |
m1286 | January 1, 1804 - December 27, 1810 |
1287 | January 2, 1811 - December 19, 1816 |
1288 | January 2, 1817 - December 26, 1822 |
1289 | January 2, 1823 - December 25, 1828 |
1290 | January 1, 1829 - December 31, 1835 |
1291 | January 7, 1836 - December 18, 1839 |
Virginia Gazette: vg
Pages organized by year given in the file name, not microfilm number.
Maryland State Court Records: ce457
This is the name of the dataset given by the Maryland State Archives. It refers to the organization of its scanned court documents.