The Peril and Promise of Historians as Data Creators: Perspective, Structure, and the Problem of Representation

Written by: Sharon Leon

Primary Source: 6floors/[bracket], November 24, 2019

[This is a working draft of a chapter in progress for an edited collection.]

Data-Driven History

Digital historians are well-familiar with notion that the larger community of historians generally has been skeptical of and cautious about data-driven scholarship. The controversies surrounding Robert Fogel and Stanley Engerman’s 1974 work, Time on the Cross: the Economics of American Slavery continue to haunt computational work.[1] Regularly, historians who are suspicious of digital methods inquire as to how contemporary digital work can avoid reproducing the interpretive missteps of the era of “cliometrics.” While Time on the Cross often stands in for a whole range of historical scholarship based on quantitative methods, it undoubtedly continues to be a focus point for conversation precisely because those quantitative methods were used to argue that people enslaved in the United States had willingly collaborated with the system of slavery to make it an efficient and productive economic institution. In doing so, Fogel and Engerman made arguments about the interior life and motivations of human beings based on the material conditions and outcomes of their circumstances. In effect, they mistook correlation for causation. The combination of quantitative methods and a history of wrenching human rights violations strikes a discordant tone that hinges on the reduction of human pain and suffering to columns and rows of numbers that can be processed and calculated with an algorithm.

No one pushed back more strongly against Fogel and Engerman’s conclusions than Herbert Gutman. No stranger to quantitative methods, Gutman revisited both the materials that the authors worked with and the conclusions that they drew from that data. He argued that though the system of slavery Fogel and Engerman examined might have seemed efficient, that efficiency was achieved through the pervasive presence and threat of violence rather than through voluntary cooperation or through an adoption of the enslaver’s worldview. An analysis of the economic systems surrounding slavery could not yield knowledge about the inner thoughts, feelings, and motivations of the enslaved as they performed their labor, regardless of how productive they were.[2]

In the wake of the widespread reaction against cliometrics, historians generally have been private about their work with data—presenting only end products, narratives, and summaries, even when that work is data-driven, but not all that computationally sophisticated. Often a small part of a much larger interpretive process, many who do minor work with data never even note that they have a set of spreadsheets or a database that they used to organize and analyze their source materials. This tendency has worked to mask the role that data collection and analysis plays in contemporary historical scholarship.

Public or private, large or small, scholarly engagement with data demands the same kinds of critical interrogation that all other kinds of sources undergo. Scholars can never assume that the meaning of their data is self-evident. Rather than surrendering to the easy assumption that quantification results in a simple reflection of the world as it was, historians would do well to focus on the constructedness of historical sources at every stage of their existence. In doing so, they might embrace Johanna Drucker’s call to “reconcieve of all data as capta.” Drucker explains:

Differences in the etymological roots of the terms data and capta make the distinction between constructivist and realist approaches clear. Capta is “taken” actively while data is assumed to be a “given” able to be recorded and observed. From this distinction, a world of differences arises. Humanistic inquiry acknowledges the situated, partial, and constitutive character of knowledge production, the recognition that knowledge is constructed, taken, not simply given as a natural representation of pre-existing fact.[3]

In resisting the naturalization of data, historian create a situation where they can be much more careful about the claims and conclusions that rest on that data. Their hands and interpretations are only one of many that have touched these sources from the point of their creation to the present.

The history of the enslavement, transportation, and forced labor of African-descended peoples is one suffuse with violence, destruction, and dehumanization. As this history played out around the globe over several centuries, the parties involved created a massive trove of historical documentation of the institution and its impact: families ripped apart, human beings bought and sold, cruel corporal punishment, financial systems developed and proliferated, legal justifications codified. That documentation can serve as the basis for far reaching, data-driven scholarship on the history of slavery. But these methodological approaches are not without its ethical pitfalls, and historians of slavery have a duty to pause to consider their obligations when creating, publishing, and drawing conclusions from that data.

Treating data as capta places it in an important trajectory in the lifecycle of historical evidence. That trajectory includes the initial creation of the record, its elevation to the status of a piece of information that should be preserved, its preservation, its preparation for research access, its review by an historian, its transformation into structured data, and its publication in a digitally accessible form. In scrutinizing this lifecycle, historians can come to a renewed awareness of the constructed nature of data, and of the individuals who help to shape access to evidence about the past, including record creators, archivists, historians, and technologist.

The Events

To begin with an historical event in itself is to begin with a problem of perspective. An event’s participants each come to it with a different worldview that shapes what they remember from their experiences. Then, a whole set of cultural conditions shape what kinds of durable evidence they can create about those experiences: degrees of literacy, the strength of oral tradition, access to materials for record making and storage, relationships with people who participate in institutions that create and save accounts of events. Comparatively, the surviving evidence about slavery and the conditions it perpetuated has been overwhelmingly created by those in relative positions of power and dominance, and not by those who were enslaved. The records they created are often imbued with the sense that enslaved people were first and foremost subjects of commerce and control—people to be bought, sold, and used instrumentally in the service of the interests of their owners.

For instance, Sowande Mustakeem writes about how the journey of the Middle Passage worked in numerous and traumatic ways through a “human manufacturing process.” Historians are left to transform captured people of African descent chattel property to be sold in the new world. To develop and narrative this history, Mustakeem drew on personal, professional, financial, and public accounts that represent the perspectives of sailors, ship captains, merchants, brokers, surgeons, and clerks, but few enslaved people themselves. Nonetheless, Mustakeem, explains that

Transmitting often murky details, they constructed narratives, perpetuated silences, and provided insights and biases on distant places and foreign people. Although fraught with inconsistencies, embellishments, and ethnic and racial stereotyping, these varied archival sources provided fertile opportunities to widen the spectrum of bondage to include the world of slavery at sea.[4]

This description aptly captures assemble fragments of not entirely trustworthy sources to develop a narrative interpretation.

Additional questions and complications arise, however, when historians begin to represent those sources as data that can be categorized, quantified, and perhaps visualized. Referring to the documentation of the transatlantic slave trade, Jessica Marie Johnson has explained that, “in slaving conventions along the African coast… compilers of slave ship manifests participated in the transmutation of black flesh into integers and fractions.”[5] This process suggests a gross dehumanization through quantification—a process that can be undertaken both by record creators and by the historians studying them. Those integers and fractions are not necessarily any more trustworthy than the narrative accounts created by historical actors.

Though enslavers might dutifully record their enslaved persons as property for transport, tax purposes or inheritance, their attention to individual characteristics and details about those people belies their dismissal of their human integrity. Sharon Block’s recent book Colonial Complexions: Race and Bodies in Eighteenth Century America is overflowing with examples of the ways that Anglo-American colonists describe people’s physical appearance in the period just before skin color became the overwhelming indicator of racial status. Block’s work documents the unstable meaning and description of physical markers in thousands of missing person advertisements from colonial newspapers. Furthermore, this instability was not limited to the description of complexion. Block explains: “Age might seem like one of the most objective features of bodily description, but it was not necessarily an exact count of the person’s years on earth. In a phenomenon known as age heaping, runaways were much more likely to be listed at an age that was a multiple of five: for example, twice as many runaways were identified as being twenty-five years old as twenty-four or twenty-six years old.”[6] This evidence of semantic variation is just one facet of a larger need for researchers to question the seemingly self-evident elements of historical records related slavery.

In sum, historians are trained to read for historical perspective. Reading against the grain, and listening for silences, They assemble fragmentary evidence from the past to create new and instructive interpretations. Even so, spaces, gaps, and ambiguities exist within the seemingly stable elements of the records, that which can be categorized and tallied. None of this will come as a surprise to historians who have been practicing the craft of research for any period of time. At the same time, when placed in juxtaposition to the ways that digital representations can make the ambiguous and fluid seem fixed, the reminder of these issues can serve as a way to heighten sensitivity to the complexity of the records themselves.

The Archives

These records, though partial and subjective, created at the point of the original events, often resided for decades in a variety of sites that predated what we would commonly recognize as archival repositories, tucked away in church basements, office storerooms, and courthouses. The ongoing threat of destruction and decay due to natural disasters, climate, and neglect was substantial. The records’ care and stewardship frequently was not in the hands of someone committed to retention and preservation for the majority of their existence. But eventually a significant number of those records have made their way into modern archival repositories, and historians would be remiss if they discounted the ways that archival professionals have shaped historical records and access to them.

Much has been written over the last fifty years about archives as sites of power and tools of state and institutional definition.[7] With Foucault and Derrida, this work stands as a theoretical interrogation of systems, power, and desire around the creation of knowledge.[8] With Steedman, Stoler, Trouillot, and the many perspectives assembled by Antoinette Burton, the work centered much more closely around the place that the development of record keeping bureaucracies have played in supporting imperial projects in colonial contexts and the subsequent histories that can be written from those records.[9] The power of these accounts in the thinking of historians cannot be underestimated. It is so great, South African historian Keith Breckenridge argues, that the narrative of the imperial archive has occasionally stood in as a scapegoat for identifying other kinds of organizational, managerial, and collaborative failures in large-scale digitization projects.[10]

Though “the archive” continues to be central to the ways that historians envision themselves and the process of writing history, an increasing number of archivists have noted that the practices of archival professionals rarely enter into the these meditations.[11]  Nonetheless, the records that do reside in archival repositories have been subject of various processes of shaping and framing at the hands of the archivists who tend them. The core duties of an archivist include record selection, arrangement, and description, all of which come together to form an essential context through which historians access historical materials and profoundly influence the experience an historian has when encountering historical records.

Modern archival practice in the Anglo-American context has been shaped by the influential work of Sir Hilary Jenkinson in the wake of World War I and then by T. R. Schellenberger in the wake of World War II. Jenkinson championed an approach to the work that centered on the importance of neutrality and objectivity. This approach to record collection created the impression of a passive stance on the part of the archivist.[12] On the other hand, the important influence of T.R. Schellenberger’s guidelines for appraisal and selection of records based on their perceived secondary value argued for an essential active role for the archivist.[13]  The process of selection gives the archivist the purview to assess materials and decide whether or not they are worth the resources necessary to preserve them. The necessarily and ongoing practice of winnowing down archives significantly limits the records to which scholars have access. The archivist is actively shaping the historical record by making value judgements about what records are worthy of accessioning and preserving.

Once materials are accessioned into archival repositories, a processing archivist spends time considering the arrangement of the records in their collections and holdings. Often the practice is to maintain the organization provided by the original record creator, but in many instances that organization is not clear. Thus, the archivist might group records into themes and order them chronologically. This imposition of structure is necessary for facilitating access to the materials, but it can predispose an historian to read the records from a particular perspective. Framing the perspective of inquiry in response to the context built by archival arrangement can having an important and lasting effect on the outcome of a project.

In the creation of finding aid, the archivist offers a narrative description of the records that can tend to surface some aspects of the archives and submerge others. For example, the archives of religious orders who owned enslaved people might be primarily described in terms of their organizational structure, their temporal affairs, and their religious duties. As a result, a scholar reviewing the finding aid might initially assume that the records related to slavery might only make up a very small portion of the archives, when in reality the trace evidence of considerations about the enslaved and their lives pervades the full scope of the records.

The emergence of critical archival studies has helped to draw attention to the power and implication of this work. Defined as “those approaches that (1) explain what is unjust with the current state of archival research and practice, (2) posit practical goals for how such research and practice can and should change, and/or (3) provide the norms for such critique,” critical archival studies calls for a transparent and active role for archival professionals. In the words of Caswell, Sangwand, and Puzulan, “critical archival studies broadens the field’s scope beyond an inward, practice-centered orientation and builds a critical stance regarding the role of archives in the production of knowledge and different types of narratives, as well as identity construction.”[14] The result has been an increased attention to archival work with underrepresented communities, and to post-custodial possibilities for preservation and access to materials. The work of critical archival studies has made significant inroads in securing the ability of dispossessed communities to see themselves in archival sources by broadening collection practices, questioning the need for custodial control, and trying to be more forthright about the colonial and dominant contexts of record creation and preservation. Anthony Dunbar has argued for resisting silences in archival description by embracing a practice of creating counterstories:

The first counterstory approach within the archives is the development of counternarratives that bring to the surface issues of racial dis-enfranchisement that are submerged based on a socio-historical archive’s mission which is likely to have been heavily influenced by marginalizing dominant culture realities. The second counterstory approach is a socio-historical archive that exists within itself as a form of counterstory to a dominant narrative.[15]

Thus, counterstories and counter-narratives are necessary not only in the production of historical interpretation, but also in the work of archival description that shapes scholarly access.

Unfortunately, resources for doing thorough and finely-grained archival description are scarce. The finding aids that exist for many collections describe materials only at the box level, rather than at the folder level, and almost never at the item level. The persistence of the shortages of staff time and resources for description have been a focus of the profession for many years now, most clearly articulated by Dennis Meissner and Mark Greene’s 2005 call for “more product, less process.”[16] Collections that have already been processed are unlikely to be reprocessed. Thus, the frameworks exist to create the rich description that would highlight the counterstories for which Dunbar calls, but the capacity to implement them across the mass of archival holdings is limited. As a result, through the existence of legacy description the practices and assumptions of earlier generations of archivists cast long shadows, highlighting certain elements and occluding others.

The Data Sets

Just as the cumulative labor of archives professionals shapes the historical record, the training, theoretical and methodological predispositions of historians shape the ways that they engage with records. Though much graduate training focuses on the narrative choices that historians make to frame and deliver their interpretive work, somewhat less attention is paid to the everyday research practices and activities that move the scholar from research question to interpretation. These practices, however, represent key decision-points in the process of transforming historical records into data sets.

Regardless of the source or content of a record, scholarly engagement with a new historical source is an iterative process of reading, questioning, contextualizing, and comparing. This deep meditative focus on a source or a set of sources is ideal, but it is often at odds with the ways that contemporary historians conduct their archival research. Limited by time and budgets, often scholars have a chance only to skim archival materials, photographing, and cataloguing promising sources for thorough examination at a later date. Digital innovation does offer some significant recourse to this peril. Historians whose key materials have been digitized, the opportunity to return to a source over and over exists. Researchers who can visit a repository with a scanner or a digital camera, the creation of personal research collection offers a way to revisit and confirm or dispute initial readings.[17] For those taking notes that summarize a particular source, rather than taking a transcription or an image representation, the perspectival process of reading and recording can forever shape the tone and insight of their historical interpretation. Without having the possibility of returning to the original or a good facsimile of a source, researchers may forever be left with their first impressions.

Given a significant body of materials, historians’ grasp of the full scope of material develops slowly as they read records. Eventually, they have to step back to try to achieve some bird’s eye view of the partial depiction of the historical events and circumstances described in the sources. A survey of the source material and a thorough grasp of the research questions provide important contexts for structuring the collection and representation of the information that arises from the close reading of individual sources. This summative view can develop during a discursive process of note-taking, outlining, and writing of narrative prose.

Moreover, many historians do not even think about their work as the act of creating data, to say nothing of creating data that might be generalized and shared. Rather, most see their working processes as one of creating research notes, designed specifically to answer their individual research question. The notion that other historians might productively make use of the information gleaned from their encounter with the records comes up as an afterthought, if at all. As a result, the data produced in this work mainly resides in hands of the scholar as locally held spreadsheets, databases, or unstructured notes. The form and the content of this work product is necessarily idiosyncratic, and frequently shaped by the form of the primary sources rather than any sense of a set of possible future use-cases.

For researchers working with medium to large numbers of people or events or other aspects of quantitative information, the summative view is facilitated by the creation of structured data that represents selected entities that are presented within the sources. Unfortunately, the skills necessary to create well-formed structured data are often not a part of historians’ methodological training. As a result, researchers can find themselves constructing a data model more by chance than by intention. However, the logical structure for a data model is sometimes no more self-evident than the meaning of the information being represented. Without a thorough initial survey of the sources at hand it is very difficult to create a data model that is sufficient to represent the content of the sources. Researchers might begin to address a set of records with a preconceived notion of what they should capture as their data. The danger with that approach is that historians will proceed with an initial data model, not realizing until they have invested a significant amount of time and effort in using it that the model fails to adequately represent information that might be central to answering their research question. Then, they will be forced to revise their data model and return to the sources to remediate the missing elements in their representation. Thus, the process of deciding on how that information should be categorized and coded to create a legible and useful data set is a key movement of methodological work for the scholar, because the structure for the data can significantly limit the ways it can be used and interpreted.

For some, their only formal exposure to structured and standardized data may be through bibliographic and collections metadata. Every historian is familiar with the basic fields contained in MARC (Machine-Readable Cataloguing) record based on working with library catalogues.[18] Others might recognize with the basic set of fields associated with the Dublin Core Metadata Initiative standard, due to the fact that this standard serves as the lowest common denominator of information for describing many online digital collections, including the materials aggregated in the Digital Public Library of American and in Europeana.[19] However, few are familiar with the process of creating this metadata and the finer points of these specifications, not to mention the ranges of other descriptive standards that might be appropriate to apply to the data that they might want to capture and represent from their archival materials.

Moreover, creating a data model to represent historical information gleaned from primary sources is a very different process than that which metadata librarians and collections professionals undertake to create collections metadata. Cataloging metadata refers mostly to the process and context of the creation of the source. In developing research data sets, historians are called upon to read a set of varied primary sources and model the data about the historical people, places, and events described within the content of those sources. This is meso-level, derived data, rather than collections metadata. Thus, scholars are not capturing data about the record as an entity, but rather they are capturing data about the history that the record represents. Though there are some standards and practices that can be marshaled in the creation of a model for this kind of data, the work is much less clear-cut than creating a model for collections data that describes types of materials.

The process of designing a data model requires a kind of structural thinking that is much more akin to systems design. The methods training for contemporary historians often does not involve a specific discussion of the processes and design principles necessary to responsibly collect and model historical research data. In some programs those topics are integrated into an introduction to digital history or digital humanities course, but more frequently this training falls to the university libraries, requiring students to know that they need to take the initiative to seek out the training that they might need. In recent years, this training has most often focused on the concept of tidy data, a set of simple data formation guidelines from statistician and R creator Hadley Wickham that help to ensure that the data is maximally actionable. Wickham explains: “Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.”[20] Recording data using this very simple, rectangular structure can dramatically increase the reusability of a dataset. Even so, these principles only begin to touch the many factors historians have to consider when they create a data model. They offer instruction on how to form the data, but not about how to select the appropriate variables to capture.

Because the creation of a data model sets up a rigid structure for the collection and access of that data, it fixes a host of representational choices not only in the selection of variables, but also in the formation of observations. Each cell in a rectangular data set represents a choice that results in a new representation of the past. The first and most difficult choice often lays in what to include and how to represent it semantically. For some historians, the easiest way to collect data from an archival source is to copy the information verbatim from the document. The risk in this approach is that scholars reproduce the ontological assumptions of the record creator and in the process perpetuates the oppressive regimes of power and control that were imposed to subjugate enslaved peoples. If historians choose not to transcribe the record exactly, a host of other choices arise.

As digital humanities librarians Katie Rawson and Trevor Muñoz suggest in their essay “Against Cleaning,” there is no underlying order to be uncovered when working with data. Rather each effort to shape the data for use results in the creation of new data.[21] Selecting a controlled vocabulary to describe a person’s racial identity cannot offer the degree of complexity necessary to fully render that aspect of identity. Indicating relationship status using contemporary heteronormative nuclear family representations fails to capture the contingency and complexity of family life and fictive kin under enslavement. For data to be computationally actionable, it requires level of normalization that was not the common practice of the record creators themselves. Scholars may need to normalize spellings personal names and place names, create fixed dates, or concatenate fields.

Then, they might undertake a process of extending their data beyond that which is clearly rendered within the sources. Historians may choose to augment their data with external information not included in their primary sources, such as geospatial information. More significantly, researchers could impute new data from the information in the records. For example, an historian might have a source that indicate the ages of a number of enslaved people at a particular point in time. They might then calculate birth years for those individuals. That might seem like a reasonable choice, but only if they have other sources to corroboration that information and they remains cognizant of the possibility that the record creator might have been practicing heaping. Each of these decisions incrementally distances the data set from the primary source. Additional complications can arise if more than one historian is working to capture information for a data set, since different individuals might interpret the parameters for the variables slightly differently. This creates an ongoing task of quality control and review. Since this work of representation is not self-evident, objective, description, it is always situated and perspectival.

As a result, those data sets cannot stand on their own without clear and thorough documentation that accounts for the many decision points along the way. This kind of documentation needs to be more than a provenance statement or an effective footnote. For as Anthony Grafton noted in his effort to trace the origins of the footnote as a documentation technique among historians, “Some of the new forms of history rest on evidence that footnotes cannot accommodate—like the massive analyses of statistical data undertaken by historical demographers, which can be verified only when they agree to let colleagues use their computer files.”[22] The documentation for this work needs to be significantly more thorough than the data that is contained in a traditional footnote for it needs not only to point to the original source, but also to outline all of the choices the scholar has made along the way to model, capture, and transform their data.

The Linked Data

While for decades historians have privately constructed systems to capture information from their sources in the service of developing more full and insightful interpretation, the emergence of a widely accessible internet in the mid-1990s has fundamentally changed the range of possible approached to this work. In 2006, Tim Berners-Lee, British architect of the World Wide Web, articulated vision for a modern web that was made of vast mesh of truly linked data connecting information across domains using a simple set of principles. Those principles included giving each entity a stable uniform resource indicator (URI) that could be served over the web so that users could locate those entities, and when they did they would find useful information that was served and structured using a standard approach, and that whenever possible that information should include links to other URIs that would lead to more information about more entities.[23] This vision for the semantic web has been slow to catch on, but it portents a web that is both human readable and machine readable.

More importantly, linked data holds an enormous degree of promise for bringing together scholarly work that was once siloed and disparate. The creation of linked data lets authors be explicit about the relationship between the resources that they are representing on the web, growing a set of connections and elaborating a knowledge base, link by link. Berners-Lee was not thinking about historical data when he set out the principles for the semantic web. They are designed to be general enough that the specific type of “thing” being linked does not matter. But for historians, the “thing” does matter, quite a lot. Having the ability to describe faithfully people, places, and events, and how they relate to one another, based on historical evidence is paramount to doing good digital history.

For historians who are new to working with data, using the principles of linked data to create a functional and clear data model allows them to draw upon a large set of patterns and existing vocabulary systems to represent the data they are trying to capture. The Resource Description Framework (RDF) is one such standard.[24] RDF specifies that relationships among unique pieces of data be express in a sentence form: subject—predicate–object. The predicate in question should be a property of a linked data vocabulary. The Linked Open Vocabularies website includes nearly 700 different vocabularies that offer standards based ways to describe entities on the web.[25] Those vocabularies and the many others available can provide historians with a machine readable way to express their captured data, including those that describe relationships, events, locations, and transactions.

Nonetheless, the standards for expressing linked data are constraining. The sentence form of the standard produces a level of simplicity and fixity that does not always align with the messiness and uncertainty of historical knowledge. Furthermore, with linked data the model itself is descriptive because the mechanics of the linkages are semantic, representing not just a relationship, but a particular kind of relationship. The choice of the predicates — the properties that describe the connection between a URI and some other piece of data (another URI, a number, a date, an element of a controlled vocabulary, a string of text) is the heart of the data model, and a particularly fraught element of the work. Most existing controlled vocabularies and linked data schemas are an imperfect fit for the people, places, and events that historians would use them to describe.

Though there are issues with using linked data to model the past, the system does have a number of things to recommend it. One of the benefits of using linked data is that the system allows historians to create a stable space on the web to represent each person within a historic frame. Every single person can have a URI to accumulate knowledge about that person. This approach can go a long way towards combatting the dehumanization that was inherent in the process of enslavement and that carries through in the records that represent it. At the same time, the creation of stable URIs for historical entities makes it possible to examine them at scale through geospatial technologies, timelines, or other kinds of visualizations.

For example, having created individual records to represent a group of enslaved people, the next step might be to create a set of associations among individuals, as members of kinship networks, and as participants in events, each tied to the documentary record. This expanding web of connections begins to capture some of the important experiences for these individuals and communities, but it can never truly capture the full conditions of enslavement or the innumerable checks on everyday freedoms and independent decision making experienced by the enslaved. Furthermore, while it is possible to represent documented kinship networks, a researcher must be cognizant of and make clear the ways that enslavement constrained people’s ability to shape and sustain their own relationships. With little ability to leave the community and little control over free association, what can be said about the networks that enslaved people formed? How can their agency in day-to-day events be recognized while not minimizing the constraints of the situation? Data derived from a documentary record created by the enslavers will only show the events in which individuals participated, not those which were foreclosed to them. Given these variables, the linked data will nonetheless always be a mere surface representation of these very human lives and struggles. And in this respect, the data can never stand on its own without the support of additional interpretative framing.

With these limitations, historians need to work hard to prevent the data model itself from becoming a site of distortion and misrepresentation that wrongly projects a false degree of stability and permanence. For example, information about partner relations between enslaved adults comes in many forms. There may be few clearly documented sacramental marriages, but many couples are listed together as parents of children, and others are discussed in terms of family units in ledgers and correspondence. One way to represent these connections is to use the Relationship Vocabulary property “Spouse Of” as the predicate connecting these individuals.[26] That choice signals the likely relationship in question, but it offers no way to note the precarity and uncertainty of those relationships under slavery. The RDF structure lends an impression of stability and fixity to that relationship that likely does not reflect the historical reality. Therefore, the scholar must make those possible distortions clear throughout the many facets or the project.

In addition to offering a standardized and structured way for historians to represent the data they have captured from their sources on the web, linked data also offers the possibility of building out a vast network of connected and integrated information. The mechanism for creating this network was part of Tim Berners-Lee’s original plan for the semantic web, which calls for using the semantic RDF structure to link to other existing URIs for relevant data. Representations of the same individual across many projects should be tied through a “Same As” links, but the possibilities for representing individuals in different roles and different contexts is nearly endless.

Given the vast numbers of scholars and organizations working to uncover information about individual enslaved people and their lives, undoubtedly these opportunities for integration through linked data will only grow in the coming years. These links might be surfaced through inferencing, or programmatically combing the machine-readable linked data web for likely matches. But those matches must then be reviewed by historians to confirm the likelihood of similarity based on the data’s provenance and to make a decision about the appropriateness of the linkage.

The Replicants

Writing in 1997 about the difficulty of documenting quantitative and data-driven work, Anthony Grafton did not envision a world in which data sets created by historians from a wide variety of fields could easily be shared and accessed on the internet. Currently, however, data developed for individual ends, need not stay private. Now, it can be freely shared among scholars not only in private email transactions, but by posting data sets on web. Posting an inert spreadsheet in a data repository, or on GitHub, or even on a project web page represents the most basic form of sharing this captured data. In thinking about the ways that digitization of archival sources has transformed historical research, Lara Putnam has warned of the dangers of this radical decontextualization of materials through search aggregation. Digitization can work to obscure the partiality of the historical record because the materials are, in many ways, divorced from their original context and arrangement as they can be plucked from the database that allows for sorting and resorting, and discourages linear access.[27] While this decontextualization is problematic for digitized sources, the issues associated with digital data sets are slightly more complicated in that they involve not only decontextualization, but also replication, and recombination.

Every data set served on the web through Github, Figshare, or a DataVerse instance allows for users to copy and reuse that data, capturing and copying it at a particular instance in time. That user might fork or repost that data creating another instance of the content, perhaps allowing it to diverge from original. One of the core principles of digital preservation has been a generalized embrace of the LOCKSS program: Lots Of Copies Keeps Stuff Safe.[28] Launched at Stanford University Libraries in a 1999, LOCKSS provides cultural heritage organizations access to appropriate technology and a network of other organizations, each of which was willing to house a preservation copy of their others digitized holdings. Sharing the burden of preservation this way made the network fail safe through redundancy. The redundancy here is essential to the program’s success, but it also essential to note that the many copies that are the key to the program are not served to public users; they are part of a long term preservation strategy. In the case of data sets, that redundancy represents a risk if they do not exist as exact copies of the original.

Of course, for significant portions of the cultural record, multiple copies and editions is the normal course of business. In the bibliographic universe, librarians have devised a system track and represent the relationships of those copies to one another: Functional Requirements for the Bibliographic Record (FRBR). This model relates the equivalent entities of the work to its expressions, its manifestations, and individual items. The model also includes derivative works, such as translations. Finally, the model allows for new works that are related to the “original.”[29] Embracing a FRBR-like conceptual model for representing the relationships among data sets and their variations requires a shared understanding among data publishers and users about the ways that they will produce and track their variations. FRBR works in a universe of corporate bodies that work as publishers, registering their work with ISBNs. The universe of data set production and publication significantly more freewheeling at the moment. Data publishers could embrace the use of the Digital Object Identifier (DOI) system to provide persistent unique pointers to individual datasets, minting a new DOI for each updated edition of the data.[30] This approach would help to limit some of the confusion that stems from the regular updating and augmenting of data from an individual source.

Unfortunately, minting DOIs for data sets does not solve the provenance issues that arise when many data sets are aggregated. The infrastructure of the web offers multiple ways to aggregate that data into new interfaces and applications. The promise of this aggregation is a vastly expanded research base of information about the past. But the peril of this aggregation is that with each step of combination and replication, the data about the past becomes further and further distanced from the shaping contexts of its creation. While each variable in an observation about a type of historical people, places, or events, is the product of the perspectives of record creators, archival professionals, and the historians who captured and shaped that data, almost none of that context travels with the data on the web. Most often this data travels with extremely thin contextual information, consisting of a provenance link to a digitized source or to a finding aid for an archival collection. In situations where an historian might visit a data aggregation, perform a search, and download a newly created data set to their local computer, that link to the data’s contexts of creation might be lost forever. Moreover, that historian has just created a new edition of that data that might be manipulated, augmented, and republished, divorcing it further from all necessary documentation. These data sets, loosed from the context of the individual sources from which they have been derived, and absent of the clear documentation about their formation, have the potential to represent a clear break in the methodological agreement of professional historians to be transparent about the materials on which their interpretive conclusions are built.

Rather than aggregating data into a new service instance that might fall out of sync with data updates or corrections, historians would be better served by making use of the affordances of the linked data web and application programming interfaces to serve the data from a single source, tied to clear and thorough documentation. Without that documentation from the historian who created the data set, that ties the data directly to the primary sources from which it was derived, that explains the way that each cell was imputed or augmented, a research cannot responsibly use the data in their own work. While historians work within the common conventions of the field, those conventions can only establish a bear baseline of shared understanding in the work of interpreting data. As a result, historians must to commit to publishing their data with self-reflexive position statements about the creation of the data set – in effect fieldwork journaling, not unlike our colleagues who do ethnographic work – because the creation of a data set is a complex process of abstraction, augmentation, and modeling that moves from historical record to rectangular data, and sometimes to linked data. This process-level exposure reveals the epistemological and ontological assumptions with which historians address and organize their work. Narrating the assumptions that have gone into that process is the only way to make the results usable by others. Thus, serving the data and the documentation through a linked data infrastructure makes it possible for external users to access the data for their own use without sacrificing their surety about its version and its context of creation.


[1]    Robert William Fogel and Stanley L. Engerman, Time on the Cross: The Economics of American Slavery, Reissue edition (New York: W. W. Norton & Company, 1995).

[2]    Herbert G. Gutman, Slavery and the Numbers Game: A Critique of Time on the Cross (Urbana: University of Illinois Press, 2003).

[3]    Johanna Drucker, “Humanities Approaches to Graphical Display,” Digital Humanities Quarterly 005, no. 1 (March 10, 2011): Paragraph 3.

[4]      Sowande M. Mustakeem, Slavery at Sea: Terror, Sex, and Sickness in the Middle Passage (Urbana: University of Illinois Press, 2016).

[5]     Jessica Marie Johnson, “Markup Bodies: Black [Life] Studies and Slavery [Death] Studies at the Digital Crossroads,” Social Text 36 (2018): 65.

[6]    Sharon Block, Colonial Complexions: Race and Bodies in Eighteenth-Century America (Philadelphia: University of Pennsylvania Press, 2018) 37-38.

[7]    Elizabeth Yale, “The History of Archives: The State of the Discipline,” Book History 18, no. 1 (October 30, 2015): 332–59,

[8]    Jacques Derrida, Archive Fever: A Freudian Impression, trans. Eric Prenowitz, 1 edition (Chicago, Ill.: University of Chicago Press, 1998) and Michel Foucault, The Archaeology of Knowledge: And the Discourse on Language (New York, NY: Vintage, 1982).

[9]    Carolyn Steedman, Dust, 1 edition (Manchester: Manchester University Press, 2002), Ann Laura Stoler, Along the Archival Grain: Epistemic Anxieties and Colonial Common Sense (Princeton, NJ: Princeton University Press, 2010), and Michel-Rolph Trouillot, Silencing the Past: Power and the Production of History, 20th Anniversary Edition, 2nd Revised edition (Boston, Massachusetts: Beacon Press, 2015), and Antoinette Burton, ed., Archive Stories: Facts, Fictions, and the Writing of History (Duke University Press Books, 2006).

[10]   Keith Breckenridge, “The Politics of the Parallel Archive: Digital Imperialism and the Future of Record-Keeping in the Age of Digital Reproduction,” Journal of Southern African Studies 40, no. 3 (May 4, 2014): 499–519,

[11]   M. L. Caswell, ““’The Archive’ Is Not an Archives: On Acknowledging the Intellectual Contributions of Archival Studies”,” Reconstruction: Studies in Contemporary Culture 16, no. 1 (August 4, 2016), and Terry Cook, “The Archive(s) Is a Foreign Country: Historians, Archivists, and the Changing Archival Landscape,” The Canadian Historical Review 90, no. 3 (September 16, 2009): 497–534,

[12]   Hilary Jenkinson, A Manual of Archive Administration (London?: P. Lund, Humphries & co., ltd., 1937), See also, Francis X Blouin and William G Rosenberg, Processing the Past: Contesting Authorities in History and the Archives (New York; Oxford: Oxford University Press, 2013).

[13]   T. R. Schellenberg, Modern Archives; Principles and Techniques, Archival Classics Reprints (Chicago, Ill.: Society of American Archivists, 1996),

[14]   Michelle Caswell, T-Kay Sangwand, and Ricardo Punzalan, “Critical Archival Studies: An Introduction,” Journal of Critical Library and Information Studies 1, no. 2 (2017): 2,

[15]   Anthony W. Dunbar, “Introducing Critical Race Theory to Archival Discourse: Getting the Conversation Started,” Archival Science 6, no. 1 (March 1, 2006): 116,

[16]   Mark A. Greene and Dennis Meissner, “More Product, Less Process: Revamping Traditional Archival Processing,” The American Archivist 68, no. 2 (October 1, 2005): 208–63,

[17]   Roger Schonfeld and Jennifer Rutner, “Supporting the Changing Research Practices of Historians” (Ithaka S+R, 2012),

[18]   MARC Standards,

[19]   Dublin Core Metadata Initiative,; Digital Public Library of America,; Europeana,

[20]   Hadley Wickham, “Tidy Data,” Journal of Statistical Software 59, no. 10 (August 2014): 1-5,

[21]   Katie Rawson and Trevor Muñoz, “Against Cleaning,” Curating Menus, July 6, 2016,

[22]   Anthony Grafton, The Footnote: A Curious History (Harvard University Press, 1999), 15.

[23]   Tim Berners-Lee, “Linked Data,” July 27, 2006,

[24]   “RDF 1.1 Concepts and Abstract Syntax,” WC3, February 25, 2014,

[25]   Linked Open Vocabularies,

[26]   Relationship: A Vocabulary for Describing Relationships Between People,

[27]   Lara Putnam, “The Transnational and the Text-Searchable: Digitized Sources and the Shadows They Cast,” The American Historical Review 121, no. 2 (April 1, 2016): 392,

[28]    LOCKSS,

[29]     Barbara B. Tillett, What is FRBR? A Conceptual Model for the Bibliographical Universe, (Washington DC: Library of Congress, 2003):

[30]    The DOI System,


The following two tabs change content below.
Sharon Leon
Sharon M. Leon is an Associate Professor of History at Michigan State University, where she specializes in digital methods with a focus on public history. Also, she is the Director of the Omeka suite of web publishing software platforms. As a result, Dr. Leon often is pursuing many research tracks at once. Currently, she is at work on a digital project to surface and analyze the community networks and experiences of the cohort of people enslaved and sold by the Maryland Province Jesuits in the Eighteenth and Nineteenth Centuries. Simultaneously, she is building a major methodological project on doing community-engaged digital public history. Dr. Leon received her bachelors of arts degree in American Studies from Georgetown University in 1997 and her doctorate in American Studies from the University of Minnesota in 2004. Her first book, An Image of God: the Catholic Struggle with Eugenics, was published by University of Chicago Press (May 2013). Prior to joining the History Department at MSU, Dr. Leon spent over thirteen years at George Mason University’s History Department at the Roy Rosenzweig Center for History and New Media as Director of Public Projects, where she oversaw dozens of award-winning collaborations with library, museum, and archive partners from around the country.