Analyzing and Visualizing Text with Constellate and ProQuest TDM Studio

Main content

Analyzing and Visualizing Text with Constellate and ProQuest TDM Studio

Week 3 of the Data Jam focused on Visualizing Texts and Networks. Text analysis (also referred to as text mining, data mining, or TDM) is the use of computational methods to derive information from texts to search, find patterns, discover relationships, and analyze, in order to gain insights for a research question. In this workshop, we tested two platforms for sourcing text data for visualization and analysis: Constellate, a text analytics platform from JSTOR Labs ;and ProQuest TDM Studio, an end-to-end text and data mining solution from ProQuest. Watch the recording and read through the workshop notes to learn more about comparing these two platforms, perform searches, and learn more about text analysis more generally. Here are the key takeaways we covered: 

1. Introducing Text Analysis

In order to understand the two platforms for sourcing text data, we reviewed the text analysis process and vocabulary for working in this method. Text analysis is often used as a way to gain a better understanding of large volumes of content - whether to discover new resources via non-traditional search methods or identify possible topics and research questions. We reviewed the process for getting started in text analysis, and key words used across platforms. We also discussed some text analysis projects, like the Fan Engagement Meter, to understand how text analysis includes a number of different methods and tools for exploring research questions. 

A slide entitled "Text Analysis Research Process." The slide lists the following, from top to bottom: Formulate a Research Question, Build a Corpus, Clean Data, Analyze Text, Gain Insights.

You can find the Penn Libraries guide to Text and Data Mining at guides.library.upenn.edu/penntdm. 

2. Constellate (constellate.org)

Constellate is a new text and data analytics service from JSTOR and Portico. Constellate provides users with the ability to build datasets for analysis from a variety of sources, perform basic text analysis and visualizations, and gather with a growing community of practitioners to share text analytics materials. Constellate provides text and data analysis capabilities and access to content from a variety of databases in an open environment with teaching materials that can be used, modified, and shared.

Constellate provides content from JSTOR (journal articles, book chapters, research reports, pamphlets), Portico (journal articles, book chapters, full books), Chronicling America (historical newspapers, 1789-1963), Doc South (documents, books), South Asia Open Archives (journal articles, reports, newspapers, periodical, pamphlets, and surveys), and Reveal Digital (alternative press, newspapers, magazines, journals).

Term frequency chart of unigram frequency across the dataset. The term "amtrak" appears in 100% of documents over time, while terms like "chicago", "railroad", and "government" fluctuate.
A term frequency chart generated by the Constellate platform, using a dataset focused on the keyword "amtrak"

 

For more on how to use Constellate, check out the How-To Guides.  

3. ProQuest TDM Studio (tdmstudio.proquest.com)

ProQuest TDM Studio allows researchers to mine and computationally analyze large volumes of published content from news, scholarly and other publications provided to the University of Pennsylvania Libraries via current ProQuest subscriptions. Currently, TDM Studio offers access to 176 ProQuest Databases or 51,711 publications (magazines, books, conference papers, dissertations and theses, scholarly journals, current and historical newspapers like Wall Street Journal, NYTimes, Washington Post).  

A topic list of keywords based on a dataset related to the keyword "Amtrak".
Three of the five topics produced for the Amtrak dataset, showing the frequency of topics over time

For more on how to use ProQuest TDM Studio, check out these videos, resources, and guides.  

4. Comparisons

These two platforms are both actively under development, and offer similar but distinct opportunities for researchers to engage in the text analysis process. This chart offers a quick comparison of the two platforms for sourcing, visualizing, and analyzing textual datasets. 

  CONSTELLATE TDM STUDIO
PROCESSING DATA Jupyter Notebooks Jupyter Notebooks
EXPORTING DATA A JSON-L dataset containing the n-grams, full-text and metadata Rolling, 7-day export limit of 15MB
BUILT-IN TOOLS FOR VISUALIZING DATA
  • Number of Documents Over Time
  • Key phrases
  • Term Frequency
  • Document Categories over time
  • Category Treemap
  • Geographic Analysis
  • Topic Modeling
DATASETS
  • JSTOR
  • Portico
  • Chronicling America
  • Reveal Digital
  • Doc South
  • South Asia Open Archives
  • 176 Databases
  • 51,711 Publications, including current newspapers
DATASET SIZE 50,000 items per dataset
  • Up to 2 million documents per dataset (10 datasets max) for Workbench Dashboard
  • Up to 10,000 documents per dataset (5 data sets max) for Visualization Dashboard
ACCESS Access provided through University of Pennsylvania Contact RDDS Team for information

 

 

About the Author

Emily Esten
Emily Esten
Arnold and Deanne Kaplan Collection of Early American Judaica Curator of Digital Humanities
As the Kaplan Curator, Emily spearheads projects that facilitate access to and use of Penn's Judaica collections, making connections between them and dispersed Judaica content around the globe. She is responsible for curating the Kaplan Collection of Early American Judaica and for rolling out Scribes of the Cairo Geniza project, phase II.

As the inaugural Kaplan Curator, Emily Esten spearheads projects that facilitate access to and use of Penn's Judaica collections, promoting them and making connections between them and dispersed Judaica content around the globe. She is also responsible for curating, building, and researching the Arnold and Deanne Kaplan Collection of Early American Judaica. In addition, she coordinates Scribes of the Cairo Geniza project.

In addition to her role at the Penn Libraries, she is the Web Manager for Contingent Magazine and the Director of Communications for the National Emerging Museum Professionals Network. Previously, she worked at the Edward M. Kennedy Institute for the United States Senate and at Brown University.

Emily holds a Bachelor of Arts degree, with majors in history and digital humanities, from the University of Massachusetts Amherst and a Master's Degree in public humanities from Brown University.