Analyzing and Visualizing Text with Constellate and ProQuest TDM Studio
Interested in working with large volumes of published content from news, scholarly and other publications? Constellate and Proquest TDM Studio are library services that allow researchers to mine data from a variety of sources and start visualizing, analyzing, and exporting data right away. In this workshop, we compared these two platforms, perform searches, and learn about text analysis more generally.
Week 3 of the Data Jam focused on Visualizing Texts and Networks. Text analysis (also referred to as text mining, data mining, or TDM) is the use of computational methods to derive information from texts to search, find patterns, discover relationships, and analyze, in order to gain insights for a research question. In this workshop, we tested two platforms for sourcing text data for visualization and analysis: Constellate, a text analytics platform from JSTOR Labs ;and ProQuest TDM Studio, an end-to-end text and data mining solution from ProQuest. Watch the recording and read through the workshop notes to learn more about comparing these two platforms, perform searches, and learn more about text analysis more generally.
Here are the key takeaways we covered:
1. Introducing Text Analysis
In order to understand the two platforms for sourcing text data, we reviewed the text analysis process and vocabulary for working in this method. Text analysis is often used as a way to gain a better understanding of large volumes of content - whether to discover new resources via non-traditional search methods or identify possible topics and research questions. We reviewed the process for getting started in text analysis, and key words used across platforms. We also discussed some text analysis projects, like the Fan Engagement Meter, to understand how text analysis includes a number of different methods and tools for exploring research questions.
Constellate is a new text and data analytics service from JSTOR and Portico. Constellate provides users with the ability to build datasets for analysis from a variety of sources, perform basic text analysis and visualizations, and gather with a growing community of practitioners to share text analytics materials. Constellate provides text and data analysis capabilities and access to content from a variety of databases in an open environment with teaching materials that can be used, modified, and shared.
Constellate provides content from JSTOR (journal articles, book chapters, research reports, pamphlets), Portico (journal articles, book chapters, full books), Chronicling America (historical newspapers, 1789-1963), Doc South (documents, books), South Asia Open Archives (journal articles, reports, newspapers, periodical, pamphlets, and surveys), and Reveal Digital (alternative press, newspapers, magazines, journals).
ProQuest TDM Studio allows researchers to mine and computationally analyze large volumes of published content from news, scholarly and other publications provided to the University of Pennsylvania Libraries via current ProQuest subscriptions. Currently, TDM Studio offers access to 176 ProQuest Databases or 51,711 publications (magazines, books, conference papers, dissertations and theses, scholarly journals, current and historical newspapers like Wall Street Journal, NYTimes, Washington Post).
These two platforms are both actively under development, and offer similar but distinct opportunities for researchers to engage in the text analysis process. This chart offers a quick comparison of the two platforms for sourcing, visualizing, and analyzing textual datasets.
A JSON-L dataset containing the n-grams, full-text and metadata
Rolling, 7-day export limit of 15MB
Built-in Tools for Visualizing Data
Number of Documents Over Time
Document Categories over time
South Asia Open Archives
51,711 Publications, including current newspapers
50,000 items per dataset
Up to 2 million documents per dataset (10 datasets max) for Workbench Dashboard
Up to 10,000 documents per dataset (5 data sets max) for Visualization Dashboard
Access provided through University of Pennsylvania