Week 3 of the Data Jam focused on Visualizing Texts and Networks. Text analysis (also referred to as text mining, data mining, or TDM) is the use of computational methods to derive information from texts to search, find patterns, discover relationships, and analyze, in order to gain insights for a research question. In this workshop, we tested two platforms for sourcing text data for visualization and analysis: Constellate, a text analytics platform from JSTOR Labs ;and ProQuest TDM Studio, an end-to-end text and data mining solution from ProQuest. Watch the recording and read through the workshop notes to learn more about comparing these two platforms, perform searches, and learn more about text analysis more generally. Here are the key takeaways we covered:
1. Introducing Text Analysis
In order to understand the two platforms for sourcing text data, we reviewed the text analysis process and vocabulary for working in this method. Text analysis is often used as a way to gain a better understanding of large volumes of content - whether to discover new resources via non-traditional search methods or identify possible topics and research questions. We reviewed the process for getting started in text analysis, and key words used across platforms. We also discussed some text analysis projects, like the Fan Engagement Meter, to understand how text analysis includes a number of different methods and tools for exploring research questions.
You can find the Penn Libraries guide to Text and Data Mining at guides.library.upenn.edu/penntdm.
2. Constellate (constellate.org)
Constellate is a new text and data analytics service from JSTOR and Portico. Constellate provides users with the ability to build datasets for analysis from a variety of sources, perform basic text analysis and visualizations, and gather with a growing community of practitioners to share text analytics materials. Constellate provides text and data analysis capabilities and access to content from a variety of databases in an open environment with teaching materials that can be used, modified, and shared.
Constellate provides content from JSTOR (journal articles, book chapters, research reports, pamphlets), Portico (journal articles, book chapters, full books), Chronicling America (historical newspapers, 1789-1963), Doc South (documents, books), South Asia Open Archives (journal articles, reports, newspapers, periodical, pamphlets, and surveys), and Reveal Digital (alternative press, newspapers, magazines, journals).
For more on how to use Constellate, check out the How-To Guides.
3. ProQuest TDM Studio (tdmstudio.proquest.com)
ProQuest TDM Studio allows researchers to mine and computationally analyze large volumes of published content from news, scholarly and other publications provided to the University of Pennsylvania Libraries via current ProQuest subscriptions. Currently, TDM Studio offers access to 176 ProQuest Databases or 51,711 publications (magazines, books, conference papers, dissertations and theses, scholarly journals, current and historical newspapers like Wall Street Journal, NYTimes, Washington Post).
For more on how to use ProQuest TDM Studio, check out these videos, resources, and guides.
These two platforms are both actively under development, and offer similar but distinct opportunities for researchers to engage in the text analysis process. This chart offers a quick comparison of the two platforms for sourcing, visualizing, and analyzing textual datasets.
|PROCESSING DATA||Jupyter Notebooks||Jupyter Notebooks|
|EXPORTING DATA||A JSON-L dataset containing the n-grams, full-text and metadata||Rolling, 7-day export limit of 15MB|
|BUILT-IN TOOLS FOR VISUALIZING DATA||
|DATASET SIZE||50,000 items per dataset||
|ACCESS||Access provided through University of Pennsylvania||Contact RDDS Team for information|