Identifying and improving AGU collaborations using network analysis and scientometrics

The American Geophysical Union (AGU) is an Earth and space science professional society based in the United States. Research conducted by AGU members ranges from the Earth’s deep interior to the outer planets of our solar system. However, little research exists on the AGU meeting itself. In this work, we apply network analysis and scientometrics to 17 years of AGU Fall Meetings. We are interested in the AGU network structure and what its properties can tell us about how the procedures of the AGU Fall Meeting can be enhanced to facilitate better scientific communication and collaboration. We quantify several network properties and illustrate how this type of analysis can enhance meeting planning and layout. We conclude with practical strategies for the AGU Program Committee.


Introduction
The American Geophysical Union (AGU) is an Earth and space science professional society based in the United 20 States. AGU publishes scientific journals, sponsors meetings, and supports education and outreach efforts to promote public understanding of science. Research conducted by AGU members ranges from the Earth's deep interior to the outer planets of our solar system. Despite the American in its name, roughly 40% of the AGU's membership comes from outside of the U.S. 1 Each year, the AGU hosts a Fall Meeting that draws tens of thousands of participants. The research presented at 25 these meetings has been discussed and debated extensively. However, little research exists on the AGU meeting itself. In this work, we apply network analysis and scientometrics to seventeen years of AGU Fall Meetings. We model the AGU Fall Meetings as graphs in which presentation co-authors are connected nodes and analyze these graphs to ascertain their structure and properties. We are interested in what the structure and network properties can tell us about the scientometrics of the AGU.

30
Scientometrics is the science of measuring and analyzing science itself, such as a discipline's structure, growth, change, and interrelations (Hood and Wilson, 2001). Vassily Nalimov first coined the term in the 1960s and subsequent work has focused on a discipline's methodologies and principles as well as individual researchers' scientific output (Braun, Glänzel, and Schubert, 2006;Hirsh, 2005). Here, we are using "scientometrics" in the 35 general sense of "the science of science" to understand how science operates and can be improved. Our work is an exploration of possible approaches to developing scientometrics within the Earth and space sciences. We are interested in how science collaboration and networking are taking place and how the procedures of the AGU Fall Meeting could be enhanced to facilitate better scientific communication and collaboration. We provide suggestions on how our work can be operationalized; yet, are currently not at an operational stage. 40 2 Dataset, Assumptions, and Limitations

Dataset
The data in this study came from the AGU Abstract Browser 2 . The Abstract Browser is a publicly available database 45 of historical abstracts presented at AGU meetings. This database contains abstracts from meetings other than the Fall Meeting, such as the Ocean Sciences Meetings; however, we limited our study to Fall Meetings only. The Fall Meetings are multi-disciplinary and provide the largest most comprehensive subset of data available. Restricting our study to Fall Meetings provides the most data and also ensures equal coverage of the sub-domains covered by AGU.
Our study includes 17 years of data and covers the Fall Meetings from 2000 to 2017.

50
The AGU is divided into sections representing the subdisciplines of Earth and space science. As science evolves over the years, new sections are formed, and older ones can be merged or dissolved. The sections on which we had data to perform our analysis are listed in Table 1.

Limitations and Assumptions
The Abstract Browser contains Fall Meeting data such as sessions held, presentations given in each session (including title, authors, affiliations, and an abstract), and the AGU section in which the session was held. However, the author data contains only email address, last name, and initials. Moreover, the same author sometimes has only a 70 first initial while other times having a first and middle initial. The first author of this study is a prime example. He appears in the abstract database as both: T. W. Narock and T. Narock. This raises significant challenges for autonomously disambiguating people. Further complicating this issue is the case where authors change institutions.
For example, T. Narock appears with his graduate school email address and later with the email address of his affiliation post-graduation. Each author does have an organizational affiliation provided; however, this data is also 75 messy and difficult to use for disambiguation. There is no standard naming convention and the same institution often appears with multiple names. For example, the NASA Goddard Space Flight Center is listed as NASA/Goddard, NASA/GSFC, and NASA/Goddard Space Flight Center. Ideally, authors would be listed with their ORCID (Haak et al., 2012); however, at present, such data is not available via any public AGU interface that we are aware of.
Lacking the means to perform a large-scale crowdsourced disambiguation project, we sought other means to 80 disambiguate authors.
We considered email address to be a unique and distinguishing feature. Specifically, the network graphs we construct from the AGU data likely have multiple nodes representing the same person. As such, we consider the network analysis portion of our study a lower limit. We know that the actual values for network density and connected components are not lower than the values reported here, and they would likely be a bit higher had we been able to uniquely identify all authors in our dataset. Despite this limitation, we feel our 95 analysis can still provide useful insights into the AGU meetings.
All networks are comprised of nodes (also called vertices) and edges (connections between the nodes). Networks also come in multiple types ranging from directed to undirected. Twitter is an example of a directed network. Edges have directionality in a directed network. For example, Twitter user A can follow user B; however, user B is not 100 obligated to follower user A back. The edge between users A and B would have directionality. In an undirected network all edges are bidirectional by default. This is how "friending" works in Facebook. Both users (nodes) must agree to the "friendship" and a link (edge) is created. There are no directed edges allowed in an undirected network.
We model each AGU section as an undirected network based on co-authorship. If A co-authored a presentation with 105 B and C, then A, B, and C become nodes in the network with bidirectional links between each (e.g. A-B, A-C, B-C).
We do not apply any weighting to the edges. If authors A and B co-authored a presentation at the 2000 Fall Meeting and then again at the 2010 Fall Meeting this adds no new information to the graph. We also consider edges to be eternal when studying the temporal evolution of the network. For example, if authors A and B co-authored a presentation at the 2000 Fall Meeting these nodes and edges persist in 2017 even if those authors never co-authored 110 another presentation. We also note that we are measuring co-authorship and not necessarily collaboration. Our dataset does not contain references and acknowledgements used in presentations. These secondary connections (e.g. citing a paper or acknowledging a discussion) do not show up as edges in our graphs.

115
The analysis software used in this study is freely and publicly available from Narock et al. (2019). The graph data generated from our software is available in Narock et al. (2018a) 3. Network Analysis 120

Network Density
Network density is defined as the ratio of actual connections to possible connections. Possible values for network density range from 0 (no connections at all) to 1 (everyone is connected to everyone else). Figure 1 illustrates the concept of network density on sample networks. In 1.) of Figure 1 there are three nodes and three potential 125 connections. These three potential connections are realized as all nodes are connected to each other. This is representative of the AGU case in which A, B, and C have co-authored presentations with each other; although, not necessarily the same presentation. The network in 1.) has a density of 3/3 = 1.
The network shown in 2.) has the same three potential connections. However, only two of the nodes are directly 130 connected. In this example, A has co-authored a presentation with B and B has co-authored a presentation with C; yet, A has not co-authored a presentation with C. The network in 2.) has a density of 2/3 = .67.
It's unlikely that a real-world network such as the AGU would have network density of 1. Given the diversity of research topics it's unlikely that the network would be completely connected. But, what are the actual density values 135 and how do they change over time?

140
To answer these questions, we first considered each AGU section to be its own network. Yearly network graphs were then created for each section using the Abstract Browser data. Next, we computed the percentage change in network density for each section. We note that

150
Network density decreases for all sections. This is telling us that nodes are being added faster than edges. In practical terms, the rate at which new people (nodes) are attending AGU sessions is greater than the rate at which continuing attendees (nodes) are making new connections. Again, these percentage change values should be considered a lower limit given our inability to completely disambiguate the authors in our data.

155
We expect network density to decrease over time. For density to remain constant, each new node must also be accompanied by an even larger number of new edges. However, we are surprised by the extent to which density is decreasing. If a large number of new collaborations were being found at AGU, then existing nodes would have new edges at a rate comparable to new nodes being added. This appears to not be the case.

160
In graph theory, a connected component of an undirected graph (also referred to as a component) is a subgraph within the whole graph. Figure 3 shows an example. The network in the figure is comprised of three connected components. Although not shown here, an isolated node not connected to any other nodes in the network is also

Multi-Disciplinary Authors
We define a multi-disciplinary author as anyone who appears in the network graph of more  Aside from the related space physics sections of SH and SM, we do not see a significant amount of presentations across sections. Authors tend to stay within their primary domains.

205
Authors submitting to the Fall Meeting are asked to tag their abstracts with keywords from the AGU's keyword . We note that abstracts are not exclusive to one keyword group. Authors are free to self-tag their abstracts with multiple keywords that may span multiple parts of the keyword hierarchy. This is reflected in our analysis where the same abstract may contribute to keyword usage counts in multiple parts of the keyword hierarchy.

215
For clarity of display, we filtered out keyword groups that did not reach 100 occurrences during the 17 years in which we had data. Figures 5 through 8 highlight specific trends in keyword usage that were observed in our data.
The full set of images showing keyword usage from all keyword categories is included in the Appendix.  To us, this is indicative of the power of simple scientometric visualizations. By simply counting keywords we can begin to identify emerging trends, which, as we discuss further in the next section, can be exploited by meeting and 240 section leadership to better structure future Fall Meetings. Further, more detailed analysis, such as the example above, identify very effective session planning and emerging science, which can further be exploited by section leadership and the AGU Program Committee.

245
The Planetary Science section is the primary user of Astrobiology keywords as shown in Figure 6. Usage from 2005 to 2010 was more or less consistent. However, beginning in 2011 a sudden increase in usage is seen that continues to today. A similar trend is seen with Education keywords in Figure 7.  It may not be surprising that planetary scientists are using astrobiology terms to tag their abstracts. Meeting 255 attendees may even have anecdotal evidence of observing this themselves. Yet, had someone been tracking this data in 2012 and 2013 we could have seen this trend emerging. This information could have gone into meeting planning and potentially led to more physical space at the meeting venue, joint sessions, increased public outreach, and other initiatives that could have maximized the dissemination of astrobiology science.

260
The related trend, Figure 7, shows Union sessions having a sudden uptick in Education-related. A scientometrics and data driven AGU could leverage this information in being proactive with joint sessions and when/where presentations are given at the Fall Meeting. We explore this in more detail in the next section.

295
In regard to network density and connected components, there is no optimal network clustering value. However, lower density networks comprised of many loosely connected clusters have been shown to be beneficial (Burt, 2004). In these networks, everyone doesn't already know each other, and multiple clusters lead to new and unique perspectives. On the contrary, when everyone knows everyone else (density=1) you're more likely to repeatedly 300 hear the same ideas (Burt, 2004).
In order for information to spread across a network there needs to be connections between the clusters. We want to avoid the scenario depicted in Figure 3 and have at least one connection between each connected component in an AGU section. By knowing how many connected components there are, what is the primary research topic of each (most used keyword), and whom the components are comprised of, can be beneficial for meeting planners and 305 section leadership. For the AGU Fall Meeting, session proposal is open to any self-organized group of up to four AGU members. Authors then opt to have their submission assigned to a particular session. We could make this process more proactive by providing section leadership with connected component data and encouraging connections between specific AGU members. This could range from informal networking events to suggesting session co-conveners. Another option is to facilitate navigation of the meeting via analytics tools built on top of the AGU's historic meeting data. A simple example is shown in Figure 10. This so-called force directed graph adds additional information to a standard network graph. In a force directed graph the distance between two nodes is indicative of the strength of the connection. For instance, in Figure 10

330
We want to be clear that we are not advocating for any sort of new metric. We do not need to rank researchers nor do we need to rank the value of their work based on where it's presented. The journal impact factor does a poor enough job of this already (Shanahan, 2016). Rather, we are advocating for tools that would help attendees, especially early-career and new attendees, identify whom they might want to seek out based on their research interests. Figures 11 through 13 show an example tool we built for the AGU Open API Challenge 8,9 . After 335 identifying a researcher, possibly through a visualization like Figure 10, the user is guided through finding that researcher in the historical abstract database (Figures 11 and 12). The co-authorship network is then leveraged to identify all AGU presenters who have co-authored a presentation with the researcher of interest. Figure 13 shows an example for our colleague Peter Wiebe. For brevity, only the 2018 co-authors are shown in the figure. The Abstract column in Figure 13 lists the year of presentation, the section of the presentation, and the presentation ID. Each row 340 in the Abstract column is a clickable link that will take the user to a web page displaying the presentation title,

Steps Toward Gender Equality
Ford and colleagues (Ford et al., 2018) have identified a gender imbalance in AGU presentations. Women are invited and assigned oral presentations less often than men. It was found that male primary conveners allocate 365 invited abstracts and oral presentations to women less often and below the proportion of women authors. This trend was apparent regardless of the male primary conveners being students or at more senior career stages. Ford et al. (2018) also identified that women elect for poster only presentations more so than men.
The dataset used in this study has a longer timespan than the one used by Ford et al. (2018). However, our dataset 370 does not include gender or career stage information. We cannot add any new information on the gender imbalance discussion. Scientometrics and network analysis may provide tools to counter this imbalance. Yet, we are cognizant that more open data may exacerbate the problem by exposing presenters to more opportunities for bias. We highlight these issues here as it is a discussion very much worth having. However, at this time, we are unable to offer any additional data, insights, or strategies.

405
AGU is on the cusp of an incredible milestone. Founded in 1919, the AGU will celebrate its centennial in 2019.
There is a lot we can learn from the past 100 years. Network analysis, scientometrics, and data science can help us quantify what we're doing right and identify paths toward improvement. Let's leverage open data and open science to improve how we present our science over the next 100 years. We conclude with a summary of recommendations.
• Further explore the percentage change in network density. AGU is highly invested in collaboration, as evidenced by Science Neighborhoods, Town Halls, and related events. If edges are being added at a rate far below the rate of new nodes, are these collaboration events truly effective?
• Explore connected components to identify clusters of research topics and who comprises each cluster.
Combination with other datasets to identify career status (e.g. student, early career, senior researcher) can 415 be helpful for the Program Committee in balancing session chairs. Connected component analysis may also be helpful in recommending collaboration amongst components.
• AGU covers a wide cross-section of the geosciences. Yet, the number of researchers presenting across sections appears minimal. The analysis of keywords reveals there are numerous sections interested in the same topics. AGU should take steps to enhance presentations across sections.

420
• Scientometric analysis can reveal emerging trends and hidden patterns. We advocate for the release of program data prior to the Fall Meeting and the development of open tools that leverage this data. Narock (2018b) presented techniques that can help operational this into predictive analytics.
• Unique identifiers, such as ORCID and the Global Research Identifier Database, can be used to clearly identify researchers and organizations.

425
• Technology and open data may help in efforts to battle gender and minority biases in science presentations.
Yet, more data and easier access to a researcher's history may lead to unintended consequences and additional biases. Our community needs to continue having discussions in this area and actively evaluate the role scientometrics might play.
• There is currently a strong push for scientific data to adhere to the FAIR principles (Wilkinson et al., 2016).

430
We believe our science communication efforts should adhere to these principles as well.