open research - Digital Science

Machine-First FAIR: Realigning Academic Data for the AI Research Revolution

Mark Hahnel — Mon, 17 Nov 2025 12:16:40 +0000

The best way for humankind to benefit from research is to prioritize machines over people when sharing data. Here’s why.

We push out the lines that academic research needs to be Findable, Accessible, Interoperable and Re-usable (FAIR) for humans and machines. This suggests humans and machines should get equal priority when it comes to FAIR. This is not the case, we should prioritize the machines. Machine-generated new knowledge will accelerate knowledge discovery.

While humans can infer insights from sparse information in academic literature and datasets – due to our ability to find more context online – the machines currently cannot. To go further, faster in knowledge discovery we need to move past human-powered knowledge discovery. To do this, the machines need structure and pattern. Every research-generating organization should be prioritizing this.

Academia is Ignoring Decades of Advancement

Academic research generates more than 6.5 million papers annually, and over 20 million datasets, each representing potential training signals for the artificial intelligence systems reshaping discovery. Yet most institutional data remains locked in formats optimized for human consumption rather than computational processing.

While most stakeholders know the theoretical merits of making data FAIR (Findable, Accessible, Interoperable, Reusable) for both humans and machines, the practical reality is starker: in an era where language models can process orders of magnitude more literature than any human researcher, we are still organizing our most valuable research assets for the wrong consumer.

The economic implications are substantial. Organizations like the Chan Zuckerberg Initiative (CZI) have committed over $3.4 billion toward AI-powered biology, funding projects ranging from their 1,024 GPU DGX SuperPOD cluster for computational biology research to the Virtual Cell Platform that aims to create predictive models of cellular behavior. The Navigation Fund, with its $1.3 billion endowment, has invested in AI infrastructure through their Voltage Park subsidiary, while simultaneously funding open science initiatives focused on machine-actionable intelligence and metadata enhancement. Astera Institute has deployed portions of its $2.5 billion endowment to support projects like their $200 million investment in Imbue’s AI agent research and their Science Entrepreneur-in-Residence program specifically targeting scientific publishing infrastructure. Meanwhile, the Allen Institute for AI demonstrates the practical returns on machine-first approaches through projects like their OLMo series of fully open language models, where complete training datasets, code, and methodologies are published in computational formats, and their Semantic Scholar platform, which processes millions of academic papers to extract structured, machine-readable knowledge graphs.

Chan Zuckerberg Initiative (CZI)

Yet the vast majority of academic institutions continue to publish their findings in PDFs or as poorly described datasets. While LLMs are getting better at ingesting multi-modal content, PDF is a format that remains surprisingly resistant to reliable automated extraction, despite decades of advancement in natural language processing. This is not merely a technical limitation. Modern large language models struggle with PDFs because these documents prioritize visual presentation over semantic structure. Critical information becomes trapped in figures, tables, and formatting that computational systems cannot reliably parse. A reaction scheme embedded as an image, a dataset described in paragraph form, or experimental parameters scattered across multiple tables represent precisely the kind of structured knowledge that could accelerate discovery if only machines could access it consistently.

The Architecture of Computational Research Infrastructure

The solution requires a fundamental reorientation toward machine-first data architecture. Rather than retrofitting human-readable outputs for computational consumption, we can take inspiration from pharma and industry writ large, who are designing their data flows to serve algorithms from the ground up, with human-friendly interfaces emerging as downstream products of this computational foundation.

Consider the transformation pathway implemented by teams working with Digital Science’s suite of computational research tools. We’re building workflows in our tools for automated knowledge extraction at scale. The extracted knowledge gains semantic coherence through integration into domain-specific knowledge graphs. Platforms like metaphacts (metaphactory) provide the infrastructure to align these signals with established ontologies while enforcing quality constraints through SHACL validation integrated into continuous deployment pipelines. The result is not merely a database of facts, but a queryable intelligence system that can answer novel questions through automated reasoning over validated relationships.

Simultaneously, the operational requirements of research continue through dedicated literature management systems. Tools like ReadCube maintain the audit trails and conflict resolution workflows that regulatory environments demand, while ensuring that every screening decision and data extraction connects to persistent identifiers. The curated evidence flows directly into the computational infrastructure rather than terminating in isolated spreadsheets.

The critical innovation lies in packaging. While human researchers expect PDFs and narrative summaries, machine learning pipelines require structured metadata that specifies exactly what each dataset contains, where to retrieve it, and how to interpret every field.

The Metadata Multiplier Effect on Repository Platforms

Academic data repositories like Figshare occupy a unique position in the machine-first FAIR ecosystem. We serve as the critical junction between human research practices and computational discovery. When researchers publish datasets with comprehensive, structured metadata, these platforms transform from simple storage services into computational assets that can feed directly into AI research pipelines. The difference lies entirely in how authors describe their work at the point of deposit.

The REAL (Real-world multi-center Endoscopy Annotated video Library) – colon dataset on Figshare: https://doi.org/10.25452/figshare.plus.22202866.v2

Consider two datasets published on the same platform: one uploaded with a generic title like “experiment_data_final.xlsx” and minimal description, the other with machine-readable field descriptions, standardized vocabulary terms, and explicit links to ontologies and methodologies. The first requires human interpretation before any computational system can make sense of its contents. The second can be discovered, validated, and integrated into training pipelines automatically. Figshare’s API can surface the rich metadata to computational systems, but only if researchers have provided it in the first place.

The platform infrastructure already supports the technical requirements for machine-first FAIR. Persistent DOIs ensure stable identifiers, while structured metadata fields can accommodate everything from ORCID researcher identifiers to detailed provenance information. When authors invest time in describing their data using controlled vocabularies, specifying units of measurement, documenting collection methodologies, and linking to relevant publications, they create computational assets rather than digital archives. The same dataset that might languish undiscovered with poor metadata becomes a valuable training resource when described with machine-readable precision.

This creates a powerful feedback loop. Datasets with excellent metadata get discovered and reused more frequently, driving citation counts and demonstrating impact. Meanwhile, poorly described data remains computationally invisible regardless of its scientific value. Platforms like Figshare could amplify this effect by providing better authoring tools that encourage structured metadata entry, perhaps even using AI to suggest appropriate ontology terms or validate metadata completeness before publication. The infrastructure for machine-first FAIR already exists, it simply requires researchers to embrace metadata as a first-class research output rather than an administrative afterthought. But this is an evolving field, new standards are emerging that repositories need to engage with.

The Croissant format, a lightweight JSON-LD descriptor based on schema.org, provides this computational bridge. A single Croissant file enables any training pipeline to hydrate datasets without custom loaders while simultaneously supporting discovery through standard web infrastructure.

Practical Implementation in Institutional Contexts

The transition to machine-first FAIR follows a predictable arc when properly resourced. Initial implementations focus on proving the fundamental workflow with narrowly scoped pilot projects. A team might select a single dataset and one sharply defined outcome, perhaps drug-target interaction prediction or materials property modeling and implement the complete pipeline from literature extraction through validated knowledge graph construction to machine-readable packaging.

The critical insight from successful implementations is the importance of automation as the second phase. Manual processes that work for pilot projects become bottlenecks at scale. The most effective teams invest heavily in converting their proven workflows into tested, continuous integration pipelines that enforce quality gates automatically. This includes SHACL validation for knowledge graphs, automated license checking, and provenance tracking.

Production deployment requires infrastructure investments that many academic institutions are not yet considering. Successful implementations provide stable, resolvable URLs for every dataset and descriptor, enable content negotiation so that both machines and humans receive appropriate formats, and implement comprehensive monitoring of data quality trends and usage patterns. This is the stack that Digital Science can provide.

Quantifying Institutional Success

Organizations can assess their progress toward machine-first FAIR through several concrete indicators. Successful implementations demonstrate that every significant dataset resolves to a persistent identifier that returns structured JSON-LD for computational consumers while maintaining readable landing pages for human users. Knowledge graphs pass automated validation, maintain stable URI schemes, and support catalogued query patterns rather than requiring ad hoc exploration.

Literature workflows leave complete audit trails with PRISMA-compliant reporting that can be generated automatically rather than assembled manually. Licensing and provenance information becomes verifiable through computational means rather than requiring human interpretation. Most importantly, the time taken from initial hypothesis to trained model decreases as institutional infrastructure matures and teams spend more of their time on discovery rather than data preparation.

The research organizations that define the next decade will not necessarily be those with the largest datasets, but rather those whose data infrastructure works most effectively at computational scale. Every day spent optimizing publishing workflows for human-readable reports while leaving data computationally inaccessible represents lost ground in an increasingly competitive landscape.

The funders backing this transformation, from CZI’s investments in computational biology to Astera’s focus on AI-native research infrastructure, are betting that machine-first approaches will determine which institutions can effectively leverage artificial intelligence for discovery. The technical architecture exists today. The standards are stable. The remaining barrier is institutional commitment to prioritizing computational accessibility over familiar but inefficient human-centered workflows.

Academic research stands at yet another technology-driven inflection point. The institutions that embrace machine-first FAIR will find themselves having more impact for their research and researchers.

The post Machine-First FAIR: Realigning Academic Data for the AI Research Revolution appeared first on Digital Science.

Australian research well placed for adoption of National Persistent Identifier (PID) Strategy

David Ellis — Thu, 09 Oct 2025 07:15:07 +0000

Digital Science report offers “mixed score card”, makes 23 recommendations including mandatory ORCIDs for all Aussie researchers

Thursday 9 October 2025

Digital Science, a technology company serving stakeholders across the research ecosystem, has made a series of 23 recommendations for Australia’s research future in a report published today into the use of persistent identifiers (PIDs) in research.

The report is the Australian National Persistent Identifier (PID) Benchmarking Toolkit, available now on Figshare.

Commissioned by the Australian Research Data Commons (ARDC), Digital Science was tasked with developing a comprehensive PID benchmarking framework, and to conduct a benchmarking process that could be used to monitor the effectiveness of Australia’s National PID Strategy over time. The report, developed collaboratively with the ARDC, also benefited from consultation and engagement with the Australian research community.

The lead author of the report, Digital Science’s VP of Research Futures, Simon Porter, will discuss the findings at two upcoming events in Brisbane, Australia: International Data Week (13-16 October) and the eResearch Australasia Conference (20-24 October).

A unique opportunity for Australian research

“This is the first time Australia’s National PID Strategy has been benchmarked, and it represents a unique opportunity for the Australian research system to benefit from that process,” Simon Porter said.

“What we’ve seen from the benchmarking is that Australia’s adoption of ORCID for research publications across the research sector has been extremely successful – and Australia is now third in the world for including DOI (Digital Object Identifier) links with dissertations published online.

“Workflows between publishers, institutional research information systems, and ORCID are also sufficiently strong, and we can see that Australia is well placed for a more comprehensive use of the ORCID infrastructure.

“However, our comprehensive review gave Australian research a mixed score card and recommended several changes and interventions to help strengthen the national strategy,” Mr Porter said.

“One of the key issues we’ve seen is that although Australian researchers are more engaged than the global average in the practice of data citation, they trail significantly behind their European peers.

“And while ORCID and ROR adoption has been strong for publications, the use of persistent identifiers with data sets and non-traditional research outputs (NTROs) remains the exception rather than the norm. As significant publishers of NTRO items in their own right, institutions should hold themselves to the same standards that they expect from publishers – all creators should ideally be described with an ORCID, and affiliation id (ROR).”

Natasha Simons, Director of National Coordination at the ARDC, congratulated Digital Science on the release of the National PID Benchmarking Toolkit. “The Australian Persistent Identifier Strategy is a critical national initiative to benefit the Australian people by strengthening our digital information ecosystem, the quality of our research and our capacity for effective research engagement, innovation and impact,” she said. “So it is essential to develop robust benchmarks that can track our progress and measure outcomes. The Toolkit provides us with exactly what’s needed.”

Recommendations to strengthen Australia’s research future

Some of the 23 recommendations made in the report include:

Australian research has progressed to the point where ORCIDs should now be mandatory for all researchers; Australian Institutions should require ORCID registration within their institutional research information management systems.

Australian research institutions should adopt the best practices of publishers to ensure that all authors are described by ORCIDs and affiliations via ROR.

Australia should join international pressure to ensure that all publishers both record ORCID records and push the associated metadata into Crossref, and to avoid publishers that do not support ORCID workflows.

Australia should consider a national policy for publishing dissertations with DOIs in institutional repositories, formalizing the use of ORCIDs for authors and their supervisors.

Reports published by universities and their research centres should ideally be published in institutional repositories, with associated identifiers.

Ongoing benchmarking analysis of PIDs should not ignore closed access material. (e.g., ignoring closed-access publications would result in missing 35% of Australia’s research output in 2024.)

RAiDs (Research Activity Identifiers) should be added from “day one” of the creation of a funding grant.

Grants funding organizations should create persistent identifiers “as soon as is practical” – including complete metadata – to enable research funding to be visible and tracked earlier.

“We welcome the opportunity to have led this benchmarking process, and we hope our recommendations will lead to some meaningful improvements within Australian research,” Mr Porter said.

“Importantly, we’ve also demonstrated that it is possible to produce a benchmarking toolkit for PIDs, and our work may have implications for other nations and their roadmaps towards a persistent identifier future.”

Background: The importance of PIDs

Persistent identifiers (PIDs) are unique numbered references to individual researchers and their work, which are connected to digital outputs and resources. They help connect researchers, projects, outputs, and institutions, and have become critical for:

Making research inputs and outputs FAIR (findable, accessible, interoperable, and reusable)
Enabling research outputs to be identified, tracked and cited
Analyzing research impact
Supporting national-scale research analytics

Widely used PIDs include ORCID iDs, DOIs, RORs, and emerging identifiers include DOIs for grants, and identifiers for projects (RAiDs).

Note: In the report, Simon Porter declares that he is also a member of the ORCID Board.

Read the full report

Discover more at International Data Week (13-16 October) and the eResearch Australasia Conference (20-24 October).

About Digital Science

Digital Science is an AI-focused technology company providing innovative solutions to complex challenges faced by researchers, universities, funders, industry and publishers. We work in partnership to advance global research for the benefit of society. Through our brands – Altmetric, Dimensions, Figshare, IFI CLAIMS Patent Services, metaphacts, OntoChem, Overleaf, ReadCube, Symplectic, and Writefull – we believe when we solve problems together, we drive progress for all. Visit digital-science.com and follow Digital Science on Bluesky, on X or on LinkedIn.

Media contact

David Ellis, Press, PR & Social Manager, Digital Science: Mobile +61 447 783 023, d.ellis@digital-science.com

The post Australian research well placed for adoption of National Persistent Identifier (PID) Strategy appeared first on Digital Science.

Digital Science relaunches Scientometric Researcher Access to Data (SRAD) program

David Ellis — Tue, 22 Jul 2025 08:49:52 +0000

Access to Altmetric and Dimensions data is now boosted with Dimensions on BigQuery for researchers in the scientometrics field

Tuesday 22 July 2025

Digital Science today reaffirms its commitment to supporting the global scientometric research community and the study of scholarly literature, by relaunching its Scientometric Researcher Access to Data (SRAD) program.

This revitalized initiative will offer scientometric researchers streamlined, no-cost access to Digital Science’s Altmetric and Dimensions data, and is now further expanded by offering access to Dimensions on BigQuery.

The SRAD program is available to scientometrics researchers involved in non-commercial scientometric studies, empowering them to more easily answer system-wide research questions about scholarly literature and its impact.

To lead this important effort and build a thriving global community of expert users, Digital Science has appointed Kathryn Weber-Boer to the position of Director Scientometrics – Scientometric Researcher Engagement. Ms Weber-Boer brings deep expertise in scientometrics, academic engagement, and advanced analytics.

Ms Weber-Boer said: “This program plays an important role in Digital Science’s commitment to open research and improving research. I am honoured to be in the position of driving strategic outreach, program design, and community leadership, to help researchers maximize the impact of Digital Science tools.

“By expanding access to Dimensions on GBQ, we’re excited to enable scientometrics researchers to answer complex questions with big data, exploring and linking more datapoints, and connecting our world-leading Dimensions data to other open datasets.

“The SRAD program is built around key principles of accessibility, responsible data use, and community empowerment. Through tailored training and dynamic community engagement, it’s our hope that we can contribute to driving innovation in the fields of Scientometrics, Research Policy, and Innovation Studies,” she said.

Find out more about the Scientometric Researcher Access to Data program

About Dimensions

Part of Digital Science, Dimensions hosts the largest collection of interconnected global research data, re-imagining research discovery with access to grants, publications, clinical trials, patents and policy documents all in one place. Follow Dimensions on Bluesky, X and LinkedIn.

About Altmetric

Altmetric is a leading provider of alternative research metrics, helping everyone involved in research gauge the impact of their work. We serve diverse markets including universities, institutions, government, publishers, corporations, and those who fund research. Our powerful technology searches thousands of online sources, revealing where research is being shared and discussed. Teams can use our powerful Altmetric Explorer application to interrogate the data themselves, embed our dynamic ‘badges’ into their webpages, or get expert insights from Altmetric’s consultants. Altmetric is part of the Digital Science group, dedicated to making the research experience simpler and more productive by applying pioneering technology solutions. Find out more at altmetric.com and follow @altmetric on X and @altmetric.com on Bluesky.

About Digital Science

Media contact

David Ellis, Press, PR & Social Manager, Digital Science: Mobile +61 447 783 023, d.ellis@digital-science.com

The post Digital Science relaunches Scientometric Researcher Access to Data (SRAD) program appeared first on Digital Science.

Access vs Engagement – is OA enough?

Guest Author — Tue, 01 Jul 2025 13:35:55 +0000

Making research Open Access (OA) is one major step in the process, but how do we know if OA research is having its intended impact? Ann Campbell and Katie Davison share the results of their investigations and some lessons for the future of OA.

Reaching OA’s potential

One of the principal aims of Open Access (OA) has always been to democratize knowledge by making research free to read; however, that should be the starting point, not the ultimate goal. Perhaps it’s time to step back and ask ourselves, “Are we in danger of becoming preoccupied with the ‘access’ aspect of open – neglecting the other components that make research successful?”

In our rush to remove paywalls and ‘financial barriers’, could it be that we are simply equating ‘freely available’ to ‘truly accessible’? How valuable is making research content accessible without it being discoverable? And how beneficial is it for an end user to find content if they don’t see its relevance, or if they can’t act on it?

Access alone isn’t enough. If research isn’t discoverable, understandable, or actionable for the people who need it (policymakers, practitioners, researchers across regions and community organizations), then OA has fallen short of its full potential.

The ability to get research into the hands of those who can fully capitalize on it is a crucial factor to research success, but in practice, significant gaps and disconnects are evident – particularly from a data and systems perspective. We have made huge progress in terms of the volume of research that is technically ‘open’, however we now need to find out who is actually benefiting.

Current narrative suggests that OA articles are more likely to be cited – but our data suggests this isn’t universally true, or at least that there is more to the story. In addition, citations alone don’t tell us who’s engaging with the content or whether it’s reaching communities outside of academia.

If equity in research means the ability to publish and participate in research fairly, (regardless of location, career stage or discipline), should we accept that the measure of success is whether an article has been published OA? Or should we be measuring success based on whether the research achieves its intended aims, reaches its intended audience, and enables meaningful participation across global research communities?

This blog will look at what ‘access’, taken in isolation, is and what it isn’t. Using data from Dimensions, extracted from the Dimensions on GBQ environment alongside World Bank data on GBQ, we challenge the notion that emphasis on publishing OA is enough to ensure equitable participation. We explore what happens when we focus on access without discoverability. We assess whether research participation is happening in a balanced way or whether there are barriers to journal publication – including but not limited to Article Processing Charges (APCs) – and engagement.

To help us with this, we have conducted a benchmarking and data interpretation exercise to understand the wider problem of participation in research.

SDGs case study

Let’s begin with a common assumption: that publishing is the ultimate goal for a researcher, and that lower-middle and low-income countries struggle to publish OA at the same rate as upper-middle and high-income countries due to the financial challenges associated with APCs.

The visual on the left (in Chart 1) shows us the number of gold OA articles published in 2023. This view alone might suggest that lower-income countries are being prevented from publishing OA compared to upper-income countries. However, benchmarking against the overall amount of research from these regions shows the reverse – low-income (LIC) and lower-middle-income countries (LMIC) are producing proportionately more OA content.

Chart 1. Open Access articles as a portion of overall research BY Income level versus as a portion of overall research AT income level. Dimensions data filtered by 2023 pub year, research article document type and SDG 4. Accessed 28/02/2025.

With this data in mind we dismiss the notion that a general analysis of open participation will drive further insight and shift to participation at journal level. For this analysis, it is useful to consider participation in these terms: where there is intent to contribute to a research topic, is that intent being met or prevented through journal selection and traditional impact measures?

To see this in action, we decided to focus this case study on Indonesian researchers’ contribution to SDG 4, Quality Education.

We focused on Indonesia because in 2023 Indonesia was the second-highest producer of research articles among LMIC countries with a high amount of OA content. (NB: We will not delve into the reasons behind Indonesia’s high output in this piece.)
We focused on SDG 4 because Indonesian researchers produced a substantial, and outsized, amount of Quality Education research. More than any other country and roughly 10% of overall research aligned to SDG 4 (as seen in Chart 2).

Chart 2. The total publications of research aligned to SDG 4, in 2023, by country. Dimensions data filtered by 2023 pub year, research article document type and SDG 4. Accessed 28/02/2025.

In a world where participation in global research was truly balanced and contributions to knowledge were reflected proportionally, if Indonesia contributes 10% of overall research to quality education, we would hope to see the 10% Indonesian representation happen at journal level as well.

To view this, we analyzed journals publishing the most research articles aligned with SDG 4 and benchmarked them against common markers for citation impact and attention. We then assessed the representation of Indonesian research within these journals. Specifically, we calculated the proportion of SDG 4-aligned research with at least one Indonesian-affiliated researcher, aiming for a 10% representation rate. The results are shown in the visual below (Chart 3).

Chart 3. Balanced representation for Indonesia? This chart shows the journals that produce some of the highest amount of journal article content aligned to SDG 4 by citation and Altmetric averages. The size of the bubble related to the portion of research articles in that journal, with at least one author affiliated with an organization in Indonesia.

Our journal-level analysis revealed that the desired 10% participation rate was not met. There was an imbalance within the journals around the level of Indonesian research present. Notably, this imbalance occurred across varying access types and associated publication fees. At the top, Education and Information Technologies, our highest-cited journal, a hybrid title, showed ~2% Indonesian representation. Education Sciences, a gold title that scored middle-ish for citation average, has less than 1%. The largest portion of Indonesian research appeared at the bottom left in two diamond-access, regional titles where we saw lower average scores in both citation and attention.

Therefore, a barrier may be the APCs; usually higher for market leading, established journals. (We’d highlight that Cogent Education is the closest to meeting the 10% participation rate and is a publication that does charge an APC but also offers waivers for LIC and LMIC countries.) However, this is just one of many potential barriers to equitable participation and one addressed by programs like Research4Life and publisher-led, global discounting practices. Our focus here was viewing the research holistically, taking into account how open practices have supported or hindered participation through both journal selection and research impact.

This view (Chart 3) highlights the challenge seasoned publishers face in balancing publication preferences, what motivates or prevents a researcher to select that journal, and readership habits, which encompass both accessibility and discoverability, the kind of discoverability established journals typically offer. The low metrics for the diamond OA journals (bottom left, Chart 3) illustrate the challenge for journals of ensuring research reaches readers.

Publisher mediation

To look closer at the intersection between the two sides publishers must mediate to ensure research meets its potential, we first focus on publication preferences. Many publishers aim to remove participation barriers so we can share quality research in a balanced, fully representational way. How can publishers work to ensure this proportional representation?

One approach is reducing costs of APCs, another is raising awareness. Emerald Publishing uses Dimensions data to benchmark the locale of research relative to our journal level subjects and try to balance Editorial Advisory Board (EAB) selection proportionally. This practice aims to inform publishers and editors where the research is coming from, without compromising EAB selection quality; addressing this at journal level regardless of access type or other unintended barriers.

The other aspect of this publisher mediation, and the one crucial to ensuring research is seen by the intended audience, is understanding reader habits. It is important to understand the benefits of making research openly accessible versus accessible, findable, and usable. Access in isolation, without the presence of discoverability to ensure the work reaches the end user, is not enough.

Below we can see the average citations for the top 100 most productive countries by access type (Table 1). We conclude from this brief view that hybrid titles generate more citation activity as they are the established journals that have an established readership base.

Citation Calculation	Closed	Hybrid	Gold (APC charge)	Gold (no-APC charge)
Average	1.9	3.0	1.8	1.1
Median	1.8	2.9	1.8	1.0

Table 1. Average and Median citations for articles published in 2024 by access type. Dimensions data filtered by 2024 pub year, research article document type and access type including identifying non-APC journals. Accessed 27/03/2025.

It is probable that the imbalance in Indonesian representation is shaped by the age and prestige of journals themselves. For the most part, Open Access journals are younger than their subscription-based closed counterparts, and because Journal Impact Factors (JIFs) are based on a two-year citation window, newer journals (both open and closed access) are naturally disadvantaged.

As a result, newer journals that cover emerging or interdisciplinary areas, such as research aligned with the Sustainable Development Goals (SDGs), may find it difficult to achieve similar visibility and ‘reputation’. This creates a compounding effect: newer OA journals may be more inclusive and open to geographically diverse contributions, yet they lack the discoverability and citation momentum of older, established titles.

In turn, researchers from countries like Indonesia are more likely to publish in regional, Diamond OA journals – which remain under-recognized in global research metrics despite playing a crucial role in local knowledge and research ecosystems.

This echoes the concerns raised in the Budapest Open Access Initiative 20th anniversary recommendations (BOAI20), which call for a more equitable and inclusive approach to Open Access – one that recognizes the value of diverse publication venues, fosters participation from underrepresented communities, and moves beyond outdated prestige indicators.

This points to a deeper issue: when discoverability and prestige are unequally distributed across journals, people may judge research quality based on where it’s published, rather than on the actual quality of the research.

A pattern emerges

This brings us to further consider the practice of prioritizing access above all else, how this may perpetuate bias in the system arising from assessing research quality based on its potential reach, and how that can be hindered by the journal itself.

We examined the quality of Indonesian research in high-output titles and found that when venue and discoverability practices align, Indonesian research citations are above average, dispelling any assumption about overall ‘quality’ that may arise from most Indonesian researchers prioritizing access when selecting journal (Chart 4).

Chart 4. Quality of Indonesian Research seen through balanced discoverability. Dimensions data filtered by 2023 pub year, research article document type and SDG 4. Accessed 28/02/2025.

This prompted a further question: Even when quality is demonstrable, is it being recognized globally? A parallel analysis examining citation practices across all low-income countries allowed us to test whether the patterns we observed with Indonesian research reflect broader systemic issues. We found a consistent pattern: research from low-income countries is often overlooked in citation practices, even when it is highly relevant and well-aligned with global priorities and even when it aligns closely with the focus of the citing publication.

In a parallel analysis, we found a consistent pattern: research from low-income countries is often overlooked in citation practices, even when it is highly relevant and well-aligned with global priorities and even when it aligns closely with the focus of the citing publication.

The parallel analysis examined global research output from 2013 to 2023, focusing on contributions to Sustainable Development Goals (SDGs), excluding SDG 3 (Good Health and Well-Being) given its high proportion of research. Using author affiliations from the Dimensions database, we categorized publications by author country and matched them to World Bank income group classifications. This allowed us to compare research priorities between high-income and low-income countries over this time.

As shown in the chart below, there are clear differences in thematic focus. Researchers in low-income countries disproportionately prioritize areas like SDG 2: Zero Hunger and SDG 6: Clean Water and Sanitation – topics that directly reflect the urgent, lived realities in these regions. In contrast, high-income countries show a stronger focus on SDGs such as Affordable and Clean Energy and Partnerships for the Goals. These differing priorities demonstrate the local expertise and indigenous knowledge embedded in lower-income regions – expertise that, as shown in our citation analysis, is not being adequately acknowledged or cited in global research outputs.

Chart 5. SDG Priorities ranked by publication count Low Income and High Income countries. Extracted using Dimensions data joined to World Bank data on Google Big Query.

In critical areas such as Zero Hunger and Clean Water and Sanitation – topics where low-income countries often hold deep, practical expertise – our citation analysis reveals minimal inclusion of their work by researchers in high-income countries. Specifically, just 0.2% of references in high-income country publications on these SDGs cite publications where authors are based solely in low-income countries. In contrast, over 70% of the references come from publications with authors affiliated exclusively with high-income institutions (74% for Zero Hunger and 71% for Clean Water and Sanitation).

Even when we broaden the scope to include any contribution from a low-income country, the numbers remain stark: 1.41% for Zero Hunger and 1.22% for Clean Water and Sanitation. This is despite the fact that these regions face the most urgent realities tied to these challenges – and who are actively publishing in these areas.

These findings point to a clear disconnect between where expertise exists and where it is recognized. In both Zero Hunger and Clean Water and Sanitation, areas where low-income countries have direct, practical experience, we see how research is vastly under-cited by high-income country publications. This underrepresentation suggests a missed opportunity to draw on locally grounded knowledge that could meaningfully shape global solutions.

Conclusion

This isn’t about a lack of relevant research. It’s about discoverability, visibility, and deeply embedded citation habits. Open Access isn’t just about making research available, it’s about making sure that research is seen, used, and respected within the global knowledge ecosystem.

Emerald has recently launched the Open Lab, which looks at the research ecosystem and how open practices impact it. Its goal is to find real solutions to some of the problems not yet addressed by open practices and some of the problems created by them.

We hope this analysis encourages thoughtful discussion on where the focus should shift, thus allowing us to effectively evaluate the success of Open Access and help ensure that all research can meet its full potential.

Authors:

Ann Campbell, Technical Solutions Manager, Digital Science
Katie Davison, Insights Analyst, Emerald Publishing

The post Access vs Engagement – is OA enough? appeared first on Digital Science.

TL;DR Shorts: Venki Ramakrishnan on open research

Tue, 29 Oct 2024 16:46:17 +0000

Hot on the heels of last week’s International Open Access Week and the announcement of the latest round of Nobel prizes earlier this month, it felt right that this week’s TL;DR Tuesday offering should feature Nobel laureate and former President of the Royal Society, now Group Lead of the Structural Studies Division of the Medical Research Council’s Laboratory of Molecular Biology, Venki Ramakrishnan. In this week’s TL;DR Shorts episode, Venki talks about the opportunities and challenges of open research.

Venki Ramakrishnan talks about the opportunities and challenges of open research. Check out the video on the Digital Science YouTube channel.

Venki is a strong supporter of open research. Most research activities undertaken across the world are funded by taxpayers or charities. Venki believes that the public and other researchers have a right to access research information generated from this work. However, he also has concerns about the potential consequences of cutting out some of the steps in more traditional publishing routes.

Venki emphasises the need for research outputs to be checked for their integrity and credibility when sharing research openly. This curation step takes time and resources and therefore costs money, so we need to find a balance between sharing research information openly, widely and without financial barriers to accessing it, while also ensuring that that research is robust and of high quality. Who should fund this?

One way the open research agenda has transformed research is through institutions and funders insisting that data sets are made more available by sharing them in open repositories. This is just one example of how culture change in research is happening, and how small steps lead to a much larger research transformation and a paradigm shift in how we do research.

Subscribe now to be notified of each weekly release of the latest TL;DR Short, and catch up with the entire series here.

If you’d like to suggest future contributors for our series or suggest some topics you’d like us to cover, drop Suze a message on one of our social media channels and use the hashtag #TLDRShorts.

The post TL;DR Shorts: Venki Ramakrishnan on open research appeared first on Digital Science.

Research transformation report

Chris Allen — Mon, 28 Oct 2024 17:54:00 +0000

Research transformation: Change in the era of AI, open, and impact: Voices from the academic community

Download now

Your experiences, as told to us

To understand more about how the research world is transforming, what’s influencing change, and how roles are impacted, we reached out to the research community through a global survey and in-depth interviews.

It’s clear academia is at a pivotal juncture

External pressures from an increasingly complex world are forcing rapid change in the sector.

As a society, we need answers to pressing issues and there is a growing expectation for research to deliver. But increasing demands, tightening budgets, and lack of infrastructure can stand in the way of progress. Many are turning to emerging technologies for support.

View key takeaways for librarians

View key takeaways for research offices

Download the Research Transformation report

Discover proven strategies for innovative transformation

In order to build tools that really speak to users’ needs, as well as talking often, it is important to understand where the space has come from and where it is moving too. We were delighted to hear how aligned our focusses were. I’m particularly excited to see where we can improve on all fronts with the inevitability and all of the benefits of open research.

Mark Hahnel

VP Open Research | Digital Science

Key findings

Several themes that emerged from our research are summarized here. For all the detail, make sure you download the full report.

Download report

Open research is transforming research, but barriers remain

82% of respondents said that open research enhancements will have the most impact on research over the next five years.

Open research cited as most positive change in last five years
Open research top change the community would like to see in the next five years
Challenges in open research include lack of awareness, funding, support, resources and infrastructure
Concerns around data security, research quality and competitiveness

I don’t think we have sufficiently thought through how we can absolutely be confident about privacy and security at the same time as we go full sail into open.

Kevin Dunn, Provost, Western Sydney University

Download the report

In recent years, there has been very welcome emphasis on research culture and open research, and concern with other types of metrics and behaviors that are not as hard-nosed as they once were.

Sally Smith, Director of Research, Trinity College Dublin

Download the report

Research metrics are evolving to emphasize holistic impact and inclusivity

77% of respondents expect to spend more time on ‘Research Impact and Evaluation’ over the next five years.

Frustration with traditional metrics, but they still hold weight
Call for a more holistic evaluation of research impact and quality
A limited shift to more responsible use of traditional metrics and introduction of alternative metrics
Institutes addressing academic culture issues but need greater recognition for non-traditional contributions

AI’s transformative potential is huge, but bureaucracy and skill gaps threaten progress

69% of respondents stated that skill gaps are having an extremely high or moderate impact on their role today.

Emerging technologies will continue to impact roles over the next five years
New technology expected to drive efficiencies in data and analytics, and open research
Call to address AI skills gap and introduce change management strategies
Enthusiasm for AI tempered by concerns around ethics, security and integrity, as well as AI bias, hallucinations and impact on critical thinking

I think artificial intelligence will be a game changer in terms of the development of the tools that we use primarily to find and discover research.

Emily Hart, Science Librarian, Research Impact Lead, Syracuse University

Download the report

Do we have trouble finding partners? Partners with money? Yes. Partners for research? I don’t think so. It comes down to the funding.

Michelle Vincent, Director of Research Strategy and Performance at Swinburne University of Technology

Download the report

Collaboration is booming, but increasing concerns over funding and security

80% of researchers believe collaboration outside of academia is changing the way research is performed.

Interconnected technology and open research support greater global connectivity
Collaboration has multiple benefits e.g. can increase citations and enhance research quality
Easy to find collaborators, but scarce funding to support partnership
Increasing concerns around research security and ‘damaging’ collaborations

Security and risk management need a strategic and cultural overhaul

45% of respondents report an increase in the amount of time they spend on research security now compared to five years ago

Security threats putting international research collaborations at risk
Institutions tasked to balance risk and innovation, but they aren’t equipped
Risk management conflicts with other priorities
Tendency to ‘wait and see’, rather than proactive management

Over the last five years, there has been a demonstrable investment and a positive step change in awareness and engagement across the UK HE sector around security and compliance.

Chris Buckland, Director of Security, Risk & Compliance, Cranfield University

Download the report

Our report speaks loudly of the technological advancements, new research practices and global problems driving change in academia. These transformations have created both opportunities and obstacles for institutions and the sector at large.

Simon Porter

VP Research Futures | Digital Science

Key questions for our community

AI

How can we share a framework where we can trust the automation AI provides throughout researcher workflows?
How can cultural change be achieved in time for such a fast-changing phenomenon?

Open

Is open research an inevitability?
How do we define research security in the context of open research, where academic freedom is balanced with responsibility?

Impact

What is ‘real impact,’ and can it be measured beyond rankings?
With geopolitics creating an environment that reduces opportunities for collaboration, what are the implications for academia in a more siloed research world?

Driving progress for all

Our report speaks loudly of the technological advancements, new research practices and global problems driving change in academia. These transformations have created both opportunities and obstacles for institutions and the sector at large.

At Digital Science, our goal is to advance global research by solving the community’s biggest challenges through innovative artificial intelligence (AI) technology. Our job is to make life easier for everyone in the research world—researchers, universities, funders, industry, and publishers—so that research can become open, fairer, faster, freer and more connected to drive progress for all.

How we can help you

The post Research transformation report appeared first on Digital Science.

TL;DR Shorts: Joy Owango on open research

Tue, 22 Oct 2024 15:30:00 +0000

The theme of this year’s International Open Access Week, which kicked off yesterday, is ‘Community over Commercialisation,’ and today’s TL;DR Shorts contributor is someone who knows a huge amount about community and research culture. Joy Owango is the Founding Director of the Training Centre in Communication, or TCC Africa. In today’s episode, Joy talks about how open research is helping increase the visibility and representation of research being done in Africa.

Joy Owango discusses the impact that open research has had on the visibility and representation of African research outputs, while also reminding us that more barriers require removal for true equity in research. Check out the video on the Digital Science YouTube channel.

Joy believes that open access is one of the best things that has ever happened for researchers. It has raised the visibility of African research outputs by reducing or removing blockers to the inclusion of the amazing and impactful research being carried out, such as the fees associated with publishing novel research in top-tier journals with broad engagement. Publishers play a key role in driving this change, and overcoming this challenge requires them to fundamentally change how they work with the research community across the world.

Joy reminds us that there are also unexpected consequences that must be overcome as a global research community. An open research culture should allow more people to access existing research information and also contribute to it without financial or other barriers. While preprints and APC waivers exist, there is still a long way to go to ensure that all actors across the research ecosystem are taking advantage of these programmes of research transformation to break down barriers to inclusion in research and foster greater translation of research into real-world impact.

Subscribe now to be notified of each weekly release of the latest TL;DR Short, and catch up with the entire series here.

If you’d like to suggest future contributors for our series or suggest some topics you’d like us to cover, drop Suze a message on one of our social media channels and use the hashtag #TLDRShorts.

The post TL;DR Shorts: Joy Owango on open research appeared first on Digital Science.

The University of Limpopo chooses Figshare to support its research excellence strategy

David Ellis — Thu, 19 Sep 2024 08:00:00 +0000

Thursday 19 September 2024

Figshare, a leading provider of institutional repository infrastructure that supports open research, is pleased to announce that the University of Limpopo has chosen Figshare to facilitate the collection, management, sharing and preservation of its research data.

The University of Limpopo – one of the top public universities in South Africa offering undergraduate and postgraduate qualifications, and a variety of short learning programmes – will become the 20th institution in the country using Figshare as their data repository.

Using Figshare, the University of Limpopo will be able to encourage collaboration amongst their research community and network. This will drive interdisciplinary research discovery, the cross-pollination of ideas and further knowledge sharing. With Figshare, they will now have the infrastructure required to ensure research data will be managed in line with best practices throughout the research lifecycle as part of the University’s commitment to research excellence.

Importantly, the repository will also facilitate compliance with the University’s own open research guidance and mandates but also enable researchers to easily adhere to research sharing obligations set out in a growing number of funder policies.

“The motto ‘Finding Solutions for Africa’ expresses the commitment to provide high-quality academic programmes and research outcomes that are socially relevant and responsive, giving meaningful expression to the motto.

“Choosing Figshare gives the University of Limpopo the advantage of having a sustainable platform for our research data to be well managed, discoverable and citable. The repository will promote research work by enabling access to data used for its cutting-edge research to support further studies on a global scale,” said Khomotso Maphopha, Executive Director: LIS.

Mark Hahnel, Figshare Founder and Digital Science’s VP of Open Research, said: “It’s wonderful to see the University of Limpopo become the 20th institution using Figshare in South Africa. We’re thrilled to be able to support their research excellence initiatives and strategies with Figshare infrastructure and their commitment to making research data management a priority for their research community is admirable. It’s exciting to see the global Figshare community continue to grow with leading institutions such as the University of Limpopo.”

About the University of Limpopo

The University of Limpopo was established in 1960 as the University College of the North and later retained the name University of Limpopo (UL) after the unbundling of the MEDUNSA campus from UL in 2015. The University is situated in the Limpopo Province, South Africa and has grown into a world-class university with a commitment towards offering approved and accredited high-quality programmes. The University Envisions its future as one in which “the provision and quality of programmes, students’ graduate attributes, academics, culture and services are poised towards finding solutions for Africa”.

About Figshare

Figshare, a Digital Science Solution, is a provider of institutional repository infrastructure. Our solutions help institutions share, showcase and manage their research outputs in a discoverable, citable, reportable and transparent way. We support institutions in meeting the growing demands for research to become open, freer, FAIRer and more connected. We provide the flexibility and control for you to create research management workflows that work for you. We take care of implementation, updates, security and maintenance – ensuring you and your researchers can always depend on your repository, leaving you to focus on what really matters; research and its impact on the world.

About Digital Science

Digital Science is an AI-focused technology company providing innovative solutions to complex challenges faced by researchers, universities, funders, industry and publishers. We work in partnership to advance global research for the benefit of society. Through our brands – Altmetric, Dimensions, Figshare, IFI CLAIMS Patent Services, metaphacts, OntoChem, Overleaf, ReadCube, Scismic, Symplectic, and Writefull – we believe when we solve problems together, we drive progress for all. Visit digital-science.com and follow @digitalsci on X or on LinkedIn.

Media contacts

David Ellis, Press, PR & Social Manager, Digital Science: Mobile +61 447 783 023, d.ellis@digital-science.com

The post The University of Limpopo chooses Figshare to support its research excellence strategy appeared first on Digital Science.

Appalachian State University chooses Figshare as its new institutional repository platform

David Ellis — Tue, 10 Sep 2024 13:45:00 +0000

Tuesday 10 September 2024

Figshare, a leading provider of institutional repository infrastructure that supports open research, is pleased to announce that Appalachian State University has chosen Figshare as its new institutional repository platform to share, showcase and manage its research outputs.

Appalachian State University (App State) – part of the University of North Carolina System – chose Figshare as its new repository platform to replace the NC DOCKS consortial repository, which was created in 2007 and is slated to shut down at the end of 2024. The team at App State wanted to use the opportunity to upgrade to a modern repository that could house datasets and encompass a wide array of scholarly work, including non-traditional outputs.

App State will use Figshare to finetune and expand the metadata for its records across output types and disciplines. The university plans to use its new repository to showcase every type of research output, including scholarly publications, research data, media, theses and beyond, and to invite broad participation across disciplines, including the Humanities.

As part of the search for a new repository solution, the university was committed to updating its approach to open research management and championing best practices. Figshare is the ideal platform to support this with its open research features and tools for both users and administrators. A crucial part of this project for App State is building engagement and excitement for both the repository and open research practices throughout its faculty, and the university at large.

“App State is delighted to upgrade to a Figshare IR and data repository. We’re confident that the platform’s attractive and user-friendly design will sell itself to our faculty and research community. We also admire Figshare’s support of open access and look forward to increasing the university’s contribution to the OA landscape through the enthusiastic use of our new and improved institutional repository. Figshare’s dual commitment to researchers and open access makes it the perfect choice for Appalachian State,” said Natalie Foreman, Open Access Publishing Manager at Appalachian State University.

Mark Hahnel, Figshare Founder and Digital Science’s VP of Open Research, said: “We’re very happy to welcome another leading US Institution to the Figshare community and we’re excited to see Appalachian State University take advantage of our growing institutional repository functionality. It’s encouraging to see open research and the systems needed to support its progress continue to be prioritized by well-established research institutions in the US.”

About Appalachian State University

As a premier public institution, Appalachian State University prepares students to lead purposeful lives as global citizens who understand and engage their responsibilities in creating a sustainable future for all. The App State Experience promotes a spirit of inclusion that brings people together in inspiring ways to acquire and create knowledge, to grow holistically, to act with passion and determination, and to embrace diversity and difference. As one of 17 campuses in the University of North Carolina System, App State enrolls more than 21,000 students, has a low student-to-faculty ratio and offers more than 150 undergraduate and 80 graduate majors at its Boone and Hickory campuses and through App State Online.

About Figshare

About Digital Science

Digital Science is an AI-focused technology company providing innovative solutions to complex challenges faced by researchers, universities, funders, industry and publishers. We work in partnership to advance global research for the benefit of society. Through our brands – Altmetric, Dimensions, Figshare, IFI CLAIMS Patent Services, metaphacts, OntoChem, Overleaf, ReadCube, Scismic, Symplectic, and Writefull – we believe when we solve problems together, we drive progress for all. Visit digital-science.com and follow @digitalsci on X or on LinkedIn.

Media contacts

David Ellis, Press, PR & Social Manager, Digital Science: Mobile +61 447 783 023, d.ellis@digital-science.com

The post Appalachian State University chooses Figshare as its new institutional repository platform appeared first on Digital Science.

The next serendipitous paradigm shift for drug discovery

Mark Hahnel — Thu, 08 Aug 2024 10:20:59 +0000

How AI and federated learning are transforming drug discovery

If we were living in a simulation, in order for humanity to continue its drive out towards longer, happier lives, every now and then something drastic should happen. We should get a serendipitous paradigm shift at the most desperate time. The next paradigm shift is AI. AI may be the technological Shangri-La we were crying out for in order to stop the heating of the planet and ultimately, the end of humanity. This may also be the case with drug discovery. The way in which we find and create new drugs may be about to transform forever.

Drug discovery has come a long way. It started with natural remedies and saw landmark serendipitous discoveries like penicillin in 1928. The mid-20th century introduced rational drug design, targeting specific biological mechanisms. Advances in genomics, high-throughput screening, and computational methods have further accelerated drug development, transforming modern medicine. However, despite these advances, fewer than 10% of drug candidates succeed in clinical trials (Thomas, D. et al. Clinical Development Success Rates and Contributing Factors 2011–2020 (BIO, QLS & Informa, 2021)). Challenges like pharmacokinetics and the complexity of diseases hamper progress. While we no longer fear smallpox or polio and have effective treatments for bacterial infections and Hepatitis C, today’s most damaging diseases are complex and hard to treat due to our limited understanding of their mechanisms.

Nature 627, S2-S5 (2024) https://doi.org/10.1038/d41586-024-00753-x

Cue paradigm shift. DeepMind’s AlphaFold has revolutionized biology by accurately predicting protein structures, a task crucial for understanding biological functions and disease mechanisms. The economic prowess of Deepmind is also creating some mind-blowing figures. The estimated replacement cost of current Protein Data Bank archival contents (the dataset from which the AlphaFold models were built) exceeds US$20 billion (assuming an average cost of US$100,000 for regenerating each of the >200,000 experimental structures). AlphaFold has subsequently generated a database of more than 200 million structures. Some back of the envelope maths infers that this would have cost us $20,000,000,000,000 using the original methods.

Number of protein structures in Alphafold. Credit: Deepmind

Of course, there are many simultaneous attempts to move the research needle using AI. A team from AI pharma startup Insilico Medicine, working with researchers at the University of Toronto, took 21 days to create 30,000 designs for molecules that target a protein linked with fibrosis (tissue scarring). They synthesized six of these molecules in the lab and then tested two in cells; the most promising one was tested in mice. The researchers concluded it was potent against the protein and showed “drug-like” qualities. All in all, the process took just 46 days. Scottish spinout Exscientia has developed a clinical pipeline for AI-designed drug candidates.

Not only does the platform generate highly optimized molecules that meet the multiple pharmacology criteria required to enter a compound into a clinical trial, it achieves it in revolutionary timescales, cutting the industry average timeline from 4.5 years to just 12 to 15 months. These companies have the technical know-how to build the models, and most likely some internal data with which to train them on. But they need more.

The power of existing data

Platforms like the Dimensions Knowledge Graph, powered by metaphactory, demonstrate the potential of structured data. With over 32 billion statements, it delivers insights derived from global research and public datasets. Connecting internal knowledge with such vast external data provides a trustworthy, explainable layer for AI algorithms, enhancing their application across the pharma value chain.

Knowledge democratization bridges the gaps in the pharma value chain. Credit: metaphacts

AI is not all there is to be excited about in drug discovery. A further technological, serendipitous paradigm shift could amplify the results of AI alone. Once trained, machine-learning models can be updated as and when more data become available. With ‘federated learning’, separate parties update a shared model using data sets without sharing the underlying data. Advances in federated learning allow for collaborating across organizations without sharing sensitive data, maintaining privacy while pooling diverse datasets. Federated learning is a machine learning technique that allows models to be trained across multiple companies holding local data samples. Instead of sending data to a central server, each device sends its model updates (e.g., weight changes) to the central server. This allows further reduction in time and cost in the drug discovery process by improving predictive models, without leaking private company held datasets. Public data can augment local datasets held in corporate R&D departments, enriching the training process. Public data with similar characteristics can help in creating more comprehensive models. This is why we need more, better described open academic research data.

Pharmaceutical companies of the world should be engaging further with both open academic data aggregators in order to assist in the improvement of metadata quality and highly curated linked datasets like the ones supported by the Dimensions Knowledge Graph and metaphactory. The limiting factor is not the AI capabilities, it is the amount of high-quality, well described data that they can incorporate into their models. They need to:

Acquire: Gather data from diverse sources, including internal external datasets. Make use of federated learning.
Enhance: Enrich data with metadata and standardized formats to improve utility and interoperability.
Analyse: Use new models to establish patterns, trends and drug candidates.

You may be thinking that this isn’t a serendipitous leap. This is the fruition of decades of research moving us to a point where these technologies can be applied. You may be right. Either way, the timing of these paradigm changing tools does feel serendipitous. Without AI and federated learning, we could not tackle today’s complex diseases in such an efficient manner. There is a long way to go, but by continuing to curate and build on top of academic data, we can push the boundaries of what’s possible in modern medicine.

This is part of a Digital Science series on research transformation. Learn about how we’re tracking transformation here.

The post The next serendipitous paradigm shift for drug discovery appeared first on Digital Science.

The Barcelona Declaration… exploring our responsibilities as metadata consumers

Simon Porter — Wed, 10 Jul 2024 20:22:50 +0000

Towards creating responsible metadata consumers…

The first commitment of the Barcelona Agreement articulates that, ‘We will make openness the default for the research information that we use and produce’, but who ‘we’ are is critical in understanding all of our roles and responsibilities in the research ecosystem. Funders, publishers, infrastructure providers, institutions and researchers all have different ways of interacting with data in their contexts as producers, consumers and aggregators of data.

The Barcelona Declaration is perhaps the first document to begin to frame community responsibility with regards to consuming open metadata. Yet, it is just that, a beginning – we believe that understanding with granular detail what should be expected from each part of our ecosystem is critical in making Barcelona actionable, to drive us forward into a more open metadata landscape. Indeed, open metadata is only important if we commit to using it in our practice, allowing it to shape the way that we interact across the research world.

A commitment to consume, however, still requires us to pay attention to the type of open metadata that we use, the contexts in which we apply it, and the expectations that we place on others when doing so. Without explicitly articulating our roles both as creators as well as consumers of research metadata, we risk creating an open, yet untrusted research landscape.

Not all metadata are the same

There is a fundamental asymmetry between production and consumption (and also aggregation). Whilst the responsibilities associated with creating metadata are relatively easy to articulate, the responsibilities around consuming and aggregating metadata are not so well thought through as, to this point in time, this has been the less proximate issue. (Indeed, Barcelona makes it clear that we have reached a milestone in that we now need to consider this issue.) We argue that responsibilities around consumption are contextual in nature, depending on the provenance of the metadata itself, and work needs to be put into articulating these responsibilities for each participant and use case. In the context of the recent Barcelona Declaration then, it is useful to explore some of the different ways metadata can be created and then exploring what responsibilities could result for consumers.

Within the Barcelona Declaration there are (at least) three different sorts of metadata records that are implicitly referred to:

Open metadata records

Open metadata records are those that have been created from inception with open research principles in mind. For example, a publication created under these principles will have an ORCiD associated with each researcher and a ROR ID associated with each affiliation. Within the body of the publication (and its metadata), funding organisations will be linked to their Open Funder Registry ID (or ROR ID), and the grant itself will be linking to open persistently identified grant records (for example via the Crossref grant linking system). The publication itself (along with a rich metadata representation) will be associated with a DOI, and all references that resolve to a DOI will also be openly available. When we speak about open here, we have in mind a CC0 licence for these data. Within the paper itself we might expect to see other links such as a link to a data repository, along with other trust markers that establish the provenance of the paper and situate it within the norms of good research practice. We might have similar expectations for grants, datasets, research software code, and other research objects.

Algorithmically enhanced records

Algorithmically enhanced records are metadata records that have had elements derived from algorithmic processing that was not part of the original record. The algorithm may not be open, the approach used may not be known and the probability that the metadata is correct may also not be known. (This is something of a hidden variable in many analyses today – it is generally assumed that data in an article may have statistical variances but that metadata describing an article does not.) Many publication records that have been created over time do not meet our current requirements for metadata openness. Either the technology (or identifier infrastructure) did not exist at the time that they were created, or good metadata practices have yet to take hold within the context that the record was created. For records such as these, algorithms are used to enhance the record with identifiers. Prominent examples include algorithms that are used to identify institutional affiliations, but also to reconstruct researcher identities. Algorithms can also be used to enhance the description of a record by adding links to external research classifications that would never have existed in the original metadata.

This type of data is likely to become more and more commonplace as LLMs and other AI systems are becoming more easily and cheaply available. And hence, it is likely that for some years to come metadata will have inbuilt statistically generated inaccuracies which may be ignored by the community at large, if they can be proven to be negligible in key analyses.

Institutionally enhanced metadata records

Institutionally enhanced metadata records are those enhanced through university processes for the purposes of institutional and government reporting. These records, harvested from multiple sources, or manually curated, may have additional metadata associated with them. An author on a paper might be associated with an institutional ID, new research classifications might be added with links to dataset. These institutional records might be made public through institutional profiles or syndicated to larger state or national initiatives.

What are our responsibilities when using and reusing research metadata?

The text of the Barcelona Declaration treats all three types of metadata that we have defined above to be on an equal footing: To be shared under a CC0 licence, allowing an unrestricted ability to reuse. Issues of licence aside, the way we reuse metadata should be informed by the provenance of the created information.

When considering how to implement the objectives of the Barcelona Declaration then, it is worth thinking carefully about a general approach to the responsibilities associated with reuse. As with the Barcelona Declaration, we propose these as a beginning and a discussion rather than an absolute. Refining these responsibilities will take community discussion.

Here are three responsibilities that we think would be useful to begin the conversation:

Responsibility 1. The purpose for which a piece of metadata is intended to be used must place a limit on both the scope (types of interpretation) and range (geographical, subject or temporal extent) under which it can be responsibly used

Beyond considerations of openness, the context of the data that is being propagated needs to be considered. Metadata is generated for a purpose, and that purpose defines the accuracy and care to which the metadata is applied. It also defines the limits and responsibilities for maintaining its accuracy.

For institutions, the Barcelona Declaration explicitly identifies Current Research Information Systems (CRIS systems) as one mechanism to make research information open. It is required that all relevant research information can be exported and made open, using standard protocols and identifiers where available. This requirement builds on a movement, initially gaining traction around 2010 with the VIVO and Harvard Catalyst profiles projects funded by the NIH. The key use cases for these public profiles has been expertise finding, either at the institution, state, or national level. The key insight of this movement is that information collected for internal reporting and administrative purposes could also be used to create public profiles – a single source of information efficiently driving multiple uses. In some cases the approach of CRIS aggregated information has been taken further to create state-based portals such as the Ohio Innovation Exchange, or national open research analytics platforms such as Research Portal Denmark. Although successful, the nature of the provenance of these records means that there are practical limitations to the way the information can be reused beyond these applications.

Implicit in the name of a CRIS is a key limitation. CRISs are used to maintain/modify/aggregate information about ‘current’ researchers. There is (for an institution) no implied duty of care for the maintenance of public information about past staff. Indeed, from the perspective of expertise finding it may be inconvenient to have these profiles remain discoverable in the same way.

Metadata within CRIS systems are also often collected for a politically aligned purpose such as the demonstration of value to voters (which is often presented as a national purpose in the form of government reporting), and can lead to unbalanced metadata records when used in a broader context. For instance, publications recorded for the purposes of national reporting might very accurately record the researcher affiliations within a country, but will be significantly less accurate on international affiliations for whom the reporting exercise has little bearing.

Records can become unbalanced in other ways too: research can be classified to reflect the goals of the individual reporting exercises (a point that we wrote about in detail in our article on FoR Classification) – both in terms of the classifications that are applied, and the time and effort to which those classifications are maintained, and the scope of research classified. If there is a purpose to reusing this classification metadata in a different context, the provenance under which it was recorded must be maintained and understood.

A potential interpretation of the Barcelona Declaration could be that all metadata must be curated with the understanding that it will be used and consumed within the broader research community in perpetuity. If this is the intended interpretation, then we should be realistic about the extra effort that this requires, both in terms of effort and the structures that should be put around the codification and documentation of data curation approaches. This interpretation also instantly begs several practical questions: Does the storing, and passing on of a metadata record imply a responsibility to keep it up to date forevermore?

What inequalities would this interpretation place on the broader research community? Specifically, does this interpretation advantage the “metadata rich” (those with the infrastructure to invest in improving records) and disadvantage the metadata poor (those who have poor embedded mechanisms or post hoc mechanisms for the curation of metadata)? This concern is not hypothetical, as current lack of visibility of African research has hindered efforts to comprehensively understand, evaluate and build upon African nations.

There are of course already remedies to address many of the persistence challenges associated with making institutional metadata open. One mechanism is to transfer the responsibility for the metadata from the institution to the individual researcher via their ORCiD. Within this workflow, researchers remain responsible for maintaining a public record of their outputs, and institutions can maintain responsibility for asserting when a researcher worked for them. Coupled with a national push to publish research in open access journals and repositories, the Barcelona Declaration complements the approach taken by national persistent identifier strategies as they move towards PID-optimised research cycles.

Responsibility 2. Machine-generated metadata should not be propagated beyond the systems for which it was created, without human curation or verification

Machine-generated metadata, such as the association of an institutional identifier to an address expressed as a string, research classifications, or algorithmically determined research IDs are all generated within precision and recall tolerances. These tolerances are set by system providers, and are aligned with the requirements of their users. Individual statements, however, are not guaranteed against any particular record. What is more, algorithmically generated data can be regenerated as methods improve, potentially invalidating records from previous runs. This notion defines a hitherto overlooked metadata provenance. Without accompanying provenance, metadata can be considered to have ‘escaped’ from its originating system and runs the risk of being “orphaned”, with no ability to be updated or appropriately contextualised. To move an algorithmically generated metadata record out of the context of the system for which it was created must be to take ownership of the provenance and the statements that can result from its use.

Whilst not so much of a problem for publications (an updated version of the record can always be requested using the DOI,) this is particularly problematic for algorithmically generated researcher IDs, as (in the case of an identifier that refers to more than one person), improved algorithms could radically change the identity that the researcher that the identifier refers to. In the case of a researcher record that split because it was really two researchers, the existing researcher ID could end up pointing to a different researcher.

The Barcelona Declaration is right to focus on data sharing practices using standard protocols and identifiers where available. But here too, care must be taken to assess where metadata has come from as many algorithms associate a persistent identifier with a metadata record. For instance, if an ORCiD is used instead of an internal researcher ID to refer to a researcher, but the set of assertions that are produced have been algorithmically generated, then communicating these assertions outside of the system that they were generated breaks the model of trust established by ORCiD.

Responsibility 3. Ranking platforms should be independent of the data aggregations from which they are drawn

A key use case enabled by algorithmically generated metadata is comparative research performance assessment, often encoded in rankings systems. At a first glance, this responsibility may appear to be incompatible with responsibility 2 – if metadata should be strongly coupled to its provenance and context, why should it be divorced from the ranking use case? We regard this issue as being similar to a separation of evaluation bodies and those being evaluated. Because of the different choices that different scientometric platforms make with regards to precision on recall, the same ranking methodology can lead to different results when implemented over different scientometric platforms. However, rankings systems are often entangled with single systems, providing perverse incentives for institutions to engage (both in terms of investment and data quality feedback) with one dataset over another.

One benefit of the focus on persistent identifiers that the Barcelona Declaration is that information assessment models can (and should) be constructed without reference to individual scientometric datasets. By decoupling data aggregations from the rankings themselves, we allow new data aggregation services to emerge without locking in single sources of truth. In this way scientometric data sources should be treated like Large Language Models LLM – extraordinarily useful, but with an ability to swap out one for another. Perhaps we need to add another R (replaceable) to FAIR data principles for scientometric datasets.

The decoupling of data from ranking also has another effect, in that it discourages investment in the data quality of a single system, and focuses on either improving data at the source (for instance Crossref) or by improving independent disambiguation algorithms (such as those offered by the Research Organization Registry).

To develop an independent rankings infrastructure will require agreement to use not only the persistent identification infrastructure that we have, but a commitment to develop systems that refer to external classification systems.

Can we go further? Building on a commitment to independent rankings infrastructure for instance, Is it reasonable to expect a common query language for scientometric research and analysis across scientometric systems?

The beginning of a conversation…

Finally, from the exploration above, we hope that we have made the case that our responsibilities as metadata consumers go beyond simple considerations of licence or platform. With the current state-of-the-art in research infrastructure our experiences of how to facilitate open data are not embedded in metadata and do not travel with it. How we use metadata places unclear expectations on others, and affects perceptions of trust in our analysis or in the research information system more generally. As the Barcelona Declaration moves from declaration to implementation, perhaps even blending with evolving national persistent identifier strategies, we hope that these considerations form part of the continuing conversation.

The post The Barcelona Declaration… exploring our responsibilities as metadata consumers appeared first on Digital Science.

University of the Witwatersrand chooses Figshare to support its open data goals

David Ellis — Tue, 18 Jun 2024 07:02:09 +0000

Tuesday 18 June 2024

Figshare, a leading provider of institutional repository infrastructure that supports open research, is pleased to announce that the University of the Witwatersrand Library has chosen Figshare to support its research community with archiving, publishing, sharing and promoting their datasets.

The University of the Witwatersrand (Wits) – a leading research institution in South Africa based in Johannesburg, ranked as the second best university in Africa 2024 (jointly with Stellenbosch University) – will become the 19th institution in the country using Figshare as their data repository. Figshare is proud to now officially work with the top nine universities in Africa.

Using Figshare, Wits will be able to ensure that all research data produced by their researchers, academics and postgraduate students will have a DOI, ensuring it is truly findable, trackable and citable. DOIs will be minted through the Figshare platform and will therefore be integrated with international data harvesters, making the research shared highly discoverable. Wits will also utilize the ability to clearly link research data to published articles and related outputs shared in other repositories, which is an important part of Figshare’s functionality.

The Wits Library is looking forward to the potential new collaborations and partnerships the new repository may foster, enabling their research community to connect with researchers in similar fields and disciplines. The easy-to-use interface and extensive support documentation Figshare provides was also of great importance to the Wits team when selecting a tool alongside the attribution of crucial usage metrics including views, downloads, Altmetric attention data and citations.

Importantly, the repository will ensure compliance with relevant funder requirements for data sharing and make it easy for researchers and academics to fulfil their funder responsibilities when it comes to open research. The research shared in the repository will also be in line with core Open Science and FAIR data best practices and principles.

“As the Wits Library, we are excited to bring to Wits the Figshare data repository and related services, and look forward to assisting our academics, researchers and students by making their research output available in a citable, shareable and discoverable manner. The Wits Library will use this platform to contribute to the national data repository providing even more exposure to the important research being conducted at our institution,” said University Librarian Dr Daisy Selematsela.

Mark Hahnel, Figshare Founder and Digital Science’s VP of Open Research, said: “We’re really proud to have such a strong Figshare community in South Africa, with a large group of leading institutions, and it is exciting to see this grow even further with the addition of the prestigious University of the Witwatersrand. We’re looking forward to supporting them in meeting their open data goals and providing their research community with the infrastructure required to adhere to funder policies and open research best practices.”

About the University of the Witwatersrand

The University of the Witwatersrand (Wits) is situated in Johannesburg, South Africa – a vibrant, leading commercial city on the African continent. Wits is home to one of the largest fossil collections in the southern hemisphere and is internationally recognized as a leader in the palaeo-sciences. Wits scientists have been contributing to the palaeo-sciences record for almost a century. Wits University is more than an educational brand, it’s a time-honoured legacy that aims to educate and change through quality education and sound motivation to develop African societies. For 100 years Wits has been known for and remembered through the calibre of graduates that it produces, and the footprints of change that they leave in the public and private sectors in all fields of industry. Wits University has five faculties and 33 schools. It boasts two campuses – in Braamfontein and Parktown – that are spread over 400 acres in Johannesburg. It includes 11 libraries that create a vibrant nexus of ideas, collections, expertise, and spaces in which users illuminate solutions for local and global challenges and host a hi-tech digitisation centre. Perhaps more than the University of the Witwatersrand extraordinarily being ranked in the top 1.3% of universities globally, our legacy equally requires that we continue to be positively ranked by the everyday impact we make to society, the quality of our goals and the lives we change. This is how we are ranked in and among the world’s universities.

About Figshare

About Digital Science

Digital Science is an AI-focused technology company providing innovative solutions to complex challenges faced by researchers, universities, funders, industry and publishers. We work in partnership to advance global research for the benefit of society. Through our brands – Altmetric, Dimensions, Figshare, ReadCube, Symplectic, IFI CLAIMS Patent Services, Overleaf, Writefull, OntoChem, Scismic and metaphacts – we believe when we solve problems together, we drive progress for all. Visit www.digital-science.com and follow @digitalsci on X or on LinkedIn.

Media contacts

Simon Linacre, Head of Content, Brand & Press, Digital Science: Mobile +44 7484 381477, s.linacre@digital-science.com

David Ellis, Press, PR & Social Manager, Digital Science: Mobile +61 447 783 023, d.ellis@digital-science.com

The post University of the Witwatersrand chooses Figshare to support its open data goals appeared first on Digital Science.