Making Sense of Google Scholar Data using Scholarly

Have you ever needed to extract Google Scholar information from the public profiles of professors or researchers? I recently faced a similar challenge when I wanted to gather publication data for a specific author. Initially, I attempted to scrape the number of authors and their publication details using Beautiful Soup, but I quickly encountered obstacles. Google identified my multiple API requests as bot-like behavior, hindering my progress. However, my quest took a positive turn when I discovered the Scholarly module in Python. With its help, I was able to overcome the hurdles and obtain all the information I needed efficiently. Let me share my experience with you.

Retrieving Author Information

To begin, I used the Scholarly module to search for the desired author and retrieve their profile information. By calling the appropriate functions and specifying the author's name, I obtained the desired results. The module provided a convenient way to access the author's profile and publications effortlessly.

Here's the code snippet that retrieves the author's information:

from scholarly import scholarly

search_query = scholarly.search_author('Author Name')  # Retrieve the first result from the iterator
first_author_result = next(search_query)
scholarly.pprint(first_author_result)

Classifying Publications

Beyond retrieving publication data, I wanted to further classify the publications based on specific criteria. In particular, I aimed to differentiate between the consortium and non-consortium publications. To achieve this, I implemented a classification system using Python's re module and pandas library. I parsed the author's publications and examined the number of authors as well as the presence of specific keywords to determine their classification. If a publication had over 100 authors or contained the term "consortium" in the authors' list, it was classified as a consortium publication; otherwise, it was classified as non-consortium.

Here's the code snippet that classifies the publications:

import re
import pandas as pd

consortium_publications = []
non_consortium_publications = []
consortium_citations = []
non_consortium_citations = []


for publication in author['publications']:
    first_publication = publication
    first_publication_filled = scholarly.fill(first_publication)
    print(publication['bib']['title'])
    print(first_publication['bib']['author'])


    authors_str = first_publication_filled['bib'].get('author', '')
    authors = re.split(r'[,.;]| and |\d\.', authors_str) # Authors can be identified by this regex
    authors = [author.strip() for author in authors]

    num_authors = len(authors)
    print("Number of authors:", num_authors)
    if num_authors >= 100:
        is_consortium = True
    else:
        is_consortium = any('consortium' in author.lower() for author in authors)

    if is_consortium:
        print("This publication is from a consortium.")
        consortium_publications.append(first_publication_filled['bib'].get('title', ''))
        non_consortium_publications.append('')  # Add empty string to non-consortium publications
        consortium_citations.append(first_publication.get('num_citations', 0))
        non_consortium_citations.append('')  # Add empty string to non-consortium citations
    else:
        print("This publication is not from a consortium.")
        consortium_publications.append('')  # Add empty string to consortium publications
        non_consortium_publications.append(first_publication_filled['bib'].get('title', ''))
        consortium_citations.append('')  # Add empty string to consortium citations
        non_consortium_citations.append(first_publication.get('num_citations', 0))

Applications

Now, let's explore some applications of this approach to extracting Google Scholar information and classifying publications:

Research Exploration: Researchers who are interested in a specific professor's work can utilize this method to gain insights into their publication history, collaborations, and involvement in consortium-based research projects.
Academic Evaluation: Institutions or committees responsible for evaluating professors or researchers can leverage this technique to assess an individual's scholarly contributions and identify their participation in consortium initiatives.
Funding Decision-Making: Funding agencies or organizations seeking to support research projects can employ this approach to evaluate an author's past publications and determine their involvement in consortium-based research, aiding decision-making processes.
Collaboration Analysis: Researchers interested in studying collaboration patterns within a specific field or among certain authors can use this method to identify consortium-based publications and explore potential networks and partnerships.

By adapting and extending the provided code snippets, these applications can be customized to suit specific research goals and requirements.

Conclusion

In conclusion, extracting Google Scholar information and classifying publications based on specific criteria can greatly enhance research exploration and analysis. With the Scholarly module and the power of Python's data manipulation capabilities, you can easily retrieve publication data, calculate the H-index, and gain valuable insights. Whether you are an aspiring researcher, an academic evaluator, or simply intrigued by a professor's work, this method opens doors to a wealth of scholarly information.

So, what are you waiting for? Dive into the world of scholarly research, unlock its secrets, and embark on your exciting research journey. By adapting and extending the provided code snippets, you can customize the applications to suit specific research goals and requirements.

Happy coding, and may your exploration of scholarly knowledge be fruitful!