Center for Big Data Analytics and Discovery Informatics

Research Foci

The center pursues fundamental and applied research in the Data Sciences focused on:

Algorithmic Abstractions of Natural Systems and Methods and Tools for Their analyses and simulation
As advances in science are enabled by algorithmic abstractions of natural processes, there is an urgent need for developing, sharing, analyzing, and integrating computational abstractions or representations of the key entities, relationships, and processes of interest in the sciences. For example, progress in life sciences has been accelerated substantially with the emergence of gene ontology. Of particular interest are system-level, mechanistic, computational models of biological, cognitive, and social systems that enable the integration of different processes into coherent and rigorous representations that can be analyzed, simulated, integrated, shared, validated against experimental data, and used to guide experimental investigations. Such abstractions, coupled with formal methods for their analysis, can provide rich modeling languages with precise syntax and semantics that can be analyzed systematically and efficiently for certain properties of interest. For example, a question of interest to a cancer biologist, e.g. whether the up-regulation of genes A and B and down-regulation of gene C could possibly take a cell from a healthy state to a cancerous state can be translated into a reachability query against a model of a cell where the state of the cell encodes the expression levels of the genes. A major focus of the center is on the development, analysis, and applications of algorithmic abstractions to advance scientific discovery, including models that link multiple levels of abstraction, e.g., molecules, cells, tissues, organs, organisms, in the case of life sciences.
Big Data Management and Analytics
The volume, velocity, and variety of data are increasing at unprecedented rates. Such data include primary observational and experimental data on the one hand, and publications, databases, and knowledge bases that capture the results of analysis of such data. The emergence of big data presents many challenges in data management and to integrate diverse data, software, expertise, results, etc. Literature search to synthesize what is known is one example of an increasingly unmanageable process given the deluge of data, publications, etc. The volume, velocity, variety and veracity of data often hampers our ability to realize the full potential of data in many application domains. Data privacy and ethical considerations (e.g., ensuring that the decision models created using machine learning algorithms do not violate individual privacy or lead to discrimination on the basis of race, ethnicity, or gender) present additional challenges. The center aims to address these challenges by developing advanced algorithms, software, and infrastructure for big data management or analytics to address the challenges of collection, storage, preservation, management, integration, analysis, and sharing of huge quantities of disparate data. Of particular interest are scalable machine learning algorithms that yield interpretable predictive models, and advanced cyberinfrastructure for collaborative data and computation enabled discovery.
Cognitive tools for that augment and extend human Intellect and Abilities
While several aspects of data acquisition and routine data analytics are being increasingly automated, most of the key aspects of the scientific process remain manual, laborious, error-prone, and often hard to replicate. Effective use of data to guide discovery requires frameworks that organize the hypotheses that are under consideration, the data that supports them, the models that have been created from the data, and the hypotheses resulting from the models. As our ability to gather digital information of all kinds outstrips our cognitive ability to process, assimilate, and use the information, realizing the potential of data – big and small – to extract useful knowledge to inform our decisions and actions calls for cognitive tools i.e., computational tools that leverage and extend the reach of human intellect, and partner with for scientists, decision makers, and public at large, on a broad range of tasks in discovery (e.g., mapping the state of knowledge in a discipline and identifying gaps, formulating and prioritizing questions; designing, prioritizing and executing experiments; drawing inferences and constructing explanations and hypotheses from the literature, databases, knowledge bases, expressing and reasoning with scientific arguments of variable certainty and provenance; synthesizing findings from disparate observational and experimental studies; formulating new questions, in a closed-loop fashion). The center aims to develop algorithmic or information processing abstractions of various aspects of the scientific process; develop computational artifacts (representations, processes, software) that embody such understanding; and integrate the resulting cognitive tools into collaborative human-machine systems and infrastructure to advance science.
Human-human, human-machine and machine-machine collaboration
Scientific advances increasingly require collaboration between humans, humans and machines, and between machines. While some of the collaborations are planned, others are unplanned and ad hoc. Enabling such collaborations requires sharable and communicable representations and processes, mechanisms for decomposing tasks, assigning tasks to participants, combining results, and sharing relevant “mental model” etc. Of particular interest are tools for documentation, replication and communication of scientific studies, collaboration, team formation (incentivizing participants, decomposing tasks, combining results, engaging participants with different levels of expertise and abilities), communicating scientific results (across disciplinary boundaries and across levels of abstraction), tracking scientific progress and impact, as well as realization of organizational and social structures and processes that facilitate collaboration and mechanisms for engaging large numbers of participants with varying levels of expertise and ability in discovery and decision making.
Multi-disciplinary, interdisciplinary and trans-disciplinary collaborative research in the data sciences
The center focuses on multidisciplinary, interdisciplinary, and trans-disciplinary teams that bring together: Experimental scientists in a discipline, e.g., the biomedical sciences, with information and computer scientists, mathematicians, etc., to develop algorithmic or information processing abstractions to support theoretical and experimental investigations; Organizational and social scientists and cognitive scientists to study such teams, learn how best to organize and incentivize such teams and develop a science of team science; Experimental scientists in one or more disciplines, computer and information scientists and engineers, organizational and social scientists, cognitive scientists, and philosophers of science to design, implement, and study end-to-end systems that flexibly integrate the relevant cognitive tools into complex scientific workflows to solve broad classes of problems in specific domains, e.g., understanding complex interactions between food, energy, water, environment, and populations. The center engages in interdisciplinary research at the intersection of data sciences and emerging data-intensive disciplines aimed at addressing major scientific and societal challenges. Some of the current foci include:
1. Cognitive and Brain Sciences: Current projects focus on analysis, modeling, and prediction of brain networks and their dynamics e.g., from longitudinal functional magnetic resonance imaging data
2. Life Sciences: Current projects focus on analysis , modeling, and prediction of complex molecular interactions, interfaces, and complexes, genome annotation, host-pathogen interactions, infectious disease dynamics, and integrative biology, e.g., using genomics, transcriptomics, proteomics, metabolomics, microbiomics data
3. Health Sciences: Current projects focus on integrative analysis and predictive and causal modeling of individual health outcomes using electronic health records, genomics, socio-economic, environmental, and other data
4. Learning Sciences: Current projects focus on analysis and predictive and causal modeling of learning outcomes from data

Penn State University

Center for Big Data Analytics and Discovery Informatics

Menu

This is Penn State

Research Foci