{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"institution":[{"id":[{"id":"https:\/\/ror.org\/03mb6wj31","id-type":"ROR","asserted-by":"publisher"},{"id":"https:\/\/www.isni.org\/000000041937028X","id-type":"ISNI","asserted-by":"publisher"},{"id":"https:\/\/www.wikidata.org\/entity\/Q1640731","id-type":"wikidata","asserted-by":"publisher"}],"name":"Universitat Polit\u00e8cnica de Catalunya","acronym":["UPC"]}],"indexed":{"date-parts":[[2026,2,6]],"date-time":"2026-02-06T17:55:18Z","timestamp":1770400518699,"version":"3.49.0"},"reference-count":0,"publisher":"Universitat Polit\u00e8cnica de Catalunya","license":[{"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"abstract":"<jats:p>Spatial Big Data is considered an essential trend in future scientific and business applications. Indeed, research instruments, medical devices, and social networks generate hundreds of peta bytes of spatial data per year. However, as many authors have pointed out, the lack of specialized frameworks dealing with such kind of data is limiting possible applications and probably precluding many scientific breakthroughs. \r\nIn this thesis, we describe three HPC scientific applications, ranging from molecular dynamics, neuroscience analysis, and physics simulations, where we experience first hand the limits of the existing technologies. Thanks to our experience, we define the desirable missing functionalities, and we focus on two features that when combined significantly improve the way scientific data is analyzed. \r\nOn one side, scientific simulations generate complex datasets where multiple correlated characteristics describe each item. For instance, a particle might have a space position (x,y,z) at a given time (t). If we want to find all elements within the same area and period, we either have to scan the whole dataset, or we must organize the data so that all items in the same space and time are stored together.  The second approach is called Multidimensional Indexing (MI), and it uses different techniques to cluster and to organize similar data together. \r\nOn the other side, approximate analytics has been often indicated as a smart and flexible way to explore large datasets in a short period.  Approximate analytics includes a broad family of algorithms which aims to speed up analytical workloads by relaxing the precision of the results within a specific interval of confidence.  For instance, if we want to know the average age in a group with  1-year precision, we can consider just a random fraction of all the people, thus reducing the amount of calculation. But if we also want less I\/O operations, we need efficient data sampling, which means organizing data in a way that we do not need to scan the whole data set to generate a random sample of it. \r\nAccording to our analysis, combining Multidimensional Indexing with efficient data Sampling (MIS) is a vital missing feature not available in the current distributed data management solutions. \r\nThis thesis aims to solve such a shortcoming and it provides novel scalable solutions. At first, we describe the existing data management alternatives; then we motivate our preference for NoSQL key-value databases. Secondly, we propose an analytical model to study the influence of data models on the scalability and performance of this kind of distributed database. Thirdly, we use the analytical model to design two novel multidimensional indexes with efficient data sampling: the D8tree and the AOTree.  Our first solution, the D8tree, improves state of the art for approximate spatial queries on static and mostly read dataset. Later, we enhanced the data ingestion capability or our approach by introducing the AOTree, an algorithm that enables the query performance of the D8tree even for HPC write-intensive applications. We compared our solution with PostgreSQL and plain storage, and we demonstrate that our proposal has better performance and scalability. \r\nFinally, we describe Qbeast, the novel distributed system that implements the D8tree and the AOTree using NoSQL technologies, and we illustrate how Qbeast simplifies the workflow of scientists in various HPC applications providing a scalable and integrated solution for data analysis and management.<\/jats:p>\n                <jats:p>La gesti\u00f3n de BigData con informaci\u00f3n espacial est\u00e1 considerada como una tendencia esencial en el futuro de las aplicaciones cient\u00edficas y de negocio. De hecho, se generan cientos de petabytes de datos espaciales por a\u00f1o mediante instrumentos de investigaci\u00f3n, dispositivos m\u00e9dicos y redes sociales. Sin embargo, tal y como muchos autores han se\u00f1alado, la falta de entornos especializados en manejar este tipo de datos est\u00e1 limitando sus posibles aplicaciones y est\u00e1 impidiendo muchos avances cient\u00edficos. En esta tesis, describimos 3 aplicaciones cient\u00edficas HPC, que cubren los \u00e1mbitos de din\u00e1mica molecular, an\u00e1lisis neurocient\u00edfico y simulaciones f\u00edsicas, donde hemos experimentado en primera mano las limitaciones de las tecnolog\u00edas existentes. Gracias a nuestras experiencias, hemos podido definir qu\u00e9 funcionalidades ser\u00edan deseables y no existen, y nos hemos centrado en dos caracter\u00edsticas que, al combinarlas, mejoran significativamente la manera en la que se analizan los datos cient\u00edficos. Por un lado, las simulaciones cient\u00edficas generan conjuntos de datos complejos, en los que cada elemento es descrito por m\u00faltiples caracter\u00edsticas correlacionadas. Por ejemplo, una part\u00edcula puede tener una posici\u00f3n espacial (x, y, z) en un momento dado (t). Si queremos encontrar todos los elementos dentro de la misma \u00e1rea y periodo, o bien recorremos y analizamos todo el conjunto de datos, o bien organizamos los datos de manera que se almacenen juntos todos los elementos que comparten \u00e1rea en un momento dado. Esta segunda opci\u00f3n se conoce como Indexaci\u00f3n Multidimensional (IM) y usa diferentes t\u00e9cnicas para agrupar y organizar datos similares. Por otro lado, se suele se\u00f1alar que las anal\u00edticas aproximadas son una manera inteligente y flexible de explorar grandes conjuntos de datos en poco tiempo. Este tipo de anal\u00edticas incluyen una amplia familia de algoritmos que acelera el tiempo de procesado, relajando la precisi\u00f3n de los resultados dentro de un determinado intervalo de confianza. Por ejemplo, si queremos saber la edad media de un grupo con precisi\u00f3n de un a\u00f1o, podemos considerar s\u00f3lo un subconjunto aleatorio de todas las personas, reduciendo as\u00ed la cantidad de c\u00e1lculo. Pero si adem\u00e1s queremos menos operaciones de entrada\/salida, necesitamos un muestreo eficiente de datos, que implica organizar los datos de manera que no necesitemos recorrerlos todos para generar una muestra aleatoria. De acuerdo con nuestros an\u00e1lisis, la combinaci\u00f3n de Indexaci\u00f3n Multidimensional con Muestreo eficiente de datos (IMM) es una caracter\u00edstica vital que no est\u00e1 disponible en las soluciones actuales de gesti\u00f3n distribuida de datos. Esta tesis pretende resolver esta limitaci\u00f3n y proporciona unas soluciones novedosas que son escalables. En primer lugar, describimos las alternativas de gesti\u00f3n de datos que existen y motivamos nuestra preferencia por las bases de datos NoSQL basadas en clave-valor. En segundo lugar, proponemos un modelo anal\u00edtico para estudiar la influencia que tienen los modelos de datos sobre la escalabilidad y el rendimiento de este tipo de bases de datos distribuidas. En tercer lugar, usamos el modelo anal\u00edtico para dise\u00f1ar dos novedosos algoritmos IMM: el D8tree y el AOTree. Nuestra primera soluci\u00f3n, el D8tree, mejora el estado del arte actual para consultas espaciales aproximadas, cuando el conjunto de datos es est\u00e1tico y mayoritariamente de lectura. Despu\u00e9s, mejoramos la capacidad de ingesti\u00f3n introduciendo el AOTree, un algoritmo que conserva el rendimiento del D8tree incluso para aplicaciones HPC intensivas en escritura. Hemos comparado nuestra soluci\u00f3n con PostgreSQL y almacenamiento plano demostrando que nuestra propuesta mejora tanto el rendimiento como la escalabilidad. Finalmente, describimos Qbeast, el sistema que implementa los algoritmos D8tree y AOTree, e ilustramos c\u00f3mo Qbeast simplifica el flujo de trabajo de los cient\u00edficos ofreciendo una soluci\u00f3n escalable e integra<\/jats:p>","DOI":"10.5821\/dissertation-2117-131429","type":"dissertation","created":{"date-parts":[[2023,7,19]],"date-time":"2023-07-19T05:53:16Z","timestamp":1689745996000},"approved":{"date-parts":[[2019,3,22]]},"source":"Crossref","is-referenced-by-count":0,"title":["A framework for multidimensional indexes on distributed and highly-available data stores"],"prefix":"10.5821","author":[{"sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Cesare","family":"Cugnasco","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"3865","container-title":[],"original-title":[],"deposited":{"date-parts":[[2026,2,6]],"date-time":"2026-02-06T06:38:28Z","timestamp":1770359908000},"score":1,"resource":{"primary":{"URL":"https:\/\/hdl.handle.net\/2117\/131429"}},"subtitle":[],"editor":[{"given":"Yolanda","family":"Becerra Fontal","sequence":"first","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]},{"given":"Jordi","family":"Torres Vi\u00f1als","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[null]]},"references-count":0,"URL":"https:\/\/doi.org\/10.5821\/dissertation-2117-131429","relation":{},"subject":[]}}