I am not a database expert and have no formal computer science background, so bear with me. I want to know the kinds of real world negative things that can happen if you use an... moreI am not a database expert and have no formal computer science background, so bear with me. I want to know the kinds of real world negative things that can happen if you use an old MongoDB version prior to v4, which were not ACID compliant. This applies to any ACID noncompliant database.
I understand that MongoDB can perform Atomic Operations, but that they don't "support traditional locking and complex transactions", mostly for performance reasons. I also understand the importance of database transactions, and the example of when your database is for a bank, and you're updating several records that all need to be in sync, you want the transaction to revert back to the initial state if there's a power outage so credit equals purchase, etc.
But when I get into conversations about MongoDB, those of us that don't know the technical details of how databases are actually implemented start throwing around statements like:
MongoDB is way faster than MySQL and Postgres, but there's a tiny chance, like... less
I'm kicking tires on BI tools, including, of course, Tableau. Part of my evaluation includes correlating the SQL generated by the BI tool with my actions in the tool.
Tableau has... moreI'm kicking tires on BI tools, including, of course, Tableau. Part of my evaluation includes correlating the SQL generated by the BI tool with my actions in the tool.
Tableau has me mystified. My database has 2 billion things; however, no matter what I do in Tableau, the query Redshift reports as having been run is "Fetch 10000 in SQL_CURxyz", i.e. a cursor operation. In the screenshot below, you can see the cursor ids change, indicating new queries are being run -- but you don't see the original queries.
Is this a Redshift or Tableau quirk? Any idea how to see what's actually running under the hood? And why is Tableau always operating on 10000 records at a time? less
Situation now: I have a data warehouse job profile that publishes .txt file in Data folder every day in the morning. I open Tableau workbook which automatically updates data... moreSituation now: I have a data warehouse job profile that publishes .txt file in Data folder every day in the morning. I open Tableau workbook which automatically updates data visualisations because of union I made. I save this workbook as extract and collages without Tableau Desktop can view it via Tableau Reader.
What I need: This reporting format is heavily dependent on me and I need to automate this.
Is this even possible without Tableau Server?
Quoting the Spark DataFrames, Datasets and SQL manual:
A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are less important due to... moreQuoting the Spark DataFrames, Datasets and SQL manual:
A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model. Others are slotted for future releases of Spark SQL.
Being new to Spark, I'm a bit baffled by this for two reasons:
Spark SQL is designed to process Big Data, and at least in my use case the data size far exceeds the size of available memory. Assuming this is not uncommon, what is meant by "Spark SQL’s in-memory computational model"? Is Spark SQL recommended only for cases where the data fits in memory?
Even assuming the data fits in memory, a full scan over a very large dataset can take a long time. I read this argument against indexing in in-memory database, but I was not convinced. The example there discusses a scan of a 10,000,000 records table, but that's not really big data. Scanning a table with billions of records can cause simple queries of the "SELECT x WHERE y=z"... less
We are working on a datawarehouse for a bank and have pretty much followed the standard Kimball model of staging tables, a star schema and an ETL to pull the data through the... moreWe are working on a datawarehouse for a bank and have pretty much followed the standard Kimball model of staging tables, a star schema and an ETL to pull the data through the process.
Kimball talks about using the staging area for import, cleaning, processing and everything until you are ready to put the data into the star schema. In practice this typically means uploading data from the sources into a set of tables with little or no modification, followed by taking data optionally through intermediate tables until it is ready to go into the star schema. That's a lot of work for a single entity, no single responsibility here.
Previous systems I have worked on have made a distinction between the different sets of tables, to the extent of having:
Upload tables: raw source system data, unmodified
Staging tables: intermediate processing, typed and cleansed
Warehouse tables
You can stick these in separate schemas and then apply differing policies for archive/backup/security etc. One of the other guys... less
For some reason my MDF file is 154gigs, however, I only loaded 7 gigs worth of data from flat files. Why is the MDF file so much larger than the actual source data?
More... moreFor some reason my MDF file is 154gigs, however, I only loaded 7 gigs worth of data from flat files. Why is the MDF file so much larger than the actual source data?
More info:
Only a few tables with ~25 million rows. No large varchar fields (biggest is 300, most are less than varchar(50). Not very wide tables < 20 columns. Also, none of the large tables are indexed yet. Tables with indexes have less than 1 million rows. I don't use char, only varchar for strings. Datatype is not the issue.
Turned out it was the log file, not the mdf file. The MDF file is actually 24gigs which seems more reasonable, however still big IMHO.
UPDATE:
I fixed the problem with the LDF (log) file by changing the recovery model from FULL to simple. This is okay because this server is only used for internal development and ETL processing. In addition, before changing to SIMPLE I had to shrink the LOG file. Shrinking is not recommended in most cases, however, this was one of those cases where the log file should have never... less
I have completed my graduate public policy program but it was not at all tech heavy - some economics and econometrics but not requiring any CS knowledge. A good portion of the... moreI have completed my graduate public policy program but it was not at all tech heavy - some economics and econometrics but not requiring any CS knowledge. A good portion of the research jobs in DC require a basic level of programming knowledge. Mostly they want people who can perform advanced search and retrieval functions with large datasets and save stuff in different formats within their servers. And, they want STATA/stats knowledge, which I have some of.
My question is this: where is the best place to start learning some programming to get to this level? For instance, is Java, SQL, VBA or something else the best thing and most useful for these purposes? And, how much math do I need to write and run simple requests?
Thanks less
My PHP SQL Statement is failing due to pound (#) sign. How can I get around this. (Other than fixing the database name?)
$sql = "SELECT CMCD, TK#, TECH, STATS from LIB.TICKET... moreMy PHP SQL Statement is failing due to pound (#) sign. How can I get around this. (Other than fixing the database name?)
$sql = "SELECT CMCD, TK#, TECH, STATS from LIB.TICKET FETCH FIRST 10 ROWS ONLY ";
$rs = odbc_exec($conn,$sql);
I need to delete about 2 million rows from my PG database. I have a list of IDs that I need to delete. However, any way I try to do this is taking days.
I tried putting them in a... moreI need to delete about 2 million rows from my PG database. I have a list of IDs that I need to delete. However, any way I try to do this is taking days.
I tried putting them in a table and doing it in batches of 100. 4 days later, this is still running with only 297268 rows deleted. (I had to select 100 id's from an ID table, delete where IN that list, delete from ids table the 100 I selected).
I tried:
DELETE FROM tbl WHERE id IN (select * from ids)
That's taking forever, too. Hard to gauge how long, since I can't see it's progress till done, but the query was still running after 2 days.
Just kind of looking for the most effective way to delete from a table when I know the specific ID's to delete, and there are millions of IDs. less
I'm working on some IoT integrations and I am wondering where in the Azure I can parse my IoT data (JSON data).
My earlier workflow was this; sensor pushes data -> iot hub -> stream analytics jobs -> sql database. Stream analytics job works fine but I have heard that it is not "right" way to parse data in Azure. So what is the right and best way to do that. I need to save it to SQL database.