Database Technologies......: 2008

Sunday, December 21, 2008

SPSS Links

http://courses.csusm.edu/resources/spss/
http://www.stat.tamu.edu/spss.php
http://www.ats.ucla.edu/stat/spss/
http://www.ats.ucla.edu/stat/seminars/
http://faculty.vassar.edu/lowry/webtext.html

Monday, November 24, 2008

OLAP is a data summarization/aggregation tool that helps simplify data analysis, while data
mining allows the automated discovery of implicit patterns and interesting knowledge hidden in large amounts of data.
Key is OLAP

OLAP tools are targeted toward simplifying and supporting interactive data analysis,
whereas the goal of data mining tools is to automate as much of the process as possible,
while still allowing users to guide the process. In this sense, data mining goes one step beyond traditional on-line analytical processing.
Key is OLAP tools

OLAP functions are essentially for user-directed data summary and comparison
(by drilling, pivoting, slicing,dicing, and other operations). Data mining covers a much broader spectrum than simple OLAP operations because it performs not only data summary and comparison but also association, classification, prediction, clustering, time-series analysis, and other data analysis tasks.
Key is OLAP Operations

Data mining is not confined to the analysis of data stored in data warehouses. It may
analyze data existing at more detailed granularities than the summarized data provided
in a data warehouse. It may also analyze transactional, spatial, textual, and multimedia
data that are difficult to model with current multidimensional database technology. In
this context, data mining covers a broader spectrum than OLAP with respect to data
mining functionality and the complexity of the data handled.
Key is OLAP data analysis

Because data mining involves more automated and deeper analysis than OLAP,
data mining is expected to have broader applications. Data mining can help business
managers find and reach more suitable customers, as well as gain critical
business insights that may help drive market share and raise profits. In addition,
data mining can help managers understand customer group characteristics
and develop optimal pricing strategies accordingly, correct item bundling based
not on intuition but on actual item groups derived from customer purchase patterns,
reduce promotional spending, and at the same time increase the overall net
effectiveness of promotions.

Tuesday, November 11, 2008

Table Expressions

Four types of table expressions:
1. Derived tables
2. Common table expressions (CTEs),
3. Views
4. Inline table-valued functions (inline TVFs) also known as Parametrized views.

Beneifts.

Table expressions are virtual Tables.
The benefi ts of using table expressions uses a modular approach and not to performance.

Any Table Expression should meet the following three requirements.

1. Order is not guaranteed because the rows in a relational table are not ordered except with one restriction using Top Clause with Order By. Here the Order by is used for restricting the rows.
2. All columns must have names.
3. All column names must be unique.

To reuse the definitions , use views and inline TVFs.
Not to reuse the definitions, use dervied tables and CTEs.

APPLY operator

Multiple instances of the same CTE The APPLY operator operates on two input tables, the second of which may be a table expression; I’ll refer to them as the left and right tables. The right table is usually a derived table or an inline TVF.

If the right table expression returns an empty set, the CROSS APPLY operator does not returnthe corresponding left row. To return the empty set use the OUTER Apply.

--------------------------------------------------------------------------------

In its non-recursive form, a CTE is an alternative to derived tables, views, and inline user-defined functions (UDFs).

A derived table is a named table expression that exists for the duration of a query.
Views are also named table expressions that persist in the database until you explicitly drop them. Unlike with derived tables
1. a query can refer to a view name multiple times;
2. the view can't refer to variables.

Inline UDFs have the same characteristics as views except that they can refer to the function's input arguments.

CTEs have the best features of derived tables, views, and inline UDFs—they are named table expressions that exist only for the duration of the query, that an outer query can refer to multiple times, and that can refer to variables defined in the calling batch. . . .

Friday, October 31, 2008

Good Sites -- 1

http://ssas-info.com/

www.microsoft.com/bi

Friday, October 3, 2008

Data Mining Books

Data Mining Books to buy

1. maraaret h. dunham

2. Jaiwei hau

Introduction to Data Mining By Pang-Ning Tan, Michael Steinbach, Vipin Kumar

Data Mining Videos from Google Talk

http://www.youtube.com/watch?v=-tWS0tN8sW0 -- Day 13

http://www.youtube.com/watch?v=fmZYH3rmqDQ

http://www.youtube.com/watch?v=l4a3e__QzoY

http://www.youtube.com/watch?v=CzvgrcQhWGg

http://www.youtube.com/watch?v=xpuB9ydmBsM

http://www.youtube.com/watch?v=N5i85v0ckzY

http://www.youtube.com/watch?v=zRsMEl6PHhM

-----------------------------------------------------------

Course from stanford university

http://www.newstats202.com/

Sunday, September 21, 2008

Statistics topics needs to cover for Data Mining

Data Analytics Competence Spectrum
From Pricing, Risk, Cross-selling, Retention, Time-to-Market, Ad Efficiency and Campaign Management to Acquisition and Growth, and Web Analytics, our Data Analytics capabilities cover a number of possible analyses, using different programming tools and software such as SAS, SPSS, Excel, SQL, Matlab and Siebel. The following is a sample of the techniques used for various types of analysis.
Link Analysis
Multivariate, Linear, Ordinal and Multinomial Regression
Conjoint Analysis
Logistic Regression
Factor Analysis
Survival Analysis
Decision Trees
Clustering
Bayesian Analysis
Neural Networks
Kohonen Maps
Simulation Techniques
Discriminant Analysis
Time Series Analysis
Vendor Performance Management Tool
Sales and Retail Dashboards
Contact Search Tool
Launch Performance Tracking Tool
Market Basket Analysis
Econometric Forecasting
Cycle Research for Technical Analysis
Reporting and Analysis
Optimising Cash and Accounts Payable Management

Thursday, August 28, 2008

Microsoft BI sites

http://www.microsoft.com/sql/technologies/reporting/default.mspx

http://www.microsoft.com/sql/technologies/analysis/default.mspx

http://www.microsoft.com/sql/technologies/dm/default.mspx

http://www.microsoft.com/sql/technologies/integration/default.mspx

Wednesday, August 20, 2008

Sql Server Data Mining

http://www.sqlserverdatamining.com/ssdm/Home/tabid/36/Default.aspx

Thursday, August 7, 2008

Sql Server 2005 DMV's

Some interesting DMV's in SQL SERVER 2005

sys.dm_tran_locks
sys.dm_exec_cached_plans
sys.dm_exec_query_plan
sys.dm_exec_sessions
sys.dm_tran_session_transactions
sys.dm_exec_requests

Sunday, July 27, 2008

SSRS - Improvoing Performance

1. Avoid long running reports, by creating a report snapshot to run nightly or during low activity on your system.

2. Define pagination to hide overhead. for example, if a user requests a report that returns 1000 rows of data, you can use page breaking to render the initial page or results while the report server renders the rest of the pages in the background. without pagination, the user would have to wait until it rendered all 1000 rows.

3. Implement filters for performance.
If you use report-level filtering on data regions and data grouping rather than using query parameters as filters.

4. Enable drill down for detail.
Instead of giving users a report that shows all the data at once, design a summary report that
presents a subset of data and let usrs drill down to get more detail. using master detail
reports that take advantage of SSRS's navigation capabilities. Many parts of the report
controls have navigation property, which we use to create a hyperlink on the control for
passing the specific values.

Saturday, July 26, 2008

Dimensional Vs ER Modeling

1. The relationship in a Dimensional model don't represent business rules instead they are navigational paths used to help write reports or create graphs. But whereas relationship in ER modeling represents business rules.
2. The primary goal of the ER modeling is to remove all non key data redundancy.
But Dimensional modeling controls data redundancy by confirming dimension and fact tables.
The table that has been confirmed can be used in more than one dimensional data model.

How to Create a Dimensional Model

1. Identity business process by business process and Each business process can be expressed
as a data mart
---- a modular , highly focused, richly detailed, incrementally designed componenet
of the datawarehouse.
Initially try to focus on the Single-Source data mart not on multiple source datamat.
example of single data marts are retail sales, purchase orders, shipments and payments.
example of multiple data mart is Customer profitability which combines revenue and costs that often come from sales and inventory databases.

2. Grain of the Fact table which is the level of detail that the table captures.

Saturday, June 28, 2008

SSRS -- Reporting

Report Structure Types
Structured Reports -- Relational
Structured Grouping -- OLAP
Structured Attributes -- Data mining

Structured Reports -- Relational
a. Pre-Define Report structure(i.e., Report Structure is Static).
b. Allow Interaction (drill down,slicing) but notdynamic structing.

Structured Grouping -- OLAP
a. Pre-define grouping buckets --
b. Report Structure is Dynamic (Slice & Dice, Dynamic Pivot, Sort and filter)
c. Pre-Summarize data for speed.

Structured Attributes -- Data Mining
a. Pre-define attributes (and model)
b. Mining model calculates grouping and structure.
c. Use for Prediction , exploration and analysis.

Reporting Approaches
1. OLTP Data ODS -- Report
2. Star Schema ODS -- Staging Table -- Start Schema -- Reeport
3. OLAP Data ODS -- Staging Table -- Star Schema -- cube -- Report
Here we use Extract, Transform and Load.
4. UDM

Friday, May 30, 2008

Junk dimensions

OLTP tables that are full of flag fields and yes/no attributes, many of which are used for operational support and have no documentation except for the column names.
There are three conventional ways to deal with these attributes: discard all of the miscellaneous attributes, eliminating them from the dimensional design; incorporate the miscellaneous attributes into the fact table; or make each miscellaneous attribute a separate dimension. However, all of these options are less than ideal.
Discarding the data can be dangerous because the miscellaneous values, flags, and yes/no fields might contain valuable business data.
Including the miscellaneous attributes in the fact table could cause the fact table to swell to alarming proportions, especially if you have more than just a few miscellaneous attributes. The increased size of the fact table could cause serious performance problems because of the reduced number of records per physical I/O. Even if you tried to index these fields to minimize the performance problems, you still wouldn’t gain anything because so many of the miscellaneous fields contain flag values such as 0 and 1; Y and N; or open, pending, and closed.
And if you make each miscellaneous attribute a separate dimension, it will most likely result in a complicated dimensional designyou’re looking at a much more complicated star schema and associated cube.
Column names such as Completed, Packed, Shipped, Received, Delivered, and Returned (each with yes/no data values) are very common, and they do have business value. Instead of discarding flag fields and yes/no attributes, I suggest placing them all into a junk dimension that’s organized
The junk dimension shown in Figure 1 represents an order-fulfillment system; the column headers show some of the possible statuses an item that has been ordered can have. Row 1 indicates that the item ordered has been picked out of the warehouse, packed for shipment, shipped, delivered, received, returned for a refund, and restocked in the warehouse. Row 9 shows an item on order that’s waiting to begin the order-fulfillment process. The rows in between indicate items that are in various stages of the orderfulfillment process. This example is very simple because the process is so linear and sequential,

Thursday, May 29, 2008

Degenerate dimension

Degenerate dimension is not one that lacks moral structure or integrity. Instead, a degenerate dimension is a dimension that doesn’t exist as a table but is represented in the data warehouse.
Data warehouse dimensional design requires you to include control documents such as invoices, orders, and warranties. Each of these control documents has a control number such as the invoice number, the order number, or the serial number of the item under warranty.

Degenerate dimensions are simply control numbers that are stored in the fact table of a data warehouse. These control numbers look like keys, but they don’t act like keys; they have no associated dimension to join with. Control numbers provide a way to identify which line items in the fact table were generated as a part of the same order or invoice. Let’s take a look at how to map control numbers from the OLTP database to the fact table in the data warehouse and associate them with each line item

Monday, May 5, 2008

Google's own Database

http://labs.google.com/papers/bigtable-osdi06.pdf.

Tuesday, April 29, 2008

Any BI Platform should deliver 12 capabilities

Magic Quadrant for Business Intelligence
Platforms, 2008

http://mediaproducts.gartner.com/reprints/microsoft/vol7/article3/article3.html

Gartner defines a BI platform as a software platform that delivers the 12 capabilities listed below.

Integration
BI infrastructure — All tools in the platform should use the same security, metadata, administration, portal integration, object model and query engine, and should share the same look and feel.

Metadata management — This is arguably the most important of the 12 capabilities. Not only should all tools leverage the same metadata, but the offering should provide a robust way to search, capture, store, reuse and publish metadata objects such as dimensions, hierarchies, measures, performance metrics and report layout objects.

Development — The BI platform should provide a set of programmatic development tools — coupled with a software developer's kit for creating BI applications — that can be integrated into a business process, and/or embedded in another application. The BI platform should also enable developers to build BI applications without coding by using wizard-like components for a graphical assembly process. The development environment should also support Web services in performing common tasks such as scheduling, delivering, administering and managing.

Workflow and collaboration — This capability enables BI users to share and discuss information via public folders and discussion threads. In addition, the BI application can assign and track events or tasks allotted to specific users, based on pre-defined business rules. Often, this capability is delivered by integrating with a separate portal or workflow tool.

Information Delivery
Reporting — Reporting provides the ability to create formatted and interactive reports with highly scalable distribution and scheduling capabilities. In addition, BI platform vendors should handle a wide array of reporting styles (for example, financial, operational and performance dashboards).

Dashboards — This subset of reporting includes the ability to publish formal, Web-based reports with intuitive displays of information, including dials, gauges and traffic lights. These displays indicate the state of the performance metric, compared with a goal or target value. Increasingly, dashboards are used to disseminate real-time data from operational applications.

Ad hoc query — This capability, also known as self-service reporting, enables users to ask their own questions of the data, without relying on IT to create a report. In particular, the tools must have a robust semantic layer to allow users to navigate available data sources. In addition, these tools should offer query governance and auditing capabilities to ensure that queries perform well.

Microsoft Office integration — In some cases, BI platforms are used as a middle tier to manage, secure and execute BI tasks, but Microsoft Office (particularly Excel) acts as the BI client. In these cases, it is vital that the BI vendor provides integration with Microsoft Office, including support for document formats, formulas, data "refresh" and pivot tables. Advanced integration includes cell locking and write-back.

Analysis
OLAP — This enables end users to analyze data with extremely fast query and calculation performance, enabling a style of analysis known as "slicing and dicing." This capability could span a variety of storage architectures such as relational, multidimensional and in-memory.

Advanced visualization — This provides the ability to display numerous aspects of the data more efficiently by using interactive pictures and charts, instead of rows and columns. Over time, advanced visualization will go beyond just slicing and dicing data to include more process-driven BI projects, allowing all stakeholders to better understand the workflow through a visual representation.

Predictive modeling and data mining — This capability enables organizations to classify categorical variables and estimate continuous variables using advanced mathematical techniques.

Scorecards — These take the metrics displayed in a dashboard a step further by applying them to a strategy map that aligns key performance indicators to a strategic objective. Scorecard metrics should be linked to related reports and information in order to do further analysis. A scorecard implies the use of a performance management methodology such as Six Sigma or a balanced scorecard framework.

Where the Microsoft BI stands

Strengths

Microsoft's pricing and integration with its Office (including its major CPM-led innovation of 2007, PerfomancePoint Server) and SQL Server products are especially attractive to organizations that have standardized on the Microsoft information infrastructure. The bundling and pricing of its BI products makes them an economically attractive offering that will be considered by many organizations.
Microsoft's BI products appeal to the large community of Microsoft application developers. Microsoft's BI platform provides developers with infrastructure, development tools, workflow and collaboration capabilities that are held in higher regard than those of many of its competitors.
Microsoft is benefiting from developing its indirect sales and services channel and market awareness of its SQL Server, Office and SharePoint Portal installed base. As a result, Microsoft estimates that it now has around 2,000 OEM/ISV partners for its BI products. Many departmental and business unit end users who hear the Office and SharePoint integration marketing messages for BI will likely ask for the products and associated support from their IT departments.
According to the customers we contacted as part of his research, Microsoft offers the best BI software quality of all the megavendors, with over half of them reporting no problems with software. This reflects Microsoft's focus on BI, the strength of its product line management team and the fact that much of its BI technology has been internally developed rather than acquired.

Cautions

Microsoft was late to join the BI platforms market and it is still playing catch up. According to customers, its still lags behind pure-play vendors in terms of metadata management, reporting, and dashboard and ad hoc query capabilities. However, Microsoft is in it for the long haul and Gartner expects that it will continue to grow its BI investments in order to become a stronger competitor. Organizations that have heterogeneous applications, information infrastructure and development environments will find Microsoft's BI-related marketing and announcements to be interesting but potentially distracting, since they may not easily integrate with their existing investments in infrastructure and applications. Despite its price advantage, Microsoft will face increasing competitive pressure as BI becomes a market where strategic sourcing, of more than just BI capabilities, takes precedence over features and functions, and as the other megavendors' acquisitions coalesce into their product stacks.

Wednesday, April 16, 2008

Good Sites for understanding Data-Mining Therotical Concepts

Courtsey by Adnan Masood

http://ocw.mit.edu/OcwWeb/Sloan-School-of-Management/15-062Data-MiningSpring2003/CourseHome/index.htm

http://www.kdnuggets.com/courses/index.html

http://www.kdnuggets.com/education/usa-canada.html

http://www.kdnuggets.com/education/online.html

http://www.statistics.com/

Top 10 websites of Data Mining

kdnuggets.com 27
sas.com 24
kaushik.net 15
analyticbridge.com 10
datashaping.com 8
juiceanalytics.com 8
r-project.org 7
google.com 5
videolectures.net 5
analyticrecruiting.com 4
informs.org 4
webanalyticsdemystified.com 4

Sunday, March 16, 2008

Dimensional Modeling Concepts to Learn

To understand Dimensional Modeling ,
Following concepts are very very important

http://kimballgroup.com/html/kucourseMDWD.html

• Basic dimensional modeling concepts
• Conformed dimensions
• Slowly changing dimensions
• Additional concepts in dimensional modeling
-Hierarchies and snowflaking
-Degenerate and junk dimensions
-Many to many dimensions
• The dimensional modeling process

---------------------------------------------------------

• Populating dimension tables
-Using the slowly changing dimension wizard
-Avoiding the slowly changing dimension wizard

• Populating fact tables
-Basic fact table processing
-The surrogate key pipeline
-Advanced issues in fact table processing

--------------------------------------------------------------
A Good Link
http://kimballgroup.com/html/articles.html

Types of Dimensions

Dimensions are of 3 types mainly in case of slowly changing dimensions:

Type 1 ->Doesnt maintain any history and is update only.

Type 2-> Maintains full history.Insert row when certain attributes change else update row.The latest record is identified by a flag or date field which is called the MRI or Most Recent Indicator

Type 3->Maintains partial history.History is maintained as columns and not rows.Hence it is predecided as to till how many levels of history you want to maintain.

There are three types of Dimesions

Confirmed Dimensios, Junk Dimesions, Degenerative Dimesions

Comfirmed is some thing which can be shared by shared by multiple Fact Tables or multiple Data Marts.

Junk Dimensiions is grouping flagged values

Degenerative Dimension is something dimensional in nature but exist fact table.

Tuesday, February 12, 2008

SSRS - 2000 Query to Retreive all the reports subscribed to a File Share

select DeliveryExtension,ExtensionSettings
,CHARINDEX('\\', ExtensionSettings)
,substring(ExtensionSettings,CHARINDEX('\\', ExtensionSettings),80)
from CataLog c (nolock)inner join ReportSchedule r (nolock)
on (c.ItemID = r.reportid)
inner join Subscriptions s (nolock)
on (s.SubscriptionID = r.SubscriptionID)
where DeliveryExtension = 'Report Server FileShare'
and ExtensionSettings like '%PATH%'

Wednesday, February 6, 2008

SQL SERVER 2005

From Today, I will make sure to post something related to Sql Server 2005.